Hành trình build Jarvis cho Mac và bài học cay đắng về local LLM

TL;DR cho team product/CTO — 73% AI agent product launched 2025-2026 default chọn local LLM (open-source) vì “free” và “privacy”. Sau khi tự build voice agent thực sự, tôi nhận ra: cho agentic flow (multi-step tool calling), local model 7-13B chưa đủ depth → user trust erode → retry cost cao → TCO thực tế đắt hơn Cloud frontier $0.001/turn. Bài này: hành trình + decision framework cho team đang plan AI feature roadmap 2026.

Vì sao 1 PM lại tự code voice assistant

Tôi là Product Manager, không phải engineer. Daily work là review AI product roadmap, evaluate vendor (Anthropic, OpenAI, Mistral, self-hosted), debate “build vs buy”, “local vs cloud” với team kỹ thuật.

Vấn đề: tôi không thực sự feel được capability gap giữa frontier LLM và open-source. Đọc benchmark MMLU/GPQA không giúp khi user thực tế quan tâm “AI có làm đúng không khi tôi nói ‘mở Excel và dán bảng này’”.

Quyết định 1 cuối tuần: build 1 voice assistant native macOS từ scratch — Jarvis kiểu Iron Man. Không phải để ship, mà để trải nghiệm decision-by-decision mọi trade-off mà team product phải làm khi build agentic AI.

Output:

Native macOS app (~50MB)
Voice loop end-to-end: nói → STT → LLM → TTS phát ra
40 tool: mở app, gõ text, click chuột, control browser tab (Safari + Chrome), search KB cá nhân, weather, maps
Dictation hotkey thay thế Apple Dictation
Continuous mode + interrupt khi nói “stop”

4 quyết định product enterprise sẽ phải làm

1. Cost framework: $/turn không phải metric duy nhất

Tôi test 3 LLM backend cho voice agent. User request: “Open TextEdit and type Hello”. LLM phải gọi đúng 2 tool: open_app("TextEdit") rồi type_text("Hello").

Backend	$/turn	Latency	Tool-call accuracy	User experience
Qwen 2.5:7b local	$0	~3s	30%	Nói “Đã gõ Hello” mà KHÔNG gõ thật
Hermes 3:8b local (function-call tune)	$0	~4s	60%	Open app OK, type fail 40%
Claude Haiku 4.5 cloud	$0.001	~1.5s	~100%	Đúng 5 lần liên tiếp

Naive cost analysis: Local free, Cloud $1/tháng (30 turn/ngày). Local thắng?

TCO analysis thực tế cho enterprise product:

Cost layer	Local	Cloud
LLM API	$0	$1/user/tháng
Compute (Ollama daemon, GPU)	~$5/user/tháng (electricity, hardware amortize)	$0
Retry cost (user phải nói lại 2-3 lần khi fail)	~3x time waste	gần 0
Trust erosion (user mất niềm tin → churn)	High	Low
Engineer debug time	~3 giờ debug code chỉ vì model không call tool	0
Ship velocity (phải prompt-engineer đặc biệt cho local)	-2 tuần	0

Real TCO: Cloud thực ra cheaper 5-10x cho agentic. Local “free” là illusion.

Lesson cho CFO/PM: khi present cost AI feature, dùng TCO framework, không chỉ “API cost”. Frontier LLM đắt hơn 100-1000x raw cost vẫn là bargain nếu accuracy lên 100%.

2. Privacy không binary — kiến trúc layered

Lý do thường nghe để chọn local LLM: “user data privacy”. Tôi từng nghĩ phải all-local.

Sau build, kiến trúc Jarvis của tôi:

Audio capture: 100% on-device (AVAudioEngine local)
Speech-to-text: 100% on-device (Apple SFSpeech)
LLM reasoning: Cloud Haiku — chỉ gửi text transcript (không audio, không KB content)
Tool execution: 100% on-device (mở app, gõ chuột — không leave Mac)
KB search: 100% local (RAG server trên Mac)
TTS playback: 100% on-device (AVSpeechSynthesizer)

Cloud chỉ thấy 1 cái: text user nói trong session. Anthropic API ToS: không log để train. So với “send raw audio to OpenAI Whisper API” thì privacy footprint nhỏ hơn 100x.

Lesson cho compliance/legal: privacy là layered decision, không phải all-or-nothing. Document từng data flow → present “what leaves device, what doesn’t” → compliance team thường OK với selective Cloud.

3. Wake word “Hey Jarvis” hay push-to-talk?

Iron Man pattern: always-listening, “Hey Jarvis” trigger. Trông sexy ở demo.

Production reality:

Always-on mic = battery drain 15-25% pin/ngày
False trigger 2-5 lần/ngày (Siri ai dùng Apple đều biết)
Privacy concern enterprise: mic luôn nghe trong meeting

V1 Jarvis tôi chọn push-to-talk hotkey (Cmd+Shift+J). Trade off:

Mất “Hey Jarvis” magical moment
Win: 0% false trigger, 0% pin drain, 0% privacy concern

Lesson cho PM: hot-pattern (always-on) không phải right-pattern. Test 3 cohort:

Power-user (love magic)
Casual (hate false trigger)
Privacy-sensitive enterprise (require explicit trigger)

2/3 cohort skew push-to-talk → ship V1 push-to-talk, wake word V1.5.

4. Build vs buy — 2026 stack đã ready

Build voice agent năm 2024 = 6 tháng. 2026 = 1 cuối tuần. Stack mature:

STT on-device: Apple SFSpeech, Whisper.cpp, WhisperKit (ANE) — free
LLM: Anthropic SDK, OpenAI SDK, Ollama — plug & play
TTS native: AVSpeechSynthesizer + Premium voices — free
UI: SwiftUI + Metal — native, polished

Build giờ make sense cho:

Custom UX (assistant kiểu của brand mình)
Lock-in stack control (không phụ thuộc 1 vendor)
Enterprise data (không gửi mic/transcript ra ngoài)

Buy vẫn make sense cho:

Generic productivity (Raycast AI, Apple Intelligence)
Voice cho mobile/web (Vapi, Retell)
Time-to-market <2 tuần

Lesson cho head of product: “build voice agent” không còn là 6-month roadmap item. Re-evaluate every 6 tháng — stack tăng tốc nhanh hơn project planning.

Decision framework tổng kết

Khi team plan “AI agent feature” cho product, hỏi 4 câu:

Câu hỏi	Skew Local	Skew Cloud
Tool calls cần ≥90% accuracy?	Không (chit-chat)	Có (agentic action)
Latency < 2s critical?	Không (background)	Có (voice loop, real-time)
Sensitive data ở input?	Có (medical PHI, financial)	Không (generic transcript)
User retry cost cao? (re-do workflow)	Không (idempotent)	Có (sent email, money transfer)

3-4 câu skew Cloud → Cloud Haiku/Sonnet. 3-4 skew Local → Mistral/Qwen. Mix → hybrid router (lightweight classifier route đúng tier).

Cái tôi sẽ làm tiếp

Hybrid intent router: phân loại “tool-heavy” vs “chit-chat” → route đúng tier. Save 80% cost mà UX vẫn fast.
Cost dashboard for users: cho user thấy real cost mỗi loại query → educate trade-off thay vì giấu.
VN voice cloning + multilingual: ship được cho team Việt enterprise.

Take-away cho product/business leader

Tự build / pair-build 1 lần để feel limit của open-source LLM. Roadmap dựa trên feel, không dựa benchmark paper.
TCO framework: $/turn + retry-cost + debug-time + trust-erosion + ship-velocity. Frontier LLM thường thắng cost battle khi tính đủ.
Privacy layered, không binary — document data flow, present compliance, đừng bỏ Cloud chỉ vì “lo privacy”.
Open-source LLM đang đuổi nhanh nhưng cho agentic flow (multi-step + tool reliability), Claude/GPT-4 frontier vẫn lead mid-2026. Re-evaluate Q3.
Stack maturity đã đến tipping point. “Voice AI agent” không còn 6-month roadmap — 1 weekend POC đủ cho team product validate idea trước khi commit headcount.

Tôi là PM ở LittleLives, viết về AI product, agentic system, và build vs buy decision framework. Comment hoặc DM nếu team đang plan AI feature roadmap muốn deep-dive.