Personal RAG Knowledge Base — PRD

Size S · P0 · Foundation Status: ✅ M1 done (2026-04-29) — see Implementation for build details Originally planned: 1 weekend / Actual: ~2 days concentrated work

1. Problem

Personal knowledge is fragmented across many sources: wiki/documentation pages, ticketing systems, chat threads, meeting transcripts, email, markdown notes, and AI conversation transcripts. When trying to recall “what did I read about X?”, finding the answer takes 15–30 minutes of manual hunting across separate tools. OS-level search (Spotlight) is keyword-only and doesn’t understand semantic intent. Per-source MCP search is slow and bloats the context window with raw chunks.

Pain: ~90% of consumed knowledge is not retrievable on demand.

Why now: this is a foundation pattern that 6+ downstream AI projects can reuse (recipe extractor, research agent, support bot, finance advisor, etc.). Build once, reuse many times.

2. Goal & Success Metrics

Goal: Ask a Claude client “any thoughts on vector DB X for ~100K vectors?” → get top-5 relevant chunks + source links in under 3 seconds.

Metrics — actual achieved:

Metric	Target M1	Achieved	Note
Hit@5 on test queries	≥60%	~85% (subjective on 4 queries)	Verified with real-world queries scoring 0.85+
Latency p95	<5s	1.16s	Embed query + ADB vector search + tunnel
Sources ingested	100 docs	5,000+ sources / 47K chunks	Full bulk migrate Day 3
Touchpoint	Telegram	MCP: Claude Desktop + Claude.ai web + iOS app	Pivoted Day 1, validated Day 8

3. User journey (revised)

Pivot from original: Dropped a custom Telegram bot in favor of MCP — Claude clients can call kb_search / kb_ingest natively.

A sync agent (e.g. chat exporter, meeting transcriber, AI session hook, ticket crawler) writes a .md file into the local KB folder.
A post-write hook automatically calls kb_ingest MCP → embed + store in ADB.
User asks Claude: “Anything new on topic X this week?”
Claude calls kb_search over MCP → returns top-5 chunks with source paths.
Claude synthesizes an answer with citations.

4. Scope (MoSCoW) — final

Must — DONE:

✅ Ingest markdown files (URL/PDF supported via downstream sync agents)
✅ Chunk + embed + store in ADB 23ai with VECTOR(384, FLOAT32)
✅ MCP tools: kb_health, kb_ingest, kb_search, kb_stats
✅ Citations: search results include source URI and chunk index

Should — DONE:

✅ Idempotent ingest via server-side hash check
✅ Re-ingest replaces chunks if content hash changed
✅ Auto-tag from folder hierarchy (16 path-based rules)
⏸️ Hybrid search (BM25 + vector) — semantic alone proved sufficient

Could — partial:

⏸️ Apple Notes import — out of scope, low ROI for actual usage
⏸️ Kindle highlights import — same reason
✅ Wiki/docs crawler integration — handled by downstream scripts
❌ Custom web UI — replaced by Claude clients themselves

Won’t (M1–M3) — kept:

Multi-user support (single-user system by design)
Real-time sync — manual triggers + nightly scheduled task is sufficient
Native mobile app — Claude iOS app inherits via OAuth

5. Architecture (final)

Pivoted from “RAG + Telegram Bot” → MCP Server + Multi-source ingest. See Architecture for diagrams.

6. Tech Stack — final choices

Layer	Original spec	Implemented	Reason for change
LLM serving	Local Llama 3.2 3B	External (caller’s Claude client)	MCP delegates LLM to caller; server only does retrieval
Embedder	BGE-small-en-v1.5	multilingual-e5-small	English-only embedder underperformed on mixed VN/EN queries
Vector DB	ADB 23ai	ADB 23ai ✓	Free 20 GB, native vector type with HNSW index
Reranker	BGE-reranker-base (M2)	❌ skipped	Single-stage e5 alone scored 0.85+ on real queries
Bot framework	python-telegram-bot	MCP Streamable HTTP	Native Claude integration, multi-client for free
HTTP framework	—	FastMCP + Starlette + uvicorn	MCP SDK provides this out-of-box
Tunnel	Public IP + nginx	Cloudflare named tunnel	Persistent URL, no inbound port open
Auth	”Telegram only”	OAuth 2.0 (PKCE+DCR) + legacy bearer	Supports mobile/web/desktop clients
Deploy	systemd	systemd ✓

Cost posture: Designed to fit within cloud free tiers for ADB, Object Storage, and Cloudflare. Compute uses a small commodity VM that can run on any provider; the architecture intentionally avoids vendor-specific lock-in (no managed services for the hot path).

7. Milestones — actual

Day	What shipped
Day 1	MCP server scaffold (FastMCP), tunnel, bearer auth
Day 2	`kb_ingest`/`search`/`stats` tools, embed bench, schema applied
Day 3	Bulk migrate 5,000+ sources / 45K chunks (~95 min wall)
Day 3b	Embed model pivot BGE-en → e5-small multilingual (re-embed 131 min)
Day 4-5	Refactor 4 sync sources to call MCP `kb_ingest` (filesystem-first rule)
Day 6	Persistent named tunnel via custom domain
Day 7	Weekly Object Storage backup via write-only PAR + restore script
Day 8	OAuth 2.0 + DCR + PKCE → Claude.ai web + iOS app access

M1 DoD passed:

✅ Ingest 5,000+ sources (target 10)
✅ 5/5 real queries answered correctly (target 3/5)
✅ Latency 1.16s (target <5s)

8. Cost & Quota

Item	Free tier?	Actual usage
ADB 23ai (1 × 20 GB)	✅	47K chunks ≈ ~250 MB stored (~1.3% of quota)
Object Storage backup	✅ <20 GB	~86 MB/week × ~52 weeks = ~4.5 GB/year (~22% of quota)
Cloudflare Tunnel	✅	<1 MB/day data

The serving VM is the only line item that may need migration once a free trial credit ends. The architecture is provider-agnostic — moving compute is a re-deploy, not a re-build.

9. Risks & open questions — outcomes

Original risks:

ADB auto-stop after 7 days idle → mitigated by daily health pings via normal usage
Embedding quality on Vietnamese text with English model → resolved Day 3b (multilingual model swap)
Local LLM crash recovery → N/A (LLM not used in final architecture)

New risks (M2/M3 backlog):

Single password OAuth (no 2FA) — adequate for personal use; layer SSO if shared
Cloudflare bot management blocking default urllib User-Agent → fixed with custom UA header
Token rotation reliance on operator memory — push to password manager

Original open Qs:

Q1: Web UI from M2? → ❌ dropped, Claude clients are sufficient
Q2: Privacy of sensitive content in ADB? → ✅ accepted, single-tenant deployment
Q3: Reranker latency on CPU? → N/A (reranker skipped)

10. Definition of Done

M1 Done: ✅ 2026-04-29 — 5K+ sources ingested, OAuth flow live, multilingual search working, infra includes DR backup.

M3 Done (production-ready):

⏳ TOTP 2FA or SSO wrap on /login
⏳ Reranker eval on top-20 candidates → measure Hit@5 lift
⏳ Daily-driver criterion: 2 consecutive weeks of natural usage without stack pivot

personal-rag-kb — PRD