Implementation
Sister docs: PRD (intent), Architecture (system view), Notes (decision log).
TL;DR
A production-ready personal RAG system built in ~22 hours over 2 days:
- 5,000+ sources / 47K+ chunks indexed (mixed Vietnamese + English text)
VECTOR(384)HNSW index on Oracle ADB 23ai free tier- Multilingual semantic search (VN/EN), p95 latency 1.16s end-to-end
- MCP Streamable HTTP server with OAuth 2.0 (PKCE + DCR) — accessible from Claude Desktop, Claude.ai web, Claude iOS app
- Persistent Cloudflare named tunnel
- Auto-sync hooks: 4 source types automatically ingest after a markdown write
- Weekly backup → OCI Object Storage via write-only PAR (zero credential surface on the VM)
Stack
| Layer | Component | Version | Notes |
|---|---|---|---|
| Compute | Small commodity VM | 4 vCPU / 16 GB RAM | Provider-agnostic; runs anywhere with Linux + Python |
| OS | Ubuntu | 22.04 minimal | |
| Runtime | Python | 3.10 + venv | |
| MCP SDK | mcp | 1.27.0 | Streamable HTTP transport, built-in OAuth scaffolding |
| HTTP server | uvicorn + Starlette | latest | 127.0.0.1:8080 (tunnel-fronted) |
| Embedding | sentence-transformers + multilingual-e5-small | 5.4 / hf-117M params | 384-dim, max_seq=512, ~5.7 chunks/s on 4-core CPU |
| Vector DB | Oracle Autonomous DB 23ai | free tier | VECTOR(384, FLOAT32) + HNSW INMEMORY NEIGHBOR GRAPH |
| Driver | oracledb (thin) | 3.4 | Wallet-based TLS |
| Tunnel | cloudflared | 2026.3.0 | Named tunnel, auto-restart via systemd |
| Storage backup | OCI Object Storage Standard | free 20 GB | Pre-authenticated write-only URL |
| Auth | OAuth 2.0 + PBKDF2 password | sha256 200K iters | DCR-enabled, refresh rotation |
Directory layout
On VM (/opt/.../personal-rag-kb/)
src/
├── server.py # FastMCP wrapper, Starlette routes (~270 lines)
├── oauth_provider.py # OAuthAuthorizationServerProvider impl (~230 lines)
├── oauth_login.py # /login GET + POST routes (~80 lines)
├── bulk_migrate.py # one-shot Day 3 migration
├── re_embed.py # Day 3b model swap retool
├── backup.py # weekly dump + PAR upload
└── restore.py # disaster recovery (DROP+reload)
backups/ # local cache, last 2 archives
logs/
├── server.log
├── cloudflared.log
├── backup.log
├── migrate.log
└── re_embed.log
.token # legacy bearer (chmod 600)
.oauth_env # OAuth password salt/hash + issuer (chmod 600)
.oauth_state.json # clients/codes/tokens persistent (chmod 600)
.par_url # OCI write-only PAR URL (chmod 600)
On operator host (~/.claude/)
hooks/
├── save-convo.py # Stop hook → archive AI session + auto-ingest
├── save-convo-kb.log
├── kb-ingest-file.py # Generic file → kb_ingest helper (16 path rules)
└── kb-ingest-file.log
scheduled-tasks/
└── nightly-kb-sync/SKILL.md # nightly → run all sync skills
personal-rag-kb.env # KB_MCP_URL + KB_MCP_TOKEN (chmod 600)
Sync skills (in operator’s work-repo .claude/skills/)
slack-sync.md— auto-ingest after saveemail-highlights.md— auto-ingest after savemeeting-sync.md— auto-ingest after save
Crawler (in operator’s KB folder)
crawl_direct.py— callskb-ingest-file.pyafter each ticket render
Schema
CREATE TABLE kb_sources (
id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
source_uri VARCHAR2(2000) NOT NULL,
source_type VARCHAR2(32),
title VARCHAR2(500),
tags VARCHAR2(500), -- comma-sep
metadata CLOB CHECK (metadata IS JSON),
content_hash VARCHAR2(64), -- sha256, idempotency key
ingested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT uq_kb_sources_uri UNIQUE (source_uri)
);
CREATE INDEX idx_kb_sources_type ON kb_sources(source_type);
CREATE INDEX idx_kb_sources_ingested ON kb_sources(ingested_at);
CREATE TABLE kb_chunks (
id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
source_id NUMBER NOT NULL,
chunk_idx NUMBER NOT NULL,
text CLOB NOT NULL,
embedding VECTOR(384, FLOAT32),
CONSTRAINT fk_kb_chunks_source FOREIGN KEY (source_id) REFERENCES kb_sources(id) ON DELETE CASCADE,
CONSTRAINT uq_kb_chunks_src_idx UNIQUE (source_id, chunk_idx)
);
CREATE INDEX idx_kb_chunks_source ON kb_chunks(source_id);
CREATE VECTOR INDEX kb_emb_idx ON kb_chunks(embedding)
ORGANIZATION INMEMORY NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;
Why this shape:
source_uri UNIQUE+content_hash→ ingest is idempotent (skip if same content)- HNSW INMEMORY NEIGHBOR GRAPH → low-latency ANN search; cold build = 19s for 45K vectors
ON DELETE CASCADE→ drop a source row → chunks auto-removed- Tag-based filter is
INSTR()substring match — simpler than separate tag table; sufficient at this scale
MCP tools surface
| Tool | Args | Purpose |
|---|---|---|
kb_health | — | DB ping + sources/chunks count |
kb_ingest | text, source_uri, source_type, title, tags[], metadata{} | Idempotent insert/update; chunk + embed + store |
kb_search | query, top_k=5, tag_filter, source_type | Semantic search returns top-k chunks with metadata |
kb_stats | — | Total + breakdown by source_type |
Each tool ≈ 30–60 LOC in server.py. FastMCP @mcp.tool() decorator handles JSON Schema generation, request parsing, response packaging.
Chunking strategy
CHUNK_TOK = 256 # under e5 max_seq_length=512
OVERLAP_TOK = 32
def chunk_text(text: str) -> list[str]:
ids = tokenizer.encode(text, add_special_tokens=False)
if len(ids) <= CHUNK_TOK:
return [text]
step = CHUNK_TOK - OVERLAP_TOK # = 224
chunks = []
for start in range(0, len(ids), step):
window = ids[start:start + CHUNK_TOK]
chunks.append(tokenizer.decode(window, skip_special_tokens=True))
if start + CHUNK_TOK >= len(ids):
break
if len(chunks) >= MAX_CHUNKS_PER_FILE:
break
return chunks
Why 256: BGE-small benchmarks showed quadratic seq-len scaling on CPU. 256 tokens = 3× faster than 480 tokens. Verified empirically: ~5.7 chunks/s @ 256 vs ~1.8 @ 480.
Cap 50 chunks/file: Some session transcripts run to 100K+ tokens. Head-only embedding covers the first ~12,800 tokens (~25 dense pages), which captures intent + key decisions. Trade-off accepted.
Embedding pipeline (e5 prefix convention)
multilingual-e5-small requires task-specific prefixes for best retrieval:
# Documents
embed_model.encode([f"passage: {chunk}" for chunk in chunks])
# Queries
embed_model.encode([f"query: {user_query}"])
Verified impact (verbatim Vietnamese query):
- BGE-en (no prefix): top hit score 0.726, wrong document
- e5-small (with prefix): top hit score 0.857, correct document
Auth dual-mode
# oauth_provider.py: load_access_token() falls back to legacy bearer
async def load_access_token(self, token: str) -> AccessToken | None:
if secrets.compare_digest(token, LEGACY_BEARER):
return AccessToken(token=token, client_id="_legacy_bearer", scopes=["mcp"])
state = _load_state()
return AccessToken.model_validate(state["access_tokens"][token]) if token in state["access_tokens"] else None
This allows:
- Claude Desktop (mcp-remote bridge with bearer in header) — initial setup keeps working
- Claude.ai web (full OAuth flow with DCR + PKCE)
- Claude iOS app — inherits from claude.ai web account
- curl scripts on VM — bearer for ad-hoc admin
OAuth flow (sequence)
sequenceDiagram
autonumber
participant C as Claude.ai
participant B as Browser
participant S as MCP Server
participant DB as State Store
C->>S: GET /.well-known/oauth-authorization-server
S-->>C: AS metadata (RFC 8414)
C->>S: POST /register (DCR)
S->>DB: Save client
S-->>C: 201 {client_id}
C->>B: Redirect to /authorize?
client_id, redirect_uri, code_challenge
B->>S: GET /authorize
S->>DB: Create session
S-->>B: 302 → /login?session=sid
B->>S: GET /login (password form)
S-->>B: HTML form
B->>S: POST /login (password)
S->>S: PBKDF2 verify (200K iters)
S->>DB: Generate auth code
S-->>B: 302 → redirect_uri?code=...&state=...
B->>C: Forwards code
C->>S: POST /token (code + verifier)
S->>S: Verify PKCE challenge
S->>DB: Issue access_token + refresh_token
S-->>C: 200 {access_token, refresh_token, expires_in: 3600}
Note over C,S: Subsequent MCP calls
C->>S: POST /mcp + Authorization: Bearer
S->>DB: Validate token
S-->>C: MCP response
Note over C,S: After 1 hour
C->>S: POST /token (grant_type=refresh_token)
S->>DB: Rotate refresh + new access
S-->>C: 200 {new tokens}
Step-by-step (text reference)
1. Claude.ai → GET https://<host>/.well-known/oauth-authorization-server
← {issuer, authorization_endpoint, token_endpoint, registration_endpoint, ...}
2. Claude.ai → POST /register
{ redirect_uris: [...], token_endpoint_auth_method: "none" }
← 201 { client_id: "uuid-v4", client_id_issued_at: ... }
3. Claude.ai → GET /authorize?client_id=...&redirect_uri=...&code_challenge=...&state=...
← 302 Location: https://<host>/login?session=<sid>
4. Browser → GET /login?session=<sid>
← HTML form (dark theme, password input)
5. Browser → POST /login (form { session, password })
Server: PBKDF2 verify; if OK: generate code, store, redirect
← 302 Location: <redirect_uri>?code=<code>&state=<state>
6. Claude.ai → POST /token
{ grant_type: authorization_code, code, code_verifier, redirect_uri, client_id }
Server: load AuthCode, verify code_challenge against verifier (PKCE), generate tokens
← 200 { access_token, token_type: Bearer, expires_in: 3600, refresh_token, scope: mcp }
7. Claude.ai → POST /mcp with Authorization: Bearer <access_token>
Server: load_access_token(token) → AccessToken; SDK proceeds with normal MCP
← MCP responses
8. (After 1 hour) Claude.ai → POST /token
{ grant_type: refresh_token, refresh_token: ... }
← 200 { new access_token, new refresh_token (rotated) }
Ingest flow (filesystem-first)
[Skill / Hook / Crawler]
│ writes
▼
<kb-root>/<folder>/<slug>.md ← canonical, source-of-truth
│
│ exec python3 <hooks>/kb-ingest-file.py <path>
▼
[kb-ingest-file.py]
│ reads file, classify source_type from path, auto-tag
│ HTTPS POST <mcp-host>/mcp tools/call kb_ingest
▼
[Server: kb_ingest tool]
│ SHA256(content), check kb_sources.content_hash
│ if match → return {skipped: true}
│ if differ → DELETE old chunks, recompute
│ chunk(text), embed_passages(chunks)
│ INSERT kb_sources (or UPDATE), INSERT kb_chunks bulk
▼
ADB 23ai
Why filesystem-first: ADB free tier can be evicted (auto-stop after 7 days idle, possible data loss). The local filesystem (with Time Machine or equivalent) is durable. ADB is treated as a derived index — re-buildable from filesystem in ~3 hours via bulk_migrate.py. Worst case = re-embed cost, no data loss.
Performance numbers
Measured on 4-vCPU CPU @ 2.2 GHz, 16 GB RAM:
| Operation | Throughput | Notes |
|---|---|---|
| BGE-small embed @ seq=480 | 1.8 chunks/s | rejected (too slow) |
| e5-small embed @ seq=256 | 5.7 chunks/s | accepted |
| BGE-m3 embed @ seq=256 | 0.39 chunks/s | rejected (would take 32h to re-embed) |
| Bulk migrate 5,000+ files | ~95 min | sequential, single conn |
| Re-embed 47K chunks | ~131 min | resumable via checkpoint |
| Backup 5K sources / 47K chunks → tar.gz 86 MB | 129s | gzip stream |
| HNSW index build (45K vectors) | 19s | INMEMORY NEIGHBOR GRAPH |
kb_search p50 | ~800 ms | embed + ANN + JSON serialize |
kb_search p95 (via tunnel) | 1,160 ms | adds ~300 ms tunnel + TLS overhead |
kb_health | <80 ms | DB ping only |
Reliability features
| Feature | How |
|---|---|
| Idempotent ingest | content_hash compare → skip duplicate uploads |
| Filesystem source-of-truth | re-buildable in ~3 h from local markdown |
| Persistent tunnel | systemd cloudflared.service with Restart=on-failure |
| Auto-restart MCP server | systemd personal-rag-kb.service with Restart=on-failure |
| Weekly backup | systemd timer kb-backup.timer (Sunday 03:00 UTC) |
| Backup restore | restore.py — DROP + recreate + JSONL reload + HNSW rebuild |
| Reconcile script | walks FS + diffs ADB hashes → re-ingests mismatched / new |
| Token rotation | refresh_token rotated on use (one-time) |
| Token TTL | access 1 h, refresh 30 d, auth code 10 min |
| GC | expired sessions/codes purged on each provider call |
Security model
| Threat | Mitigation |
|---|---|
| Public MCP endpoint exposure | Cloudflare tunnel terminates outbound from VM only — no inbound port open. Auth required. |
| Token theft | TLS only (CF-managed cert). Tokens random 32-byte urlsafe. PBKDF2 200K rounds for password. |
| Replay attack | Auth codes single-use (consumed on exchange). Refresh tokens rotated. |
| Data exfil from compromised VM | VM has only write-only PAR for backup bucket — cannot LIST/GET/DELETE existing backups. ADB password limited to single role. |
| Brute-force password | No throttling implemented — single-user, CF rate-limit upstream provides some protection. TODO: add rate limit on /login |
| Sensitive content | Embedding is irreversible (one-way), original text only retrievable to the authenticated user. Single-tenant deployment. |
Reproducibility — quickstart for a forker
# 1. Provision Oracle Autonomous DB 23ai (free tier) → download wallet
# 2. Provision a small Linux VM (4-core, 16 GB)
# 3. SSH:
sudo apt update && sudo apt install -y python3-venv
python3 -m venv /opt/ai-foundation/venv
/opt/ai-foundation/venv/bin/pip install \
oracledb fastapi uvicorn 'mcp[cli]>=1.2' starlette sse-starlette \
--extra-index-url https://download.pytorch.org/whl/cpu \
torch sentence-transformers
# 4. Upload wallet to /opt/ai-foundation/wallet/, write lib/db.py with TNS DSN
# 5. Apply schema (see schema section above)
# 6. Generate token: openssl rand -hex 32 > .token
# 7. Generate password: openssl rand -base64 24 → PBKDF2 → .oauth_env
# 8. Drop in src/server.py, src/oauth_provider.py, src/oauth_login.py
# 9. systemd service + cloudflared tunnel + DNS record on a domain you own
# 10. Bulk migrate from local KB folder, or start fresh
# 11. Add custom connector on claude.ai with URL https://yourdomain/mcp
Total: 1–2 hours if all credentials/wallets ready.
Future work (M3+)
- Rate limit on
/login(5 req/min per IP) - TOTP 2FA via
pyotp - Multi-user (user table, per-user
client_idisolation, row-level filter on kb_sources) - Reranker (BGE-reranker-base) on top-20 candidates → measure Hit@5 lift
- Embedding model upgrade path: e5-base (768-dim, ~2× better quality) when CPU upgrade
- Hybrid search: add Oracle Text full-text index for keyword fallback when vector returns score <0.7
License & attribution
Personal project. Built on:
- Model Context Protocol by Anthropic
- multilingual-e5-small by Microsoft
- Oracle Database 23ai free tier
- Cloudflare Tunnel free tier