Implementation

Sister docs: PRD (intent), Architecture (system view), Notes (decision log).

TL;DR

A production-ready personal RAG system built in ~22 hours over 2 days:

5,000+ sources / 47K+ chunks indexed (mixed Vietnamese + English text)
VECTOR(384) HNSW index on Oracle ADB 23ai free tier
Multilingual semantic search (VN/EN), p95 latency 1.16s end-to-end
MCP Streamable HTTP server with OAuth 2.0 (PKCE + DCR) — accessible from Claude Desktop, Claude.ai web, Claude iOS app
Persistent Cloudflare named tunnel
Auto-sync hooks: 4 source types automatically ingest after a markdown write
Weekly backup → OCI Object Storage via write-only PAR (zero credential surface on the VM)

Stack

Layer	Component	Version	Notes
Compute	Small commodity VM	4 vCPU / 16 GB RAM	Provider-agnostic; runs anywhere with Linux + Python
OS	Ubuntu	22.04 minimal
Runtime	Python	3.10 + venv
MCP SDK	`mcp`	1.27.0	Streamable HTTP transport, built-in OAuth scaffolding
HTTP server	uvicorn + Starlette	latest	`127.0.0.1:8080` (tunnel-fronted)
Embedding	sentence-transformers + multilingual-e5-small	5.4 / hf-117M params	384-dim, max_seq=512, ~5.7 chunks/s on 4-core CPU
Vector DB	Oracle Autonomous DB 23ai	free tier	`VECTOR(384, FLOAT32)` + HNSW INMEMORY NEIGHBOR GRAPH
Driver	oracledb (thin)	3.4	Wallet-based TLS
Tunnel	cloudflared	2026.3.0	Named tunnel, auto-restart via systemd
Storage backup	OCI Object Storage Standard	free 20 GB	Pre-authenticated write-only URL
Auth	OAuth 2.0 + PBKDF2 password	sha256 200K iters	DCR-enabled, refresh rotation

Directory layout

On VM (`/opt/.../personal-rag-kb/`)

src/
├── server.py                  # FastMCP wrapper, Starlette routes (~270 lines)
├── oauth_provider.py          # OAuthAuthorizationServerProvider impl (~230 lines)
├── oauth_login.py             # /login GET + POST routes (~80 lines)
├── bulk_migrate.py            # one-shot Day 3 migration
├── re_embed.py                # Day 3b model swap retool
├── backup.py                  # weekly dump + PAR upload
└── restore.py                 # disaster recovery (DROP+reload)

backups/                       # local cache, last 2 archives
logs/
├── server.log
├── cloudflared.log
├── backup.log
├── migrate.log
└── re_embed.log

.token                         # legacy bearer (chmod 600)
.oauth_env                     # OAuth password salt/hash + issuer (chmod 600)
.oauth_state.json              # clients/codes/tokens persistent (chmod 600)
.par_url                       # OCI write-only PAR URL (chmod 600)

On operator host (`~/.claude/`)

hooks/
├── save-convo.py              # Stop hook → archive AI session + auto-ingest
├── save-convo-kb.log
├── kb-ingest-file.py          # Generic file → kb_ingest helper (16 path rules)
└── kb-ingest-file.log

scheduled-tasks/
└── nightly-kb-sync/SKILL.md   # nightly → run all sync skills

personal-rag-kb.env            # KB_MCP_URL + KB_MCP_TOKEN (chmod 600)

Sync skills (in operator’s work-repo `.claude/skills/`)

slack-sync.md — auto-ingest after save
email-highlights.md — auto-ingest after save
meeting-sync.md — auto-ingest after save

Crawler (in operator’s KB folder)

crawl_direct.py — calls kb-ingest-file.py after each ticket render

Schema

CREATE TABLE kb_sources (
    id            NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    source_uri    VARCHAR2(2000) NOT NULL,
    source_type   VARCHAR2(32),
    title         VARCHAR2(500),
    tags          VARCHAR2(500),       -- comma-sep
    metadata      CLOB CHECK (metadata IS JSON),
    content_hash  VARCHAR2(64),        -- sha256, idempotency key
    ingested_at   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at    TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT uq_kb_sources_uri UNIQUE (source_uri)
);
CREATE INDEX idx_kb_sources_type     ON kb_sources(source_type);
CREATE INDEX idx_kb_sources_ingested ON kb_sources(ingested_at);

CREATE TABLE kb_chunks (
    id          NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    source_id   NUMBER NOT NULL,
    chunk_idx   NUMBER NOT NULL,
    text        CLOB NOT NULL,
    embedding   VECTOR(384, FLOAT32),
    CONSTRAINT fk_kb_chunks_source FOREIGN KEY (source_id) REFERENCES kb_sources(id) ON DELETE CASCADE,
    CONSTRAINT uq_kb_chunks_src_idx UNIQUE (source_id, chunk_idx)
);
CREATE INDEX idx_kb_chunks_source ON kb_chunks(source_id);

CREATE VECTOR INDEX kb_emb_idx ON kb_chunks(embedding)
    ORGANIZATION INMEMORY NEIGHBOR GRAPH
    DISTANCE COSINE
    WITH TARGET ACCURACY 95;

Why this shape:

source_uri UNIQUE + content_hash → ingest is idempotent (skip if same content)
HNSW INMEMORY NEIGHBOR GRAPH → low-latency ANN search; cold build = 19s for 45K vectors
ON DELETE CASCADE → drop a source row → chunks auto-removed
Tag-based filter is INSTR() substring match — simpler than separate tag table; sufficient at this scale

MCP tools surface

Tool	Args	Purpose
`kb_health`	—	DB ping + sources/chunks count
`kb_ingest`	text, source_uri, source_type, title, tags[], metadata{}	Idempotent insert/update; chunk + embed + store
`kb_search`	query, top_k=5, tag_filter, source_type	Semantic search returns top-k chunks with metadata
`kb_stats`	—	Total + breakdown by source_type

Each tool ≈ 30–60 LOC in server.py. FastMCP @mcp.tool() decorator handles JSON Schema generation, request parsing, response packaging.

Chunking strategy

CHUNK_TOK = 256   # under e5 max_seq_length=512
OVERLAP_TOK = 32

def chunk_text(text: str) -> list[str]:
    ids = tokenizer.encode(text, add_special_tokens=False)
    if len(ids) <= CHUNK_TOK:
        return [text]
    step = CHUNK_TOK - OVERLAP_TOK   # = 224
    chunks = []
    for start in range(0, len(ids), step):
        window = ids[start:start + CHUNK_TOK]
        chunks.append(tokenizer.decode(window, skip_special_tokens=True))
        if start + CHUNK_TOK >= len(ids):
            break
        if len(chunks) >= MAX_CHUNKS_PER_FILE:
            break
    return chunks

Why 256: BGE-small benchmarks showed quadratic seq-len scaling on CPU. 256 tokens = 3× faster than 480 tokens. Verified empirically: ~5.7 chunks/s @ 256 vs ~1.8 @ 480.

Cap 50 chunks/file: Some session transcripts run to 100K+ tokens. Head-only embedding covers the first ~12,800 tokens (~25 dense pages), which captures intent + key decisions. Trade-off accepted.

Embedding pipeline (e5 prefix convention)

multilingual-e5-small requires task-specific prefixes for best retrieval:

# Documents
embed_model.encode([f"passage: {chunk}" for chunk in chunks])

# Queries
embed_model.encode([f"query: {user_query}"])

Verified impact (verbatim Vietnamese query):

BGE-en (no prefix): top hit score 0.726, wrong document
e5-small (with prefix): top hit score 0.857, correct document

Auth dual-mode

# oauth_provider.py: load_access_token() falls back to legacy bearer
async def load_access_token(self, token: str) -> AccessToken | None:
    if secrets.compare_digest(token, LEGACY_BEARER):
        return AccessToken(token=token, client_id="_legacy_bearer", scopes=["mcp"])
    state = _load_state()
    return AccessToken.model_validate(state["access_tokens"][token]) if token in state["access_tokens"] else None

This allows:

Claude Desktop (mcp-remote bridge with bearer in header) — initial setup keeps working
Claude.ai web (full OAuth flow with DCR + PKCE)
Claude iOS app — inherits from claude.ai web account
curl scripts on VM — bearer for ad-hoc admin

OAuth flow (sequence)

sequenceDiagram
    autonumber
    participant C as Claude.ai
    participant B as Browser
    participant S as MCP Server
    participant DB as State Store

    C->>S: GET /.well-known/oauth-authorization-server
    S-->>C: AS metadata (RFC 8414)

    C->>S: POST /register (DCR)
    S->>DB: Save client
    S-->>C: 201 {client_id}

    C->>B: Redirect to /authorize?
client_id, redirect_uri, code_challenge

    B->>S: GET /authorize
    S->>DB: Create session
    S-->>B: 302 → /login?session=sid

    B->>S: GET /login (password form)
    S-->>B: HTML form

    B->>S: POST /login (password)
    S->>S: PBKDF2 verify (200K iters)
    S->>DB: Generate auth code
    S-->>B: 302 → redirect_uri?code=...&state=...

    B->>C: Forwards code

    C->>S: POST /token (code + verifier)
    S->>S: Verify PKCE challenge
    S->>DB: Issue access_token + refresh_token
    S-->>C: 200 {access_token, refresh_token, expires_in: 3600}

    Note over C,S: Subsequent MCP calls
    C->>S: POST /mcp + Authorization: Bearer
    S->>DB: Validate token
    S-->>C: MCP response

    Note over C,S: After 1 hour
    C->>S: POST /token (grant_type=refresh_token)
    S->>DB: Rotate refresh + new access
    S-->>C: 200 {new tokens}

Step-by-step (text reference)

1. Claude.ai → GET https://<host>/.well-known/oauth-authorization-server
   ← {issuer, authorization_endpoint, token_endpoint, registration_endpoint, ...}

2. Claude.ai → POST /register
   { redirect_uris: [...], token_endpoint_auth_method: "none" }
   ← 201 { client_id: "uuid-v4", client_id_issued_at: ... }

3. Claude.ai → GET /authorize?client_id=...&redirect_uri=...&code_challenge=...&state=...
   ← 302 Location: https://<host>/login?session=<sid>

4. Browser → GET /login?session=<sid>
   ← HTML form (dark theme, password input)

5. Browser → POST /login (form { session, password })
   Server: PBKDF2 verify; if OK: generate code, store, redirect
   ← 302 Location: <redirect_uri>?code=<code>&state=<state>

6. Claude.ai → POST /token
   { grant_type: authorization_code, code, code_verifier, redirect_uri, client_id }
   Server: load AuthCode, verify code_challenge against verifier (PKCE), generate tokens
   ← 200 { access_token, token_type: Bearer, expires_in: 3600, refresh_token, scope: mcp }

7. Claude.ai → POST /mcp with Authorization: Bearer <access_token>
   Server: load_access_token(token) → AccessToken; SDK proceeds with normal MCP
   ← MCP responses

8. (After 1 hour) Claude.ai → POST /token
   { grant_type: refresh_token, refresh_token: ... }
   ← 200 { new access_token, new refresh_token (rotated) }

Ingest flow (filesystem-first)

[Skill / Hook / Crawler]
        │ writes
        ▼
<kb-root>/<folder>/<slug>.md     ← canonical, source-of-truth
        │
        │ exec python3 <hooks>/kb-ingest-file.py <path>
        ▼
[kb-ingest-file.py]
        │ reads file, classify source_type from path, auto-tag
        │ HTTPS POST <mcp-host>/mcp tools/call kb_ingest
        ▼
[Server: kb_ingest tool]
        │ SHA256(content), check kb_sources.content_hash
        │   if match → return {skipped: true}
        │   if differ → DELETE old chunks, recompute
        │ chunk(text), embed_passages(chunks)
        │ INSERT kb_sources (or UPDATE), INSERT kb_chunks bulk
        ▼
ADB 23ai

Why filesystem-first: ADB free tier can be evicted (auto-stop after 7 days idle, possible data loss). The local filesystem (with Time Machine or equivalent) is durable. ADB is treated as a derived index — re-buildable from filesystem in ~3 hours via bulk_migrate.py. Worst case = re-embed cost, no data loss.

Performance numbers

Measured on 4-vCPU CPU @ 2.2 GHz, 16 GB RAM:

Operation	Throughput	Notes
BGE-small embed @ seq=480	1.8 chunks/s	rejected (too slow)
e5-small embed @ seq=256	5.7 chunks/s	accepted
BGE-m3 embed @ seq=256	0.39 chunks/s	rejected (would take 32h to re-embed)
Bulk migrate 5,000+ files	~95 min	sequential, single conn
Re-embed 47K chunks	~131 min	resumable via checkpoint
Backup 5K sources / 47K chunks → tar.gz 86 MB	129s	gzip stream
HNSW index build (45K vectors)	19s	INMEMORY NEIGHBOR GRAPH
`kb_search` p50	~800 ms	embed + ANN + JSON serialize
`kb_search` p95 (via tunnel)	1,160 ms	adds ~300 ms tunnel + TLS overhead
`kb_health`	<80 ms	DB ping only

Reliability features

Feature	How
Idempotent ingest	content_hash compare → skip duplicate uploads
Filesystem source-of-truth	re-buildable in ~3 h from local markdown
Persistent tunnel	systemd `cloudflared.service` with `Restart=on-failure`
Auto-restart MCP server	systemd `personal-rag-kb.service` with `Restart=on-failure`
Weekly backup	systemd timer `kb-backup.timer` (Sunday 03:00 UTC)
Backup restore	`restore.py` — DROP + recreate + JSONL reload + HNSW rebuild
Reconcile script	walks FS + diffs ADB hashes → re-ingests mismatched / new
Token rotation	refresh_token rotated on use (one-time)
Token TTL	access 1 h, refresh 30 d, auth code 10 min
GC	expired sessions/codes purged on each provider call

Security model

Threat	Mitigation
Public MCP endpoint exposure	Cloudflare tunnel terminates outbound from VM only — no inbound port open. Auth required.
Token theft	TLS only (CF-managed cert). Tokens random 32-byte urlsafe. PBKDF2 200K rounds for password.
Replay attack	Auth codes single-use (consumed on exchange). Refresh tokens rotated.
Data exfil from compromised VM	VM has only write-only PAR for backup bucket — cannot LIST/GET/DELETE existing backups. ADB password limited to single role.
Brute-force password	No throttling implemented — single-user, CF rate-limit upstream provides some protection. TODO: add rate limit on `/login`
Sensitive content	Embedding is irreversible (one-way), original text only retrievable to the authenticated user. Single-tenant deployment.

Reproducibility — quickstart for a forker

# 1. Provision Oracle Autonomous DB 23ai (free tier) → download wallet
# 2. Provision a small Linux VM (4-core, 16 GB)
# 3. SSH:
sudo apt update && sudo apt install -y python3-venv
python3 -m venv /opt/ai-foundation/venv
/opt/ai-foundation/venv/bin/pip install \
  oracledb fastapi uvicorn 'mcp[cli]>=1.2' starlette sse-starlette \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  torch sentence-transformers
# 4. Upload wallet to /opt/ai-foundation/wallet/, write lib/db.py with TNS DSN
# 5. Apply schema (see schema section above)
# 6. Generate token: openssl rand -hex 32 > .token
# 7. Generate password: openssl rand -base64 24 → PBKDF2 → .oauth_env
# 8. Drop in src/server.py, src/oauth_provider.py, src/oauth_login.py
# 9. systemd service + cloudflared tunnel + DNS record on a domain you own
# 10. Bulk migrate from local KB folder, or start fresh
# 11. Add custom connector on claude.ai with URL https://yourdomain/mcp

Total: 1–2 hours if all credentials/wallets ready.

Future work (M3+)

Rate limit on /login (5 req/min per IP)
TOTP 2FA via pyotp
Multi-user (user table, per-user client_id isolation, row-level filter on kb_sources)
Reranker (BGE-reranker-base) on top-20 candidates → measure Hit@5 lift
Embedding model upgrade path: e5-base (768-dim, ~2× better quality) when CPU upgrade
Hybrid search: add Oracle Text full-text index for keyword fallback when vector returns score <0.7

License & attribution

Personal project. Built on:

personal-rag-kb — Implementation