← Back to project
● M1 done P0 Size S Foundation

personal-rag-kb — Implementation

Tech stack deep-dive: schema, OAuth flow, performance numbers, security model, reproducibility.

Implementation

Sister docs: PRD (intent), Architecture (system view), Notes (decision log).

TL;DR

A production-ready personal RAG system built in ~22 hours over 2 days:

  • 5,000+ sources / 47K+ chunks indexed (mixed Vietnamese + English text)
  • VECTOR(384) HNSW index on Oracle ADB 23ai free tier
  • Multilingual semantic search (VN/EN), p95 latency 1.16s end-to-end
  • MCP Streamable HTTP server with OAuth 2.0 (PKCE + DCR) — accessible from Claude Desktop, Claude.ai web, Claude iOS app
  • Persistent Cloudflare named tunnel
  • Auto-sync hooks: 4 source types automatically ingest after a markdown write
  • Weekly backup → OCI Object Storage via write-only PAR (zero credential surface on the VM)

Stack

LayerComponentVersionNotes
ComputeSmall commodity VM4 vCPU / 16 GB RAMProvider-agnostic; runs anywhere with Linux + Python
OSUbuntu22.04 minimal
RuntimePython3.10 + venv
MCP SDKmcp1.27.0Streamable HTTP transport, built-in OAuth scaffolding
HTTP serveruvicorn + Starlettelatest127.0.0.1:8080 (tunnel-fronted)
Embeddingsentence-transformers + multilingual-e5-small5.4 / hf-117M params384-dim, max_seq=512, ~5.7 chunks/s on 4-core CPU
Vector DBOracle Autonomous DB 23aifree tierVECTOR(384, FLOAT32) + HNSW INMEMORY NEIGHBOR GRAPH
Driveroracledb (thin)3.4Wallet-based TLS
Tunnelcloudflared2026.3.0Named tunnel, auto-restart via systemd
Storage backupOCI Object Storage Standardfree 20 GBPre-authenticated write-only URL
AuthOAuth 2.0 + PBKDF2 passwordsha256 200K itersDCR-enabled, refresh rotation

Directory layout

On VM (/opt/.../personal-rag-kb/)

src/
├── server.py                  # FastMCP wrapper, Starlette routes (~270 lines)
├── oauth_provider.py          # OAuthAuthorizationServerProvider impl (~230 lines)
├── oauth_login.py             # /login GET + POST routes (~80 lines)
├── bulk_migrate.py            # one-shot Day 3 migration
├── re_embed.py                # Day 3b model swap retool
├── backup.py                  # weekly dump + PAR upload
└── restore.py                 # disaster recovery (DROP+reload)

backups/                       # local cache, last 2 archives
logs/
├── server.log
├── cloudflared.log
├── backup.log
├── migrate.log
└── re_embed.log

.token                         # legacy bearer (chmod 600)
.oauth_env                     # OAuth password salt/hash + issuer (chmod 600)
.oauth_state.json              # clients/codes/tokens persistent (chmod 600)
.par_url                       # OCI write-only PAR URL (chmod 600)

On operator host (~/.claude/)

hooks/
├── save-convo.py              # Stop hook → archive AI session + auto-ingest
├── save-convo-kb.log
├── kb-ingest-file.py          # Generic file → kb_ingest helper (16 path rules)
└── kb-ingest-file.log

scheduled-tasks/
└── nightly-kb-sync/SKILL.md   # nightly → run all sync skills

personal-rag-kb.env            # KB_MCP_URL + KB_MCP_TOKEN (chmod 600)

Sync skills (in operator’s work-repo .claude/skills/)

  • slack-sync.md — auto-ingest after save
  • email-highlights.md — auto-ingest after save
  • meeting-sync.md — auto-ingest after save

Crawler (in operator’s KB folder)

  • crawl_direct.py — calls kb-ingest-file.py after each ticket render

Schema

CREATE TABLE kb_sources (
    id            NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    source_uri    VARCHAR2(2000) NOT NULL,
    source_type   VARCHAR2(32),
    title         VARCHAR2(500),
    tags          VARCHAR2(500),       -- comma-sep
    metadata      CLOB CHECK (metadata IS JSON),
    content_hash  VARCHAR2(64),        -- sha256, idempotency key
    ingested_at   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at    TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT uq_kb_sources_uri UNIQUE (source_uri)
);
CREATE INDEX idx_kb_sources_type     ON kb_sources(source_type);
CREATE INDEX idx_kb_sources_ingested ON kb_sources(ingested_at);

CREATE TABLE kb_chunks (
    id          NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    source_id   NUMBER NOT NULL,
    chunk_idx   NUMBER NOT NULL,
    text        CLOB NOT NULL,
    embedding   VECTOR(384, FLOAT32),
    CONSTRAINT fk_kb_chunks_source FOREIGN KEY (source_id) REFERENCES kb_sources(id) ON DELETE CASCADE,
    CONSTRAINT uq_kb_chunks_src_idx UNIQUE (source_id, chunk_idx)
);
CREATE INDEX idx_kb_chunks_source ON kb_chunks(source_id);

CREATE VECTOR INDEX kb_emb_idx ON kb_chunks(embedding)
    ORGANIZATION INMEMORY NEIGHBOR GRAPH
    DISTANCE COSINE
    WITH TARGET ACCURACY 95;

Why this shape:

  • source_uri UNIQUE + content_hash → ingest is idempotent (skip if same content)
  • HNSW INMEMORY NEIGHBOR GRAPH → low-latency ANN search; cold build = 19s for 45K vectors
  • ON DELETE CASCADE → drop a source row → chunks auto-removed
  • Tag-based filter is INSTR() substring match — simpler than separate tag table; sufficient at this scale

MCP tools surface

ToolArgsPurpose
kb_healthDB ping + sources/chunks count
kb_ingesttext, source_uri, source_type, title, tags[], metadata{}Idempotent insert/update; chunk + embed + store
kb_searchquery, top_k=5, tag_filter, source_typeSemantic search returns top-k chunks with metadata
kb_statsTotal + breakdown by source_type

Each tool ≈ 30–60 LOC in server.py. FastMCP @mcp.tool() decorator handles JSON Schema generation, request parsing, response packaging.

Chunking strategy

CHUNK_TOK = 256   # under e5 max_seq_length=512
OVERLAP_TOK = 32

def chunk_text(text: str) -> list[str]:
    ids = tokenizer.encode(text, add_special_tokens=False)
    if len(ids) <= CHUNK_TOK:
        return [text]
    step = CHUNK_TOK - OVERLAP_TOK   # = 224
    chunks = []
    for start in range(0, len(ids), step):
        window = ids[start:start + CHUNK_TOK]
        chunks.append(tokenizer.decode(window, skip_special_tokens=True))
        if start + CHUNK_TOK >= len(ids):
            break
        if len(chunks) >= MAX_CHUNKS_PER_FILE:
            break
    return chunks

Why 256: BGE-small benchmarks showed quadratic seq-len scaling on CPU. 256 tokens = 3× faster than 480 tokens. Verified empirically: ~5.7 chunks/s @ 256 vs ~1.8 @ 480.

Cap 50 chunks/file: Some session transcripts run to 100K+ tokens. Head-only embedding covers the first ~12,800 tokens (~25 dense pages), which captures intent + key decisions. Trade-off accepted.

Embedding pipeline (e5 prefix convention)

multilingual-e5-small requires task-specific prefixes for best retrieval:

# Documents
embed_model.encode([f"passage: {chunk}" for chunk in chunks])

# Queries
embed_model.encode([f"query: {user_query}"])

Verified impact (verbatim Vietnamese query):

  • BGE-en (no prefix): top hit score 0.726, wrong document
  • e5-small (with prefix): top hit score 0.857, correct document

Auth dual-mode

# oauth_provider.py: load_access_token() falls back to legacy bearer
async def load_access_token(self, token: str) -> AccessToken | None:
    if secrets.compare_digest(token, LEGACY_BEARER):
        return AccessToken(token=token, client_id="_legacy_bearer", scopes=["mcp"])
    state = _load_state()
    return AccessToken.model_validate(state["access_tokens"][token]) if token in state["access_tokens"] else None

This allows:

  • Claude Desktop (mcp-remote bridge with bearer in header) — initial setup keeps working
  • Claude.ai web (full OAuth flow with DCR + PKCE)
  • Claude iOS app — inherits from claude.ai web account
  • curl scripts on VM — bearer for ad-hoc admin

OAuth flow (sequence)

sequenceDiagram
    autonumber
    participant C as Claude.ai
    participant B as Browser
    participant S as MCP Server
    participant DB as State Store

    C->>S: GET /.well-known/oauth-authorization-server
    S-->>C: AS metadata (RFC 8414)

    C->>S: POST /register (DCR)
    S->>DB: Save client
    S-->>C: 201 {client_id}

    C->>B: Redirect to /authorize?
client_id, redirect_uri, code_challenge B->>S: GET /authorize S->>DB: Create session S-->>B: 302 → /login?session=sid B->>S: GET /login (password form) S-->>B: HTML form B->>S: POST /login (password) S->>S: PBKDF2 verify (200K iters) S->>DB: Generate auth code S-->>B: 302 → redirect_uri?code=...&state=... B->>C: Forwards code C->>S: POST /token (code + verifier) S->>S: Verify PKCE challenge S->>DB: Issue access_token + refresh_token S-->>C: 200 {access_token, refresh_token, expires_in: 3600} Note over C,S: Subsequent MCP calls C->>S: POST /mcp + Authorization: Bearer S->>DB: Validate token S-->>C: MCP response Note over C,S: After 1 hour C->>S: POST /token (grant_type=refresh_token) S->>DB: Rotate refresh + new access S-->>C: 200 {new tokens}

Step-by-step (text reference)

1. Claude.ai → GET https://<host>/.well-known/oauth-authorization-server
   ← {issuer, authorization_endpoint, token_endpoint, registration_endpoint, ...}

2. Claude.ai → POST /register
   { redirect_uris: [...], token_endpoint_auth_method: "none" }
   ← 201 { client_id: "uuid-v4", client_id_issued_at: ... }

3. Claude.ai → GET /authorize?client_id=...&redirect_uri=...&code_challenge=...&state=...
   ← 302 Location: https://<host>/login?session=<sid>

4. Browser → GET /login?session=<sid>
   ← HTML form (dark theme, password input)

5. Browser → POST /login (form { session, password })
   Server: PBKDF2 verify; if OK: generate code, store, redirect
   ← 302 Location: <redirect_uri>?code=<code>&state=<state>

6. Claude.ai → POST /token
   { grant_type: authorization_code, code, code_verifier, redirect_uri, client_id }
   Server: load AuthCode, verify code_challenge against verifier (PKCE), generate tokens
   ← 200 { access_token, token_type: Bearer, expires_in: 3600, refresh_token, scope: mcp }

7. Claude.ai → POST /mcp with Authorization: Bearer <access_token>
   Server: load_access_token(token) → AccessToken; SDK proceeds with normal MCP
   ← MCP responses

8. (After 1 hour) Claude.ai → POST /token
   { grant_type: refresh_token, refresh_token: ... }
   ← 200 { new access_token, new refresh_token (rotated) }

Ingest flow (filesystem-first)

[Skill / Hook / Crawler]
        │ writes

<kb-root>/<folder>/<slug>.md     ← canonical, source-of-truth

        │ exec python3 <hooks>/kb-ingest-file.py <path>

[kb-ingest-file.py]
        │ reads file, classify source_type from path, auto-tag
        │ HTTPS POST <mcp-host>/mcp tools/call kb_ingest

[Server: kb_ingest tool]
        │ SHA256(content), check kb_sources.content_hash
        │   if match → return {skipped: true}
        │   if differ → DELETE old chunks, recompute
        │ chunk(text), embed_passages(chunks)
        │ INSERT kb_sources (or UPDATE), INSERT kb_chunks bulk

ADB 23ai

Why filesystem-first: ADB free tier can be evicted (auto-stop after 7 days idle, possible data loss). The local filesystem (with Time Machine or equivalent) is durable. ADB is treated as a derived index — re-buildable from filesystem in ~3 hours via bulk_migrate.py. Worst case = re-embed cost, no data loss.

Performance numbers

Measured on 4-vCPU CPU @ 2.2 GHz, 16 GB RAM:

OperationThroughputNotes
BGE-small embed @ seq=4801.8 chunks/srejected (too slow)
e5-small embed @ seq=2565.7 chunks/saccepted
BGE-m3 embed @ seq=2560.39 chunks/srejected (would take 32h to re-embed)
Bulk migrate 5,000+ files~95 minsequential, single conn
Re-embed 47K chunks~131 minresumable via checkpoint
Backup 5K sources / 47K chunks → tar.gz 86 MB129sgzip stream
HNSW index build (45K vectors)19sINMEMORY NEIGHBOR GRAPH
kb_search p50~800 msembed + ANN + JSON serialize
kb_search p95 (via tunnel)1,160 msadds ~300 ms tunnel + TLS overhead
kb_health<80 msDB ping only

Reliability features

FeatureHow
Idempotent ingestcontent_hash compare → skip duplicate uploads
Filesystem source-of-truthre-buildable in ~3 h from local markdown
Persistent tunnelsystemd cloudflared.service with Restart=on-failure
Auto-restart MCP serversystemd personal-rag-kb.service with Restart=on-failure
Weekly backupsystemd timer kb-backup.timer (Sunday 03:00 UTC)
Backup restorerestore.py — DROP + recreate + JSONL reload + HNSW rebuild
Reconcile scriptwalks FS + diffs ADB hashes → re-ingests mismatched / new
Token rotationrefresh_token rotated on use (one-time)
Token TTLaccess 1 h, refresh 30 d, auth code 10 min
GCexpired sessions/codes purged on each provider call

Security model

ThreatMitigation
Public MCP endpoint exposureCloudflare tunnel terminates outbound from VM only — no inbound port open. Auth required.
Token theftTLS only (CF-managed cert). Tokens random 32-byte urlsafe. PBKDF2 200K rounds for password.
Replay attackAuth codes single-use (consumed on exchange). Refresh tokens rotated.
Data exfil from compromised VMVM has only write-only PAR for backup bucket — cannot LIST/GET/DELETE existing backups. ADB password limited to single role.
Brute-force passwordNo throttling implemented — single-user, CF rate-limit upstream provides some protection. TODO: add rate limit on /login
Sensitive contentEmbedding is irreversible (one-way), original text only retrievable to the authenticated user. Single-tenant deployment.

Reproducibility — quickstart for a forker

# 1. Provision Oracle Autonomous DB 23ai (free tier) → download wallet
# 2. Provision a small Linux VM (4-core, 16 GB)
# 3. SSH:
sudo apt update && sudo apt install -y python3-venv
python3 -m venv /opt/ai-foundation/venv
/opt/ai-foundation/venv/bin/pip install \
  oracledb fastapi uvicorn 'mcp[cli]>=1.2' starlette sse-starlette \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  torch sentence-transformers
# 4. Upload wallet to /opt/ai-foundation/wallet/, write lib/db.py with TNS DSN
# 5. Apply schema (see schema section above)
# 6. Generate token: openssl rand -hex 32 > .token
# 7. Generate password: openssl rand -base64 24 → PBKDF2 → .oauth_env
# 8. Drop in src/server.py, src/oauth_provider.py, src/oauth_login.py
# 9. systemd service + cloudflared tunnel + DNS record on a domain you own
# 10. Bulk migrate from local KB folder, or start fresh
# 11. Add custom connector on claude.ai with URL https://yourdomain/mcp

Total: 1–2 hours if all credentials/wallets ready.

Future work (M3+)

  • Rate limit on /login (5 req/min per IP)
  • TOTP 2FA via pyotp
  • Multi-user (user table, per-user client_id isolation, row-level filter on kb_sources)
  • Reranker (BGE-reranker-base) on top-20 candidates → measure Hit@5 lift
  • Embedding model upgrade path: e5-base (768-dim, ~2× better quality) when CPU upgrade
  • Hybrid search: add Oracle Text full-text index for keyword fallback when vector returns score <0.7

License & attribution

Personal project. Built on: