← Back to project
● M1 done P0 Size S Foundation

personal-rag-kb — Enterprise

How this personal RAG architecture adapts to enterprise use cases — five concrete adaptations with deltas, scale numbers, cost models, and risks.

Enterprise patterns

The personal version is the strictest constraint case: single-user, free-tier, no compliance requirements. Relaxing those constraints unlocks B2B applications without rewriting the architecture. This page documents five concrete adaptations.

What stays vs. what changes

The hot path — chunk → embed → vector ANN search → MCP response — is identical across all enterprise use cases below. The deltas are around identity, isolation, governance, and ingest velocity, not around retrieval mechanics.

Migration matrix: Personal → Enterprise

AspectPersonalEnterprise
Tenants1 userN tenants, hard-isolated by tenant_id row filter + per-tenant API key
AuthStatic password + OAuth (PKCE+DCR)SSO (SAML 2.0 / OIDC) + per-user OAuth, MFA enforced
AuthorizationAll access = fullRBAC tags + row-level security on kb_chunks (department / sensitivity / project)
AuditNonePer-query log: who, when, what query, what chunks returned, response latency
ComplianceN/ASOC2 Type II, GDPR DSR support, data residency, retention policies, encryption keys (BYOK / KMS)
StorageADB 23ai free 20 GBADB paid tier or self-hosted Postgres + pgvector; tenant data in separate schemas or DBs
EmbedderCPU e5-small (5.7 chunks/s)GPU batch inference cluster (50–200 chunks/s/node) for bulk ingest; per-tenant model selection
Ingest velocityManual hooks + nightly cronReal-time CDC from Confluence/Slack/Salesforce/Zendesk via webhooks → queue (SQS/PubSub) → worker pool
Search indexHNSW INMEMORY single instanceDistributed (pgvector partitioning, or managed: Pinecone/Qdrant Cloud/OCI Search), tenant-scoped
Latency SLA<2s p95 best-effort<500ms p95 with retries, regional replicas, edge caching for popular queries
Cost model$0 (free tiers)Per-seat ($X/user/month) or volume-based (chunks ingested, queries served), 3-tier pricing
Deployment1 VM + 1 ADBMulti-region active-active, infra-as-code (Terraform), rolling deploys, observability stack
SupportNone24/7 P1 SLA, dedicated Slack/Teams channel, customer success manager

The architecture diagram doesn’t change — only labels on components scale up.


Use case A — Internal employee knowledge base (“Where’s that doc?”)

Problem

Mid-to-large companies (500–10,000 employees) accumulate Confluence, SharePoint, Google Docs, internal wikis, runbooks, post-mortems, OKR docs, meeting recordings. Search across these is fragmented (each tool’s search is keyword-only), siloed (per-team SharePoint), and stale (deleted documents still appear). New hires take 3–6 months to know “where to look for X”. Existing employees waste 15–30 minutes per “I know we wrote this somewhere” hunt — at 30 minutes × 2/day × 5,000 employees × $50/hour, that’s ~$30M/year of pure friction in a typical mid-sized SaaS.

Persona

Every knowledge worker. Especially: new hires (week 1–8), team leads cross-functional, support escalation engineers.

What changes from personal version

  • Sources scaled up: Confluence (10K–100K pages) + Google Drive (1M+ docs) + SharePoint + GitHub wiki + Slack channels (read-only). Per-source connector with delta sync (only changed docs since last poll).
  • Tenant model: 1 tenant per company. Hard isolation by tenant_id on every query.
  • RBAC: each chunk inherits source ACL — if a Confluence page is restricted to “Engineering” space, only employees with that AD/Okta group see it in kb_search results. Implemented as row-level security predicate on kb_chunks.
  • Identity: SAML SSO from Okta/Azure AD/Google Workspace. User’s group memberships injected into JWT, used as filter.
  • Compliance: GDPR DSR — when employee leaves, their personal contributions remain (org owns the IP), but their query log is purged after 90 days. Data residency: per-tenant region pinning.
  • Search-quality investments: hybrid search (BM25 + vector) is now worth the complexity — keyword matching for “PR-1234” or product code names where semantic embed is weak. Reranker (Cohere/BGE-reranker-base) on top-20 → +15% Hit@5 measured on internal eval.

Architecture impact

  • Add kb_acl table: (chunk_id, allowed_groups[]). Enforce in kb_search SQL.
  • Switch ingest from manual hooks to CDC pipeline: webhooks from Confluence/Drive → Kafka → embed worker pool → ADB write.
  • Add redaction layer at ingest: strip PII (employee SSN, salaries) using regex + LLM classifier before embedding.

Cost & scale (10,000-employee example)

  • Documents: ~500K, ~5M chunks
  • Storage: ~20 GB embeddings + ~50 GB CLOB → managed Postgres with pgvector ($300/mo) or ADB paid ($450/mo)
  • Embed throughput: 1 GPU node (T4) handles 100 chunks/s → 5M chunks bulk = 14 hours one-time, then ~50 chunks/s steady-state for delta sync
  • Query volume: 50K queries/day → ~1 query/employee/day average
  • All-in cost: ~$1,500/mo infra + ~$2K/mo dev maintenance → $25K/year vs $30M friction = 1,200× ROI

Risks

  • Stale ACL: source ACL changes (employee leaves group) must propagate to KB index within hours. Solved with periodic ACL refresh job, not at query time (too slow).
  • Index drift from source: hard-deleted source docs need tombstones in KB. Implement soft-delete with TTL.
  • Hallucination on absent info: when KB doesn’t have an answer, system must say “no match” rather than improvise. Tune Claude system prompt + return empty results when top score <0.6.

Use case B — Customer support agent grounding

Problem

Support agents handle tickets that often have answers buried in: past tickets, KB articles, product docs, recent release notes, internal runbooks. Today an agent searches each system separately, copy-pastes context into the ticket, switches windows. Average resolution time for “I’ve seen this before” tickets is 30–45 minutes; with grounded retrieval it could be 5–10 minutes. Industry benchmark: 40% of support tickets are “previously-answered” patterns.

Persona

L1/L2 support engineers. Customer success managers. Solutions consultants.

What changes from personal version

  • Latency-critical: agents type a ticket draft → system returns top-5 grounded suggestions inline. Target p95 <500ms (vs 1.16s personal). Achieved by:
    • Edge cache for popular query embeddings (Cloudflare Workers KV)
    • Pre-computed embeddings for the 1000 most common ticket patterns
    • Co-locate inference + DB in same region
  • Source freshness SLA: a customer escalates an issue at 14:00; the engineer fixes + writes runbook at 15:30; another agent hits the same issue at 16:00 and MUST get the new runbook. Webhook-driven ingest within 60 seconds of source write.
  • Citation enforcement: every Claude answer must cite its sources. UI shows clickable chunks the agent can verify before sending to customer. If no citation, answer rejected at the application layer.
  • Per-tenant tone/policy injection: each customer’s support team injects a custom system prompt (“never promise refunds without manager approval”, “always escalate to security if customer mentions GDPR”).

Architecture impact

  • Add query-time cache layer (Redis): hash query → cached top-K result, TTL 5 minutes. Hit rate ~60% for common patterns.
  • Streaming response: as Claude generates the answer token-by-token, stream chunks to the agent’s UI. Reduces perceived latency.
  • Feedback loop: every ticket close includes “was this suggestion useful?” thumbs up/down. Negative feedback → demote that chunk’s score for next 24h, escalate to KB owner for review.
  • Quality eval: weekly run 100 sampled queries against a labeled gold set, track Hit@5 + answer accuracy. Alert if drops >5% week-over-week.

Cost & scale (50-agent SaaS support team)

  • KB volume: 10K KB articles + 100K historical tickets + 5K runbooks → ~500K chunks
  • Query volume: 50 agents × 100 queries/day = 5,000 queries/day
  • Storage: ~5 GB embeddings → ADB paid ($150/mo)
  • Inference: minimal — most queries hit cache, cold path is ~$0.10/query Claude usage
  • All-in: ~$2K/mo infra + Claude API + redis → at 30% productivity gain on $80K avg agent salary × 50 agents = $1.2M/year value vs $24K cost = 50× ROI

Risks

  • Stale cached suggestions during incidents: when the product itself is broken, old runbooks may suggest workarounds that no longer apply. Implement “incident mode” that flushes cache + injects warning banner.
  • Customer PII in chunks: ticket history contains customer names, account IDs. Embed must redact before vector store. Use LLM-based PII detector at ingest.
  • Vendor lock-in to Claude: design abstraction layer so swap to GPT/Llama is config change, not rewrite. Lessens negotiating leverage risk.

Use case C — Sales enablement

Problem

Account executives juggle competitive intel (“how do we differentiate vs Competitor X?”), deal histories (“what concessions did we give Acme last year?”), product specs (“does our enterprise tier include SSO?”), pricing sheets, case studies. Today this lives in 8+ tools (Salesforce + Highspot + Confluence + Drive + Gong recordings). On a live discovery call, the AE has 30 seconds to find the right answer or lose credibility. Win rate correlates with response specificity — generic answers lose 23% more deals (Gartner 2024 study).

Persona

Account executives, sales engineers, sales operations, RevOps. Bonus: marketing teams aligning competitive narratives.

What changes from personal version

  • CRM integration: pull deal context (current account, opportunity stage, products discussed) → use as additional query context. “Show me cases where we beat Competitor X for a 5K-seat customer in healthcare” — tenant_id, industry, competitor are structured filters.
  • Multi-modal: ingest sales call recordings (Gong/Chorus) → speech-to-text → embed transcript chunks. Now AE can search “what did we promise on demo last Tuesday”.
  • Recency-weighted scoring: a 6-month-old pricing doc is less reliable than last week’s. Add recency_boost = exp(-age_days / 90) factor in ranking.
  • Battle card auto-generation: nightly job pre-builds competitor battle cards from indexed content, surfaces in MCP tool kb_battle_card(competitor_name).
  • Privacy partition: deal-stage-restricted chunks (M&A confidential, executive comp) only visible to specific roles.

Architecture impact

  • Add structured filter layer: combine vector search with SQL WHERE on metadata (industry, competitor, deal_stage, pubDate > 6mo). Implemented as kb_search_v2(query, filters: dict).
  • Speech-to-text ingest pipeline: Gong webhook → AWS Transcribe → chunk by speaker turn → embed → store with source_type='sales_call'.
  • Real-time alert: when a battle-card-relevant page is updated, push notification to subscribed AEs.

Cost & scale (200-rep org)

  • Sources: 5K Salesforce records, 10K Confluence pages, 50K call hours/year (3,000 hours of transcripts) → ~1M chunks
  • Query volume: 200 reps × 30 queries/day = 6,000 queries/day
  • Transcription: $0.024/min × 50K hours/year = $72K/year (use Whisper self-hosted to drop to $5K/year)
  • Storage + compute: ~$2K/mo
  • All-in: ~$4K/mo + transcription → at 5% win rate uplift on 200 reps × $2M quota = +$20M ARR vs $48K cost = 400× ROI

Risks

  • Reps relying on RAG > training: junior AEs may stop learning the product. Mitigate with quarterly “no RAG” assessments; RAG is augmentation, not replacement.
  • Out-of-date pricing surfacing: outdated contract terms quoted on a live call = legal exposure. Hard-delete deprecated docs, add validity-window metadata, alert on stale-source citations.
  • Competitor mention as training signal: be careful that competitor intel ingested doesn’t accidentally fine-tune embedder in a way that biases scoring. Keep embedder frozen, only the ranking layer learns.

Use case D — Onboarding new hires

Problem

A new engineer joins on Monday. By Friday they need to: understand the product domain, know who owns what, navigate the codebase, find runbooks, ramp on team norms, attend 15+ “intro to X” sessions. Companies measure time-to-first-PR (typically 2–6 weeks); top performers ship in 5 days. The bottleneck isn’t ability — it’s information access.

Persona

New hires (eng / PM / design / sales / support). HR / People Ops who run the onboarding program. Engineering managers responsible for ramp.

What changes from personal version

  • Curated curriculum collection: a separate kb_curriculum table marks “essential reading” with order, role-targeting, and est-read-time. MCP tool kb_onboarding_path(role, day) returns the day-N reading list.
  • Conversational tutor mode: rather than just retrieve, the system Socratic-tutors. New hire asks “what’s our deployment process?” → Claude explains with citations, then asks “want to walk through a sample deploy together?”. Behind: same MCP retrieval, different system prompt + multi-turn context.
  • Progress tracking: which docs has the new hire read, which they’ve asked about, what gaps remain. Surfaces to manager in weekly digest.
  • Glossary intelligence: company-specific acronyms (e.g. “PCF”, “NLP-onboarding”) are auto-detected and explained inline. Built as separate kb_glossary table populated from a wiki page + LLM-extracted from past docs.

Architecture impact

  • Add user_progress table tracking which chunk_ids a user has been served. Compute “blind spots” = high-importance docs they haven’t seen.
  • Conversation memory: per-user Redis store of last 10 query-answer pairs. Inject into Claude context for continuity (“you asked about deploys yesterday; today let’s cover monitoring”).
  • Manager dashboard: aggregate metrics — avg ramp velocity, most-asked questions per cohort. Useful for HR to identify documentation gaps (a question asked by 50% of new hires = a missing doc).

Cost & scale (200 new hires/year)

  • Sources: 5K onboarding docs + 1K runbooks + 500 architecture docs + Slack-archived “intro” channels
  • Query volume: 200 hires × 50 queries/day × first 30 days = 300K queries/year
  • Storage: minimal (<1 GB embeddings) — onboarding KB is curated subset
  • All-in: ~$500/mo → at 1-week ramp acceleration × 200 hires × $200K loaded comp / 50 weeks/year = +$800K productivity gain vs $6K cost = 130× ROI

Risks

  • Curated content rot: “essential reading” curated 2 years ago may be obsolete. Quarterly review by team leads, with auto-flagging of low-engagement docs (nobody reads = candidate for archive).
  • Over-reliance on async: new hires may skip human relationships (“RAG knows everything”) and miss tacit knowledge / mentor bonds. Build human-pairing prompts into the curriculum.
  • Context-collapse: “deployment” means different things to backend / frontend / mobile. Use role filter aggressively.

Use case E — PM / Engineering decision archive

Problem

“Why did we deprecate feature X 2 years ago?” “Has anyone tried approach Y before?” “What was the original requirement for module Z?” — these questions are asked constantly by PMs and engineers, especially in larger orgs. Today the answer lives in: PRDs (often in 5 places), past Jira tickets, post-mortems, ADRs (architecture decision records), Slack debates, design docs. Loss of institutional knowledge when employees leave is a measurable productivity tax: engineering teams spend 8–14% of time rediscovering decisions (DORA 2023).

Persona

PMs writing new specs (avoid relitigating settled debates). Engineers exploring architectural choices. Tech leads doing design review. New team leads ramping on org history.

What changes from personal version

  • Time-aware ranking: weight recent decisions higher, but surface “this decision was reversed in 2023” prominently. Implemented as graph metadata: each decision can supersede prior ones (supersedes: doc_id).
  • Multi-source linking: when a PRD references Jira ticket SXI-1234, that ticket’s resolution should be auto-fetched and appended at retrieval time. Cross-source reference resolution layer.
  • Decision graph visualization: optional UI shows decisions as nodes, edges = “supersedes”, “implements”, “blocked-by”. Helps newcomers understand dependency chains.
  • Diff-aware retrieval: when a PRD is edited, store version history. Query “what was the original scope of feature X?” returns v1; “what’s the current scope?” returns latest. Implemented as kb_versions table with valid_from/valid_to.
  • Outcome tagging: each decision linked to outcome metadata (shipped / deprecated / in-progress / cancelled). Avoid relitigating cancelled experiments.

Architecture impact

  • Schema additions: kb_decision_graph (id, supersedes_id, decision_type, outcome, decided_at, decided_by).
  • Reference resolver: at retrieval, expand [Jira: SXI-1234] → fetch live ticket status from Atlassian API → append to chunk text. 200ms overhead, cached 1h.
  • Versioning: each ingest of an updated source creates new row with valid_from = now, prior row gets valid_to. Query layer filters by point-in-time.

Cost & scale (1,000-employee org, 10 years of history)

  • Sources: 50K PRDs + 200K Jira tickets + 5K ADRs + 100K Slack debates → ~3M chunks
  • Query volume: ~5,000 queries/day (PMs + engineers + tech leads)
  • Storage: 30 GB → ADB paid + Atlassian API quotas
  • All-in: ~$3K/mo → eliminating just 1% of rediscovery time on 500 engineers × $250K = +$1.25M/year vs $36K = 35× ROI

Risks

  • Confidential decisions surfacing: M&A discussions, individual performance reviews, layoff planning. Hard-tag confidentiality at ingest, exclude from default search; require role + reason-for-access for sensitive queries (audit-logged).
  • Stale decisions presented as current: a 2018 decision about scaling Postgres may be technically wrong now. Always surface decided_at + outcome tags in citations. UI affordance: “this decision is 7 years old — verify still applicable”.
  • Over-confidence in archive: junior PMs may copy old PRD structure without questioning. Pair with senior reviewer norms.

Cross-cutting patterns

These appear in 3+ use cases above and form a second-tier reusable layer beyond the personal-rag-kb foundation:

  1. Tenant + RBAC overlay: row-level security on kb_chunks driven by JWT claims. Any new use case inherits this for free.
  2. Webhook → Queue → Worker ingest: replaces hooks. Same code path; different trigger.
  3. Hybrid search (BM25 + vector + reranker): 1-week dev investment that lifts every use case ~15% Hit@5.
  4. Audit log table: append-only, 90-day rolling window, queryable for compliance + product analytics.
  5. Per-tenant system prompt injection: customers tune voice/policy without touching code.
  6. Edge cache layer: 5-min TTL on query → top-K mapping. Cuts steady-state inference cost ~60%.

Building these once = 6 weeks engineering. Then each new vertical = 2–4 weeks to launch instead of 12+.

Go-to-market thinking

The architecture supports 3 plausible business models, each with different pricing/positioning:

ModelTargetPricingSales motion
B2B SaaS100–10K employee companiesPer-seat/month + ingest volume tierPLG signup → trial 14 days → upgrade. AE for enterprise.
White-label OEMVertical SaaS vendors (e.g. customer support platforms wanting RAG)Revenue share or fixed licenseDirect enterprise sales, 6-month cycles
Open-source + managedDevs / self-hostersFree OSS + $X/mo managed cloudInbound from GitHub stars; convert to managed for ops cost relief

The B2B SaaS model has the cleanest scaling story given the architecture. White-label is highest revenue per deal but requires a strong vendor brand. OSS is brand-building for the founder but slowest revenue.

What’s NOT in the personal version that enterprise needs

Realistic gap list — items that are zero-effort in personal version but real engineering investment for enterprise:

GapEffortPriority
SSO / SAML2–4 weeksP0
Multi-tenant isolation tests1 weekP0
Audit log + querying1 weekP0
GDPR DSR (export + delete)2 weeksP0 (EU customers)
SOC2 Type I controls3–6 monthsP1 (mid-market+)
Rate limiting (per-tenant)3 daysP1
Multi-region failover2–4 weeksP2 (enterprise tier)
Customer-managed encryption keys (KMS)2 weeksP2 (regulated industries)
Custom domains per tenant1 weekP2
Embedding model marketplace (per-tenant choice)4 weeksP3

Total to enterprise-ready MVP: ~3 months of 1 engineer + 1 month design + ~$5K compliance audit prep.

See also

  • Architecture — the unchanged hot path that scales across all use cases
  • Implementation — the code that ships personal version; enterprise version extends, doesn’t rewrite
  • PRD — original problem framing; enterprise framing is a superset