Enterprise patterns
The personal version is the strictest constraint case: single-user, free-tier, no compliance requirements. Relaxing those constraints unlocks B2B applications without rewriting the architecture. This page documents five concrete adaptations.
What stays vs. what changes
The hot path — chunk → embed → vector ANN search → MCP response — is identical across all enterprise use cases below. The deltas are around identity, isolation, governance, and ingest velocity, not around retrieval mechanics.
Migration matrix: Personal → Enterprise
| Aspect | Personal | Enterprise |
|---|---|---|
| Tenants | 1 user | N tenants, hard-isolated by tenant_id row filter + per-tenant API key |
| Auth | Static password + OAuth (PKCE+DCR) | SSO (SAML 2.0 / OIDC) + per-user OAuth, MFA enforced |
| Authorization | All access = full | RBAC tags + row-level security on kb_chunks (department / sensitivity / project) |
| Audit | None | Per-query log: who, when, what query, what chunks returned, response latency |
| Compliance | N/A | SOC2 Type II, GDPR DSR support, data residency, retention policies, encryption keys (BYOK / KMS) |
| Storage | ADB 23ai free 20 GB | ADB paid tier or self-hosted Postgres + pgvector; tenant data in separate schemas or DBs |
| Embedder | CPU e5-small (5.7 chunks/s) | GPU batch inference cluster (50–200 chunks/s/node) for bulk ingest; per-tenant model selection |
| Ingest velocity | Manual hooks + nightly cron | Real-time CDC from Confluence/Slack/Salesforce/Zendesk via webhooks → queue (SQS/PubSub) → worker pool |
| Search index | HNSW INMEMORY single instance | Distributed (pgvector partitioning, or managed: Pinecone/Qdrant Cloud/OCI Search), tenant-scoped |
| Latency SLA | <2s p95 best-effort | <500ms p95 with retries, regional replicas, edge caching for popular queries |
| Cost model | $0 (free tiers) | Per-seat ($X/user/month) or volume-based (chunks ingested, queries served), 3-tier pricing |
| Deployment | 1 VM + 1 ADB | Multi-region active-active, infra-as-code (Terraform), rolling deploys, observability stack |
| Support | None | 24/7 P1 SLA, dedicated Slack/Teams channel, customer success manager |
The architecture diagram doesn’t change — only labels on components scale up.
Use case A — Internal employee knowledge base (“Where’s that doc?”)
Problem
Mid-to-large companies (500–10,000 employees) accumulate Confluence, SharePoint, Google Docs, internal wikis, runbooks, post-mortems, OKR docs, meeting recordings. Search across these is fragmented (each tool’s search is keyword-only), siloed (per-team SharePoint), and stale (deleted documents still appear). New hires take 3–6 months to know “where to look for X”. Existing employees waste 15–30 minutes per “I know we wrote this somewhere” hunt — at 30 minutes × 2/day × 5,000 employees × $50/hour, that’s ~$30M/year of pure friction in a typical mid-sized SaaS.
Persona
Every knowledge worker. Especially: new hires (week 1–8), team leads cross-functional, support escalation engineers.
What changes from personal version
- Sources scaled up: Confluence (10K–100K pages) + Google Drive (1M+ docs) + SharePoint + GitHub wiki + Slack channels (read-only). Per-source connector with delta sync (only changed docs since last poll).
- Tenant model: 1 tenant per company. Hard isolation by
tenant_idon every query. - RBAC: each chunk inherits source ACL — if a Confluence page is restricted to “Engineering” space, only employees with that AD/Okta group see it in
kb_searchresults. Implemented as row-level security predicate onkb_chunks. - Identity: SAML SSO from Okta/Azure AD/Google Workspace. User’s group memberships injected into JWT, used as filter.
- Compliance: GDPR DSR — when employee leaves, their personal contributions remain (org owns the IP), but their query log is purged after 90 days. Data residency: per-tenant region pinning.
- Search-quality investments: hybrid search (BM25 + vector) is now worth the complexity — keyword matching for “PR-1234” or product code names where semantic embed is weak. Reranker (Cohere/BGE-reranker-base) on top-20 → +15% Hit@5 measured on internal eval.
Architecture impact
- Add
kb_acltable:(chunk_id, allowed_groups[]). Enforce inkb_searchSQL. - Switch ingest from manual hooks to CDC pipeline: webhooks from Confluence/Drive → Kafka → embed worker pool → ADB write.
- Add redaction layer at ingest: strip PII (employee SSN, salaries) using regex + LLM classifier before embedding.
Cost & scale (10,000-employee example)
- Documents: ~500K, ~5M chunks
- Storage: ~20 GB embeddings + ~50 GB CLOB → managed Postgres with pgvector ($300/mo) or ADB paid ($450/mo)
- Embed throughput: 1 GPU node (T4) handles 100 chunks/s → 5M chunks bulk = 14 hours one-time, then ~50 chunks/s steady-state for delta sync
- Query volume: 50K queries/day → ~1 query/employee/day average
- All-in cost: ~$1,500/mo infra + ~$2K/mo dev maintenance → $25K/year vs $30M friction = 1,200× ROI
Risks
- Stale ACL: source ACL changes (employee leaves group) must propagate to KB index within hours. Solved with periodic ACL refresh job, not at query time (too slow).
- Index drift from source: hard-deleted source docs need tombstones in KB. Implement soft-delete with TTL.
- Hallucination on absent info: when KB doesn’t have an answer, system must say “no match” rather than improvise. Tune Claude system prompt + return empty results when top score <0.6.
Use case B — Customer support agent grounding
Problem
Support agents handle tickets that often have answers buried in: past tickets, KB articles, product docs, recent release notes, internal runbooks. Today an agent searches each system separately, copy-pastes context into the ticket, switches windows. Average resolution time for “I’ve seen this before” tickets is 30–45 minutes; with grounded retrieval it could be 5–10 minutes. Industry benchmark: 40% of support tickets are “previously-answered” patterns.
Persona
L1/L2 support engineers. Customer success managers. Solutions consultants.
What changes from personal version
- Latency-critical: agents type a ticket draft → system returns top-5 grounded suggestions inline. Target p95 <500ms (vs 1.16s personal). Achieved by:
- Edge cache for popular query embeddings (Cloudflare Workers KV)
- Pre-computed embeddings for the 1000 most common ticket patterns
- Co-locate inference + DB in same region
- Source freshness SLA: a customer escalates an issue at 14:00; the engineer fixes + writes runbook at 15:30; another agent hits the same issue at 16:00 and MUST get the new runbook. Webhook-driven ingest within 60 seconds of source write.
- Citation enforcement: every Claude answer must cite its sources. UI shows clickable chunks the agent can verify before sending to customer. If no citation, answer rejected at the application layer.
- Per-tenant tone/policy injection: each customer’s support team injects a custom system prompt (“never promise refunds without manager approval”, “always escalate to security if customer mentions GDPR”).
Architecture impact
- Add query-time cache layer (Redis): hash query → cached top-K result, TTL 5 minutes. Hit rate ~60% for common patterns.
- Streaming response: as Claude generates the answer token-by-token, stream chunks to the agent’s UI. Reduces perceived latency.
- Feedback loop: every ticket close includes “was this suggestion useful?” thumbs up/down. Negative feedback → demote that chunk’s score for next 24h, escalate to KB owner for review.
- Quality eval: weekly run 100 sampled queries against a labeled gold set, track Hit@5 + answer accuracy. Alert if drops >5% week-over-week.
Cost & scale (50-agent SaaS support team)
- KB volume: 10K KB articles + 100K historical tickets + 5K runbooks → ~500K chunks
- Query volume: 50 agents × 100 queries/day = 5,000 queries/day
- Storage: ~5 GB embeddings → ADB paid ($150/mo)
- Inference: minimal — most queries hit cache, cold path is ~$0.10/query Claude usage
- All-in: ~$2K/mo infra + Claude API + redis → at 30% productivity gain on $80K avg agent salary × 50 agents = $1.2M/year value vs $24K cost = 50× ROI
Risks
- Stale cached suggestions during incidents: when the product itself is broken, old runbooks may suggest workarounds that no longer apply. Implement “incident mode” that flushes cache + injects warning banner.
- Customer PII in chunks: ticket history contains customer names, account IDs. Embed must redact before vector store. Use LLM-based PII detector at ingest.
- Vendor lock-in to Claude: design abstraction layer so swap to GPT/Llama is config change, not rewrite. Lessens negotiating leverage risk.
Use case C — Sales enablement
Problem
Account executives juggle competitive intel (“how do we differentiate vs Competitor X?”), deal histories (“what concessions did we give Acme last year?”), product specs (“does our enterprise tier include SSO?”), pricing sheets, case studies. Today this lives in 8+ tools (Salesforce + Highspot + Confluence + Drive + Gong recordings). On a live discovery call, the AE has 30 seconds to find the right answer or lose credibility. Win rate correlates with response specificity — generic answers lose 23% more deals (Gartner 2024 study).
Persona
Account executives, sales engineers, sales operations, RevOps. Bonus: marketing teams aligning competitive narratives.
What changes from personal version
- CRM integration: pull deal context (current account, opportunity stage, products discussed) → use as additional query context. “Show me cases where we beat Competitor X for a 5K-seat customer in healthcare” —
tenant_id,industry,competitorare structured filters. - Multi-modal: ingest sales call recordings (Gong/Chorus) → speech-to-text → embed transcript chunks. Now AE can search “what did we promise on demo last Tuesday”.
- Recency-weighted scoring: a 6-month-old pricing doc is less reliable than last week’s. Add
recency_boost = exp(-age_days / 90)factor in ranking. - Battle card auto-generation: nightly job pre-builds competitor battle cards from indexed content, surfaces in MCP tool
kb_battle_card(competitor_name). - Privacy partition: deal-stage-restricted chunks (M&A confidential, executive comp) only visible to specific roles.
Architecture impact
- Add structured filter layer: combine vector search with SQL WHERE on metadata (
industry,competitor,deal_stage,pubDate > 6mo). Implemented askb_search_v2(query, filters: dict). - Speech-to-text ingest pipeline: Gong webhook → AWS Transcribe → chunk by speaker turn → embed → store with
source_type='sales_call'. - Real-time alert: when a battle-card-relevant page is updated, push notification to subscribed AEs.
Cost & scale (200-rep org)
- Sources: 5K Salesforce records, 10K Confluence pages, 50K call hours/year (3,000 hours of transcripts) → ~1M chunks
- Query volume: 200 reps × 30 queries/day = 6,000 queries/day
- Transcription: $0.024/min × 50K hours/year = $72K/year (use Whisper self-hosted to drop to $5K/year)
- Storage + compute: ~$2K/mo
- All-in: ~$4K/mo + transcription → at 5% win rate uplift on 200 reps × $2M quota = +$20M ARR vs $48K cost = 400× ROI
Risks
- Reps relying on RAG > training: junior AEs may stop learning the product. Mitigate with quarterly “no RAG” assessments; RAG is augmentation, not replacement.
- Out-of-date pricing surfacing: outdated contract terms quoted on a live call = legal exposure. Hard-delete deprecated docs, add validity-window metadata, alert on stale-source citations.
- Competitor mention as training signal: be careful that competitor intel ingested doesn’t accidentally fine-tune embedder in a way that biases scoring. Keep embedder frozen, only the ranking layer learns.
Use case D — Onboarding new hires
Problem
A new engineer joins on Monday. By Friday they need to: understand the product domain, know who owns what, navigate the codebase, find runbooks, ramp on team norms, attend 15+ “intro to X” sessions. Companies measure time-to-first-PR (typically 2–6 weeks); top performers ship in 5 days. The bottleneck isn’t ability — it’s information access.
Persona
New hires (eng / PM / design / sales / support). HR / People Ops who run the onboarding program. Engineering managers responsible for ramp.
What changes from personal version
- Curated curriculum collection: a separate
kb_curriculumtable marks “essential reading” with order, role-targeting, and est-read-time. MCP toolkb_onboarding_path(role, day)returns the day-N reading list. - Conversational tutor mode: rather than just retrieve, the system Socratic-tutors. New hire asks “what’s our deployment process?” → Claude explains with citations, then asks “want to walk through a sample deploy together?”. Behind: same MCP retrieval, different system prompt + multi-turn context.
- Progress tracking: which docs has the new hire read, which they’ve asked about, what gaps remain. Surfaces to manager in weekly digest.
- Glossary intelligence: company-specific acronyms (e.g. “PCF”, “NLP-onboarding”) are auto-detected and explained inline. Built as separate
kb_glossarytable populated from a wiki page + LLM-extracted from past docs.
Architecture impact
- Add
user_progresstable tracking whichchunk_ids a user has been served. Compute “blind spots” = high-importance docs they haven’t seen. - Conversation memory: per-user Redis store of last 10 query-answer pairs. Inject into Claude context for continuity (“you asked about deploys yesterday; today let’s cover monitoring”).
- Manager dashboard: aggregate metrics — avg ramp velocity, most-asked questions per cohort. Useful for HR to identify documentation gaps (a question asked by 50% of new hires = a missing doc).
Cost & scale (200 new hires/year)
- Sources: 5K onboarding docs + 1K runbooks + 500 architecture docs + Slack-archived “intro” channels
- Query volume: 200 hires × 50 queries/day × first 30 days = 300K queries/year
- Storage: minimal (<1 GB embeddings) — onboarding KB is curated subset
- All-in: ~$500/mo → at 1-week ramp acceleration × 200 hires × $200K loaded comp / 50 weeks/year = +$800K productivity gain vs $6K cost = 130× ROI
Risks
- Curated content rot: “essential reading” curated 2 years ago may be obsolete. Quarterly review by team leads, with auto-flagging of low-engagement docs (nobody reads = candidate for archive).
- Over-reliance on async: new hires may skip human relationships (“RAG knows everything”) and miss tacit knowledge / mentor bonds. Build human-pairing prompts into the curriculum.
- Context-collapse: “deployment” means different things to backend / frontend / mobile. Use role filter aggressively.
Use case E — PM / Engineering decision archive
Problem
“Why did we deprecate feature X 2 years ago?” “Has anyone tried approach Y before?” “What was the original requirement for module Z?” — these questions are asked constantly by PMs and engineers, especially in larger orgs. Today the answer lives in: PRDs (often in 5 places), past Jira tickets, post-mortems, ADRs (architecture decision records), Slack debates, design docs. Loss of institutional knowledge when employees leave is a measurable productivity tax: engineering teams spend 8–14% of time rediscovering decisions (DORA 2023).
Persona
PMs writing new specs (avoid relitigating settled debates). Engineers exploring architectural choices. Tech leads doing design review. New team leads ramping on org history.
What changes from personal version
- Time-aware ranking: weight recent decisions higher, but surface “this decision was reversed in 2023” prominently. Implemented as graph metadata: each decision can supersede prior ones (
supersedes: doc_id). - Multi-source linking: when a PRD references Jira ticket SXI-1234, that ticket’s resolution should be auto-fetched and appended at retrieval time. Cross-source reference resolution layer.
- Decision graph visualization: optional UI shows decisions as nodes, edges = “supersedes”, “implements”, “blocked-by”. Helps newcomers understand dependency chains.
- Diff-aware retrieval: when a PRD is edited, store version history. Query “what was the original scope of feature X?” returns v1; “what’s the current scope?” returns latest. Implemented as
kb_versionstable withvalid_from/valid_to. - Outcome tagging: each decision linked to outcome metadata (shipped / deprecated / in-progress / cancelled). Avoid relitigating cancelled experiments.
Architecture impact
- Schema additions:
kb_decision_graph (id, supersedes_id, decision_type, outcome, decided_at, decided_by). - Reference resolver: at retrieval, expand
[Jira: SXI-1234]→ fetch live ticket status from Atlassian API → append to chunk text. 200ms overhead, cached 1h. - Versioning: each ingest of an updated source creates new row with
valid_from = now, prior row getsvalid_to. Query layer filters by point-in-time.
Cost & scale (1,000-employee org, 10 years of history)
- Sources: 50K PRDs + 200K Jira tickets + 5K ADRs + 100K Slack debates → ~3M chunks
- Query volume: ~5,000 queries/day (PMs + engineers + tech leads)
- Storage: 30 GB → ADB paid + Atlassian API quotas
- All-in: ~$3K/mo → eliminating just 1% of rediscovery time on 500 engineers × $250K = +$1.25M/year vs $36K = 35× ROI
Risks
- Confidential decisions surfacing: M&A discussions, individual performance reviews, layoff planning. Hard-tag confidentiality at ingest, exclude from default search; require role + reason-for-access for sensitive queries (audit-logged).
- Stale decisions presented as current: a 2018 decision about scaling Postgres may be technically wrong now. Always surface
decided_at+ outcome tags in citations. UI affordance: “this decision is 7 years old — verify still applicable”. - Over-confidence in archive: junior PMs may copy old PRD structure without questioning. Pair with senior reviewer norms.
Cross-cutting patterns
These appear in 3+ use cases above and form a second-tier reusable layer beyond the personal-rag-kb foundation:
- Tenant + RBAC overlay: row-level security on
kb_chunksdriven by JWT claims. Any new use case inherits this for free. - Webhook → Queue → Worker ingest: replaces hooks. Same code path; different trigger.
- Hybrid search (BM25 + vector + reranker): 1-week dev investment that lifts every use case ~15% Hit@5.
- Audit log table: append-only, 90-day rolling window, queryable for compliance + product analytics.
- Per-tenant system prompt injection: customers tune voice/policy without touching code.
- Edge cache layer: 5-min TTL on query → top-K mapping. Cuts steady-state inference cost ~60%.
Building these once = 6 weeks engineering. Then each new vertical = 2–4 weeks to launch instead of 12+.
Go-to-market thinking
The architecture supports 3 plausible business models, each with different pricing/positioning:
| Model | Target | Pricing | Sales motion |
|---|---|---|---|
| B2B SaaS | 100–10K employee companies | Per-seat/month + ingest volume tier | PLG signup → trial 14 days → upgrade. AE for enterprise. |
| White-label OEM | Vertical SaaS vendors (e.g. customer support platforms wanting RAG) | Revenue share or fixed license | Direct enterprise sales, 6-month cycles |
| Open-source + managed | Devs / self-hosters | Free OSS + $X/mo managed cloud | Inbound from GitHub stars; convert to managed for ops cost relief |
The B2B SaaS model has the cleanest scaling story given the architecture. White-label is highest revenue per deal but requires a strong vendor brand. OSS is brand-building for the founder but slowest revenue.
What’s NOT in the personal version that enterprise needs
Realistic gap list — items that are zero-effort in personal version but real engineering investment for enterprise:
| Gap | Effort | Priority |
|---|---|---|
| SSO / SAML | 2–4 weeks | P0 |
| Multi-tenant isolation tests | 1 week | P0 |
| Audit log + querying | 1 week | P0 |
| GDPR DSR (export + delete) | 2 weeks | P0 (EU customers) |
| SOC2 Type I controls | 3–6 months | P1 (mid-market+) |
| Rate limiting (per-tenant) | 3 days | P1 |
| Multi-region failover | 2–4 weeks | P2 (enterprise tier) |
| Customer-managed encryption keys (KMS) | 2 weeks | P2 (regulated industries) |
| Custom domains per tenant | 1 week | P2 |
| Embedding model marketplace (per-tenant choice) | 4 weeks | P3 |
Total to enterprise-ready MVP: ~3 months of 1 engineer + 1 month design + ~$5K compliance audit prep.
See also
- Architecture — the unchanged hot path that scales across all use cases
- Implementation — the code that ships personal version; enterprise version extends, doesn’t rewrite
- PRD — original problem framing; enterprise framing is a superset