Enterprise RAG: architecture and best practices for 2026
Quick Answer: what is enterprise RAG?
RAG (Retrieval-Augmented Generation) is the most-deployed AI architecture in UK enterprises in 2026 for getting a large language model to answer from your own documentation rather than from its generic training memory. The principle is simple, comparable to a barrister opening a brief before answering counsel:
- The user asks a question in natural language.
- The system retrieves relevant passages from a specially prepared knowledge base — the “vector database”, where each document is stored as a numeric signature so the system can quickly find the ones that resemble the question.
- The relevant passages are inserted into the prompt sent to the LLM.
- The LLM generates an answer grounded in those passages, with source citations.
Reference 2026 stack for the UK market:
- LLM: Mistral Large or Mistral Small 3 (via Mistral’s UK / EU endpoints, or self-hosted), GPT-4o or Claude Sonnet (if you accept the US-vendor dependency), Llama 3.1 / 3.3 70B self-hosted on OVHcloud London or Azure UK South.
- Embeddings (the document signatures): OpenAI
text-embedding-3-large, BGE-M3, E5-Mistral, or Mistral Embed. Multilingual E5 if you index UK content alongside European subsidiary docs. - Vector database: Qdrant (self-hostable, the production reference in 2026), Weaviate, Milvus, pgvector if PostgreSQL is already in your stack, or Chroma for early prototypes. Pinecone only if data residency is not a concern.
- Orchestration: LangChain, LlamaIndex, or Haystack — all three are battle-tested.
Use cases prevalent in the UK: legal research and case-law mining (clear leader given the size of the British legal sector), internal knowledge bases in banking and insurance (FCA-regulated firms), tier-1 customer support, NHS clinical-document Q&A under strict access control, scientific R&D in pharma and biotech.
Why RAG instead of fine-tuning: RAG is simpler, more maintainable, more transparent. Fine-tuning is only justified for very specific cases (style to learn durably, ultra-specialised terminology).
Cost: £50 to £200 per month operations for an SME, £5k to £25k initial integration investment.
Why RAG took over in 2026
Before 2024, embedding internal corporate knowledge into an LLM meant fine-tuning — long, costly, brittle. RAG flipped the equation, for three reasons.
Shift 1 — Long-context LLMs matured. Mistral, GPT-4o, Claude and Llama 3.3 in 2026 routinely handle context windows of 100,000 to 1 million tokens. You can feed them dozens of pages of documents as input — exactly what RAG needs. Before 2024, the 8-32k token cap forced harsh trade-offs on context volume; that friction is gone.
Shift 2 — Open-source vector DBs matured. Qdrant, Weaviate, Milvus, Chroma and pgvector make it possible in 2026 to deploy a production vector DB in a few hours, free or near-free. Before 2023, the realistic options were Pinecone (US SaaS) or an in-house build. Today: Qdrant self-hosted, in a handful of commands on an OVHcloud London or Hetzner box.
Shift 3 — Integration is straightforward. LangChain, LlamaIndex and Haystack (the latter built by Berlin-based Deepset and widely used across the UK enterprise market) have stabilised the integration patterns. A British SME can prototype a RAG in 1-2 weeks with a modest engineering team. The frameworks handle ingestion, chunking, embedding, retrieval, generation with citation. No need to reinvent the pipeline.
Concretely: in 2026, any UK organisation with an internal documentary base above ~500 documents stands to gain from exploring RAG. It has become accessible, predictable, and the ROI is measured in months — not years.
Detailed architecture of a production RAG
A mature RAG pipeline in 2026 has six components. Diagram first, detail after.
Pipeline diagram
[Document sources]
SharePoint, M-Files, wiki, OneDrive, Salesforce, contracts
│
▼
[1. Ingestion] ─── PDF/DOCX/HTML parsing, OCR if scanned
│
▼
[2. Chunking] ─── 200-1000 token passages
│
▼
[3. Embeddings] ─── numeric signature per chunk
│
▼
[4. Vector DB] ── Qdrant / Weaviate / pgvector storage
│
├──── (at runtime, user query)
▼ │
[5. Retrieval] ◄─────────────────── query embedding
│ top-K chunks │
▼ │
[6. Generation] ─── LLM with context ◄─────┘
│
▼
[Answer + source citations]
1. Document ingestion
Sources: SharePoint, M-Files, internal wikis (Confluence is dominant in the UK enterprise market), OneDrive folders, Salesforce/Dynamics exports, contracts, FAQs, intranet pages. Parsing: PDF (with OCR if scanned — Azure Document Intelligence or Tesseract), DOCX, HTML, Markdown, audio transcripts (via Whisper, increasingly via Whisper Large V3 or Distil-Whisper for cost).
Best practice: preserve metadata (author, date, source, sensitivity classification, retention class) throughout the pipeline so you can filter downstream. For FCA-regulated firms, retain the policy reference and effective date — they must be cited verbatim.
2. Chunking
Documents are split into 200-1000 token passages (depending on the target LLM and content nature). Strategies:
- Fixed chunking: 500 tokens per chunk, simple
- Semantic chunking: by paragraph or logical section, more relevant
- Hierarchical chunking: large chunk (overview) plus smaller chunks (detail), good for structured legal or regulatory documents
Good chunking is what separates a mediocre RAG from a strong one. Invest time here. For UK case-law indexing, paragraph-level chunking with the citation header preserved as metadata is a proven pattern.
3. Embeddings
Each chunk is converted into a dense vector (768-3,072 dimensions depending on the model). 2026 models relevant for English content:
| Model | Origin | Multilingual | Sovereignty |
|---|---|---|---|
| OpenAI text-embedding-3-large | US | Excellent | DPF dependency |
| OpenAI ada-002 | US | Good | DPF dependency, legacy |
| Mistral Embed | France | Excellent | EU-resident |
| BGE-M3 | China open-source | Excellent | OK if self-deployed |
| E5-Mistral | Open-source | Good | OK if self-deployed |
For sovereignty-sensitive contexts (financial services under FCA SS1/23, NHS clinical data), prefer Mistral Embed or self-hosted open-source models.
4. Vector storage (vector DB)
Storage of embeddings + metadata + original chunk. 2026 picks:
| Vector DB | Type | Ideal use case | Sovereignty |
|---|---|---|---|
| Qdrant | Open-source self-hostable | The 2026 reference, SME to enterprise | Yes |
| Chroma | Open-source | POC, rapid prototype | Yes |
| pgvector | PostgreSQL extension | Existing Postgres stack | Yes |
| Weaviate | Open-source | Larger scale | Yes if self-hosted |
| Milvus | Open-source | Very large scale | Yes if self-hosted |
| Pinecone | US SaaS | Avoid for sensitive data | No |
For regulated cases (NHS, financial services, legal services with confidentiality obligations), Qdrant self-hosted on OVHcloud London or Azure UK South is the sovereign reference choice in 2026.
5. Retrieval
At runtime, the user query is:
- Converted to an embedding (with the same model used at ingestion)
- Compared to stored embeddings (cosine similarity)
- Top-K chunks retrieved (typically K=5-10)
Common improvements:
- Hybrid search: combines vector search with keyword search (BM25). Improves precision on technical and legal terms — particularly valuable in UK case-law where exact statutory citations matter.
- Reranking: a dedicated cross-encoder model reranks top-K results to keep only the truly relevant ones. Cohere’s Rerank 3 and BGE-Reranker are the go-to options in 2026.
- Metadata filters: restrict search to a subset (by date, source, classification, user profile, jurisdiction).
6. Generation with citation
The LLM receives:
- The user query
- Relevant chunks as context
- A system prompt requiring explicit citations
Typical output: “Per the FCA Handbook SYSC 8.1, an authorised firm must take reasonable steps to avoid undue additional operational risk… [Source: FCA Handbook, SYSC 8.1.1, last updated 2024-09].”
Without citation, you don’t have a RAG — you have an LLM hallucinating on internal documents. Citation is non-negotiable for user trust and for compliance (UK GDPR Article 5(1)(d) — accuracy).
RAG vs fine-tuning — the 2026 decision
| Criterion | RAG | Fine-tuning |
|---|---|---|
| Time to ship | 1-4 weeks | 4-12 weeks |
| Initial cost | £5-25k | £30-100k |
| Maintenance (knowledge updates) | Re-index (hours) | Re-fine-tune (days) |
| Transparency | Citations possible | Black box |
| Factual accuracy | High (source-grounded) | Moderate (hallucination risk) |
| Specific style/tone | Limited | Excellent |
| Inference cost | Moderate (long context = more tokens) | Lower |
| Required skills | Devs with LLM APIs | Data science + GPU |
2026 decision rule:
- Knowledge that evolves, multiple sources, citation required → RAG
- Specific style to learn, ultra-specialised terminology, latency-critical → Fine-tuning
- Most UK business cases → RAG first
Start with RAG; switch or complement with fine-tuning only if rigorous evaluation justifies it.
6 enterprise RAG use cases relevant to the UK market
1. Legal research and case-law mining. The UK has one of the densest legal services markets in the world, with magic-circle and silver-circle firms running large internal knowledge programmes. Indexing case law (BAILII corpora), internal precedent banks and counsel opinions through a RAG dramatically reduces time-to-precedent for associates. Many top-tier firms now ship internal Mistral- or Claude-backed RAGs over their precedent banks, with strict per-matter access control.
Typical volumetrics: 50,000 to 1,000,000 documents, hundreds to thousands of queries per day during peak case work.
2. FCA-regulated knowledge: banking, insurance, asset management. Indexing the FCA Handbook, internal policies, supervisory letters, customer-due-diligence procedures. Compliance officers query the system in natural language instead of trawling 12,000 pages of regulation.
Typical volumetrics: 5,000 to 100,000 documents, hundreds of queries per month.
3. Tier-1 customer support. Indexing product documentation, resolved tickets, troubleshooting playbooks. Outcome: consistent answers, rising self-service resolution rates, 30-50% drop in tier-1 ticket volume on properly scoped scopes.
Typical volumetrics: 1,000 to 50,000 documents, 100 to 10,000 questions per day.
4. NHS and pharma R&D. NHS trusts and UK pharma (AstraZeneca, GSK and the broader Cambridge cluster) deploy RAGs over internal research, clinical guidelines and trial documents. Crucial constraint: per-role access control aligned with Caldicott principles, no exfiltration to non-UK infrastructure for clinical-grade content.
5. Employee onboarding and internal knowledge. Indexing training materials, procedures, HR policies. New joiners ask questions in natural language instead of hunting across 12 wikis.
Typical volumetrics: 500 to 5,000 documents.
6. Sales enablement and proposal drafting. Indexing past proposals, customer references, product datasheets. The salesperson generates a tailored, factually grounded proposal with the right portfolio references.
UK GDPR and ICO compliance for RAG
RAG is an automated processing activity that must be governed.
Key obligations:
- Article 30 ROPA entry: “AI assistance for internal document search”. Purpose, data processed, processors, retention.
- DPIA when the base contains personal data: see the ICO’s Guidance on AI and data protection (updated 2023, still the reference in 2026) and the ICO/Turing Institute Explaining decisions made with AI guidance.
- Pseudonymisation at ingestion when possible: do not index names or identifiers if they are not necessary.
- UK / EU sovereign hosting for sensitive data: Mistral + Qdrant on OVHcloud London, Azure UK South, or Scaleway Paris. Avoid US-only providers for clinical, legal-privileged, or safeguarding content.
- Article 28 processor agreements (UK GDPR): every component of the stack handling personal data — LLM provider, vector DB provider if managed, embedding provider — must be under a processor contract with adequate clauses, including international transfer mechanisms (UK IDTA or addendum to the EU SCCs) when relevant.
- Access control: a user must only see chunks they are legitimately entitled to access in the source documentation. Metadata filters per user profile, aligned on the source-system permissions. Without that filtering, RAG becomes a permissions-leak channel — a salesperson reaches HR records they should not see.
For high-risk AI systems falling under the EU AI Act when an organisation operates cross-border, the UK’s pro-innovation approach diverges, but the ICO has signalled that the substantive UK GDPR controls (transparency, fairness, accuracy) effectively cover the same ground for RAG deployments processing personal data.
What we refuse to promise
Three recurring antipatterns we avoid at DPLIANCE when designing a bespoke RAG.
“We index everything, we’ll sort access out later.” Wrong. Access control must be designed at ingestion, not bolted on. Indexing 100,000 documents with uniform access creates a monumental permissions-leak channel — the RAG will answer based on documents the user never had source-system access to. Retroactively applying rights is technically complex and legally fragile under UK GDPR.
“RAG will solve everything, we no longer need to organise our sources properly.” Wrong. A RAG over poorly organised, contradictory sources will answer with… poorly organised, contradictory information. RAG amplifies source quality — it does not correct it. Using a RAG project as the trigger for a documentary clean-up is often the most useful side effect.
“Let’s start straight with an autonomous agent that does RAG plus actions.” Usually a mistake for a first AI project. RAG alone has its own pitfalls (chunking, access control, citation). Adding an autonomous agent that performs external actions multiplies the risks. Start with a simple RAG, validate, then add agentic orchestration if the need is real.
DPLIANCE is a software editor. When we design a bespoke AI solution that includes a RAG, we own the full stack: model selection (Mistral, on-prem or UK-resident depending on your sensitivity), vector DB selection (sovereign Qdrant by default), source ingestion, access control aligned with your existing permissions, systematic citation, quality monitoring.
FAQ
What is RAG (Retrieval-Augmented Generation)?
RAG is an architecture that pairs an LLM (Mistral, GPT-4o, Claude, Llama 3) with an internal knowledge base. When a user asks a question, the system retrieves relevant documents from the knowledge base, feeds them to the LLM as context, then generates an answer grounded in those documents with source citations. The benefit: answers anchored in your internal documentation, not in the model’s generic memory. It is the most-deployed AI architecture in UK enterprises in 2026 for document search, tier-1 customer support, employee onboarding and legal research.
When should you choose RAG over fine-tuning?
RAG is generally preferable to fine-tuning in 2026 for: knowledge that evolves regularly (policies, procedures, product catalogues), multiple sources to cite, requirement for transparency about the origin of information, teams without a dedicated data science function. Fine-tuning is preferable for: a specific tone or style to learn durably, ultra-specialised terminology, very latency-critical scenarios. Most UK enterprise use cases benefit from starting with RAG; fine-tuning comes as a complement or second-line option if RAG alone proves insufficient.
Which vector database should I choose for RAG?
For SMEs and POCs: Qdrant (open-source, self-hostable, simple — the sovereign reference in 2026), Chroma (very simple, good for getting started), pgvector (PostgreSQL extension, ideal if you already run Postgres). For large-scale production: Qdrant in cluster mode, Weaviate, Milvus. For maximum sovereignty: self-hosted on OVHcloud London, Azure UK South or other UK-resident infrastructure. To avoid for sensitive data: US SaaS vector DBs (Pinecone, certain managed offerings) that reintroduce the cross-border transfer risk RAG was supposed to mitigate.
How much does a production RAG cost?
For a UK SME with 1,000 documents and 100 users: £50 to £200 per month in operations (LLM via API + self-hosted vector DB + storage). Initial integration investment: £5k to £25k depending on complexity (number of sources, ERP/CRM integrations, target UI quality). For a large enterprise with 100,000+ documents: £500 to £3,000 per month in operations. Typical ROI in 6 to 12 months if adoption sticks, mostly through document-search time savings and a reduction in tier-1 support ticket volume.
Is RAG GDPR/UK GDPR compliant?
RAG is an automated processing activity that must be entered in the Article 30 record of processing activities. If the knowledge base contains personal data, a DPIA is recommended and mandatory in certain cases (high volumes, special categories, monitoring). Hosting choice: LLM + vector DB on UK or EU sovereign infrastructure to avoid cross-border transfer risk. Access control is essential: a user must only see chunks they are legitimately entitled to access in the source documentation — otherwise RAG becomes a permissions-leak channel. The ICO’s guidance on AI and data protection sets the bar.
What is the difference between RAG and an AI agent?
A RAG answers a question by leaning on internal documents — it is a component. An AI agent decides on a sequence of actions to accomplish a higher-level mission — it is a system. RAG is a component frequently embedded inside agents.
Does RAG always hallucinate?
Yes, but far less than a standalone LLM. RAG constrains the LLM to ground its answers in supplied documents, drastically reducing factual hallucinations — typically from 90% down to less than 5% on questions whose answer is in the corpus. A good RAG always includes source citations.
How long to ship a production RAG?
Functional POC: 1 to 2 weeks with a development team comfortable with LLM APIs. Restricted production pilot (10-50 users): 4 to 8 additional weeks. Full industrialisation: 3 to 6 months depending on complexity. The bottleneck is rarely the technology — it is source ingestion and access governance.
Sources: official documentation for Mistral AI, Qdrant, Weaviate, Milvus, pgvector, Chroma; the foundational RAG paper (Lewis et al. 2020) and subsequent literature; LangChain, LlamaIndex and Haystack documentation; Information Commissioner’s Office — Guidance on AI and data protection; UK GDPR (Data Protection Act 2018); FCA Handbook and FCA SS1/23 on AI in financial services.
To scope a RAG project in your organisation — architecture choice, vector DB, ERP/CRM integration, access control, compliance — see our guide on GDPR-compliant AI, our enterprise AI charter guide, our AI use cases for business guide, or contact us via our bespoke AI solutions.