Blog

RAG vs Fine-Tuning: When to Use Which for Your AI Project

A practical, experience-driven guide comparing RAG and fine-tuning for AI projects. Learn the real costs, trade-offs, code examples, and when to use each — or both — based on actual projects I've built.

Published onApril 04, 202613 min read3,448 words

Table of Contents

Illustration comparing RAG and Fine-Tuning approaches for AI projects — two paths diverge from a central LLM, one connecting to external databases (RAG) and the other reshaping the model itself (fine-tuning)

Last year, I built a feature for this very website — an AI assistant that can answer questions about me, my work, and my blog. The question I had to answer before writing a single line of code was: should I use RAG or fine-tune a model?

I chose RAG. And it was the right call — for that project. But it’s not always the right call.

After building multiple AI-powered products — from internal knowledge bots to content recommendation engines — I’ve learned that the “RAG vs fine-tuning” debate isn’t about which one is better. It’s about which one fits the problem you’re actually solving.

In this post, I’ll break down both approaches from the perspective of someone who’s actually built with both. No theoretical fluff. Real costs, real trade-offs, real code, and a clear decision framework you can use today.

The Core Problem Both Solve

Large Language Models like GPT-4.5, Claude Opus 4.5, and Gemini 3 Pro are incredibly capable. But they share two fundamental limitations:

Knowledge cutoff — They don’t know what happened after their training date.
No access to your data — They’ve never seen your company’s documents, your product database, or your internal policies.

Both RAG and fine-tuning are strategies to bridge this gap, but they do it in fundamentally different ways.

Think of it like this:

RAG is like giving someone a library card. They don’t memorise the books — they look things up when asked.

Fine-tuning is like sending someone to medical school. The knowledge becomes part of who they are.

Neither approach is universally better. The right choice depends on your data, your budget, your latency requirements, and how often your information changes.

What Is RAG (Retrieval-Augmented Generation)?

RAG is an architecture pattern where the AI model retrieves relevant information from external sources before generating a response. Instead of relying solely on what it learned during training, the model gets fresh, contextual data injected into every query.

How RAG Works — Step by Step

User Query → Embedding → Vector Search → Retrieve Top-K Documents → 
Inject into Prompt → LLM Generates Answer

Here’s the detailed flow:

Your documents are pre-processed — Text is split into chunks, converted into vector embeddings, and stored in a vector database (like Pinecone, Weaviate, Qdrant, or even a simple JSON file).
User asks a question — The query is converted into an embedding using the same model.
Semantic search — The system finds the most relevant document chunks by comparing vector similarity (cosine similarity, dot product, etc.).
Context injection — The retrieved chunks are added to the LLM’s prompt as context.
Generation — The LLM generates a response grounded in the retrieved information.

A Real RAG Implementation — From My Own Site

I built a RAG-powered AI assistant for xahidex.com. Here’s a simplified version of how I generate embeddings for my knowledge base:

// Generate embeddings for knowledge base documents
import { readFileSync } from 'fs';
import { join } from 'path';
 
interface EmbeddingEntry {
  text: string;
  embedding: number[];
  source: string;
}
 
async function generateEmbeddings(
  documents: { content: string; source: string }[]
): Promise<EmbeddingEntry[]> {
  const entries: EmbeddingEntry[] = [];
 
  for (const doc of documents) {
    // Split into manageable chunks (~500 tokens each)
    const chunks = splitIntoChunks(doc.content, 500);
 
    for (const chunk of chunks) {
      const embedding = await getEmbedding(chunk);
      entries.push({
        text: chunk,
        embedding,
        source: doc.source,
      });
    }
  }
 
  return entries;
}
 
function splitIntoChunks(text: string, maxTokens: number): string[] {
  const sentences = text.split(/(?<=[.!?])\s+/);
  const chunks: string[] = [];
  let current = '';
 
  for (const sentence of sentences) {
    if ((current + sentence).length > maxTokens * 4) {
      if (current) chunks.push(current.trim());
      current = sentence;
    } else {
      current += ' ' + sentence;
    }
  }
  if (current) chunks.push(current.trim());
  return chunks;
}

And here’s how the retrieval works at query time:

function cosineSimilarity(a: number[], b: number[]): number {
  const dot = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const magA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const magB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dot / (magA * magB);
}
 
async function retrieveContext(
  query: string,
  embeddings: EmbeddingEntry[],
  topK: number = 5
): Promise<string> {
  const queryEmbedding = await getEmbedding(query);
 
  const scored = embeddings
    .map((entry) => ({
      ...entry,
      score: cosineSimilarity(queryEmbedding, entry.embedding),
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
 
  return scored.map((s) => s.text).join('\n\n');
}

This is RAG in its simplest form. No vector database, no complex infrastructure — just embeddings stored in a JSON file, cosine similarity for search, and context injected into the prompt. It works surprisingly well for personal sites and small-scale applications.

When RAG Shines

Scenario	Why RAG Works
Data changes frequently	Re-embed new docs, no retraining needed
You need source citations	RAG naturally knows where it got the info
Company knowledge bases	Internal docs, FAQs, policies — all searchable
Real-time information	Plug in live APIs as data sources
Budget is limited	No GPU costs for training
Regulatory compliance	You control exactly what data the model can access

RAG Limitations

Retrieval quality is the bottleneck — If the search returns irrelevant chunks, the answer will be wrong (garbage in, garbage out).
Context window limits — You can only inject so much context. With very large knowledge bases, you need smart chunking and ranking strategies.
Latency — The retrieval step adds time. Expect 200-800ms overhead depending on your vector DB and infrastructure.
Can’t change the model’s behaviour — RAG feeds information, but it can’t change how the model writes, reasons, or formats responses.

What Is Fine-Tuning?

Fine-tuning is the process of further training a pre-trained LLM on your own dataset. Unlike RAG, which works around the model, fine-tuning modifies the model’s weights — embedding new knowledge and behaviours directly into the neural network.

How Fine-Tuning Works

Base Model + Your Training Data → Training Process → 
Custom Model (new weights) → Deploy → Inference

Prepare training data — Typically as instruction-response pairs (JSONL format).
Choose a base model — GPT-4o-mini, LLaMA 3.3, Mistral, Qwen 2.5, etc.
Train — The model’s weights are updated using your data. This can take minutes to hours depending on dataset size and model.
Deploy — Your custom model is now available for inference.
Evaluate — Test against held-out examples to measure quality.

Fine-Tuning Training Data Example

[
  {
    "messages": [
      {
        "role": "system",
        "content": "You are an AI assistant for a Dubai-based healthcare clinic. Answer patient queries professionally and accurately."
      },
      {
        "role": "user",
        "content": "What are your paediatric vaccination packages?"
      },
      {
        "role": "assistant",
        "content": "We offer comprehensive paediatric vaccination packages following the UAE Ministry of Health schedule. Our packages include: 1) Newborn Package (0-12 months) covering BCG, Hepatitis B, DPT, and MMR vaccines. 2) Toddler Package (1-4 years) including boosters and additional immunisations. Please call our clinic to book a consultation with our paediatric team."
      }
    ]
  },
  {
    "messages": [
      {
        "role": "user",
        "content": "Do you accept insurance?"
      },
      {
        "role": "assistant",
        "content": "Yes, we accept major insurance providers in the UAE including DHA, Daman, Oman Insurance, AXA, Cigna, MetLife, and Neuron. For a complete list or to verify your specific plan, please contact our reception team."
      }
    ]
  }
]

Fine-Tuning With OpenAI — Quick Example

# Upload training data
openai api fine_tuning.jobs.create \
  -t "file-abc123" \
  -m "gpt-4o-mini-2024-07-18"

from openai import OpenAI
 
client = OpenAI()
 
# Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file="file-abc123",
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={
        "n_epochs": 3,
        "learning_rate_multiplier": 1.8,
    }
)
 
# Use your fine-tuned model
response = client.chat.completions.create(
    model="ft:gpt-4o-mini-2024-07-18:your-org::abc123",
    messages=[
        {"role": "user", "content": "What vaccines does a 6-month-old need?"}
    ]
)

When Fine-Tuning Shines

Scenario	Why Fine-Tuning Works
Custom tone/style	Train the model to write like your brand
Domain-specific reasoning	Medical, legal, financial — where precision matters
Structured output	Consistent JSON, specific formats every time
Reducing prompt length	Bake instructions into the model instead of the prompt
Classification tasks	Sentiment analysis, intent detection, categorisation
Offline/edge deployment	Fine-tuned small models can run locally

Fine-Tuning Limitations

Expensive — GPU compute for training isn’t cheap. Even with LoRA/QLoRA, you need decent hardware.
Data preparation is tedious — You need hundreds to thousands of high-quality examples.
Stale knowledge — Fine-tuned models can’t update their knowledge without retraining.
Overfitting risk — Train too much on narrow data and the model loses general ability.
No source attribution — The model “just knows” things — it can’t tell you where it learned them.

RAG vs Fine-Tuning: The Complete Comparison

Here’s the comparison table I wish I had when I started building AI features:

Factor	RAG	Fine-Tuning
What it changes	The input (context)	The model (weights)
Knowledge updates	Instant (re-embed new docs)	Requires retraining
Setup cost	Low ($0-50/month for most projects)	Medium-High ($50-5,000+ depending on scale)
Running cost	Higher per query (retrieval + longer prompts)	Lower per query (shorter prompts, baked-in knowledge)
Latency	+200-800ms for retrieval	No additional latency
Data needed	Raw documents (any format)	Curated instruction-response pairs (100s-1000s)
Hallucination risk	Lower (grounded in retrieved docs)	Higher (can confidently state wrong info)
Source attribution	Yes (knows which document it used)	No
Custom behaviour	Limited	Full control over tone, format, reasoning
Best for	Knowledge-heavy Q&A, search, support	Style, classification, domain reasoning
Maintenance	Update documents, re-embed	Re-train periodically
Privacy	Data stays in your vector DB	Data used in training (check provider policies)

The Decision Framework: Which One Should You Use?

After building with both, here’s the decision framework I use for every new AI project:

Choose RAG If:

✅ Your data changes frequently (weekly or more)
✅ You need citations and source transparency
✅ You’re building a Q&A system over documents
✅ Budget is tight and you can’t afford training costs
✅ You need it working in days, not weeks
✅ Compliance requires you to control data access
✅ Your knowledge base is large (1,000+ documents)

Choose Fine-Tuning If:

✅ You need a specific writing style or tone
✅ Your data is stable and doesn’t change often
✅ You need consistent structured outputs (JSON, XML, etc.)
✅ You’re building a classifier or intent detector
✅ Latency is critical (every millisecond matters)
✅ You want to reduce token costs at scale
✅ The task requires domain-specific reasoning

Choose Both (Hybrid) If:

✅ You need custom behavior AND up-to-date knowledge
✅ You’re building a production system at scale
✅ You want the best possible quality regardless of complexity

The Hybrid Approach: Why Not Both?

Here’s what most blog posts about this topic miss: you can use RAG and fine-tuning together. In fact, for production systems, combining both often gives the best results.

How the Hybrid Works

Fine-Tuned Model (custom behaviour + domain knowledge)
         +
RAG Pipeline (fresh, up-to-date context)
         =
Best of both worlds

Example: Imagine you’re building an AI assistant for a law firm.

Fine-tune the model on thousands of legal documents so it understands legal language, citation formats, and reasoning patterns.
Use RAG to retrieve the specific case laws, statutes, and client documents relevant to each query.

The fine-tuned model knows how to think like a lawyer. RAG gives it the specific facts it needs for this particular case.

A Real-World Hybrid Architecture

// 1. Fine-tuned model handles the reasoning
const model = 'ft:gpt-4o-mini:your-org::legal-assistant-v2';
 
// 2. RAG retrieves relevant documents
const context = await retrieveContext(userQuery, legalEmbeddings, 8);
 
// 3. Combine: fine-tuned behaviour + RAG context
const response = await openai.chat.completions.create({
  model,
  messages: [
    {
      role: 'system',
      content: `You are a legal research assistant. Use the following 
      case documents to support your analysis. Always cite specific 
      sections and precedents.\n\nRelevant Documents:\n${context}`,
    },
    { role: 'user', content: userQuery },
  ],
});

This is where the magic happens. The fine-tuned model already knows legal terminology and formats, so it needs fewer instructions in the prompt. RAG provides the specific, up-to-date case information. Together, they produce responses that neither could achieve alone.

Real Cost Breakdown (2026 Pricing)

Let’s talk real numbers. This is something most articles gloss over.

RAG Costs

Component	Service	Cost
Embeddings (generation)	OpenAI text-embedding-3-small	$0.02 per 1M tokens
Vector database	Pinecone (starter)	Free tier / $70/month
Vector database	Qdrant Cloud	Free tier / $25/month
Vector database	Self-hosted (JSON/SQLite)	$0
LLM queries	GPT-4o-mini	$0.15 per 1M input tokens
LLM queries	Claude 3.5 Haiku	$0.25 per 1M input tokens

Typical monthly cost for a small RAG app: $5-50/month

For my personal site’s AI assistant, I spend approximately $3/month — embeddings stored in a JSON file (free), and I pay only for the LLM inference per query.

Fine-Tuning Costs

Component	Service	Cost
Training	OpenAI GPT-4o-mini	$3.00 per 1M training tokens
Training	OpenAI GPT-4o	$25.00 per 1M training tokens
Training	Self-hosted (LLaMA 3.3)	~$2-10/hour GPU rental
Inference	Fine-tuned GPT-4o-mini	$0.30 per 1M input tokens (2x base)
Data preparation	Manual curation	Your time (most expensive part)

Typical cost for a fine-tuning project: $50-500+ one-time, then ongoing inference costs.

Cost Comparison for 10,000 Queries/Month

Approach	Monthly Cost Estimate
RAG with GPT-4o-mini	~$8-15
Fine-tuned GPT-4o-mini	~$5-10 (lower per query, no retrieval)
Hybrid (both)	~$12-20
Self-hosted RAG (Ollama + local)	~$0 (just electricity)

The takeaway: RAG is cheaper to start, fine-tuning is cheaper at scale. For most indie projects and small businesses, RAG is the pragmatic choice.

Beyond Basic RAG: Advanced Patterns in 2026

The RAG landscape has evolved significantly. Here are the patterns I’m paying attention to:

1. GraphRAG

Traditional RAG treats documents as isolated chunks. GraphRAG builds a knowledge graph from your documents, understanding relationships between entities. When you ask a question, it traverses the graph to find connected information — not just similar text.

Traditional RAG: "Find chunks that mention X"
GraphRAG: "Find chunks about X, then follow relationships to Y and Z"

This is particularly powerful for complex domains where information is interconnected — like medical records, legal cases, or technical documentation.

2. Agentic RAG

Instead of a single retrieval step, agentic RAG uses AI agents that can:

Decide which data sources to query
Reformulate queries if initial results are poor
Chain multiple retrievals together
Validate retrieved information before using it

Think of it as RAG with a brain — the system doesn’t just blindly retrieve and inject, it strategically hunts for the right information.

3. MCP-Based RAG

With the rise of Model Context Protocol (MCP), RAG is getting a standardised interface. Instead of building custom retrieval pipelines, you can expose your knowledge base as an MCP resource. Any MCP-compatible AI (Claude, GPT, Gemini) can then access it through a standard protocol.

I wrote about MCP in detail — it’s changing how we think about connecting AI to data.

4. Late-Interaction Retrieval (ColBERT v2+)

Instead of compressing an entire document chunk into a single embedding vector, late-interaction models keep token-level embeddings. This allows for much more precise matching at query time. It’s computationally heavier but significantly more accurate for technical or nuanced queries.

Beyond Basic Fine-Tuning: Modern Techniques

Fine-tuning has also evolved well beyond “train the whole model on your data”:

1. LoRA (Low-Rank Adaptation)

Instead of updating all model parameters, LoRA adds small trainable matrices to specific layers. This reduces training cost by 10-100x while maintaining most of the quality. It’s the standard for fine-tuning open-source models in 2026.

2. QLoRA (Quantized LoRA)

Combines quantization (reducing model precision to 4-bit) with LoRA. This means you can fine-tune a 70B parameter model on a single consumer GPU. A few years ago, this would have required a cluster.

3. DPO (Direct Preference Optimisation)

Instead of training on “correct” answers, DPO trains on human preferences — “response A is better than response B.” This is incredibly effective for aligning model behaviour with what humans actually want.

4. Synthetic Data Fine-Tuning

Use a stronger model (like GPT-4.5 or Claude Opus) to generate training data for a smaller model. This is becoming the standard approach for building cost-effective, domain-specific models. You get 80-90% of the big model’s quality at a fraction of the cost.

Common Mistakes I’ve Seen (and Made)

Mistake 1: Using Fine-Tuning When You Need RAG

I’ve seen teams spend weeks fine-tuning a model on their company wiki — only to realise the wiki changes every week. Every update meant retraining. They should have used RAG from day one.

Rule of thumb: If your data changes more than once a month, start with RAG.

Mistake 2: Using RAG When You Need Fine-Tuning

On the flip side, I once built a RAG system for a client who wanted their AI to write in a very specific brand voice. The retrieved context helped with facts, but the tone was always off. Fine-tuning on 500 examples of their brand writing fixed it instantly.

Rule of thumb: If your problem is about how the model responds (style, format, tone), fine-tuning is the answer.

Mistake 3: Bad Chunking Strategy in RAG

This is the most common RAG failure. If you split documents at arbitrary character limits, you’ll break sentences, lose context, and get terrible retrieval quality. Smart chunking (by paragraphs, sections, or semantic boundaries) makes an enormous difference.

Mistake 4: Not Evaluating Retrieval Quality

Most people evaluate the final LLM output but never check whether the retriever is actually finding the right documents. If retrieval is broken, no amount of prompt engineering will fix it. Always measure retrieval precision and recall separately.

Mistake 5: Over-Fine-Tuning

Training for too many epochs on a small dataset leads to overfitting. The model becomes great at parroting your training examples but terrible at everything else. Start with 2-3 epochs and evaluate rigorously.

My Personal Decision Process

When a new AI project lands on my desk, here’s exactly how I decide:

Step 1: What’s the core problem?

Knowledge access → RAG
Behaviour change → Fine-tuning
Both → Hybrid

Step 2: How often does the data change?

Daily/weekly → RAG (definitely)
Monthly → RAG or hybrid
Rarely → Fine-tuning is an option

Step 3: What’s the budget?

Under $50/month → RAG with a hosted LLM
$50-500/month → RAG or fine-tuning
$500+/month → Hybrid, or fine-tuned + RAG

Step 4: What’s the timeline?

Need it this week → RAG
Can invest 2-4 weeks → Fine-tuning
Long-term product → Hybrid architecture

Step 5: How critical is accuracy?

Must cite sources → RAG
Must be consistent → Fine-tuning
Both → Hybrid

For my personal projects, I almost always start with RAG. It’s faster to prototype, cheaper to run, and easier to iterate. Fine-tuning comes later when I’ve validated the use case and have enough data.

Frequently Asked Questions

Is RAG better than fine-tuning?

Neither is universally better. RAG excels at knowledge-heavy applications with frequently changing data. Fine-tuning excels at changing model behaviour, tone, and domain reasoning. For many production systems, combining both gives the best results.

Can I use RAG and fine-tuning together?

Yes, and it’s increasingly common. Fine-tune for behaviour and style, use RAG for dynamic knowledge. The fine-tuned model becomes better at interpreting and using the retrieved context.

How much data do I need for fine-tuning?

OpenAI recommends a minimum of 50 examples, but 200-1,000 high-quality instruction-response pairs is the sweet spot for most use cases. Quality matters far more than quantity.

How much data do I need for RAG?

RAG works with any amount of data — from a single document to millions. The key is proper chunking and embedding quality, not raw volume.

Is fine-tuning worth it for small projects?

Usually not. For small projects, RAG + good prompt engineering gets you 80-90% of the way. Fine-tuning makes sense when you’re at scale or need very specific model behaviour.

What vector database should I use for RAG?

For small projects: a JSON file or SQLite works fine (it’s what I use on this site). For production: Pinecone, Qdrant, Weaviate, or pgvector (PostgreSQL extension) are all solid choices. Pick based on your existing infrastructure.

Does RAG work with open-source models?

Absolutely. RAG is model-agnostic. You can use it with LLaMA, Mistral, Qwen, Phi, or any model that accepts a system/context prompt. I’ve tested RAG with Ollama locally and it works great.

Will fine-tuning make my model hallucinate less?

Not necessarily. Fine-tuning can actually increase hallucinations if the model learns to be confidently wrong from noisy training data. RAG is generally better for reducing hallucinations because it grounds responses in retrieved facts.

Final Thoughts

The RAG vs fine-tuning debate is really about understanding your problem deeply enough to choose the right tool. After building with both extensively, here’s my honest summary:

Start with RAG. It’s faster to build, cheaper to run, and easier to debug. For most applications — especially if you’re dealing with knowledge that changes — RAG is the pragmatic choice.

Add fine-tuning when you need it. When you’ve validated your use case and need specific behaviour, consistent output formats, or domain expertise baked into the model, fine-tuning is worth the investment.

Don’t be afraid of the hybrid approach. The best AI products I’ve built combine both. Fine-tune for behaviour, RAG for knowledge. It’s more complex, but the quality difference is significant.

And if you’re just getting started? Build a simple RAG system. Even a JSON file with embeddings and cosine similarity will get you surprisingly far. I know because that’s exactly how the AI on this website works.

Blog

Back to blog posts

RAG vs Fine-Tuning: When to Use Which for Your AI Project

Published onApril 04, 202613 min read3,448 words

Table of Contents

I chose RAG. And it was the right call — for that project. But it’s not always the right call.

The Core Problem Both Solve

Large Language Models like GPT-4.5, Claude Opus 4.5, and Gemini 3 Pro are incredibly capable. But they share two fundamental limitations:

Knowledge cutoff — They don’t know what happened after their training date.
No access to your data — They’ve never seen your company’s documents, your product database, or your internal policies.

Both RAG and fine-tuning are strategies to bridge this gap, but they do it in fundamentally different ways.

Think of it like this:

RAG is like giving someone a library card. They don’t memorise the books — they look things up when asked.

Fine-tuning is like sending someone to medical school. The knowledge becomes part of who they are.

Neither approach is universally better. The right choice depends on your data, your budget, your latency requirements, and how often your information changes.

What Is RAG (Retrieval-Augmented Generation)?

How RAG Works — Step by Step

User Query → Embedding → Vector Search → Retrieve Top-K Documents → 
Inject into Prompt → LLM Generates Answer

Here’s the detailed flow:

Your documents are pre-processed — Text is split into chunks, converted into vector embeddings, and stored in a vector database (like Pinecone, Weaviate, Qdrant, or even a simple JSON file).
User asks a question — The query is converted into an embedding using the same model.
Semantic search — The system finds the most relevant document chunks by comparing vector similarity (cosine similarity, dot product, etc.).
Context injection — The retrieved chunks are added to the LLM’s prompt as context.
Generation — The LLM generates a response grounded in the retrieved information.

A Real RAG Implementation — From My Own Site

I built a RAG-powered AI assistant for xahidex.com. Here’s a simplified version of how I generate embeddings for my knowledge base:

// Generate embeddings for knowledge base documents
import { readFileSync } from 'fs';
import { join } from 'path';
 
interface EmbeddingEntry {
  text: string;
  embedding: number[];
  source: string;
}
 
async function generateEmbeddings(
  documents: { content: string; source: string }[]
): Promise<EmbeddingEntry[]> {
  const entries: EmbeddingEntry[] = [];
 
  for (const doc of documents) {
    // Split into manageable chunks (~500 tokens each)
    const chunks = splitIntoChunks(doc.content, 500);
 
    for (const chunk of chunks) {
      const embedding = await getEmbedding(chunk);
      entries.push({
        text: chunk,
        embedding,
        source: doc.source,
      });
    }
  }
 
  return entries;
}
 
function splitIntoChunks(text: string, maxTokens: number): string[] {
  const sentences = text.split(/(?<=[.!?])\s+/);
  const chunks: string[] = [];
  let current = '';
 
  for (const sentence of sentences) {
    if ((current + sentence).length > maxTokens * 4) {
      if (current) chunks.push(current.trim());
      current = sentence;
    } else {
      current += ' ' + sentence;
    }
  }
  if (current) chunks.push(current.trim());
  return chunks;
}

And here’s how the retrieval works at query time:

function cosineSimilarity(a: number[], b: number[]): number {
  const dot = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const magA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const magB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dot / (magA * magB);
}
 
async function retrieveContext(
  query: string,
  embeddings: EmbeddingEntry[],
  topK: number = 5
): Promise<string> {
  const queryEmbedding = await getEmbedding(query);
 
  const scored = embeddings
    .map((entry) => ({
      ...entry,
      score: cosineSimilarity(queryEmbedding, entry.embedding),
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
 
  return scored.map((s) => s.text).join('\n\n');
}

When RAG Shines

Scenario	Why RAG Works
Data changes frequently	Re-embed new docs, no retraining needed
You need source citations	RAG naturally knows where it got the info
Company knowledge bases	Internal docs, FAQs, policies — all searchable
Real-time information	Plug in live APIs as data sources
Budget is limited	No GPU costs for training
Regulatory compliance	You control exactly what data the model can access

RAG Limitations

Retrieval quality is the bottleneck — If the search returns irrelevant chunks, the answer will be wrong (garbage in, garbage out).
Context window limits — You can only inject so much context. With very large knowledge bases, you need smart chunking and ranking strategies.
Latency — The retrieval step adds time. Expect 200-800ms overhead depending on your vector DB and infrastructure.
Can’t change the model’s behaviour — RAG feeds information, but it can’t change how the model writes, reasons, or formats responses.

What Is Fine-Tuning?

How Fine-Tuning Works

Base Model + Your Training Data → Training Process → 
Custom Model (new weights) → Deploy → Inference

Prepare training data — Typically as instruction-response pairs (JSONL format).
Choose a base model — GPT-4o-mini, LLaMA 3.3, Mistral, Qwen 2.5, etc.
Train — The model’s weights are updated using your data. This can take minutes to hours depending on dataset size and model.
Deploy — Your custom model is now available for inference.
Evaluate — Test against held-out examples to measure quality.

Fine-Tuning Training Data Example

[
  {
    "messages": [
      {
        "role": "system",
        "content": "You are an AI assistant for a Dubai-based healthcare clinic. Answer patient queries professionally and accurately."
      },
      {
        "role": "user",
        "content": "What are your paediatric vaccination packages?"
      },
      {
        "role": "assistant",
        "content": "We offer comprehensive paediatric vaccination packages following the UAE Ministry of Health schedule. Our packages include: 1) Newborn Package (0-12 months) covering BCG, Hepatitis B, DPT, and MMR vaccines. 2) Toddler Package (1-4 years) including boosters and additional immunisations. Please call our clinic to book a consultation with our paediatric team."
      }
    ]
  },
  {
    "messages": [
      {
        "role": "user",
        "content": "Do you accept insurance?"
      },
      {
        "role": "assistant",
        "content": "Yes, we accept major insurance providers in the UAE including DHA, Daman, Oman Insurance, AXA, Cigna, MetLife, and Neuron. For a complete list or to verify your specific plan, please contact our reception team."
      }
    ]
  }
]

Fine-Tuning With OpenAI — Quick Example

# Upload training data
openai api fine_tuning.jobs.create \
  -t "file-abc123" \
  -m "gpt-4o-mini-2024-07-18"

from openai import OpenAI
 
client = OpenAI()
 
# Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file="file-abc123",
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={
        "n_epochs": 3,
        "learning_rate_multiplier": 1.8,
    }
)
 
# Use your fine-tuned model
response = client.chat.completions.create(
    model="ft:gpt-4o-mini-2024-07-18:your-org::abc123",
    messages=[
        {"role": "user", "content": "What vaccines does a 6-month-old need?"}
    ]
)

When Fine-Tuning Shines

Scenario	Why Fine-Tuning Works
Custom tone/style	Train the model to write like your brand
Domain-specific reasoning	Medical, legal, financial — where precision matters
Structured output	Consistent JSON, specific formats every time
Reducing prompt length	Bake instructions into the model instead of the prompt
Classification tasks	Sentiment analysis, intent detection, categorisation
Offline/edge deployment	Fine-tuned small models can run locally

Fine-Tuning Limitations

Expensive — GPU compute for training isn’t cheap. Even with LoRA/QLoRA, you need decent hardware.
Data preparation is tedious — You need hundreds to thousands of high-quality examples.
Stale knowledge — Fine-tuned models can’t update their knowledge without retraining.
Overfitting risk — Train too much on narrow data and the model loses general ability.
No source attribution — The model “just knows” things — it can’t tell you where it learned them.

RAG vs Fine-Tuning: The Complete Comparison

Here’s the comparison table I wish I had when I started building AI features:

Factor	RAG	Fine-Tuning
What it changes	The input (context)	The model (weights)
Knowledge updates	Instant (re-embed new docs)	Requires retraining
Setup cost	Low ($0-50/month for most projects)	Medium-High ($50-5,000+ depending on scale)
Running cost	Higher per query (retrieval + longer prompts)	Lower per query (shorter prompts, baked-in knowledge)
Latency	+200-800ms for retrieval	No additional latency
Data needed	Raw documents (any format)	Curated instruction-response pairs (100s-1000s)
Hallucination risk	Lower (grounded in retrieved docs)	Higher (can confidently state wrong info)
Source attribution	Yes (knows which document it used)	No
Custom behaviour	Limited	Full control over tone, format, reasoning
Best for	Knowledge-heavy Q&A, search, support	Style, classification, domain reasoning
Maintenance	Update documents, re-embed	Re-train periodically
Privacy	Data stays in your vector DB	Data used in training (check provider policies)

The Decision Framework: Which One Should You Use?

After building with both, here’s the decision framework I use for every new AI project:

Choose RAG If:

✅ Your data changes frequently (weekly or more)
✅ You need citations and source transparency
✅ You’re building a Q&A system over documents
✅ Budget is tight and you can’t afford training costs
✅ You need it working in days, not weeks
✅ Compliance requires you to control data access
✅ Your knowledge base is large (1,000+ documents)

Choose Fine-Tuning If:

✅ You need a specific writing style or tone
✅ Your data is stable and doesn’t change often
✅ You need consistent structured outputs (JSON, XML, etc.)
✅ You’re building a classifier or intent detector
✅ Latency is critical (every millisecond matters)
✅ You want to reduce token costs at scale
✅ The task requires domain-specific reasoning

Choose Both (Hybrid) If:

✅ You need custom behavior AND up-to-date knowledge
✅ You’re building a production system at scale
✅ You want the best possible quality regardless of complexity

The Hybrid Approach: Why Not Both?

Here’s what most blog posts about this topic miss: you can use RAG and fine-tuning together. In fact, for production systems, combining both often gives the best results.

How the Hybrid Works

Fine-Tuned Model (custom behaviour + domain knowledge)
         +
RAG Pipeline (fresh, up-to-date context)
         =
Best of both worlds

Example: Imagine you’re building an AI assistant for a law firm.

Fine-tune the model on thousands of legal documents so it understands legal language, citation formats, and reasoning patterns.
Use RAG to retrieve the specific case laws, statutes, and client documents relevant to each query.

The fine-tuned model knows how to think like a lawyer. RAG gives it the specific facts it needs for this particular case.

A Real-World Hybrid Architecture

// 1. Fine-tuned model handles the reasoning
const model = 'ft:gpt-4o-mini:your-org::legal-assistant-v2';
 
// 2. RAG retrieves relevant documents
const context = await retrieveContext(userQuery, legalEmbeddings, 8);
 
// 3. Combine: fine-tuned behaviour + RAG context
const response = await openai.chat.completions.create({
  model,
  messages: [
    {
      role: 'system',
      content: `You are a legal research assistant. Use the following 
      case documents to support your analysis. Always cite specific 
      sections and precedents.\n\nRelevant Documents:\n${context}`,
    },
    { role: 'user', content: userQuery },
  ],
});

Real Cost Breakdown (2026 Pricing)

Let’s talk real numbers. This is something most articles gloss over.

RAG Costs

Component	Service	Cost
Embeddings (generation)	OpenAI text-embedding-3-small	$0.02 per 1M tokens
Vector database	Pinecone (starter)	Free tier / $70/month
Vector database	Qdrant Cloud	Free tier / $25/month
Vector database	Self-hosted (JSON/SQLite)	$0
LLM queries	GPT-4o-mini	$0.15 per 1M input tokens
LLM queries	Claude 3.5 Haiku	$0.25 per 1M input tokens

Typical monthly cost for a small RAG app: $5-50/month

For my personal site’s AI assistant, I spend approximately $3/month — embeddings stored in a JSON file (free), and I pay only for the LLM inference per query.

Fine-Tuning Costs

Component	Service	Cost
Training	OpenAI GPT-4o-mini	$3.00 per 1M training tokens
Training	OpenAI GPT-4o	$25.00 per 1M training tokens
Training	Self-hosted (LLaMA 3.3)	~$2-10/hour GPU rental
Inference	Fine-tuned GPT-4o-mini	$0.30 per 1M input tokens (2x base)
Data preparation	Manual curation	Your time (most expensive part)

Typical cost for a fine-tuning project: $50-500+ one-time, then ongoing inference costs.

Cost Comparison for 10,000 Queries/Month

Approach	Monthly Cost Estimate
RAG with GPT-4o-mini	~$8-15
Fine-tuned GPT-4o-mini	~$5-10 (lower per query, no retrieval)
Hybrid (both)	~$12-20
Self-hosted RAG (Ollama + local)	~$0 (just electricity)

The takeaway: RAG is cheaper to start, fine-tuning is cheaper at scale. For most indie projects and small businesses, RAG is the pragmatic choice.

Beyond Basic RAG: Advanced Patterns in 2026

The RAG landscape has evolved significantly. Here are the patterns I’m paying attention to:

1. GraphRAG

Traditional RAG: "Find chunks that mention X"
GraphRAG: "Find chunks about X, then follow relationships to Y and Z"

This is particularly powerful for complex domains where information is interconnected — like medical records, legal cases, or technical documentation.

2. Agentic RAG

Instead of a single retrieval step, agentic RAG uses AI agents that can:

Decide which data sources to query
Reformulate queries if initial results are poor
Chain multiple retrievals together
Validate retrieved information before using it

Think of it as RAG with a brain — the system doesn’t just blindly retrieve and inject, it strategically hunts for the right information.

3. MCP-Based RAG

I wrote about MCP in detail — it’s changing how we think about connecting AI to data.

4. Late-Interaction Retrieval (ColBERT v2+)

Beyond Basic Fine-Tuning: Modern Techniques

Fine-tuning has also evolved well beyond “train the whole model on your data”:

1. LoRA (Low-Rank Adaptation)

2. QLoRA (Quantized LoRA)

Combines quantization (reducing model precision to 4-bit) with LoRA. This means you can fine-tune a 70B parameter model on a single consumer GPU. A few years ago, this would have required a cluster.

3. DPO (Direct Preference Optimisation)

4. Synthetic Data Fine-Tuning

Common Mistakes I’ve Seen (and Made)

Mistake 1: Using Fine-Tuning When You Need RAG

I’ve seen teams spend weeks fine-tuning a model on their company wiki — only to realise the wiki changes every week. Every update meant retraining. They should have used RAG from day one.

Rule of thumb: If your data changes more than once a month, start with RAG.

Mistake 2: Using RAG When You Need Fine-Tuning

Rule of thumb: If your problem is about how the model responds (style, format, tone), fine-tuning is the answer.

Mistake 3: Bad Chunking Strategy in RAG

Mistake 4: Not Evaluating Retrieval Quality

Mistake 5: Over-Fine-Tuning

My Personal Decision Process

When a new AI project lands on my desk, here’s exactly how I decide:

Step 1: What’s the core problem?

Knowledge access → RAG
Behaviour change → Fine-tuning
Both → Hybrid

Step 2: How often does the data change?

Daily/weekly → RAG (definitely)
Monthly → RAG or hybrid
Rarely → Fine-tuning is an option

Step 3: What’s the budget?

Under $50/month → RAG with a hosted LLM
$50-500/month → RAG or fine-tuning
$500+/month → Hybrid, or fine-tuned + RAG

Step 4: What’s the timeline?

Need it this week → RAG
Can invest 2-4 weeks → Fine-tuning
Long-term product → Hybrid architecture

Step 5: How critical is accuracy?

Must cite sources → RAG
Must be consistent → Fine-tuning
Both → Hybrid

Frequently Asked Questions

Is RAG better than fine-tuning?

Can I use RAG and fine-tuning together?

Yes, and it’s increasingly common. Fine-tune for behaviour and style, use RAG for dynamic knowledge. The fine-tuned model becomes better at interpreting and using the retrieved context.

How much data do I need for fine-tuning?

OpenAI recommends a minimum of 50 examples, but 200-1,000 high-quality instruction-response pairs is the sweet spot for most use cases. Quality matters far more than quantity.

How much data do I need for RAG?

RAG works with any amount of data — from a single document to millions. The key is proper chunking and embedding quality, not raw volume.

Is fine-tuning worth it for small projects?

Usually not. For small projects, RAG + good prompt engineering gets you 80-90% of the way. Fine-tuning makes sense when you’re at scale or need very specific model behaviour.

What vector database should I use for RAG?

Does RAG work with open-source models?

Absolutely. RAG is model-agnostic. You can use it with LLaMA, Mistral, Qwen, Phi, or any model that accepts a system/context prompt. I’ve tested RAG with Ollama locally and it works great.

Will fine-tuning make my model hallucinate less?

Final Thoughts

The RAG vs fine-tuning debate is really about understanding your problem deeply enough to choose the right tool. After building with both extensively, here’s my honest summary:

Start with RAG. It’s faster to build, cheaper to run, and easier to debug. For most applications — especially if you’re dealing with knowledge that changes — RAG is the pragmatic choice.

Blog

RAG vs Fine-Tuning: When to Use Which for Your AI Project

The Core Problem Both Solve

What Is RAG (Retrieval-Augmented Generation)?

How RAG Works — Step by Step

A Real RAG Implementation — From My Own Site

When RAG Shines

RAG Limitations

What Is Fine-Tuning?

How Fine-Tuning Works

Fine-Tuning Training Data Example

Fine-Tuning With OpenAI — Quick Example

When Fine-Tuning Shines

Fine-Tuning Limitations

RAG vs Fine-Tuning: The Complete Comparison

The Decision Framework: Which One Should You Use?

Choose RAG If:

Choose Fine-Tuning If:

Choose Both (Hybrid) If:

The Hybrid Approach: Why Not Both?

How the Hybrid Works

A Real-World Hybrid Architecture

Real Cost Breakdown (2026 Pricing)

RAG Costs

Fine-Tuning Costs

Cost Comparison for 10,000 Queries/Month

Beyond Basic RAG: Advanced Patterns in 2026

1. GraphRAG

2. Agentic RAG

3. MCP-Based RAG

4. Late-Interaction Retrieval (ColBERT v2+)

Beyond Basic Fine-Tuning: Modern Techniques

1. LoRA (Low-Rank Adaptation)

2. QLoRA (Quantized LoRA)

3. DPO (Direct Preference Optimisation)

4. Synthetic Data Fine-Tuning

Common Mistakes I’ve Seen (and Made)

Mistake 1: Using Fine-Tuning When You Need RAG

Mistake 2: Using RAG When You Need Fine-Tuning

Mistake 3: Bad Chunking Strategy in RAG

Mistake 4: Not Evaluating Retrieval Quality

Mistake 5: Over-Fine-Tuning

My Personal Decision Process

Frequently Asked Questions

Is RAG better than fine-tuning?

Can I use RAG and fine-tuning together?

How much data do I need for fine-tuning?

How much data do I need for RAG?

Is fine-tuning worth it for small projects?

What vector database should I use for RAG?

Does RAG work with open-source models?

Will fine-tuning make my model hallucinate less?

Final Thoughts

Related Reading

Blog

RAG vs Fine-Tuning: When to Use Which for Your AI Project

The Core Problem Both Solve

What Is RAG (Retrieval-Augmented Generation)?

How RAG Works — Step by Step

A Real RAG Implementation — From My Own Site

When RAG Shines

RAG Limitations

What Is Fine-Tuning?

How Fine-Tuning Works

Fine-Tuning Training Data Example

Fine-Tuning With OpenAI — Quick Example

When Fine-Tuning Shines

Fine-Tuning Limitations

RAG vs Fine-Tuning: The Complete Comparison

The Decision Framework: Which One Should You Use?

Choose RAG If:

Choose Fine-Tuning If:

Choose Both (Hybrid) If:

The Hybrid Approach: Why Not Both?

How the Hybrid Works

A Real-World Hybrid Architecture

Real Cost Breakdown (2026 Pricing)

RAG Costs

Fine-Tuning Costs

Cost Comparison for 10,000 Queries/Month