Chapter 8 of 11
Chapter 6 - Modern AI Systems: RAG, Agents, and Glue Code
The Crux
Models alone are useless. Real AI systems are models + data pipelines + retrieval + guardrails + monitoring + glue code. This chapter is about engineering AI into production, not just training models.
Why Models Alone Are Useless
You've trained a great model. Congratulations. Now what?
Reality:
- The model needs to integrate with existing systems (databases, APIs, user interfaces)
- Users don't send perfectly formatted inputs
- The model drifts as the world changes
- You need to monitor failures, log predictions, retrain periodically
- You need to handle errors gracefully (what if the API is down?)
The model is 10% of the system. The other 90% is infrastructure.
RAG: Retrieval-Augmented Generation
LLMs hallucinate because they rely on memorized training data. What if we give them access to external knowledge?
The Idea
Instead of asking the LLM to answer directly:
- Retrieve relevant documents from a database
- Augment the prompt with retrieved information
- Generate the answer based on retrieved context
Example:
- User: "What's the return policy?"
- System retrieves: Company policy doc mentioning "30-day returns"
- Prompt: "Based on this policy: [retrieved text], answer: What's the return policy?"
- LLM: "We offer 30-day returns."
Why It Works
The LLM doesn't need to memorize every fact. It just needs to read context and extract answers-something LLMs are good at.
Architecture
- Document store: Database of knowledge (vector database, Elasticsearch, etc.)
- Embedding model: Convert queries and documents to vectors
- Retrieval: Find top-k most similar documents to the query (cosine similarity)
- LLM: Generate answer given query + retrieved docs
When to Use RAG vs Fine-Tuning
RAG:
- Knowledge changes frequently (e.g., product docs updated weekly)
- You need to cite sources
- You have limited GPU resources
Fine-tuning:
- Knowledge is stable
- You want the model to internalize a style or domain-specific reasoning
- You have labeled data and compute
Often, you use both: fine-tune for style/domain, RAG for up-to-date facts.