Chapter 11 of 16
8. CAPSTONE PROJECT
Enterprise "Company Brain" - The Ultimate RAG + KG System
Project Overview
Build a complete enterprise knowledge management system that ingests company documents, builds a knowledge graph, and answers questions using hybrid RAG + KG retrieval.
(This is the final boss. This project integrates everything you've learned - chunking, embeddings, graph construction, hybrid retrieval, deployment, monitoring, cost optimization. It will take weeks, not days. It will break in frustrating ways. You will question your life choices. When it finally works, you'll have a portfolio piece that actually demonstrates competence, not just "followed a tutorial." That's worth the pain.)
System Requirements
Input Sources:
- PDF documents (reports, papers, manuals)
- Markdown files (wikis, docs)
- CSV data (employee directory, project list)
- Web pages (company blog, documentation)
Capabilities:
- Document Ingestion: Async pipeline processing all formats
- Knowledge Graph: Auto-build from all documents
- Hybrid Search: Combine structured + unstructured retrieval
- Query Interface: Natural language queries with explanations
- Admin Dashboard: Monitor usage, costs, data sources
Architecture
┌─────────────────────────────────────────────────────────────┐
│ INGESTION LAYER │
│ PDF│MD│CSV│Web → Process → Extract → Split │
└────────────────────┬────────────────────────────────────────┘
│
┌───────────┴──────────┐
↓ ↓
┌────────────────┐ ┌────────────────────┐
│ VECTOR STORE │ │ KNOWLEDGE GRAPH │
│ (Pinecone/ │ │ (Neo4j) │
│ Chroma) │ │ │
│ │ │ Entities │
│ Embeddings │ │ Relationships │
│ Metadata │ │ Properties │
└────────┬───────┘ └──────┬─────────────┘
│ │
└────────┬──────────┘
↓
┌────────────────────┐
│ HYBRID RETRIEVER │
│ │
│ - Query Router │
│ - Entity Linker │
│ - Context Fusion │
└────────┬───────────┘
↓
┌────────────────────┐
│ LLM GENERATION │
│ + REASONING │
└────────┬───────────┘
↓
┌────────────────────┐
│ API + FRONTEND │
│ (FastAPI+React) │
└────────────────────┘
Technical Specifications
Tech Stack:
- Backend: Python 3.11, FastAPI
- Vector DB: Pinecone or ChromaDB
- Graph DB: Neo4j
- LLM: GPT-4 (primary), GPT-3.5-turbo (fallback)
- Frontend: React or Streamlit
- Deployment: Docker + Docker Compose
- Monitoring: Prometheus + Grafana (bonus)
Core Features (Must-Have):
- Document upload (drag-and-drop)
- Automatic KG construction
- Natural language queries
- Cited answers with source links
- Reasoning explanation ("how I found this")
- Query routing (auto-select KG vs RAG vs Hybrid)
- Admin dashboard (stats, costs)
Advanced Features (Nice-to-Have): 8. Multi-user support with authentication 9. Document versioning 10. Query history and analytics 11. Custom entity types 12. Graph visualization 13. Export answers as reports
Implementation Steps
Phase 1: Data Ingestion (Week 1)
- Build DocumentProcessor for all file types
- Implement async processing queue
- Add metadata extraction
- Test with 50+ documents
Phase 2: Knowledge Graph Construction (Week 2)
- Entity and relationship extraction
- Entity linking and deduplication
- Load into Neo4j
- Build basic Cypher query interface
Phase 3: RAG System (Week 2-3)
- Chunking strategy implementation
- Embedding generation and storage
- Hybrid retriever (BM25 + semantic)
- Reranker integration
Phase 4: Hybrid System (Week 3-4)
- Query classification and routing
- KG-augmented retrieval
- Context fusion
- Answer generation with citations
Phase 5: API & Frontend (Week 4-5)
- FastAPI endpoints
- Frontend (chat interface)
- Admin dashboard
- Authentication
Phase 6: Testing & Deployment (Week 5-6)
- Unit tests (>80% coverage)
- Integration tests
- Load testing
- Docker deployment
- Documentation
Evaluation Benchmarks
Quantitative Metrics:
- Accuracy: 85%+ on 100-question test set
- Latency: p95 < 2 seconds
- Throughput: 50+ concurrent users
- Cost: < $0.10 per query
- Uptime: 99.5%+
Qualitative Assessment:
- Answer quality (human evaluation)
- Citation accuracy (source verification)
- Reasoning clarity (explanation quality)
- User experience (UI/UX review)
Evaluation Rubric
| Component | Weight | Criteria |
|---|---|---|
| Data Ingestion | 15% | Handles all file types, metadata extraction, async processing |
| Knowledge Graph | 20% | Entity/relation extraction quality, graph completeness, Cypher queries work |
| RAG System | 20% | Retrieval quality, chunking strategy, embedding optimization |
| Hybrid Integration | 25% | Query routing, context fusion, answer quality |
| Production Quality | 20% | API design, testing, deployment, documentation, monitoring |
Total: 100 points
- 90-100: Exceptional - Production-ready, innovative features
- 80-89: Excellent - All core features working well
- 70-79: Good - Core features present, some rough edges
- 60-69: Adequate - Basic functionality works
- <60: Needs improvement
What Your Portfolio Demo Should Show
5-Minute Video Covering:
- Intro (30s): Problem statement and solution overview
- Data Ingestion (60s): Upload docs, show processing pipeline
- Knowledge Graph (60s): Visualize graph, run Cypher query
- Query Demo (90s):
- Factual query (KG-routed)
- Analytical query (RAG-routed)
- Multi-hop query (Hybrid)
- Advanced Features (60s): Citations, reasoning, admin dashboard
- Technical Deep-Dive (30s): Architecture diagram, tech stack
How This Signals Hire-Readiness
What Employers See:
- ✅ Full-stack skills: Backend + Frontend + DevOps
- ✅ AI/ML expertise: LLMs, embeddings, vector DBs
- ✅ Data engineering: Pipelines, async processing
- ✅ Production thinking: Testing, monitoring, deployment
- ✅ Problem-solving: Complex system design
- ✅ Communication: Clear documentation and demo
Conversation Starters in Interviews:
- "Tell me about your approach to query routing"
- "How did you optimize for cost and latency?"
- "What were the biggest challenges in building this?"
- "How would you scale this to 10M documents?"