Chapter 5 of 16

2A. THEORETICAL FOUNDATIONS (DEEP DIVE)

"Theory without practice is sterile, practice without theory is blind." - Immanuel Kant

This section provides the mathematical and conceptual foundations that power RAG and Knowledge Graph systems. Understanding these principles deeply will transform you from a code copier to an AI systems architect.

Fair warning: This section is dense. We're going to cover vector space theory, information retrieval theory, graph theory, and semantic similarity - all with actual math. If you're here to copy-paste code and move on, skip this section and come back when something breaks in production and you need to understand why. If you want to be the person who designs the system rather than just using it, buckle up.

Vector Space Theory & Embeddings (The Mathematics of Meaning)

The Core Idea: Meaning as Geometry

Fundamental Insight: If we can represent words, sentences, or documents as points in a high-dimensional space, then similar meanings should be close together geometrically.

Historical Context: This idea dates back to distributional semantics (1950s): "You shall know a word by the company it keeps" - J.R. Firth. Modern embeddings (Word2Vec 2013, BERT 2018) are the mathematical realization of this principle.

Vector Spaces: A Primer

A vector space is a mathematical structure where:

Each point is represented by coordinates: v = [v₁, v₂, ..., vₙ]
You can add vectors and multiply by scalars
Distance and angle have meaning

Example in 2D:

Word "king" = [0.5, 0.8]
Word "queen" = [0.4, 0.7]
Word "man" = [0.3, 0.1]
Word "woman" = [0.2, 0.0]

Geometry shows: king - man + woman ≈ queen

This is the famous word analogy property!

Why High Dimensions?

Real embeddings use 768-4096 dimensions. Why so many?

(This is one of those questions that seems simple but has a deep answer. The short version: we need enough dimensions to keep millions of different meanings separate. The long version involves manifold theory and will make your head hurt. We'll give you both.)

Curse of Dimensionality Paradox: In high dimensions:

Most points are far from each other (good for distinguishing meanings)
But angles become more meaningful than distances
More capacity to encode subtle semantic distinctions

Information Content: Language has ~10⁵ common words × multiple senses = need high dimensions to keep them separated.

The Mathematical Justification for High Dimensionality

The necessity of high-dimensional embeddings can be rigorously understood through the Johnson-Lindenstrauss Lemma, which states that a set of points in high-dimensional space can be embedded into a lower-dimensional space while approximately preserving pairwise distances.

Formal Statement: For any 0 < ε < 1, a set of n points in ℝ^D can be embedded into ℝ^k where k = O(log(n)/ε²), such that all pairwise distances are preserved within a factor of (1 ± ε).

Implications for Embeddings:

With 100,000 words and ε = 0.1 (10% error tolerance), we need k ≈ 115,000 dimensions theoretically
However, semantic structure has redundancy and lower intrinsic dimensionality
Modern embeddings (768-1536 dims) represent a practical compromise between:
- Expressiveness: Enough dimensions to separate distinct meanings
- Computational efficiency: Small enough for fast similarity computation
- Statistical efficiency: Not so high that we need enormous training data

Intrinsic Dimensionality of Language

Research suggests that while embeddings use 768+ dimensions, the intrinsic dimensionality of semantic space is much lower (estimated 50-200 dimensions). This means:

Manifold Hypothesis: Semantic meanings lie on a lower-dimensional manifold embedded in high-dimensional space
Redundancy: Many dimensions encode similar information (entangled representations)
Optimization: High dimensions make training easier (less local minima) even if not all are strictly necessary

Empirical Evidence:

# Principal Component Analysis on word2vec embeddings
# Typically shows that 95% of variance captured in ~100 principal components
# Yet we use 300 dims because:
# - Easier to train
# - Better generalization
# - Captures rare semantic distinctions

The Geometry of Meaning: A Deeper Dive

Vector Space Axioms Applied to Semantics:

A vector space V over a field F (typically ℝ for embeddings) satisfies:

Closure under addition: v + w ∈ V for all v, w ∈ V
- Semantic meaning: Combining concepts creates new concepts
- Example: "king" + "crown" = "monarchy"
Associativity: (u + v) + w = u + (v + w)
- Meaning composition is consistent regardless of grouping
Existence of zero vector: ∃ 0 ∈ V such that v + 0 = v
- The "null meaning" or "no information" vector
Existence of additive inverse: For each v, ∃ -v such that v + (-v) = 0
- Semantic opposites: "hot" + "cold" ≈ neutral
Scalar multiplication: α·v ∈ V for all α ∈ ℝ, v ∈ V
- Intensity or magnitude of meaning
- Example: 2·"happy" = "very happy", 0.5·"run" = "jog"

Why This Mathematical Structure Matters:

The vector space structure enables algebraic reasoning about meaning:

We can "solve" for unknown concepts: "king" - "man" + "woman" = ?
We can interpolate: 0.7·"walk" + 0.3·"run" ≈ "jog"
We can find orthogonal (unrelated) concepts using the nullspace

The Mathematics of Embeddings

1. Cosine Similarity (The Core Metric)

Why cosine, not Euclidean distance?

(This question comes up constantly. "Why not just use normal distance?" Because normal distance is fooled by vector magnitude. A 10-page essay and a 1-sentence summary might have identical meaning but very different vector magnitudes. Cosine similarity only cares about direction, not length. This makes it scale-invariant, which is exactly what you want for semantic similarity.)

Given two vectors u and v:

Euclidean Distance: d(u,v) = ||u - v|| = √(Σ(uᵢ - vᵢ)²)
Cosine Similarity:  cos(θ) = (u·v)/(||u|| ||v||) = Σ(uᵢvᵢ)/√(Σuᵢ²)√(Σvᵢ²)

Cosine similarity ranges from -1 to 1:

1 = same direction (identical meaning)
0 = orthogonal (unrelated)
-1 = opposite direction (antonyms)

Why cosine wins:

Scale invariant: "good movie" and "really really good movie" should be similar despite different vector magnitudes
Normalized: Always in [-1, 1], easy to interpret
Angular: Captures semantic relationship independent of frequency

Geometric Intuition:

         u  /
           /θ
          /_____ v

cos(θ) = how aligned the vectors are
Small θ → cos(θ) ≈ 1 → very similar
Large θ → cos(θ) ≈ 0 → unrelated

Mathematical Properties of Cosine Similarity

1. Relationship to Dot Product:

For normalized vectors (||u|| = ||v|| = 1), cosine similarity reduces to the dot product:

cos(θ) = u · v = Σᵢ uᵢvᵢ

This is why many vector databases normalize embeddings and use dot product for speed!

2. Metric Properties (or lack thereof):

Cosine similarity is NOT a metric because it violates the triangle inequality. However, we can convert it to a metric:

d_cos(u,v) = 1 - cos(u,v)  (cosine distance)

Or for a proper metric:

d_angular(u,v) = arccos(cos(u,v)) / π

This gives values in [0,1] and satisfies triangle inequality.

3. Computational Optimization:

For large-scale retrieval:

# Naive: O(nd) for n documents, d dimensions
similarities = [cosine(query, doc) for doc in documents]

# Optimized with normalization + matrix multiplication: O(nd) but much faster
# Normalize once
docs_normalized = docs / np.linalg.norm(docs, axis=1, keepdims=True)
query_normalized = query / np.linalg.norm(query)

# Single matrix multiplication
similarities = np.dot(docs_normalized, query_normalized)

4. Why Cosine for Text?:

Theoretical justification from distributional semantics:

Document vectors represent word co-occurrence statistics
Longer documents have larger magnitude but same semantic content
Cosine normalizes away document length, focusing on word distribution

Example:

Doc 1: "dog cat dog cat dog cat" → [3, 3]
Doc 2: "dog cat"                  → [1, 1]

Euclidean distance: ||[3,3] - [1,1]|| = 2.83 (seems different!)
Cosine similarity:  [3,3]·[1,1] / (|[3,3]||[1,1]|) = 6/(4.24*1.41) = 1.0 (identical!)

Both documents have the same semantic content (50% dog, 50% cat), and cosine correctly identifies this.

Alternative Similarity Measures and When to Use Them

Euclidean Distance (L2):

d(u,v) = ||u - v|| = √(Σ(uᵢ - vᵢ)²)

Use when: Magnitude matters (e.g., embedding dense entities with varying importance)
RAG application: Less common, but useful for hierarchical embeddings

Manhattan Distance (L1):

d(u,v) = Σ|uᵢ - vᵢ|

Use when: Sparse vectors, interpretable dimensions
RAG application: TF-IDF vectors, bag-of-words

Dot Product (Inner Product):

u · v = Σ uᵢvᵢ

Use when: Vectors are normalized OR magnitude encodes importance
RAG application: Fast approximate nearest neighbor search (FAISS uses this)

Comparison Table:

Similarity Measure | Normalized? | Metric? | Speed    | Best For
-------------------|-------------|---------|----------|------------------
Cosine             | Yes         | No*     | Fast     | Text embeddings
Euclidean          | No          | Yes     | Fast     | Dense vectors
Dot Product        | No          | No      | Fastest  | Pre-normalized
Manhattan          | No          | Yes     | Fast     | Sparse vectors

*Cosine distance (1-cos) forms a pseudo-metric

2. How Embeddings Are Learned

Skip-gram (Word2Vec) - The Original Insight:

Objective: Predict context words from center word

Given sentence: "The quick brown fox jumps"

Center: "brown"
Context: ["quick", "fox"]

Neural Network:

Input: one-hot vector for "brown" [0,0,1,0,0,...]
       ↓
Hidden Layer (embedding): [0.2, -0.5, 0.8, ...] ← This is the embedding!
       ↓
Output: probability distribution over context words

Training: Adjust embeddings so that words appearing in similar contexts get similar vectors.

Mathematical Formulation:

Maximize: Σ log P(context | word)

Where: P(context | word) = exp(u_context · v_word) / Σ exp(u_i · v_word)

This is softmax - converts dot products into probabilities.

Detailed Training Objective:

For a corpus with vocabulary V and sequence of words w₁, w₂, ..., wₜ, the skip-gram objective is:

J(θ) = (1/T) Σₜ Σ₋c≤j≤c,j≠0 log P(wₜ₊ⱼ | wₜ)

where:
- T = total words in corpus
- c = context window size (typically 5-10)
- wₜ = target word at position t
- wₜ₊ⱼ = context word at offset j

P(wₒ | wᵢ) = exp(u_o^T v_i) / Σₖ₌₁^V exp(u_k^T v_i)

The Computational Challenge:

The softmax denominator requires summing over entire vocabulary V (typically 100k+ words), making this O(V) per training example.

Solution: Negative Sampling

Instead of computing full softmax, approximate with:

log σ(u_o^T v_i) + Σₖ₌₁^K 𝔼_k~P_n[log σ(-u_k^T v_i)]

where:
- σ(x) = 1/(1+e^-x) is sigmoid function
- K = number of negative samples (typically 5-20)
- P_n = noise distribution (usually unigram^(3/4))

Why This Works:

True context word: Maximize σ(u_o^T v_i) → push u_o and v_i closer
Random negative samples: Maximize σ(-u_k^T v_i) → push u_k and v_i apart
Complexity: O(K) instead of O(V) - huge speedup!

The Two Embedding Matrices:

Word2Vec actually learns TWO embeddings per word:

v_w: Word as center (when predicting context)
u_w: Word as context (when being predicted)

Final embedding typically averages or concatenates these, giving richer representation.

Skip-gram vs. CBOW (Continuous Bag of Words)

CBOW: Inverse of skip-gram - predict center word from context

Skip-gram:  center → context words
CBOW:       context words → center

Training Signal:
Skip-gram:  "The quick [brown] fox jumps" → predict "quick", "fox"
CBOW:       "The quick [?] fox jumps" → predict "brown" from context

When to use each:

Skip-gram: Better for small datasets, rare words, captures more nuanced semantics
CBOW: Faster training, better for frequent words, smoother embeddings

Mathematical Difference:

Skip-gram: P(context | center) = Πᵢ P(wᵢ | w_center)
CBOW:      P(center | context) = P(w_center | avg(context_words))

GloVe (Global Vectors) - The Matrix Factorization View

Key Insight: Embeddings can also be learned by factorizing word co-occurrence matrices.

Objective:

J = Σᵢ,ⱼ f(Xᵢⱼ)(wᵢ^T w̃ⱼ + bᵢ + b̃ⱼ - log Xᵢⱼ)²

where:
- Xᵢⱼ = number of times word j appears in context of word i
- wᵢ, w̃ⱼ = word embeddings
- bᵢ, b̃ⱼ = bias terms
- f(x) = weighting function (gives less weight to very frequent/rare pairs)

Weighting Function:

f(x) = (x/x_max)^α if x < x_max, else 1

This prevents overfitting to very frequent co-occurrences like "the the".

Why GloVe Matters:

Bridges word2vec (local context) and LSA (global statistics)
Often produces slightly better analogies than word2vec
More interpretable (directly models co-occurrence)

Transformer Embeddings (BERT, GPT) - The Modern Approach:

Instead of static embeddings, transformers create contextualized embeddings:

"I went to the bank to deposit money" → bank₁ = [0.1, 0.8, ...] "I sat by the river bank" → bank₂ = [0.5, 0.2, ...]

How? Attention mechanism (we'll cover next) that looks at surrounding words.

The Attention Mechanism (Intuition)

Problem with Static Embeddings: "bank" always gets same vector regardless of context.

Solution: Compute embedding as weighted combination of all words in sentence.

Simplified Attention Formula:

For word wᵢ in sentence w₁, w₂, ..., wₙ:

1. Compute attention scores: αᵢⱼ = score(wᵢ, wⱼ)
2. Normalize: âᵢⱼ = softmax(αᵢⱼ) = exp(αᵢⱼ) / Σₖ exp(αᵢₖ)
3. Contextualized embedding: hᵢ = Σⱼ âᵢⱼ · v_wⱼ

where v_wⱼ is the initial embedding of word j

Example:

Sentence: "The bank by the river"

For "bank":
- High attention to: "river" (0.4), "the" (before bank, 0.3)
- Low attention to: "The" (start, 0.1), "by" (0.1), "the" (after, 0.1)

Final embedding: 0.4·v_river + 0.3·v_the + 0.1·v_The + 0.1·v_by + 0.1·v_the
→ Heavily influenced by "river", encodes "financial institution" less

Self-Attention in Transformers:

The actual mechanism is more sophisticated (Query-Key-Value):

For each word wᵢ:
- Query: Qᵢ = wᵢ · W_Q  (what am I looking for?)
- Key:   Kⱼ = wⱼ · W_K  (what do I contain?)
- Value: Vⱼ = wⱼ · W_V  (what do I communicate?)

Attention(Q,K,V) = softmax(QK^T / √d_k) V

where:
- d_k = dimension of key vectors
- Division by √d_k prevents vanishing gradients

Why This Works:

Different words attend to different contexts
Multi-head attention captures different relationships (syntax, semantics, etc.)
Stacked layers build increasingly abstract representations

3. Properties of Good Embeddings

Linearity: Semantic relationships are linear transformations

king - man + woman ≈ queen
Paris - France + Italy ≈ Rome

Clustering: Similar concepts cluster together

Fruits: [apple, orange, banana] are close in space
Animals: [dog, cat, lion] form another cluster

Dimensionality: Each dimension captures a semantic feature

Dimension 42 might encode "royalty"
Dimension 108 might encode "gender"
(Though in practice, dimensions are entangled)

Embedding Quality Metrics

Intrinsic Evaluation:

1. Word Similarity: Correlation with human judgments (WordSim-353 dataset)

2. Analogy Accuracy: "man:woman :: king:?" → should predict "queen"

Extrinsic Evaluation:

Downstream Task Performance: How well does retrieval work?

Practical Implications for RAG

Why this matters for your RAG system:

Chunk Size: Larger chunks → more diverse content → lower quality embeddings
- Sweet spot: 200-1000 tokens per chunk
- Each chunk should have coherent semantic content
Query Expansion: If query is short, expand it before embedding
- Short query: "RAG" → poor embedding
- Expanded: "What is retrieval-augmented generation?" → better
Embedding Model Choice:
- General models (OpenAI): Good for diverse content
- Domain-specific: Train on your corpus for 10-20% improvement
Similarity Threshold: Not all cosine scores are created equal
- 0.9+ : Very similar (same topic, same phrasing)
- 0.7-0.9 : Similar (same topic, different phrasing)
- 0.5-0.7 : Related (adjacent topics)
- <0.5 : Probably not relevant

← Chapter 3 - Beginner Foundations1 / 4