How to Search 100M Lines of Code in 200ms

August 24, 2025

A team just cracked one of the hardest problems in software engineering: searching 100 million lines of code in under 200ms. They made search 40% faster while using 8x less memory and maintaining 99.9% accuracy. Here's the story of how they did it.

The Problem

Imagine you're working on a 100-million-line codebase. You need to find similar functions, understand how something works, or see if someone already solved your problem. Traditional search sucks—it only matches keywords, not meaning.

Modern AI can understand code semantically using embeddings (vectors that capture meaning), but there's a catch: searching 100M vectors takes forever and eats massive amounts of memory.

The Math That Breaks Everything

Let's do the brutal math. Each line of code becomes a 768-dimensional vector:

\text{Memory} = N \times d \times 4 = 10^8 \times 768 \times 4 = 307.2 \text{ GB}

Every search compares your query to all $N$ vectors:

\text{Operations} = N \times d = 10^8 \times 768 = 7.68 \times 10^{10}

At 1 GFLOP/s, that's 77 seconds per search. Nobody's waiting that long to find a function.

Python

def naive_search(db, q):
    scores = [dot(q, v) for v in db]  # 100M ops!
    return sorted(scores)[-10:]       # 77 seconds

The Solution: Quantized Search

Instead of storing exact vectors, we compress them. Think of it like this: instead of storing the exact GPS coordinates of every house, you store which neighborhood each house is in.

The key insight: you don't need perfect precision to find similar code. You just need to be close enough.

The Breakthrough: Product Quantization

Here's the insight that changed everything. Instead of compressing a 768-dimensional vector as one piece, split it into $m$ subvectors and quantize each separately:

\mathbf{x} = [\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_m] \quad \text{where } \mathbf{x}_j \in \mathbb{R}^{d/m}

Each subvector gets compressed to 1 byte (256 codes). The compression ratio is devastating:

\text{Compression} = \frac{768 \times 4}{8 \times 1} = 384\text{x}

Python

def compress(x, m=8):
    chunks = x.reshape(m, -1)
    return [find_code(chunk) for chunk in chunks]  # 8 bytes

def search(q, db):
    return [sim(q, v) for v in db]  # 384x faster

The Speed Multiplier: Asymmetric Distance

But we're not done. The real magic happens with Asymmetric Distance Computation (ADC). Instead of computing distances on-the-fly, precompute a lookup table:

d_{ADC}(\mathbf{q}, PQ(\mathbf{x})) = \sum_{j=1}^m \|\mathbf{q}_j - \mathbf{c}_{q_j(\mathbf{x}_j)}\|_2^2

This reduces complexity from $O(Nd)$ to $O(mk + N)$ . For our case: 8 × 256 = 2,048 calculations once, then just lookups.

Python

def precompute(q, m=8, k=256):
    chunks = q.reshape(m, -1)
    return [[dist(chunk, code) for code in range(k)] 
            for chunk in chunks]  # (8, 256)

def fast_search(table, db):
    return [sum(table[i][code] for i, code in enumerate(v)) 
            for v in db]  # Just lookups!

The Final Optimization: Inverted File Index

384x compression is great, but you're still searching 100M vectors. The Inverted File (IVF) trick adds coarse quantization. First, cluster vectors into $n_{list}$ groups:

c(\mathbf{x}) = \arg\min_{i=1,\ldots,n_{list}} \|\mathbf{x} - \boldsymbol{\mu}_i\|_2^2

During search, only probe $n_{probe}$ nearest clusters. This reduces search space from $N$ to $N \cdot n_{probe} / n_{list}$ .

Python

def ivf_search(q, centroids, lists, nprobe=8):
    # Find closest clusters
    dists = [dist(q, c) for c in centroids]
    top_clusters = sorted(range(len(dists)), key=lambda i: dists[i])[:nprobe]
    
    # Search only those clusters
    candidates = [v for i in top_clusters for v in lists[i]]
    return pq_search(q, candidates)  # 512x smaller search space

The Results

Combining Product Quantization + Inverted Index gives you the best of both worlds:

Metric	Before	After	Improvement
Search Time	2+ seconds	<200ms	10x faster
Memory Usage	2GB	250MB	8x less
Accuracy	100%	99.9%	Barely any loss

Why This Matters

Fast code search isn't just about convenience—it changes how you work:

Real-time suggestions: AI assistants can pull relevant context instantly
Better code reuse: Find existing solutions before writing new code
Faster debugging: Locate similar bugs and their fixes immediately
Improved onboarding: New developers can navigate massive codebases

The Complete System

Here's how all the pieces fit together in production:

Python

class SearchEngine:
    def __init__(self):
        self.ivf = IVF(nlist=4096)
        self.pq = PQ(m=8, k=256)
        self.exact = ExactIndex()  # For new code
    
    def search(self, q, k=10):
        approx = self.ivf.search(q, nprobe=8)
        exact = self.exact.search(q)
        return merge_results(approx + exact, k)
    
    def add(self, embedding, meta):
        self.exact.add(embedding, meta)
        if self.exact.size() > 10000:
            self.rebuild()

Key Takeaways

Compression works: 384x smaller with minimal accuracy loss
Clustering matters: Don't search everything, search smart
Precomputation is key: Do expensive work once, reuse it
Handle edge cases: Have exact search for new/changing code

The result? Developers can search 100 million lines of code as fast as they can type. That's the kind of performance that changes everything.

Deep Dive: Training the Quantizers

The magic happens during training. For Product Quantization, you need to learn $m$ codebooks, each with $k$ codewords. This is done via k-means clustering on each subvector dimension:

\min_{\{\mathcal{C}_j\}} \sum_{i=1}^N \sum_{j=1}^m \|\mathbf{x}_{i,j} - q_j(\mathbf{x}_{i,j})\|_2^2

Where $q_j(\mathbf{x}_{i,j})$ is the quantization of subvector $j$ from vector $i$ . The training alternates between:

Assignment: Assign each subvector to nearest codeword
Update: Recompute codewords as centroids of assigned subvectors

Python

def train_pq(vectors, m=8, k=256, iters=20):
    d = vectors.shape[1]
    codebooks = np.random.randn(m, k, d//m)
    
    for _ in range(iters):
        # Assignment step
        codes = encode(vectors, codebooks)
        
        # Update step  
        for j in range(m):
            for c in range(k):
                mask = codes[:, j] == c
                if mask.sum() > 0:
                    codebooks[j, c] = vectors[mask, j*d//m:(j+1)*d//m].mean(0)
    
    return codebooks

The Quantization Error Analysis

How much accuracy do we lose? The quantization error for Product Quantization is bounded by:

\mathbb{E}[\|\mathbf{x} - PQ(\mathbf{x})\|_2^2] \leq \sum_{j=1}^m \mathbb{E}[\|\mathbf{x}_j - q_j(\mathbf{x}_j)\|_2^2]

Each subvector's quantization error decreases as $O(k^{-2/d_j})$ where $d_j = d/m$ is the subvector dimension. This is why splitting into subvectors helps— the curse of dimensionality is less severe in lower dimensions.

Optimized Product Quantization (OPQ)

Standard PQ assumes subvectors are independent, but code embeddings have correlations. Optimized PQ learns a rotation matrix $\mathbf{R}$ to decorrelate dimensions:

\min_{\mathbf{R}, \{\mathcal{C}_j\}} \sum_{i=1}^N \|\mathbf{x}_i - PQ(\mathbf{R}\mathbf{x}_i)\|_2^2

The rotation is learned by alternating between updating codebooks and solving the Procrustes problem:

\mathbf{R}^* = \arg\min_{\mathbf{R}} \|\mathbf{X} - \mathbf{Y}\mathbf{R}\|_F^2

Python

def train_opq(vectors, m=8, k=256, opq_iters=5):
    R = np.eye(vectors.shape[1])  # Identity initially
    
    for _ in range(opq_iters):
        # Rotate vectors
        rotated = vectors @ R.T
        
        # Train PQ on rotated vectors
        codebooks = train_pq(rotated, m, k)
        
        # Reconstruct and solve Procrustes
        codes = encode(rotated, codebooks)
        reconstructed = decode(codes, codebooks)
        
        U, _, Vt = np.linalg.svd(vectors.T @ reconstructed)
        R = Vt.T @ U.T
    
    return R, codebooks

IVF Index Construction

The Inverted File index adds a coarse quantization layer. Vectors are first clustered into $n_{list}$ groups using k-means. The optimal number of lists balances search speed vs accuracy:

n_{list}^* \approx \sqrt{N}

For 100M vectors, this suggests ~10K lists. Each vector is then stored as:

Coarse assignment: Which cluster it belongs to
Residual vector: $\mathbf{x} - \boldsymbol{\mu}_{c(\mathbf{x})}$
PQ codes: Quantized residual

Python

def build_ivf(vectors, nlist=4096, m=8, k=256):
    # Coarse quantization
    centroids = kmeans(vectors, nlist)
    assignments = assign_to_centroids(vectors, centroids)
    
    # Compute residuals
    residuals = vectors - centroids[assignments]
    
    # Train PQ on residuals
    pq_codebooks = train_pq(residuals, m, k)
    pq_codes = encode(residuals, pq_codebooks)
    
    # Build inverted lists
    lists = [[] for _ in range(nlist)]
    for i, (assignment, code) in enumerate(zip(assignments, pq_codes)):
        lists[assignment].append((i, code))
    
    return centroids, pq_codebooks, lists

Search Algorithm Details

The complete search algorithm combines coarse search, fine search, and reranking:

\text{Score}(\mathbf{q}, \mathbf{x}) = \|\mathbf{q} - \boldsymbol{\mu}_{c(\mathbf{x})}\|_2^2 + d_{ADC}(\mathbf{q} - \boldsymbol{\mu}_{c(\mathbf{x})}, PQ(\mathbf{x} - \boldsymbol{\mu}_{c(\mathbf{x})}))

The first term is the coarse distance, the second is the fine PQ distance on residuals.

Python

def search_ivf_pq(q, centroids, pq_codebooks, lists, nprobe=8, k=10):
    # 1. Coarse search: find nearest clusters
    coarse_dists = [np.linalg.norm(q - c) for c in centroids]
    probe_lists = np.argsort(coarse_dists)[:nprobe]
    
    # 2. Fine search: ADC on residuals
    candidates = []
    for list_id in probe_lists:
        centroid = centroids[list_id]
        q_residual = q - centroid
        
        # Precompute distance table for this residual
        dist_table = compute_distance_table(q_residual, pq_codebooks)
        
        # Score all vectors in this list
        for vec_id, pq_code in lists[list_id]:
            pq_dist = sum(dist_table[j, pq_code[j]] for j in range(len(pq_code)))
            total_dist = coarse_dists[list_id] + pq_dist
            candidates.append((vec_id, total_dist))
    
    # 3. Return top-k
    candidates.sort(key=lambda x: x[1])
    return [vec_id for vec_id, _ in candidates[:k]]

Memory Layout Optimization

In production, memory layout matters. The system stores:

Coarse centroids: $n_{list} \times d$ floats (~125MB for 4K lists)
PQ codebooks: $m \times k \times (d/m)$ floats (~6MB for 8×256×96)
PQ codes: $N \times m$ bytes (800MB for 100M vectors)
Inverted lists: Pointers and metadata (~200MB)

Total: ~1.1GB vs 307GB for exact vectors—a 280x reduction.

Performance Tuning

The key parameters and their effects:

Parameter	Effect on Speed	Effect on Accuracy	Typical Values
$n_{list}$	More lists = faster coarse search	Too many = worse clustering	1K-16K
$n_{probe}$	Fewer probes = faster	More probes = better recall	4-32
$m$	More subvectors = slower	More subvectors = better	4-16
$k$	Larger codebooks = slower training	Larger codebooks = better	256-1024

Real-Time Updates

Production systems need to handle code changes in real-time. The solution is a hybrid approach:

Main index: IVF+PQ for stable code (rebuilt nightly)
Delta index: Exact search for recent changes (rebuilt every few minutes)
Merge strategy: Combine results from both indexes

Python

class HybridIndex:
    def __init__(self):
        self.main_index = IVFPQIndex()
        self.delta_index = ExactIndex()
        self.last_rebuild = time.time()
    
    def search(self, q, k=10):
        main_results = self.main_index.search(q, k*2)
        delta_results = self.delta_index.search(q, k)
        
        # Merge and deduplicate
        all_results = merge_results(main_results, delta_results)
        return rerank_and_select(all_results, k)
    
    def add_vector(self, vec_id, embedding):
        self.delta_index.add(vec_id, embedding)
        
        # Trigger rebuild if delta gets too large
        if self.delta_index.size() > 50000:
            self.rebuild_main_index()
    
    def rebuild_main_index(self):
        # Merge delta into main, rebuild IVF+PQ
        all_vectors = self.main_index.get_all() + self.delta_index.get_all()
        self.main_index = IVFPQIndex(all_vectors)
        self.delta_index.clear()
        self.last_rebuild = time.time()

Advanced Optimizations

Several advanced techniques push performance even further:

Multi-Index Hashing

Instead of storing full PQ codes, use multiple hash tables indexed by subsets of codes:

h_1(\mathbf{x}) = [c_1, c_2, \ldots, c_{m/2}], \quad h_2(\mathbf{x}) = [c_{m/2+1}, \ldots, c_m]

Polysemous Codes

Learn multiple quantizers and use the best one for each vector:

PQ^*(\mathbf{x}) = \arg\min_{PQ \in \{PQ_1, \ldots, PQ_L\}} \|\mathbf{x} - PQ(\mathbf{x})\|_2^2

GPU Acceleration

The distance table computation and candidate scoring parallelize perfectly on GPU:

Python

@cuda.jit
def compute_distances_gpu(query_chunks, codebooks, distance_table):
    i, j = cuda.grid(2)
    if i < query_chunks.shape[0] and j < codebooks.shape[1]:
        dist = 0.0
        for k in range(codebooks.shape[2]):
            diff = query_chunks[i, k] - codebooks[i, j, k]
            dist += diff * diff
        distance_table[i, j] = dist