← Back

How to Search 100M Lines of Code in 200ms

August 24, 2025

A team just cracked one of the hardest problems in software engineering: searching 100 million lines of code in under 200ms. They made search 40% faster while using 8x less memory and maintaining 99.9% accuracy. Here's the story of how they did it.

The Problem

Imagine you're working on a 100-million-line codebase. You need to find similar functions, understand how something works, or see if someone already solved your problem. Traditional search sucks—it only matches keywords, not meaning.

Modern AI can understand code semantically using embeddings (vectors that capture meaning), but there's a catch: searching 100M vectors takes forever and eats massive amounts of memory.

The Math That Breaks Everything

Let's do the brutal math. Each line of code becomes a 768-dimensional vector:

Memory=N×d×4=108×768×4=307.2 GB\text{Memory} = N \times d \times 4 = 10^8 \times 768 \times 4 = 307.2 \text{ GB}

Every search compares your query to all NN vectors:

Operations=N×d=108×768=7.68×1010\text{Operations} = N \times d = 10^8 \times 768 = 7.68 \times 10^{10}

At 1 GFLOP/s, that's 77 seconds per search. Nobody's waiting that long to find a function.

Python
def naive_search(db, q):
    scores = [dot(q, v) for v in db]  # 100M ops!
    return sorted(scores)[-10:]       # 77 seconds

The Solution: Quantized Search

Instead of storing exact vectors, we compress them. Think of it like this: instead of storing the exact GPS coordinates of every house, you store which neighborhood each house is in.

The key insight: you don't need perfect precision to find similar code. You just need to be close enough.

The Breakthrough: Product Quantization

Here's the insight that changed everything. Instead of compressing a 768-dimensional vector as one piece, split it into mm subvectors and quantize each separately:

x=[x1,x2,,xm]where xjRd/m\mathbf{x} = [\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_m] \quad \text{where } \mathbf{x}_j \in \mathbb{R}^{d/m}

Each subvector gets compressed to 1 byte (256 codes). The compression ratio is devastating:

Compression=768×48×1=384x\text{Compression} = \frac{768 \times 4}{8 \times 1} = 384\text{x}
Python
def compress(x, m=8):
    chunks = x.reshape(m, -1)
    return [find_code(chunk) for chunk in chunks]  # 8 bytes

def search(q, db):
    return [sim(q, v) for v in db]  # 384x faster

The Speed Multiplier: Asymmetric Distance

But we're not done. The real magic happens with Asymmetric Distance Computation (ADC). Instead of computing distances on-the-fly, precompute a lookup table:

dADC(q,PQ(x))=j=1mqjcqj(xj)22d_{ADC}(\mathbf{q}, PQ(\mathbf{x})) = \sum_{j=1}^m \|\mathbf{q}_j - \mathbf{c}_{q_j(\mathbf{x}_j)}\|_2^2

This reduces complexity from O(Nd)O(Nd) to O(mk+N)O(mk + N). For our case: 8 × 256 = 2,048 calculations once, then just lookups.

Python
def precompute(q, m=8, k=256):
    chunks = q.reshape(m, -1)
    return [[dist(chunk, code) for code in range(k)] 
            for chunk in chunks]  # (8, 256)

def fast_search(table, db):
    return [sum(table[i][code] for i, code in enumerate(v)) 
            for v in db]  # Just lookups!

The Final Optimization: Inverted File Index

384x compression is great, but you're still searching 100M vectors. The Inverted File (IVF) trick adds coarse quantization. First, cluster vectors into nlistn_{list} groups:

c(x)=argmini=1,,nlistxμi22c(\mathbf{x}) = \arg\min_{i=1,\ldots,n_{list}} \|\mathbf{x} - \boldsymbol{\mu}_i\|_2^2

During search, only probe nproben_{probe} nearest clusters. This reduces search space from NN to Nnprobe/nlistN \cdot n_{probe} / n_{list}.

Python
def ivf_search(q, centroids, lists, nprobe=8):
    # Find closest clusters
    dists = [dist(q, c) for c in centroids]
    top_clusters = sorted(range(len(dists)), key=lambda i: dists[i])[:nprobe]
    
    # Search only those clusters
    candidates = [v for i in top_clusters for v in lists[i]]
    return pq_search(q, candidates)  # 512x smaller search space

The Results

Combining Product Quantization + Inverted Index gives you the best of both worlds:

MetricBeforeAfterImprovement
Search Time2+ seconds<200ms10x faster
Memory Usage2GB250MB8x less
Accuracy100%99.9%Barely any loss

Why This Matters

Fast code search isn't just about convenience—it changes how you work:

The Complete System

Here's how all the pieces fit together in production:

Python
class SearchEngine:
    def __init__(self):
        self.ivf = IVF(nlist=4096)
        self.pq = PQ(m=8, k=256)
        self.exact = ExactIndex()  # For new code
    
    def search(self, q, k=10):
        approx = self.ivf.search(q, nprobe=8)
        exact = self.exact.search(q)
        return merge_results(approx + exact, k)
    
    def add(self, embedding, meta):
        self.exact.add(embedding, meta)
        if self.exact.size() > 10000:
            self.rebuild()

Key Takeaways

  1. Compression works: 384x smaller with minimal accuracy loss
  2. Clustering matters: Don't search everything, search smart
  3. Precomputation is key: Do expensive work once, reuse it
  4. Handle edge cases: Have exact search for new/changing code

The result? Developers can search 100 million lines of code as fast as they can type. That's the kind of performance that changes everything.

Deep Dive: Training the Quantizers

The magic happens during training. For Product Quantization, you need to learn mm codebooks, each with kk codewords. This is done via k-means clustering on each subvector dimension:

min{Cj}i=1Nj=1mxi,jqj(xi,j)22\min_{\{\mathcal{C}_j\}} \sum_{i=1}^N \sum_{j=1}^m \|\mathbf{x}_{i,j} - q_j(\mathbf{x}_{i,j})\|_2^2

Where qj(xi,j)q_j(\mathbf{x}_{i,j}) is the quantization of subvector jjfrom vector ii. The training alternates between:

  1. Assignment: Assign each subvector to nearest codeword
  2. Update: Recompute codewords as centroids of assigned subvectors
Python
def train_pq(vectors, m=8, k=256, iters=20):
    d = vectors.shape[1]
    codebooks = np.random.randn(m, k, d//m)
    
    for _ in range(iters):
        # Assignment step
        codes = encode(vectors, codebooks)
        
        # Update step  
        for j in range(m):
            for c in range(k):
                mask = codes[:, j] == c
                if mask.sum() > 0:
                    codebooks[j, c] = vectors[mask, j*d//m:(j+1)*d//m].mean(0)
    
    return codebooks

The Quantization Error Analysis

How much accuracy do we lose? The quantization error for Product Quantization is bounded by:

E[xPQ(x)22]j=1mE[xjqj(xj)22]\mathbb{E}[\|\mathbf{x} - PQ(\mathbf{x})\|_2^2] \leq \sum_{j=1}^m \mathbb{E}[\|\mathbf{x}_j - q_j(\mathbf{x}_j)\|_2^2]

Each subvector's quantization error decreases as O(k2/dj)O(k^{-2/d_j}) wheredj=d/md_j = d/m is the subvector dimension. This is why splitting into subvectors helps— the curse of dimensionality is less severe in lower dimensions.

Optimized Product Quantization (OPQ)

Standard PQ assumes subvectors are independent, but code embeddings have correlations. Optimized PQ learns a rotation matrix R\mathbf{R} to decorrelate dimensions:

minR,{Cj}i=1NxiPQ(Rxi)22\min_{\mathbf{R}, \{\mathcal{C}_j\}} \sum_{i=1}^N \|\mathbf{x}_i - PQ(\mathbf{R}\mathbf{x}_i)\|_2^2

The rotation is learned by alternating between updating codebooks and solving the Procrustes problem:

R=argminRXYRF2\mathbf{R}^* = \arg\min_{\mathbf{R}} \|\mathbf{X} - \mathbf{Y}\mathbf{R}\|_F^2
Python
def train_opq(vectors, m=8, k=256, opq_iters=5):
    R = np.eye(vectors.shape[1])  # Identity initially
    
    for _ in range(opq_iters):
        # Rotate vectors
        rotated = vectors @ R.T
        
        # Train PQ on rotated vectors
        codebooks = train_pq(rotated, m, k)
        
        # Reconstruct and solve Procrustes
        codes = encode(rotated, codebooks)
        reconstructed = decode(codes, codebooks)
        
        U, _, Vt = np.linalg.svd(vectors.T @ reconstructed)
        R = Vt.T @ U.T
    
    return R, codebooks

IVF Index Construction

The Inverted File index adds a coarse quantization layer. Vectors are first clustered intonlistn_{list} groups using k-means. The optimal number of lists balances search speed vs accuracy:

nlistNn_{list}^* \approx \sqrt{N}

For 100M vectors, this suggests ~10K lists. Each vector is then stored as:

  1. Coarse assignment: Which cluster it belongs to
  2. Residual vector: xμc(x)\mathbf{x} - \boldsymbol{\mu}_{c(\mathbf{x})}
  3. PQ codes: Quantized residual
Python
def build_ivf(vectors, nlist=4096, m=8, k=256):
    # Coarse quantization
    centroids = kmeans(vectors, nlist)
    assignments = assign_to_centroids(vectors, centroids)
    
    # Compute residuals
    residuals = vectors - centroids[assignments]
    
    # Train PQ on residuals
    pq_codebooks = train_pq(residuals, m, k)
    pq_codes = encode(residuals, pq_codebooks)
    
    # Build inverted lists
    lists = [[] for _ in range(nlist)]
    for i, (assignment, code) in enumerate(zip(assignments, pq_codes)):
        lists[assignment].append((i, code))
    
    return centroids, pq_codebooks, lists

Search Algorithm Details

The complete search algorithm combines coarse search, fine search, and reranking:

Score(q,x)=qμc(x)22+dADC(qμc(x),PQ(xμc(x)))\text{Score}(\mathbf{q}, \mathbf{x}) = \|\mathbf{q} - \boldsymbol{\mu}_{c(\mathbf{x})}\|_2^2 + d_{ADC}(\mathbf{q} - \boldsymbol{\mu}_{c(\mathbf{x})}, PQ(\mathbf{x} - \boldsymbol{\mu}_{c(\mathbf{x})}))

The first term is the coarse distance, the second is the fine PQ distance on residuals.

Python
def search_ivf_pq(q, centroids, pq_codebooks, lists, nprobe=8, k=10):
    # 1. Coarse search: find nearest clusters
    coarse_dists = [np.linalg.norm(q - c) for c in centroids]
    probe_lists = np.argsort(coarse_dists)[:nprobe]
    
    # 2. Fine search: ADC on residuals
    candidates = []
    for list_id in probe_lists:
        centroid = centroids[list_id]
        q_residual = q - centroid
        
        # Precompute distance table for this residual
        dist_table = compute_distance_table(q_residual, pq_codebooks)
        
        # Score all vectors in this list
        for vec_id, pq_code in lists[list_id]:
            pq_dist = sum(dist_table[j, pq_code[j]] for j in range(len(pq_code)))
            total_dist = coarse_dists[list_id] + pq_dist
            candidates.append((vec_id, total_dist))
    
    # 3. Return top-k
    candidates.sort(key=lambda x: x[1])
    return [vec_id for vec_id, _ in candidates[:k]]

Memory Layout Optimization

In production, memory layout matters. The system stores:

Total: ~1.1GB vs 307GB for exact vectors—a 280x reduction.

Performance Tuning

The key parameters and their effects:

ParameterEffect on SpeedEffect on AccuracyTypical Values
nlistn_{list}More lists = faster coarse searchToo many = worse clustering1K-16K
nproben_{probe}Fewer probes = fasterMore probes = better recall4-32
mmMore subvectors = slowerMore subvectors = better4-16
kkLarger codebooks = slower trainingLarger codebooks = better256-1024

Real-Time Updates

Production systems need to handle code changes in real-time. The solution is a hybrid approach:

  1. Main index: IVF+PQ for stable code (rebuilt nightly)
  2. Delta index: Exact search for recent changes (rebuilt every few minutes)
  3. Merge strategy: Combine results from both indexes
Python
class HybridIndex:
    def __init__(self):
        self.main_index = IVFPQIndex()
        self.delta_index = ExactIndex()
        self.last_rebuild = time.time()
    
    def search(self, q, k=10):
        main_results = self.main_index.search(q, k*2)
        delta_results = self.delta_index.search(q, k)
        
        # Merge and deduplicate
        all_results = merge_results(main_results, delta_results)
        return rerank_and_select(all_results, k)
    
    def add_vector(self, vec_id, embedding):
        self.delta_index.add(vec_id, embedding)
        
        # Trigger rebuild if delta gets too large
        if self.delta_index.size() > 50000:
            self.rebuild_main_index()
    
    def rebuild_main_index(self):
        # Merge delta into main, rebuild IVF+PQ
        all_vectors = self.main_index.get_all() + self.delta_index.get_all()
        self.main_index = IVFPQIndex(all_vectors)
        self.delta_index.clear()
        self.last_rebuild = time.time()

Advanced Optimizations

Several advanced techniques push performance even further:

Multi-Index Hashing

Instead of storing full PQ codes, use multiple hash tables indexed by subsets of codes:

h1(x)=[c1,c2,,cm/2],h2(x)=[cm/2+1,,cm]h_1(\mathbf{x}) = [c_1, c_2, \ldots, c_{m/2}], \quad h_2(\mathbf{x}) = [c_{m/2+1}, \ldots, c_m]

Polysemous Codes

Learn multiple quantizers and use the best one for each vector:

PQ(x)=argminPQ{PQ1,,PQL}xPQ(x)22PQ^*(\mathbf{x}) = \arg\min_{PQ \in \{PQ_1, \ldots, PQ_L\}} \|\mathbf{x} - PQ(\mathbf{x})\|_2^2

GPU Acceleration

The distance table computation and candidate scoring parallelize perfectly on GPU:

Python
@cuda.jit
def compute_distances_gpu(query_chunks, codebooks, distance_table):
    i, j = cuda.grid(2)
    if i < query_chunks.shape[0] and j < codebooks.shape[1]:
        dist = 0.0
        for k in range(codebooks.shape[2]):
            diff = query_chunks[i, k] - codebooks[i, j, k]
            dist += diff * diff
        distance_table[i, j] = dist