From Semantic Retrieval

Embeddings

The semantic arm rests on one trick: turn text into a vector so that texts which mean similar things land near each other. Everything else, the similarity metric, the nearest-neighbor index, the fusion with lexical, assumes the vectors already exist and are good. This leaf is about where they come from and what “good” means, and the thing I had to get past first was my own fuzzy sense that an embedding is “the meaning of the text as numbers,” which is true enough to be useless. The useful version is more specific and more honest about what the vector can and cannot carry.

What an embedding actually is

An embedding is a function from a piece of text to a fixed-length list of numbers, a point in a space of a few hundred to a couple thousand dimensions. The function is a learned model, and the only thing that makes the output an embedding rather than an arbitrary hash is the property it was trained to have: texts that mean similar things map to points that are close, and texts that mean different things map to points that are far. Distance in the space is trained to track dissimilarity of meaning.

That property does not come for free, and seeing how it is learned demystifies the whole thing. The standard recipe is contrastive: show the model many pairs of texts that are known to be similar, question and its correct answer, a sentence and its paraphrase, and many pairs known to be dissimilar, and adjust the model so the similar pairs come out close and the dissimilar pairs come out far. Do this over enough pairs and the model generalizes, placing texts it has never seen near their paraphrases. The model is not told what “meaning” is. It is shown which things should be close and learns coordinates that make them so. The embedding space is the residue of millions of “these two go together, these two do not” judgments.

This is why an embedding can match how do I get my money back to refund policy with no shared words. Somewhere in training the model saw refund-shaped questions paired with refund-shaped answers, and it learned to put that whole region of phrasings in one neighborhood. The match is geometric because the training made the geometry encode the pairing.

Measuring near: cosine

If meaning is a location, similarity is a distance, and the distance the field almost always uses is cosine similarity: the cosine of the angle between two vectors, which is their dot product divided by the product of their lengths. It runs from 1 for vectors pointing the same direction, through 0 for perpendicular, to -1 for opposite. Cosine cares about direction, not magnitude, which is the right choice here because the direction of an embedding is trained to carry the meaning and the length often just reflects incidental things like text length or token count.

Five real query/answer pairs. Lexical overlap (Jaccard over content words) is zero for every one; semantic cosine is 0.93 to 1.00. The same matches word-matching threw away, meaning-matching recovers. (The semantic vectors here are hand-set to show the shape, not learned.)

I leaned on cosine in the territory’s opening experiment: five query/answer pairs that shared zero content words scored cosine 0.93 to 1.00 in a toy concept space, the same pairs that scored zero on lexical overlap. The vectors there were hand-set rather than learned, so the magnitude of the numbers is illustrative, but the mechanism is exactly what a real embedding model does at scale: place paraphrases in nearly the same direction, and let cosine read off the nearness. The convenient consequence is that if you normalize every vector to unit length up front, cosine becomes a plain dot product, which is cheap and is what the nearest-neighbor index actually computes.

The failure that mirrors lexical’s

The honest part of this leaf is the failure, because it is sharp and it is the reason embeddings do not end the story. Embeddings are weak exactly where they look strongest, on rare exact literals. An error code, an API name, a SKU, a function signature, a rare identifier: these carry their meaning in their precise literal form, and the embedding model smooths them into a neighborhood of similar-looking things. ENOENT gets embedded somewhere in “error-ish” space, near other error tokens, and a query for ENOENT comes back with plausible, thematically adjacent, completely wrong neighbors. The model has no notion that the user needs that exact string and nothing else.

This is the precise mirror of the lexical blind spot. Lexical fails on paraphrase, where there is no shared word; semantic fails on literals, where the exact word is the whole point. Each is strong where the other is weak. I find it clarifying that the two arms do not differ in degree but in kind: they fail on disjoint sets of queries, which is exactly the condition under which combining them is worth the trouble. That combination is hybrid search, and the rarity-weighted lexical arm earns its keep precisely by catching the literals the embeddings smear.

The representation forks

How you turn a document into vectors is not one decision but several, and each is a place real systems diverge.

One vector per document, or per passage. A long document has many topics, and squashing all of them into a single point averages them into a blur that is near everything and excellent at nothing. The common fix is to split the document into passages, the same chunking that complicates IDF counting on the lexical side, and embed each passage separately, so a query can match the one passage that answers it. The cost is more vectors to store and search, and a new bookkeeping problem of mapping a matched passage back to its document, the same logical-document-versus- chunk distinction that haunts the lexical statistics. Chunk too coarsely and each vector is a blur; chunk too finely and you lose the context that made a passage meaningful and multiply the index size.

How many dimensions. More dimensions give the model more room to separate meanings, up to a point, and cost proportionally more storage and search time, because every distance computation runs over every dimension. Higher dimensionality also makes approximate nearest neighbor harder, because distances grow less discriminating as dimensions rise. The dimension count is a capacity-versus-cost dial, and the right setting is the smallest that does not lose relevance, measured rather than guessed.

Which model, and the shared-space requirement. The query and the documents must be embedded into the same space by compatible models, or their distances are meaningless, the geometric cousin of the analyzer parity requirement on the lexical side. Swap the document embedding model without re-embedding the corpus, or embed the query with a different model than the documents, and you are measuring distances between points that do not live in the same coordinate system. Re-embedding the whole corpus when the model changes is a real operational cost, the semantic version of rebuilding the index, and it is why model upgrades on the semantic arm are heavier than they look.

The thing I keep returning to is that an embedding is not “the meaning” of a text in any absolute sense. It is a position learned from a particular set of similar-and-dissimilar judgments, good at the kinds of similarity it was trained on and blind to the kinds it was not, which is why the literal failure is structural and not a bug to be patched. Knowing what the vector was trained to encode tells you exactly where to trust it and where to reach for the other arm.

The code

I had no embedding model offline, so the experiment is honest about being a stand-in: it shows the shape (lexical zero, semantic high) with hand-set concept vectors, and the cosine is the real metric a production system uses. The lexical and semantic halves side by side:

python

import re, math

STOP = {"how", "do", "i", "my", "get", "the", "to", "a", "back", "what", ...}

def content_tokens(s):
    return {w for w in re.findall(r"[a-z]+", s.lower()) if w not in STOP}

def jaccard(a, b):                                  # lexical overlap: shared / union of content words
    A, B = content_tokens(a), content_tokens(b)
    return len(A & B) / len(A | B) if (A | B) else 0.0

def cosine(a, b):                                   # semantic similarity: angle between vectors
    dot = sum(x * y for x, y in zip(a, b))
    na = math.sqrt(sum(x * x for x in a)); nb = math.sqrt(sum(y * y for y in b))
    return dot / (na * nb) if na and nb else 0.0

# "how do i get my money back"  vs  "refund policy and returns"
jaccard(q, d)              # -> 0.00  (no shared content words: lexical is blind)
cosine(vec[q], vec[d])     # -> 0.93  (same direction: meaning matches)

In a real system vec[...] comes from a learned embedding model rather than a hand-built concept table, but the comparison logic is exactly this. The full script with all five pairs is in the experiment workbench beside this page.

Up to semantic retrieval. The structure that searches these vectors without comparing them all is approximate nearest neighbor search. The arm whose literal strengths cover this arm’s literal weakness is lexical retrieval, and the two are combined in hybrid search.