100%
From Search and Retrieval

What Search Is

The simplest version of search is grep. You have a folder of documents, the user types a word, and you open every file in turn, scan the text, and return the files that contain that word. Linear scan, exact string match. This is the baseline every other kind of search is trying to beat.

Why exact-match breaks

The user types "how do I get my money back". The file that actually answers the question is called refund_policy.txt. None of the query’s words appear in it. grep returns nothing, even though the answer is sitting in the folder. This is the failure mode every modern search system exists to solve: queries and documents talk about the same thing using different words.

Why “just ask an LLM” doesn’t scale

You could hand all the documents to a language model and let it pick. This works at 100 files. It collapses at 100,000 — too much text to fit in a single call, too expensive to pay per token on every search. So the LLM, if it shows up at all, is the last step of a pipeline. Something cheaper has to narrow the candidate set first.

The distributional hypothesis

The way to match across different wordings is to match on meaning, not on exact letters. The intuition is old (J.R. Firth, 1957): you shall know a word by the company it keeps. Words that appear in similar positions, surrounded by similar other words, tend to mean similar things. “Refund” and “money back” both show up next to “purchase,” “return,” “card,” “transaction” — so a system that learns from large amounts of text will eventually represent them as nearby in some space. This is the foundation every embedding model is built on.

The basic structure of meaning-based retrieval

Assume for a moment that there exists a magic function closeness(text_a, text_b) that returns a number for how related-in-meaning two pieces of text are. Search becomes very simple:

  1. For each document in the corpus, call closeness(query, document).
  2. Sort by score.
  3. Return the top few.

That’s the whole structure. No word-by-word loop, no exact-match step. The query and each document are compared as whole units. Every dense-retrieval system on earth is a variation of this — with the magic function replaced by something real, and with tricks to avoid actually scanning all the documents.