This article is the second edition of the Advent of Patterns series. In this series, running from December 1st to December 24th 2024, I will document one design or programming pattern I have noticed recently. Read the introduction to the series to learn more about the series.
When you are designing a search engine, it is important to filter documents that match a criteria before doing any additional processing. For example, consider a scenario where you search for “coffee” in a search engine. If the search engine computes a relevance score of all documents to the term “coffee”, the server will be doing a lot of extra work. It is more effective for the search engine to find all documents that contain the word “coffee”, then score them.
The abstract pattern is to filter documents to a search space, then run a computation on them. I call it filter, then compute. Filter, then compute is an optimization pattern to help you think about how to order operations in tasks that involve search. The pattern can be followed to ensure that you only run computations on documents that you know are relevant to the given search space.
Russ Cox describes an application where this pattern was used in his article that explains how Google Code Search worked.
Google Code Search accepted regexes as queries. The engine would accept user input then “build a query of ANDs and ORs that gives the trigrams that must be present in any text matching the regular expression.” This was then used to find documents that match the regex. Then, an actual regex is run on the matching documents.
This pattern is important because running regexes is a computationally expensive operation. To run a regex on all files in the search index would take a significant amount of time. Instead, queries are run against a trigram reverse index to find all documents that will match the regex. Then, the regex can be run to find the exact position in a file that matches a regex, and compute additional meta information about the match.
My website search engine applies the filter then compute strategy in several places. The first is what I touched on earlier: the search engine finds all documents that match a search term then scores those documents.
In addition, for some queries (i.e. “what is advent of patterns”), the engine aims to find a direct answer to the user input. This is done by searching the top three most relevant articles to the query for specific text patterns (i.e. “advent of patterns is”). These matches are then graded and the most likely match is returned.
Searching documents for specific patterns is relatively expensive, and I also only want to do it if documents are relevant to the query.
Filtering, ranking, then running the computation is essential to build an experience that delivers on the goal of helping users find relevant information, fast. First, documents are filtered and scored and ranked by relevant. Then, documetns are further filtered to the top three to search for a document that may contain an answer.
Filtering the number of documents to search for a direct answer ensures I have a narrow search space. Ranking ensures I have a relevant search space. Limiting the search space helps keep answers more relevant. If a direct answer to a query can’t be found in the first three ranked results based on BM25, it is unlikely that the answer is available anywhere else.
Can you think of more examples of filter, then compute? Let me know!