In “Regular Expression Matching with a Trigram Index or How Google Code Search Worked”, Russ Cox describes a system that allows for code search with regular expressions. It has taken me several reads of the article to understand roughly what is going on, and I still don’t have a full understanding.
With that said, there is one pattern that stands out to me and that I have kept in mind throughout my recent search work. He notes:
We can run this query against the trigram index to identify a set of candidate documents and then run the full regular expression search against only those documents.
The rough algorithm is:
- Take a user’s regular expression.
- Compute a query containing ANDs and ORs that can be used to find all documents that could match the regular expression .
- Run the full regular expression on the returned documents.
Regular expressions are expensive to run. The more documents you have, the longer it takes to run a regex to find matches.
With that said, Cox first built a system to narrow the search space such that regex matching only needs to happen on documents that contain trigrams that must be in a document in order for it to match the regex.
Then, a regex can be run to find the location of a string in the matching documents, and do any additional post-processing.
The general pattern I derived from this is to filter before running computationally expensive tasks. The fewer pieces of data you have on which you need to run an expensive task, the better.