Filter before running computationally expensive tasks


In “Regular Expression Matching with a Trigram Index or How Google Code Search Worked”, Russ Cox describes a system that allows for code search with regular expressions. It has taken me several reads of the article to understand roughly what is going on, and I still don’t have a full understanding.

With that said, there is one pattern that stands out to me and that I have kept in mind throughout my recent search work. He notes:

We can run this query against the trigram index to identify a set of candidate documents and then run the full regular expression search against only those documents.

The rough algorithm is:

  1. Take a user’s regular expression.
  2. Compute a query containing ANDs and ORs that can be used to find all documents that could match the regular expression .
  3. Run the full regular expression on the returned documents.

Regular expressions are expensive to run. The more documents you have, the longer it takes to run a regex to find matches.

With that said, Cox first built a system to narrow the search space such that regex matching only needs to happen on documents that contain trigrams that must be in a document in order for it to match the regex.

Then, a regex can be run to find the location of a string in the matching documents, and do any additional post-processing.

The general pattern I derived from this is to filter before running computationally expensive tasks. The fewer pieces of data you have on which you need to run an expensive task, the better.



Source link

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top