My website search engine uses text search to identify documents relevant to a given term. Up until recently, the search engine treated every word in a term independently.
For example, consider the query “all too well”. Documents would be found that contain any of the words in the query. Then, the results would be ordered according to their lexical relevance, as measured by TF/IDF (which I replaced with BM25).
TF/IDF and BM25 do not account for the proximety of words in documents. This means that a document that mentions “all too well” directly would be treated the same as a document that mentions all the three component words separately.
I have recently updated my site search engine to take into account word proximety when ranking documents.
The ranking process is as follows:
- Find candidate documents that contain words in a query.
- Calculate the BM25 scores for each document given the query.
- For each candidate document, identify if the words in the query appear directly in sequence at any point in the document.
- If words in a query appear together in a document, boost the rank of the document.
With this process and the search “all too well”, a blog post that contains that exact phrase will be considered more relevant than one that contains the component words.
This approach has a few benefits for my blog search:
- If you paste in a blog post title to my search engine, the blog post with that title should show up first. This is because the word proximity boost pushes the blog post up to the top.
- Named entities with two or more words (i.e. “All Too Well”) will return more relevant results, because word proximity is considered when ranking.
- Generally, it is easier to find documents that contain a phrase.
Let’s walk through an example comparing the old and new algorithms.
We’ll use “all too well” as the example. With this query, the intent is to find my writings related to Taylor Swift’s song “all too well”.
For the query “all too well”, BM25 without a proximity boost returns:
- Beyond Tellerrand 2022
- Advent of Technical Writing: Facilitating Ideas
- Announcing Tay Tay Lyric of the Day
The first two results are not related to the query.
For the query “all too well”, BM25 with a proximity boost returns the following as the most relevant results :
- Taylor Swift Subreddit Acronym Reference
- Analyzing use of Taylor Swift song name acronyms on Reddit
- Announcing Tay Tay Lyric of the Day
All three results above are related to the query.
Given the comparison above, it is clear that the word proximity boost is significant.
In “How to find word collocations in a document”, I walked through how I implemented the logic to find phrases in a document using sets. I recommend reviewing the post if you are interested in learning how I implemented a solution to efficiently check if a document contains a multi-word phrase.