Improving relevance on my site search engine


The search engine on this website is powered by JameSQL, an open source, NoSQL database. The engine accepts a query and evaluates it according to any conditions specified. I strive to keep query times below 10ms, to help ensure the search experience feels and is as fast as possible.

When I first released the search engine, relevance was determined by the number of times keywords were mentioned in a document. This naturally biased search results toward longer documents that would reference key words more times than shorter documents. I knew that there were better ways to determine relevance, but as with any project I wanted to focus on one piece at a time. First, I wanted to make everything work, while keeping an eye to make things fast. Then, I could come back to relevance.

This week, I released a new ranking algorithm for my search engine. The algorithm has a few parts. First, all documents are ranked according to TF-IDF, a popular information retrieval algorithm. TF-IDF weighs words differently depending on their occurrence across an entire corpus of text; words that appear a lot will have less weight than those that appear less. I explained how TF-IDF works, and how to implement it, in another blog post.

TF-IDF is the foundation of keyword search relevance in my search engine. Using TF-IDF, results were more relevant in my experiments, and long documents that mention a keyword many times were no longer appearing at the top for most queries.

With that said, I had a few other preferences I wanted to encode in the retrieval system.

I wanted older articles to lose weight over time. This is because blogs are temporal, and the most recent information is more likely to be relevant. I introduced a “decay” factor that multiplies the TF-IDF score by 0.9 for every 30 days since an article was published, which will downrank older documents over time. (As I write, I realise I should introduce a base so that this doesn’t significantly downrank relevant documents on the basis that they are old.)

I also wanted to improve the rank of blog posts depending on how many articles were linking to them. I added a boost so that there is a slight increase in score depending on how many times a post has been linked to across my site. I use a logarithm to ensure that high numbers of links does not supplant TF-IDF as the main metric for evaluation.

The search algorithm looks like this:

tf_idf_score * (0.9 ** (days_since_post / 30)) * log(num_of_incoming_links)

This is expressed in the JameSQL ranking language as:

((_score * decay published) * log((inlinks + 1)))

The +1 on the inlinks ensures that an error is not raised because you cannot calculate log(0) in the Python math library used to calculate the score.

I suspect the multiplication of the inlinks will need to change over time to addition, but what is published above is what has worked for me in my testing.

Try out my blog search engine and see how it performs. If you notice a search that doesn’t return results you would expect, let me know!

JameSQL is open source, so you can build your own search engine with it too!



Source link

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top