Suppose you are looking for my Aeropress recipe. To find this information, you could turn to my blog search engine. This search engine indexes all of my blog posts. The search engine is powered by JameSQL, a NoSQL document database.
Turning text queries into JSON queries
Let’s start with a query:
aeropress recipe
When you type this query into my search engine, it is first interpreted with a transpiler.
The transpiler converts the text that you have typed into a JSON structure that the document database powering the search engine can understand. The transpiler understands various special arguments. For example, aeropress -recipe
would return results that contain “aeropress” but not “recipe”; aeropress recipe category:recipe
would return documents that contain the words aeropress
and recipe
and have a category equal to recipe
.
Here is the JSON representation of the above query:
{
"query": {
"and": [
{
"or": [
{"post": {"contains": "aeropress"}},
{"title_lower": {"contains": "aeropress"}}
]
},
{
"or": [
{"post": {"contains": "recipe"}},
{"title_lower": {"contains": "recipe"}}
]
},
]
},
"limit": 10,
}
There is a lot going on in this query. Let’s talk about it step by step.
The search query has created two OR statements:
- If the post or title contains
aeropress
. - If the post or title contains
recipe
.
First, all documents whose post content or titles contain aeropress
are retrieved. Then, documents whose post content or title contains recipe
are retrieved.
The and
statement at the top of the query says that only documents that match both of the inner clauses are returned. This means that documents that match the above two statements – post or title contains aeropress
, and post or title contains recipe
– will be returned.
At this stage, rudimentary query simplification rules are applied. These rules ensure that redundant terms are removed. These rules follow Boolean logic.
For instance, if you search for coffee coffee
, the query will be reduced to coffee
. If you search for aeropress recipe -recipe
, the query will be reduced to recipe
, since recipe -recipe
cancel each other out.
Results are limited to 10 documents using the limit
query.
Once the query is generated, a special query_score
argument is added that looks like this:
"query_score": "((_score * decay published) * log((inlinks + 1)))"
This algorithm is used to rank documents. The algorithm says:
- Take the TF-IDF document score (keyword relevance).
- Multiply this value by a decay factor.
- Multiply this value by the logarithm of the number of links pointing to a document, plus one.
The decay factor is equal to:
(0.9 ** (days_since_post / 30))
This algorithm allows me to account for keyword relevance, slightly improve the rank of newer pieces of content, and improve the rank of documents that have more internal links pointing to them.
The query lifecycle
At this point, we have a complete JSON query with criteria for:
- Selecting documents.
- Ranking documents.
- Limiting the number of documents returned.
This query is then sent for evaluation, in the order above.
Selecting documents that match the query
First, the query
key is evaluated to retrieve all documents that match the query. This is done using bottom-up parsing. In this case, it means that the following inner clauses are evaluated first:
"or": [
{"post": {"contains": "aeropress"}},
{"title_lower": {"contains": "aeropress"}}
]
Then, the next or statement is evaluated:
"or": [
{"post": {"contains": "recipe"}},
{"title_lower": {"contains": "recipe"}}
]
Then, the and
statement is evaluated that finds the intersection of the results from the above two sub-queries.
Let’s take a look at one of the innermost clauses:
{"post": {"contains": "aeropress"}}
This clause says:
- Look up the
post
index. - Find all documents that contain
aeropress
.
This information is retrieved using a reverse index. A reverse index is an efficient method of retrieving documents.
The reverse index looks like this:
"word": [docid1, docid2...]
Where docid
corresponds with the IDs of all documents that contain a word. Thus, to find all documents that contain aeropress
or recipe
, JameSQL can do two dictionary lookups:
index["aeropress"]
index["recipe"]
Each of the dictionary lookups can be completed in O(1) time.
With a reverse index, the entire query can be processed in sub-millisecond time, even with thousands of documents that each have thousands of words.
You can learn more about how to build reverse indices in my How to build a search index in Python guide.
At this stage, document keyword relevance scores are calculated according to the TF-IDF formula. This score specifically says how related each document is to each keyword. The relevance of a document to the query aeropress recipe
for a single document is the sum of the TF-IDF for aeropress
and recipe
for that document.
Ranking documents
When the query
key is evaluated, the search engine has a list of candidate documents. These are documents that match the search criteria. Then, these documents can be ranked.
The ranking formula, ((_score * decay published) * log((inlinks + 1)))
, is implemented using another grammar. This grammar is for writing ranking algorithms.
The ranking formula is parsed into a syntax tree. This syntax tree and accompanying logic is then used to evaluate every document.
By the end of this process, every document has a score based on its keyword relevance (_score
), its published date, and the number of internal links pointing to the document.
The top 10 results are then retrieved, per the limit
value.
Then, results are returned to the front-end.
Here is an example of a result to the query aeropress recipe:
Of note, the first result is not particularly relevant. The post contains aeropress recipe
as an example. This could be mitigated by giving a boost to the title field, so that titles get more weight. I still need to implement this.
With that said, the second result answers our query exactly. The second result is my Aeropress recipe.
Conclusion
The search system above has three main components:
- A grammar that turns a user query (i.e.
aeropress recipe
) into a JSON format that can be evaluated by the query engine. - A bottoms-up parser that evaluates the JSON query and returns all documents that match the query structure.
- A ranking grammar that parses a ranking algorithm and computes scores for every document.
With the above system, documents can be retrieved in < 10ms then sent to the browser for display.
If you have any questions about how the system works, feel free to email me at readers [at] jamesg [dot] blog.