My website search engine now supports direct answers. Any query starting with what is
or what are
is processed with logic that aims to find a direct answer to the question. For example, consider the query “what is a trie”. When this query is run, the search engine returns a “direct answer” result at the top of the page, followed by the “10 blue links”.
For the query “what is a trie”, the following answer is returned:
A trie is a heavily-nested tree data structure that is commonly used in predictive text use cases. With a trie, you can search a tree by character
Below the answer is a link to the article from which the text was taken, and the date on which the quoted article was published.
Here is how the feature is presented from a visual perspective:
In a direct answer, the answer appears before the document title. Whereas in the 10 following results, the title appears above the document meta description.
How it works
When you type in a query starting with what is
or what are
, the query is reformatted. For the query “what is a trie”, the query is reformatted to:
a trie is
The search engine finds all documents that contain those keywords. Next, documents are ordered using the BM25 ranking algorithm. This algorithm is used to find documents relevant to the keywords. The BM25 scores for documents are boosted in these scenarios:
- If the terms in the query occur exactly together in the document title or contents, the document is ranked higher.
- An overlap ratio is computed between the query and the title. The more the words overlap, the more a boost is added. This boost is small, and used to help add more weight to document titles.
- The closer the query terms appear to the beginning of the document title, the more a boost is added. This boost is small.
I mention these conditions because I have found good ranking to be essential.
When you do a search, the top 10 documents are returned. Then, the top three of these are scanned to attempt to find a direct answer. Thus, the more relevant the top three documents are, the more likely it is that a quality direct answer can be found.
At this stage, stopwords are removed from the query.
There are three ways that the search engine tries to find a definition.
- Find the first sentence that matches the query “trie” that has the query in parentheses. This is used to extract abbreviations. For example, “what is ATW” will return the document that contains “All Too Well (ATW)”. This rule was chosen because abbreviations are commonly defined in parentheses.
- Find the first sentence that matches the pattern “trie” (our query, minus all stopwords), followed by “a” and that also contains the word “is”, “are”, or “to”. This is effective for extracting definitions of terms.
- If none of the two conditions above are matched, find the sentence with the greatest overlap between all query words and the sentence. This is effective for addressing longer tail questions like
what is a trie used for
.
If none of the above conditions match, no result is returned.
If a result is found, the preceding and following sentences are returned, if available. This can provide helpful context to understand the sentence that was found to match the query.
The above conditions are heuristics. Thus, they will not be effective for answering all queries; a sentence matching the patterns above may not necessarily be a definition. That’s where good ranking comes in: the more relevant the documents are, the more likely it is that an appropriate is found.
And, of course, this feature is contingent upon there being an answer in the documents. The question “What is BBC” is not answered with a definition of BBC because I haven’t written “British Broadcasting Corporation (BBC)”. Instead, it is answered with a snippet that mentions the phrase.
There are likely many ways I can improve the system. I am actively learning!