In this guide, I walk through how to build a query language in Python. No required knowledge of query languages is required to follow this guide. You will find this article easier to understand if you have some knowledge of trees.
This guide should get you set up with all the information you need to make your own query language. With that said, query languages are a complex topic, so note that it may take some time to get your head around different parts, especially the query execution logic.
This article follows on from the “How to build a search index in Python” tutorial, which walks through how to build a search index. While this guide is not dependent on knowledge of building a search index, it walks through how the search index we will use in this guide works.
The code for this guide is available in full on GitHub.
By the end of this guide, we will have a query language for a lyric search engine that lets us run the query:
And return results like:
["The Bolter", "my tears ricochet"]
Without further ado, let’s begin!
What we will query: A search index
In the “How to build a search index in Python” tutorial, we discussed how to build a search index from scratch, using music lyrics as an example. At the end of the tutorial, we had a Python function, search()
, that we could pass a string. The function would search a list of documents, each a Python dictionary.
In the last guide, we started with the following documents:
documents = [
{
"title": "tolerate it",
"lyric": "I made you my temple, my mural, my sky"
},
{
"title": "my tears ricochet",
"lyric": "And I still talk to you when I'm screaming at the sky",
},
{
"title": "The Bolter",
"lyric": "Started with a kiss"
}
]
We built a function that lets us search the lyrics in each document like so:
The function above returns all documents whose lyrics contain the word “started”:
The search engine we built finds all documents that contain any word in a search. If we search for I still
, the search index would return:
["tolerate it", "my tears ricochet"]
This is because “I” is in both “tolerate it” and “my tears ricohet”, and “I” is in “my tears ricochet” too. With that said, we may not want to return “my tears ricohet” since it doesn’t contain the word “still”.
The merits of a query language
Wouldn’t it be cool if we could find only documents that contain all words in a query? Or contain one of two words?
We can use this as an excuse to build a query language. A query language can be used to instruct a computer how to execute a query, with particular keywords that have meaning. For example, we can build a query language that lets us say:
This language can then be processed by a Python script to retrieve lyrics that contain both “I” and “still”.
Query languages are commonly used by both programmers and humans. A common query language used by programmers (and data analysts, and many other professionals) is SQL, which lets you directly query a database. A (relatively) common query language that may be used by humans is Google’s search operators, which let you say site:reddit.com coffee
to restrict your results to Reddit.
With a query language, you can define what words have special meaning (i.e. AND
, or OR
), and use those words to either write a query for another language or to directly query a database.
Query languages can be extended for specific use cases. For example, Google’s query language provides over a dozen operators that let you do things like restrict a search to PDFs only, find pages on a website, find pages whose title matches a pattern, and more.
The logic that we write in this guide is designed to directly query a document store (the search engine built in the last guide). With that said, the same approach we discuss could be applied to turn queries in your language to another language. This is a problem that you may encounter in some search software engineering problems. For example, you could write a query language that is turned into an Elasticsearch JSON query for use in Elasticsearch querying. With such a system, you can provide a powerful, human-friendly query language that is then structured in exactly the way your index needs it.
Setup
If you haven’t read the “How to build a search index in Python” tutorial, I encourage you to read the guide.
Once you have read the guide, run the following command to clone the search engine code that we will use in this article:
git clone https://github.com/capjamesg/build-a-search-index
The approach
Our query language will consist of three main parts:
- The grammar, which dictates what a valid query looks like;
- The parser, which reads a string and parses it into a tree, and;
- The expression transformer, which reads the tree and evaluates its contents.
In the following sections, we will walk through how to build each of these components.
We are going to use lark
, a Python package for writing query and programming languages, throughout this guide. lark
implements many utilities that allows us to focus on building the functionality of our language, rather than writing complex boilerplate code.
Let’s begin!
The grammar
Before we build a query language, we need to decide on exactly what a valid query should look like. This will allow us to write a formal grammar that tells the computer how to read a query and turn it into a format we can work with.
We will implement two operators:
X AND Y
, which returns documents that contain X and Y, and;X OR Y
, which returns documents that contain X or Y.
These statements can be chained together, so X AND Y OR Z
should be valid.
Let’s define how these operators should be used. Users of our search engine should be able to write queries like:
(I AND still)
(I OR still)
(I AND still AND you)
still
The first three queries use the operators we will implement. The last query is an example of a query with no operators.
Above, we have outlined what we want queries to look like. But computers need a formal grammar to understand the structure of a query. There are several such grammars. A popular one is Earley, which lark
understands. We will use Earley in this guide.
To write a grammar, we need to turn our example statements into general patterns that follow the Earley syntax. Lark will then process any string (i.e. (I AND still)
) and turn it into an Abstract Syntax Tree (AST). The AST is what we will use to understand our query and retrieve search results that conform to the instructions in the query.
Let’s define our grammar:
from lark import Lark, Transformer
from .search_engine import search, documents_by_title
grammar = """
start: query
query: ("(" WORD (OPERATOR WORD)* ")") | WORD
OPERATOR: "AND NOT" | "AND" | "OR"
WORD: /[a-z0-9_ ]+/
%import common.WS
%ignore WS
"""
parser = Lark(grammar)
There are a lot of symbols in the grammar
code, so let’s talk through it step by step.
First, we define start
, which is the entry point.
We define a WORD
as any lowercase letter, or any number or an underscore. WORD
may contain one or more words, separated by a space. We do not support uppercase letters in WORD
because:
- Our search index already normalizes all terms to lowercase before querying, which means having the original uppercase would have no result, and;
- We need a way to distinguish operators from words in our grammar.
We define OPERATOR
as a list of three potential tokens:
Any string that matches these descriptions exactly is turned into an OPERATOR
token. This will be treated in a particular way in the next section, when we use Lark to generate an Abstract Syntax Tree from our grammar.
We then define query
. A query must be, at minimum, a word or phrase like:
This is defined by:
The WORD not in parentheses says that a query must contain at least one word.
A query can also follow:
("(" WORD (OPERATOR WORD)* ")")
This means that a query can be in parentheses. A query that follows these instructions can be either:
(WORD)
(WORD OPERATOR WORD)
An operator must come after the first phrase.
For example, the following query would be valid:
But this one would not be:
Since we used the asterisk after (OPERATOR WORD)*
, we can include as many valid combinations of (WORD OPERATOR WORD)
as we want, like:
We also say that we want to ignore whitespace in parsing. This ensures that (I AND still)
is still treated as a valid query. (I love AND still)
is still valid.
%import common.WS
%ignore WS
With this grammar, we can turn a string into an Abstract Syntax Tree. Lark has a function that lets us see how a tree looks.
Because words need to be in lowercase to work with our grammar, we need to write a transformation function that changes all words to lowercase unless they are reserved tokens (AND
or OR
):
def preprocess(query):
RESERVED_TERMS = ["AND", "OR"]
terms = query.split()
terms = [t.lower() if t not in RESERVED_TERMS else t for t in terms]
return " ".join(terms)
This function turns:
Into:
This has no impact on our search index.
Let’s use this function to see how an example query is parsed with our grammar:
result = parser.parse(preprocess()"(I AND still)")
print(result.pretty())
Our code returns:
We have a start
entry point with one query: I AND still
.
Evaluating the tree
In the last section, we wrote a grammar that reads a string of text and turns the text into an Abstract Syntax Tree (AST). This tree can then be evaluated to return query results.
For our language, we need to evaluate the AST bottoms-up. This means we will start with the left-most token at the bottom of our tree, work our way right, and do so for every token at the bottom level of the tree. We move up to the next level, and so on, until we reach root
, which indicates we have reached the “root” (top) of the tree.
Lark has a utility to read a parsed AST and manipulate the tree: the Transformer
class. You can define a sub-class that inherits from Transformer
and defines how the tree should be manipulated. By default, the tree is changed bottoms-up.
Let’s define the structure of our transformer:
class ExpressionInterpreter(Transformer):
def query(self, items):
pass
def OPERATOR(self, token):
pass
def start(self, items):
return self.query(items)
Classes that inherit from Transformer
use the names of methods in the class to determine how to change specific parts of the AST. The names of methods should match up to tokens in the grammar.
For example, a method called query
will run on any query
token in the tree; a method called OPERATOR
will run on any OPERATOR
token.
Let’s start by defining how an OPERATOR
should behave:
def OPERATOR(self, token):
"""
Transforms OPERATOR tokens into their respective Python set functions.
"""
if token == "AND":
return set.intersection
elif token == "OR":
return set.union
AND
and OR
are strings in our search query. We need to turn them into functions that do something. This code takes an OPERATOR
token and turns it into a Python set function: set.intersection for AND
and set.union for OR
.
Now that we have the OPERATOR
code, we can write the query
function. This will take a statement and use our search engine from the last tutorial to find results that match our query.
Add the following code to your Python script:
def query(self, items):
DOCUMENT_SEARCH_KEY = "lyric"
if len(items) > 1:
left = search(items[0])
right = search(items[2])
operand = items[1]
left_doc_keys = set([doc[DOCUMENT_SEARCH_KEY] for doc in left])
right_doc_keys = set([doc[DOCUMENT_SEARCH_KEY] for doc in right])
doc_keys_after_query = operand(left_doc_keys, right_doc_keys)
result = [documents_by_title[title] for title in doc_keys_after_query]
return result
elif len(items) == 1 and isinstance(items[0], str):
return search(items[0])
else:
return items[0]
Let’s look back to our syntax tree from earlier:
The items
variable in our function corresponds to I
, AND
, and still
: the three values enclosed within query
.
If query
has more than one item, it means there is an OPERAND
. In this case, we need to process the operand. If query
has one item, it means the user has provided a string query (i.e. i still
). In this case, we do not need to process any operands, since there are not any; we can fall back onto the default behaviour of our search engine that finds all documents that mention any word.
If query
has more than one item, we break out items
into three variables:
left
: The first string.operand
: The operand.right
: The second string.
If left or right are strings, we pass them through the search()
function. This will return the results related to our query. In the code above, you may notice that we don’t pass left
or right
through search()
if they are not strings. I will talk about why this is important later in this section.
We then find the lyric
key in all documents that match the search queries on the left and right sides. Then we apply the OPERAND
function that was set for the operand.
Consider the query (I AND still)
. This will go through the following steps:
- Search results for
I
are found. - Search results for
still
are found. - The
set.intersection(left, right)
function is applied.
If we used OR
, the set.union(left, right)
function is applied.
The function application happens in this code:
doc_keys_after_query = operand(left_doc_keys, right_doc_keys)
We then get the full documents (not just the lyric
key we have been working with):
result = [documents_by_title[title] for title in doc_keys_after_query]
The documents are then returned so they can be processed higher in the tree.
Running a query
We now have a tree parser, which means we have all we need to run a query. You can use the following code to run a query:
result = parser.parse(preprocess("(I AND still)"))
ast = ExpressionInterpreter().transform(result)
print([doc["title"] for doc in ast])
This code returns:
Our code works!
At the start of this guide, our search engine returned two results for this query: tolerate it
and my tears ricochet
. But my tears ricochet
is the only record that contains both “I” and “still”.
Let’s try another query:
result = parser.parse(preprocess("(I AND still AND talk)"))
ast = ExpressionInterpreter().transform(result)
print([doc["title"] for doc in ast])
This query returns my tears richcoet
too.
If we run an invalid query, our code returns an error.
Consider the following query, which ends with an AND
operator without a word following:
result = parser.parse(preprocess("(I AND still AND talk AND)"))
This query raises a Python exception. This is because the string does not conform to our grammar, so it cannot be parsed.
Expanding the query engine: Allowing combinations of queries (Bonus)
Our grammar currently allows us to write single queries, like (I AND still)
. But what if we want to combine two queries together?
We can do this with a small modification to the query
part of our grammar:
query: ("(" WORD (OPERATOR WORD)* ")") | (query OPERATOR query)* | WORD
All of the queries we wrote in earlier sections are still valid, but we have added a new feature: an operator can be used between two queries. That means the following query is valid:
This also means we can nest queries within queries, if we want:
(I AND (still AND talk)) OR kiss
With that said, changing the grammar isn’t enough: we also need to change our tree processor to understand this change.
Update your query
code to the following:
def query(self, items):
DOCUMENT_SEARCH_KEY = "lyric"
if len(items) > 1:
if isinstance(items[0], str):
left = search(items[0])
else:
left = items[0]
if isinstance(items[2], str):
right = search(items[2])
else:
right = items[2]
operand = items[1]
left_doc_keys = set([doc[DOCUMENT_SEARCH_KEY] for doc in left])
right_doc_keys = set([doc[DOCUMENT_SEARCH_KEY] for doc in right])
doc_keys_after_query = operand(left_doc_keys, right_doc_keys)
result = [documents_by_title[title] for title in doc_keys_after_query]
return result
elif len(items) == 1 and isinstance(items[0], str):
return search(items[0])
else:
return items[0]
Let’s talk about the changes above and what they mean with reference to an example.
Consider the query:
Here is the tree for this query:
start
query
I
AND
still
OR
query
love
The transformer will evaluate (I AND still)
, then love
. But because query()
returns results, we our code needs to be able to understand those results when they go up the tree. When love
returns a result, the tree needs to understand it when it is parsing the X OR Y
query (the results from evaluating (I AND still)
and love
).
If left or right are lists, this means they have already been processed by query
. We need to account for this to support combining queries (i.e. (I AND still) OR love
). In the code above, we say that if left
or right
is a string, it should be passed to the search()
function to retrieve results. But if left
or right
is not a string, it means left
or right
is results. These can thus be left alone and combined using the existing logic we have.
This is confusing to understand, so let’s add a few print statements to our code:
def query(self, items):
print("Tree before querying", items)
if len(items) > 1:
if isinstance(items[0], str):
left = search(items[0])
else:
left = items[0]
if isinstance(items[2], str):
right = search(items[2])
else:
right = items[2]
operand = items[1]
print("Processed left value", left)
print("Processed right value", right)
left_doc_keys = set([doc[DOCUMENT_SEARCH_KEY] for doc in left])
right_doc_keys = set([doc[DOCUMENT_SEARCH_KEY] for doc in right])
doc_keys_after_query = operand(left_doc_keys, right_doc_keys)
result = [documents_by_title[title] for title in doc_keys_after_query]
print("result", result, "\n")
return result
elif len(items) == 1 and isinstance(items[0], str):
print("Processed single string value", items[0], "\n")
return search(items[0])
else:
return items[0]
When run on the query (I AND still) OR love
, this code returns:
Processed left value [{'title': 'tolerate it', 'lyric': 'I made you my temple, my mural, my sky'}, {'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}]
Processed right value [{'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}]
result [{'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}]
Tree before querying [Token('WORD', 'kiss')]
Processed single string value kiss
Tree before querying [[{'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}], method 'union' of 'set' objects, [{'title': 'The Bolter', 'lyric': 'Started with a kiss'}]]
Processed left value [{'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}]
Processed right value [{'title': 'The Bolter', 'lyric': 'Started with a kiss'}]
result [{'title': 'The Bolter', 'lyric': 'Started with a kiss'}, {'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}]
Tree before querying [[{'title': 'The Bolter', 'lyric': 'Started with a kiss'}, {'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}]]
['The Bolter', 'my tears ricochet']
Let’s talk through how our code processes the tree.
First, all of the tokens at the bottom of the tree are parsed. Strings (i.e. I
) are ignored, and OPERATORS
are turned into functions.
We then parse I AND still
, where AND
has been transformed into the set.intersection function (<method 'intersection' of 'set' objects>
).
Because left
and right
are strings, the processed value is a list of documents that have been returned by our search engine.
Our code then applies the set.intersection
function to left
and right
. This returns a single document: my tears ricochet
.
I AND still
returns my tears ricochet
.
Next, our code evaluates kiss
. Since it is a single string, it is passed directly to search()
and results are returned. kiss
is not in any of our search documents, so no results are returned.
We now have the tree:
[[{'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}], , []]
The first item corresponds to the result from I AND still.
The second item is the set.union
method, which was computed when the OR
operator was sent through the OPERAND
function. The third item is our list of lyrics from the kiss
query.
Since left
and right
are both lists, they are not passed through search()
. Instead, The song titles are retrieved from the documents, then set.union(left, right)
is run to find what lyrics are in either left
or right
. This returns:
["The Bolter", "my tears ricochet"]
Our code successfully finds all lyrics that contain I AND still
or kiss
. The two songs that satisfy that criteria are “The Bolter” and “my tears ricochet.”
Conclusion
In this guide, we built a query language for a search engine in Python. We defined a language that lets us use AND
and OR
to assemble search queries. We can combine statements to write more advanced queries.
To build our search engine query language, we first defined an Earley grammar. This grammar is used by Lark to turn strings into Abstract Syntax Trees (ASTs). We then wrote a Lark transformer that reads the syntax tree and processes it from the bottom upward using our own logic. This transformer understands all tokens in our tree – WORD
(which are automatically turned into) strings, query
, and OPERAND
.
This query language can be extended in many ways to allow for more advanced queries. I have a few bonus challenges that you can use to extend this project:
- Add an
AND NOT
operator that returns results that do not contain a specific phrase. - Add the ability to apply expressions like greater than or less than to values in different keys in each document (i.e.
my tears AND song.listens > 100
could return all songs with titles that match “my tears” and that have been listened to more than 100 times). - Write a test suite that validates that our code is working properly.
Let your imagination run wild when it comes to ideas. Remember, first define how the grammar should look, then implement the grammar, validate that the syntax tree looks good using the pretty()
function we used in this guide, then implement the logic to evaluate your tree. Validating that the syntax tree looks good can save a lot of frustration; it can take a few tries before the tree looks the way that you want.
You can find the code for this guide on GitHub.