How to build a query language in Python


In this guide, I walk through how to build a query language in Python. No required knowledge of query languages is required to follow this guide. You will find this article easier to understand if you have some knowledge of trees.

This guide should get you set up with all the information you need to make your own query language. With that said, query languages are a complex topic, so note that it may take some time to get your head around different parts, especially the query execution logic.

This article follows on from the “How to build a search index in Python” tutorial, which walks through how to build a search index. While this guide is not dependent on knowledge of building a search index, it walks through how the search index we will use in this guide works.

The code for this guide is available in full on GitHub.

By the end of this guide, we will have a query language for a lyric search engine that lets us run the query:

And return results like:

["The Bolter", "my tears ricochet"]

Without further ado, let’s begin!

What we will query: A search index

In the “How to build a search index in Python” tutorial, we discussed how to build a search index from scratch, using music lyrics as an example. At the end of the tutorial, we had a Python function, search(), that we could pass a string. The function would search a list of documents, each a Python dictionary.

In the last guide, we started with the following documents:

documents = [
    {
        "title": "tolerate it",
        "lyric": "I made you my temple, my mural, my sky"
    },
    {
        "title": "my tears ricochet",
        "lyric": "And I still talk to you when I'm screaming at the sky",
    },
    {
        "title": "The Bolter",
        "lyric": "Started with a kiss"
    }
]

We built a function that lets us search the lyrics in each document like so:

The function above returns all documents whose lyrics contain the word “started”:

The search engine we built finds all documents that contain any word in a search. If we search for I still, the search index would return:

["tolerate it", "my tears ricochet"]

This is because “I” is in both “tolerate it” and “my tears ricohet”, and “I” is in “my tears ricochet” too. With that said, we may not want to return “my tears ricohet” since it doesn’t contain the word “still”.

The merits of a query language

Wouldn’t it be cool if we could find only documents that contain all words in a query? Or contain one of two words?

We can use this as an excuse to build a query language. A query language can be used to instruct a computer how to execute a query, with particular keywords that have meaning. For example, we can build a query language that lets us say:

This language can then be processed by a Python script to retrieve lyrics that contain both “I” and “still”.

Query languages are commonly used by both programmers and humans. A common query language used by programmers (and data analysts, and many other professionals) is SQL, which lets you directly query a database. A (relatively) common query language that may be used by humans is Google’s search operators, which let you say site:reddit.com coffee to restrict your results to Reddit.

With a query language, you can define what words have special meaning (i.e. AND, or OR), and use those words to either write a query for another language or to directly query a database.

Query languages can be extended for specific use cases. For example, Google’s query language provides over a dozen operators that let you do things like restrict a search to PDFs only, find pages on a website, find pages whose title matches a pattern, and more.

The logic that we write in this guide is designed to directly query a document store (the search engine built in the last guide). With that said, the same approach we discuss could be applied to turn queries in your language to another language. This is a problem that you may encounter in some search software engineering problems. For example, you could write a query language that is turned into an Elasticsearch JSON query for use in Elasticsearch querying. With such a system, you can provide a powerful, human-friendly query language that is then structured in exactly the way your index needs it.

Setup

If you haven’t read the “How to build a search index in Python” tutorial, I encourage you to read the guide.

Once you have read the guide, run the following command to clone the search engine code that we will use in this article:

git clone https://github.com/capjamesg/build-a-search-index

The approach

Our query language will consist of three main parts:

  1. The grammar, which dictates what a valid query looks like;
  2. The parser, which reads a string and parses it into a tree, and;
  3. The expression transformer, which reads the tree and evaluates its contents.

In the following sections, we will walk through how to build each of these components.

We are going to use lark, a Python package for writing query and programming languages, throughout this guide. lark implements many utilities that allows us to focus on building the functionality of our language, rather than writing complex boilerplate code.

Let’s begin!

The grammar

Before we build a query language, we need to decide on exactly what a valid query should look like. This will allow us to write a formal grammar that tells the computer how to read a query and turn it into a format we can work with.

We will implement two operators:

  • X AND Y, which returns documents that contain X and Y, and;
  • X OR Y, which returns documents that contain X or Y.

These statements can be chained together, so X AND Y OR Z should be valid.

Let’s define how these operators should be used. Users of our search engine should be able to write queries like:

(I AND still)
(I OR still)
(I AND still AND you)
still

The first three queries use the operators we will implement. The last query is an example of a query with no operators.

Above, we have outlined what we want queries to look like. But computers need a formal grammar to understand the structure of a query. There are several such grammars. A popular one is Earley, which lark understands. We will use Earley in this guide.

To write a grammar, we need to turn our example statements into general patterns that follow the Earley syntax. Lark will then process any string (i.e. (I AND still)) and turn it into an Abstract Syntax Tree (AST). The AST is what we will use to understand our query and retrieve search results that conform to the instructions in the query.

Let’s define our grammar:

from lark import Lark, Transformer
from .search_engine import search, documents_by_title

grammar = """
start: query
query: ("(" WORD (OPERATOR WORD)* ")") | WORD
OPERATOR: "AND NOT" | "AND" | "OR"

WORD: /[a-z0-9_ ]+/

%import common.WS
%ignore WS
"""

parser = Lark(grammar)

There are a lot of symbols in the grammar code, so let’s talk through it step by step.

First, we define start, which is the entry point.

We define a WORD as any lowercase letter, or any number or an underscore. WORD may contain one or more words, separated by a space. We do not support uppercase letters in WORD because:

  1. Our search index already normalizes all terms to lowercase before querying, which means having the original uppercase would have no result, and;
  2. We need a way to distinguish operators from words in our grammar.

We define OPERATOR as a list of three potential tokens:

Any string that matches these descriptions exactly is turned into an OPERATOR token. This will be treated in a particular way in the next section, when we use Lark to generate an Abstract Syntax Tree from our grammar.

We then define query. A query must be, at minimum, a word or phrase like:

This is defined by:

The WORD not in parentheses says that a query must contain at least one word.

A query can also follow:

("(" WORD (OPERATOR WORD)* ")")

This means that a query can be in parentheses. A query that follows these instructions can be either:

(WORD)
(WORD OPERATOR WORD)

An operator must come after the first phrase.

For example, the following query would be valid:

But this one would not be:

Since we used the asterisk after (OPERATOR WORD)*, we can include as many valid combinations of (WORD OPERATOR WORD) as we want, like:

We also say that we want to ignore whitespace in parsing. This ensures that (I AND still) is still treated as a valid query. (I love AND still) is still valid.

%import common.WS
%ignore WS

With this grammar, we can turn a string into an Abstract Syntax Tree. Lark has a function that lets us see how a tree looks.

Because words need to be in lowercase to work with our grammar, we need to write a transformation function that changes all words to lowercase unless they are reserved tokens (AND or OR):

def preprocess(query):
    RESERVED_TERMS = ["AND", "OR"]

    terms = query.split()
    terms = [t.lower() if t not in RESERVED_TERMS else t for t in terms]

    return " ".join(terms)

This function turns:

Into:

This has no impact on our search index.

Let’s use this function to see how an example query is parsed with our grammar:

result = parser.parse(preprocess()"(I AND still)")
print(result.pretty())

Our code returns:

We have a start entry point with one query: I AND still.

Evaluating the tree

In the last section, we wrote a grammar that reads a string of text and turns the text into an Abstract Syntax Tree (AST). This tree can then be evaluated to return query results.

For our language, we need to evaluate the AST bottoms-up. This means we will start with the left-most token at the bottom of our tree, work our way right, and do so for every token at the bottom level of the tree. We move up to the next level, and so on, until we reach root, which indicates we have reached the “root” (top) of the tree.

Lark has a utility to read a parsed AST and manipulate the tree: the Transformer class. You can define a sub-class that inherits from Transformer and defines how the tree should be manipulated. By default, the tree is changed bottoms-up.

Let’s define the structure of our transformer:

class ExpressionInterpreter(Transformer):
    def query(self, items):
        pass

    def OPERATOR(self, token):
        pass

    def start(self, items):
        return self.query(items)

Classes that inherit from Transformer use the names of methods in the class to determine how to change specific parts of the AST. The names of methods should match up to tokens in the grammar.

For example, a method called query will run on any query token in the tree; a method called OPERATOR will run on any OPERATOR token.

Let’s start by defining how an OPERATOR should behave:

def OPERATOR(self, token):
    """
    Transforms OPERATOR tokens into their respective Python set functions.
    """
    if token == "AND":
        return set.intersection
    elif token == "OR":
        return set.union

AND and OR are strings in our search query. We need to turn them into functions that do something. This code takes an OPERATOR token and turns it into a Python set function: set.intersection for AND and set.union for OR.

Now that we have the OPERATOR code, we can write the query function. This will take a statement and use our search engine from the last tutorial to find results that match our query.

Add the following code to your Python script:

def query(self, items):
    DOCUMENT_SEARCH_KEY = "lyric"
    if len(items) > 1:
        left = search(items[0])
        right = search(items[2])

        operand = items[1]

        left_doc_keys = set([doc[DOCUMENT_SEARCH_KEY] for doc in left])
        right_doc_keys = set([doc[DOCUMENT_SEARCH_KEY] for doc in right])

        doc_keys_after_query = operand(left_doc_keys, right_doc_keys)

        result = [documents_by_title[title] for title in doc_keys_after_query]

        return result
    elif len(items) == 1 and isinstance(items[0], str):
        return search(items[0])
    else:
        return items[0]

Let’s look back to our syntax tree from earlier:

The items variable in our function corresponds to I, AND, and still: the three values enclosed within query.

If query has more than one item, it means there is an OPERAND. In this case, we need to process the operand. If query has one item, it means the user has provided a string query (i.e. i still). In this case, we do not need to process any operands, since there are not any; we can fall back onto the default behaviour of our search engine that finds all documents that mention any word.

If query has more than one item, we break out items into three variables:

  • left: The first string.
  • operand: The operand.
  • right: The second string.

If left or right are strings, we pass them through the search() function. This will return the results related to our query. In the code above, you may notice that we don’t pass left or right through search() if they are not strings. I will talk about why this is important later in this section.

We then find the lyric key in all documents that match the search queries on the left and right sides. Then we apply the OPERAND function that was set for the operand.

Consider the query (I AND still) . This will go through the following steps:

  • Search results for I are found.
  • Search results for still are found.
  • The set.intersection(left, right) function is applied.

If we used OR, the set.union(left, right) function is applied.

The function application happens in this code:

doc_keys_after_query = operand(left_doc_keys, right_doc_keys)

We then get the full documents (not just the lyric key we have been working with):

result = [documents_by_title[title] for title in doc_keys_after_query]

The documents are then returned so they can be processed higher in the tree.

Running a query

We now have a tree parser, which means we have all we need to run a query. You can use the following code to run a query:

result = parser.parse(preprocess("(I AND still)"))

ast = ExpressionInterpreter().transform(result)

print([doc["title"] for doc in ast])

This code returns:

Our code works!

At the start of this guide, our search engine returned two results for this query: tolerate it and my tears ricochet. But my tears ricochet is the only record that contains both “I” and “still”.

Let’s try another query:

result = parser.parse(preprocess("(I AND still AND talk)"))

ast = ExpressionInterpreter().transform(result)

print([doc["title"] for doc in ast])

This query returns my tears richcoet too.

If we run an invalid query, our code returns an error.

Consider the following query, which ends with an AND operator without a word following:

result = parser.parse(preprocess("(I AND still AND talk AND)"))

This query raises a Python exception. This is because the string does not conform to our grammar, so it cannot be parsed.

Expanding the query engine: Allowing combinations of queries (Bonus)

Our grammar currently allows us to write single queries, like (I AND still). But what if we want to combine two queries together?

We can do this with a small modification to the query part of our grammar:

query: ("(" WORD (OPERATOR WORD)* ")") | (query OPERATOR query)* | WORD

All of the queries we wrote in earlier sections are still valid, but we have added a new feature: an operator can be used between two queries. That means the following query is valid:

This also means we can nest queries within queries, if we want:

(I AND (still AND talk)) OR kiss

With that said, changing the grammar isn’t enough: we also need to change our tree processor to understand this change.

Update your query code to the following:

def query(self, items):
    DOCUMENT_SEARCH_KEY = "lyric"
    if len(items) > 1:
        if isinstance(items[0], str):
            left = search(items[0])
        else:
            left = items[0]

        if isinstance(items[2], str):
            right = search(items[2])
        else:
            right = items[2]

        operand = items[1]

        left_doc_keys = set([doc[DOCUMENT_SEARCH_KEY] for doc in left])
        right_doc_keys = set([doc[DOCUMENT_SEARCH_KEY] for doc in right])

        doc_keys_after_query = operand(left_doc_keys, right_doc_keys)

        result = [documents_by_title[title] for title in doc_keys_after_query]

        return result
    elif len(items) == 1 and isinstance(items[0], str):
        return search(items[0])
    else:
        return items[0]

Let’s talk about the changes above and what they mean with reference to an example.

Consider the query:

Here is the tree for this query:

start
  query
    I
    AND
    still
  OR
  query
  	love

The transformer will evaluate (I AND still), then love. But because query() returns results, we our code needs to be able to understand those results when they go up the tree. When love returns a result, the tree needs to understand it when it is parsing the X OR Y query (the results from evaluating (I AND still) and love).

If left or right are lists, this means they have already been processed by query. We need to account for this to support combining queries (i.e. (I AND still) OR love). In the code above, we say that if left or right is a string, it should be passed to the search() function to retrieve results. But if left or right is not a string, it means left or right is results. These can thus be left alone and combined using the existing logic we have.

This is confusing to understand, so let’s add a few print statements to our code:

def query(self, items):
    print("Tree before querying", items)
    if len(items) > 1:
        if isinstance(items[0], str):
            left = search(items[0])
        else:
            left = items[0]

        if isinstance(items[2], str):
            right = search(items[2])
        else:
            right = items[2]

        operand = items[1]

        print("Processed left value", left)
        print("Processed right value", right)

        left_doc_keys = set([doc[DOCUMENT_SEARCH_KEY] for doc in left])
        right_doc_keys = set([doc[DOCUMENT_SEARCH_KEY] for doc in right])

        doc_keys_after_query = operand(left_doc_keys, right_doc_keys)

        result = [documents_by_title[title] for title in doc_keys_after_query]

        print("result", result, "\n")

        return result
    elif len(items) == 1 and isinstance(items[0], str):
        print("Processed single string value", items[0], "\n")
        return search(items[0])
    else:
        return items[0]

When run on the query (I AND still) OR love, this code returns:

Processed left value [{'title': 'tolerate it', 'lyric': 'I made you my temple, my mural, my sky'}, {'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}]
Processed right value [{'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}]
result [{'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}] 

Tree before querying [Token('WORD', 'kiss')]
Processed single string value kiss 

Tree before querying [[{'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}], method 'union' of 'set' objects, [{'title': 'The Bolter', 'lyric': 'Started with a kiss'}]]
Processed left value [{'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}]
Processed right value [{'title': 'The Bolter', 'lyric': 'Started with a kiss'}]
result [{'title': 'The Bolter', 'lyric': 'Started with a kiss'}, {'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}] 

Tree before querying [[{'title': 'The Bolter', 'lyric': 'Started with a kiss'}, {'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}]]
['The Bolter', 'my tears ricochet']

Let’s talk through how our code processes the tree.

First, all of the tokens at the bottom of the tree are parsed. Strings (i.e. I) are ignored, and OPERATORS are turned into functions.

We then parse I AND still, where AND has been transformed into the set.intersection function (<method 'intersection' of 'set' objects>).

Because left and right are strings, the processed value is a list of documents that have been returned by our search engine.

Our code then applies the set.intersection function to left and right. This returns a single document: my tears ricochet.

I AND still returns my tears ricochet.

Next, our code evaluates kiss. Since it is a single string, it is passed directly to search() and results are returned. kiss is not in any of our search documents, so no results are returned.

We now have the tree:

[[{'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}], , []]

The first item corresponds to the result from I AND still. The second item is the set.union method, which was computed when the OR operator was sent through the OPERAND function. The third item is our list of lyrics from the kiss query.

Since left and right are both lists, they are not passed through search(). Instead, The song titles are retrieved from the documents, then set.union(left, right) is run to find what lyrics are in either left or right. This returns:

["The Bolter", "my tears ricochet"]

Our code successfully finds all lyrics that contain I AND still or kiss. The two songs that satisfy that criteria are “The Bolter” and “my tears ricochet.”

Conclusion

In this guide, we built a query language for a search engine in Python. We defined a language that lets us use AND and OR to assemble search queries. We can combine statements to write more advanced queries.

To build our search engine query language, we first defined an Earley grammar. This grammar is used by Lark to turn strings into Abstract Syntax Trees (ASTs). We then wrote a Lark transformer that reads the syntax tree and processes it from the bottom upward using our own logic. This transformer understands all tokens in our tree – WORD (which are automatically turned into) strings, query, and OPERAND.

This query language can be extended in many ways to allow for more advanced queries. I have a few bonus challenges that you can use to extend this project:

  1. Add an AND NOT operator that returns results that do not contain a specific phrase.
  2. Add the ability to apply expressions like greater than or less than to values in different keys in each document (i.e. my tears AND song.listens > 100 could return all songs with titles that match “my tears” and that have been listened to more than 100 times).
  3. Write a test suite that validates that our code is working properly.

Let your imagination run wild when it comes to ideas. Remember, first define how the grammar should look, then implement the grammar, validate that the syntax tree looks good using the pretty() function we used in this guide, then implement the logic to evaluate your tree. Validating that the syntax tree looks good can save a lot of frustration; it can take a few tries before the tree looks the way that you want.

You can find the code for this guide on GitHub.



Source link

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top