Building a NoSQL database in Python

This post is not an authority on how to make a NoSQL database, rather how I made mine. I am still actively learning this topic, but I wanted to write my thoughts to help me process what I am learning.

I have used NoSQL databases for search projects in the past. A NoSQL database like Elasticsearch stores documents in (usually) a single table. Then, the database provides a query interface over which you can retrieve records from the database.

Unlike a relational database (i.e. MySQL or PostgreSQL), NoSQL databases do not store explicit relations between records. Thus, there is no concept of a join in NoSQL; primary and foreign keys are not used, either. Instead, relations are dictated by how you model the shape of data, rather than what tables you create.

One of my favourite ways to learn how programming systems work is to read up on the topic and then try to implement the system myself. After several hours of thinking and coding, I now have a fully-functional NoSQL database that I have open sourced on GitHub. The database is implemented in Python.

In this blog post, I am going to talk through how my database works at a high level.

There are likely many optimizations that can be made to improve query speed. I was inspired by the Advanced Design Patterns for DynamoDB talk by AWS, which helped me reason with how dictionaries and indices could be used to allow for efficient lookup of resources in a NoSQL datasbase. I recommend that talk for a deep dive into the internals of NoSQL databases.

The core data structure

My NoSQL database uses a document-based structure. The core of the index is a Python OrderedDict. This is a dictionary whose values are ordered by insertion; the most recently inserted item is always at the end of the list.

Each record in the index has:

A partition key, which would be used to determine what partition the data should be stored in, and;
The data in the record.

Here is an example record of a song lyric, where the partition key is the title of the song:

{
	"tolerate it": {
		"title": "tolerate it",
    "lyric": "my mural, my sky"
		"uuid": "..."
	}
}

In NoSQL data structures, partition keys do not have to be unique, but they should have a high cardinality. This means that the values should be diverse and varied, such as customer IDs in an ordering system. In a production system, a database may be split up into shards and saved across several servers. If partition keys have high cardinality, requests can be distributed more equally across the shards.

In addition to a partition key, sort keys are a common part of NoSQL database architectures. These keys can be queried to retrieve a response. Sort keys are decided by modeling how you want to query your database. You can then choose a sort key for the main index, and build index views that can be used for efficiently querying different fields. The index views, called Global Secondary Indices (GSIs) refer back to the original index to prevent duplicating data.

Global Secondary Indices (GSIs)

I am still actively learning about how this part of a NoSQL database works. If I have made any mistakes, please let me know by emailing me!

In my database implementation, I decided the root index would have no sort key. Instead, GSIs can be built that let you efficiently query different columns in documents (i.e. lyric in the document above) by creating secondary indices that are built specially for searching each column you want to search.

For the above document, I may create a GSI for the lyric column that looks like this:

gsis = {"lyric": {"my mural, my sky": ["uuid...", "uuid..."], ...}, ...}

Without the GSI, I would have to iterate over all documents in the main index and run an operation (i.e. to check if the query string is in the lyric, or to check if the query string is exactly equal to the lyric).

With the GSI, we can build an index where multiple unique lyrics are stored under the same value, so we don’t have to check duplicate values twice. Indeed, the efficiency of my GSI implementation would be especially clear when:

You have a large dataset.
You are doing an equals check to check for equality between a string in a record and a query.

The data structure I used could be modified to improve search operations for other types, too. For example, a trie could be computed across all records. Tries are deeply nested dictionaries where each letter in a word or word in prose is nested within the previous word or letter. They are commonly used for autocomplete, where you want to efficiently find all words that start with a given string.

In the context of this problem, a trie could let you find all records whose keys start with “my” by doing two dictionary lookups, rather than iterating over every record and checking if the first two characters are equal to “my”.

The trie would need to be limited to a certain number of starting characters to avoid taking up too much memory.

Includes checks (i.e. checking if lyric includes “sky”) would be more expensive. I am unsure on the optimal way to do this at scale beyond the approach I implement below, which is to manually check all GSI keys (unique values in a column) for the substring using an efficient string search algorithm. This is faster than manually checking every entry if some entries are duplicates, however.

GSIs do need to be pre-computed and kept in sync, and take up more memory, but this trade-off is acceptable owing to how GSIs speed up the query process at run time. I am unsure of how they scale across servers in a distributed environment, though; perhaps every shard could have its own GSI for its own assets, so GSI lookups could happen concurrently across all servers.

How adding documents works

When a document is added to the index, three things happen:

A UUID is assigned to the document.
The document with its UUID is saved in the main index.

Once a document has been added to the index, it can be queried.

The query engine

NoSQL database architectures allow for rich, complex queries that are ideal for building information retrieval systems. Inspired by Elasticsearch, I wanted queries to be represented in JSON. This would give me flexibility to allow users to write complex systems. To start, I wrote a single query that I wanted to support, then anchored my development efforts around that query. The query was something like:

{"query": {"title": {"contains": "tolerate"}}}

This query would retrieve all documents whose title contains the string “tolerate”.

With a query I wanted to support in mind, the next step was to build the query engine. This engine takes a query and finds documents that match the condition in the query. I built support for one condition at a time to be evaluated.

The query engine follows this algorithm:

Retrieve the column name being queried. In the above example, this is title.
Retrieve the search criteria. In the above example, this is contains.
Create a GSI for the column name being queried. This creates an index of the form {title: uuid}. This can be used to efficiently retrieve a list of titles and their associated document IDs, which can then be reconciled back to the full document by querying the main main index. With this approach, retrieving a full document by title is an O(1) + O(1) operation. The first O(1) is a lookup to find the UUID of the item in the GSI. The second O(1) is a lookup to find the document data in the main index.
Iterate over all documents and check if it matches the condition. In the above example, the condition is whether title contains tolerate. I decided to use the Python pybmoore to implement string searching. This library implements Boyer-Moore string-search, an efficient string searching algorithm
If the condition is matched, the document UUID is recorded so the document can be returned.
The document contents are retrieved from the global index using the UUIDs.
Results are returned from the query.

Let’s show how it works!

Consider the following list of documents:

[
    {"title": "tolerate it", "lyric": "I made you my temple, my mural, my sky"},
    {
        "title": "my tears ricochet",
        "lyric": "And I still talk to you when I'm screaming at the sky"
    },
    {"title": "The Bolter", "lyric": "Started with a kiss"}
]

Each document has a song title and an associated lyric.

I can load the documents into an index using the following code:

from nosql import NoSQL
# this is imported from a local file
# the code is available at https://github.com/capjamesg/nosql

index = NoSQL(
    index_by=["title", "lyric"],
)

for document in documents:
    index.add(document)

I can then construct a search query.

After I built the part of the query engine that searches documents, I added a few parameters that you can use to customize queries: sort, limit, and start. Variations of these parameters are commonly available across NoSQL databases to allow for sorting, limiting result pages, and pagination, respectively.

With a limit parameter to limit results, a final query looks like this:

query = {
	"query": {"query": {"title": {"contains": "tolerate"}}},
	"limit": 10
}

This query can then be executed using the following code:

result = index.search(query)

print("Showing search results for query: ", query)

for r in result["documents"]:
    print(r["title"])

print("results returned in ", result["query_time"] + "s")

When the query runs, the query engine evaluates the query and finds all matching results. Then, the query code limits the number of results returned by search() according to the limit key in our query.

The query engine returns both all documents that match, as well as how long it took to find a document.

In a test running this on a dummy dataset of 300,000 documents, with each containing a title and a song lyric, it takes < 0.001 seconds to find the results that match the query above.

Of note, query run times increase if you are searching long documents. For example, if every document was 10,000 words and you were searching for the presence of a string, it would take noticeably longer to find a result. There are ways around this, such as building reverse indexes of each indexed result that let you efficiently find specific words in a document. With that said, there is significant nuance that I am still processing. For example, running a contains query across a massive document would take time and I am not sure of the best way to counter this.

You can see an example of a benchmark script that evaluates query speed in the project repository.

Adding support for complex queries

At this point, I had a working query engine. But, the engine could only satisfy one query at a time. I decided to extend the engine to add support for operators queries. These are queries that have a list of conditions and an operator that is used to evaluate them (i.e. and or or). I started with a common compound query: and.

I wanted to be able to write a query that would let me find documents that met two or more conditions. For example, I could find documents whose title starts with “tolerate” and whose lyrics contain “my mural”:

{
  "query": {
      "and": [
          {"title": {"starts_with": "tolerate"}},
          {"lyric": {"contains": "my mural"}},
      ]
  },
  "limit": 2,
  "sort_by": "title",
}

To implement this, I needed to build a depth-first search function that would start at the top of the query then recursively evaluate each key and value. If a key was a reserved keyword and, its children could be evaluated as queries or other query keywords. This can happen recursively until the end of a branch. The end of a branch will always be a query.

Results can then bubble up toward the top of the callstack. When a reserved keyword is hit, the results from queries that were evaluated are then processed using the corresponding function for that keyword.

For example, consider the above query. I needed a function that would follow this path:

evaluating {'and': [{'title': {'starts_with': 'tolerate'}}, {'lyric': {'contains': 'my mural'}}]}
evaluating {'title': {'starts_with': 'tolerate'}}
running query {'title': {'starts_with': 'tolerate'}}
evaluating {'lyric': {'contains': 'my mural'}}
running query {'lyric': {'contains': 'my mural'}}
combining results using method and values {UUID('4ea169ac-a45d-4973-a5a3-093c977b685a')}

In this path, the query is evaluated. The and key is retrieved. Then, each condition is evaluated. When a search query is encountered (i.e. {'title': {'starts_with': 'tolerate'}}), the query is executed using the query engine described in the last section. Then, the results are sent up the call stack. The results are the UUIDs of documents that match a condition. When all nodes in and are evaluated, they should be processed to find results that meet all conditions. For the and method, I used the set.intersection function to find the UUIDs that were returned in all queries.

Then, the documents associated with each UUID can be retrieved and sent back to the client.

The depth-first search I implemented is recursive. This means that queries can support nested compound queries, like:

query = {
    "query": {
        "or": {
            "and": [
                {"title": {"starts_with": "tolerate"}},
                {"title": {"contains": "it"}},
            ],
            "lyric": {"contains": "kiss"},
        }
    },
    "limit": 2,
    "sort_by": "title",
}

Conclusion

I have used Elasticsearch for a few search projects in the past. I was always curious about how documents could be queried so fast. With this project, I was able to build a greater understanding of the fundamentals of the structure of a NoSQL database and how they can be represented in Python.

One of the most significant takeaways from this project was the role of a GSI: an auxillery index that can be used to map items with the same key back to the UUIDs with which they are associated.

The source code for this project is available on GitHub. I have added a test suite that evaluates a wide range of functionality, including query behaviour acrosss simple, boolean, and nested queries, and adding and removing items from an index.

If you have any questions about how this project works, feel free to email me at readers [at] jamesg [dot] blog.

Source link