How can search engines be so fast? While there are many parts of a search system, one of the key concepts to know is the inverted index.
Suppose we have the following strings, all Taylor Swift lyrics:
I made you my temple, my mural, my sky
And I still talk to you when I’m screaming at the sky
Started with a kiss
We could manually search through these to find all lyrics that contain a word, like “window”. With that said, this solution wouldn’t scale well if the documents we had were every song lyric by every pop artist. Here, we need an inverted index.
An inverted index is a data structure that maps words to the documents they are in.
Here is the structure of an inverted index:
{
"temple": [0],
"kiss": [2],
"sky": [0, 1],
...
}
With the structure above, we can look for sky
and find that it is in documents 0 and 1. Then, we could return those documents to the user. The alternative would be manually searching through every single document.
This data structure is a map, also known as a dictionary. One of the benefits of a dictionary is that it has O(1) lookup times. This means that the lookup time is constant no matter how many items are in the dictionary. If our dictionary has a million records, it will be as fast to query as it would if it had three records.
Let’s talk through how to create a reverse index.
Define the documents
Before we begin, we need to define our documents. Our search feature is going to let us find what songs contain a specific lyric. Thus, we will need two pieces of information: a song title, and the lyric.
Let’s start with a list of documents:
documents = [
{
"title": "tolerate it",
"lyric": "I made you my temple, my mural, my sky"
},
{
"title": "my tears ricochet",
"lyric": "And I still talk to you when I'm screaming at the sky",
},
{
"title": "The Bolter",
"lyric": "Started with a kiss"
}
]
Here, we have three documents, each with a title and a lyric.
Define the index
Now that we have a list of documents, we need to build our index.
To build the index, we will loop through each item in the list of documents. For each item, we will:
- Change all text to lowercase, so that our index is case insensitive;
- Remove all symbols, so symbols do not interfere with searches;
- Add each word in the
lyric
to an index.
Translated into code, this looks like:
import string
def transform_text(text):
return text.lower().translate(str.maketrans("", "", string.punctuation))
index = {}
for i, doc in enumerate(documents):
lyric = transform_text(doc["lyric"])
for word in lyric.split():
if word not in index:
index[word] = set({})
index[word].add(i)
First, we define a function called transform_text
. This code will turn all text lowercase and remove all punctuation. This is important because if our search index includes capital letters, we can only query it if the user has typed in the same capitals as the index. The same applies with punctuation. Thus, if we wanted to look for sky
, we would have to search exactly for sky
. If we didn’t have the apostrophe, we couldn’t search for it in our dictionary.
We are using sets to keep track of documents that contain a word. This is so that a document that contains the same word multiple times will only be saved once in the index.
We then loop over each document. We call transform_text
to transform the lyric to lowercase and remove all punctuation. We then add each individual word to our index. We add words with this structure:
{ word: [documents] }
Let’s run the code above on a single lyric: “I made you my temple, my mural, my sky”:
{'i': {0}, 'made': {0}, 'you': {0}, 'my': {0}, 'temple': {0}, 'mural': {0}, 'sky': {0}}
We now have all the words in the lyric in our index. Each word is associated with its document ID. Since this is the first lyric in our list of documents, all IDs are 0.
If we index all of our documents, we will have:
{'i': {0, 1}, 'made': {0}, 'you': {0, 1}, 'my': {0}, 'temple': {0}, 'mural': {0}, 'sky': {0, 1}, 'and': {1}, 'still': {1}, 'talk': {1}, 'to': {1}, 'when': {1}, 'im': {1}, 'screaming': {1}, 'at': {1}, 'the': {1}, 'started': {2}, 'with': {2}, 'a': {2}, 'kiss': {2}}
Notice sky
has two values: 0 and 1. This is because the word appears in both documents.
We now have a search index! The next step is to write some code to query it.
Search the index
To query our index, let’s define a search function:
def search(query):
words = transform_text(query).split()
results = set()
for word in words:
if word in index:
results.update(index[word])
titles = [documents[idx] for idx in results]
return titles
This function accepts a query and defines an empty list of results. We first transform the query so that it is in the same form as all the documents: lowercase, and without symbols. We then split the query into a list of words. Because our index only contains single words as keys, we have to run a query for each word. We then loop over all words and check if the word is in the index. If it is, we add all of the document IDs to the set results
.
Finally, we get the song titles associated with each document ID.
This code will return all documents that contain any word in your search query. The words don’t have to be together. There is more code that we would need to write to return documents that contain two or more words in sequence. That is out of scope for this guide.
Let’s call our function with the query sky
:
[{'title': 'tolerate it', 'lyric': 'I made you my temple, my mural, my sky'}, {'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}]
Our code successfully identified both lyrics!
If we search for kiss
, we get:
[{'title': 'The Bolter', 'lyric': 'Started with a kiss'}]
Benchmarking
Earlier in this article, we discussed how an index would be faster than searching through all words in all documents. Let’s test that theory. We can use the code above, with a few changes. First, we need to make the documents list larger:
documents = documents * 5000
This will make the documents list 5,000x larger. Thus, we will have 15,000 documents (3 documents x 5,000).
Second, we need to write some code that will time the use of our search query. Let’s run 20 searches of the same query and time them:
import time
start = time.time()
for _ in range(20):
search("sky")
end = time.time()
print("Time taken for index:", end - start)
Finally, let’s write code that would manually check if every lyric
value in documents
contained our search query:
def search_by_words(query):
lyric = transform_text(query).split()
results = []
for word in lyric:
for doc in documents:
if word in transform_text(doc["lyric"]):
results.append(doc)
titles = [doc["title"] for doc in results]
return titles
start = time.time()
for _ in range(20):
search_by_words("sky")
end = time.time()
print("Time taken for manual comparisons:", end - start)
Let’s search for the term sky
. When tested on an M1 Macbook, this code returns:
Time taken for index: 0.0071811676025390625
Time taken for manual comparisons: 0.8166451454162598
To run 20 queries with our index of 15,000 documents, our script took 0.007s. The same 20 queries on the same index took 0.8s if we did manual comparison.
Of note, our document sizes are short. If our documents were much longer – for example, the text from web pages – the manual comparison would take even longer. Querying the index, on the other hand, would take the same time no matter how long documents are. This is because the index has been pre-built and saved in a data structure with O(1) lookup time. We only need to build the index once. Whereas our manual comparison is checking everything by brute force, which is slow and does not scale well.
Conclusion
In this guide, we made a reverse index in Python. We defined a list of documents, then created a reverse index that maps each word in every document to an ID associated with that document. We then wrote code to query the index, and demonstrated that indexes are substantially faster than brute-force search at runtime.
Our search index does have a significant limitation: we cannot search for text that is explicitly together (i.e. I can
). With that said, for a few dozen lines of code, we have built an efficient data structure for searching documents.
You can see the full code for this blog post on GitHub.
This post was inspired by Tim Bray’s On Search: Basic Basics blog post, which was the first resource that helped me understand search indexing and how it could be so fast.