The Firefox address bar lets you search for URLs given a series of characters. You can then select a URL with your arrow keys and press enter to navigate to that URL.
One use of this for me is finding files on GitHub. For example, I can type gith jam def
to find the default.html
file in the private jamesg.blog
GitHub repository. I usually write these queries incrementally, as the search bar suggests what is available.
This pattern is much faster than using the GitHub user interface to navigate to a specific page that is nested within several folders, and more appropriate than a bookmark since I have fast URL search across all of my browser history.
I was thinking about how this could be implemented. The data structure I would likely use is a bigram code index, a data structure I learned about in the “Regular Expression Matching with a Trigram Index or How Google Code Search Worked” blog post.
The fundamental premise is to create a reverse index where every token is a series of two characters, The aforelinked blog post uses three characters, but two characters are more appropriate for URL search since URLs are typically short. For example, given the URL:
jamesg.blog/coffee
We would have a reverse index like:
ja: [1] am: [1] me: [1] es: [1] ..
Every two character sequence in the URL would have an entry that matches the token to its full URL. 1
refers to the document ID, which in this case would map back to the full URL jamesg.blog/coffee
. Using document IDs is best practice since it reduces the amount of data you need to store.
When I type a search query like jam of
, the bigram index could be searched for ja
, am
, and of
. Then, the system could return all documents that mention those bigrams. This can be done using a set intersection operation, starting with all the resutls for ja
, then intersecting successive results.
The set intersection operation returns all items that appear in all sets being intersected.
Consider a search index with three URLs:
jamesg.blog/coffee jamesg.blog/code jamesg.blog/coffee-tips
Now, consider the query:
jam co
All of the above three URLs would be returned, since they all contain the strings ja
, am
, and co
. They could be ordered by use, allowing the user to navigate more easily to pages they refer to frequently.
Now, if I add ffee
to my search, the search would narrow to:
jamesg.blog/coffee jamesg.blog/coffee-tips
These are the two URLs that contain the sequences ja
, am
, co
, of
, ff
, fe
, and ee
. These are all of the two-character sequences in jam coffee
.
I haven’t built the above solution yet, but I wanted to document my thoughts on how I would build the solution as a thought exercise.