URL search in the browser address bar

The Firefox address bar lets you search for URLs given a series of characters. You can then select a URL with your arrow keys and press enter to navigate to that URL.

One use of this for me is finding files on GitHub. For example, I can type gith jam def to find the default.html file in the private jamesg.blog GitHub repository. I usually write these queries incrementally, as the search bar suggests what is available.

This pattern is much faster than using the GitHub user interface to navigate to a specific page that is nested within several folders, and more appropriate than a bookmark since I have fast URL search across all of my browser history.

I was thinking about how this could be implemented. The data structure I would likely use is a bigram code index, a data structure I learned about in the “Regular Expression Matching with a Trigram Index or How Google Code Search Worked” blog post.

The fundamental premise is to create a reverse index where every token is a series of two characters, The aforelinked blog post uses three characters, but two characters are more appropriate for URL search since URLs are typically short. For example, given the URL:

jamesg.blog/coffee

We would have a reverse index like:

ja: [1]
am: [1]
me: [1]
es: [1]
..

Every two character sequence in the URL would have an entry that matches the token to its full URL. 1 refers to the document ID, which in this case would map back to the full URL jamesg.blog/coffee. Using document IDs is best practice since it reduces the amount of data you need to store.

When I type a search query like jam of, the bigram index could be searched for ja, am, and of. Then, the system could return all documents that mention those bigrams. This can be done using a set intersection operation, starting with all the resutls for ja, then intersecting successive results.

The set intersection operation returns all items that appear in all sets being intersected.

Consider a search index with three URLs:

jamesg.blog/coffee
jamesg.blog/code
jamesg.blog/coffee-tips

Now, consider the query:

jam co

All of the above three URLs would be returned, since they all contain the strings ja, am, and co. They could be ordered by use, allowing the user to navigate more easily to pages they refer to frequently.

Now, if I add ffee to my search, the search would narrow to:

jamesg.blog/coffee
jamesg.blog/coffee-tips

These are the two URLs that contain the sequences ja, am, co, of, ff, fe, and ee. These are all of the two-character sequences in jam coffee.

I haven’t built the above solution yet, but I wanted to document my thoughts on how I would build the solution as a thought exercise.

Source link

Related Posts

Leave a Comment Cancel Reply