How to find word collocations in a document


My blog search engine used to treat every word independently in evaluating a query. For example, results for “homebrew website club” were calculated by finding documents that contained all those words. The words did not necessarily have to be in sequence.

For many queries, documents that were relevant bubbled up to the top of search results using TF/IDF. This is because more relevant documents would accure higher scores for containing all three words.

Then, I started to think: how could I promote documents that contain words that appear together? My use case was proper nouns like “Homebrew Website Club” (an event) and “Dancing With Our Hands Tied” (a song). I wanted documents that contained the exact phrase to be ranked higher, rather than treating each word individually.

In thinking about this ranking change, I also realised ranking documents higher if they contained an exact phrase from a search term would assist with retrieving documents by title. This is a use case I have for my search engine, where I often remember a phrase from a recent blog post and I want to be able to find it with that phrase easily.

The outcome I wanted was:

  1. A user types in a search query;
  2. Documents that contain the words in the query are retrieved;
  3. Documents are ranked according to a text ranking metric (i.e. TF/IDF);
  4. Documents are boosted if they contain an exact sequence of words from the query.

Thus, “homebrew website club” should promote documents that contain that exact phrase over documents that contain all three of those words but in different places.

I wanted an efficient solution that wouldn’t take too long to compute over lots of documents. That’s when I decided to use sets.

I came up with the following algorithm:

  1. Create a dictionary that maps each word in a document to its word position in the document;
  2. Declare a set called words that contains the position of all occurences of the first word in a phrase (i.e. “Homebrew”);
  3. For each word in a term (i.e. “Homebrew”, “Website”, “Club”), find all term positions. Subtract i from each term position, where i is the position of the word in the term (i.e. “Homebrew” = position - 0, “Website” = position - 1). Store these positions in a set called positions_for_word. Calculate the intersection of words and positions_for_word, and set the value of words to the new, intersected set.
  4. After evaluating all terms, return True if the length of the words is greater than 0; otherwise, return False.

My code is as follows:

from collections import defaultdict

text = "Homebrew Website Club is a weekly meetup of people interested in personal websites."

def check_for_collocation(text, term):
    word2pos = defaultdict(list)

    for i, word in enumerate(text.split()):
        word2pos[word].append(i)

    term_parts = term.split()
    words = set(word2pos[term_parts[0]])

    for i, t in enumerate(term_parts):
        term_positions = set([pos - i for pos in word2pos[t]])
        words = words.intersection(term_positions)

    if first_term_position:
        return True

    return False

print(check_for_collocation(text, "Website Club is"))

Instead of iterating over each word and comparing if the next word is equal to a specific word in a phrase, I decided to use sets.

Let’s work through the above code with an example.

Suppose we have the text in the code snippet above, and the term Website Club is that we want to check is in the document. First, we calculate a dictionary of words mapped to positions, like:

{
	"Homebrew": [0],
	"Website": [1],
	"Club": [2],
	"is": [3],
	"a": [4],
	...
}

words is initially the value of the first word, which in the above case is [0]. In the for i, t in enumerate(term_parts): for loop, we iterate over each term (Website, Club, is), and keep a variable i that tracks the position we are at in the terms.

We iterate over each term. For ecah term, we get the positions of the term in the document. Then, we substract i from each position.

For the term Website, term_positions becomes:

This is because the position of Website is 1.

We intersect this with the starting set, which contains {1} because we initialized it with the first word values. This is important because you cannot calculate the intersection of an empty set.

For the term Club, term_positions becomes:

Wait, [1] again? How does that work? This is why we subtract i from each position. Because Club is one word away from Website, the position in the document subtracted by the position in the phrase for which we are looking will be 1.

The intersection of 1 and 1 is 1.

For the term is, term_positions becomes [1], because it follows Club.

The intersection of all values is the list [1]. Because the list contains an item, it means our whole phrase had been found.

If we were searching for Website Club cookie, the term_positions for cookie is an empty list. The empty set intersected with {1} is an empty set. An empty set means the phrase is not in our document.

This algorithm works no matter where a document is. You can also use it to count how many occurences there are of a phrase, and keep track of the start of each occurence.

In this case, {1} tells us that the phrase Website Club is starts at the 1 index in the list of words in text. If Website Club is was also in the 50th position, our set would be {1, 50}, each item denoting the start of the exact phrase Website Club is.

With this computation, my site search can efficiently find documents that contain an exact phrase without having to do a string search. Documents that contain an exact phrase can then be promoted in search results.



Source link

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top