Introduction to consensus modeling in Python


Consensus algorithms allow computer scientists and data analysts to answer a question based on information from several data sources. These data sources may be independent, or data taken from the same source but at different time frames.

Consider a scenario where you are building a tool that retrieves tags from Wikipedia album pages to build a graph of tags and albums. You want to build a reliable list of tags.

With that said, there is a limitation of Wikipedia: the site can be changed at any time. If you retrieve tags at a time when a page has been vandalised with a new tag, or a page with an irrelevant tag has been accidentally added, that data will become part of your dataset. One way to retrieve more reliable tags for each album would be to look at the last few days of tags and use a consensus approach to come to a conclusion about which tags you should use in your graph.

In general, a few common approaches to consensus modeling are:

  1. Return results in all data sets (unanamous consensus). This is used to find complete agreement between sources.
  2. Return results in one or more data sets. This is useful for finding inconsistencies between data sources.
  3. Return results in over N% of data sets. This is useful for finding data that is consistent throughout a specific percentage of datasets, such as a majority (50% of sources).
  4. Return results in the last N data sets. This can be used if the data you are working with is modeled over time. You could retrieve results that appear in all of the last N days of data, for example.

There are many more consensus strategies. In addition, the exact implementation of a strategy depends on your use case.

In the context of our problem of retrieving tags for building a graph of song artist tags, we could use consensus modelling to:

  • Retrieve all tags that have appeared on the page every day for the last 10 days. This may be useful, particularly for more popular albums, since edits may be made in the 10 day window that resolve inaccuracies or vandalism. In addition, this allows you to ensure that newer tags (i.e. ones added 11 days ago) are considered.
  • Retrieve a tag that appears in at least 50% of pages. This would help prevent a scenario where a tag is accidentally removed nine days ago then added back eight days ago. Our first strategy would exclude such a tag, but this strategy would include it, provided the tag had been on the page for at least 50% of the history of the page.

In this guide, we are going to walk through how to implement all of the above strategies.

Step #1: Initialise data

To get started, we need some data. I have made a dummy dataset of tags from the Folklore (Taylor Swift album) Wikipedia page. I made changes to the data so that it can be used to illustrate the consensus algorithm at work:

categories = [
    ["2021 albums", "Taylor Swift albums", "Albums produced by Taylor Swift"],
    [
        "2020 albums",
        "Taylor Swift albums",
        "Albums produced by Taylor Swift",
        "Republic Records albums",
    ],
    [
        "2020 albums",
        "Taylor Swift albums",
        "Albums produced by Taylor Swift",
        "Republic Records albums",
    ],
]

Above, we have three days of dummy data. Day one has three tags. Days two and three have four tags.

Step #2: Define valid strategies (optional)

If you only plan to implement a single strategy, you can skip this step. I have added this code so that we can work through an example of a system that implements several consensus strategies that can be chosen from. While most applications will only need one, this code acts as good scaffolding for the rest of the guide, and allows for an aside on Enums, a useful tool in programming.

We are going to implement all four strategies we discussed in the introduction:

  • Return results in all data sets, which we will call UNANAMOUS.
  • Return results in one or more data sets, which we will call IN_ONE_OR_MORE.
  • Return results in over N% of data sets, which we will call CROSSES_THRESHOLD.
  • Return results in the last N data sets, which we will call IN_LAST_N_DAYS.

We will define these strategies as an enum. An enum is a list of values.

from collections import defaultdict
from enum import Enum

class VALID_STRATEGIES(Enum):
    IN_ONE_OR_MORE = "in_one_or_more"
    CROSSES_THRESHOLD = "crosses_threshold"
    IN_LAST_N_DAYS = "in_last_n_days"
    UNANIMOUS = "unanimous"

Step #3: Count occurrences of each item

To get started, let’s define a Python function that takes in a list of lists and a strategy that should be used to reach consensus:

def consensus(items: list, strategy: VALID_STRATEGIES) -> set:
  pass

The data we are working with is a list of lists: the items list will contain lists of tags. This function will return a set. This set will contain all unique items in the lists that we pass in that match our consensus strategy. We are returning a set because we will use Python’s set methods to find intersections of the items values that we pass into the consensus function. Intersections are items that appear in one or more sets.

Above, we set VALID_STRATEGIES as the type for strategy. This means that code editors like Visual Studio Code can tell developers that a strategy must be in VALID_STRATEGIES.

Of note, sets, the return type of our function, are not ordered, so results may appear in different orders every time you run the function. To retrieve ordered results, you can order them alphabetically after they have been passed through the algorithm using a helper function like this:

def order_alphabetically(categories):
    return sorted(categories)

To get started, we need to calculate:

  • All unique items in all lists of tags;
  • The number of lists each tag appears in.

We can compute this information using the following code:

unique_items = set()
membership_counts_in_total = defaultdict(int)

for list_num, sublist in enumerate(items):
    for i in set(sublist):
        unique_items.add(i)

        membership_counts_in_total[i] = membership_counts_in_total[i] + 1

In this code, we iterate over every list item in the lists in items. We turn each list in items into a set so that we don’t accidentally count the presence of an item more than once for each list it is in. This is essential to implement the “return results in over N% of data sets” strategy, where we need to track how many data sets an item appears in, rather than how many times it appears in total.

With this code, we now know all we need to implement consensus algorithms.

Step #4: Implementing algorithms

In one or more

Because unique_items is a set, Python ensures that only one of each item can be added. This means that the list has naturally tracked all items that appear in one or more of our tag lists.

Thus, we can return unique_items to represent all items that appear in one or more lists:

if strategy == VALID_STRATEGIES.IN_ONE_OR_MORE:
    return unique_items

This code returns:

{"2020 albums", "Taylor Swift albums", "Albums produced by Taylor Swift", "Republic Records albums", "2021 albums"}

Unanimous

To calculate whether an item appears in all tag lists, we can use the set.intersection() function. This function returns the items that exist in two or more sets.

Consider the following lists:

["2021 albums", "Taylor Swift albums", "Albums produced by Taylor Swift"]
["2020 albums", "Taylor Swift albums", "Albums produced by Taylor Swift", "Republic Records albums"]
["2020 albums", "Taylor Swift albums", "Albums produced by Taylor Swift", "Republic Records albums"]

We need to convert them to sets to be used with the set.intersection() function:

With set.intersection(), we can find all items that appear in both the sets:

if strategy == VALID_STRATEGIES.UNANIMOUS:
    return set.intersection(*[set(i) for i in items])

The asterisk (*) means take all of the sets and pass them in as independent arguments to set.intersection(). This is required because set.intersection() takes one or more sets as arguments, not a list of sets.

With the above two lists as examples, our code would return:

{"Taylor Swift albums", "Albums produced by Taylor Swift"}

This represents all three items that appear in both sets.

If we had three or more sets, set.intersection() would work across all sets, finding the tags that appear in all sets.

In last N days

We can modify our code above slightly to return tags that appear in all tag lists from the last N days. We can change the code to only read items from the last N days, rather than reading all results. Note that this strategy only works if the tag lists are ordered by day. If the tag lists are unordered, we cannot apply a last N days approach since we would not know know when the last N days were.

We can use the following code to find all tags that appear in the last N tag lists:

if strategy == VALID_STRATEGIES.IN_LAST_N_DAYS:
    return set.intersection(*[set(i) for i in items[-n:]])

For this code to work, we need to change our consensus function to accept an n argument, which says how many lists we want to analyse:

def consensus(items: list, strategy: VALID_STRATEGIES, n = 2) -> set:

Let’s call our code with:

print(consensus(categories, VALID_STRATEGIES.IN_LAST_N_DAYS, n=2))

This code will run our consensus algorithm and find all tags that appears in the last 2 days of tag history.

Our code returns:

{'2020 albums', 'Albums produced by Taylor Swift', 'Taylor Swift albums', 'Republic Records albums'}

Of note, 2021 albums is not in this list, since it only appears in the first tag list.

Crosses threshold

A common use of consensus algorithms is to return results that appear in N% or more of data sources. For example, we could consider a tag valid for inclusion in our album graph project by saying that if it appears in 50% or more of our data history, it is less likely to be an erroneous tag so we can include it.

A common use of the “crosses threshold” algorithm is to implement majority vote: only include data if it appears in 50% or more of data sources.

We can compute the percentage of datasets in which each tag appears using this code:

[membership_counts_in_total[i] / len(items) * 100 for i in membership_counts_in_total]

In the above code, we iterate over all unique items in all sets, and divide the number of sets the item appears in by the number of items it appears in. We can multiply the number by 100 to return a percent. We can print these values alongside their corresponding albums to see that the code works:

print(membership_counts_in_total.keys())
print([membership_counts_in_total[i] / len(items) * 100 for i in membership_counts_in_total])

Our code returns:

dict_keys(['2021 albums', 'Taylor Swift albums', 'Albums produced by Taylor Swift', '2020 albums', 'Republic Records albums'])
[33.33333333333333, 100.0, 100.0, 66.66666666666666, 66.66666666666666]

2021 albums appears in 33% of the tag lists we have passed in. This equates to one list, since we passed in three lists of tags. Taylor Swift albums appears in all tag lists.

The code above returns the percentage of datasets each tag is in. We can use the same logic to return all items in all tag lists that appears in at least N% of data using this modified code:

if strategy == VALID_STRATEGIES.CROSSES_THRESHOLD:
  return set(
      [
          i
          for i in membership_counts_in_total
          if membership_counts_in_total[i] / len(items) * 100 > threshold
      ]
  )

Above, we use a list comprehension to iterate over all unique items. We then compute if the % of datasets the item is in is greater than a threshold. We need to add this threshold as an argument of our consensus function:

def consensus(items: list, strategy: VALID_STRATEGIES, n = 2) -> set:

We can then use the following code to identify items that appears in 50% of data sets or more:

consensus(categories, VALID_STRATEGIES.CROSSES_THRESHOLD, threshold=50)

Our code returns:

{'2020 albums', 'Albums produced by Taylor Swift', 'Republic Records albums', 'Taylor Swift albums'}

We have successfully identified all tags that appear in 50% or more of the tag lists we supplied to our consensus algorithm.

Conclusion

Consensus modeling is used to find agreement between several data sources. It is used across computer science and data science to solve problems. For example, in natural language processing, you could use consensus modeling to automatically assign tags to content. You could use three open source models to tag content, then take the tags that were reached by consensus of several models. This would allow you to mitigate the risk that a model returns an inaccurate category, which would muddy your data.

In this guide, we walked through an example of consensus modeling where we had a dummy list of tags from the last three days of a a wiki page history and we wanted to find consensus among the tag history. We implemented four consensus strategies to show how each of them works: in one or more, unanimous, in last N days, and crosses threshold.

If you have any questions about this guide, feel free to reach out to me at readers [at] jamesg [dot] blog.

The code for this tutorial is available on GitHub.



Source link

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top