How Artemis retrieves web feeds


Artemis, the calm web reader I am building, updates once a day with the most recent posts from the authors you are following. I have designed Artemis to update on this cadence so that I don’t feel compelled to check my reader several times each day for new updates. Months after starting to use the software, I have found I check Artemis substantially less than the software I use that can update at any time. I feel calmer. I don’t feel like I need to keep checking to miss out.

I wanted to document the process behind how Artemis retrieves web feeds from a technical perspective.

Every hour, a cron job runs a Python script. This Python script starts by finding all feed URLs that are subscribed to by a user in whose timezone it is midnight. Artemis is designed to update at ~12am in the user’s timezone. Users set their timezone on registration. When the script has a list of all feeds where it is midnight for at least one subscriber, the polling process starts.

Feeds are shuffled in a random order before polling begins. This is to help prevent an occurrence where many feeds in a row, or many feeds close together, are from the same source. Such a scenario that may cause rate limiting issues.

At maximum, 20 threads operate concurrently to poll web feeds. Each request is made with an If-None-Match and an If-Modified-Since header. The value of the If-None-Match header is an etag value that has been supplied by the source on the last occasion the feed was polled. If the etag has not changed, a HTTP 304 status code is returned. This means the feed is unchanged.

The If-Modified-Since header asks the server to check if the contents have been modified since the last time the feed was polled, which I send over as the value of the header. The server will return a 304 if the source is unchanged.

If a 304 is returned and no posts are available, the system moves onto the next feed. Otherwise, the feed is converted into a standard, modified ActivityStreams 2 (AS2) representation using Granary. Granary is a Python package with many utilities for converting feeds. I convert everything into AS2, then query the AS2 representation to find the information I serve to the user in the feed: the feed titles, links, published dates, and URLs. If no published date is set, the Updated date is used. If no Updated date is used, the current server time is set as the publication date to ensure the post is seen in the feed.

When all feeds have been polled, the feed titles, links, URLs, and dates are saved in the database.

For each post, everyone who has subscribed to the author of the feed in which the post is found.

Before a post is saved, there are several “filters” that run. These filters can be set for each feed in the reader. Filters let you decide what content should be saved in your reader. There are three types of filters: most recent k, random k, and keyword. Most recent k returns the most recent 3, 5, or 10 results in the feed; random k returns 3, 5, or 10 random results. Keyword returns results that match one or more keywords or phrases.

For each subscriber of the feed associated with each post, a check is run to see if they are in a timezone where it is midnight. If they are, the post is saved in the database; otherwise, the post is skipped. This ensures that users don’t see posts multiple times per day in their feed.

In summary, the overall algorithm is:

  1. Find feed subscribers for each post.
  2. Apply filters.
  3. Make sure it is midnight for the user.
  4. Save the post in the database.

These filters are especially useful for high-volume data sources where you would like either a sample or keyword filter view of the data.

Artemis presently polls over 3,000 feeds across over a dozen timezones. The once-per-day update constraint means the software can be designed a bit differently to other readers. From a software engineering perspective, it is less stressful to maintain a service that does not strive to find the latest information as fast as possible.



Source link

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top