Creating custom feed parsers

If you have a website that does not provide an RSS or Atom feed natively, but nonetheless contain news items structured in a manner that could be syndicated, it is possible to construct a tailored feed reader class for that particular website. The parser would take the HTML content, parse it into a structured form (using BeautifulSoup or a similar tool) and then extract the necessary information. The process is of course vulnerable to any structural changes made in the HTML, but it’s still better than nothing.

To implement an HTML feed parser, you should inherit from the BaseFeedReader class and implement the parse_document() method. This method must return a two element tuple containing:

  • a dictionary of metadata attributes and values (can be just an empty dict)
  • a list of dictionaries, each dictionary representing the constructor keyword arguments for EntryEvent

How the method extracts this information is entirely up to the implementation, but using either lxml.html or BeautifulSoup directly is usually the most robust method. The implementation needs to return all the events found in the document. The matter of filtering already seen events is taken care of in the update() method.

The only required piece of information for each event is the id of the event. This is the unique identifier of the event which will be used for preventing already seen events from being dispatched from the event_discovered signal of the feed. Other than that, you can fill in as many of the fields of EntryEvent as you like, or subclass the class to contain extra attributes.

An example of a custom feed reader has been provided in examples/custom_html.py.