Identity in RSS

As an aggregator developer, one of the primary problems you need to solve is how to identify post items from an RSS feed. At the most basic level, you need to know what items you've seen before, because nothing frustrates a user more than seeing the same news items over and over again. Once you've got a way to identify posts, there are lots of other things you can do with post items, so there's a lot of good reasons to solve this problem.

RSS 2.0, RSS 1.0, and Atom all provide a way to handle post identity: the <guid> element, the rdf:about attribute, and the <atom:id> elements, respectively. Unfortunately, not everyone provides this metadata, or does it incorrectly: for instance, CNN doesn't give you GUIDs, the Cincom weblogs just use big integers (these look like they might be dates, but I'm not sure), and PHP.NET is re-using the rdf:about attribute on different posts. The problems, from last to first: if you identify posts by GUID, re-using a GUID amounts to modifying a post, though that doesn't seem to be the intent in this case. Using big integers is poor practice, because an integer isn't a GUID. Recall that the GU part stands for globally unique: if you use integers as GUIDs, you're just hoping that there won't be a collision, especially if your protocol is to increment a counter with each new post. If you're going to use an integer for a GUID, use a really big one (128 bits or so), and use an algorithim appropriate for the purpose: counters are not appropriate. Mark Pilgrim had a comprehensive discussion of how to create a good ID in Atom, his advice holds for RSS feeds. In the first case, you have to fall back on some other method. You can try to assume that the <link> is as good as a GUID, or you can generate a hash of the content, which will tell you if you've seen the exact same item before, but the item will show up as new if the publisher edits the item. I did a little research on this: of all the news items listed as "current" in NewsGator Online (items that are currently in some feed someone has subscribed to in NGO), roughly 1/3 of them lack GUIDs; I don't have any statistics on how many are really "globally unique". The as-yet-undocumented Google API will create their own IDs for feeds that lack them, apparently in the form of an SHA-1 hash. People wonder why server-based aggregators invent their own ID scheme; the reason is that you can't trust publishers to do it for you.

One thing Mark pointed out is that the same item in multiple feeds should have the same GUID. Take, for instance, the entry from Sam Ruby's weblog known as tag:intertwingly.net,2004:2131. That item is also known as http://www.intertwingly.net/blog/2131.html in the RSS 1.0 and 2.0 feeds. Sam's not alone in this, TypePad weblogs will create different IDs in different feeds: compare Atom and RDF for the Newsgator API weblog.

— Gordon Weakliem at permanent link