Scraping RSS

I mentioned last week that I was working on an RSS version of Richard Thompson's upcoming tour dates page. Conceptually, this is pretty easy: Richard lists tour dates in rows in a table, one row to an engagement. The row contains things like the location, the name of the venue and possibly a link to it, the location of the venue, and the actual date of the engagement. I've completed a version of this page, rendered as RSS. Don't go hitting the feed validator just yet, though, this feed is invalid RSS. Go ahead and subscribe if you like, and find out when Richard will be near you. Trust me, you haven't lived until you've heard his rendition of Whoops, I Did It Again.

This excercise brought up a couple of issues with generating RSS from a static page. First of all, I used the date of the engagement for the date of the RSS item. This is probably abuse of the spec, but it seemed to me to be the important info, so I went with it. However, a valid date also needs a time, and Richard doesn't list the stage time, which would be the sensible thing here. I could invent a time (say, 8:00 PM), but then that brings in timezone issues, and that's probably impossible to resolve. The other alternative would be to set the item date as the date I first saw this item, but I'd have to save some info on the server to determine that, which is inconvenient, but not insurmountable.

The second issue is with handling updates. I simply scrape the entire page and generate the RSS feed each time. However, Richard doesn't add dates all that often, so I'm pinging his server each hour, looking for updates. I thought that I could help this situation by passing back the Last-Modified and ETag headers with the feed, then passing the If-Modified-Since and If-None-Match headers back on subsequent polls. However, this page is dynamically generated and doesn't support conditional GET. So I really have no alternative but to scrape the page each time to figure out if there were updates. The best alternative I can think of is to schedule a cron job to retrieve the tour dates page from Richard's server, scrape it, and generate a static RSS file once a day. As it is, I gave the feed a <TTL> of 1440, once a day. I don't know how many aggregators will pay attention to this.

Finally, I had an issue with permalinks. I omitted the <guid> element, since I didn't have a unique identifying URL for this item. Instead, I used <link>, using the link to the venue that Richard provided. However, he doesn't always provide a link, so I fall back on the link to the page I scraped. As James Robertson mentioned, this creates a problem for determining unread items, if there are URLs omitted on multiple engagements. I could use a guid element, but I don't have an obvious candidate for uniqueness - possibly the date itself, since Richard probably doesn't play 2 engagements in a day, and according to the spec, the guid element is just an opaque string, it just needs to be unique, not significant. However, if I did this sort of thing for other artists, their engagements wouldn't be unique if they played on the same date, so I'd probably have to wrap the date up in a tag: style URI.

— Gordon Weakliem at permanent link

Bad Timing

A friend at work noticed that Google is giving
The service you requested is not available at this time.

Service error -27.
for all searches this morning. Someone else pointed out that Google is scheduled to IPO today. Man, talk about bad timing.

— Gordon Weakliem at permanent link