Scaling On The Cheap · 14 February, 09:43 AM
Yesterday, Greg wrote about NewsGator’s feed retrieval intervals, which brought up a topic that I’ve thought about writing about for a while now. Greg’s post applies specifically to an online aggregator, but it’s definitely shared by a number of online applications. If you depend on external data sources, they represent a potential bottleneck.
An online aggregator essentially maintains a list of feeds that people are subscribed to and periodically polls those feeds for news. The obvious architecture is to act like a desktop aggregator: poll each feed hourly and store any new posts. You’ll probably create a daemon or service to handle the polling, so the front end isn’t concerned with that process. This works fine for the first 1,000 users. Then the at about 10,000 users, you find you need to scale out and add more threads, more daemons, more servers, to keep up the hourly polling. Then your subscription list gets multiplied by 100,000 or so, and as you scale out, you’re finding that horizontal scaling has logarithmic benefit, twice as many servers don’t work twice as fast. At a million feeds, you’re making around 16,000 HTTP requests a second, and buying new servers is getting expensive.
What Greg’s outlining is simply re-architecting as the result of observing a few key things:
- Some feeds are just plain broken, possibly permanently. If you’ve polled a feed for a week straight and it returns 401, or 500, or some error (or possibly a DNS failure), it’s probably not going to be returning 200 an hour from now.
- Some feeds are more subtly broken. The server might be slow, which causes your feed fetcher to tie up at least a socket, and possibly a thread, and possibly other resources.
- Other feeds simply have a lot of data in them. In one case, I’ve run into a Syndic8 feed that had something like 100,000 items and was about 25MB to download.
With a few feeds, remote performance problems are no big deal. With hundreds of thousands of feeds, a small percentage of bad feeds can really cause your feed fetcher to fall behind. If a big site like LiveJournal or Typepad has a bad day, you’ll never catch up. But those are only one class of problems. The other problem is that you can spend a lot of time trying to get content that isn’t there, or that nobody wants:
- Some feeds are just fine, but don’t update regularly. RSS gives you (flawed) elements to give hints about update times and frequency, and HTTP offers caching headers which offer the same kind of hint. But you can still observe update frequency and determine that if someone historically updates weekly, they don’t need hourly polling. In other cases, the feed’s simply abandoned. If it hasn’t updated in a long time, you might want to check for updates from time to time, but there’s no point checking hourly.
- Other feeds are fine, and might even update regularly, but have only 1 or two subscribers. In an online aggregator, the 80/20 rule is still in effect – a few feeds have the bulk of the subscribers. If you offer keyword searches and similar features, you’d like to index content from even the feeds with few subscribers, so those results feed into search results. Still, those feeds don’t need to be polled that often.
- Some feeds have no active subscribers. Again, you might want to index them, but if there are no active subscribers, there’s no need to be in a big hurry to get those updates. In the worst case, there’s a reason some feeds have only one subscriber. If you offer free accounts, spammers will create accounts with a subcription list of tens or hundreds of splogs, just so you can index them. As Mark Pilgrim famously said, spammers don’t read weblogs, they write to them.
- Conversely, some feeds have a LOT of subscribers. You want to make sure that those updates get processed regularly, because a lot of people have indicated an interesting in getting those updates.
The larger point is that this extends beyond the problem of building an online aggregator. If you have a system that depends on retrieving data from remote sources, particularly sources out of your control, you can use a few statistics about those sources to get some cheap scalability.
— Gordon Weakliem
Comment
Commenting is closed for this article.