«« <junit> Ant task doesn't work in Eclipse Time to Shift Gears Once Again »»
blog header image
RSS is doesn't scale? Numbers please.

There's bit a bit of blogger hubbub about MSDN not publishing full RSS feeds any more because of bandwidth issues. Looks like it only effects the "combined blogs" feed and all other individual blog feeds still have full text.

After some discussion on how others deal with bandwidth issues, James Robertson openly questioned if MSDN used conditional-get or compression features of HTTP to save bandwidth. In his comments Dare Obasanjo confirmed they indeed do use them.

But James was skeptical and cooked up some Smalltalk code to see for himself. Turns out MSDN doesn't use conditional-get or compression after all. No one from Microsoft has responded to this evidence yet.

Personally, I'm surprised no one has come up with some hard numbers.

How much bandwidth would MSDN save with compression and conditional-get? From what I understood, conditional-get doesn't save them anything with their combined feed (an RSS feed of all blogs put together), since it changes way too often. You have to pull down a new version every time.

For individual blogs, conditional-get would work well. To get some rough numbers, figure out the number of average posts per day per blog and the figure out how many requests per day on a blog are. Use this to find a rough percentage of how many could be returned as "not changed" (HTTP 304) by conditional-get and calculate bandwidth savings based on that. Heck, the server could keep track of not-changed responses (HTTP 304) vs. full RSS feed response and you could make exact calculations.

Compression on text like XML can be very good -- like 50%+ (try zipping up a few xml/html documents in WinZip and you'll see what I mean) and the longer a document is, the better it will compress because the dictionary is used more often. The difference between sending a terabyte and a half terabyte is significant.

Does this change any of the "scaling" issues of RSS? Documents will compress at the same % no matter how many people read them, so it scales at the same rate as uncompressed: linearly, with less overall volume.

Conditional-get also scales linearly as traffic increases, assuming all other things are equal. Often not all things are equal, and it does not scale linearly as users increase because of:

1. how often a person posts to their blog and changes the RSS feed (feed entropy)
2. how often each user's feed reader is pulling down the feed and how many times they get the full feed vs. a conditional-get "not changed" 304. If the server doesn't support conditional-get, the feed is redownloaded every time, increasing bandwidth costs.

Number 2 can be controlled like Slashdot. If a user reads Slashdot's RSS feed too often, the user is blocked temporarily. I like this solution because it puts the onus on the user to correct the behaviour of their feed reader. If they want to read the feed badly enough, they will fix their reader settings. This is good for slashdot because if you read the feed every 30 minutes or whatever the minimum is, 99.9% of the time it has changed. Conditional-get would buy slashdot almost nothing.

Unfortunately, that solution is not completely user-friendly. It would be nice if the server would tell the feed reader "hey, you can only read me once every 60 minutes!" and the feed reader corrected itself automatically for each feed. Maybe this already happens, I don't know.

Number 1 is controlled by the RSS feed author, what I like to call feed entropy. Conditional-get savings could be improved if a blogger made 5 posts a day all at the same time instead of spread out over the course of the day. Then feed readers would only have to get one version of the feed (the rest of the requests would return 304) instead of 5 new full feed downloads that day for that user. That's not a linear savings either, it's big.

Is it realistic to expect people to write blog posts like this? No. Could commercial sites like Salon and Wired take advantage of this? Sure, I bet they already do.

I'd really like it if someone in-the-know gave me some numbers and math so I could make an educated and informed decision about RSS/HTTP deadness and scalability. Until I see real numbers based on real traffic, I don't think I'll be completely convinced it's dead in the water.

Posted at September 10, 2004 at 12:36 PM EST
Last updated September 10, 2004 at 12:36 PM EST
Comments
Google
 
Search scope: Web ryanlowe.ca