HTTP Digest Demoronizer
RFC 2617 is underspecified when it comes to a username or password that contains characters outside the 7 bit ASCII character set. The problem is that you have to know what encoding to use to write the username and password when generating the Authorization header, and second, in the case of HTTP Digest, you also need to have the server and client agree on an encoding when generating the authorization digest - single bit differences will cause authentication to fail. I had vaguely know this was a problem but recently this was brought into sharp focus. The funny thing is that Google is almost entirely silent on this problem. I can't be the only person trying to do Digest authentication with non-ASCII credentials. There are a few hints out there, RFC 2831, for example, but it's like I've just stumbled onto a secret club.
HTTP Digest is the real problem, because it can fail in two ways - you can encode the username parameter incorrectly, or you can get the encoding wrong when generating hashes - this is true of either the client side or server side. Firefox seems to use UTF-8, come hell or high water, so if your server is assuming UTF-8 when working with Authorization and WWW-Authenticate headers, you're OK. Python's urllib2.AbstractDigestHandler seems to be using ASCII encoding, so it won't even send the request - the request throws an exception.
It turns out that for .NET clients, you can almost pull this off with HTTP Digest. .NET will understand non-ASCII characters, IF you put an undocumented (at least in RFC 3617) parameter charset=utf-8. You can't put a space before "charset", even though you can before other parameters. You can enclose utf-8 in doublequotes, or not. If you do this, .NET will submit the username parameter UTF-8 encoded on the next Authorization: header. It will even set the charset parameter to UTF-8, so you know the encoding explicitly.
Except... the response parameter in the Authorization header is still borked. Response is a hash of the username, realm and password, joined by colons, along with hashes of other request parameters. BUT the first hash is apparently generated using Encoding.Default (e.g. Windows-1252 for my machine), not UTF-8 as one might think. I say "apparently" because looking in the disassembly, it looks like they just copy bytes from the string into a buffer to be hashed, instead of using Encoding.GetBytes(string). Whatever encoding that is, it's not UTF-8, at least not on my box. So all the hard work of interpreting the charset parameter is for nothing - the server still has to guess at a charset to use to calculate the digest.
About the only thing I can think of to work around this on the server side is a demoronizer: attempt the hash with UTF-8 first, then Windows-1252. But outside of the bug, there's a couple lessons here. First, I had to resort to Reflector to see exactly what the framework was doing, because I don't have source for the framework. This behavior is buried deep in System.Net, inside an internal class, and I can't test it in isolation - it's a black box inside a black box. Source code would have made this much simpler to debug. The second bit is that because of the nature of the framework and the classes in question (signed assemblies, sealed classes, undocumented internal classes), I have to rewrite a whole lot of code - i.e. everything under System.Net.HttpWebRequest - to fix one bug. Python gets it wrong as well, but in Python, I can subclass urllib2.HTTPDigestAuthHandler and supply a new version of get_authorization, or better yet, submit a patch.