Diff for Trees
I've been looking at the problem of generating a diff for a tree structure, which might be sourced as an XML document, but in my case is more like an object tree. One of the first hits Google gave me was a reference for calculating diff for Lisp (Scheme) source code. The problem of calculating a diff on Lisp expressions is somewhat different from the case for XML; for example, in Lisp code, order of expressions is almost always important, while for some XML documents it may not be. Differences in namespace prefixes could also be considered insignificant in an XML document, and so could be normalized out as a preprocessing step. The item I linked also brings up a number of problems with modeling tree diffs; in particular, that difference algorithms typically generate a series of changes as output, which works well for strings but may not be the best answer for trees.
In any case, the Zhang-Shasha algorithm appears to be the starting place for most of the work in this area. It turns out that the Microsoft XML diff and patch toolkit implements both a Zhang-Shasha implementation as well as a "fast" algorithm. In some limited testing, it seems like the "fast" algorithm seems to have the property that it tends to produce a set of diff operations that result in a more minimal transformation. The implementation notes suggest that the Zhang-Shasha is more "precise", which may be true, but for the documents I'm working with, the fast algorithm seems to produce results closer to what transformations were actually used to get from one document to another.
— Gordon Weakliem at permanent link
Reading Lists, Technorati Style
I'm looking at Technorati's favorites feature and just not getting it. I can import an OPML, with a lot of strange warnings about some of my blogs not being blogs, if I happen to have my OPML on my disk, which is sort of a pain for those of us who like server based aggregators. With a limit of 50, I ought to have some way to filter down my OPML to my 50 actual favorites, but not in this incarnation. Now I have a list of headlines. I can render these in HTML (larded with advertisements) or RSS, and expose public URLs for the same. James Robertson says it's reading lists "without all the nasty OPML", Tim Bray calls it reading lists with a "reasonable GUI". I suppose as long as you don't mind looking at ads. Tim says "Frankly, I’m not sure what the use-case here is; but then again, we’re just making this stuff up as we go along." Indeed.
— Gordon Weakliem at permanent link
Stop the Insanity
A request to feed producers: please, please, please, do not vary your output according to User-Agent.
curl -A "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0" http://feeds.feedburner.com/snellspace | findstr "<title" curl -A "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 1.0.3705; .NET CLR 2.0.50727)" http://feeds.feedburner.com/snellspace | findstr "<title"
Guess what User-Agent gets Firefox-style output? Thinking about who owns this bug makes my head hurt. I can't even find a reference to the mode attribute of >title< in the spec - isn't this deprecated? I think, without looking at the code, that NewsGator's parser will treat type="text/html" as type="text", based on the section on the type attribute, specifically:
When present, the value MUST be one of "text", "html" or "xhtml". If the "type" attribute is not provided, Atom Processors MUST behave as though it were present with a value of "text".
Of course, the real irony is in the post that's broken.
Update: I just realized the problem: FeedBurner delivers an Atom 0.3 feed to some UAs, and Atom 1.0 to others. So I have a parsing bug for a dead format. Yay.
— Gordon Weakliem at permanent link
The Appeal of Port 80
Maybe Stefan was begging the question, but the reality is that nobody cares about SOAP over other than HTTP because that was never a serious consideration. James Robertson's fond of calling SOAP "CORBA over port 80", and he's not just saying that to be perverse. SOAP grew out of XML-RPC, which had no pretensions of being anything like CORBA, but simply a way to do remote operations in a manner (arguably) more structured than x-www-form-urlencoded.
That got me to thinking, though, is port 80 really the entire story? CORBA and DCOM did have deployment issues because of having to open up ports on the firewall, but if that's all it is, well, there's no law that says you HAVE to put HTTP on port 80. I know, it's an abuse of well known ports, but it's not like it hasn't happened before. There is another piece to this, I realized: management. We know how to configure and manage our web servers already, so we can do all that without having to learn a new set of tools. Furthermore, we can deploy services the same way we deploy web pages. It sweeps enough issues under the rug to make it much easier to get started on a SOAP service than on a CORBA service.