More on XML Diff · 21 February, 01:07 PM
I was thinking some more about the XML Diff problems being posed lately, and recalled that Zhang and Shasha published some results on calculating differences for trees. I ran into Daniel Ehrenberg’s discussion of the general problem. Daniel’s post is an excellent overview, with lots of links to academic papers on the subject, but a few points bear highlighting:
- For unordered trees, the solution is believed to be NP-hard.
- Zhang-Shasha runs in quadratic time, pretty expensive for large trees. Other algorithms improve on this.
- Zhang-Shasha doesn’t deal with moves, other than representing them as a delete/add. This introduces some problems; other algorithms address this.
- Daniel’s addressing only the algorithm for generating an edit script, not representing the edit script. All of these algorithms support only 2 or 3 operations, so the edit script should be pretty simple to represent.
- Daniel’s also more concerned with generating a visual representation of a merge for a user to review.
- Converting a document to XML Canonical Form is probably a huge component of any solution.
The lesson is that the general case XML diff is a pretty hard problem. For this reason, James’ assertions that simpler is better ring fairly true: complexity appears to be a necessity for solving the entire problem. In this light, Sam’s advice to “Spend some time up front specifying the behaviors that you want to address” seems like the best place to start in paring down the problem space.
— Gordon Weakliem
Comment
Commenting is closed for this article.