Pattern Matching in Scheme

Sometimes it just all makes sense at once. Patrick Logan recently mentioned how pattern matching can be used to process trees in Scheme. I've looked at PLT's match library a few times and never got any traction, but last night, I had an aha! moment while puzzling over an example of how to extract the links from a web page. Let's look at my specific problem, extracting my entries from a Bloglines archive. Bloglines uses a regular format to divide up each entry; essentially, each entry is wrapped by a div like this:
<div id="siteItem.0.0">
The id attribute gets incremented as it goes, to siteItem.0.1, siteItem.0.2, etc. (I'm sure there's a way the first number gets incremented, but it's not in my example). The scheme code to match this is as follows:
(require (lib "htmlprag.ss" "htmlprag")
         (lib "match.ss")
         (lib "url.ss" "net"))
(define (sxml->siteItem sxml)
  (let loop ((sxml sxml)
             (item-list '()))
    (match sxml
      (('div ('@ ('id siteItem)) . item-data)
       (cons (cons siteItem item-data) item-list))
      ((item ...) (append item-list
                          (apply append
                                 (map sxml->siteItem sxml))))
      (else item-list))))
I know that this is daunting, so I'll try to break it down. sxml->siteItem takes an argument that's an SXML representation of HTML. This is a big Scheme S-Expression, i.e. a tree, or more generally, a list. The match function inside sxml->siteItem takes a list of rule - action pairs. The rules are:
  1. For a <div> that has an id attribute, store the attribute's value in siteItem and all the child nodes of the div in item-data; return the siteItemitem-data pair appended to the head of the result list.
  2. For any other tag, return the result of appending the current result list onto the the results of the recursive calls to sxml->siteItem over each element in the list.
  3. For anything else, return the result list.
What this does is return me a dotted list of results, so I can do expressions like (assoc "siteItem.0.3" results) to find a div with a specific id (assuming that I assigned the result of sxml->siteItem to result).

It turns out that once you get past the pattern matching syntax, it's very simple to pull chunks of markup out of an SXML representation of HTML. There's actually much more that you can do that I haven't even touched on; you can match based on evaluating a Scheme expression, for instance. The one gotcha to this is that you have to account for a couple things: attributes get matched exactly, and whitespace also gets matched. What this means is that in my example, it's looking literally for a <div> with a single attribute id. If you're matching nested markup, significant whitespace can end up in the markup, which complicates the pattern. So for more general matching, you'll need a more general pattern. But the general feeling I have is that Scheme's match syntax is like having the power of a regex, but at the markup level, instead of matching on the raw text.

— Gordon Weakliem at permanent link