Pattern Matching in Scheme
div like this:
<div id="siteItem.0.0">The id attribute gets incremented as it goes, to siteItem.0.1, siteItem.0.2, etc. (I'm sure there's a way the first number gets incremented, but it's not in my example). The scheme code to match this is as follows:
(require (lib "htmlprag.ss" "htmlprag")
(lib "match.ss")
(lib "url.ss" "net"))
(define (sxml->siteItem sxml)
(let loop ((sxml sxml)
(item-list '()))
(match sxml
(('div ('@ ('id siteItem)) . item-data)
(cons (cons siteItem item-data) item-list))
((item ...) (append item-list
(apply append
(map sxml->siteItem sxml))))
(else item-list))))
I know that this is daunting, so I'll try to break it down. sxml->siteItem takes an argument that's an SXML representation of HTML. This is a big Scheme S-Expression, i.e. a tree, or more generally, a list. The match function inside sxml->siteItem takes a list of rule - action pairs. The rules are:
- For a <div> that has an
idattribute, store the attribute's value insiteItemand all the child nodes of thedivinitem-data; return thesiteItemitem-datapair appended to the head of the result list. - For any other tag, return the result of appending the current result list onto the the results of the recursive calls to
sxml->siteItemover each element in the list. - For anything else, return the result list.
(assoc "siteItem.0.3" results) to find a div with a specific id (assuming that I assigned the result of sxml->siteItem to result).
It turns out that once you get past the pattern matching syntax, it's very simple to pull chunks of markup out of an SXML representation of HTML. There's actually much more that you can do that I haven't even touched on; you can match based on evaluating a Scheme expression, for instance. The one gotcha to this is that you have to account for a couple things: attributes get matched exactly, and whitespace also gets matched. What this means is that in my example, it's looking literally for a <div> with a single attribute id. If you're matching nested markup, significant whitespace can end up in the markup, which complicates the pattern. So for more general matching, you'll need a more general pattern. But the general feeling I have is that Scheme's match syntax is like having the power of a regex, but at the markup level, instead of matching on the raw text.