Ruby: XML Parsing With SAX
SAX is an event-driven parser for XML.
It sequentially reads the xml and generates special events. So, if you want to use SAX, you should implement the code to handle them. It's quite different from the DOM model, where the whole xml is parsed and loaded in an tree.
As you can see, the first approach is more difficult than the DOM one. Why we should use it? Depends.
If you want to extract certain informations from a big file, probably you should choose a SAX implementation, in this way you can avoid the initial DOM loading overhead.
The Ruby XML Library
The Ruby core library has a built-in XML parser (both DOM and SAX) called REXML, but it's terribly slow, it's highly advisable to use libxml. It's a binding to the popular library from Gnome and it was released as gem.
The Ruby Implementation
In first instance we need an handler, to deal with the SAX events.
1 class Handler 2 def method_missing(method_name, *attributes, &block) 3 end 4 end
Libxml generates several events and it expects to find certain methods into the class assigned ad handler. With method_missing we simply avoid any exception.
A More Useful Example
We try to extract the most recent headlines of a blog.
Download the feed:
1 curl http://feeds.feedburner.com/LucaGuidi >> luca.xml
Now we need our custom SAX parser:
1 require 'rubygems' 2 require 'xml/libxml' 3 require 'handler' 4 5 class SaxParser 6 def initialize(xml) 7 @parser = XML::SaxParser.new 8 @parser.string = xml 9 @parser.callbacks = Handler.new 10 end 11 12 def parse 13 @parser.parse 14 @parser.callbacks.elements 15 end 16 end
We have just wrapped the SAX parser from libxml and we have registered our first class as callback handler.
Now we are going to improve the handler to recognize and save the post titles:
1 class Handler 2 attr_accessor :elements 3 4 def initialize 5 @elements = [] 6 end 7 8 def on_start_element(element, attributes) 9 @print = true if element == 'title' 10 end 11 12 def on_characters(characters = '') 13 @elements << characters if @print 14 end 15 16 def on_end_element(element) 17 @print = false 18 end 19 20 # Handle all missing methods of the SAX events chain. 21 # You can implement or omit one or many of those methods, without any raising Exception. 22 # 23 # The complete chain is: 24 # on_start_document 25 # on_processing_instruction(instruction, arguments) 26 # on_start_element(element, attributes) 27 # on_characters(characters = '') 28 # on_end_element(element) 29 # on_end_document 30 def method_missing(method_name, *attributes, &block) 31 end 32 end
When the handler is instantiated we create an internal array to store our results, then when we find and title element we set on true the print flag. When it's true we can store the data into elements, then we set on false on the ending handler of the element.
Usage
We create a trivial script:
1 #!/usr/bin/env ruby 2 require 'sax_parser' 3 4 xml = open(ARGV[0], 'r').collect { |l| l }.join 5 puts SaxParser.new(xml).parse
From the shell:
1 ./parse luca.xml
Conclusion
SAX is less elegant and easy than DOM, but could be very useful in certain cases.
advertising





Posted by nanda on 2008-02-12 22:15:14 UTC (permalink)
Good tutorial, one question, how would I read any particular attribute of an element?
Posted by nanda on 2008-02-12 22:20:58 UTC (permalink)
never mind, got it !
:)