Ruby: XML Parsing With SAX 3

Posted by luca
on Wednesday, January 30

SAX is an event-driven parser for XML.

It sequentially reads the xml and generates special events. So, if you want to use SAX, you should implement the code to handle them. It's quite different from the DOM model, where the whole xml is parsed and loaded in an tree.
As you can see, the first approach is more difficult than the DOM one. Why we should use it? Depends.
If you want to extract certain informations from a big file, probably you should choose a SAX implementation, in this way you can avoid the initial DOM loading overhead.

The Ruby XML Library

The Ruby core library has a built-in XML parser (both DOM and SAX) called REXML, but it's terribly slow, it's highly advisable to use libxml. It's a binding to the popular library from Gnome and it was released as gem.

The Ruby Implementation

In first instance we need an handler, to deal with the SAX events.

class Handler
  def method_missing(method_name, *attributes, &block)
  end
end

Libxml generates several events and it expects to find certain methods into the class assigned ad handler. With method_missing we simply avoid any exception.

A More Useful Example

We try to extract the most recent headlines of a blog.

Download the feed:

curl http://feeds.feedburner.com/LucaGuidi >> luca.xml

Now we need our custom SAX parser:

require 'rubygems'
require 'xml/libxml'
require 'handler'

class SaxParser
  def initialize(xml)
    @parser = XML::SaxParser.new
    @parser.string = xml
    @parser.callbacks = Handler.new
  end

  def parse
    @parser.parse
    @parser.callbacks.elements
  end
end

We have just wrapped the SAX parser from libxml and we have registered our first class as callback handler.

Now we are going to improve the handler to recognize and save the post titles:

class Handler
  attr_accessor :elements

  def initialize
    @elements = []
  end

  def on_start_element(element, attributes)
    @print = true if element == 'title'
  end

  def on_characters(characters = '')
    @elements << characters if @print
  end

  def on_end_element(element)
    @print = false
  end

  # Handle all missing methods of the SAX events chain.
  # You can implement or omit one or many of those methods, without any raising Exception.
  # 
  # The complete chain is:
  #   on_start_document
  #   on_processing_instruction(instruction, arguments)
  #   on_start_element(element, attributes)
  #   on_characters(characters = '')
  #   on_end_element(element)
  #   on_end_document
  def method_missing(method_name, *attributes, &block)
  end
  end

When the handler is instantiated we create an internal array to store our results, then when we find and title element we set on true the print flag. When it's true we can store the data into elements, then we set on false on the ending handler of the element.

Usage

We create a trivial script:

#!/usr/bin/env ruby
require 'sax_parser'

xml = open(ARGV[0], 'r').collect { |l| l }.join
puts SaxParser.new(xml).parse

From the shell:

./parse luca.xml

Conclusion

SAX is less elegant and easy than DOM, but could be very useful in certain cases.

Comments

Leave a response

  1. SurleauJanuary 31, 2008 @ 10:56 AM
    Instead of using method_missing? in the handler class, you can include the XML::SaxParser::Callbacks module, which defines all the methods called by the SAX parser.
  2. nandaFebruary 12, 2008 @ 11:15 PM
    Good tutorial, one question, how would I read any particular attribute of an element?
  3. nandaFebruary 12, 2008 @ 11:20 PM
    never mind, got it ! :)
Comment