Ruby: XML Parsing With SAX

SAX is an event-driven parser for XML.

It sequentially reads the xml and generates special events. So, if you want to use SAX, you should implement the code to handle them. It's quite different from the DOM model, where the whole xml is parsed and loaded in an tree.
As you can see, the first approach is more difficult than the DOM one. Why we should use it? Depends.
If you want to extract certain informations from a big file, probably you should choose a SAX implementation, in this way you can avoid the initial DOM loading overhead.

The Ruby XML Library

The Ruby core library has a built-in XML parser (both DOM and SAX) called REXML, but it's terribly slow, it's highly advisable to use libxml. It's a binding to the popular library from Gnome and it was released as gem.

The Ruby Implementation

In first instance we need an handler, to deal with the SAX events.

   1  class Handler
   2    def method_missing(method_name, *attributes, &block)
   3    end
   4  end

Libxml generates several events and it expects to find certain methods into the class assigned ad handler. With method_missing we simply avoid any exception.

A More Useful Example

We try to extract the most recent headlines of a blog.

Download the feed:

   1  curl http://feeds.feedburner.com/LucaGuidi >> luca.xml

Now we need our custom SAX parser:

   1  require 'rubygems'
   2  require 'xml/libxml'
   3  require 'handler'
   4  
   5  class SaxParser
   6    def initialize(xml)
   7      @parser = XML::SaxParser.new
   8      @parser.string = xml
   9      @parser.callbacks = Handler.new
  10    end
  11  
  12    def parse
  13      @parser.parse
  14      @parser.callbacks.elements
  15    end
  16  end

We have just wrapped the SAX parser from libxml and we have registered our first class as callback handler.

Now we are going to improve the handler to recognize and save the post titles:

   1  class Handler
   2    attr_accessor :elements
   3  
   4    def initialize
   5      @elements = []
   6    end
   7  
   8    def on_start_element(element, attributes)
   9      @print = true if element == 'title'
  10    end
  11  
  12    def on_characters(characters = '')
  13      @elements << characters if @print
  14    end
  15  
  16    def on_end_element(element)
  17      @print = false
  18    end
  19  
  20    # Handle all missing methods of the SAX events chain.
  21    # You can implement or omit one or many of those methods, without any raising Exception.
  22    # 
  23    # The complete chain is:
  24    #   on_start_document
  25    #   on_processing_instruction(instruction, arguments)
  26    #   on_start_element(element, attributes)
  27    #   on_characters(characters = '')
  28    #   on_end_element(element)
  29    #   on_end_document
  30    def method_missing(method_name, *attributes, &block)
  31    end
  32    end

When the handler is instantiated we create an internal array to store our results, then when we find and title element we set on true the print flag. When it's true we can store the data into elements, then we set on false on the ending handler of the element.

Usage

We create a trivial script:

   1  #!/usr/bin/env ruby
   2  require 'sax_parser'
   3  
   4  xml = open(ARGV[0], 'r').collect { |l| l }.join
   5  puts SaxParser.new(xml).parse

From the shell:

   1  ./parse luca.xml

Conclusion

SAX is less elegant and easy than DOM, but could be very useful in certain cases.

advertising

Comments

  1. nanda's Gravatar

    Posted by nanda on 2008-02-12 22:15:14 UTC (permalink)

    Good tutorial, one question, how would I read any particular attribute of an element?

  2. nanda's Gravatar

    Posted by nanda on 2008-02-12 22:20:58 UTC (permalink)

    never mind, got it !
    :)