How I Learned to Stop Worrying and Love… Hpricot
When I first started to code scripts with Ruby, one of the features that I loved was the Rexml library which made parsing XML just a breeze, most likely because I was coming from a Perl/Sax background and at the time I was surrounded by Java programmers, but also because it really is that easy to use.
At home, I have this small script that takes a XML file listing books that I have, do stuff to it and outputs more XML but in a different format and with some filtering. At first, it was ok. But recently, the source XML file got a bit hairy and larger that it used to be (just 2MB, but that’s already a bit of a constraint).
So I looked around a bit, and in fact there are now a few alternatives which are faster; One of them being Hpricot which is originally an HTML parser (as I already used it earlier on) but which also works well for well-formed XML. And in fact, it isn’t much harder to use (even easier at some point) and yes, it is faster. The same script (almost) is about 7 times as fast using Hpricot rather than good ol’ Rexml. Ok, I’m convinced.
So, the old Rexml code looked like that:
file = File.new( Source_db )
doc = REXML::Document.new( file )
doc.elements.each('library/items/book') do |x|
# do real stuff here
if x.attributes.has_key? 'author'
# and more
end
end
And the corresponding Hpricot code instead looks like:
file = File.new( Source_db )
doc = Hpricot.XML( file )
doc.search('library/items/book') do |x|
# do real stuff here
if x.attributes.has_key? 'author'
# and more
end
end
This is just a small script, but there is no much compatibility issues and the performances are interesting.
Update: going a step further, I tested Nokogiri, which is in the race for being the fastest and which also provides a Hpricot compatibility layer. It is indeed faster: about 1/3 faster (this is 1/2 faster running on Ruby 1.9, but I didn’t test Hpricot with the same Ruby version), even though Hpricot is not at its latest. But it is also slightly incompatible: An attribute value is not just a string but a particular object., so adding a to_s call is required (and compatible otherwise, since String provides a to_s function). And also the previous code would require an iterating method after the search (for example, the usual each).