Rails + Tidy + REXML

It wasn’t totally straight forward to get Tidy, REXML and Rails to play together, so I thought I would write down what and how I did it to save time for others.

The reason for doing this is that I get text in (X)HTML format through RSS feeds and I want to make excerpts of it. So given a long text as input I want to make a short extract of it.

After a bit of thinking and googling I figured out that slicing a HTML document after a given amount of characters is not super trivial to do. Because of the tags you need to actually parse the HTML document and keep track of which tags you need to close then reaching the given amount of characters. Luckily for us Mike Burns has already written a function for Truncating HTML in
Ruby
. Perfect!

However, after adding that piece of code (and unit tests for that of course) you will find out that REXML barfs if the input is not well-formed HTML and naturally having no control of the content of the RSS feeds there is no way you can guarantee that.

Luckily Tidy comes to the rescue. Tidy is a library that corrects invalid HTML. Install the tidy library and then the tidy ruby gem.

gem install tidy

Unfortunately you have to manually set the path to the library before you can use it with

Tidy.path = '/usr/lib/tidylib.so'

If you, like me use an apple laptop for development and linux on the server that path is going to be different between the environments. So what I did was to introduce a constant in the rails environment files. In the config/environments/production.rb file I put:

TIDY_LIB_PATH = '/usr/lib/libtidy.so'

And naturally I set it to the correct path for my powerbook in the config/environments/development.rb file. Then I just do

Tidy.path = TIDY_LIB_PATH

before using Tidy and everything is good.

To make Tidy behave decently you need to set the following options:

  • tidy.options.show_body_only = true – don’t output body and html tags
  • tidy.options.output_xhtml = true – output xhtml
  • tidy.options.wrap = 0 – don’t write newlines all over the place
  • tidy.options.char_encoding = ‘utf8′ – use utf8 to play nice with rails

so in the end this is what I ended up with:

require 'rexml/parsers/pullparser'
require 'tidy'</p>

<p>def make_excerpt
excerpt = slice(tidy_up_html(content), 2000)
end</p>

<p>def tidy_up_html(html)
Tidy.path = TIDY_LIB_PATH</p>

<p>cleaned_up = Tidy.open do |tidy|
tidy.options.show_body_only = true
tidy.options.output_xhtml = true
tidy.options.wrap = 0
tidy.options.char_encoding = 'utf8'
cleaned_up = tidy.clean(html)
cleaned_up
end
end</p>

<p>def slice(string, length, ellipsis = '...')
p = REXML::Parsers::PullParser.new(string)
tags = []
new_len = length
results = ''
while p.has_next? &amp;&amp; new_len &gt; 0
p_e = p.pull
case p_e.event_type
when :start_element
tags.push p_e[0]
results &lt;&lt; &quot;&lt;#{tags.last} #{attrs_to_s(p_e[1])}&gt;&quot;
when :end_element
results &lt;&lt; &quot;&lt;!--#{tags.pop}--&gt;&quot;
when :text
results &lt;&lt; p_e[0].first(new_len)
current_len = new_len
new_len -= p_e[0].length
if new_len &lt; 0</p>

<h1>find next dot</h1>

<p>i = p_e[0].index('.', current_len)
results &lt;&lt; p_e[0].slice(current_len, i-current_len) if i
results &lt;&lt; p_e[0].slice(current_len, p_e[0].length) unless i
results &lt;&lt; ellipsis
end
else
results &lt;&lt; &quot;&lt;!-- #{p_e.inspect} --&gt;&quot;
end
end
tags.reverse.each do |tag|
results &lt;&lt; &quot;&lt;!--#{tag}--&gt;&quot;
end
results
end

I modified Mike Burns’ method so that after the given number of characters has been reached it will still include text until the next ‘.’ character. I figured it’s much nicer with an excerpt that ends with a complete sentence.

Feel free to use this code if you want.