Martin @ Blog

software development and life.


Html parser for Java

For my graduation project, I needed to parse HTML documents, in order to strip the HTML tags and formatting (I wanted to index them using Lucene). After some searching on the Internet, I decided to use the library HTMLParser. This is a HTML parser developed in Java and the codebase seems to rather stable. It is not fully standard compliant, but that was not required for my purpose. However, it supports HTML documents which are not nicely formatted (missing end-tags, improper nesting, etc.). However, I stumbled upon a bug causing incorrectly parsing websites when they containg multiple META-tags defining the charset (using http-equiv="Concent-Type" META-tags). While this is not allowed according to the HTML specification, there are sites on the internet doing this. When downloading a website using HTTP, it is also possible to define the charset using HTTP headers. Changing the charset using META-tags generally should only be done when it is not possible to control the HTTP headers (because it is not possible to change the web server configuration, for example). The HTTPparser project handles changes of the charset by throwing an exception after which the document can be parsed again using the correct charset. However, this changing can be done infinitely when there are multiple charset definitions in a particular HTML document.
I’ve created a patch which fixes this problem. The good news is that the patch is accepted for the 2.0-version of HTMLParser, but I’ve also created a patch for the (older) 1.6-version (I’m using this version myself).

On another news: I’ve restored the archive of my weblog. It now goes back to the very beginning of my weblogging career. Unfortunately, there are some issues with encoding (probably a MySQL problem) and all the images on my weblog are broken.

One Response to “Html parser for Java”

  1. August 4th, 2009 at 23:51

    Rafael Sobek says:


    recently I described the HTMLParser in the form of examples in my blog.

    The url is

    Have fun