Converting PDF to Text with pdftohtml

Previously I have tried to extract pdf information by converting PDF to text, as described here.

Problem is,  a big wall of text is very hard to process.
Here come pdftohtml it is part of the poppler package on linux. But gnuwin do not have it for windows. Which is one reason I use pdftotext.

pdftohtml convert pdf to html. simple usage is

pdftohtml yourpdffile.pdf

You will get your html file. But it is a bit plain as they just extract text from it. It there is image inside pdf, or you pdf is pretty complicated, like Malaysian Hansard. You can use the -c

pdftohtml -c yourpdffile.pdf

Here is the catch, it will generate 1 html per page in the pdf, with images. But the layout is maintained. For document like Malaysian Hansard, it would be hundreds of page.

Then there is way to produce xml

pdftohtml -xml yourpdffile.pdf

You will get an xml file which the position information.

p.s I’m using this for Whether there will be result today. 



Converting PDF to Text

So I have recently involved with a project to extract data from PDF. Which is actually evil, but that is not important now.

On linux there is a set of utilities comes with xpdf program. It should be part of the default package installation, if not, you just apt-get or yum it.

On windows you can go to the gunwin32 page, I just download the zip just so i would not have to remove it with a uninstaller.

I don’t really need the layout information, on it. so I just use pdftotext.

On windows

program_location/pdftotext.exe -layout pdf_file.pdf

On linux, just

pdftotext -layout pdf_file.pdf

The -layout would maintain the layout of the text as from the pdf. Otherwise, the positioning for certain text will be inconsistent.