I’ve got a mess of html page dumps from an Openfire XMPP server admin console that I want to combine together into a .pdf document for posterity. Here’s what I did.
1. The pages were captured using the File… Save Page As… function in Firefox. After saving every page under every tab I had 60 .html pages in all (actually “.jsp.html”, Openfire is a Java application).
2. Made a list of those files sorted in order of their creation from earliest to latest, with one file name per line:
ls -tr1 *.html >filelist.txt
3. Wrote a little shell script convert the files to pdf using the htmldoc utility, calling it “htmldoc-pdf.sh”:
#!/bin/bash # htmldoc-pdf.sh Converts saved web pages to pdf # Created 1/22/2013 by P Lembo /usr/bin/htmldoc --outfile output.pdf --webpage --landscape --format pdf --embedfonts --no-links --size letter
5. Removed the newlines from the file list and replaced them with whitespace characters:
perl -pi -e 's/n/ /g' filelist.txt
6. Appended the script file with the list of file names:
cat filelist.txt >>html2pdf.sh
7. Edited the script so it looked something like this:
#!/bin/bash /usr/bin/htmldoc --outfile output.pdf --webpage --landscape --format pdf --embedfonts --no-links --size letter index.jsp.html server-properties.jsp.html ...
8. Ran the script in the directory where all the .html files were located.
9. Opened the resulting file in my favorite pdf reader.
To be honest the output I got was serviceable but contained some pages where text was cut off along the right hand margin (one of the reasons I chose the “–landscape” option was to try and avoid that), so this is still a very experimental procedure.