The PDF Manual

Practical tips & tricks for hacking PDF files

by Hadley Bradley

 Introduction

PDF files are commonplace within business and healthcare. It’s easy to see why. The portable document format (PDF) was developed by Adobe to allow document exchange between different hardware and operating systems with little impact on the visual presentation of the published document.

The PDF file format, which is based on the PostScript language, achieves this by packaging the text, fonts and images required to faithfully display the content within the file. This has made the PDF file the go-to choice for sharing and distributing documents.

Originally a priority format of Adobe, the PDF file format was standardised and made an open format in 2008 – which means vendors can now use it without having to pay Adobe any royalties.

The author has worked on a regional health information exchange which consisted of several million PDF files – this pocket book outlines many of the hacks and tricks used within that project to help manage a large number of PDF files. The majority of the tips are free to read here at the PDF bible web site. However, a couple of tips have been reserved for the ebook bundle which is available to purchase.

 Reading file metadata

The first hack we’re going to review is reading the PDF document metadata. Associated with each PDF document are a number metadata items which can be used to describe the content of the document. These metadata items include user-generated fields like the documents title, subject and keywords. Every PDF document also comes with a few computer-generated metadata items. For example, these include the file size of the document and the number of pages the document contains. The paper size used and the PDF version number can also be obtained.

To interrogate the metadata fields we’re going to use a command-line tool called pdfinfo. This simple command-line utility is available in both Windows and Linux versions from the creators web site. If you’re using the Mac OSX operating system and the brew package manager then the utility can be installed using the following command:

brew install xpdf

The pdfinfo command prints the contents of the documents information dictionary to the terminal window. To run the command launch a terminal window or command prompt, depending on which operating system you’re using, and type the command pdfinfo followed by the name of the PDF file you want to interrogate.

pdfinfo 998784.pdf

If successful, this will print out all the metadata items. A typical example is shown below.

Title:          Invoice 998784
Creator:        _wkhtmltopdf
Producer:       Qt 4.8.7
CreationDate:   12/30/18 10:11:36
Tagged:         no
Form:           none
Pages:          1
Encrypted:      no
Page size:      595 x 842 pts (A4) (rotated 0 degrees)
File size:      31861 bytes
Optimized:      no
PDF version:    1.4

We can see the Title attribute has been set for this document. We can also see that the document was created using wkhtmltopdf which is a tool we’ll explore when looking at converting HTML web pages. The PDF has one page and we can see from the page size that it’s an A4 page in portrait mode.

If you look at the date created value you’ll notice that it’s in the American date format. If you’d prefer to have the date output in the more conventional YYYYMMDD format then you can use the rawdates parameter when calling the pdfinfo command.

pdfinfo -rawdates 998784.pdf

Running this command on the same file outputs the creation date in a unencoded format.

Title:          Invoice 998784
Creator:        _wkhtmltopdf
Producer:       Qt 4.8.7
CreationDate:   D:20171230101136+01'00'
Tagged:         no
Form:           none
Pages:          1
Encrypted:      no
Page size:      595 x 842 pts (A4) (rotated 0 degrees)
File size:      31861 bytes
Optimized:      no
PDF version:    1.4

When you run the command against a PDF file which has had more metadata items set then they will be printed out.

In the example below, you can see that the subject, keywords and author values have also been set. You can see from the encrypted field that this particular PDF document has been encrypted with some restrictions. This file can be printed, but the contents can’t be changed or copied.

pdfinfo journal.pdf
Title:          Technical Handover Journal
Subject:        Secondment Handover Documentation
Author:         Hadley Bradley
CreationDate:   09/25/18 10:02:09
Tagged:         no
Form:           none
Pages:          8
Encrypted:      yes (print:yes copy:no change:no addNotes:no)
Page size:      595.276 x 841.89 pts (A4) (rotated 0 degrees)
File size:      154008 bytes
Optimized:      no
PDF version:    1.5

We’ll be looking at how to set these restrictions later on in the book.

 Fixing broken XREF tables

Have you ever encountered a corrupt or broken PDF file? It can happen, sometimes the XREF tables within the file become corrupted.

I encountered a situation in which the PDF files created by a particular clinical system couldn’t be processed by other applications. On further inspection, it turned out that the PDF files were actually corrupted. Although the files would load correctly in Adobe reader, when any other program tried to process the files they would fail. I guess Adobe reader was detecting the corruption and fixing the data stream on the fly as it was loading the file.

It turned out the XREF tables within the PDF file were corrupted. There is a good article titled The trouble with the XREF table which explains the problem in more detail.

The supplier of the clinical system in question couldn’t easily resolve the issue because they were using a third-party tool to generate their PDF files. So I had to find a way to fix these PDF files as they came out of the system.

Luckily the free version of the pdf tool kit by PDF Labs would read the files and attempt to fix the broken XREF tables. The command I used to fix the broken files as they come out of the afflicted system was:

pdftk broken.pdf output fixed.pdf

It’s worth noting that the pdf tool kit can also be used on Windows machines.

 Compare PDF files with a visual diff tool

Many of the clinical systems I develop at work produce PDF reports when a patient is discharged. In some instances, these reports can be amended after the initial report is generated. The clinical audit team needed a visual tool to compare any two revisions of a given report and see the differences highlighted in colour.

To do this I used the excellent DiffPDF tool created by Qtrac Ltd.

DiffPDF screen shot

The tool is available for Windows and can be installed on a Ubuntu based Linux system by using the following command:

sudo apt-get install diffpdf

The DiffPDF tool can be easily integrated into your existing software solution, as you can pass in the PDF file names as command-line options. So in my case, I produce a screen for the clinical audit team to use, which showed patients with reports which had revisions. When an auditor picked a particular patient for review I simply launched DiffPDF passing across the appropriate file names for them compare.

For example, to launch DiffPDF passing in two versions of a file you would use the syntax:

diffpdf original.pdf revised.pdf

By default, DiffPDF highlights deleted text in red, inserted text in cyan, and replaced text in magenta. All the colours can be customized, or plain highlighting can be chosen. The change bar colour, thickness, and indent can also be customized—or the change bar can be hidden entirely.

The tool allows toggling between different comparison modes. If the Words comparison method is chosen, then each page’s text is compared word by word which is ideal for languages like English. If the Characters mode is chosen, then each page’s text is compared character by character which is more suited to languages like Chinese and Japanese. Finally, there is an appearance comparison mode, which is better suited if the PDF contains images.

 Convert HTML document to PDF

There are various tools for converting an HTML document or web page into a PDF file. One of the better ones is an open-source command-line utility called wkhtmltopdf. This utility uses the Qt WebKit rendering engine which ensures it faithfully reproduces the output when converting HTML documents.

The utility is available for different operating systems, and binaries can be downloaded from the website for Windows, macOS and Linux.

To convert an HTML document called book.html into a PDF you would run the following command:

wkhtmltopdf book.html book.pdf

The default page size used within the generated PDF file is A4. However, this can be changed by using the page size parameter and setting a new value. For example, to use the Letter paper size you would run the following command:

wkhtmltopdf --page-size Letter book.html book.pdf

As wkhtmltopdf uses the WebKit engine to render the HTML, the utility also support cascading style sheets (CSS). This is very useful as you can use a print style sheet within your HTML document to control how different HTML tags are handled at page boundaries.

<style>
p, h2, h3 {
    orphans: 3;
    widows: 3;
}

img, table {
    page-break-after: avoid;
}
</style>

For example, if the above CSS code was included in the HTML document it would prevent images and tables being broken up across page boundaries. It also tries to prevent widows and orphans within HTML titles h2 & h3 and paragraph text p.

Widows are defined as a paragraph-ending word that falls at the beginning of the following page or column, thus separated from the rest of the text. Mnemonically, a widow is alone at the top.

Orphans are defined as a paragraph-opening line that appears by itself at the bottom of a page or column, thus separated from the rest of the text. Mnemonically, an orphan is alone at the bottom.

It’s worth taking the time to review all the possible command-line options for this utility as it has options for disabling images and javascript. It allows you, for example, to define custom page margins and specify custom headers and footers to use within the output.

To get a full list of all the available options use the manpage option.

wkhtmltopdf --manpage

This pocket book is currently at work in progress. Consider book marking the page for later reference as I’ll be covering these topics shortly


Table of Contents

  1. Introduction
  2. Reading file metadata
  3. Fixing broken XREF tables
  4. Compare PDF files with a visual diff tool
  5. Convert HTML document to PDF