Sunday, December 22, 2013


In the last post, I described how to convert a PDF into XPS as an intermediate step to split out and convert the individual pages to TIFF format.  Our goal is to combine single page TIFF pages back into their original form where multiple page documents are held in a single file.  We have to do this since the health providers insist on sending us our health records combined into a single PDF file.

This post is geek-speak.  If you don’t know C# and .NET, forget it.  Just skip to the bottom and pull down the tool if you want.

The first step is to open the XPS file and to print the individual pages as single page TIFF files. 

We create a new XpsDocument object to hold our XPS file.  Next, we get the number of pages and walk the stack of pages one at a time and save each one as a new TIFF file.


Now, the toolset become pretty crude.  For my sample PDF file of 877 pages, I should now have 877 TIFF files in a directory and I can use the Windows File Explorer and an image viewer to sort, delete, and combine the files.  I find that the Windows Photo Viewer application works well. 

The tool (download source or object below), allows me to drag the pages that should be combined into a single TIFF file.  It then (optionally) deletes the original files.

Therefore, this process may look something like:  a) walk the files using Explorer and the Windows Photo Viewer and delete files we don’t want to keep; b) when we find a file we want, walk the individual TIFF files until we see the last page we want to combine into a single saved TIFF file; c) grab all the individual TIFF files and drop them on the application from below; d) make sure the pages are in the correct order, resort them if not into the order you want them saved as; e) save the pages into a new single TIFF file that contains one or more pages.

As I say, this process is a bit rough and the code and tool are not “production” applications; but, I post them here as reference for someone else who may want to use the logic and improve the functionality to meet their needs. 

Of course, all of this would be much simpler if the health care institutions would comply with the federal regs (45 C.F.R. § 164.524) where it says: 

(i) The covered entity must provide the individual with access to the protected health information in the form or format requested by the individual, if it is readily producible in such form or format; or, if not, in a readable hard copy form or such other form or format as agreed to by the covered entity and the individual.

Although I have requested my records to be in high-quality multi-page TIFF format; or at least a single TIFF file, this has NEVER happened and they always come in a very low quality PDF.  Nor has the health care institution ever contacted me to see if we could “agree” on the format as per these regulations Sad smile  Hopefully, the magical “blue button” will fix all this; but in the meantime, here’s the full (and still a bit rough) tool set. 

The “Document Doctor” or DocDoc – source files here; object (executables) here.

Saturday, December 21, 2013

Converting the PDF to XPS

Last time we removed the password from the PDF supplied by our Provider and now we have an 877 page PDF file contains dozens of multi-page documents from my medical record.  Inside this one large PDF, I may have four pages of labs, a two-page EKG, four pages of physician notes, etc.  I want these discreet documents together as individual files and not munged together in a giant PDF.  Once they are saved back into individual files containing the relevant pages of a procedure such as lab results; it will be easier for me and the indexer to find and organize the documents.

One problem with dealing with PDF formatted files is that it is difficult to change from PDF to other formats; at least using Open Source tools and utilities.  Therefore, we need a way to translate the PDF into a format we can use.  It turns out that on the Windows platform, we have a nice tool available that allows us to process the documents:  the XPS printer driver.  Once we convert the PDF to XPS, we can use code to further process the file into individual documents.

If you haven’t done so, install and configure the XPS printer driver.  Use the free Adobe reader and open the large PDF.  When you press the print menu item in Adobe, select the XPS printer – it will ask you where you want to create the file.  The point here is that XPS driver writes to a file and not a printer.

We need to retain the quality as much as possible.  Unfortunately, many providers print to the PDF in very poor quality so will attempt to clean things up so our OCR engine has a better chance later to index the documents.

For our current task, I suggest you use the grayscale setting if you are not printing color:


Once you press the “Print” button, a new dialog will pop up to ask you where you want to save the new XPS document.  You will find that it is very similar to the PDF in both size and quality.  The only thing we have gained is that the file is now in a format we can work with.

In the next post, we will detail another tool which will open the new XPS file and print out each page as a separate high-quality TIFF file.  After that, we can use another tool to combine the individual TIFF pages back into a single multiple-page TIFF file which can be indexed and viewed.  Almost there then.

Friday, December 20, 2013

Remove (or Add) Password Protection to PDF

As I mentioned in the previous post, I created several tools to assist with the processing of my Personal Health Record.  The first tool I wrote will unlock a PDF file.  Locked PDF seems to be how most Providers send along copies of the health record.  In other words, when I request my medical records, they come in one large multi-page PDF file that is password protected. To make processing easier, I want to remove the password.
To make this work, I use the PDFsharp library which is an Open Source toolkit that greatly simplifies everything.
I have created two projects:  one to unlock a PDF file and the other to lock the PDF.
The source code is available here.  If you want the executables, they are here.
To illustrate just how great and easy the PDFsharp library is, here’s the three lines of code that unlock the PDF document:
PdfDocument document = PdfReader.Open(fileName, pw, PdfDocumentOpenMode.Modify);

document.SecuritySettings.DocumentSecurityLevel = PdfSharp.Pdf.Security.PdfDocumentSecurityLevel.None;

The “fileName” variable is the path and name to the PDF file and the “pw” variable is the password for the document we want to clear.
That’s it.  Three simple lines of code to clear the password.  The “lock” functions are just as easy.  Mark another win for Open Source software.
Once we have the document unlocked, we find that we may have a few hundred pages of medical records.  Next time we’ll talk about how to break those down into manageable and indexable documents so we can use them