Last time we removed the password from the PDF supplied by our Provider and now we have an 877 page PDF file contains dozens of multi-page documents from my medical record. Inside this one large PDF, I may have four pages of labs, a two-page EKG, four pages of physician notes, etc. I want these discreet documents together as individual files and not munged together in a giant PDF. Once they are saved back into individual files containing the relevant pages of a procedure such as lab results; it will be easier for me and the indexer to find and organize the documents.
One problem with dealing with PDF formatted files is that it is difficult to change from PDF to other formats; at least using Open Source tools and utilities. Therefore, we need a way to translate the PDF into a format we can use. It turns out that on the Windows platform, we have a nice tool available that allows us to process the documents: the XPS printer driver. Once we convert the PDF to XPS, we can use code to further process the file into individual documents.
If you haven’t done so, install and configure the XPS printer driver. Use the free Adobe reader and open the large PDF. When you press the print menu item in Adobe, select the XPS printer – it will ask you where you want to create the file. The point here is that XPS driver writes to a file and not a printer.
We need to retain the quality as much as possible. Unfortunately, many providers print to the PDF in very poor quality so will attempt to clean things up so our OCR engine has a better chance later to index the documents.
For our current task, I suggest you use the gray scale setting if you are not printing color.
Once you press the “Print” button, a new dialog will pop up to ask you where you want to save the new XPS document. You will find that it is very similar to the PDF in both size and quality. The only thing we have gained is that the file is now in a format we can work with.
In the next post, we will detail another tool which will open the new XPS file and print out each page as a separate high-quality TIFF file. After that, we can use another tool to combine the individual TIFF pages back into a single multiple-page TIFF file which can be indexed and viewed. Almost there then.