Tech

PDF makes a professional-quality output

CIOL Bureau

03 Jan 2003 00:00 IST

Updated On 03 Jan 2003 00:20 IST

New Update

When someone in marketing wants a brochure that looks "just so," or legal needs a document that shouldn't be changed, they publish it as Portable Document Format (PDF). PDF is a standard defined by Adobe Systems for platform-independent, device-independent rendering and display of documents.

Advertisment

PDF builds on the fantastic success of Adobe's PostScript (PS), first released in 1984 to improve the printing sophistication possible with common hardware. In principle, PDF has a fixed appearance, invariant across different Web browsers and different devices including printers; the content of PDF documents is "locked down."

While neither of these propositions is strictly true, they're close enough for most purposes.

Moreover, PDF generally prints well; only a plain text document is more likely to be compatible

with any particular printer. What does that have to do with you? As a systems or server-side programmer, perhaps you think of PDF as just another opaque content type.

Your desktop users or document specialists occasionally update instances on your servers, and you serve up the files just as you would any others. That, you say, should be the limit of your

involvement.

Advertisment

Programmatic PDF generation

That model misses out on several interesting server-side possibilities for processing PDF, though. When you automate generation of PDF, you can begin to use all the techniques of software engineering: version control, abstraction, professional-quality backups, regression tests, and so on. Programmatic PDF generation means you can customize deliverables in a manageable way.

Perhaps your organization's habit with PDF is to have someone adept with a particular desktop word processor set up a "mail merge" sort of operation to parameterize document output. Automation can reach far deeper, though.

Desktop software vendors have a partial appreciation of this. Several word-processing or desktop-publishing packages have scripting capabilities that reach at least part of the way to PDF. Some shops create PostScript images and transform them into PDF with Ghostscript or similar packages.

Advertisment

My favorite way to automate PDF generation, though, is with one of three actively maintained open source libraries: ReportLab, PJ, and PDFlib. They're all roughly comparable, and I've had medium to good success on projects that relied on each.

ReportLab is the one I currently use most: it handles the multi-megabyte PDFs with which I work, its exposure of Python as a scripting language suits me, its library includes all the functionality I need for daily work, and the ReportLab company behind the library appears to enjoy sustainable business.

Moreover, It’s convenient integration into the Python interactive shell makes for a delightfully productive development environment. The rest of this month's "Server clinic" illustrates how you can start to program PDF.

Advertisment

PDF's "Hello, world"

While you probably have a good Python installation on your servers already, Python.org's download page can help assure you're current. Version 2.2.1 is a good choice. With Python installed, you need to visit the ReportLab Download page before you begin your PDF programming career. Even over slow connections, downloads and installations of both Python and the ReportLab Toolkit take well under an hour.

The source code for your first application can be as simple as this:

Advertisment

This code simply puts a headline on an otherwise blank piece of paper. While mundane, it hints at new horizons: font style and size, content, and formatting are all programmable. When your organization decides to publish in Times New Roman rather than Helvetica, you can, in principle, change one configuration assignment and regenerate everything, rather than having to open each of thousands of documents, alter them, and write them back out.

The same is true for other effects: if you want to expand the typeface on information targeted to older readers, for instance, your application can automate that. Don't think you have to develop your own word processor to accomplish anything meaningful, though. While the ReportLab library

is broad and deep enough to allow that, it also supports a couple of specific shortcuts that enormously simplify my PDF programming.

First is the import_HTML method. This renders valid HTML source into PDF pages. For many applications, I find it convenient to prototype in HTML, get "stakeholder sign-off" for a sample document, parameterize the HTML generation, then complete an implementation with:

Advertisment

my_document.import_HTML(my_html_source)

This gives me a very fast, easily maintained, fully programmatic way to pour content into PDF. ReportLab's processing efficiency is so good that I can comfortably generate all kinds of PDF documents for Web display on the fly. This gives me the opportunity to keep critical financial or engineering reports fully current with the latest data while preserving an appropriate visual appearance. Print documents enjoy the same choices for customization, of course.

Putting together PDF pieces

A second crucial library function is copyPages. It appends an existing PDF document to a Canvas instance. CopyPages makes it easy to construct a PDF document as a concatenation of several pieces. For more sophisticated effects, ReportLab, like other PDF tool vendors, licenses a for-fee product. In ReportLab's case, its PageCatcher product annotates existing PDF documents, reorders their pages, reformats them for different printing methods, adds backgrounds (including watermarks), and fills in PDF forms. ReportLab documents several interesting uses for ageCatcher.

One example is programmatic preparation of completed Internal Revenue Service (IRS) forms.

A final ReportLab capability I've found important is its management of Tables of Contents. Online document readers appreciate these navigational aids, which Adobe calls "bookmarks" or "outlines." Most PDF viewers show these as menus in a left-hand window. The ReportLab Reference itself constitutes a nice example of a bookmarked document. Such ReportLab functions as copyPages include an option to import an outline properly into a larger document, or discard it.

Advertisment

Conclusion

Whenever a computing job seems tedious or error-prone -- updating documents "by hand," for example -- you should be on the lookout for a way to automate the process. Although many systems programmers don't seem to realize it, management of PDF documents presents

rich opportunities for automation and abstraction.

Use the ReportLab libraries or other available PDF-savvy tools to teach your server to do your PDF work. That should free your time for more productive pursuits. Future installments of "Server clinic" are likely to touch on other under-appreciated fields for server-side automation, including

generation of Excel and Word documents.

Disclaimer: I'm on cordial personal terms with the employees of several companies that specialize in PDF-related products. However, I've never had a financial interest in any of the companies, nor any contractual relationship other than as an ordinary customer.

(The author is the Vice president, Phaseit, Inc.)

tech-news