Daniel Roy Greenfeld

Adding Metadata to PDFs

How to use Python to combine PDFs and then add metadata to them.

Fri Feb 28 2020 3 min read


For both Django Crash Course and the forthcoming Two Scoops of Django 3.x, we're using a new process to render the PDFs. Unfortunately, until just a few days ago that process didn't include the cover. Instead, covers were inserted manually using Adobe Acrobat.

While that manual process worked, it came with predictable consequences.

# Merging the PDFs

This part was easy and found in any number of blog articles and Stack Overflow answers.

  • Step 1: Install pypdf2
  • Step 2: Write a script as seen below
from PyPDF2 import PdfFileMerger

now = datetime.now()

pdfs = [
  'images/Django_Crash_Course_5.5x8in.pdf',
  '_output/dcc.pdf',
]

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(pdf)

merger.write("releases/beta-20200226.pdf")
merger.close()    

It was at this point that we discovered that our new file, releases/beta-20200226.pdf, was missing most of the metadata. Oh no!

# Adding the Metadata

According to the PyPDF2 docs, adding metadata is very straight-forward. Just pass a dict into the addMetadata() function. I inserted this code right before the call to merger.write():

merger.addMetadata({
    "Title": "Django Crash Course",  
    "Authors": 'Daniel Roy Greenfeld, Audrey Roy Greenfeld',
    "Description": "Covers Python 3.8 and Django 3.x",
    "ContentCreator": "Two Scoops Press",
    "CreateDate": "2020-02-26",
    "ModifyDate": "2020-02-26",
})

The PDF built! Yeah! Time to open it up and see the results!

Alas, no metadata showed up.

Then I spent a long time with trial-and-error trying to get the metadata to show up properly. While there are lots of Python-related articles on extracting metadata using PyPDF2, I struggled to find anything that explained how to add metadata.

# Doing My Homework

After a bunch of research (googling, stack overlow-ing, and visiting forums) I found a wonderful book on O'Reilly called PDF Explained by John Whitington. Much credit to John Whitington, he's a good writer and very knowledgable on the topic of PDF.

For my purposes, the two critical sections were found in Chapter 4 of PDF Explained:

  • https://www.oreilly.com/library/view/pdf-explained/9781449321581/ch04.html#didentries
  • https://www.oreilly.com/library/view/pdf-explained/9781449321581/ch04.html#dates

Based off what I read, I established the following rules:

  • Every metadata field name had to be prefixed with /
  • Stick to the metadata names found in chapter 4
  • Follow the date format supplied in chapter 4

# Writing the Code!

Now armed with my rules I returned to the code. This is what I came up with:

from datetime import datetime
from PyPDF2 import PdfFileMerger

pdfs = [
  'images/Django_Crash_Course_5.5x8in.pdf',
  '_output/dcc.pdf',
]

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(pdf)

# Make PDF datestamp
now = datetime.now()
pdf_datestamp = now.strftime("D:%Y%m%d%H%M%S-8'00'")

# https://www.oreilly.com/library/view/pdf-explained/9781449321581/ch04.html#didentries
# Fields are **precisely** named
merger.addMetadata({
    "/Author": 'Daniel Roy Greenfeld, Audrey Roy Greenfeld',
    "/Title": "Django Crash Course",
    "/Subject": "Covers Python 3.8 and Django 3.x",
    "/Creator": "Two Scoops Press",
    "/CreationDate": pdf_datestamp,
    "/ModDate": pdf_datestamp,
})

# Write the release
version = f"beta-{now.strftime('%Y%m%d')}"
merger.write(f"releases/{version}.pdf")
merger.close()

# Conclusion

The lesson I learned writing this little utility is that as useful as Google and Stack Overflow might be, sometimes you need to explore reference manuals. Which, if you ask me, is a lot of fun. 😃

Speaking of reference manuals, while I referenced the online version of PDF Explained to get my work done, I've ordered a kindle version of the book. It's the least I can do.