How to Split a .Pdf Every 2 Pages Using Python

The Code_Tap

12-2016

Python , PDF , python-pdf2 , multiprocessing , raspberry pi

So today I found myself in the unenviable posision of having to scan
in hundreds of pages archived paperbased logs at work.
Ofcourse my first thought was, how can I automate some of this!
Whilst keeping the photocopier company as I scanned the paperwork in,
I found within the python package index a little treat called PyPDF2.

Arch: yaourt -S python-pdf2
Debian: apt-get install python-pdf2

I then spent the evening pouring over the man pages and figuring this
thing out. What I needed was a way to split the PDF files into pairs of
pages, allowing me to bulk scan and save time!

Here is the code below

How to split a .pdf every 2 pages using Python

#!/usr/bin/env python3

from PyPDF2 import PdfFileWriter, PdfFileReader
import glob, sys

pdfs = glob.glob("*.pdf")

for pdf in pdfs:
    inputFile = PdfFileReader(open(pdf, "rb"))
    for i in range(inputFile.numPages // 2):
        output = PdfFileWriter()
        output.addPage(inputFile.getPage(i * 2))

        if i * 2 + 1 <  inputFile.numPages:
            output.addPage(inputFile.getPage(i * 2 + 1))

        newname = pdf[:9] + "-" + str(i) + ".pdf"

        outputStream = open(newname, "wb")
        output.write(outputStream)
        outputStream.close()

This code will rename all pdf files within the directory
of where it is run.

Can it be faster ?

I found that whilst it worked pretty quickly on files upto 100
pages in size, I started to get a little impatient waiting for
it to complete multiple batches of larger files ( x000’s of pages).
I felt that as I was in the zone i’d experiment with the multiprocessing
module and see how things go.

This is the modified code:

#!/usr/bin/env python

from PyPDF2 import PdfFileWriter, PdfFileReader
from multiprocessing import Pool
import glob, sys

pdfs = glob.glob("*.pdf")

def process_pdfs(pdf):
    inputFile = PdfFileReader(open(pdf, "rb"))
    print("Processing %s"% pdf)
    for i in range(inputFile.numPages // 2):
        output = PdfFileWriter()
        output.addPage(inputFile.getPage(i * 2))

        if i * 2 + 1 <  inputFile.numPages:
            output.addPage(inputFile.getPage(i * 2 + 1))

        newname = pdf[:9] + "-" + str(i) + ".pdf"

        outputStream = open(newname, "wb")
        output.write(outputStream)
        outputStream.close()

p = Pool(processes=4)
p.map(process_pdfs, pdfs)

You can change the line p = Pool(processes=4) to however
many processors you machine has availible.
This code change resulted in a huge improvement in file processing
speed but never managed to max my system out ( intel i5 ) so im
guessing an SSD upgrade will be on the horison soon as my little
laptop HDD is causinga bottleneck!

What next ?

whilst I feel im finished with this code, im going to put it onto
my raspberry pi so that I can run it headless and have it ready
to process all my pdfs for me from a usb automatically.