How to Split a .Pdf Every 2 Pages Using Python
So today I found myself in the unenviable posision of having to scan
in hundreds of pages archived paperbased logs at work.
Ofcourse my first thought was, how can I automate some of this!
Whilst keeping the photocopier company as I scanned the paperwork in,
I found within the python package index a little treat called PyPDF2.
Arch: yaourt -S python-pdf2
Debian: apt-get install python-pdf2
I then spent the evening pouring over the man pages and figuring this
thing out. What I needed was a way to split the PDF files into pairs of
pages, allowing me to bulk scan and save time!
Here is the code below
How to split a .pdf every 2 pages using Python
#!/usr/bin/env python3
from PyPDF2 import PdfFileWriter, PdfFileReader
import glob, sys
pdfs = glob.glob("*.pdf")
for pdf in pdfs:
inputFile = PdfFileReader(open(pdf, "rb"))
for i in range(inputFile.numPages // 2):
output = PdfFileWriter()
output.addPage(inputFile.getPage(i * 2))
if i * 2 + 1 < inputFile.numPages:
output.addPage(inputFile.getPage(i * 2 + 1))
newname = pdf[:9] + "-" + str(i) + ".pdf"
outputStream = open(newname, "wb")
output.write(outputStream)
outputStream.close()
This code will rename all pdf files within the directory
of where it is run.
Can it be faster ?
I found that whilst it worked pretty quickly on files upto 100
pages in size, I started to get a little impatient waiting for
it to complete multiple batches of larger files ( x000’s of pages).
I felt that as I was in the zone i’d experiment with the multiprocessing
module and see how things go.
This is the modified code:
#!/usr/bin/env python
from PyPDF2 import PdfFileWriter, PdfFileReader
from multiprocessing import Pool
import glob, sys
pdfs = glob.glob("*.pdf")
def process_pdfs(pdf):
inputFile = PdfFileReader(open(pdf, "rb"))
print("Processing %s"% pdf)
for i in range(inputFile.numPages // 2):
output = PdfFileWriter()
output.addPage(inputFile.getPage(i * 2))
if i * 2 + 1 < inputFile.numPages:
output.addPage(inputFile.getPage(i * 2 + 1))
newname = pdf[:9] + "-" + str(i) + ".pdf"
outputStream = open(newname, "wb")
output.write(outputStream)
outputStream.close()
p = Pool(processes=4)
p.map(process_pdfs, pdfs)
You can change the line p = Pool(processes=4)
to however
many processors you machine has availible.
This code change resulted in a huge improvement in file processing
speed but never managed to max my system out ( intel i5 ) so im
guessing an SSD upgrade will be on the horison soon as my little
laptop HDD is causinga bottleneck!
What next ?
whilst I feel im finished with this code, im going to put it onto
my raspberry pi so that I can run it headless and have it ready
to process all my pdfs for me from a usb automatically.