6 minutes
Written: 2023-06-26 21:34 +0000
Updated: 2024-08-06 00:53 +0000
Managing Scanned Books
This post is part of the Modern reading series.
Where I consider digitizing colored content.
Background
It so happened that I recently acquired a rather unique book which (somewhat oddly) had no corresponding digital variant 1. I’ve discussed how I modify books and papers for consumption on my Kobo Aura HD in the past, but this comes before that. Much of this is essentially a plug for Scantailor Advanced 2.
Goals
Starting with the initial output pdf
files from a scanner’s “scan to me” function 3 the goal is to have all the bells and whistles of a well-done digital book:
- Fixing orientations
- This is meant to deal with rotations mostly, skew is later
- Splitting pages
- Since most scanners with books take two pages at a time
- Handling skew
- Deskewing works to reorient pages for visibility
- Reworking content selections
- The automated selection worked will for this, and is especially useful if there are artifacts
- Adding back margins
- For padding post selection
- Output determination
- 600 DPI is generally enough, and “mixed” mode works great for images, the one thing to add is that “Higher search sensitivity” for picture shapes is pretty useful
These six steps can be carried out with Scantailor Advanced, which has a very intuitive GUI for the same but there are but two more things:
- Optical Character Recognition (OCR)
- A proper Table of Contents
We will also have to first get high resolution images out of the pdf
files.
For starters our project might look like this:
1❯ ls
2appendix.pdf chap6.pdf images
3chap2_chap3.pdf chap7.pdf toc_chap1_intro.pdf
4chap4.pdf chap8.pdf
5chap5.pdf
From pdf
to tiff
Without any bells and whistles, a simple bash script will suffice to extract
high resolution .tiff
files from .pdf
inputs. The only caveat is the need to
carefully format the page names so they can be joined together easily. This lead to:
1# The first page number (output file name) for the current PDF
2export INP_SCAN_NAME=toc_chap1_intro
3output_index=1
4pdftoppm -tiff "${INP_SCAN_NAME}.pdf" temp
5i=0
6for file in temp*.tif; do
7 i=$((i+1))
8 new_index=$(printf "%03d" $((output_index + i - 1)))
9 new_file="page_${new_index}.tiff"
10 mv "$file" "images/$new_file"
11done
At this point we can plug the images directly into ScanTailor Advanced and work through the GUI.
OCR Setup
Working with the TIFF images which form the output at ScanTailor Advanced is a good bet. System packages needed are:
1trizen -S tesseract tesseract-data-eng \
2 tesseract-data-isl \
3 ghostscript unpaper pngquant \
4 img2pdf jbig2enc pdftk
Subsequently I ended up with a throwaway conda
environment for using OCRmyPDF
(like with all python
tools).
1micromamba create -p $(pwd)/.ocrtmp pip ocrmypdf
2micromamba activate $(pwd)/.ocrtmp
Since ocrmypdf
works on single PDF files or alternatively single images, one
brute-force approach is simply:
1cd out # where scantailor puts things
2mkdir temp
3for file in page*.tif; do
4 base=$(basename "$file" .tif)
5 ocrmypdf -l eng --output-type pdf "$file" "temp/${base}.pdf"
6done
Which leads to a series of .pdf
files.
Concatenation and Compression
Ghostscript is the best way to get a single reasonably compressed output from these files4:
1export OUTPUT_PDF_DPI=500 # good enough for me
2# Or 600 for the Scantailor default
3gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
4 -dCompatibilityLevel=1.7 -dPDFSETTINGS=/ebook \
5 -dEmbedAllFonts=true -dSubsetFonts=true \
6 -dColorImageDownsampleType=/Bicubic \
7 -dColorImageResolution=$OUTPUT_PDF_DPI \
8 -dGrayImageDownsampleType=/Bicubic \
9 -dGrayImageResolution=$OUTPUT_PDF_DPI \
10 -dMonoImageDownsampleType=/Bicubic \
11 -dMonoImageResolution=$OUTPUT_PDF_DPI \
12 -sOutputFile="output_${OUTPUT_PDF_DPI}.pdf" \
13 temp/*.pdf
Where the compatibility level is to allow for linearization 5 (needs PDF 1.7). Sometimes this leads to files which are still too large (Table ref:tbl:sizes).
DPI | Size (in MB) |
---|---|
300 | 139.2 |
600 | 303.5 |
In such cases pdfsizeopt
(from here) can be setup in the environment with:
1mkdir $CONDA_PREFIX/tmp/pdfsizeopt
2cd $CONDA_PREFIX/tmp/pdfsizeopt
3wget -O pdfsizeopt_libexec_linux.tar.gz https://github.com/pts/pdfsizeopt/releases/download/2023-04-18/pdfsizeopt_libexec_linux-v9.tar.gz
4tar xzvf pdfsizeopt_libexec_linux.tar.gz
5rm -f pdfsizeopt_libexec_linux.tar.gz
6wget -O pdfsizeopt.single https://raw.githubusercontent.com/pts/pdfsizeopt/master/pdfsizeopt.single
7chmod +x pdfsizeopt.single
8mv * $CONDA_PREFIX/bin && cd $CONDA_PREFIX/bin
9ln -s pdfsizeopt.single $CONDA_PREFIX/bin/pdfsizeopt
10rm -rf $CONDA_PREFIX/tmp/*
Which can then be used as:
1pdfsizeopt output_600.pdf
The compression takes a while 6, but the results are completely
lossless and the reduction was pretty neat, from 390 MB
to 250 MB
, though it
took forever, and Adobe Acrobat reduced a 390 MB
file to 108 MB
in minutes.
For handling the cover image I ended up with:
1# from the ScanTailor 600 DPI image
2# 39.3 MB
3tiffcp -c zip:p9 cpage-1.tif cpage.tif # 19.7 MB
4convert cpage.tif -compress JPEG -quality 90% cpage.jpg # 2.0 MB
Then I dumped it through a web tool 7(around 560 KB
) before finishing up
with ocrmypdf
for the final pdf
of 535 KB
.
TOC Generation
Sometimes I feel too lazy and end up doing my table of contents editing in Master PDF editor 8. That being said my most favored way for a fully fledged TOC with chapters and sub-headings is with the venerable PDFtk:
1cat toc.txt
2BookmarkBegin
3BookmarkTitle: Chapter 1
4BookmarkLevel: 1
5BookmarkPageNumber: 5
6
7BookmarkBegin
8BookmarkTitle: Section 1.1
9BookmarkLevel: 2
10BookmarkPageNumber: 6
11
12BookmarkBegin
13BookmarkTitle: Section 1.2
14BookmarkLevel: 2
15BookmarkPageNumber: 10
16BookmarkBegin
17
18
19BookmarkTitle: Chapter 2
20BookmarkLevel: 1
21BookmarkPageNumber: 15
22...
This can be applied with:
1pdftk input.pdf update_info toc.txt output output_with_toc.pdf
Sometimes I end up padding the pdf
with a few extra pages to ensure the page
numbers (printed) match the ones in the TOC.
Linearization
Since we targeted PDF 1.7, with linearization support 9. We need qpdf:
1micromamba install qpdf
2qpdf --linearize output_300.pdf olin300.pdf
This has seemingly no effect on size so it can be the last step.
Conclusions
This isn’t my first time working with a scanned book. Felt like a lot more steps
than before though. Some more bash magic could have automated extracting the
images from scanner generated pdf
files, and perhaps some other minor
tweaks 10 or other tools 11, though for my purposes, this
worked well enough. I was a little disappointed to not be able to hit optimal
file sizes without web tools, but given the amount of time expended, it was
probably to be expected.
A few papercuts steeled my resolved to get this onto my devices ↩︎
A more maintained fork of the old and venerable Scantailor which now weirdly has a fan-page with dubious plugs to something called Truthfinder… ↩︎
Scanning the book in one go led to overly large attachments for my email provider. 576.96 766.08 ↩︎
If compression doesn’t matter using
pdftk
(needed for the TOC) would be fine too. ↩︎Swankier explanation here ↩︎
The TOC work can go on simultaneously, and compression can take days ↩︎
When file-sizes are really a bottle neck there’s also the sister pdfcompressor site, or for really small files there’s the Adobe Acrobat compressor ↩︎
They used to have a really nice trial of version 4 which did everything I needed.. ↩︎
Some of my devices use web browsers for rendering
pdf
files. ↩︎ScanTailor could be run on an HPC most probably, as could the
ocrmypdf
step, the pdf compression definitely should (took around three days on my ThinkPad X1 Carbon 6th Gen) ↩︎pdf.tocgen is pretty cool ↩︎
Series info
Modern reading series
- My Life in E-ink
- Managing Scanned Books <-- You are here!
- Managing cloud based calibre libraries