Where I consider digitizing colored content.

Background

It so happened that I recently acquired a rather unique book which (somewhat oddly) had no corresponding digital variant 1. I’ve discussed how I modify books and papers for consumption on my Kobo Aura HD in the past, but this comes before that. Much of this is essentially a plug for Scantailor Advanced 2.

Goals

Starting with the initial output pdf files from a scanner’s “scan to me” function 3 the goal is to have all the bells and whistles of a well-done digital book:

Fixing orientations
This is meant to deal with rotations mostly, skew is later
Splitting pages
Since most scanners with books take two pages at a time
Handling skew
Deskewing works to reorient pages for visibility
Reworking content selections
The automated selection worked will for this, and is especially useful if there are artifacts
Adding back margins
For padding post selection
Output determination
600 DPI is generally enough, and “mixed” mode works great for images, the one thing to add is that “Higher search sensitivity” for picture shapes is pretty useful

These six steps can be carried out with Scantailor Advanced, which has a very intuitive GUI for the same but there are but two more things:

  • Optical Character Recognition (OCR)
  • A proper Table of Contents

We will also have to first get high resolution images out of the pdf files. For starters our project might look like this:

1❯ ls
2appendix.pdf	chap6.pdf		     images
3chap2_chap3.pdf  chap7.pdf		     toc_chap1_intro.pdf
4chap4.pdf	chap8.pdf
5chap5.pdf

From pdf to tiff

Without any bells and whistles, a simple bash script will suffice to extract high resolution .tiff files from .pdf inputs. The only caveat is the need to carefully format the page names so they can be joined together easily. This lead to:

 1# The first page number (output file name) for the current PDF
 2export INP_SCAN_NAME=toc_chap1_intro
 3output_index=1
 4pdftoppm -tiff "${INP_SCAN_NAME}.pdf"  temp
 5i=0
 6for file in temp*.tif; do
 7  i=$((i+1))
 8  new_index=$(printf "%03d" $((output_index + i - 1)))
 9  new_file="page_${new_index}.tiff"
10  mv "$file" "images/$new_file"
11done

At this point we can plug the images directly into ScanTailor Advanced and work through the GUI.

OCR Setup

Working with the TIFF images which form the output at ScanTailor Advanced is a good bet. System packages needed are:

1trizen -S tesseract tesseract-data-eng \
2    tesseract-data-isl \
3    ghostscript unpaper pngquant \
4    img2pdf jbig2enc pdftk

Subsequently I ended up with a throwaway conda environment for using OCRmyPDF (like with all python tools).

1micromamba create -p $(pwd)/.ocrtmp pip ocrmypdf
2micromamba activate $(pwd)/.ocrtmp

Since ocrmypdf works on single PDF files or alternatively single images, one brute-force approach is simply:

1cd out # where scantailor puts things
2mkdir temp
3for file in page*.tif; do
4  base=$(basename "$file" .tif)
5  ocrmypdf -l eng --output-type pdf "$file" "temp/${base}.pdf"
6done

Which leads to a series of .pdf files.

Concatenation and Compression

Ghostscript is the best way to get a single reasonably compressed output from these files4:

 1export OUTPUT_PDF_DPI=500 # good enough for me
 2# Or 600 for the Scantailor default
 3gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
 4    -dCompatibilityLevel=1.7 -dPDFSETTINGS=/ebook \
 5    -dEmbedAllFonts=true -dSubsetFonts=true \
 6    -dColorImageDownsampleType=/Bicubic \
 7    -dColorImageResolution=$OUTPUT_PDF_DPI \
 8    -dGrayImageDownsampleType=/Bicubic \
 9    -dGrayImageResolution=$OUTPUT_PDF_DPI \
10    -dMonoImageDownsampleType=/Bicubic \
11    -dMonoImageResolution=$OUTPUT_PDF_DPI \
12    -sOutputFile="output_${OUTPUT_PDF_DPI}.pdf" \
13    temp/*.pdf

Where the compatibility level is to allow for linearization 5 (needs PDF 1.7). Sometimes this leads to files which are still too large (Table ref:tbl:sizes).

Table 1: For a color book
DPISize (in MB)
300139.2
600303.5

In such cases pdfsizeopt (from here) can be setup in the environment with:

 1mkdir $CONDA_PREFIX/tmp/pdfsizeopt
 2cd $CONDA_PREFIX/tmp/pdfsizeopt
 3wget -O pdfsizeopt_libexec_linux.tar.gz https://github.com/pts/pdfsizeopt/releases/download/2023-04-18/pdfsizeopt_libexec_linux-v9.tar.gz
 4tar xzvf pdfsizeopt_libexec_linux.tar.gz
 5rm -f    pdfsizeopt_libexec_linux.tar.gz
 6wget -O pdfsizeopt.single https://raw.githubusercontent.com/pts/pdfsizeopt/master/pdfsizeopt.single
 7chmod +x pdfsizeopt.single
 8mv * $CONDA_PREFIX/bin && cd $CONDA_PREFIX/bin
 9ln -s pdfsizeopt.single $CONDA_PREFIX/bin/pdfsizeopt
10rm -rf $CONDA_PREFIX/tmp/*

Which can then be used as:

1pdfsizeopt output_600.pdf

The compression takes a while 6, but the results are completely lossless and the reduction was pretty neat, from 390 MB to 250 MB, though it took forever, and Adobe Acrobat reduced a 390 MB file to 108 MB in minutes.

Figure 1: Lossless compression (250 MB) on the left, lossy Acrobat compression (108 MB) on the right

Figure 1: Lossless compression (250 MB) on the left, lossy Acrobat compression (108 MB) on the right

For handling the cover image I ended up with:

1# from the ScanTailor 600 DPI image
2# 39.3 MB
3tiffcp -c zip:p9  cpage-1.tif cpage.tif #  19.7 MB
4convert cpage.tif -compress JPEG -quality 90% cpage.jpg # 2.0 MB

Then I dumped it through a web tool 7(around 560 KB) before finishing up with ocrmypdf for the final pdf of 535 KB.

TOC Generation

Sometimes I feel too lazy and end up doing my table of contents editing in Master PDF editor 8. That being said my most favored way for a fully fledged TOC with chapters and sub-headings is with the venerable PDFtk:

 1cat toc.txt
 2BookmarkBegin
 3BookmarkTitle: Chapter 1
 4BookmarkLevel: 1
 5BookmarkPageNumber: 5
 6
 7BookmarkBegin
 8BookmarkTitle: Section 1.1
 9BookmarkLevel: 2
10BookmarkPageNumber: 6
11
12BookmarkBegin
13BookmarkTitle: Section 1.2
14BookmarkLevel: 2
15BookmarkPageNumber: 10
16BookmarkBegin
17
18
19BookmarkTitle: Chapter 2
20BookmarkLevel: 1
21BookmarkPageNumber: 15
22...

This can be applied with:

1pdftk input.pdf update_info toc.txt output output_with_toc.pdf

Sometimes I end up padding the pdf with a few extra pages to ensure the page numbers (printed) match the ones in the TOC.

Linearization

Since we targeted PDF 1.7, with linearization support 9. We need qpdf:

1micromamba install qpdf
2qpdf --linearize output_300.pdf olin300.pdf

This has seemingly no effect on size so it can be the last step.

Conclusions

This isn’t my first time working with a scanned book. Felt like a lot more steps than before though. Some more bash magic could have automated extracting the images from scanner generated pdf files, and perhaps some other minor tweaks 10 or other tools 11, though for my purposes, this worked well enough. I was a little disappointed to not be able to hit optimal file sizes without web tools, but given the amount of time expended, it was probably to be expected.


  1. A few papercuts steeled my resolved to get this onto my devices ↩︎

  2. A more maintained fork of the old and venerable Scantailor which now weirdly has a fan-page with dubious plugs to something called Truthfinder… ↩︎

  3. Scanning the book in one go led to overly large attachments for my email provider. 576.96 766.08 ↩︎

  4. If compression doesn’t matter using pdftk (needed for the TOC) would be fine too. ↩︎

  5. Swankier explanation her↩︎

  6. The TOC work can go on simultaneously, and compression can take days ↩︎

  7. When file-sizes are really a bottle neck there’s also the sister pdfcompressor site, or for really small files there’s the Adobe Acrobat compressor ↩︎

  8. They used to have a really nice trial of version 4 which did everything I needed.. ↩︎

  9. Some of my devices use web browsers for rendering pdf files. ↩︎

  10. ScanTailor could be run on an HPC most probably, as could the ocrmypdf step, the pdf compression definitely should (took around three days on my ThinkPad X1 Carbon 6th Gen) ↩︎

  11. pdf.tocgen is pretty cool ↩︎