Off the shelf OCR and Deep Learning

In general, detexify (here) and my ScanTools workflow is great. However, sometimes more can be done.

1micromamba create -n textocr
2micromamba activate textocr
3micromamba install torchvision -c pytorch
4pip install pix2tex[gui]
5pip install python-doctr
6pip install nougat-ocr

The TeX tool (LaTeX OCR) works great even via the terminal. The doctr library is a bit more finicky, but can be a decent way to extract plain text when regular OCR tools fail (e.g. ocrmypdf).

 1import argparse
 2from pathlib import Path
 3from doctr.io import DocumentFile
 4from doctr.models import ocr_predictor
 5
 6
 7def process_pdf(pdf_path):
 8    model = ocr_predictor(
 9        det_arch="db_resnet50", reco_arch="crnn_vgg16_bn", pretrained=True
10    )
11    doc = DocumentFile.from_pdf(pdf_path)
12    result = model(doc)
13    return result.render()
14
15
16def save_to_text(input_path, output_text):
17    output_file = Path(input_path).with_suffix(".txt")
18    output_file.write_text(output_text, encoding="utf-8")
19
20
21def main():
22    parser = argparse.ArgumentParser(description="OCR processing of a PDF file")
23    parser.add_argument("pdf_file", help="Path to the PDF file to be processed")
24    args = parser.parse_args()
25    ocr_result = process_pdf(args.pdf_file)
26    save_to_text(args.pdf_file, ocr_result)
27
28
29if __name__ == "__main__":
30    main()

Which is alright, called via python -c doctr_runner.py blah.pdf.

Or Meta’s nougat, which is slower but generally better formatted:

1nougat blah.pdf -o output_dir

As of 15-11-2023, both these options have a known warning about a memory leak.