Off the shelf OCR and Deep Learning
In general, detexify
(here) and my ScanTools workflow is great. However,
sometimes more can be done.
1micromamba create -n textocr
2micromamba activate textocr
3micromamba install torchvision -c pytorch
4pip install pix2tex[gui]
5pip install python-doctr
6pip install nougat-ocr
The TeX
tool (LaTeX OCR) works great even via the terminal. The doctr
library is a bit more finicky, but can be a decent way to extract plain text when regular OCR tools fail (e.g. ocrmypdf
).
1import argparse
2from pathlib import Path
3from doctr.io import DocumentFile
4from doctr.models import ocr_predictor
5
6
7def process_pdf(pdf_path):
8 model = ocr_predictor(
9 det_arch="db_resnet50", reco_arch="crnn_vgg16_bn", pretrained=True
10 )
11 doc = DocumentFile.from_pdf(pdf_path)
12 result = model(doc)
13 return result.render()
14
15
16def save_to_text(input_path, output_text):
17 output_file = Path(input_path).with_suffix(".txt")
18 output_file.write_text(output_text, encoding="utf-8")
19
20
21def main():
22 parser = argparse.ArgumentParser(description="OCR processing of a PDF file")
23 parser.add_argument("pdf_file", help="Path to the PDF file to be processed")
24 args = parser.parse_args()
25 ocr_result = process_pdf(args.pdf_file)
26 save_to_text(args.pdf_file, ocr_result)
27
28
29if __name__ == "__main__":
30 main()
Which is alright, called via python -c doctr_runner.py blah.pdf
.
Or Meta’s nougat, which is slower but generally better formatted:
1nougat blah.pdf -o output_dir
As of 15-11-2023
, both these options have a known warning about a memory leak.
Comments