우분투 ocr tessaract와 ocrmypdf

카테고리 없음

by state 2025. 4. 30. 21:39

1. tessaract

가장 많이 사용되는 오픈소스 OCR엔진으로 한글을 포함합니다

다양한 CLI나 API를 사용가능하다는 것이 장점입니다

sudo apt-get update
sudo apt install tesseract-ocr
sudo apt install tesseract-ocr-kor
sudo apt install tesseract-ocr-eng

설치한 다음 CLI에서

tesseract input.jpg output.txt -l kor

-l kor (한국어로 번역) / eng (영어로 번역)

하지만 저는 pdf에서 글을 추출하고자 햇는데 tesseract는 pdf를 지원하지 않습니다

2. ocrmypdf

sudo apt-get install ocrmypdf
pip install -U pymupdf

ocrmypdf도 tesseract를 기반으로 하고 잇기 때문에 원하는 언어가 있으면 tesseract의 언어팩을 미리 설치해야한다

실행은

ocrmypdf -l eng --deskew --force-ocr input.pdf output.pdf

--deskew : 스캔할 때 기울어진 페이지를 수정

--force-ocr : 에러를 무시하고 진행

(수정중)

state