PDF OCR

Convert scanned PDFs to searchable and selectable text using OCR. Free, no signup.

Drag your PDF here

.pdf · up to 2 GB

FreeNo signupNo watermarkOCR included

What OCR is used for

OCR PDF: make any scanned document searchable

Searchable documents

Convert scanned files into PDFs where you can search words, select text, and copy excerpts.

Historical archives

Digitize and make accessible historical documentation, paper files, and physical contract archives.

Accessibility

Documents with OCR layer are accessible to screen readers and compliant with digital accessibility regulations.

Multi-language

Support for over 100 languages including English, Spanish, Arabic, Chinese, Russian, and more with Tesseract 5.

How it works

Three steps, no hassle

Upload your scanned PDF

Drag or select the scanned PDF. OCR works on PDFs that are images — physically scanned documents, document photographs, digitized faxes.

OCR recognition

The OCR engine analyzes each page as an image, identifies characters, and generates an invisible text layer overlaid on the original document image.

Download the searchable PDF

The resulting PDF looks identical to the original, but you can now search text in it, select and copy text, and the information is accessible to indexers and screen readers.

FAQ

Got questions?

What is OCR and how does it work?

OCR (Optical Character Recognition) is the technology that converts images of text into digitally encoded text. The process has three main stages: image preprocessing (skew correction, noise removal, binarization), segmentation (identifying text lines, words, and individual characters), and recognition (comparing each character against reference models to determine the most likely character). Modern OCR engines based on LSTM (Long Short-Term Memory) recurrent neural networks surpass classic template-based methods in accuracy, especially on documents with irregular typefaces, tilted or degraded text.

What accuracy does OCR achieve on English documents?

Tesseract 5, the most widely used open-source OCR engine (originally developed by HP in the 1980s, acquired by Google and published under the Apache 2.0 license, with the LSTM-based version 5.0 launched in November 2021), achieves accuracy rates of 98–99% on printed English documents scanned at 300 DPI with good quality. Documents with standard typefaces (Times New Roman, Arial, Calibri) have very high accuracy rates. Documents with decorative typefaces, very small text (under 8 points), or degradation from age have lower accuracy rates.

What is the difference between OCR to searchable PDF and OCR to text?

OCR to searchable PDF (also called PDF with OCR layer or text-embedded PDF) maintains the original document image and adds an invisible text layer that makes the document searchable. The visual appearance is identical to the original scan. OCR to text extracts only the recognized text without preserving the original image. For documents where the original image has legal value (signed contracts, notarial documents, stamped invoices), the searchable PDF is the correct option. For data extraction or text analysis, direct extraction to TXT is more efficient.

Does it work with documents in multiple languages?

Yes. Tesseract 5 supports over 100 languages including English, Spanish, French, German, Portuguese, Italian, Russian, Simplified and Traditional Chinese, Japanese, Arabic, Hindi, and many more. For documents that mix languages on the same page, multi-language recognition mode can be activated, which improves accuracy compared to fixing a single language.

What scanner resolution is needed for good OCR accuracy?

The minimum recommended resolution for quality OCR is 300 DPI. At this resolution, most printed typefaces are sufficiently defined for the OCR engine to recognize them correctly. At 150 DPI, accuracy drops notably, especially with small body text (10–12 points). At 600 DPI, quality is excellent but the scan file size is much larger without proportional improvement in OCR accuracy for normal text. For documents with microprint (very small text such as footnotes in legal documents), scanning at 400–600 DPI may be necessary.

Can OCR be applied to PDFs that already have some text?

Mixed PDFs that have pages with native text and scanned pages are common — for example, a contract where the first pages are digitally generated text and the last page is a scanned signature. Modern OCR engines can automatically detect which pages have real text and which are images, applying OCR only where necessary. This avoids unnecessary reprocessing of pages that already have readable text.