DocumentsImagesMediaPDF Tools

PDF OCR

Convert scanned PDFs to searchable and selectable text using OCR. Free, no signup.

Drag your PDF here

.pdf · up to 2 GB

FreeNo signupNo watermarkOCR included

OCR PDF: make any scanned document searchable

Searchable documents

Convert scanned files into PDFs where you can search words, select text, and copy excerpts.

Historical archives

Digitize and make accessible historical documentation, paper files, and physical contract archives.

Accessibility

Documents with OCR layer are accessible to screen readers and compliant with digital accessibility regulations.

Multi-language

Support for over 100 languages including English, Spanish, Arabic, Chinese, Russian, and more with Tesseract 5.

Three steps, no hassle

1

Upload your scanned PDF

Drag or select the scanned PDF. OCR works on PDFs that are images — physically scanned documents, document photographs, digitized faxes.

2

OCR recognition

The OCR engine analyzes each page as an image, identifies characters, and generates an invisible text layer overlaid on the original document image.

3

Download the searchable PDF

The resulting PDF looks identical to the original, but you can now search text in it, select and copy text, and the information is accessible to indexers and screen readers.

Got questions?

OCR (Optical Character Recognition) is the technology that converts images of text into digitally encoded text. The process has three main stages: image preprocessing (skew correction, noise removal, binarization), segmentation (identifying text lines, words, and individual characters), and recognition (comparing each character against reference models to determine the most likely character). Modern OCR engines based on LSTM (Long Short-Term Memory) recurrent neural networks surpass classic template-based methods in accuracy, especially on documents with irregular typefaces, tilted or degraded text.

Tesseract 5, the most widely used open-source OCR engine (originally developed by HP in the 1980s, acquired by Google and published under the Apache 2.0 license, with the LSTM-based version 5.0 launched in November 2021), achieves accuracy rates of 98–99% on printed English documents scanned at 300 DPI with good quality. Documents with standard typefaces (Times New Roman, Arial, Calibri) have very high accuracy rates. Documents with decorative typefaces, very small text (under 8 points), or degradation from age have lower accuracy rates.

OCR to searchable PDF (also called PDF with OCR layer or text-embedded PDF) maintains the original document image and adds an invisible text layer that makes the document searchable. The visual appearance is identical to the original scan. OCR to text extracts only the recognized text without preserving the original image. For documents where the original image has legal value (signed contracts, notarial documents, stamped invoices), the searchable PDF is the correct option. For data extraction or text analysis, direct extraction to TXT is more efficient.

Yes. Tesseract 5 supports over 100 languages including English, Spanish, French, German, Portuguese, Italian, Russian, Simplified and Traditional Chinese, Japanese, Arabic, Hindi, and many more. For documents that mix languages on the same page, multi-language recognition mode can be activated, which improves accuracy compared to fixing a single language.

The minimum recommended resolution for quality OCR is 300 DPI. At this resolution, most printed typefaces are sufficiently defined for the OCR engine to recognize them correctly. At 150 DPI, accuracy drops notably, especially with small body text (10–12 points). At 600 DPI, quality is excellent but the scan file size is much larger without proportional improvement in OCR accuracy for normal text. For documents with microprint (very small text such as footnotes in legal documents), scanning at 400–600 DPI may be necessary.

Mixed PDFs that have pages with native text and scanned pages are common — for example, a contract where the first pages are digitally generated text and the last page is a scanned signature. Modern OCR engines can automatically detect which pages have real text and which are images, applying OCR only where necessary. This avoids unnecessary reprocessing of pages that already have readable text.

OCR PDF: how to make a scanned document searchable with optical recognition technology

OCR (Optical Character Recognition) applied to scanned PDFs is one of the most transformative technologies in document management. Before OCR, physical documents scanned to PDF were silent images: you couldn't search for a word, you couldn't select text, you couldn't have a screen reader read them. OCR transforms these page images into documents with real text, while maintaining the original visual appearance. The history of OCR is long: the first automatic character recognition systems for postal mail date from the 1950s. The first commercial PC products arrived in the 1980s with OmniPage (Caere Corporation, 1988) and FineReader (ABBYY, 1993). The revolution came with machine learning-based engines: Tesseract, originally developed by HP Research Laboratories in Bristol between 1985 and 1995, was acquired by Google in 2006 and published as open source. Version 4 (2018) introduced LSTM architectures that dramatically improved accuracy. Version 5 (November 2021) refined these models to achieve accuracy rates of 98–99% under optimal conditions.

Applying OCR to scanned PDFs has two output modes with distinct use cases. The first is the searchable PDF (also known as PDF/OCR): the resulting PDF maintains the original document image and adds an invisible text layer that enables search, text selection, and accessibility, without altering the visual appearance. This mode is correct for documents with legal or archival value where the original image must be preserved intact — signed contracts, notarial documents, letterhead invoices, medical records. The second mode is pure text extraction (TXT): only the recognized text is extracted, losing the visual format. This mode is more suitable for text analysis, feeding search systems, or processing content with data processing tools. The PDF/A-3 standard (ISO 19005-3, published in 2012) allows embedding OCR text in the PDF so the document is both a faithful visual archive and accessible text, and is the recommended format for institutional archives.

OCR accuracy depends on multiple factors that users can control. Scanner resolution is the most important: 300 DPI produces optimal results for most 10–12 point typefaces. Document background color matters: OCR works best on white backgrounds with high-contrast black text. Documents with colored backgrounds, watermarks, overlapping stamps, or text printed over background images have lower accuracy. The quality of the original paper and document age also matter: a 1970s document printed on yellowed paper with faded ink will have lower accuracy than a 2020 printed document. For deteriorated historical documents, image preprocessing techniques (contrast enhancement, stain removal, skew correction) significantly improve OCR accuracy. Convertir.ai applies automatic preprocessing before OCR to maximize accuracy for most common scanned documents.