FileKit
All posts
·7 min read

How to OCR Scanned Documents — Extract Text from Images

A guide to Optical Character Recognition: what it is, how to use browser-based OCR, tips for better accuracy, and when to use OCR vs. direct text extraction.

What Is OCR?

OCR (Optical Character Recognition) converts images of text into actual, selectable, searchable text. When you scan a paper document, photograph a whiteboard, or screenshot a conversation, the result is an image — pixels on a screen. The text exists visually but not digitally. You cannot select it, copy it, search for a specific word, or feed it into a spreadsheet.

OCR solves this by analyzing the pixel patterns, identifying letter shapes, and producing machine-readable text. Modern OCR engines use neural networks trained on millions of document samples, achieving accuracy rates above 99% for clean, well-formatted text.

How OCR Works (Simplified)

Understanding the pipeline helps you get better results:

  1. Preprocessing. The engine converts the image to grayscale, adjusts contrast, removes noise, and corrects skew (rotation). Better input at this stage means better output at every subsequent stage.
  2. Layout analysis. The engine identifies text regions, columns, paragraphs, and reading order. It separates text from images, tables, and decorative elements.
  3. Character recognition. Each character shape is matched against learned patterns. The engine considers context — surrounding letters, dictionary words, language rules — to resolve ambiguous shapes (is that a zero or the letter O?).
  4. Post-processing. Spell checking and language models correct likely errors. The result is structured text output.

How to OCR a Document

1. Browser-Based OCR

FileKit's OCR tool uses Tesseract.js — an open-source OCR engine compiled to WebAssembly — to recognize text entirely in your browser. Supports English, Simplified Chinese, Japanese, and mixed English+Chinese. Drop an image or scanned PDF, choose the language, and get the extracted text in seconds. No server upload, no third-party processing — the document never leaves your device.

2. Google Drive

Upload a scanned PDF or image to Google Drive, right-click, and select Open with Google Docs. Google applies OCR automatically and creates an editable document with the recognized text. Works well for simple, single-column layouts but struggles with multi-column documents, tables, and handwriting.

3. Adobe Acrobat

Acrobat's Scan & OCR feature creates a searchable text layer on top of scanned pages. The original image stays intact while an invisible text layer is added behind it. This means the document looks identical to the scan but you can select, copy, and search the text. Best for archival-quality documents where visual fidelity matters.

4. Command Line with Tesseract

# Basic OCR (English)
tesseract scan.png output -l eng

# Multi-language (English + Chinese)
tesseract scan.png output -l eng+chi_sim

# Generate searchable PDF (keeps original image + adds text layer)
tesseract scan.png output -l eng pdf

# Batch process a folder of images
for f in *.png; do tesseract "$f" "${f%.png}" -l eng; done

Tesseract is the most widely used open-source OCR engine. Install it via your system package manager (brew install tesseract on macOS,apt install tesseract-ocr on Ubuntu). Language data packages must be installed separately for non-English languages.

Tips for Better OCR Accuracy

Image Quality

  • Resolution. Aim for 300 DPI minimum. At 150 DPI, small text (below 10pt) becomes unreliable. Phone photos of documents are typically 150-200 DPI equivalent — adequate for large text but poor for fine print.
  • Contrast. Dark text on a white background gives the best results. Light gray text, colored backgrounds, watermarks, and gradient fills all degrade accuracy. If your source has low contrast, increase it in an image editor before running OCR.
  • Focus and sharpness. Blurry images produce blurry character shapes that the engine cannot match. Use a flatbed scanner when possible; if using a phone camera, hold steady and ensure good lighting.

Document Preparation

  • Straighten the image. Skewed text (even 2-3 degrees) significantly hurts accuracy. Most OCR engines attempt auto-deskew, but manual correction before scanning is more reliable. If you have a scanned PDF with skewed pages, consider rotating the pages first.
  • Remove borders and artifacts. Crop the document to remove scanner borders, black edges, and binding shadows. These artifacts confuse layout analysis.
  • Select the right language. Always specify the primary language of the document. For mixed-language content (English headers with Chinese body text), use the combined language mode if your OCR tool supports it.

Common OCR Challenges

Handwriting

Standard OCR engines are trained on printed text and perform poorly on handwriting. For handwritten documents, use specialized services (Google Cloud Vision, Amazon Textract) that have handwriting-specific models. Accuracy varies widely depending on handwriting legibility.

Tables and Forms

OCR engines recognize text but not table structure. A scanned table may come out as jumbled text with lost column alignment. For structured data extraction from tables, dedicated document AI services or manual post-processing are often necessary.

Multi-Column Layouts

Newspapers, academic papers, and magazines often use multi-column layouts. OCR engines may read across columns instead of down them, producing nonsensical output. Adobe Acrobat and advanced OCR tools handle columns better than basic engines.

Low-Quality Faxes and Photocopies

Documents that have been faxed or photocopied multiple times accumulate noise, lose contrast, and develop artifacts. Each generation degrades quality. For these documents, aggressive preprocessing (contrast enhancement, noise removal, binarization) before OCR can significantly improve results.

OCR vs. Text Extraction: Know the Difference

Not all PDFs need OCR. There are two kinds of PDFs:

  • Digital PDFs — Created from Word, Excel, HTML, or any application's "Export to PDF" function. The text is already embedded and selectable. Use the PDF to Text tool — it reads the text directly, is faster, and is 100% accurate.
  • Scanned PDFs — Created by scanning paper documents. Each page is an image. The text exists only as pixels. This is where OCR is necessary.

How to tell the difference: open the PDF and try to select text with your cursor. If you can highlight individual words, it is a digital PDF and OCR is unnecessary. If clicking and dragging selects the entire page as an image, it is a scanned PDF and needs OCR.

After OCR: Next Steps

  • Proofread the output. Even the best OCR makes mistakes. Common errors include confusing similar characters (l/1, O/0, rn/m), dropping punctuation, and merging or splitting words incorrectly.
  • Convert to your target format. Once you have the text, you can paste it into a document, spreadsheet, or database. For creating a formatted PDF from the extracted text, use the Text to PDF tool.
  • Archive the searchable version. If you ran OCR on a scanned PDF and generated a searchable PDF (with the text layer), keep this version as your primary archive. It preserves the original visual appearance while enabling full-text search.
  • Compress if needed. Scanned PDFs with OCR text layers can be large. Compress the PDF to reduce file size while keeping the text layer intact.