How to Use Python to Extract Text and Data from Scanned PDFs with OCR

1743 words

9 minutes

How to Use Python to Extract Text and Data from Scanned PDFs with OCR

2025-06-30

Tutorial

Python

/

OCR

/

PDF

/

Data Extraction

/

Tesseract

Using Python for Text and Data Extraction from Scanned PDFs with OCR#

Extracting information from scanned documents presents a unique challenge. Unlike digitally created PDFs, scanned PDFs are essentially image files wrapped in a PDF container. This means the text within them is not readily selectable or searchable using standard PDF processing tools. Optical Character Recognition (OCR) technology provides the solution by converting images of text into machine-readable text data. Python, with its extensive ecosystem of libraries, offers powerful capabilities for implementing OCR and automating the extraction process from scanned PDFs.

Understanding Scanned PDFs and OCR#

Scanned PDF: A scanned PDF is a document created by scanning a physical paper document. Each page is stored as an image (like a JPEG or TIFF) within the PDF file format. Text is visually present but is not encoded as character data.
Optical Character Recognition (OCR): OCR is a technology that converts different types of documents, such as scanned paper documents, PDF files, or images captured by a digital camera, into editable and searchable data. OCR software analyzes an image, identifies areas containing text, and then converts these images of characters into actual text characters.

Extracting text and data from scanned PDFs relies entirely on applying OCR to the image content of each page. Python libraries provide the tools to access these images and interface with OCR engines.

Essential Concepts and Tools#

Successfully extracting data from scanned PDFs using Python and OCR involves several components:

PDF Handling: The ability to open a PDF, identify its pages, and extract the image data for each page. Libraries like pdf2image are commonly used for this, converting PDF pages into image objects (e.g., using the PIL/Pillow library or OpenCV).
OCR Engine: The core software that performs the character recognition. Tesseract OCR is a widely used, open-source engine.
Python Interface for OCR: Python libraries that serve as wrappers around OCR engines, allowing Python code to send images to the engine and receive the extracted text. pytesseract is the most popular wrapper for the Tesseract engine. easyocr is another popular library known for its ease of use and strong multilingual support, which includes its own OCR models.
Image Preprocessing: Techniques applied to the page images before sending them to the OCR engine to improve accuracy. This includes steps like:
- Grayscaling: Converting color images to black and white or shades of gray.
- Binarization (Thresholding): Converting an image to purely black and white pixels, which can help separate text from the background.
- Deskewing: Correcting pages that were scanned at an angle.
- Denoising: Removing speckles or artifacts from the image. Libraries like Pillow (PIL) or OpenCV (cv2) are used for image manipulation.
Text and Data Processing: Once text is extracted by OCR, further processing in Python is often needed to clean the text, extract specific data points (like names, dates, amounts), or structure the data (e.g., into CSV format). Regular expressions (re module) and string manipulation techniques are essential here.

Step-by-Step Walkthrough: Extracting Text with Python and Tesseract#

This walkthrough demonstrates using pdf2image to convert PDF pages to images and pytesseract to perform OCR.

Prerequisites:

Python: Install Python 3.x.
Tesseract OCR Engine: Download and install Tesseract from its official repository (installation varies by operating system). Ensure the Tesseract executable is in your system’s PATH, or note its path for configuration.
Poppler: pdf2image requires the Poppler utility suite. Install Poppler (installation varies by OS, e.g., poppler-utils on Debian/Ubuntu, poppler on Fedora, via Homebrew on macOS, or using recommended installers on Windows).
Python Libraries: Install the necessary libraries using pip:
Terminal window
```
1
pip install pytesseract pdf2image Pillow
```
(Pillow is a fork of PIL and is used by pdf2image for image handling).

Steps:

Import Libraries: Import the required modules in your Python script.

1
from pdf2image import convert_from_path
2
import pytesseract
3
from PIL import Image # Used by pdf2image, good to import for image manipulation
4
import os # For path handling

Configure Tesseract Path (If not in PATH): If Tesseract’s executable is not in your system’s PATH, specify its location.

1
# Example for Windows. Adjust path as necessary.
2
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

Convert PDF Pages to Images: Use convert_from_path to turn each page of the scanned PDF into an image object.

1
pdf_path = 'scanned_document.pdf' # Replace with your PDF file path
2
try:
3
    # Store pages in a list
4
    pages = convert_from_path(pdf_path)
5
    print(f"Successfully converted {len(pages)} pages to images.")
6
except Exception as e:
7
    print(f"Error converting PDF to images: {e}")
8
    # Handle error, e.g., PDF not found, Poppler not installed correctly
9
    exit()

Perform OCR on Each Image: Iterate through the list of image pages and apply pytesseract.image_to_string() to extract text from each.

1
extracted_text = []
2
for i, page_image in enumerate(pages):
3
    print(f"Processing page {i + 1}...")
4
    try:
5
        # Apply OCR to the image
6
        text = pytesseract.image_to_string(page_image)
7
        extracted_text.append(f"--- Page {i + 1} ---\n{text}\n")
8
        print(f"Extracted text from page {i + 1}.")
9
    except Exception as e:
10
        print(f"Error during OCR on page {i + 1}: {e}")
11
        extracted_text.append(f"--- Page {i + 1} (OCR Error) ---\n\n")
12

13
# Combine text from all pages
14
full_text = "\n".join(extracted_text)

Save the Extracted Text: Write the collected text to a file.

1
output_text_file = 'extracted_text.txt'
2
try:
3
    with open(output_text_file, 'w', encoding='utf-8') as f:
4
        f.write(full_text)
5
    print(f"Text extraction complete. Saved to '{output_text_file}'")
6
except Exception as e:
7
     print(f"Error saving extracted text: {e}")

This basic process provides the raw text output from the OCR engine.

Improving Accuracy with Image Preprocessing#

Preprocessing can significantly boost OCR accuracy, especially for low-quality scans. Integrate preprocessing steps before the pytesseract.image_to_string() call.

1
# Assuming 'page_image' is the PIL Image object for a page
2

3
# Example Preprocessing Steps:
4

5
# 1. Convert to Grayscale
6
gray_image = page_image.convert('L')
7

8
# 2. Apply Binarization (Thresholding) - Example: Simple Threshold
9
# A threshold value (e.g., 128) separates dark pixels (text) from light pixels (background)
10
threshold_image = gray_image.point(lambda x: 0 if x < 128 else 255, '1') # '1' mode is for 1-bit pixels (black and white)
11

12
# You would then perform OCR on the preprocessed image:
13
# text = pytesseract.image_to_string(threshold_image)
14

15
# More advanced preprocessing might use OpenCV for features like deskewing, adaptive thresholding, etc.
16
# Requires: pip install opencv-python
17
# import cv2
18
# import numpy as np
19
# Example using OpenCV (convert PIL to OpenCV format first)
20
# open_cv_image = np.array(page_image)
21
# open_cv_image = open_cv_image[:, :, ::-1].copy() # Convert RGB to BGR
22
# gray_cv = cv2.cvtColor(open_cv_image, cv2.COLOR_BGR2GRAY)
23
# thresh_cv = cv2.threshold(gray_cv, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
24
# Then convert back to PIL or use pytesseract's capability to take array input (depending on version/usage)
25
# text = pytesseract.image_to_string(thresh_cv) # pytesseract can often take numpy arrays

The specific preprocessing steps required depend heavily on the quality and characteristics of the scanned documents.

Extracting Structured Data#

Extracting structured data (like tables, key-value pairs) from the raw OCR text is more complex than simple text extraction, especially from scanned documents where layouts can be inconsistent or distorted.

Common techniques involve:

Pattern Matching (Regex): Use regular expressions to find specific formats (e.g., dates \d{2}/\d{2}/\d{4}, currency \$\d+\.?\d*, specific keywords followed by values).
Line-by-Line Analysis: Process the extracted text line by line, looking for spatial relationships or indentation that might indicate structure.
Heuristics: Develop rules based on the expected document structure to identify data fields.
Advanced Libraries/Techniques: For tables, libraries like camelot or tabula-py are excellent for digitally created PDFs but struggle with scanned images of tables. Extracting tables from scanned documents often requires performing OCR on specific image regions corresponding to the table and then using post-processing (potentially involving libraries like pandas for data manipulation) to reconstruct the table structure from the extracted text blocks. This is a significantly more advanced task.

For instance, extracting an invoice number might involve searching for lines containing “Invoice Number:”, “Invoice #”, or similar patterns and then extracting the text immediately following that pattern.

1
import re
2

3
# Assuming 'full_text' contains the OCR output
4
invoice_number_pattern = r"Invoice Number:\s*(\w+)"
5
date_pattern = r"Date:\s*(\d{2}/\d{2}/\d{4})"
6

7
invoice_match = re.search(invoice_number_pattern, full_text)
8
date_match = re.search(date_pattern, full_text)
9

10
invoice_number = invoice_match.group(1) if invoice_match else "Not Found"
11
invoice_date = date_match.group(1) if date_match else "Not Found"
12

13
print(f"Extracted Invoice Number: {invoice_number}")
14
print(f"Extracted Invoice Date: {invoice_date}")

This requires careful analysis of the document types being processed to define appropriate patterns.

Real-World Applications#

Python-based OCR from scanned PDFs is invaluable in numerous scenarios:

Document Archiving and Search: Making large collections of historical documents, physical archives, or scanned records searchable. Organizations can digitize paper files and use OCR to enable keyword search across the content.
Automated Data Entry: Extracting data from standardized forms, invoices, receipts, or applications to populate databases or spreadsheets automatically, reducing manual data entry effort and errors.
Legal and Compliance: Processing scanned legal documents to extract case details, dates, or party names for analysis or inclusion in document management systems.
Research: Extracting text from scanned books, journals, or manuscripts for digital analysis or creating digital editions.
Healthcare: Processing scanned patient records, consent forms, or lab reports for data extraction and integration into electronic health record (EHR) systems (while considering privacy and security regulations).

Challenges and Considerations#

Accuracy Varies: OCR accuracy is highly dependent on scan quality, font type, font size, document layout complexity, and image resolution. Distorted text, low resolution, complex backgrounds, or unusual fonts significantly reduce accuracy.
Preprocessing is Key: The quality of the output from the OCR engine is directly influenced by the quality of the input image. Investing time in appropriate preprocessing steps for specific document types is crucial.
Handling Complex Layouts: Tables, multi-column text, images mixed with text, and handwritten notes pose significant challenges for standard OCR and require more sophisticated post-processing or specialized tools. Tesseract has modes to handle different structures, but they require configuration.
Language Support: Tesseract supports many languages but requires installing language data packs. easyocr is designed with multilingual support built-in.
Performance: Processing large volumes of multi-page scanned PDFs can be computationally intensive and time-consuming.
Alternatives: For mission-critical applications requiring high accuracy, especially with complex layouts or low-quality scans, commercial cloud-based OCR APIs (like Google Cloud Vision AI, AWS Textract, Azure Read API) often offer superior performance and features but come at a cost. Libraries like easyocr provide an open-source alternative that is easier to set up than Tesseract in some cases and excels with multilingual text.

Key Takeaways#

Scanned PDFs contain images, not digital text, requiring OCR for text and data extraction.
Python, combined with libraries like pdf2image, pytesseract, and Pillow (or OpenCV), provides a powerful toolkit for performing OCR on scanned PDFs. easyocr is another viable Python library alternative with strong multilingual capabilities.
Converting PDF pages to images is the first step in the process.
Image preprocessing (grayscaling, thresholding, deskewing, etc.) is crucial for improving OCR accuracy, especially with imperfect scans.
Extracting structured data from scanned documents often requires post-processing the raw OCR text using techniques like regular expressions or line-by-line analysis. Extracting tables from scanned images is particularly challenging.
Python OCR is widely applicable for tasks like document archiving, automated data entry, and research.
OCR accuracy varies, and performance considerations are important for large-scale processing. Commercial APIs or alternative libraries like easyocr might be considered for specific needs.