2014 words

10 minutes

Using Python to Extract Tables from PDFs with Tabula and Camelot

2025-06-30

Tutorial

Python

/

PDF

/

Data Extraction

/

Automation

/

Tabula

Extracting Tables from PDFs: A Guide Using Python with Tabula and Camelot#

Extracting tabular data from PDF documents presents a significant challenge in data processing and automation workflows. While PDFs ensure consistent formatting across different systems, their structure makes machine-readable data extraction, particularly of tables, non-trivial. Data is often embedded as visual elements rather than structured rows and columns easily parsed by standard text extraction tools. This necessity arises frequently in scenarios involving financial reports, research papers, government statistics, and various forms of automated data collection where information is disseminated in PDF format.

Python offers powerful libraries designed specifically to address this challenge. Two prominent libraries for extracting tables from PDFs are Tabula-py (a Python wrapper for Tabula) and Camelot. Both provide programmatic ways to identify and extract tabular data, converting it into formats like CSV or pandas DataFrames for further analysis or processing.

The Need for PDF Table Extraction#

The demand for extracting structured data from PDFs stems from several key areas:

Data Analysis: Researchers, analysts, and data scientists frequently encounter reports or datasets embedded within PDFs. Extracting this data into a structured format is the first step before any analysis can begin.
Automation: Automating processes that rely on data from PDF reports (e.g., updating databases, generating summaries, comparing figures) requires reliable programmatic extraction. Manual copy-pasting is time-consuming and prone to errors.
Reporting and Compliance: Organizations may receive financial statements, invoices, or regulatory documents in PDF format. Extracting key figures is essential for internal reporting and compliance checks.
Digitization of Information: Converting legacy or published PDF documents into searchable, structured datasets makes the information more accessible and usable.

Traditional text extraction tools often fail with tables because they might read across columns or misinterpret borders and spacing. Libraries like Tabula and Camelot employ more sophisticated techniques to identify the boundaries and structure of tables within a PDF page.

Introducing Tabula and Camelot#

Tabula-py and Camelot are specialized Python libraries built for the task of PDF table extraction.

Tabula-py: This library is a Python wrapper for the Java-based Tabula library. It is particularly effective at extracting tables from native PDFs, which are documents created directly from a digital source where text and structure information are still present (though potentially challenging to access). Tabula works well when table cell boundaries are clearly defined or inferred from spacing.
Camelot: Camelot is a Python library that offers more flexibility, designed to handle a wider variety of PDFs, including scanned documents and PDFs with complex layouts or unclear borders. It provides different “flavors” of parsing strategies (“Lattice” and “Stream”) to tackle various table structures. Camelot often requires more tuning and parameter adjustments for optimal results but can succeed where simpler tools fail.

Both libraries provide the extracted data in a structured format, commonly as a list of pandas DataFrames, making integration with standard Python data science workflows seamless.

Getting Started: Installation#

Before using Tabula-py or Camelot, installation is necessary. Both have some dependencies.

Installing Tabula-py#

Tabula-py requires a Java Runtime Environment (JRE) to be installed on the system because it uses the underlying Java library. Ensure Java is installed and accessible from your system’s PATH.

1
pip install tabula-py

To verify Java installation, open a terminal or command prompt and type java -version. If Java is not found, it needs to be installed separately according to the operating system’s instructions.

Installing Camelot#

Camelot requires Ghostscript, a suite of software for interpreting PostScript and PDF files. Ghostscript is used by Camelot to render PDF pages as images, which helps in identifying tables, especially in scanned or complex documents. Ensure Ghostscript is installed and accessible from your system’s PATH.

1
pip install camelot-py[cv] opencv-python pandas

The [cv] part ensures installation of dependencies needed for image processing (OpenCV), which Camelot uses. Pandas is typically needed to work with the extracted data.

To verify Ghostscript installation, open a terminal or command prompt and type gs --version. If Ghostscript is not found, install it from the official Ghostscript website or via a package manager (like brew install ghostscript on macOS, sudo apt-get install ghostscript on Debian/Ubuntu, or downloading the installer on Windows).

Using Tabula-py for Table Extraction#

Tabula-py provides a straightforward function, read_pdf(), to extract tables.

Step-by-Step Extraction with Tabula-py#

Import the library:
```
1
import tabula
```
Specify the PDF file:
```
1
pdf_path = "path/to/your/document.pdf"
```
Extract tables: The core function is tabula.read_pdf().
```
1
tables = tabula.read_pdf(pdf_path, pages='all', multiple_tables=True)
```
- pdf_path: The path to the PDF file.
- pages='all': Specifies extracting from all pages. Can be a page number (‘1’), a range (‘1-5’), or a list of pages (‘1,3,5’).
- multiple_tables=True: Attempts to find and extract multiple tables on a single page.
- guess=True (default): Tabula attempts to guess the table area and structure. Setting this to False might require specifying table areas manually using the area parameter.

Process the extracted tables: The read_pdf function returns a list of pandas DataFrames, where each DataFrame represents a table found in the PDF.

1
for i, table in enumerate(tables):
2
    print(f"--- Table {i+1} ---")
3
    print(table.head()) # Print first few rows
4
    # Further processing or saving

Exporting data: Tabula-py can directly export to formats like CSV or JSON.

1
tabula.convert_into(pdf_path, "output.csv", output_format="csv", pages='all')
2
# Or save individual DataFrames from the 'tables' list
3
tables[0].to_csv("first_table.csv", index=False)

Tabula-py Code Example#

1
import tabula
2
import pandas as pd
3

4
pdf_file = "example_report.pdf" # Replace with your PDF file path
5

6
try:
7
    # Extract tables from all pages, allow multiple tables per page
8
    list_of_dfs = tabula.read_pdf(pdf_file, pages='all', multiple_tables=True, guess=True)
9

10
    if list_of_dfs:
11
        print(f"Found {len(list_of_dfs)} tables.")
12
        # Access and display the first table
13
        first_table = list_of_dfs[0]
14
        print("\nFirst extracted table (first 5 rows):")
15
        print(first_table.head().to_markdown(index=False)) # Using to_markdown for clear printing
16

17
        # Example: Save all tables to separate CSV files
18
        for i, df in enumerate(list_of_dfs):
19
            df.to_csv(f"table_{i+1}.csv", index=False)
20
            print(f"Saved table {i+1} to table_{i+1}.csv")
21

22
    else:
23
        print("No tables found in the PDF.")
24

25
except Exception as e:
26
    print(f"An error occurred: {e}")
27

28
# Example of direct conversion to CSV
29
# try:
30
#     tabula.convert_into(pdf_file, "all_tables.csv", output_format="csv", pages='all', multiple_tables=True)
31
#     print("\nAll tables converted and saved to all_tables.csv")
32
# except Exception as e:
33
#     print(f"Error during direct conversion: {e}")

Note: The example_report.pdf file would need to exist and contain tables for this code to run.

Tabula-py excels with PDFs where tables have clear visual separation and consistent spacing, often found in reports generated from databases or spreadsheets.

Using Camelot for Table Extraction#

Camelot offers two primary parsing “flavors”:

Lattice: Assumes the table has a grid-like structure with clearly defined lines separating cells. This is suitable for tables with borders.
Stream: Suitable for tables without borders, where cells are separated by whitespace. It works by grouping text elements based on their proximity.

Camelot’s strength lies in its ability to handle less structured or scanned PDFs, often requiring tuning parameters to achieve the best results.

Step-by-Step Extraction with Camelot#

Import the library:
```
1
import camelot
```
Specify the PDF file:
```
1
pdf_path = "path/to/your/document.pdf"
```
Extract tables: Use the camelot.read_pdf() function.
```
1
tables = camelot.read_pdf(pdf_path, pages='all', flavor='lattice') # Or flavor='stream'
```
- pdf_path: The path to the PDF file.
- pages='all': Specifies pages to extract from.
- flavor='lattice' or flavor='stream': Chooses the parsing strategy. ‘lattice’ is often the default or a good starting point for bordered tables, while ‘stream’ is better for borderless tables.

Access extracted tables: camelot.read_pdf returns a TableList object, which is a list of Table objects. Each Table object contains a pandas DataFrame representing the extracted data.

1
print(f"Found {tables.n} tables.") # tables.n gives the number of tables found
2

3
for i, table in enumerate(tables):
4
    print(f"\n--- Table {i+1} ---")
5
    print("Data (first 5 rows):")
6
    print(table.df.head().to_markdown(index=False)) # Access DataFrame via .df
7

8
    # Access parsing report (helpful for debugging)
9
    print("\nParsing Report:")
10
    print(table.parsing_report)

The parsing_report includes metrics like the accuracy of detection (accuracy) and the number of whitespace characters ignored (whitespace).

Exporting data: Each Table object has an export() method.

1
tables.export('output.csv', f='csv', compress=False) # Export all tables to one or multiple files
2
# Or export a single table
3
tables[0].to_csv('first_table_camelot.csv', index=False)

Debugging with plotting: Camelot can plot detected tables and their cells, which is invaluable for tuning parameters.
```
1
# Requires matplotlib
2
# tables[0].plot()
3
# import matplotlib.pyplot as plt
4
# plt.show()
```

Camelot Code Example#

1
import camelot
2
import pandas as pd
3

4
pdf_file = "example_scanned_report.pdf" # Replace with your PDF file path
5
# Consider trying 'lattice' first, then 'stream' if needed
6
extraction_flavor = 'lattice' # or 'stream'
7

8
try:
9
    # Extract tables using the specified flavor
10
    tables = camelot.read_pdf(pdf_file, pages='all', flavor=extraction_flavor)
11

12
    print(f"Found {tables.n} tables using '{extraction_flavor}' flavor.")
13

14
    if tables.n > 0:
15
        # Access and display the first table
16
        first_table = tables[0]
17
        print("\nFirst extracted table (first 5 rows):")
18
        print(first_table.df.head().to_markdown(index=False))
19

20
        print("\nParsing Report for first table:")
21
        print(first_table.parsing_report)
22

23
        # Example: Save all tables to separate CSV files
24
        for i, table in enumerate(tables):
25
            table.to_csv(f"camelot_table_{i+1}.csv", index=False)
26
            print(f"Saved table {i+1} to camelot_table_{i+1}.csv")
27

28
    else:
29
        print("No tables found in the PDF using the current settings.")
30

31
except Exception as e:
32
    print(f"An error occurred: {e}")
33
    print("Ensure Ghostscript is installed and in your PATH.")

Note: The example_scanned_report.pdf file would need to exist and contain tables for this code to run.

Camelot’s flexibility, particularly with the ‘stream’ flavor and options for line/text detection, makes it a robust choice for challenging PDF documents.

Choosing Between Tabula and Camelot#

Selecting the appropriate library depends largely on the nature of the PDF document:

Feature	Tabula-py	Camelot
Primary Use	Native PDFs, well-structured tables	Native and Scanned PDFs, complex layouts
Table Types	Best with bordered or clearly spaced	Lattice (bordered), Stream (borderless)
Dependencies	Java	Ghostscript, OpenCV
Ease of Use	Often simpler for standard native PDFs	Can require more parameter tuning
Flexibility	Less flexible with varying layouts	More flexible with different parsing flavors
Output Format	pandas DataFrame, direct CSV/JSON export	pandas DataFrame, multiple export formats
Debugging Aids	Limited	Plotting functions for visual debugging

For native PDFs with clear tables: Start with Tabula-py. Its guess=True option often works effectively out-of-the-box.
For scanned PDFs or complex native PDFs: Camelot is generally a better choice. Experiment with both ‘lattice’ and ‘stream’ flavors. Use Camelot’s plotting feature to understand why extraction might be failing and tune parameters like table_areas, row_tol, etc.

In some cases, trying one library and falling back to the other if the first fails is a practical approach.

Troubleshooting Common Issues#

Dependencies not found: Ensure Java (for Tabula-py) and Ghostscript (for Camelot) are installed correctly and are accessible via your system’s PATH. Installation instructions vary by operating system.
No tables detected or incorrect extraction:
- Verify the PDF actually contains tables.
- Try different pages specifications.
- For Tabula, try setting guess=False and manually specifying the table area using the area parameter (requires identifying coordinates, potentially using the online Tabula tool or a PDF reader’s coordinate display).
- For Camelot, try the other flavor (‘lattice’ vs ‘stream’). Use the plot() function to visualize what Camelot sees. Manually specify the table_areas if needed. Adjust parameters like row_tol or column_tol.
Merged cells or complex headers: These are challenging for both libraries. Post-processing the extracted DataFrame with pandas might be necessary to handle merged cells or multi-row headers.
Multi-page tables: Both libraries extract tables page by page. Combining parts of a table spread across multiple pages typically requires custom code to concatenate the relevant DataFrames based on header identification or page order.

Real-World Application: Automating Financial Data Extraction#

Consider a scenario where an analyst needs to regularly extract quarterly revenue and expense tables from PDF reports published by various companies. Manually copying and pasting this data from dozens of PDFs is inefficient and error-prone.

Using Python with Tabula-py or Camelot can automate this process:

A script identifies the PDF reports to be processed.
For each PDF, the script uses tabula.read_pdf() or camelot.read_pdf(), potentially with page numbers or area specifications refined for the known report structure.
The script processes the resulting DataFrames to identify the specific tables containing revenue and expense data (e.g., by searching for keywords in column headers).
Relevant data points are extracted from these tables.
The extracted data is consolidated into a master spreadsheet or database.
The script runs on a schedule or is triggered when new reports are available.

This automation significantly reduces the time spent on data collection, frees the analyst for higher-value tasks like analysis, and improves data accuracy by eliminating manual transcription errors. The choice between Tabula and Camelot would depend on the consistency and structure of the PDF reports from different companies; Camelot might be preferred if the reports vary significantly in layout or include scanned components.

Key Takeaways#

Extracting tables from PDFs programmatically saves significant time and improves accuracy compared to manual methods.
Python libraries Tabula-py and Camelot are powerful tools for this task.
Tabula-py is generally well-suited for native PDFs with clear table structures and requires Java.
Camelot offers more flexibility for scanned or complex PDFs using ‘lattice’ and ‘stream’ parsing flavors and requires Ghostscript.
Both libraries return extracted data as pandas DataFrames, facilitating further analysis.
Choosing the right library depends on the PDF’s characteristics (native vs. scanned, layout complexity).
Troubleshooting often involves ensuring dependencies are installed, trying different parsing parameters, and potentially specifying table areas manually.
Real-world applications include automating data collection from financial reports, research papers, and other documents distributed in PDF format.