Python: Extract Data From PDF

Chris Albert
3 min readJul 5, 2024

**If you don’t have a premium account you can read this blog post here**

Intro

True story, the town I live in recently had some issues around the town budget. As I started to get involved I wanted to look at the data provided by our Board of Finance. To my misfortune, I found myself trying to analyze data in a PDF. Trying to do any sort of analysis on it was time consuming and frustrating. To make the data useful I needed to extract data from PDF and into something more flexible, in this case a simple CSV will do. Here is a sample of what we are working with:

Tooling

For this project we are going to use Pandas and Tabula. I also want to note I am using Python 3.12.2 for this project. To get Tabula up and running you will need to have Java 8+ installed as well. From there its fairly straight forward.

pip install pandas
pip install tabula-py

Setup

First we are going to create a function that uses Tabula to extract data from our source PDF. I was pleasantly surprised by how simple this is to implement. Behold the ease of use:

--

--