Member-only story

Python: Extract Data From PDF

Chris Albert
3 min readJul 5, 2024

**If you don’t have a premium account you can read this blog post here**

Intro

True story, the town I live in recently had some issues around the town budget. As I started to get involved I wanted to look at the data provided by our Board of Finance. To my misfortune, I found myself trying to analyze data in a PDF. Trying to do any sort of analysis on it was time consuming and frustrating. To make the data useful I needed to extract data from PDF and into something more flexible, in this case a simple CSV will do. Here is a sample of what we are working with:

Tooling

For this project we are going to use Pandas and Tabula. I also want to note I am using Python 3.12.2 for this project. To get Tabula up and running you will need to have Java 8+ installed as well. From there its fairly straight forward.

pip install pandas
pip install tabula-py

Setup

First we are going to create a function that uses Tabula to extract data from our source PDF. I was pleasantly surprised by how simple this is to implement. Behold the ease of use:

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Chris Albert
Chris Albert

Written by Chris Albert

Father, husband, runner, data enthusiast

No responses yet

Write a response