Member-only story
Python: Extract Data From PDF
**If you don’t have a premium account you can read this blog post here**
Intro
True story, the town I live in recently had some issues around the town budget. As I started to get involved I wanted to look at the data provided by our Board of Finance. To my misfortune, I found myself trying to analyze data in a PDF. Trying to do any sort of analysis on it was time consuming and frustrating. To make the data useful I needed to extract data from PDF and into something more flexible, in this case a simple CSV will do. Here is a sample of what we are working with:
Tooling
For this project we are going to use Pandas and Tabula. I also want to note I am using Python 3.12.2 for this project. To get Tabula up and running you will need to have Java 8+ installed as well. From there its fairly straight forward.
pip install pandas
pip install tabula-py
Setup
First we are going to create a function that uses Tabula to extract data from our source PDF. I was pleasantly surprised by how simple this is to implement. Behold the ease of use: