One can use this notebook to build a pipeline to parse and extract data from OCRed PDF files. Warning: When using LLMs for entity extraction, be sure to perform extensive quality control. They are very susceptible to distracting language (latching on to text that sound “kind of like” what you’re looking for) and missing language (making up content to fill any holes), and importantly, they do NOT provide any hints to when they may be erroring. You need to make sure random audits are part of your workflow! Below we’ve worked out a workflow using regular expressions and LLMs to parse data from zoning board orders, but the process is generalizable.Collect a set of PDFsPlace OCRed PDFs into the data folderWrite regexes to pull out dataWrite LLM prompts to pull out data
https://github.com/colarusso/entity_extraction/blob/main/PDF%20Entity%20Extraction%20with%20Regex%20and%20LLMs.ipynb
A Jupyter notebook to extract data from PDFs. Useful stuff