Information extraction using LLMs
Contents
The problem:
Saloni Dattani writes excellent articles on many topics (OK, mostly about death, but still). One of those topics is related to when different types of drugs were developed.
She mentioned a problem that she has encountered when trying to prepare data for this type of article: data on drug developments are often hard to access! There are datasets for currently available drugs, but not for historical ones, which would help visualize drug development over time. This data isn’t typically found in handy, machine-readable tables or in structured public repositories.
Instead, they’re often only compiled in books, papers, or manuals. And while those books may be digitized, the relevant data are (at best) in a semi-structured format, and typically completely unstructured.
In Saloni’s case, the data for drugs she was interested in were in 25 pdfs (dividing drug names alphabetically, skipping the letter Y) and contained information in a format like the below image. For copyright reasons the original content can’t be reproduced publicly, but I have made a recreation based on one of psychology’s greatest contributions to the scientific literature:
The process of extracting the desired information (Drug Name, Therapeutic Function, Trade Name, Country, and Year) would take a human a long time. I haven’t counted how many pages there are in total among all of the pdfs, but there are at least a couple of thousand. And there are typically multiple trade names per drug, each of which would require entry in its own separate row in an ideally structured dataset.
Let’s say there are 2000 pages, and an average of 3 trade names per drug (so 3 rows per drug to be inputted). Even if the entry for each row took 5 seconds (which I think is being pretty generous to how quickly someone could do this) this would mean a human, working nonstop, would take about 16 hours to do all of this. Those are valuable hours they could instead be using to sleep or create shareholder value. I couldn’t stand for this. Saloni deserves better.
Large language models are perfect for a repetitive task like this. Below I’ll describe how to tackle this (and similar) problems, and provide code which can be used for this purpose.
The approach:
We can break this whole problem down into 3 simple steps.
Step 1: we need to process the pdfs into text. In doing this, we need to (i) ensure that the order of content within the pdfs is maintained (i.e., content should be extracted in the order it appears visually in each book), and (ii) trim parts of the book that will be unnecessary from the text. This latter step is important to reduce the number of input tokens, which will both keep the LLM on-task and reduce the cost of calling the OpenAI API. In particular, I noticed that all of the relevant information Saloni needed was contained before the lengthy “Manufacturing Process” section, meaning we could trim this out completely for each drug. This was good news, because some of those descriptions are…..really, really long.
Step 2: After this we need to get our LLM running. To do this, we need to specify a prompt for the LLM which includes the extracted text from the step above. Additionally, we would like the output of the LLM to be deterministic (i.e., that given the same input, the output will be identical across calls, which ensures reproducibility). Not everyone may know this, but determinism in LLM output can be achieved very simply by setting the temperature
hyperparameter to 0
. We also need to ensure that the LLM output is in a consistent format, and ensure that it could be combined with other iterations.
Step 3: Lastly, we need to iterate this process over every pdf! So…let’s do that. The code below describes this process.
The code:
Step 0: Import dependencies
Here I import the Python libraries we will need for this:
fitz
for parsing the pdfsre
for the easy application of regular expressions for text parsing/removalopenai.OpenAI
for accessing ChatGPT’s APIgoogle.colab.userdata
for accessing secret keys within the Colab notebook (aka, where I have stored my API key)pandas
for handling LLM output and exporting dataframesos
for handling file directory stuffio.StringIO
to help parse the output from the LLM into a format readable for pandas
All of the above modules, except for fitz
, are available in the Google Colab environment already. For those unfamiliar with Colab: to install fitz
, you’ll need to run the commented-out line !pip install pymupdf
below on your first run of this code (and every subsequent first run after your current Colab session has ended).
In the section below I also initialise the OpenAI client using my API key. If you want to use this yourself, you’ll need to either ensure you have an API key stored as a secret key in Colab called openai_api
, or replace userdata.get('openai_api')
with your API key as a string (I would not recommend exposing your API like this though!).
#!pip install pymupdf # once this has been run in your session once, you can comment it out
import re
from openai import OpenAI
import fitz # this is just an alias for PyMuPDF, I don't know why
from google.colab import userdata
import pandas as pd
import os
from io import StringIO
# initialise openAI client with secret API key (sorry, you can't have mine, I'm broke already)
= OpenAI(
openai_client = userdata.get('openai_api'),
api_key = 900.0 # we set a pretty long timeout (15 mins) since some of these books take some time
timeout )
Step 1: Define text processing and data extraction function
One important further consideration not mentioned above: pdfs are weird. If you just feed them into fitz, you’ll often find odd behavior where text is not rendered in the order that you see it visually in the pdf, which leads to some problematic and undesireable output. For example, relevant text information about the drug might be rendered in the middle of a paragraph in the “Manufacturing Process” section. Since we plan to filter out the Manufacturing Process section, we would lose relevant data in this example. There is, fortunately, a simple solution: we just iterate over each page of the pdf and extract information in blocks based on the their coordinates in the pdf page. This keeps everything nice and orderly, and preserves the validity of any heuristics for extraction we might come up with based on the visual appearance of the pdf!
def parse_text_for_llm(pdf_path):
"""
Processes a single PDF file:
- Reads and combines text from the PDF in order.
- Removes unwanted sections between "Manufacturing Process" and "References", which saves us a bunch of tokens.
"""
# Load the PDF file
= fitz.open(pdf_path)
doc
# Initialize list to store ordered text blocks
= []
all_text_blocks
# Extract and order text blocks from all pages
for page_num in range(doc.page_count):
= doc[page_num]
page = page.get_text("blocks") # Extract text as blocks with coordinates
blocks
# Sort blocks by their vertical (y) position for correct reading order
=lambda block: (block[1], block[0])) # Sort by y, then x coordinate
blocks.sort(key
# Append sorted text from blocks to list
= " ".join(block[4] for block in blocks)
page_text
all_text_blocks.append(page_text)
# Combine all ordered text into a single string
= " ".join(all_text_blocks)
full_text
# Regex pattern to match and remove unwanted sections
= re.compile(r"(Manufacturing Process)(.*?)(?=References)", re.DOTALL | re.IGNORECASE)
unwanted_section_pattern
# Remove all instances of unwanted sections within the combined text
= re.sub(unwanted_section_pattern, r"\1\n", full_text)
cleaned_text
= cleaned_text.replace(";", ",") # remove semi-colons so that they can function as separators in the csv
cleaned_text
return(cleaned_text)
Step 2: Call the LLM
We first define the prompt that the LLM will receive. Then we define a function which takes the cleaned text as input, combines this with the pre-defined prompt, and feeds this to the LLM. The prompt specifies that the LLM output should be in a semi-colon-separated csv-friendly output table. Then finally we do a little bit of extra cleaning, and feed the model output to pandas to make it into a nice, exportable csv, which is then also saved in the Colab environment.
# Prepare prompt for the LLM
= (
prompt f"Below is the content of a book containing information about different drugs. "
f"Each section starts with the drug name, which can be identified since it will be in all caps, and followed immediately by the heading 'Therapeutic Function'. "
f"Please extract the following information for each drug:\n"
f"1. Drug name, 2. Therapeutic Function, 3. Trade Name, 4. Country, and 5. Year Introduced.\n"
f"NEVER extract the manufacturer.\n"
f"Many drugs will have multiple trade names, countries, and years of introduction; document all.\n"
f"Output the extracted text in a csv table using semi-colons as separators with the above headings, allowing for multiple rows per drug when there are multiple trade names/countries/years.\n"
f"Your output should begin with the first column name and with NO TRAILING CHARACTERS or text before or after the table.\n"
f"Ensure that each row has exactly 5 columns, which correspond to the 5 column names above.\n"
f"Do not return any commentary or content beyond the table itself. Ensure the table is in an format easily exportable to a csv and uses semi-colons as the separator.\n"
f"You can find the book text below:\n"
)
def get_llm_response(input_text, prompt, pdf_path, save_path):
"""
Calls the LLM, processes and returns its output as a csv:
- Sends the cleaned text to the LLM for extraction.
- Parses the GPT output and saves it to a CSV file.
"""
# create the prompt for the model
= (f"{prompt}" f"{input_text}")
model_prompt
# Call the OpenAI API to run the data extraction on the cleaned text
= openai_client.chat.completions.create(
response ="gpt-4o",
model=0, # I set temp to zero to ensure output is deterministic and replicable!
temperature=[{"role": "user", "content": model_prompt}]
messages
)
# extract the main output of the LLM
= response.choices[0].message.content
output
# for some reason the output comes with some stray text characters that mess up csv formatting; this removes them
= re.sub(r"```(?:csv|python)?\n?", "", output).strip()
output_trimmed
# converts output to a csv; note that bad lines currently are skipped!
= pd.read_csv(StringIO(output_trimmed), delimiter=";", on_bad_lines='warn')
data
# Save the data to a CSV
# Generate a CSV file name based on the PDF file name
= os.path.splitext(os.path.basename(pdf_path))[0]
base_name = f"{save_path}/{base_name}_extracted.csv"
csv_filename =";", index=False)
data.to_csv(csv_filename, sep
# give a little bit of feedback during the process to prove it's not stuck
print(f"Processed {pdf_path} and saved output to {csv_filename}")
Step 3: Iterate the process over the pdfs
We then run the two functions defined above over the list of pdfs. When I did this myself, I pointed the script to folders in my linked Google Drive; however, below I get the code to instead look for the files in the Files component of Colab. note that files stored in Files are only present for the duration of the runtime (hence why I used Drive in my case).
For illustration, the below code also creates the relevant directories and populates them with some example files, so you can see the process in action.
In the event that code gets stopped and restarted, I also add some logic that basically prevents already processed files from being reprocessed.
# Specify the folder path where you want to access the books from
# this was my path but obviously yours will differ!
= '/content/drive/MyDrive/books_to_be_processed'
folder_path = '/content/drive/MyDrive/books_info_extracted_blog'
save_location
# List all files to be processed
= [os.path.join(folder_path, f) for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))]
pdf_files
# List already processed files, removing "_extracted.csv" to match original PDF names
# this was because I did some of this iteratively for boring and dumb reasons
= [os.path.splitext(f.replace("_extracted", ""))[0] for f in os.listdir(save_location) if f.endswith(".csv")]
already_processed_files
# Filter out entries in pdf_files that are in already_processed_files
= [f for f in pdf_files if os.path.splitext(os.path.basename(f))[0] not in already_processed_files]
incomplete_pdf_files
# Process each unprocessed PDF file
for pdf_file in incomplete_pdf_files:
# Run Step 1
= parse_text_for_llm(pdf_file)
step_1_cleaned_text
# Run Step 2
= step_1_cleaned_text,
get_llm_response(input_text = prompt,
prompt = pdf_file,
pdf_path = save_location) save_path
….and now we just combine the above outputs into a single dataframe!
# List of files to be combined
# using the location save_location as defined above!
= [
files_to_be_combined for f in os.listdir(save_location) if os.path.isfile(os.path.join(save_location, f))
os.path.join(save_location, f)
]
# empty list to store dataframes
= []
dataframes
# Loop through each file and read it as a pandas dataframe
for file in files_to_be_combined:
= pd.read_csv(file, sep=";") # Assuming CSV uses semi-colons as separators
df
dataframes.append(df)
# Combine all dataframes into a single dataframe
= pd.concat(dataframes, ignore_index=True)
combined_df
# Save the combined dataframe to a new CSV
= os.path.join(save_location, "combined_extracted_data.csv")
output_csv_path =";", index=False) combined_df.to_csv(output_csv_path, sep
Wrapping up
…and that’s it! A really simple, easily modifiable workflow for extracting unstructured information from pdfs and converting it into structured data using a combination of systematic pdf text extraction and large language models. Obviously this is only the first step, and there need to be systematic checks for data quality and accuracy in the extractions (just as we would need if a human did this). We can in principle use another call to our LLM to do a quick, high-level sanity check on a few dimensions: for example, by making sure that all entries in the “Countries” column are in fact countries:
# get the full csv table
# again using the save_location directory defined above
= pd.read_csv(os.path.join(save_location, "combined_extracted_data.csv"), sep = ";")
full_data
# The data is too big for one call to the LLM, so let's just sample 20% of it
# we could also split it all up, but this is just a toy example
import random
42)
random.seed(= full_data.sample(frac=0.2)
sampled_data
# convert the dataframe to a string since LLMs require text data
= sampled_data.to_string()
data_as_string
= (
country_validation_prompt f"Below is text from a csv with 5 columns: 1. Drug name, 2. Therapeutic Function, 3. Trade Name, 4. Country, and 5. Year Introduced.\n"
f"The country column should contain either a dash (to indicate no data) or the name of a country.\n"
f"Please go through each row of this text and identify any instances where the entry for the 'Country' column appears erroneous; that is, where its content is neither a country nor a dash indicating no data.\n"
f"In such cases, please tell me the row number where the erroneous entry can be found.\n"
f"Here is the text:\n"
f"{data_as_string}"
)
# Call the OpenAI API to run the simple validation
= openai_client.chat.completions.create(
validation_response ="gpt-4o",
model=0,
temperature=[{"role": "user", "content": country_validation_prompt}]
messages
)
# extract the main text output of the LLM
= validation_response.choices[0].message.content validation_output
Sample output I got from this on some test data (I also ran this on the actual data, and there were about 10 actual errors; but by that point I had already made the meme below and I wanted to share it):
The following rows contain entries in the 'Country' column that appear erroneous:
- Row 39: "Switz." should likely be "Switzerland."
- Row 462: There's "S. Africa," which is a common abbreviation for "South Africa."
- Row 467: "Switz." appears again and likely represents "Switzerland."
- Row 715: "Monte Carlo" is not a country; it's an area within Monaco.
- Row 762: "S. Africa" is again used for "South Africa."
- Row 763: "E. Germany" refers to the former East Germany,
which no longer exists as a country.
- Row 928: "Switz." appears again as likely meant to be "Switzerland."
These instances reflect abbreviations or geographical areas
rather than the official or recognized country name.
Of course, it goes without saying that some human-driven validation is needed; in the same manner it would be if a human had done the extraction manually. But doing this initial step with an LLM is much quicker than a human’s manual extraction; it took about an hour for the code above to process all 25 pdfs, so at least 16 times as quick as a human by my back-of-the-envelope calculations. And this workflow can easily be adapted for other problems (e.g., I am currently using a modified version of it for the extraction of numeric values from papers).
And best of all, Saloni can get back to making more cool charts!