Information extraction using LLMs

Author

Jamie Cummins

Published

November 19, 2024

The problem
The approach
The code
Wrapping up

The problem:

Saloni Dattani writes excellent articles on many topics (OK, mostly about death, but still). One of those topics is related to when different types of drugs were developed.

She mentioned a problem that she has encountered when trying to prepare data for this type of article: data on drug developments are often hard to access! There are datasets for currently available drugs, but not for historical ones, which would help visualize drug development over time. This data isn’t typically found in handy, machine-readable tables or in structured public repositories.

Instead, they’re often only compiled in books, papers, or manuals. And while those books may be digitized, the relevant data are (at best) in a semi-structured format, and typically completely unstructured.

In Saloni’s case, the data for drugs she was interested in were in 25 pdfs (dividing drug names alphabetically, skipping the letter Y) and contained information in a format like the below image. For copyright reasons the original content can’t be reproduced publicly, but I have made a recreation based on one of psychology’s greatest contributions to the scientific literature:

If only models from chemistry were actually this robust.

The process of extracting the desired information (Drug Name, Therapeutic Function, Trade Name, Country, and Year) would take a human a long time. I haven’t counted how many pages there are in total among all of the pdfs, but there are at least a couple of thousand. And there are typically multiple trade names per drug, each of which would require entry in its own separate row in an ideally structured dataset.

Let’s say there are 2000 pages, and an average of 3 trade names per drug (so 3 rows per drug to be inputted). Even if the entry for each row took 5 seconds (which I think is being pretty generous to how quickly someone could do this) this would mean a human, working nonstop, would take about 16 hours to do all of this. Those are valuable hours they could instead be using to sleep or create shareholder value. I couldn’t stand for this. Saloni deserves better.

Large language models are perfect for a repetitive task like this. Below I’ll describe how to tackle this (and similar) problems, and provide code which can be used for this purpose.

The approach:

We can break this whole problem down into 3 simple steps.

Step 1: we need to process the pdfs into text. In doing this, we need to (i) ensure that the order of content within the pdfs is maintained (i.e., content should be extracted in the order it appears visually in each book), and (ii) trim parts of the book that will be unnecessary from the text. This latter step is important to reduce the number of input tokens, which will both keep the LLM on-task and reduce the cost of calling the OpenAI API. In particular, I noticed that all of the relevant information Saloni needed was contained before the lengthy “Manufacturing Process” section, meaning we could trim this out completely for each drug. This was good news, because some of those descriptions are…..really, really long.

Step 2: After this we need to get our LLM running. To do this, we need to specify a prompt for the LLM which includes the extracted text from the step above. Additionally, we would like the output of the LLM to be deterministic (i.e., that given the same input, the output will be identical across calls, which ensures reproducibility). Not everyone may know this, but determinism in LLM output can be achieved very simply by setting the temperature hyperparameter to 0. We also need to ensure that the LLM output is in a consistent format, and ensure that it could be combined with other iterations.

Step 3: Lastly, we need to iterate this process over every pdf! So…let’s do that. The code below describes this process.

The code:

Step 0: Import dependencies

Here I import the Python libraries we will need for this:

fitz for parsing the pdfs
re for the easy application of regular expressions for text parsing/removal
openai.OpenAI for accessing ChatGPT’s API
google.colab.userdata for accessing secret keys within the Colab notebook (aka, where I have stored my API key)
pandas for handling LLM output and exporting dataframes
os for handling file directory stuff
io.StringIO to help parse the output from the LLM into a format readable for pandas

All of the above modules, except for fitz, are available in the Google Colab environment already. For those unfamiliar with Colab: to install fitz, you’ll need to run the commented-out line !pip install pymupdf below on your first run of this code (and every subsequent first run after your current Colab session has ended).

In the section below I also initialise the OpenAI client using my API key. If you want to use this yourself, you’ll need to either ensure you have an API key stored as a secret key in Colab called openai_api, or replace userdata.get('openai_api') with your API key as a string (I would not recommend exposing your API like this though!).

#!pip install pymupdf # once this has been run in your session once, you can comment it out
import re
from openai import OpenAI
import fitz  # this is just an alias for PyMuPDF, I don't know why
from google.colab import userdata
import pandas as pd
import os
from io import StringIO

# initialise openAI client with secret API key (sorry, you can't have mine, I'm broke already)
openai_client = OpenAI(
    api_key = userdata.get('openai_api'),
    timeout = 900.0 # we set a pretty long timeout (15 mins) since some of these books take some time
)

Step 1: Define text processing and data extraction function

One important further consideration not mentioned above: pdfs are weird. If you just feed them into fitz, you’ll often find odd behavior where text is not rendered in the order that you see it visually in the pdf, which leads to some problematic and undesireable output. For example, relevant text information about the drug might be rendered in the middle of a paragraph in the “Manufacturing Process” section. Since we plan to filter out the Manufacturing Process section, we would lose relevant data in this example. There is, fortunately, a simple solution: we just iterate over each page of the pdf and extract information in blocks based on the their coordinates in the pdf page. This keeps everything nice and orderly, and preserves the validity of any heuristics for extraction we might come up with based on the visual appearance of the pdf!

def parse_text_for_llm(pdf_path):
    """
    Processes a single PDF file:
    - Reads and combines text from the PDF in order.
    - Removes unwanted sections between "Manufacturing Process" and "References", which saves us a bunch of tokens.
    """
    # Load the PDF file
    doc = fitz.open(pdf_path)

    # Initialize list to store ordered text blocks
    all_text_blocks = []

    # Extract and order text blocks from all pages
    for page_num in range(doc.page_count):
        page = doc[page_num]
        blocks = page.get_text("blocks")  # Extract text as blocks with coordinates

        # Sort blocks by their vertical (y) position for correct reading order
        blocks.sort(key=lambda block: (block[1], block[0]))  # Sort by y, then x coordinate

        # Append sorted text from blocks to list
        page_text = " ".join(block[4] for block in blocks)
        all_text_blocks.append(page_text)

    # Combine all ordered text into a single string
    full_text = " ".join(all_text_blocks)

    # Regex pattern to match and remove unwanted sections
    unwanted_section_pattern = re.compile(r"(Manufacturing Process)(.*?)(?=References)", re.DOTALL | re.IGNORECASE)

    # Remove all instances of unwanted sections within the combined text
    cleaned_text = re.sub(unwanted_section_pattern, r"\1\n", full_text)

    cleaned_text = cleaned_text.replace(";", ",") # remove semi-colons so that they can function as separators in the csv

    return(cleaned_text)

Step 2: Call the LLM

We first define the prompt that the LLM will receive. Then we define a function which takes the cleaned text as input, combines this with the pre-defined prompt, and feeds this to the LLM. The prompt specifies that the LLM output should be in a semi-colon-separated csv-friendly output table. Then finally we do a little bit of extra cleaning, and feed the model output to pandas to make it into a nice, exportable csv, which is then also saved in the Colab environment.

# Prepare prompt for the LLM
prompt = (
        f"Below is the content of a book containing information about different drugs. "
        f"Each section starts with the drug name, which can be identified since it will be in all caps, and followed immediately by the heading 'Therapeutic Function'. "
        f"Please extract the following information for each drug:\n"
        f"1. Drug name, 2. Therapeutic Function, 3. Trade Name, 4. Country, and 5. Year Introduced.\n"
        f"NEVER extract the manufacturer.\n"
        f"Many drugs will have multiple trade names, countries, and years of introduction; document all.\n"
        f"Output the extracted text in a csv table using semi-colons as separators with the above headings, allowing for multiple rows per drug when there are multiple trade names/countries/years.\n"
        f"Your output should begin with the first column name and with NO TRAILING CHARACTERS or text before or after the table.\n"
        f"Ensure that each row has exactly 5 columns, which correspond to the 5 column names above.\n"
        f"Do not return any commentary or content beyond the table itself. Ensure the table is in an format easily exportable to a csv and uses semi-colons as the separator.\n"
        f"You can find the book text below:\n"
    )

def get_llm_response(input_text, prompt, pdf_path, save_path):
    """
    Calls the LLM, processes and returns its output as a csv:
    - Sends the cleaned text to the LLM for extraction.
    - Parses the GPT output and saves it to a CSV file.
    """

    # create the prompt for the model
    model_prompt = (f"{prompt}" f"{input_text}")

    # Call the OpenAI API to run the data extraction on the cleaned text
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        temperature=0, # I set temp to zero to ensure output is deterministic and replicable!
        messages=[{"role": "user", "content": model_prompt}]
        )

    # extract the main output of the LLM
    output = response.choices[0].message.content

    # for some reason the output comes with some stray text characters that mess up csv formatting; this removes them
    output_trimmed = re.sub(r"```(?:csv|python)?\n?", "", output).strip()

    # converts output to a csv; note that bad lines currently are skipped!
    data = pd.read_csv(StringIO(output_trimmed), delimiter=";", on_bad_lines='warn')

    # Save the data to a CSV
    # Generate a CSV file name based on the PDF file name
    base_name = os.path.splitext(os.path.basename(pdf_path))[0]
    csv_filename = f"{save_path}/{base_name}_extracted.csv"
    data.to_csv(csv_filename, sep=";", index=False)

    # give a little bit of feedback during the process to prove it's not stuck
    print(f"Processed {pdf_path} and saved output to {csv_filename}")

Step 3: Iterate the process over the pdfs

We then run the two functions defined above over the list of pdfs. When I did this myself, I pointed the script to folders in my linked Google Drive; however, below I get the code to instead look for the files in the Files component of Colab. note that files stored in Files are only present for the duration of the runtime (hence why I used Drive in my case).

For illustration, the below code also creates the relevant directories and populates them with some example files, so you can see the process in action.

In the event that code gets stopped and restarted, I also add some logic that basically prevents already processed files from being reprocessed.

# Specify the folder path where you want to access the books from
# this was my path but obviously yours will differ!
folder_path = '/content/drive/MyDrive/books_to_be_processed'
save_location = '/content/drive/MyDrive/books_info_extracted_blog'

# List all files to be processed
pdf_files = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))]

# List already processed files, removing "_extracted.csv" to match original PDF names
# this was because I did some of this iteratively for boring and dumb reasons
already_processed_files = [os.path.splitext(f.replace("_extracted", ""))[0] for f in os.listdir(save_location) if f.endswith(".csv")]

# Filter out entries in pdf_files that are in already_processed_files
incomplete_pdf_files = [f for f in pdf_files if os.path.splitext(os.path.basename(f))[0] not in already_processed_files]

# Process each unprocessed PDF file
for pdf_file in incomplete_pdf_files:
    # Run Step 1
    step_1_cleaned_text = parse_text_for_llm(pdf_file)

    # Run Step 2
    get_llm_response(input_text = step_1_cleaned_text,
                     prompt = prompt,
                     pdf_path = pdf_file,
                     save_path = save_location)

….and now we just combine the above outputs into a single dataframe!

# List of files to be combined
# using the location save_location as defined above!
files_to_be_combined = [
    os.path.join(save_location, f) for f in os.listdir(save_location) if os.path.isfile(os.path.join(save_location, f))
]

# empty list to store dataframes
dataframes = []

# Loop through each file and read it as a pandas dataframe
for file in files_to_be_combined:
        df = pd.read_csv(file, sep=";")  # Assuming CSV uses semi-colons as separators
        dataframes.append(df)

# Combine all dataframes into a single dataframe
combined_df = pd.concat(dataframes, ignore_index=True)

# Save the combined dataframe to a new CSV
output_csv_path = os.path.join(save_location, "combined_extracted_data.csv")
combined_df.to_csv(output_csv_path, sep=";", index=False)

Wrapping up

…and that’s it! A really simple, easily modifiable workflow for extracting unstructured information from pdfs and converting it into structured data using a combination of systematic pdf text extraction and large language models. Obviously this is only the first step, and there need to be systematic checks for data quality and accuracy in the extractions (just as we would need if a human did this). We can in principle use another call to our LLM to do a quick, high-level sanity check on a few dimensions: for example, by making sure that all entries in the “Countries” column are in fact countries:

# get the full csv table
# again using the save_location directory defined above
full_data = pd.read_csv(os.path.join(save_location, "combined_extracted_data.csv"), sep = ";")

# The data is too big for one call to the LLM, so let's just sample 20% of it
# we could also split it all up, but this is just a toy example
import random

random.seed(42)
sampled_data = full_data.sample(frac=0.2)

# convert the dataframe to a string since LLMs require text data
data_as_string = sampled_data.to_string()

country_validation_prompt = (
        f"Below is text from a csv with 5 columns: 1. Drug name, 2. Therapeutic Function, 3. Trade Name, 4. Country, and 5. Year Introduced.\n"
        f"The country column should contain either a dash (to indicate no data) or the name of a country.\n"
        f"Please go through each row of this text and identify any instances where the entry for the 'Country' column appears erroneous; that is, where its content is neither a country nor a dash indicating no data.\n"
        f"In such cases, please tell me the row number where the erroneous entry can be found.\n"
        f"Here is the text:\n"
        f"{data_as_string}"
    )

# Call the OpenAI API to run the simple validation
validation_response = openai_client.chat.completions.create(
    model="gpt-4o",
    temperature=0,
    messages=[{"role": "user", "content": country_validation_prompt}]
    )

# extract the main text output of the LLM
validation_output = validation_response.choices[0].message.content

Sample output I got from this on some test data (I also ran this on the actual data, and there were about 10 actual errors; but by that point I had already made the meme below and I wanted to share it):

The following rows contain entries in the 'Country' column that appear erroneous:

- Row 39: "Switz." should likely be "Switzerland."
- Row 462: There's "S. Africa," which is a common abbreviation for "South Africa."
- Row 467: "Switz." appears again and likely represents "Switzerland."
- Row 715: "Monte Carlo" is not a country; it's an area within Monaco.
- Row 762: "S. Africa" is again used for "South Africa."
- Row 763: "E. Germany" refers to the former East Germany,
which no longer exists as a country.
- Row 928: "Switz." appears again as likely meant to be "Switzerland."

These instances reflect abbreviations or geographical areas
rather than the official or recognized country name.

Of course, it goes without saying that some human-driven validation is needed; in the same manner it would be if a human had done the extraction manually. But doing this initial step with an LLM is much quicker than a human’s manual extraction; it took about an hour for the code above to process all 25 pdfs, so at least 16 times as quick as a human by my back-of-the-envelope calculations. And this workflow can easily be adapted for other problems (e.g., I am currently using a modified version of it for the extraction of numeric values from papers).

And best of all, Saloni can get back to making more cool charts!

Contents