As part of my PhD, I work with neuroimaging data that lends itself to a specific kind of modeling. But to stay sharp (and, honestly, to keep things interesting), I’ve been trying to experiment with other types of data and modeling techniques in my downtime. At worst, I gain some experience with new tools. At best, I learn something that can feed back into my research project.
This post is about one of those side projects: an attempt to use large language models (LLMs) to extract biomedical knowledge from unstructured text and turn it into a graph-based representation of biochemical pathways.
This idea has roots tracing back to my undergrad days, but it really kicked into gear when I came across the Drug Review dataset in the UCI Machine Learning Repository. The dataset is a collection of patient-written drug reviews, complete with ratings and reported side effects. Suffice to say though, the dataset caught my eye for a completely different reason from its original intent to support sentiment classification research.
In this post, I’ll walk through what motivated the project, what I built and learned along the way, and what I’d explore next if I had more time.
The Backstory
Nearly ten years ago, two of the most foundational courses I took in undergrad, Biochemistry and Neurochemical Foundations of Behavior, sparked a long-standing interest in how molecules interact to shape biology and behavior. What fascinated me most was the ligand-receptor relationship: how a single molecule (ligand) binding to a protein (receptor) could kick off a cascade of effects.
Mapping out signaling pathways by hand became my way of learning in those classes. I needed to trace the molecule and protein interactions to really understand class lectures and papers we read as assignments. The maps tended to be straightforward for each lecture or paper, but when everything needed to come together for exam prep, the maps started to look like messy, tangled webs. The same proteins and molecules popped up across different systems, their roles overlapping and intertwining. The concept of a “side effect” really started to make more sense, not as an exception, but as an inevitable consequence of shared pathways.
As I moved forward in my education and career, learning more molecules and more pathways gave way to increasing complexity. A part of me started to wonder if this could all be organized somehow. Could we build a structured, centralized map of these interactions, one that made the connections between molecules, proteins, and phenotypes easier to visualize and comprehend?
That’s when graph theory began to resonate with me. I’ve always thought in terms of relationships, whether it was ideas, molecules, or systems. Graphs gave me a language to formalize that intuition. I started imagining biochemical pathways not as isolated diagrams in a textbook, but as a vast, interconnected graph: ligands, receptors, and relationships between the two, all as nodes and edges in a living, dynamic network.
As intriguing as that idea was to me, I was content to let it stay theoretical in my head. Drawing out a simple signaling pathway from one lecture or one paper was already a slow, painstaking process for me. Now add in textbooks full of these well-established pathways, plus the steady stream of new papers that expanded and challenged our notion of biochemical pathways. The idea of organizing all of it felt impossible. Fascinating, yes, but not practical.
This was all before LLMs. Since ChatGPT turned everything upside down in late 2022, I, like many others, have been captivated by what generative AI makes possible. Every so often, I think about the sheer volume of scientific writing that could be mined, not just for summaries or explanations, but also for structured knowledge. What if we could convert these written texts into machine-readable data?
While LLMs solved the organizing data problem, I often told myself it’d be too much work to put together a small dataset worth demoing, given all else I had going on in life. But when I stumbled across the Drug Review dataset, it felt like enough pieces were finally in place: a small, messy, but promising starting point to begin building something that could finally scratch an itch a decade-plus in the making.
Exploring the Dataset
Most of my PhD has involved processing and cleaning neuroimaging data, so I was thrilled when I found the Drug Review dataset as a neat, structured spreadsheet. With just eight features and nearly 4,000 drug reviews, it didn’t take long to get a handle on the data.
For each drug (urlDrugName), reviewers provided the primary reason they used it (condition), along with self-assessments of their experience. This included ratings for overall satisfaction (rating), effectiveness at treating the original condition (effectiveness), and the severity of side effects (sideEffects). There were also several free-text fields such as benefitsReview, sideEffectsReview, and commentsReview, where the reviewers provided their subjective experiences about the drug.
These free-text fields were the part that really piqued my interest. They resembled the scientific papers and textbooks I used to create biochemical pathway maps in undergrad, albeit with a less technical feel. Consider the following rows (shown here in dictionary format for easier viewing):
{
"reviewID": 1043,
"urlDrugName": "Vyvanse",
"rating": 9,
"effectiveness": "Highly Effective",
"sideEffects": "Mild Side Effects",
"condition": "add",
"benefitsReview": "My mood has noticably improved, I have more energy, experience better sleep and digestion.",
"sideEffectsReview": "a few experiences of nausiea, heavy moodswings on the days I do not take it, decreased appetite, and some negative affect on my short-term memory.",
"commentsReview": "I had began taking 20mg of Vyvanse for three months and was surprised to find that such a small dose affected my mood so effectively. When it came to school work though I found that I needed the 30mg to increase my level of focus (and have been on it for a month since). I had not experienced decreased appetite until about a month into taking the 20mg. I find that the greatest benefit of Vyvanse for me is that it tends to stabalize my mood on a daily basis and lessens any bouts of anxiety and depression that i used to face before I was perscribed."
}
{
"reviewID": 4109,
"urlDrugName": "Chantix",
"rating": 10,
"effectiveness": "Highly Effective",
"sideEffects": "Mild Side Effects",
"condition": "smoking",
"benefitsReview": "I quit smoking, minimized desire to drink alcohol, suppressed appetite, stabilized mood fluctuations.",
"sideEffectsReview": "I would get an upset stomach if I didn't eat at the time I took the medication.",
"commentsReview": "Chantix somehow seems to suppress the pleasure part of the brain. Before you realize it... you will have gone a whole day or maybe days/weeks without even thinking about a cigarette subconcsiously and when you do smoke a cigarette there is little to no euphoria or any other sensation for that matter, its just the taste of smoke and nicotine. I also found it suppressed my appetite and my desire to drink, as well."
}
In the first review, we can see that Vyvanse was used to treat ADD, but we also learn about additional effects: improved energy, sleep, and digestion; nausea; mood stabilization; appetite suppression (emerging after a month); and short-term memory issues. In the second review, we learn that Chantix was used for smoking cessation, yet it also produces effects like reduced appetite, mood stabilization, and even a dampening of reward circuitry.
These user reviews hint at overlapping downstream effects between the two drugs. Clinicians could likely explain these similarities based on known mechanisms for each disease and drug. But as someone with a less trained eye, I was struck by how these connections emerged from just two narratives. It made me wonder whether a broader analysis that integrates user experiences with scientific reports and case studies could surface novel, less obvious patterns. These might not yet be formalized in clinical frameworks, but they could still reflect meaningful pharmacological or physiological relationships.
This is what excites me about the dataset and the project: we can start with unstructured text from user reviews to build a knowledge graph, with the potential to expand this approach to include peer-reviewed papers, case reports, and other biomedical sources.
Converting Unstructured Text to Nodes and Edges
Given that the data was largely clean and ready for analysis (perks of using a previously-published dataset!), the first step was loading the data into a Pandas DataFrame.
import pandas as pd
df = pd.read_csv('../projects/drugLibTrain_raw.tsv', sep='\t')
df.rename(columns={"Unnamed: 0": "reviewID"}, inplace=True)
Before jumping into using LLMs, I considered a saying that I repeat often throughout my PhD: “to a hammer, everything looks like a nail.” Rather than going for the heavy tools, I initially wanted to see if simpler tools existed that might help me with this task.
This led to discovering Named Entity Recognition (NER), a well-established technique in natural language processing (NLP) for identifying and classifying entities in unstructured text. NERs, especially those trained for biomedical domains like BioBERT and SciSpaCy, are trained to extract key elements like drug names, conditions, and side effects. And because many NER models, especially for clinical and science purposes, are pre-trained and publicly available, I could skip model development and rely on validated outputs.
However, many of these specialized NER models were trained on scientific text, which tends to use more formal academic and medical language. This posed a challenge when working with the drug review dataset, which included plain language text from users with a much broader, less specialized vocabulary. While the NER models excelled at extracting drug names and symptoms, there was a risk that they would fail to capture the relationships between these entities. Since that relational context is essential for building a meaningful graph, I needed a method capable of extracting both entities and their connections.
Using LLMs via Ollama
This led me to pivot toward using LLMs. While they are more computationally demanding, LLMs offer flexibility in determining relations between entities that the NER models couldn’t match. Unlike NER, which tends to focus on classifying entities, LLMs can generate variable outputs based on highly specific prompts with detailed instructions. This allows me to not only extract entities but also identify relationships between them, which is critical for building the knowledge network I aimed to create. By tailoring the prompts to the data’s nuances, I could guide the model to produce richer, more complex outputs.
To make this shift work in Python, though, I needed a way to interact with an LLM programmatically. Until this point, my use of LLMs had been limited to web tools like ChatGPT or Gemini. I knew APIs existed, but they came with trade-offs I wanted to avoid: cloud dependency, usage limits, recurring costs, and less control over where data goes. Even though this project didn’t involve sensitive information, I liked the idea of building with privacy and flexibility in mind from the start.
That’s when I discovered Ollama, an open-source platform that runs LLMs locally. This was a game changer for me. It let me interact with LLMs programmatically, automate workflows in Python, and avoid sending data to external servers. I could spin up lightweight models for development on my M2 MacBook Air and easily switch to heavier models later on more powerful machines. That flexibility made it easier to iterate, test, and scale, all while keeping the entire pipeline on my own hardware.
After installing Ollama and downloading the model, I tested my setup with the following script:
for i, row in df.iterrows():
prompt = "Please respond **only** with a valid JSON object in this exact format:\n'{\n "message": "hello"\n}\n'Do not include any explanation or extra text. Only output the JSON object."
response = requests.post(
"http://localhost:11434/api/generate",
json={"model": "llama3.2", "prompt": prompt}
)
response_text = ""
for chunk in response.iter_lines():
if chunk:
data = json.loads(chunk.decode("utf-8"))
response_text += data.get("response", "")
try:
# Find the first and last curly braces to extract JSON
start = response_text.find("{")
end = response_text.rfind("}") + 1
graph_json = json.loads(response_text[start:end])
return graph_json
except Exception as e:
print("Failed to parse JSON:", e)
print("LLM output was:", response_text)
return None
Before even working with the data, I was quickly exposed to how important developing a good prompt is for getting consistent results. My initial prompt was “Return a JSON with "message" as the key and "hello" as the value“, but I was getting far more output than I expected. This turned out to be a great lesson about the importance of specificity in prompting. I then iterated through prompts until the models reliably returned {"message": "hello"} as a string. Once I got this test prompt set up, I was able to write out code to process the LLM output for the graph database. I then focused my attention towards writing a useful prompt for converting the drug reviews to graph nodes and edges.
Engineering the Right Prompt
Most of my time for this project went to prompt engineering. This is where I really felt the dual-edged nature of LLMs. While they offer tremendous flexibility, that same flexibility required me to be precise and clear about the outputs I wanted. Even before this project, I was aware that LLMs rarely produce the same answer twice, even when given the same prompt. As I experimented with more detailed prompts, I quickly realized that even small changes in phrasing or word choice could lead to drastically different results, making it hard to strike the right balance between precision, flexibility, and reproducibility. To improve my results, I researched prompting strategies online and learned about zero-shot and few-shot prompting—specifically, how providing examples of acceptable outputs can guide the model more effectively. This quickly revealed one of the key drawbacks of LLMs: while they offer powerful capabilities, achieving consistent results demands careful fine-tuning of every prompt. Ultimately, I settled on the following prompt to obtain useful results with an acceptable level of consistency.
reviewID = row['reviewID']
drug = row['urlDrugName']
condition = row['condition']
benefits = row['benefitsReview']
sideeffects = row['sideEffectsReview']
comments = row['commentsReview']
prompt = f"""
You are an expert biomedical knowledge graph builder.
Given the following information from a drug review:
- Review ID: "{reviewID}"
- Drug name: "{drug}"
- Condition: "{condition}"
- Benefits: "{benefits}"
- Side Effects: "{sideeffects}"
- Other comments: "{comments}"
Your task:
1. Identify all relevant entities (such as the drug, condition, effects, or other key concepts).
- When multiple effects (primary or side effects) are presented, you should do what you can to make them individual entities. In other words, "and" should be avoided as much as possible when generating an entity.
2. Identify the relationships between these entities (e.g., 'side_effect_of', 'interacts_with'). Use the following (and only create new ones if absolutely necessary):
- "treats" : i.e. A treats B
- "causes" : i.e. A causes B.
- "side_effect_of" : i.e. A is side effect of B
- "inhibits" : i.e. A inhibits B
- "activates" : i.e. A activates B
- "modulates" : i.e. A modulates B
- "interacts_with" : i.e. A interacts with B
- "part_of" : i.e. A is part of B.
- "regulates" : i.e. A regulates B
- "expresses" : i.e. A expresses B
- "associated_with" : i.e. A is associated with B
3. Output ONLY a single valid JSON object with two lists:
- "nodes": each with "id" (the entity text), "type" (e.g., Drug, Condition, Effect).
- "edges", with the following:
- "source" (id of the source node)
- "target" (id of the target node)
- "relation" (the relationship type)
- an optional "attribute" object (e.g., {{"strength": "high"}}) if relevant.
- "evidence" (should be this: {reviewID})
Example format:
{{"nodes": [
{{"id": "Aspirin", "type": "Drug"}},
{{"id": "Headache", "type": "Condition"}},
{{"id": "Stomach pain", "type": "Effect"}}
],
"edges": [
{{"source": "Aspirin",
"target": "Headache",
"relation": "treats",
"attribute": {{"effectiveness": "high"}},
"evidence": "12345"}},
{{"source": "Aspirin",
"target": "Stomach pain",
"relation": "causes",
"attribute": {{"strength": "high"}},
"evidence": "12345"}}
]
}}
If no attribute is available, omit the "attribute" field.
Do not include any explanation or extra text. Output ONLY the JSON.
"""
The goal of this prompt was to extract the most meaningful components of each drug review in a way that could be structured into nodes and edges for a knowledge graph. To start, I defined the nodes as categories like “drug,” “condition,” and “effect,” and also defined several edges, such as “treats,” “causes,” and “side_effect_of.” This provided the LLM a clear starting point for classification. I also included the reviewID as an evidence field so that each relationship could be traced back to its original source, laying the groundwork for future trust scoring or filtering.
As an aside, I also experimented with parallelizing LLM calls to speed things up, similar to how I’ve used Python’s multiprocessing library for other compute-heavy tasks. Unfortunately, that didn’t yield much improvement, likely due to hardware limitations (I was running everything on an M2 MacBook Air). In the end, I had to accept running queries one at a time, which is ultimately fine for a prototype like this.
Visualizing the Graph using Neo4j
Once the nodes and edges were generated, I realized I was stepping into unfamiliar territory. While I had a solid grasp of the basics of graph theory, working with graph databases was an entirely new challenge. The tools and concepts involved were different from the relational and document databases I had used in the past, and I quickly recognized that I needed to take a step back before diving into any analyses.
Fortunately, I now had a dataset that I could use to learn as I go. I decided to take some time to learn how to navigate and use the tools, which would later allow me to do more in-depth analyses using the graph. After evaluating several graph database options, I chose Neo4j for its user-friendly interface, though I’m open to learning and exploring other platforms in the future.
After installing Neo4j Desktop, I got to work setting up a project, importing the nodes and edges CSVs, and generating Cypher queries until I eventually got to something that resembled my initial vision. I generated a graph with different colors representing the primary nodes of interest (drugs, conditions, and effects), and the different relationships between them clearly mapped. While I’ve initialized the graph and explored some of its basic structure, I’ll likely save deeper analysis and insights for a future blog post. For now, seeing the structure come to life in a visual format is rewarding enough: it made the abstract relationships feel concrete, and it gave me a clearer sense of what kinds of insights might actually be possible.
Looking Ahead
This was a fun project that brought together several strands of curiosity, from biochemical networks to language models to graph theory. Even though I’m leaving it in a prototype state for now, the process revealed just how much potential (and complexity) there is in modeling unstructured biomedical text as a structured knowledge graph.
There’s still a lot of work to make this usable at scale. Data quality is the most immediate concern. While the LLM did a solid job generating nodes and edges, the output still needs careful vetting and QC to ensure nothing was missed or erroneously generated. If I were to keep developing this, I’d prioritize precision (correctly classifying an identified entity) over recall (capturing as many entities as possible, even if it’s wrong). In a domain as sensitive as biomedical effects, I believe that incorrect classifications are more problematic than failing to classify something altogether.
There’s also room to experiment with different models. I used a lightweight 3 billion parameter LLM to keep things efficient on my local machine, but I’m curious how a larger model with more capacity for nuance and ambiguity might improve the quality of the graph. The trade-off in speed and compute might be worth it for richer, more accurate relationships.
And then there’s the graph itself. I’ve only scratched the surface of what’s possible with Neo4j and Cypher, but even in this early stage, I can already see the kinds of questions it might help answer: Which drugs share similar downstream effects? Are certain conditions disproportionately linked to specific side effects? By querying the graph, I could trace shared outcomes across different treatments, identify side effects that recur across drug classes, or uncover clusters of conditions based on overlapping treatment mechanisms. With more time, I’d love to explore these questions or even build a lightweight interface to let others do the same. These kinds of insights could reveal unexpected connections or spark research questions that might otherwise go unnoticed.
Beyond scale, there’s significant potential to expand both the scope and depth of the graph. Currently, it relies solely on patient reviews, which are valuable but inherently subjective data. Incorporating peer-reviewed literature, clinical trial reports, or case studies would create a much more comprehensive and reliable graph. To support this, the prototype includes an evidence field for each relationship, tracking its source and frequency. While this currently reflects user reviews, it could just as easily reference citations from published research. A single mention in a review might be a curiosity, but consistent findings across trials or meta-analyses would carry far more weight. Over time, this approach could highlight high-confidence subgraphs and identify weaker connections that need further validation.
The beauty of this approach is its flexibility. Thanks to LLMs, we can adjust prompts to handle different data types while still generating structured outputs. That adaptability is key, whether scaling up with new sources or refining the precision of the relationships we extract. While this prototype focused on a domain I know well, the underlying method is transferable. With the right prompts and inputs, it could apply just as easily to scientific literature, environmental data, or historical archives, anywhere relationships are buried in messy text and worth making explicit.
Final Thoughts
Building this knowledge graph prototype has been incredibly rewarding, both personally and professionally. It’s allowed me to bring together a range of tools and techniques I’ve picked up over the last decade. While the project is still in its early stages, it’s no longer just an idea: it’s a tangible step forward in demonstrating how unstructured biomedical knowledge can be structured and visualized using generative AI tools.
Looking ahead, I think the combination of LLMs and graph databases holds real promise, especially for biomedical sciences. LLMs can extract structured data from free-form text with flexibility and scale, while graph databases offer powerful ways to explore and reason about that data. As both technologies mature, I see growing potential to turn the overwhelming messiness of scientific writing into something more searchable, visual, and usable, whether in biology, medicine, or beyond.
Leave a comment