Category: Uncategorized

Shiny New Toy Meets Everyday Work: My First Steps with Agentic Coding

Agentic coding is everywhere right now, but for a while I watched from a distance. I would say I was already a native AI user, with a lot of my day-to-day habits shifted like replacing Stack Overflow threads and Reddit searches with ChatGPT queries, but I wasn’t convinced agents would materially change how I wrote code. That changed once I had access to a capable agentic model and enough compute to actually push it. What began as curiosity quickly turned into adoption, and today agentic coding is a core part of my daily workflow. In this post, I want to reflect on that transition, the lessons it’s forced me to learn, and how I’m thinking about growing with these tools rather than around them.

From the Sidelines into the (Virtual) Arena

What finally drew me into agentic coding is a story in itself. My drafts folder for blog posts has two or three different iterations of how fantasy football kickstarted my research career and data interests. Maybe one day I’ll finally publish one of those posts, but for now, I’ll spoil the plot a bit by saying the game has long been my sandbox for learning programming. From basic Python to more advanced work like connecting to APIs and scraping the web, fantasy football has served as a durable, evolving project that helps me learn and advance my programming skills.

Naturally, as with much of my programming journey, fantasy football also became my first real foray into agentic coding.

Before the start of the last fantasy football season, I had introduced a custom playoff format for the leagues I manage and organize. All the Python scripts I wrote to handle the midseason competitions that would influence the playoffs were working smoothly, so I felt pretty good about my playoff code. However, a week before the start of the playoffs, I realized there would be a problem: I was going to be on vacation for the championship weeks without access to my personal computer, which had all the scripts I needed to organize the playoffs. Hosting a website felt like the obvious answer to my problem, but my web development experience was limited. After a few hours brainstorming solutions, it finally clicked–this was the perfect opportunity to try agentic coding. I already had a working Python codebase that did exactly what I needed, the challenge was refactoring it into JavaScript and wrapping it in some HTML and CSS.

I’m a little embarrassed to admit now that my next few searches were things like “how to agentically code” and “where do I prompt an agent.” But a few hours later, I had written my first set of instructions and set up the agentic environment. Once I hit send, the only word that could describe what happened next is magic.

Within 10 minutes, I watched an agent take my Python logic and refactor it into a functional prototype using JavaScript, HTML, and CSS, all languages I have basic familiarity with but haven’t used enough to build something from scratch. I’d always thought about building a website based on my Python codebase, so seeing one come together felt like scratching a decade-long itch for my fantasy football leagues.

But that feeling didn’t last long. As soon as I moved from watching the agent work to actually testing the site myself, the illusion started to crack. Buttons didn’t behave the way I expected, edge cases appeared out of nowhere, and logic wasn’t properly implemented. My inner skeptic, the one burned by a thousand past coding bugs, quietly took the wheel.

There were glitches in the standings and logic errors that I assumed a faithful refactor would have avoided. But the debugging process turned out to be just as surprising as the initial “magic.” It became a mind-meld between me and the agent, a cycle of me finding a bug, suggesting a fix, and watching the code evolve. For the first time, I felt the gap between my intent and the execution finally closing. I no longer had to trade my momentum for a syntax error; I could stay focused on the logic while the agent handled translating that into updated coding syntax.

Before long, I had what I wanted. I deployed the site before the playoffs started, and throughout championship week it updated standings and results in real time. No laptop, no manually scripts, no late-night interventions. Each league ran more smoothly because of it, and for the first time, the system I’d imagined actually existed in the wild. That was the moment I realized I unlocked a new way of building.

Moving the Chains from Fantasy to Research

While the fantasy football project was a useful proof of concept, I’ve found some real use cases of agentic coding showing up in the everyday, data-intensive tasks of my work. Lately, the biggest gains haven’t come from writing complex algorithms, but from automating repetitive, high-friction work, such as quality checking and data validation.

For anyone who has spent hours manually clicking through neuroimaging data, the browser is an incredibly intuitive environment for viewing and navigating images. Quality control in imaging, at its core, is visual: checking whether the boundaries of a structure were properly captured, whether a segmentation is over-traced, under-traced, or missing entirely.

Recently, I faced the task of quality-checking 1,500 segmentation masks. In practice, that meant opening each image, inspecting the boundaries slice by slice, then switching to a spreadsheet to record whether a given subject passed, failed, or needed to be revisited. Each decision required carefully lining up subject IDs, slice numbers, and notes. The work was monotonous, cognitively fatiguing, and potentially error-prone. One of my bigger frustrations was constantly context switching between opening and reviewing images, assigning QC status, writing notes, and maintaining a correct database format. Cutting any part of this cognitive load would make the process much easier.

In the past, QCing this many images would have taken a single person anywhere from 7-10 days, or a small team of reviewers 3-5 days (assuming they could put aside all other tasks and solely focus on this). Building a custom web-based tool to automate some of the work had always crossed my mind, but the math never worked out. Even in a best-case scenario, learning enough web development to build something usable would have taken days, followed by days of QC itself. The time investment simply wasn’t worth it.

With agentic coding, that calculus flipped. I built a browser-based QC tool in about an hour: an image viewer paired with a small set of buttons (later mapped to keyboard keys!) that automatically logged pass, fail, or return decisions directly to a master spreadsheet. The impact was immediate. Because failures were relatively rare, I could review so many more subjects in a row without breaking flow. No more counting rows, no double-checking IDs, no context switching just to save a decision. The tool handled the bookkeeping, and I stayed focused on the science.

I completed the entire QC process in roughly 10 work-hours and ran corrective follow-ups that same night. Just as importantly, the faster turnaround changed how I thought about iteration. I could address failures immediately, recover usable data, and ultimately include a larger sample size in my analysis than I otherwise would have. A task that once looked like a week of tedious work became something I could finish, and finish correctly, within two days.

Crucially, agentic coding didn’t replace my judgment. I still decided whether an algorithm succeeded or failed, and what to do with those outcomes. What it removed was a thick layer of friction between intent and execution. The tool itself was simple, but it fundamentally changed what felt feasible.

I’ve slowly gained confidence building more complex tools with agentic coding. A recent project involved creating a tool for data entry and validation in clinical studies. Over the years, I’ve manually entered data for countless RA jobs, carefully matching fields from subject to subject, and sometimes across multiple studies. These spreadsheets often have machine-readable column names, nonintuitive formats for data entry, and hundreds of fields per subject. Mistakes can be made anywhere, and even small errors compound downstream during analysis. Spreadsheets with conditional formatting can get the job done, but they’re rarely optimal and often tedious to maintain. I’d long imagined browser-based forms that could enforce validation rules automatically, making data entry faster, more accurate, and even a little more engaging. With agentic coding, that vision became feasible.

I was able to build those forms in a fraction of the time it would have taken to learn a full web development stack. Now, data entry checks itself in real time: fields validate as I go, inconsistencies are flagged immediately, and the system keeps everything aligned across subjects. What used to be a boring, error-prone chore has become intuitive and efficient, another example of turning a high-friction workflow into something manageable, and even enjoyable.

Establishing a New Baseline

Agentic coding is a recent development, but it addresses a problem I have felt throughout my career. I noticed early on that specific breakthroughs in my programming, like replacing a slow for-loop with a vectorized matrix operation or moving from chaining Terminal commands to using native Python libraries, immediately accelerated my research. These moments of growth made me wonder how much faster I could have moved if I had known those better ways of working from the start. I have had many ideas stall because I simply did not know the right way to build them. Even when I could get code to work, I often spent my days fighting the secondary battles of validating and scaling it. As much as I would love to pause and advance my coding fundamentals, it is rarely justifiable when I am trying to implement a one-off tool for a larger research project, leaving me stuck in the weeds of implementation when I would much rather be focused on the science.

To me, agentic coding provides a way to bridge that gap. It turns the “how do I even start” moments into a prototype we can actually refine. When things break, it helps me narrow down where the problem might be, turning a full day of debugging into a cooperative troubleshooting session. Instead of having to find the problem and the fix from scratch, I can focus on verifying the solution, which in turn opens up more energy towards the research. By handling the boilerplate, the tool ensures implementation is no longer a bottleneck. I can finally explore ideas that used to be too technically expensive to start, effectively raising the floor for what is possible.

I feel much less restricted by the gaps in my own coding abilities now. Beyond just finishing a task, I am using the agentic outputs to learn coding in a context that actually matters to my research. Every time I pick up a new concept from an agent, I can move onto the next idea that much faster. The hope is that as the tools get better, my coding (and prompting) gets better too, creating a compound effect on the quality and impact of my work. As someone with a grip on science but an informal background in coding, this is a game changer.

To think, this is the impact of agentic coding on just one researcher. It’s hard not to get ahead of myself thinking about how much the greater research world will benefit from this tech. Agentic coding acts as a great equalizer, democratizing high-level execution not just for research, but for any field where technical friction stalls good ideas.

Of course, the “magic” isn’t a replacement for expertise. If anything, it demands more of it. We still have to know what a good result looks like, how to spot a logic hallucination, and how to verify that the data isn’t being quietly mangled behind a sleek UI. The agent provides the momentum, but the researcher still has to provide the guardrails.

There’s a special kind of energy that comes with knowing the distance between a “what if” and a working prototype is now just a few prompts away. I feel like I’m no longer negotiating with my own limitations. Now, I’m just looking for the next question to answer.

February 13, 2026
From Drug Reviews to Biochemical Networks: Using LLMs to Build a Graph of Biochemical Pathways
As part of my PhD, I work with neuroimaging data that lends itself to a specific kind of modeling. But to stay sharp (and, honestly, to keep things interesting), I’ve been trying to experiment with other types of data and modeling techniques in my downtime. At worst, I gain some experience with new tools. At best, I learn something that can feed back into my research project.

This post is about one of those side projects: an attempt to use large language models (LLMs) to extract biomedical knowledge from unstructured text and turn it into a graph-based representation of biochemical pathways.

This idea has roots tracing back to my undergrad days, but it really kicked into gear when I came across the Drug Review dataset in the UCI Machine Learning Repository. The dataset is a collection of patient-written drug reviews, complete with ratings and reported side effects. Suffice to say though, the dataset caught my eye for a completely different reason from its original intent to support sentiment classification research.

In this post, I’ll walk through what motivated the project, what I built and learned along the way, and what I’d explore next if I had more time.

The Backstory

Nearly ten years ago, two of the most foundational courses I took in undergrad, Biochemistry and Neurochemical Foundations of Behavior, sparked a long-standing interest in how molecules interact to shape biology and behavior. What fascinated me most was the ligand-receptor relationship: how a single molecule (ligand) binding to a protein (receptor) could kick off a cascade of effects.

Mapping out signaling pathways by hand became my way of learning in those classes. I needed to trace the molecule and protein interactions to really understand class lectures and papers we read as assignments. The maps tended to be straightforward for each lecture or paper, but when everything needed to come together for exam prep, the maps started to look like messy, tangled webs. The same proteins and molecules popped up across different systems, their roles overlapping and intertwining. The concept of a “side effect” really started to make more sense, not as an exception, but as an inevitable consequence of shared pathways.

As I moved forward in my education and career, learning more molecules and more pathways gave way to increasing complexity. A part of me started to wonder if this could all be organized somehow. Could we build a structured, centralized map of these interactions, one that made the connections between molecules, proteins, and phenotypes easier to visualize and comprehend?

That’s when graph theory began to resonate with me. I’ve always thought in terms of relationships, whether it was ideas, molecules, or systems. Graphs gave me a language to formalize that intuition. I started imagining biochemical pathways not as isolated diagrams in a textbook, but as a vast, interconnected graph: ligands, receptors, and relationships between the two, all as nodes and edges in a living, dynamic network.

As intriguing as that idea was to me, I was content to let it stay theoretical in my head. Drawing out a simple signaling pathway from one lecture or one paper was already a slow, painstaking process for me. Now add in textbooks full of these well-established pathways, plus the steady stream of new papers that expanded and challenged our notion of biochemical pathways. The idea of organizing all of it felt impossible. Fascinating, yes, but not practical.

This was all before LLMs. Since ChatGPT turned everything upside down in late 2022, I, like many others, have been captivated by what generative AI makes possible. Every so often, I think about the sheer volume of scientific writing that could be mined, not just for summaries or explanations, but also for structured knowledge. What if we could convert these written texts into machine-readable data?

While LLMs solved the organizing data problem, I often told myself it’d be too much work to put together a small dataset worth demoing, given all else I had going on in life. But when I stumbled across the Drug Review dataset, it felt like enough pieces were finally in place: a small, messy, but promising starting point to begin building something that could finally scratch an itch a decade-plus in the making.

Exploring the Dataset

Most of my PhD has involved processing and cleaning neuroimaging data, so I was thrilled when I found the Drug Review dataset as a neat, structured spreadsheet. With just eight features and nearly 4,000 drug reviews, it didn’t take long to get a handle on the data.

For each drug (urlDrugName), reviewers provided the primary reason they used it (condition), along with self-assessments of their experience. This included ratings for overall satisfaction (rating), effectiveness at treating the original condition (effectiveness), and the severity of side effects (sideEffects). There were also several free-text fields such as benefitsReview, sideEffectsReview, and commentsReview, where the reviewers provided their subjective experiences about the drug.

These free-text fields were the part that really piqued my interest. They resembled the scientific papers and textbooks I used to create biochemical pathway maps in undergrad, albeit with a less technical feel. Consider the following rows (shown here in dictionary format for easier viewing):
```
{
  "reviewID": 1043,
  "urlDrugName": "Vyvanse",
  "rating": 9,
  "effectiveness": "Highly Effective",
  "sideEffects": "Mild Side Effects",
  "condition": "add",
  "benefitsReview": "My mood has noticably improved, I have more energy, experience better sleep and digestion.",
  "sideEffectsReview": "a few experiences of nausiea, heavy moodswings on the days I do not take it, decreased appetite, and some negative affect on my short-term memory.",
  "commentsReview": "I had began taking 20mg of Vyvanse for three months and was surprised to find that such a small dose affected my mood so effectively.  When it came to school work though I found that I needed the 30mg to increase my level of focus (and have been on it for a month since).  I had not experienced decreased appetite until about a month into taking the 20mg.  I find that the greatest benefit of Vyvanse for me is that it tends to stabalize my mood on a daily basis and lessens any bouts of anxiety and depression that i used to face before I was perscribed."
}
```
```
{
  "reviewID": 4109,
  "urlDrugName": "Chantix",
  "rating": 10,
  "effectiveness": "Highly Effective",
  "sideEffects": "Mild Side Effects",
  "condition": "smoking",
  "benefitsReview": "I quit smoking, minimized desire to drink alcohol, suppressed appetite, stabilized mood fluctuations.",
  "sideEffectsReview": "I would get an upset stomach if I didn't eat at the time I took the medication.",
  "commentsReview": "Chantix somehow seems to suppress the pleasure part of the brain.  Before you realize it... you will have gone a whole day or maybe days/weeks without even thinking about a cigarette subconcsiously and when you do smoke a cigarette there is little to no euphoria or any other sensation for that matter, its just the taste of smoke and nicotine.  I also found it suppressed my appetite and my desire to drink, as well."
}
```
In the first review, we can see that Vyvanse was used to treat ADD, but we also learn about additional effects: improved energy, sleep, and digestion; nausea; mood stabilization; appetite suppression (emerging after a month); and short-term memory issues. In the second review, we learn that Chantix was used for smoking cessation, yet it also produces effects like reduced appetite, mood stabilization, and even a dampening of reward circuitry.

These user reviews hint at overlapping downstream effects between the two drugs. Clinicians could likely explain these similarities based on known mechanisms for each disease and drug. But as someone with a less trained eye, I was struck by how these connections emerged from just two narratives. It made me wonder whether a broader analysis that integrates user experiences with scientific reports and case studies could surface novel, less obvious patterns. These might not yet be formalized in clinical frameworks, but they could still reflect meaningful pharmacological or physiological relationships.

This is what excites me about the dataset and the project: we can start with unstructured text from user reviews to build a knowledge graph, with the potential to expand this approach to include peer-reviewed papers, case reports, and other biomedical sources.

Converting Unstructured Text to Nodes and Edges

Given that the data was largely clean and ready for analysis (perks of using a previously-published dataset!), the first step was loading the data into a Pandas DataFrame.
```
import pandas as pd

df = pd.read_csv('../projects/drugLibTrain_raw.tsv', sep='\t')
df.rename(columns={"Unnamed: 0": "reviewID"}, inplace=True)
```
Before jumping into using LLMs, I considered a saying that I repeat often throughout my PhD: “to a hammer, everything looks like a nail.” Rather than going for the heavy tools, I initially wanted to see if simpler tools existed that might help me with this task.

This led to discovering Named Entity Recognition (NER), a well-established technique in natural language processing (NLP) for identifying and classifying entities in unstructured text. NERs, especially those trained for biomedical domains like BioBERT and SciSpaCy, are trained to extract key elements like drug names, conditions, and side effects. And because many NER models, especially for clinical and science purposes, are pre-trained and publicly available, I could skip model development and rely on validated outputs.

However, many of these specialized NER models were trained on scientific text, which tends to use more formal academic and medical language. This posed a challenge when working with the drug review dataset, which included plain language text from users with a much broader, less specialized vocabulary. While the NER models excelled at extracting drug names and symptoms, there was a risk that they would fail to capture the relationships between these entities. Since that relational context is essential for building a meaningful graph, I needed a method capable of extracting both entities and their connections.

Using LLMs via Ollama

This led me to pivot toward using LLMs. While they are more computationally demanding, LLMs offer flexibility in determining relations between entities that the NER models couldn’t match. Unlike NER, which tends to focus on classifying entities, LLMs can generate variable outputs based on highly specific prompts with detailed instructions. This allows me to not only extract entities but also identify relationships between them, which is critical for building the knowledge network I aimed to create. By tailoring the prompts to the data’s nuances, I could guide the model to produce richer, more complex outputs.

To make this shift work in Python, though, I needed a way to interact with an LLM programmatically. Until this point, my use of LLMs had been limited to web tools like ChatGPT or Gemini. I knew APIs existed, but they came with trade-offs I wanted to avoid: cloud dependency, usage limits, recurring costs, and less control over where data goes. Even though this project didn’t involve sensitive information, I liked the idea of building with privacy and flexibility in mind from the start.

That’s when I discovered Ollama, an open-source platform that runs LLMs locally. This was a game changer for me. It let me interact with LLMs programmatically, automate workflows in Python, and avoid sending data to external servers. I could spin up lightweight models for development on my M2 MacBook Air and easily switch to heavier models later on more powerful machines. That flexibility made it easier to iterate, test, and scale, all while keeping the entire pipeline on my own hardware.

After installing Ollama and downloading the model, I tested my setup with the following script:
```
for i, row in df.iterrows():
    prompt = "Please respond **only** with a valid JSON object in this exact format:\n'{\n  "message": "hello"\n}\n'Do not include any explanation or extra text. Only output the JSON object."
    
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": "llama3.2", "prompt": prompt}
    )
    
    response_text = ""
    for chunk in response.iter_lines():
        if chunk:
            data = json.loads(chunk.decode("utf-8"))
            response_text += data.get("response", "")

    try:
        # Find the first and last curly braces to extract JSON
        start = response_text.find("{")
        end = response_text.rfind("}") + 1
        graph_json = json.loads(response_text[start:end])
        return graph_json
    except Exception as e:
        print("Failed to parse JSON:", e)
        print("LLM output was:", response_text)
        return None
```
Before even working with the data, I was quickly exposed to how important developing a good prompt is for getting consistent results. My initial prompt was “Return a JSON with "message" as the key and "hello" as the value“, but I was getting far more output than I expected. This turned out to be a great lesson about the importance of specificity in prompting. I then iterated through prompts until the models reliably returned {"message": "hello"} as a string. Once I got this test prompt set up, I was able to write out code to process the LLM output for the graph database. I then focused my attention towards writing a useful prompt for converting the drug reviews to graph nodes and edges.

Engineering the Right Prompt

Most of my time for this project went to prompt engineering. This is where I really felt the dual-edged nature of LLMs. While they offer tremendous flexibility, that same flexibility required me to be precise and clear about the outputs I wanted. Even before this project, I was aware that LLMs rarely produce the same answer twice, even when given the same prompt. As I experimented with more detailed prompts, I quickly realized that even small changes in phrasing or word choice could lead to drastically different results, making it hard to strike the right balance between precision, flexibility, and reproducibility. To improve my results, I researched prompting strategies online and learned about zero-shot and few-shot prompting—specifically, how providing examples of acceptable outputs can guide the model more effectively. This quickly revealed one of the key drawbacks of LLMs: while they offer powerful capabilities, achieving consistent results demands careful fine-tuning of every prompt. Ultimately, I settled on the following prompt to obtain useful results with an acceptable level of consistency.
```
reviewID = row['reviewID']
drug = row['urlDrugName']
condition = row['condition']
benefits = row['benefitsReview']
sideeffects = row['sideEffectsReview']
comments = row['commentsReview']

prompt = f"""

You are an expert biomedical knowledge graph builder.

Given the following information from a drug review:

- Review ID: "{reviewID}"
- Drug name: "{drug}"
- Condition: "{condition}"
- Benefits: "{benefits}"
- Side Effects: "{sideeffects}"
- Other comments: "{comments}"

Your task:

1. Identify all relevant entities (such as the drug, condition, effects, or other key concepts).
    - When multiple effects (primary or side effects) are presented, you should do what you can to make them individual entities. In other words, "and" should be avoided as much as possible when generating an entity.

2. Identify the relationships between these entities (e.g., 'side_effect_of', 'interacts_with'). Use the following (and only create new ones if absolutely necessary):
    - "treats" : i.e. A treats B
    - "causes" : i.e. A causes B.
    - "side_effect_of" : i.e. A is side effect of B
    - "inhibits" : i.e. A inhibits B
    - "activates" : i.e. A activates B
    - "modulates" : i.e. A modulates B
    - "interacts_with" : i.e. A interacts with B
    - "part_of" : i.e. A is part of B.
    - "regulates" : i.e. A regulates B
    - "expresses" : i.e. A expresses B
    - "associated_with" : i.e. A is associated with B

3. Output ONLY a single valid JSON object with two lists:
    - "nodes": each with "id" (the entity text), "type" (e.g., Drug, Condition, Effect).
    - "edges", with the following:
    - "source" (id of the source node)
    - "target" (id of the target node)
    - "relation" (the relationship type)
    - an optional "attribute" object (e.g., {{"strength": "high"}}) if relevant.
    - "evidence" (should be this: {reviewID})

Example format:

{{"nodes": [
    {{"id": "Aspirin", "type": "Drug"}},
    {{"id": "Headache", "type": "Condition"}},
    {{"id": "Stomach pain", "type": "Effect"}}
    ],
  "edges": [
    {{"source": "Aspirin", 
      "target": "Headache", 
      "relation": "treats", 
      "attribute": {{"effectiveness": "high"}}, 
      "evidence": "12345"}},
    {{"source": "Aspirin", 
      "target": "Stomach pain", 
      "relation": "causes", 
      "attribute": {{"strength": "high"}}, 
      "evidence": "12345"}}
    ]
}}

If no attribute is available, omit the "attribute" field.

Do not include any explanation or extra text. Output ONLY the JSON.
"""
```
The goal of this prompt was to extract the most meaningful components of each drug review in a way that could be structured into nodes and edges for a knowledge graph. To start, I defined the nodes as categories like “drug,” “condition,” and “effect,” and also defined several edges, such as “treats,” “causes,” and “side_effect_of.” This provided the LLM a clear starting point for classification. I also included the reviewID as an evidence field so that each relationship could be traced back to its original source, laying the groundwork for future trust scoring or filtering.

As an aside, I also experimented with parallelizing LLM calls to speed things up, similar to how I’ve used Python’s multiprocessing library for other compute-heavy tasks. Unfortunately, that didn’t yield much improvement, likely due to hardware limitations (I was running everything on an M2 MacBook Air). In the end, I had to accept running queries one at a time, which is ultimately fine for a prototype like this.

Visualizing the Graph using Neo4j

Once the nodes and edges were generated, I realized I was stepping into unfamiliar territory. While I had a solid grasp of the basics of graph theory, working with graph databases was an entirely new challenge. The tools and concepts involved were different from the relational and document databases I had used in the past, and I quickly recognized that I needed to take a step back before diving into any analyses.

Fortunately, I now had a dataset that I could use to learn as I go. I decided to take some time to learn how to navigate and use the tools, which would later allow me to do more in-depth analyses using the graph. After evaluating several graph database options, I chose Neo4j for its user-friendly interface, though I’m open to learning and exploring other platforms in the future.

After installing Neo4j Desktop, I got to work setting up a project, importing the nodes and edges CSVs, and generating Cypher queries until I eventually got to something that resembled my initial vision. I generated a graph with different colors representing the primary nodes of interest (drugs, conditions, and effects), and the different relationships between them clearly mapped. While I’ve initialized the graph and explored some of its basic structure, I’ll likely save deeper analysis and insights for a future blog post. For now, seeing the structure come to life in a visual format is rewarding enough: it made the abstract relationships feel concrete, and it gave me a clearer sense of what kinds of insights might actually be possible.

Looking Ahead

This was a fun project that brought together several strands of curiosity, from biochemical networks to language models to graph theory. Even though I’m leaving it in a prototype state for now, the process revealed just how much potential (and complexity) there is in modeling unstructured biomedical text as a structured knowledge graph.

There’s still a lot of work to make this usable at scale. Data quality is the most immediate concern. While the LLM did a solid job generating nodes and edges, the output still needs careful vetting and QC to ensure nothing was missed or erroneously generated. If I were to keep developing this, I’d prioritize precision (correctly classifying an identified entity) over recall (capturing as many entities as possible, even if it’s wrong). In a domain as sensitive as biomedical effects, I believe that incorrect classifications are more problematic than failing to classify something altogether.

There’s also room to experiment with different models. I used a lightweight 3 billion parameter LLM to keep things efficient on my local machine, but I’m curious how a larger model with more capacity for nuance and ambiguity might improve the quality of the graph. The trade-off in speed and compute might be worth it for richer, more accurate relationships.

And then there’s the graph itself. I’ve only scratched the surface of what’s possible with Neo4j and Cypher, but even in this early stage, I can already see the kinds of questions it might help answer: Which drugs share similar downstream effects? Are certain conditions disproportionately linked to specific side effects? By querying the graph, I could trace shared outcomes across different treatments, identify side effects that recur across drug classes, or uncover clusters of conditions based on overlapping treatment mechanisms. With more time, I’d love to explore these questions or even build a lightweight interface to let others do the same. These kinds of insights could reveal unexpected connections or spark research questions that might otherwise go unnoticed.

Beyond scale, there’s significant potential to expand both the scope and depth of the graph. Currently, it relies solely on patient reviews, which are valuable but inherently subjective data. Incorporating peer-reviewed literature, clinical trial reports, or case studies would create a much more comprehensive and reliable graph. To support this, the prototype includes an evidence field for each relationship, tracking its source and frequency. While this currently reflects user reviews, it could just as easily reference citations from published research. A single mention in a review might be a curiosity, but consistent findings across trials or meta-analyses would carry far more weight. Over time, this approach could highlight high-confidence subgraphs and identify weaker connections that need further validation.

The beauty of this approach is its flexibility. Thanks to LLMs, we can adjust prompts to handle different data types while still generating structured outputs. That adaptability is key, whether scaling up with new sources or refining the precision of the relationships we extract. While this prototype focused on a domain I know well, the underlying method is transferable. With the right prompts and inputs, it could apply just as easily to scientific literature, environmental data, or historical archives, anywhere relationships are buried in messy text and worth making explicit.

Final Thoughts

Building this knowledge graph prototype has been incredibly rewarding, both personally and professionally. It’s allowed me to bring together a range of tools and techniques I’ve picked up over the last decade. While the project is still in its early stages, it’s no longer just an idea: it’s a tangible step forward in demonstrating how unstructured biomedical knowledge can be structured and visualized using generative AI tools.

Looking ahead, I think the combination of LLMs and graph databases holds real promise, especially for biomedical sciences. LLMs can extract structured data from free-form text with flexibility and scale, while graph databases offer powerful ways to explore and reason about that data. As both technologies mature, I see growing potential to turn the overwhelming messiness of scientific writing into something more searchable, visual, and usable, whether in biology, medicine, or beyond.
July 28, 2025

Category: Uncategorized

Shiny New Toy Meets Everyday Work: My First Steps with Agentic Coding

From the Sidelines into the (Virtual) Arena

Moving the Chains from Fantasy to Research

Establishing a New Baseline

From Drug Reviews to Biochemical Networks: Using LLMs to Build a Graph of Biochemical Pathways