Fine-tuning an LLM model with H2O LLM Studio to generate Cypher statements | by Tomaz Bratanic

[ad_1]

Avoid depending on external and ever changing APIs for your knowledge graph based chatbot

Large language models like ChatGPT have a knowledge cutoff date beyond which they are not aware of any events that happened later. Instead of fine-tuning models with later information, the trend is to provide additional external context to LLM at query time. I have written a couple of blog posts on implementing a context-aware knowledge graph-based bot to a bot that can read through the company’s resources to answer questions. However, I have used OpenAI’s large language models in all of the examples so far

While OpenAI’s official position is that they don’t use users’ data to improve their models, there are stories like how Samsung employees leaked top secret data by inputting it into ChatGPT. If I were dealing with top-secret, proprietary information, I would stay on the safe side and not share that information with OpenAI. Luckily, new open-source LLM models are popping up every day.

I have tested many open-source LLM models on their ability to generate Cypher statements. Some of them have a basic understanding of Cypher syntax. However, I haven’t found any models reliably generating Cypher statements based on provided examples or graph schema. So, the only solution was to fine-tune an open-sourced LLM model to generate Cypher statements reliably.

I have never fine-tuned any NLP model, let alone an LLM. Therefore, I had to find a simple way to get started without first obtaining a Ph.D. in machine learning. Luckily, I stumbled upon H2O’s LLM Studio tool, released just a couple of days ago, which provides a graphical interface for fine-tuning LLM models. I was delighted to discover that fine-tuning an LLM no longer required me to write any code or long bash commands. With just a few mouse clicks, I would be able to complete the task.

All the code of this blog post is available on GitHub.

Preparing a training dataset

First, I had to learn how the training dataset should be structured. I examined their tutorial notebook and discovered that the tool could handle training data provided as a CSV file, where the first column includes user prompts, and the second column contains desired LLM responses.