
As may have been mentioned in our previous social media posts, Ouro has a monthly training day called “Kullanhuuhdonta” as part of our company culture. During these days, we gather at an interesting location and teach each other new and interesting topics or skills. We also occasionally invite outside experts to give lectures. The topics covered can range from the ethics of artificial intelligence to programming language syntax and memory management models.
This month, we met at the island of Lonna that is located off the coast of Helsinki. Our agenda included Large Language Models and Retrieval Augmented Generation, which focused on updating a commonly used LLM with our own data set.
The main focus of the day was to familiarize ourselves with the basics of RAG at a practical level. Fortunately, Python has an amazing repertoire of libraries available that will help us focus on the “doing” part without requiring a deep theoretical background. This is one of the main purposes of our monthly Kullanhuuhdontas – to introduce new ideas and tools to everyone. If anyone finds the topic interesting, Ouro will support their journey in deepening their theoretical background.
We split our day in four parts:
First, we decided on the specialized data that each of us would use as a source material. Then, we gathered the documents and preprocessed them using the Langchain Community libraries, specifically the DirectoryLoader. This allowed us to combine all the files into one large data structure, which was then ready to be fed into the next step.
Next, it was time to manipulate the large data structure. This one big blob of data is of no use, so it’s time to break it down into smaller, more manageable chunks. The Langchain library has a tool called text_splitter that offers the RecursiveCharacterTextSplitter function for automatically splitting data into smaller chunks with a predetermined size and overlap. These features are essential for our goal of feeding the data to LLM later.
The third and final step for preparing the data is to process it into a vector database and store it for later use. Once again, the Langchain Community libraries come to the rescue. There is a module called Chroma, which is able to take all the previously prepared chunks and perform some OpenAIEmbeddings() magic. This magic is essentially about the relatedness of text strings, which is what makes LLMs work.
After completing all the preparation work, it was time for a coffee break before delving into the main topic of RAG: using specialized data for making queries. This turned out to be less magical than expected. In simple terms, we took a question from the user, applied it to the Chroma database, and received back relevant chunks of information. Then, we formulated a query prompt for OpenAI, which was roughly as follows:
“””
Answer the question based only on the following context:
– first data chunk related to the question
– second data chunk related to the question
– third data chunk related to the question
—
Answer the question based on the above context: “This is my question related to the specialized data”
“””
And that’s it! The key to implementing RAG is to create enough chunks of appropriate size and formulate a question for LLM chat with relevant chunks. The chat then uses its own language parsing magic to process the prompt and hopefully returns a sensible answer. Essentially, all of these steps can be done manually with any LLM that provides a chat function. However, as seen above, the biggest task is preparing the relevant data and storing it for future use. The rest is surprisingly easy, thanks to LLM’s ability to parse and formulate text.
Author is Ouro´s Senior Developer Matti Kärki
