Evaluation Module

Estimated reading: 7 minutes

Why to evaluate?

Every LLM can hallucinate, or produce responses that are not entirely accurate or truthful. This is because LLMs are trained on vast amounts of text data, which can include biased or incorrect information. Additionally, LLMs can generate responses based on patterns in the data that may not reflect reality. Even though our model achieves factual accuracy about 90% to 100%, your input data or dataset may cause the model to hallucinate due to lack of context or factual information.

Hallucinations can occur in several ways. For example, an LLM may generate responses that are plausible but not entirely accurate. Or, it may generate responses that are completely false but difficult to identify as such. In some cases, an LLM may generate responses that are intentionally misleading or malicious.

To mitigate the risk of hallucinations, LLMs need to be rigorously evaluated and tested on a variety of prompts and scenarios. If you are updating your dataset, it is strongly recommended that you also evaluate the accuracy of Kaila’s responses after the agent has been retrained.

How it works?

For a simple evaluation of how your new or existing Agent that is trained on your data performs, we have prepared a great Evaluation Module that will help you to find out in a very short time how the Agent behaves in different situations and how accurately it answers different questions related to your knowledge or data.

The evaluation module works in principle by comparing a predicted answer from a model against ground truth answers written – ideally – by a human. We then compare these pairs of answers using sentence similarity and the output is the percentage similarity of the predicted answers to the ground truth answers.

Similarity is rough

Please keep in mind that the resulting similarity is only indicative. In some cases, the model may rate a predicted answer as “Poor” or “Good” even though the predicted answer is absolutely correct. This happens in cases where the resulting predicted answer is longer than the ground truth answer or the predicted answer is stated in completely different words but factually and contextually this answer is correct. The correct way to evaluate is by using your eyes and your mind.

So let’s dive in! Find Evaluation Module in your workspace in the left vertical menu under Knowledge/ Evaluations section.

Prepare the Ground Truths

As a first step in the evaluation process, prepare the list of questions and ground truth answers that are based on your dataset. What’s the ground truth answer you ask? It’s simple the one and only right answer to the question related to the dataset.

You can either start typing answers and GT answers directly in the Evaluation Module in your workspace of the Kaila Studio, or you can prepare it in a table a save as a CSV file.

Please note that you should always prepare questions and GT answers solely based on the dataset!

Module Interface

How to use the Evaluation Module

Click on the (+) button in the upper right corner and create a new evaluation set.
Name the evaluation set (for example after the dataset version).
Select the created agent that you want to evaluate from the drop-down list.
Upload the CSV file which includes list of questions and corresponding Ground truth answers, or click on the Add question button in the bottom right corner.

1. Importing the evaluation list from CSV

The easiest way how to upload multiple questions and answers into the Evaluation module is through a CSV file. In order to do that follow these steps:

Formatting your .csv

o upload Questions and GTs successfully, they must be formatted as question;groundtruthanswer; in .csv

Click on the Upload CSV button next to the Select agent drop down menu.
Select your .csv
Hit Upload
If the upload of the .cvs with your Q&As was successful, they should appear in the evaluation sheet

2. Creating the evaluation list manually

If you choose to type questions and GT answers one by one, click on the Add question button in the bottom right corner. A new line for answer and question will appear and follow these steps:

Type in the potential question users might ask Kaila based on the dataset into the Question field.
Type in the correct answer (Ground truth) to this question. You can either write it yourself or copy/paste it from the dataset.

Number of questions and aswers For the best evaluation of the dataset and the quality of the answers, it is recommended to prepare at least 50 questions and their corresponding correct answers. Of course, it always depends on the size of your dataset. As a general rule, the more the better.

3. Evaluation

If you have all your questions and GT answers ready and you are satisfied with them, it is time to start the evaluation itself. At this point, there is nothing easier than clicking on the “Evaluate” or “Create Evaluation” button (in case you are creating a completely new evaluation).

Evaluating, or getting predicted answers from Kaila can take anywhere from a few seconds to a few minutes, depending on the number of your Q&As. Predicted answers are retrieved sequentially one by one, so after a while you will start to see predicted answers for each Q&As together with a sentence similarity score.

The evaluation itself runs in the background so you can go back and check the results.

4. The results

Once the evaluation is complete, you can check the results and how your Agent is performing in terms of the questions asked.

Once the evaluation is complete, you can check the results and how your Agent is doing in terms of the pre-proposal questions. In case you are not satisfied with the results, you can supplement your data on which the Agent was trained or add the necessary context and run the evaluation again.