Prepare Your Training Data

Estimated reading: 7 minutes

The Basics

“As you may have noticed, Kaila is a machine based on complex mathematical functions, specifically neural models. Generative text models work through mathematical calculations, albeit complex ones, to make predictions based on the interrelationships between words or sequences of words. Using previous training data, the model puts together relevant words to generate a response. For example, if the model has learned that the word “Hello” is often followed by “how are you?”, it will generate “Hi, how are you?” with a high probability. As technology advances, so do the capabilities of AI models like Kaila. It’s important to keep in mind that while AI models have come a long way, they are not perfect and may still produce errors in their outputs. However, with continued development and improvement, they can provide increasingly accurate results. It’s also important to understand the limitations of AI models and to validate the results they produce before using them in important decision-making processes.

However, Kaila can only respond based on what it has been trained on and may not have knowledge of what is not included in its training data. That’s why preparing the training data properly is crucial for Kaila to respond as accurately and factually as possible. For more information on how to prepare the data, please refer to this article.

There is no data like data

For Kaila to work properly and provide the most accurate answers possible, preparing the data for training is the most important part. Not only does the data need to be properly prepared, but it also needs to be correctly chosen in terms of the document type. Kaila generally works best with contextually linked data – text. As an example, a spreadsheet dataset, which for example will contain the questions and answers, will be harder for Kaila to work and perform with than a comprehensive and long block of text from which it can make contextual links. This is why the right choice of data and topic is so important for Kaila.

Which text documents are suitable?

Existing knowledge base content
Textbooks, books
Product or process manuals
User guides
FAQs
Communication channels (Slack, etc.)
Product sheets
Onboarding manuals (user or employees)

Which documents are inappropriate?*

Legal documents
Contracts (so-so)
Laws
Tables (Excels)
Onboarding manuals (user or employees)

* We are constantly improving our product and it certainly does not mean that Kaila will not be able to process the above documents in the future. While Kaila can process these documents, some of the output may not be of high quality and accurate.

Prepare your training data

After you’ve considered what use case you want to use Kaila for and thought about choosing the right topic for your Kaila Agent, now you need to collect and prepare information and materials for training data.

So, you may be wondering, what can such a text document look like?

For this example, we have selected text data from Wikipedia. The information we have chosen is about Prague, as we are from Prague and we love it. Firstly, we will visit the Wikipedia page about Prague. Then, we will manually copy the entire text, excluding pictures and references. (You can try it yourself!). After that, we will open a plain text editor such as Notepad (Windows), TextEdit (Mac), or any code editor.

No Words!

Please refrain from pasting text into editors such as Google Docs, Microsoft Word, or OpenOffice.

Copied ad pasted text data from Wikipedia, ready for training looks something like this:

In which format should be dataset saved?

We recommend to save your datasets only in .txt format

If I already have data collected, do I need to edit it?

Long story short – yes. And now the long story. Here are the ground rules for what needs to be done with the data:

Maintain a logical structure of content, with blocks (topic content) and topic areas (headings).
Merge headings and subheadings together. Put the heading, which would normally be in bold or H1, on the first line of the topic content and separate it with a colon from the rest of the content in the area (e.g.: How Kaila works: Kaila is a conversational AI assistant that allows...).
As little bulleted text with short sentences or one word as possible.
As few different definitions for the same subject as possible (e.g.: ¨Car: used to transport people from place A to place B¨ or ¨Car: is a device consisting of an engine, wheels¨). Better put the information together like : ¨A car is a vehicle that has 4 wheels and transports people from place A to place B.¨
In case you have information like ¨CEO: Radim Tvrdon¨, we recommend that you expand this information a bit to say ¨Radim Tvrdon is the CEO of WeBoard, the company that created Kaila.¨

Still feeling lost? Never mind.

If you don’t know how to do this then we can try to remove the formatting for you in Kaila Studio – but please keep in mind that our product is still in MVP stage and removing formatting may not always work.

If you have more content you want to add to Kaila Studio, feel free to contact us at the support email and we will help you import all your data.

What about the data I have in spreadsheet format? We simply can’t import spreadsheet formats yet so please send us the data in spreadsheet format to support email and we will be happy to import it into Kaila Studio for you.

Or try our ready-made dataset!

We have prepared a formatted and cleaned Apple History Dataset to show you how to format your data correctly.

How to create AI Agent?

If you have followed and completed all the above instructions now is the time to create your first AI Agent.

How to create AI Agent