The data annotation process
How to make sure you deliver the best model with the least effort
Context
As a Duco Adaptive IDP project manager, your aim is to understand the task at hand in detail, including all relevant terminology, document types and nuances.
If there is a manual process that you are replacing with Duco Adaptive IDP, we strongly urge you to spend time in the field with the data experts who are currently processing the documents. They can inform you of nuances, ambiguity or business knowledge that they have encountered before. This can save you a large amount of time.
Make sure you understand the basics of the project before you start:
- What is your end goal?
- What are you trying to extract or classify?
- Where does the raw data come from?
- Is it representative?
- Which business rules need to be applied on the output?
- How is the output used?
When the whole team clearly understands what the goal is, you can start with our nonlinear approach of data annotation and see what the possibilities are.
1. Exploration stage
We first need to get to know the data. We look into the data together with a data expert and ask a lot of questions. We have to completely understand what the goal of the extraction or classification is. After understanding the use case and analysing a representative set of documents, together with the customer we define the names of the entities and document types.
2. Annotation guidelines
In the next step we have to write the exhaustive rules we will use for annotating. As explained in the data annotation guidelines, good guidelines don't leave space for interpretation and include examples.
Let's look at an example of annotation guidelines for the name of the buyer:
Annotate the whole name, e.g. John Smith or María Dolores Carmen Rodríguez Martínez, as one entity. Do not include Mr, Mrs, Dhr …. If the name of the buyer is a company, annotate the name of the company.
And for the invoice date:
Annotate the whole date including the year but without day of the week, e.g. 04/10/2018. Annotate it no matter in which format it appears; 2018-10-04, 4th of October 2018, 4. 10. 18, etc. Do not forget to annotate all occurrences in the relevant context.
The secret to creating good annotation guidelines is to anticipate the edge-cases in the documents and to describe them in the guidelines. There is a possibility that some documents include "errors", and guidelines need to provide instructions on what to do with those documents.
Include also the general guidelines: for example, do you include trailing whitespaces, punctuation marks, etc. You expect everyone to annotate in the same manner, so provide very clear instructions, even if you sometimes think something is obvious.
3. Annotating
Start small
After writing the annotation guidelines it is time to start annotating. And here the iterations begin. We advice to first look at 10 documents and try to annotate them using the annotation guidelines. If at any point you have to make an additional decision on how to annotate a certain entity, add it to the annotation guidelines. If you notice one of the guidelines doesn't make sense, now is the best time to adjust it.
Let's return for a moment to the example of the entity extraction from invoices. We noticed when annotating the first couple of documents that names sometimes appear in several lines. So we decided to add the following to the annotation guidelines:
If the name appears in two or more lines, annotate each line separately.
Another thing we noticed is, that the date of the invoice sometimes appear in the header or footer, but because of different reasons. We decide to describe this edge case in the annotation guidelines:
Be careful, if a date appears in the header/footer due to the print, do not annotate it. Annotate dates in the header/footer only if the context explicitly mentions that the date is the invoice date.
It is important to know that if you decide to change the annotation guidelines later on, in the middle of the annotation process, it can happen that all the already annotated documents have to be annotated again. That is why it is of outmost importance to define unambiguous annotation guidelines.
Dry-run your guidelines
After annotating several documents without needing to update the guidelines, it is time for the next step. Ask a colleague or two to help you annotate around 50 documents. Check if after seeing enough instances of documents, your annotation guidelines are clear and don't allow room for interpretation.
Now that you clearly understand the domain, articulate it to the team of annotators repeatedly. If needed, provide training to improve the accuracy of annotations. If at any point the team of annotators cannot annotate a document based on the annotation guidelines, they need to let you know so you can update the guidelines. It is also important for them to regularly check the guidelines for possible changes.
How to divide work between multiple people
We have found that the best way to get accurate and consistent labels is to assign specific document types to specific owners. The following are benefits:
- Consistency through broader exposure: Because one document type is handled by only one person, that person learns the nuances for that document well
- Speed. Every document type requires a warm-up period to get to know the label names, shortcuts, the document layouts, the different ways of writing something, get used to the annotation guidelines, ... Having dedicated people per document type maximises their efficiency.
- Unconscious consistency. Even if the responsible annotator is not aware of the unconscious and ambiguous choices he/she makes, we can expect him/her to at least do it consistently.
4. Optimisation of the annotation process
Train a first model
After the initial set of the data is annotated, it is possible to train the first version of the model. When the training is finished, suggested annotation and review tasks are created which include annotation suggestions. By performing these tasks, the labelling process will be considerably faster, since you will only label documents from which the model can learn the most, and there will already be a lot of correct suggestions, which you simply need to accept. The more data you annotate, the more data the model is trained on, the more accurate the suggestions are and the more time is saved by the annotators. Including a model early in the annotation process also allows you to easily determine when a model is ready to go to production.
Using the task module for problematic entities
However well your annotation guidelines were, you will sometimes find that annotations were inconsistent, incomplete, or due to some new insight need to change. In this case you can create a custom task, allowing you to quickly find the relevant subset of documents and iterating through them to correct the specific field.
Conclusion
Data annotation can be a boring and time consuming job but it can also be efficient and fast if you know how to tackle the problem correctly.
As our customer, you can individually decide how much you want to be involved in the annotation of your data. If you decide to annotate the data yourself, we will support you throughout the whole process. We help with data annotation workshops where we find the best solutions for the annotation of your specific data as well as help with writing the annotation guidelines and performing quality checks after the annotating has started. On the other hand, some customers want to outsource the whole task, and we take over not only the downstream ML tasks, but as well the whole process of data annotation.