Frequently asked questions
Which documents are taken into account for training?
All training documents in status Processed are taken into account for training. In case you have sent documents from production to training they will be taken into account as well after you have validated them and marked them as done. You can do this by going to the task module and creating a suggested production task. Validating documents here will transfer the status from Input required to Processed. In case you're interested in specific documents in Input required status you can filter on these while creating a custom task.
More details:
- about training in Train your models
- about creating tasks in Tasks
How many documents do I need to train a model?
Document classification
For document classification, you need to have at least five documents for each language for each document type (for which the recognition type is set to A.I. model), and you need to have a minimum of two document types. Documents for a certain language and document type will be discarded in case this minimal requirement isn't met. For example:
Document type 1:
- 7 NL documents
- 3 FR documents
- 5 EN documents
Document type 2:
- 5 NL documents
- 4 FR documents
- 3 EN documents
The FR documents will be removed, since we have less than 5 documents for each document type. The documents for EN will also be removed, because even though there are 5 documents of type 1, at least 5 documents of 2 distinct document types are needed, and there are only 3 of document type 2. As a consequence, the model will only learn to make predictions for NL documents. In order to include the other languages, more data for those languages is needed. At the end of the training, a warning is shown in the UI mentioning the documents that were removed.
In the case of the scenario above but without the NL documents, the training will fail, since for none of the languages there would be sufficient data.
If the minimally required amount of documents is present in the training data, the training will succeed, but the resulting model will not necessarily be accurate. The amount of documents you need to obtain an accurate model depends on several factors, all related to the amount of variation in the training data:
- the amount of languages: the more languages, the more data will be needed
- the amount of different document types: the more document types, the more data will be needed. Try to have similar amounts of documents for each document type. We correct for class imbalance, but if one of your document types is very unfrequent compared to the others (ex. 5 vs. 500 examples), it will not be learned well.
- the similarity between document types: if two document types are very similar, the model will struggle to distinguish between them. More data can remedy this.
- the amount of variation within one single document type: some document types are a bit vague and contain a variety of different types of documents. For instance, some projects have a document type "Other" or "Irrelevant", which contains a bit of everything. Especially if these vague categories contain documents that are very similar to other document types, the model will struggle to learn them. More data can remedy this. It might also be needed to split vague classes into several classes.
It is impossible to determine the exact amount of documents needed to train a performant model, as each dataset is different. Also, data quality is more important than data quantity: training a model on 100 well-labeled documents will give far better results than training a model on 1000 badly labeled documents.
Page management
To train a page management model, you need, for each language, at least five documents with more than one page of the document type that requires page management. In case there are less than 5 documents with more than one page for a certain language, all data for that language will be discarded. As a consequence, the model will not be able to make predictions for the language in question.
The quality of the data is paramount: make sure that the pages of all the documents are in the correct order, and that all files are correctly split. If your documents have page numbers, double check that each document starts with page number one and that no other page in the document has page number one. Mark all incomplete documents as failed to make sure they are not included in the training data.
Entity extraction
To train an entity extraction model, at least two annotated documents are needed, and at least one entity has to be present on both documents. All entities that are not present on at least two documents will be removed from the training data, and the model will not be able to predict those. Also entities that are very unfrequent compared to other entities will be removed: for instance, if the data contains 500 annotations for entity A, but less than 10 for entity B, entity B will be removed. We do this because we have experimentally found that very unfrequent entities behave as noise for frequent entities: the model will struggle to learn all entities, even the frequent ones. This is less of an issue if all entities are unfrequent (for instance because you have a very small dataset). In this case, usually all entities are maintained. At the end of a training, an overview of which entities were removed from which documents, if any, is shown in the warnings. This information allows you to decide which data to add to the training data to improve your model.
It is possible to train performant entity extraction models with very little data, thanks to the Metamaze few shot training pipeline. This pipeline is automatically enabled if there is very little training data, you do not need to do anything specific to be able to use it.
If you only have simple entities, it is possible to train a model with an f1-score of more than 70% on 10 documents. In case your data is very uniform (for instance if there is little variation in layout), even higher accuracies (>90% f1-score) can be obtained with 10 documents.
In case your data contains composite entities, slightly more data is needed to obtain similar results. As a rule of thumb: the more complex the composite entities, the more data you will need. For instance, if you only have one type of composite, which occurs multiple times on each document (for instance order lines on purchase orders), and the members of the composite entity are usually in the same order, it is possible to obtain an f1-score of more than 70% with 10 documents. The more different types of composites you have, the more data you will need to come to similar results. Also if the composite entities do not occur on every document or at most once on every document, you will need more data. If the composites themselves contain a lot of variation (variable number of members, different orderings of members, etc.), you will need more data. Even on very hard datasets, it is possible to obtain a 50% f1-score with 30 to 50 documents. This will allow you to speed up the labeling process by making use of model-assisted labeling and active learning.
If you get started with entity extraction, we recommend labeling 10 documents if you do not have any composite entities, and 30-50 if you do, before triggering the first training. Make sure this initial dataset is of high quality: the better the annotations, the better the initial model will perform (see Guidelines to annotate correctly). If you have more data available, make sure that it is uploaded and has a document type and language assigned before triggering the training. Thanks to Active Learning, the best next set of documents to be labeled will be selected from this unlabeled data. The selected documents will already have predictions when you start labeling them, which speeds up the labeling process. Correct the predictions if needed, and add missing entities for about 50-100 new documents before triggering a new training. Repeat this process until your model is accurate enough to deploy it in production.
Which documents are considered when training a document classification model?
All documents in training that aren't marked as 'Failed' & 'Input required' and for which the language and the document type are set.
When should I trigger a new training?
To achieve the best results when using Metamaze, it is important to know when to trigger a new training. There are several scenarios when a new training should be triggered, and we have provided our recommendations below.
- For the first training:
- If only simple entities are being processed, 10 labeled documents should suffice.
- For composite entities, at least 30 labeled documents are recommended. However, the number of labeled documents required may vary based on the complexity of the documents.
- The default training mode should always be an incremental training, unless there are situations that block incremental training.
- If there is no improvement in model accuracy after 5 incremental trainings, trigger a full training.
- When more than 20% of the data has been removed:
- If the data was removed because it's bad data that the model should forget about, trigger a full training.
- If the data was removed for other reasons, such as data retention, and the model should not forget it, trigger an incremental training.
- When more than 10% of the documents have been changed, trigger a full training.
These recommendations are designed to help ensure that your models are trained effectively and efficiently.
When do we use which entity type?
While setting up your project for Entities you might run into doubts about which entity type to use.
The main differentiator will be whether your information is an image or text. In case it is text, your have the following options:
- The regular type will be sufficient in all cases besides date or digit.
- With a date you choose date. After being extracted, the dates will be parsed to a common date format that you can specify (ex. DDMMYYYY)
- If the information you would like to extract is a digit, chose number. Number allows you to choose a decimal or thousand separator, a parsing rule puts the information extracted in a specific decimal and/or thousand separator format.
What if a document was already annotated by someone?
Good news, you are done with annotating! You can update annotations done by someone else if needed. More details can be found in Annotation of training data.
How does parsing work?
Parsing is done after annotating or extracting the information you need. In case you change something in the parsing rules of your project settings it will be only be applied to new annotations and extractions. Parsing is not taken into account for training.
Can I check the annotations done by a colleague?
Yes, you can do so by creating a 'custom task' and setting up the filter 'annotated by users'. You can select the colleague in the list. If you are looking for specific documents, you can also set up additional filters such as date for instance.
More details about creating a task can be found in Tasks.
When do we need to do what?
Steps to follow to make progress in your project are listed below Overview of Project Steps.
What browsers do you support?
We support Chromium based browsers (Google Chrome, Microsoft Edge, Brave, Opera, ...), Firefox and Safari.
Iโm trying to adjust annotations but I get the error message โOverlap with another labelโ.
Entities are not allowed to linearly overlap with each other: a word belonging to one entity cannot be contained inside another entity. This can happen when annotating data in columns and tables: sometimes you need to group words that are contained in the same column, but the OCR service reads them row per row, which will cause words of other columns to be contained in the column you are annotating. If you get the overlap error when annotating data, click on the first and the last word of the entity you wish to label, the selected text will be highlighted. This allows you to see more easily which other entity there is an overlap with, and decide whether you can solve the conflict. If you cannot solve it, for training data, mark the document as failed, as the model will not able to learn it properly. For production data, you can manually enter the entities that you cannot label.
How long does it take to process an upload?
Processing time for a document depends on a number of factors:
- Total number of pages in your upload
- Amount of text in your upload
- How many ML models you use: OCR, document classification, entity extraction, enrichments
- Processing load and lag of the whole Metamaze platform.
Taking the above into account, on average it takes 2 minutes to process an upload.
Everything that is processed within 15 minutes is considered normal processing time.
How do I correctly configure page management?
The answer to this question depends on your use case. Even with the option "Train a page management A.I. model to merge/split documents automatically" enabled, you can obtain different processing flows, depending on your needs.
All possible scenario's are detailed below:
My project has multiple document types, and each uploaded file is a document
In this case, you need document classification, but not page management. Do not activate it in the project settings.
My project has multiple document types and each uploaded file can contain more than one document, but max. one document per document type
This happens for instance if one single uploaded file contains a payslip, a purchase contract, and a credit agreement, but never more than one of each type.
For this scenario, you need both document classification and page management. Enable the "Train a page management A.I. model to merge/split documents automatically" option when creating your project. However, do NOT train any page management models, as splitting of the uploads is straightforward, no model is needed for it: during the document classification step of the processing pipeline, each page in the upload will be assigned a document type, and all pages of the same type are subsequently combined into one document. If you do train the page management models, errors will inevitably be introduced in the processing pipeline, since AI models are rarely 100% accurate. These errors can easily be avoided by using the default page management flow, which simply groups pages of the same type into documents.
My project has multiple document types, each uploaded file can contain more than one document, and can have more than one document of the same type
This happens for instance if one single uploaded file contains three payslips, a purchase contract, and a credit agreement, and you need to know how many payslips there are in the upload.
For this scenario, you need both document classification and page management models. Enable the "Train a page management A.I. model to merge/split documents automatically" option when creating your project. Only train the page management models for those document types for which there can be more than one document in one single upload. Do not train the other page management models, as they are not needed. If you do train and deploy them, errors will be introduced in the processing pipeline, since AI models rarely are 100% accurate. If you do not train and deploy a model for a certain document type, the default page management flow will be used: all pages of the same document type will be grouped into one document.
My project has one document type and each file can contain more than one document, all of the same type
This happens for instance if one uploaded file can contain a multitude of invoices, and you need information to be extracted from each invoice separately.
For this scenario, you only need page management, no document classification. Train the page management model for your document type.
Auto-scaling and cold starts
Metamaze uses auto-scaling features that have the cold start characteristic. Models are unloaded automatically after 10 minutes of inactivity. If you do a new upload after unloading, the model needs to be loaded again which can take up to 5 minutes. Depending on the availability of space on the cluster, a new node might need to be added which can take up to 15 minutes which depends only on the availability of Azure in their West Europe data center. Once the system is scaled, requests will go a lot faster.
If your use case requires faster or synchronous processing, it is possible to prevent downscaling so that the model is always available. If you believe that is necessary for your use case, please contact support@metamaze.eu to discuss options and pricing.
Buffering and queueing
Metamaze queues all incoming uploads automatically using a global FIFO principle (first in, first out). Due to the nature of our partitioning, this however does not guarantee that the order of processing in uploads is guaranteed.
Downtime or other problems?
To check the current status of Metamaze, you can look at the public status page where you can also subscribe to updates about the status. If uploads are stuck for abnormally long (e.g. 1 hour), please contact support at support@metamaze.eu.
How can I get the output documents from Metamaze in UIPath?
Since there is no out-of-the-box integration with UIPath and you can't really use our rest API output integration for this, you will have to fetch the documents yourself.
You can use the following Metamaze API call: https://app.metamaze.eu/docs/index.html#tag/Regular-processing/operation/ProcessStatusAndRetrieveResultsGet
How can I send documents of a failed upload to training?
There are 2 options, you can download the document and upload it manually in training or you can use the following Metamaze API call to do that: https://app.metamaze.eu/docs/index.html#tag/Regular-processing/paths/~1organisations~1%7BorganisationId%7D~1projects~1%7BprojectId%7D~1upload~1%7BuploadId%7D~1send-to-training'/post
Make sure you set the "includeFailedDocuments" parameter when using the API call.
If you configure parsing settings with the new configurable parsing, and you have a date entity configured with an output date format YYYY/MM/DD. Does the configurable parser have an impact on the output date format?
No, parsing is only used on the input (the entity value on the document). The output date format will be respected, for this question YYYY/MM/DD.
How is document classification done when using a document classification model and document types configured with regexes?
- First documents will be classified based on document types configured to use regexes
- When the regex document classification is done, the document classification is done using the deployed document classification model
What is the difference between entity, composite & aggregated entity?
- Entity - contains text that has been extracted
- Composite - defines a group of entities related to each other
-
Aggregated entity - similar to an entity but also includes postprocessing combining several project settings:
-
- minOccurrences
- maxOccurrences
- mergeOccurrences
- parsing (date, number, ...)
-
When is a document sent to training automatically?
Documents are sent to training if one of the following criteria are satisfied:
- The document type was manually updated AND the newly set document type has the recognition type set to AI model
-
- Recognition types of "Regex" & "None" will be ignored
-
- An entity annotation that was predicted by an AI model that was updated, validated (e.g. accepted but had confidence lower than the threshold) or removed.
-
- Regex entity types & manual annotations will be ignored
-
- A relation was created, updated, validated or removed.
- Page management was performed manually at any step in the pipeline (only for projects that have page management enabled)
Can I look up documents based on text on a document?
Yes you can, you can use the filters "Entity" and "Entity value" to find documents based on text. For now, you can't look up any given text of a document and you are limited to entity values.