Entities are words you want to extract from documents, such as employee name, street, house number, postal code, municipality, net wage, ....
The settings provide an overview of the entities per document type.
General OpenAI GPTx Instructions
You can provide extra general instructions allowing you to tailor your instructions more flexibly for improved model performance when working with annotationless models.
With an existing document type
In case you have added an existing document type to your project, the entities associated with it will be loaded automatically. You can choose to enable or disable them with the toggle on the right-hand side:
Clicking on an entity allows you to edit the settings in the 3rd panel.
Part of the settings are managed by the document type owner. If your user is part of this organisation you will be able to edit the settings, otherwise you will have to contact the organisation who is the owner of the document type.
With a new document type
When you have created a new document type, the entities list will initially be empty. You can start creating new entities by clicking on the "+ Create" button at the top or the "Create" button in the middle section.
After filling in the entity name, one of the following entity classes can be chosen:
Entity classes
After filling in the entity name, one of the following entity classes can be chosen:
- Text - A text entity is an entity that has a value in a textual form. When labelling documents, you will be able to select one or more words to indicate a value for this entity.
- Image - An image entity is for recognising objects such as handwritten text, signatures, ... Labeling an object in a document is done by drawing a rectangle around the object.
- Composite - A composite entity is a group of other entities, e.g. an order line consisting of different entities such as the product name, product number, quantity, price per unit, .... When creating a composite entity you can select the entities that belong to it in the next step.
- Paragraph - A paragraph entity is an entity that has a value in a textual form. Unlike the text entity, a paragraph entity is optimised for longer pieces of text. Do not label full pages or very long spans of text (multiple paragraphs) as paragraph entities, they are meant for labeling single paragraphs or a couple of lines of text. Note that paragraph entities cannot be added to a composite entity. If you need to extract very long spans of text or be able to link paragraph entities to other entities, please contact the Duco team.
-
Regex - This is a special entity that is more likely to be found by searching for a match in the document using a regular expression you can set. No AI model is used for this. When defining regex entities, keep in mind that Duco Adaptive IDP automatically adds spaces around all punctuation. There are some website which can be helpful when creating a regular expression:
-
Group [closed beta feature] - A group entity is a group of other entities, e.g. an order line consisting of different entities such as the product name, product number, quantity, price per unit, .... Unlike composites, groups have fewer restrictions. You can create a hierarchy of 2 levels or more (groups within groups), groups can overlap with each other and can consist of entities spread across pages.
- Checkbox - a checkbox entity allows you to extract text and linked checkbox and its state. You can use this entity like a normal text entity by annotating the next next to a checkbox. Duco Adaptive IDP will automatically detect the checkbox next to it.
Only use group entities if the composite entities are restricting you. At the moment of writing, the composite entity has a higher accuracy and should be used when possible.
Groups and composites can't be used at the same time within the same document type. You either choose composites or groups.
Entity types
If you chose a text entity you can also set an entity type. This type will be used for validating and converting the value to a certain format. For example, if you choose the type 'date', Duco Adaptive IDP will validate the value found by the model for this entity and convert it to the format you define yourself. If Duco Adaptive IDP would detect a value for this entity that is not a date (conversion and validation failed) it will be put into the manual intervention module for checking (if this step is enabled).
There are different types for text entities:
- Regular - This is a text type entity. There is no validation or conversion to a particular format.
- Number - This is a numerical entity. Choose the desired input format for decimals and thousands.
- Date - This is an entity of the date type. Choose the desired date output format. For a complete list of supported format strings, see this link.
After choosing the appropriate entity type, you can optionally indicate to which composite entity it belongs below "part of composite".
Other entity settings
Next it is possible to indicate the following:
-
Remove punctuation - This setting allows you to delete punctuation. For instance, with license plates, you typically have punctuation that you might want to remove: 1-ABC-123.
The following symbols are removed when you enable this functionality:
Copy
.,\/#!$%\^&\*;:{}=\-_`~()
-
Remove spaces - This setting allows to delete redundant spaces in case these are found
- Data masking - This setting allows for the real value of the entity to be replaced by generated fake values for the training data
- First Name
- Last Name
- Full name
- Street address
- Street name
- Zip code
- City
- Country
- Regular expression - Values will be generated based on a regular expression. There are some websites that can be helpful when creating a regular expression:
- Number - Values will be generated between min and max values with a precision
- Min - Minimum number range for generating fake value
- Max - Maximum number range for generating fake value
- Precision - The number of numbers after the decimal
-
Merge occurrences - This will merge all occurrences of the entity into a single field for the output
-
Required - This setting determines whether the entity is required to be identified. If a mandatory entity is not found by the entity extraction model, the document will end up in the validation queue.
-
Occurrences - This is the minimal number of occurrences that is expected for this entity. The default is 1.
When the entity is a child entity of a composite, the minimum number of occurrences is calculated for each composite separately.
The following logic is applied to the parsed values (if no parsed value, the value as extracted from OCR will be used). - When the maximum occurrences is filled in, some extra logic is applied.
The following logic will be applied: - It will check all the occurrences for an entity
- it will try to keep a single value if you set max occurrences to 1
- if all of the parsed values are the same it keeps 1 value and no issue will be created and not will not be sent to human validation
Now suppose you had the following occurrences for your entity:
1. 20655061
2. 20655062
3. 20655062
In the above case, there still will be 1 max unique occurrence because behind the scenes there is some logic that will pick the most occurring value in this case ‘20655062’.
Now suppose you had the following occurrences for your entity:
1. 20655061
2. 20655062
3. 20655063
In the above case, the extra logic won’t be able to decide between the 3 occurences and it will trigger an issue with the upload being sent to human validation. -
Color - The color of the entity as it will be indicated in a document. Click on the square to change the color.
- An entity will be marked in the chosen color if the entity was tagged in the training/labeling module or the manual intervention module by a user. If the entity value is recognized by the entity extraction AI model, the same color will be displayed in a more transparent styling.
-
Override threshold - Setting an individual threshold for an entity.
- OpenAI GPTx Instructions - Extra extraction instructions you can set if the annotation-less model doesn't give you good predictions. This field is only visible if you enable this feature in the project settings: #openai-gptx.