Overview of the Pipeline
The Duco Adaptive IDP pipeline
Duco Adaptive IDP automates the processing of documents or e-mails in different formats, including, but not limited to, PDF files, word files, images, scans, ... Every document that is processed goes through the following steps
- Input - Ingesting the document
- OCR - Pre-processing of a document, converting images to text
- Document creation - splitting/merging pages via page management, recognizing document types via classification
- Extraction - Extracting the desired text fields (entities) on the document or e-mail
- Validation - Validation of business rules
- Enrichments - look-ups and validation of information via external sources or custom logic
- Output - to your desired system
Document Processing Flow
Input - Ingesting A Document
Duco Adaptive IDP supports a number of different options to ingest data out of the box. For a full list of output integration, see Input.
You can find more information on how to upload documents on Input - Ingesting A Document.
For a list of supported file types, see Supported File Formats.
OCR - Pre-processing Of A Document
On every document that is uploaded to Duco Adaptive IDP, OCR will be performed to convert all images to text. Preprocessing steps like rotating pages will be performed automatically.
For e-mails, the e-mail will be rendered as a thread like you would see in an e-mail application.
For more information about OCR, please see OCR.
Document Creation - Page Management & Document Classification
Page management is the process of splitting and merging the original pages of the files into documents. There are a number of different page management options available
- Treat each file in an upload as one document
- Treat each page in an upload as one document
- Merge all files in an upload into one document
- Train an AI model to split documents automatically
- Always go to human validation for splitting and merging.
Behavior of document classification when AI-based page management is enabled
When AI-based page management is enabled, document classification is performed before page management and on the page level. All pages of an upload are put in a sequence according to their original sort order. For every page, the document classification model will make a prediction on the type of document.
After that, page management is performed per document type.
When a document contains multiple languages, the page management model will never mix languages. If you need to create documents that contain multiple languages per document, you can force the language to be one default language by configuring that in the languages setting of the project settings Languages.
Behavior of document classification without AI-based page management
When options like "Treat each file in an upload as one document", "Treat each page in an upload as one document", or "Merge all files in an upload into one document" are used, page management is performed before document classification.
Document classification will happen on the document level instead of the page level. This is typically more accurate because there is more context for the model to make decisions.
Extraction - extracting text and visual information
In this step, information is extracted from each document. Each piece of extracted information is properly formatted, based on your format configurations.
In Duco Adaptive IDP, we call one piece of extracted information an Entity, which is why we often refer to this step as Entity Extraction.
Multiple types of entities are supported. To learn more about them you can check the Entity classes section of the entity project settings.
Entity extraction is an optional step. If no entities are configured for a given document type, this step will be skipped.
To learn more about entity extraction, you can start at the following pages
Validation - deciding if a human review is needed
Not all documents will be correct, and not all AI predictions will be correct. To guarantee the completeness of a document, it's important to configure a set of validation rules.
Simple validation settings
For entities, a number of common validation options include
- If an entity is required or not
- The minimum and maximum amount of expected unique occurrences
- If a value can be parsed as a valid number or date.
For more information about simple validation settings, see Entities.
Confidence-based validation of predictions
Entities, page management or document classification that uses AI models (so all types except Regex) will get a confidence score for every prediction. If the confidence score is lower than a predefined threshold, the document will be sent to human validation.
To learn more about confidence-base validation, see Human validation.
Custom business rules
Business rules are used to validate the information extracted from the document through conditions you can create.
Duco Adaptive IDP provides all the necessary settings for creating different conditions that can be combined via boolean operators such as AND and OR. These conditions enable you can compare different elements with each other:
- The value of an entity extracted from a document, e.g. the net salary from the pay slip.
- How many times an entity is present in the document, e.g. two signatures must be present
- The number of pages of a document
- Information from metadata coming from external data sources and sent along with the document, e.g. when the customer sends information in a web form such as net wages in addition to uploading his loan application document, this information can be sent along to validate with the net wages recognized by Duco from his pay slip.
- A previously set fixed value
- A regular expression
- …
Elements can be compared by using all kinds of boolean operators such as smaller than, larger than, equal to, .... The outcome of the validation of business rules is sent along with the output of the document and is also visible in the 'production pipeline module'.
If you don't want to validate the extracted information with business rules you don't need this step.
To learn more about business rules, see Business rules.
Enrichments
Data enrichments allow you to embed custom code, custom logic and additional data sources into your processing pipeline by integrating an API call to an external system. Rather than doing the custom logic after Duco Adaptive IDP extraction, by using enrichments you can perform human validation on the enrichments within Duco Adaptive IDP.
Examples of when it makes sense to use an enrichment
- For external data lookup - for example, to find the matching supplier based on name, address, vat number
- For intelligent decision making - for example, to classify if an order is "delivery" or "pickup"
- For applying business logic - for example, to check if an order can automatically be fulfilled based on lead time and stock level
- For data validation - for example to validate an IBAN number.
- For embedding custom machine learning models - for example, sentence classification
- For custom parsing or standardization - for example, converting different Units of Measurement to a standard unit.
For more info regarding Enrichments, see Enrichments.
Output
When all steps have been completed, the result is sent to your own service, application or data source. Using the project settings, you can select the desired configuration to get the information into your system.
For a full list of output integration, see Output.