Data Extractor

Extractor is a software for converting documents from various presentation sources into structured data. The core of the service is a module that uses set of predefined configuration files (stencils) to identify and parse Unicode text input into structured message. The output message structure is defined by the stencil, so the single running extractor core can produce different types of messages.

The extractor can also consume PDF, HTML and other formats by invoking import filters. Filter pipeline is a configurable chain of modifiers that converts input document into Unicode text. Each filter chain has its own input MIME type and can execute multiple processing steps. For instance the processing chain for HTML e-mail content would have HTML basic syntax check adding and element, followed by headless HTML browser for page layout and text extraction. Any command-line application can be used as input filter.

More about technical background and design considerations can be found on the architecture page.

Stencil Editor

Application

Stencil Editor is a desktop application written in C# and Windows Presentation Framework. The Editor loads both source document and stencil into single UI where the data extraction rules can be defined and tested against actual sample data. Each extraction rule can be executed separately or the whole extractor can be run. The UI maintains complete undo/redo stack for any stencil editing operations. The Stencil Editor setup includes filters for importing PDF, HTML and XLS files.

Stencil

Stencils are XML-based documents that contain one or more ruleset element. Rulesets can be nested, in which case child ruleset serve as optional parts of root level element. Valid ruleset contains one or more item. The set of items in ruleset define the structure of the data that will be extracted by extractor service from input document. Each item has one or more matching pattern, multiple patterns serve as alternative search.

The ruleset contains collection of condition elements that define the document type. Multiple conditions are combined with logical operators and the resulting boolean value determines if the stencil is applicable to particular input data.

Conditions and patterns utilize regular expression language that allows flexible data matching. The regex is well established method for searching for patterns in textual data, so many people in IT industry are already familiar with the syntax. The Stencil concept builds upon the regex but it implements additional structural layer that makes it possible to define complex extraction rules by combining multiple small and simple search expressions.

Publisher

Stencil Editor includes Publisher component as part of the stencil UI. The Publisher can be used to connect to the live extractor service to manage the list of available stencils on the server that provides REST API. When uploaded, new stencils become available immediately, optionally on multiple service instances backing the single REST API entry point. The publisher allows uploading meta-data and attachments, so for instance most common example files can be kept on the site along with the stencil.

Stencil tools

Stencil editor has convenience features like regular expression generator and pattern library, which help building new stencils from existing tested parts. Theoretically it is not possible to deduct the regular expression from single sample value. The generator allows users to create list of regular expression parts that are tested against the sample data. When suitable predefined construct is found, it will be suggested as new pattern.

The Windows version of extractor core is installed along with the Stencil Editor to provide preview of the stencil execution for test and development. Clicking "Test" button in Stencil Editor will launch the extractor runtime as background process and present the JSON data retrieved from sample content.

Deployment

The Stencil Editor installer can be downloaded for 64-bit Windows and it includes 64-bit Windows runtime (extractor.exe). Separate runtime tools package also includes extractor command line version.

The extractor runtime is deployed in Docker container that runs Linux operating system. Running instance of the extractor can be managed via REST API or by using built-in management web page. The management interface allows adding stencils and observing execution logs. Each extractor invocation produces a log entry that includes result code (none, partial, complete). It is optionally possible to retain input files and use collected samples for developing new stencils.

Extractor can also be invoked as command-line application on Linux and Windows.

Stencil format

Stencil file contains part of configuration for extractor application, stored in XML format. When extractor loads stencil, it will add the rulesets from the stencil in to the runtime configuration. Extraction process tests rulesets against input document one by one, in the order as they appear in list. When the ruleset matches document, the extractor will apply the ruleset to the input data.

It is possible to control in which order the extractor uses stencils. For that purpose there is Order attribute at the root level of stencil document. When Order is set to any positive integer, the value is the index of stencil in the extractors stencil list (or last, when index is larger than count of pre-existing stencils).

When Order is set to -1 (minus one, default value) the stencil will be added to the end of the list and any following stencils (if any) will be added after it. Use the -1 when the position of stencil is not important.

To keep the stencil at the end of the list so it is always applied after all other stencils, the value -2 (minus two) can be used. Stencil with order -2 will be always kept at the end of the list. When there are multiple stencils with order -2, their order relative to each other is implementation specific.
Stencil contains one or more units called ruleset.

Ruleset

Ruleset defines number of rules for detecting certain document type and for extraction of data items from document content. Rulesets can be nested, in which case child rulesets serve as optional parts of root level ruleset. To be valid, the ruleset must contain one or more items. The set of items in ruleset define the structure of the data that will be extracted by extractor service from input document.

Constraints

The ruleset may contain constraint elements that define the match criteria for the ruleset. Multiple rulesets are combined with logical operators and the resulting boolean value causes the ruleset to be applied or skipped.

Constraint elements are used for detecting the document type. Ruleset without constraints will be always applied to any type of input. In the presence of large number of rulesets, significant amount of processing may be needed in order to get the best match. Constraints make it possible to narrow down the number of rulesets that get applied to the input content.

Constraints can be nested and combined with logical operators AND, OR, AND NOT, OR NOT. This makes it possible to create complex match conditions. Constraint is similar to pattern, except that it does not extract any data, there is no format and data type attributes. Result of condition is Boolean value true or false. All constraints combined will also produce single Boolean value. When matching process (applying all constraints) yields a value false, the extractor will move to the next available ruleset in the list. In opposite case, the items in ruleset get evaluated.

Items

Item is the named entity similar to input field in electronic form. It has unique name and its value is extracted from the content according to the match patterns that it contains. Item may have multiple match patterns and each pattern may extract multiple values from single document. Any value that the pattern evaluation produces is added to the candidate list. When all patterns are processed, the selection rule will be applied to choose appropriate value. The rule may be one of the predefined types: first, last, largest, smallest, or all.

Patterns

Item may contain one or more pattern elements. The pattern element’s content is the regular expression that extracts input string from the document. The expression may contain multiple capture groups, in which case the group number attribute will be used to select the one that is treated as input string. Group number is 1-based index and value 0 means that all text captured by regular expression will be used.

When the evaluation of regular expression produces non-empty result, the item will get record new candidate value. Before recording the value may be normalized, depending on the data type and format.

The extracted value data type may be: generic, text, date, or numeric. The generic is used to extract values that do not have specific formatting rules but usually represent single keyword (like invoice number, ZIP code). The text is used for larger free text items. Neither generic nor text have any formatting applied to the extracted value.

The date and numeric values are normalized according to their format and language attributes. The date value returned by extractor is always in locale-independent format yyyy-MM-dd, for example “2017-12-01”.

For instance, consider the input value “December 01, 2017”. When the pattern specifies format “MMMM dd, yyyy” and language “en”, the de-formatting will retrieve the standard “2017-12-01” from that input.

Regular expression

Regular expression is a sequence of characters that define a search pattern. In simplest case the pattern is nothing more than a string that we are searching for in the content. Complex expressions however allow creating mini-parsers that can handle very complicated matches. Basic concepts of regular expressions are described in the Wikipedia article.

The Stencil Editor executes regular expressions on the sample document. When editing the expression property of constraint or pattern, press enter to execute it. The result - if any - will be highlighted in the sample view. In case multiple capture groups are defined, the active group will be underlined. Active group is selected by specifying non-zero group number.

NOTE: Date and numeric format are not applied in the Stencil Editor for preview in current version. The "Test" function however runs the extractor command-line app in background, so the test reflects actual runtime result.

Date format

Format specification is used by extractor to normalize the date values so they could be stored in uniform representation. The syntax of date format itself is also language independent but the application of the format on input data depends on language code set on pattern element.

For example the format "MMMM dd, YYYY" when applied to input string "March 12, 2020" expects language code "en", so the month name is localized. When there is need to parse German language input, then the language code must be set to "de". The same pattern then works with input "März 12, 2020" but will fail to parse "March 12, 2020".

When there is no way to structure the stencil using different ruleset for English and German input, then another option is to define alternative pattern on same item. Multiple patterns will be applied in order they are defined and each pattern has its own language, format and extraction expression.

Date format used by extractor is described here and here

Numeric format

More about numeric format patterns can be found here

Language code

Language code is an attribute of pattern element. Internal attribute stored in the stencil file is ISO 639-1 two-letter code for the language. Processing of the format (date and numeric) depends on the language.