Document Composition

Following describes our back-end toolset for document processing. These components can be used to extract JSON or XML data from presentation formats, to merge structured data with layout and produce various export formats for printing, archiving or distribution.

Our base components can be deployed in various ways and integrated with different systems using simple glue code or scripting. We use mostly C++ programming language for backend core libraries which are built and tested on both Linux and Windows.

Our runtime tools distribution contains two main command line applications - extract and merge. These applications are thin wrappers for several internal modules that are set up for execution from command line.

Document format

Woodston document is XML-based file format that combines vector and raster graphics with structure information. There are four different elements at the document root level: dictionary, design, data and resources. In current implementation there is one of each but in future versions the document could store multiple alternative designs, for example paginated and mobile view of the same document can be stored in single file.

The dictionary type appears under root but can also be encountered as part of other nested items like fragments and fields. The purpose of the dictionary element is to store arbitrary meta-data along with the document. Values in dictionary are named entities that may contain a value or another dictionary. Nesting allows creating custom namespaces for storing complex structures. In simplest case the dictionary may store name of author, change date or notes but it is also possible to accommodate additional processing rules for workflows, rendering options etc.

The design element contains graphical parts of the document. The main building block of a document design is called fragment. Fragments can be used as pages, invoice rows, table cells or as containers for other fragments. Container fragments have optional layout rules attached which instruct formatting engine how to set up the placement of elements during layout process.

Document has optional data fork that declares tree of nested data items. Each item in the data DOM has a name, type and value. The structure of the data tree defines the data interface of a document. Content of data DOM can be imported and exported as XML or JSON and it exists independent of the document graphics. The data items can be optionally bound to the fields in document, so the data DOM serves as data source for presentation. Not all items in the data DOM need to be bond to presentation fields, so the DOM may store more complex structure than is necessary for presentation. Same applies to the variable fields in layout, some presented values may be calculated during runtime, modified by script or serve as user input areas.

The document resource element stores images, scripts and fonts. Resources may be stored in file as embedded streams or as a reference to the external file.

Read more on the document format page.

Document layout

Document layout process starts by loading document template in its initial state. This usually means that the data DOM is unpopulated, variable fields are blank or contain default values and repeatable fragments have single instance.

Data acquisition process retrieves values from structured source like XML, JSON and populates data DOM.

Once the data is loaded, the content (<design>) part gets populated. In this process, the original XML or JSON file that supplied the data is no longer used. Elements in the document design are bound to the data DOM using a path statements. The element path is a string that describes location of element in the document. Data binding uses path syntax to address elements in data tree, eg /document/data/$invoice/$receiver/$mail. When fragment in document design is bound to repeated element in data eg /document/data/$invoice/$row then the content populator will create multiple instances of that fragment. The populator will not consider spacing issues, overflow or line breaking. It simply instantiates all child elements and fills the fields with text content. Optional field formatting rules are applied at this stage as well.

When the whole design content is populated, the document layout rules will be applied. In this stage the text formatting and line breaking happens and overflows get resolved. Overflow occurs when container fragment holds more child elements than can be fit into the available area with specified layout method. For example, the vertical top-to-bottom stacking places invoice rows into the container one by one and when it runs out of space, the overflow method gets invoked. Overflow may then duplicate parent container fragment and transfers the overflow content - child fragments that did not previously fit - into cloned container. In simple terms, new page is created and layout continues recursively.

Completely formatted document has all elements in their final state ready for output. The renderer loads a driver module and translates the content into graphical primitives like lines, rectangles, polygons, text, and images. Each primitive will end up as a call to the abstract rendering interface, so the renderer on this level works same way for every concrete output format. This allows recording the output from renderer and replay at later stage for multi-format output or for diagnostic purposes.

Scripting API

At each of above steps, the events are published to the JavaScript engine which runs document scripts. This makes it possible for the script code to modify the document and implement custom logic. The most common scenario is calculation of additional values after the document data DOM is populated.

The document script is a JavaScript file that in its minimal form looks like this:

((merge) => {});

The merge context object provides logging facilities, access to the current document and can be used for subscribing to the events. Following script will issue information message (level 2) and subscribe to the "data populated" event when script gets loaded.

((merge) => {
    merge.log.info('extension loaded');
    merge.subscribe('document:data:populated', () => {
        merge.log.info('data populated');
    });
});

The events triggered from the merge process are following:

  • document:loaded
  • document:data:loaded
  • document:data:populated
  • document:layout:processed

The rendering phase is not accessible to the script in this version. It is possible that the rendering step will be added here or eventually appears as part of another - process management - scripting model. Current scripting deals basically with the document itself and does not consider the processing steps outside the DOM. Such external events are for instance retrieval of data and document file, sending the resulting PDF by email or posting to REST API. They definitely deserve some attention, we just do not want to mix our apples with the oranges.

Traverse through document data DOM:

((merge) => {

    const reportAll = (items) => {

        merge.log.info('data items:');
        items.forEach((item) => {
            merge.log.info(` item.name: ${item.name}`);
        });
    };

    merge.subscribe('document:data:populated', () => {
        reportAll(merge.document.data.children);
    });
});

Lets consider this simple data DOM:

<data>
    <item name="InvoiceNumber" />
</data>

To modify data DOM values from the script after the XML data is loaded:

((merge) => {
    merge.subscribe('document:data:loaded', () => {
        merge.log.info(`InvoiceNumber = ${merge.document.data.$InvoiceNumber}`);
        merge.document.data.$InvoiceNumber = '123456';
        merge.log.info(`InvoiceNumber = ${merge.document.data.$InvoiceNumber}`);
    });
});

More about scripting can be found on merge scripting reference page.

Output formats

The output drivers implemented in current version include PDF, PostScript, SVG, Windows printing and the internal document format. The latter may be used to serialize the formatted output as XML document which can be opened in UI tools, archived, rendered into PDF, PostScript or printed.

Command Line Tools

While we are working on integrated process setup and management system, the document composition tools can be executed as command line tools.

Extract

The extract command line application runs input file through filter pipeline to obtain Unicode text and processes the result with extractor. Extracted data is written into output as XML or JSON. Multiple stencil files can be used to configure the extractor.

Example:

extract.exe -in input.pdf -out output.xml -config .\stencils\*.stencil

Additional parameters:

    -in         [filename] input file
    -out        [filename] output file
    -mime       [mime] input mime type 
    -outmime    [mime] output mime type (application/xml | application/json)
    -config     [path] stencil file path, use pattern * for multiple files
    -log        [filename] name of log file
    -logappend  causes the log to be appended to existing file
    -loglevel   [1..5] where lower level shows more messages
    -bin        [folder] location of program data files
    -filters    [folder] location of filter files
    -wd         [folder] set working directory
    -dump       [folder] write input filter output (Unicode text) to file
    -version    displays version info

Merge

The merge command line application loads document template and data file. It populates the documents data DOM with values from XML or JSON, runs layout process and renders the result using one of the drivers: PDF, PostScript, Windows printing or XML.

Example:

merge.exe -in input.shape -data data.xml -out output.ps

Additional parameters:

    -in         [filename] input file
    -out        [filename] output file
    -print      [name] of Windows printer, not supported on Linux
    -mime       [mime] output mime type 
    -split      splits pages into separate files
    -data       [filename] data file
    -datamime   [mime] data file mime type (application/xml | application/json)
    -fonts      [folder] font folder
    -log        [filename] log file
    -logappend  causes the log to be appended to existing file
    -loglevel   [1..5] where lower level shows more messages
    -filters    [folder] location of filter files
    -bin        [folder] location of program data files
    -wd         [folder] set working directory
    -version    displays version info

The MIME type parameters may be omitted. In that case the automatic content type detection from filename extension and content header is applied. The detection is not always possible, especially when filename has no extension or particular file format lacks fixed header signature.

Following mime types are currently recognized by content type detection. Note that detection here means ability to recognize file type, not the the ability to use the file as input or output in any concrete scenario.

text/plain
text/html 
application/xml 
application/json 
application/pdf 
application/postscript 
application/raw 
application/rtf 
application/vnd.ms-excel 
image/wmf 
image/emf 
image/tiff 
image/jpeg 
image/png 
image/gif 
image/bmp 
image/svg+xml 
image/jp2 

Following types are used for internal format and are specific to the application:

application/vnd.ws-doc
application/vnd.ws-stencil 
application/vnd.ws-extractor 
application/vnd.ws-settings 
application/vnd.ws-filter 
application/vnd.ws-flow
application/tcml