Document and Content Nodes

Understanding the way we build documents

In Kodexa, documents are represented in a generalized document structure consisting of a collection of metadata and a set of content nodes. This document structure may be thought of as a rich tree model, with a root content node at the top and one or more child content nodes branching off as leaves. Each of the child content nodes contain some portion of the document's value. This tree structure allows us to enable navigation within the tree and maintain lineage between the parent and child nodes.

The tree structure of the document may be represented as:

┌───────────┐
┌──▶│ metadata │
│ └───────────┘
┌──────────────┐ │
│ document │──┤ ┌──────────┐ ┌──────────┐
└──────────────┘ │ ┌─▶ node │──▶│ node │
│ │ └──────────┘ └──────────┘
│ ┌──────────┐ │ ┌──────────┐
└──▶│ node ├─┼─▶ node │
└──────────┘ │ └──────────┘
│ ┌──────────┐
└─▶ node │
└──────────┘

While at generic level everything can be thought of as a content node, we leverage the 'type' property on the content node to provide meaning to the hierarchy. For example, we may set the type property to 'page' or 'line' in order to differentiate their usage:

┌───────────┐
┌──▶│ metadata │
│ └───────────┘
┌──────────────┐ │
│ document │──┤ ┌──────────┐ ┌──────────┐
└──────────────┘ │ ┌─▶ page │──▶│ line │
│ │ └──────────┘ └──────────┘
│ ┌──────────┐ │ ┌──────────┐ ┌──────────┐
└──▶│ root ├─┼─▶ page │──▶│ line │
└──────────┘ │ └──────────┘ └──────────┘
│ ┌──────────┐ ┌──────────┐
└─▶ page │──▶│ line │
└──────────┘ └──────────┘

The reason that this is important is based on the document, we might not always have the same structure. For example, we might start with a document simply made up of characters and then based on a later step in the pipeline, we might then restructure the document to the first form (including words, etc.). The key here is that we can build documents that suit the structure we currently have

Content Nodes

At the heart of the design is the content node. A content node has several core properties such as type, content, content_parts, children, and feature(s):

┌──────────────────────────────────────┐
│ ┌──────────────┐ │
│ node │ parent │ │
│ └──────────────┘ │
│ ┌──────────────┐ ┌────────────────┐ │
│ │ type │ │ │ │
│ └──────────────┘ │ ┌────────────┐ │ │
│ ┌──────────────┐ │ │ feature │ │ │
│ │ content │ │ └────────────┘ │ │
│ └──────────────┘ │ ┌────────────┐ │ │ ┌───────────────────┐
│ ┌──────────────┐ │ │ feature │ │ │ │ │
│ │content_markup│ │ └────────────┘ │ │ │ feature │
│ └──────────────┘ │ ┌────────────┐ │ │ │ │
│ ┌──────────────┐ │ │ feature │─┼─┼┐ │ ┌──────────────┐ │
│ │ children │ │ └────────────┘ │ ││ │ │ type │ │
│ └──────────────┘ └────────────────┘ ││ │ └──────────────┘ │
│ │ │└────▶│ ┌──────────────┐ │
└─────────│────────────────────────────┘ │ │ name │ │
│ ┌──────────┐ │ └──────────────┘ │
└────▶│ ┌──────┐ │ │ ┌──────────────┐ │
│ │ node │ │ │ │ value │ │
│ └──────┘ │ │ └──────────────┘ │
│ ┌──────┐ │ └───────────────────┘
│ │ node │ │
│ └──────┘ │
└──────────┘

type The specific type of this content node.

parent Each node is aware of its parent. Only a root content node on a document would not have a parent.

features A collection of features (described more in the next section)

content The text representation of the content.

content_parts An array version of the content, this is used to break the content and intersperse it with the children, allowing us to understand where the child nodes fit into the content, it is not always present, and is only present if the structure allows for child nodes to be embedded into the content at specific locations.

children The child content nodes that roll up to this content node.