Working with unstructured data is a challenge for many reasons - the most obvious obstacle is the data's lack of formal structure. All frameworks that attempt to process unstructured data try to apply some structure; however, this can be challenging because unstructured data varies by type and by content. Imposing a structure requires these various types to be normalized in some way, and it must be done without losing fidelity.
Difficulties in structuring the data are further complicated when the data will be provided to third-party models/functions for processing. Different providers are likely to require the data to be structured in different ways to meet their needs, not yours. Data structures end up being dependent on the original data source type, the normalized structure imposed by the processing framework, the needs of third-party tools, and any use-case specific requirements.
Trying to fill all of these needs has traditionally led to the creation of overly simplistic normalized structures that have lost important details, or overly rigid structures that are constructed to work with specific models/functions, but can't be used more widely.
At Kodexa, our content model is called the Kodexa Document. It's a generalized data structure flexible enough to work with multiple sources of data (PDF, Image etc) while also being rich enough to support the management of features and the application of tags.
Content nodes are the structures that provide the needed flexibility to the Kodexa Document. Documents are represented in a generalized structure consisting of a collection of metadata and a set of content nodes. This structure may be thought of as a rich tree model, with a root content node at the top and one or more child content nodes branching off as leaves. Each child content node contains some portion of the document's value. This tree structure allows us to enable navigation within the tree and maintain lineage between the parent and child nodes.