JSON Representation

A Kodexa document is typically in-memory, since that is where all the functionality is available. However, the document object itself has the ability to be converted into a JSON representation which can be stored on disk (in fact the JsonDocumentStore uses this functionality to support being a store for documents).

The JSON structure mirrors much of the standard structure of the document -- we have metadata, content, nodes, and features as you would expect. Let's take a look at a document to see how a simple PDF would look (note that we used the PDF parser and layout modules to generate the document).

Starting with the first part of the JSON:

{
"version": "1.0.0",
"metadata": {
"source_path": "example.pdf",
"connector": "folder",
"mime_type": [
"application/pdf",
null
],
"connector_options": {
"path": "/fileshare/",
"file_filter": ".*.pdf"
}
},
....
}

First up, the JSON can identify the version of the document. This is used to allow newer versions of Kodexa to understand the version that was used to write the JSON (the document version is not exactly the same as the Kodexa version).

Next, we have the metadata. This typically tells you a little about the document. Here we can see the source path (filename), the type of connector used, the mime type and also the options (which in this case tells us about the folder that was used to find the document).

In the next section, you will see the content structure including all the features (tags are a feature).

{
...
"content_node": {
"type": "root",
"content": null,
"features": [],
"index": 0,
"children": [
{
"type": "page",
"content": "",
"features": [
{
"name": "spatial:bbox",
"value": [
0,
0,
792,
612
],
"single": true
}
],
....
}