The Document introduces three powerful concepts that, when combined, provide the building blocks of our processing.
Selectors allow you to search the Document based on the Document's content, structure, or both. These are similar to XPath queries, but have been tailored to work with our content model. You will use selectors to identify areas of the Document to which you want to apply additional processing/extraction.
Adding tags to a Document is a way to "label" data within the Document. It allows you to easily mark parts of the structure or the text within a node with a specific meaning. For example, when processing an HTML file, you may want to add a tag named "Hi" to every node of type 'p' (paragraph) that has the phrase "Hello". In a later processing step, you may select the nodes with the "Hi" tag and perform another action on them. Since processing steps can refer to the presence of previously applied tags, tags provide a powerful and flexible way to provide incremental understanding of the Document.
Features a similar to tags, in that they can be added to nodes to provide additional information about the node or its contents. Features record more granular information than tags, such as spatial co-ordinates identified during parsing or the entity type for each word in a node's content when performing NER processing.
When you start solving problems with Kodexa, you will learn that the flexibility of the Document is your friend. It provides you a consistent way to work across use-cases, and since the model and API is consistent, you can write re-usable code that can be leveraged in multiple use-cases.