Connectors

Connectors

Connectors are used to provide documents to a pipeline. For example:

pipeline = Pipeline(FolderConnector(path='/my_important_docs')

When creating a Kodexa Document from plain text or reading a JsonDocumentStore of fully formed Kodexa Documents, you do not need to add a step to the pipeline to parse the documents. The pipeline will be able to process these documents into fully-formed Kodexa Documents.

When using connectors of other types (files, folders, URLs) to read non-Kodexa Documents, you will need to add a pipeline step to parse the document so that it's fully-formed. If you do not provide a parser in these instances, Kodexa Documents will still be returned by the pipeline, but there will be no content node text on the document, only metadata describing the connector and document source details.

If the connector provides a file that needs to be parsed, it will return an empty content node in the document but it will set the following metadata:

source_path=<the path to the file, as understood in the context of the connector>
connector=<the name of the connector in the registry>
connector_option=<the options for that connector>

You can then retrieve that file from the connector at any time by using:

from kodexa import connectors
source = get_source(document)

Note that this will return a temporary file (if the document needs to be downloaded) or the file, with the pointer reset to the beginning of the file. When retrieving the file with this method, you should close the file once you are finished with it to ensure it is deleted.

‚Äč