Tika - Reindexation
Crawl all files and extract content using Tika.

Apache Tika is an independant, open source, content extractor that supports a very wide range of file formats. It can even support OCR for extracting text from images.
Although content extraction should be done on an event-based manner (at files upload or modification), this job can be useful when enabling Tika on an existing system to crawl all files and index them at once.
Parameters
Name | Type | Default | Mandatory | Description |
---|---|---|---|---|
TikaServer | text | localhost:9998 | true | Address of the tika service. |
Extensions | text | pdf|doc|docx|html|xls|xlsx|pptx|key | true | Limit list of extensions to be analyzed |
Trigger Type
Manual
JSON Representation
{
"Label": "Tika - Reindexation||Crawl all files and extract content using Tika||mdi mdi-magnify",
"Owner": "pydio.system.user",
"Custom": true,
"Actions": [
{
"ID": "actions.contents.tika",
"NodesSelector": {
"Query": {
"SubQueries": [
{
"type_url": "type.googleapis.com/tree.Query",
"value": "MAFSHXt7LkpvYlBhcmFtZXRlcnMuRXh0ZW5zaW9uc319"
}
],
"Operation": 1
},
"Label": "Select files with extension"
},
"Parameters": {
"additionalMeta": "Content-Type",
"compressContent": "true",
"extractContent": "pydio-binaries/tika-{{.Node.Uuid}}.gz",
"fieldname": "{\"@value\":\"Extension\"}",
"serverAddress": "{{.JobParameters.TikaServer}}"
},
"ChainedActions": [
{
"ID": "actions.tree.meta",
"Parameters": {
"metaJSON": "{}"
}
}
]
}
],
"MaxConcurrency": 10,
"Parameters": [
{
"Name": "TikaServer",
"Description": "Address of the tika service.",
"Value": "localhost:9998",
"Mandatory": true,
"Type": "text"
},
{
"Name": "Extensions",
"Description": "Limit list of extensions to be analyzed",
"Value": "pdf|doc|docx|html|xls|xlsx|pptx|key",
"Mandatory": true,
"Type": "text"
}
]
}
Back to top