Lucene Search Engine

All Plugins / Index / Lucene

Identity Card

StatusCore
Plugin LabelLucene Search Engine
Short DescriptionZend_Search_Lucene implementation to index all files and search a whole workspace quickly.
Plugin Identifierindex.lucene
AuthorCharles du Jeu
Urldocs/references/plugins/index/lucene
Dependenciesaccess.fs, access.smb, access.imap, access.swift, access.s3, access.inbox, access.demo, access.dropbox, access.webdav, access.sftp_psl, access.smbicewind, access.sftp, access.ftp

Documentation

This plugin uses the Zend_Search_Lucene library that implement the Apache Lucene module in PHP for indexing the files and providing an efficient search tool. You must make sure to add a meta source "index.lucene" to the repositories that you want to be indexed.

The plugin supports the indexation of metadata, the background indexation of huge folders (if the framework can be run in background via the command line), and also the indexation of files contents when they are textual files (TXT, HTML). It could be possible to add PDF indexation using some pdf-to-text conversion, but it's not implemented yet.

The search results display a "Hit Score" that is provided by the search engine.

UNICONV + XPDF INTEGRATION

If you can install the uniconv utilitary on your server, along with the openoffice or libreoffice headless suite, and the xpdf utilitary, the plugin will be able to extract and index textual contents from office documents (Word,Excel,Powerpoint and all their closed or open-source variants).

Examples to install the packages on CentOS : yum install unoconv openoffice.org-headless openoffice.org-writer openoffice.org-calc openoffice.org-impress xpdf
Or on Debian : apt-get install unoconv openoffice.org-headless openoffice.org-writer openoffice.org-calc openoffice.org-impress xpdf

Plugin parameters

LabelDescriptionTypeDefault
Parse Content Until *
PARSE_CONTENT_MAX_SIZE
Skip content parsing and indexation for files bigger than this size (must be in Bytes)String500000
HTML files *
PARSE_CONTENT_HTML
List of extensions to consider as HTML file and parse contentStringhtml,htm
Text files *
PARSE_CONTENT_TXT
List of extensions to consider as Text file and parse contentStringtxt
Unoconv Path
UNOCONV
Full path on the server to the 'unoconv' binaryString
PdftoText Path
PDFTOTEXT
Full path on the server to the 'pdftotext' binaryString
Query Analyzer
QUERY_ANALYSER
Analyzer used by Zend to parse the queries. Warning, the UTF8 analyzers require the php mbstring extension.Select (utf8num_insensitive, utf8num_sensitive, utf8_insensitive, utf8_sensitive, textnum_insensitive, textnum_sensitive, text_insensitive, text_sensitive)textnum_insensitive
Wildcard limitation
WILDCARD_LIMITATION
For the sake of performances, it is not recommanded to use wildcard as a very first character of a query string. Lucene recommends asking the user minimum 3 characters before wildcard. Still, you can set it to 0 if necessary for your usecases.Integer3
Auto-Wildcard
AUTO_WILDCARD
Automatically append a * after the user query to make the search broaderBooleanfalse
Hide 'My Shares'
HIDE_MYSHARES_SECTION
Hide My Shares section in the Orbit theme GUI.Booleanfalse

Instance parameters

LabelDescriptionTypeDefault
Index Content *
index_content
Parses the file when possible and index its content (see plugin global options)Booleanfalse
Index Meta Fields
index_meta_fields
Which additionnal fields to index and searchString
Repository keywords
repository_specific_keywords
If your workspace path is defined dynamically by specific keywords like AJXP_USER, or your own, mention them here.String