Extension:FileIndexer and some cool new tools

Today’s the day we start looking into Extension:FileIndexer as a way to make .pdf and “Office” format documents searchable in Mediawiki. By default only text documents are indexed in the wiki database, leaving everything else unsearchable (except for document name).


The extension is still in beta. The latest version, 0.4.5.03, was posted in March of this year. Reportedly it has tested successfully with the latest version of Mediawiki, 1.17 in October (see Compatibility).

The code for the extension can be found here.

Prerequisites include the following:

strings
iconv
xpdf (pdf2text)
antiword
catdoc (including xls2csv and catppt)

The first two are included with the system on Red Hat and Red Hat derived distros. There are rpm packages available for the latter three from 3rd party repositories (EPEL and rpmforge).

One of the good things trying out these kinds of extensions is that you discover the existence of useful tools that you’d never heard of. That was the case for me with catdoc and its companion xls2csv and catppt. While these would be of minimal use for rendering anything other than plain text presentations of data in those Office products, they’re just the thing for capturing text content so that it can be indexed and used by a search engine.

It goes without saying that following the directions for installing this extension is essential to actually having it work. I can’t emphasize enough the importance of reading both the Extension and Discussion pages, including the archived Talk material from previous versions. The current maintainer, like many (most?) developers, is not a professional writer.

That said, I am not going to attempt to re-write or supplement in any detail what he’s provided. Given its status as beta software the chances are good that the installation and configuration details will change and I don’t want to commit anything to writing here that could confuse later readers.

A few of things about the current version: (1) Study the Configuration section. All comments in the actual code are in German, so they’ll be useless as a guide to most of you; (2) Be sure to create Template:FileIndex; (3) Check and recheck file permissions on the NS_IMAGE (usually $IP/images), including any temp directory under there to make sure your web server user has write privileges to it (I usually chown -R apache:staff and then chmod -R g+w).

I tried setting $wgFiRequestIndexCreationFile to something other than the default “/tmp”, but the extension ignored it (test setting was /data/tmp, which is a world-writable directory on a really big partition). As a result I’d recommend setting up a cron job to run a file cleanup script to sweep out accumulated temporary doc files left there by the indexer.

I added the recommended lines to LocalSettings.php and reset some additional parameters in FileIndexer_cfg.php to suit my usage. For example, in FileIndexer.php I changed the following:

$wgFiCheckSystem = true; # was false

After experimenting a bit I finally got down to testing the extension seriously. It was able to parse and index the content from a number of different document types, including .pdf and .doc, the most common in my world.

I’m still not sure whether creating indexes on upload automatically is a good idea in production. This is the default that you can turn off by setting $wgFiCreateOnUploadByDefault to false, thus:

$wgFiCreateOnUploadByDefault = false;

A user would still be able to check the box on the upload form to have the document indexed, but they’d need to be deliberate about it. For now I’m going to make it optional and periodically use the form on the Special:Fileindexer page to create/update on select kinds of documents (like .pdf and .doc).