More iconv-nience

In An iconv-nient truth, I described how switching from the system iconv to perl’s Text::Iconv got me past a nagging problem with a particular dataset. The upshot was that when presented with a line of text that contained something it couldn’t understand, iconv would throw an ugly error. Contrasting this, Text::Iconv would just take the cowardly route and skip the offending line altogether. At the time I noted that both behaviors had potential downstream unpleasantness, but that in the end getting a blank line was better than having a 15,000 record translation abort at line 3,000.

So here I am, a year and a half later facing one of those downstream downsides.

It turns out that data on our brave colleagues in Lithuania was not getting reported out because my Text::Iconv routine was choking on a couple of characters in their street address. A test with the system iconv returned the dreaded “illegal input sequence at position XXXXXX” error.

After a little routing around the Internets I found a technique to get around the error with my RHEL (Red Hat Enterprise Linux) 5 system’s iconv.

iconv -f UTF-8 -t ISO-8859-15//TRANSLIT easteu.csv -o easteu.csv~

The //TRANSLIT switch, although not documented in RHEL’s man pages on iconv or iconv_open, does show up in section 3 of the GNU documentation for iconv_open. The fact that this information is missing from the standard documentation set has been noted before, and duly ignored (it does show up prominently in the Ubuntu Manpage entry for iconv). From the GNU doc:

When the string “//TRANSLIT” is appended to tocode, transliteration is activated. This means that when a character cannot be represented in the target character set, it can be approximated through one or several characters that look similar to the original character.

Which is exactly what the doctor ordered in this case.