When importing or exporting masses of data in a directory environment that include international characters it’s important to make sure UTF-8 encoding is being used throughout.
My thanks to Greg Cranz for starting a conversation on this subject that led me to finally get it straight.
Back at the dawn of time before we were all one world and everyone lived and worked in their own computerized silos, IBM, Microsoft and others came up with a variety of (mostly incompatible) schemes for displaying international characters. Then Unicode came along and all our problems were solved. Well, not really.
First of all, many of those pre-Unicode schemes are still in use and the major software vendors continue to enable them for backward compatibility. To do otherwise would undoubtedly provoke the wrath of customers both big and small. Secondly, while UTF-8 is the dominant encoding scheme used on the Internet and in a lot of software, it competes with other schemes like UTF-16 that are favored by many.
All LDAPv3 compliant directories store string data with UTF-8 encoding. As a result it is important to make sure that what is being stored is compatible with that scheme.
Over the years perl’s UTF-8 handling has gone from being a bolt-on to an integral part of the language. As a result there are a number of pragmas available that make it easy to import and export UTF-8 encoded strings.
For example, you can instruct perl to require UTF-8 encoding when reading from or writing to files. The :utf8 function allows you to do this in a simple and straightforward way. For example:
open(FH, "<:utf8", $infile) or die $!; open(FH1, ">:utf8", $outfile) or die $!;
From the doc for PerlIO:
Declares that the stream accepts perl’s internal encoding of characters. (Which really is UTF-8 on ASCII machines, but is UTF-EBCDIC on EBCDIC machines.) This allows any character perl can represent to be read from or written to the stream. The UTF-X encoding is chosen to render simple text parts (i.e. non-accented letters, digits and common punctuation) human readable in the encoded file.
There’s an important caution in the doc, a bare “:utf8” declaration will not validate that the data is UTF-8 encoded. To do that you need to use the “:encoding(utf8)” function as described in the utf8 pragma doc.