Dealing with special characters in names

Most applications nowadays handle Unicode in text seamlessly.

If only that were true. A partial solution from one directory manager below.

For around 12 years I managed an LDAP directory service for a big multinational that held tens of thousands of active entries, and twice that in inactive ones.

In the early 2000’s when I started we had around 200 names, mostly in Northern Europe, that had special characters. Because we were a Sun shop at the time we were running Sun’s Directory Server, which fortunately stored such data encoded in UTF-8 (even though the operating system defaulted to Latin-1).

Now many years later that number of “special” names is just under 2,000 active users. Too many to ignore, and enough to warrant greater attention.

My early solution was to create custom attributes to store the decomposed (ASCII-ized) value of any givenname or sn that had special characters in it, and then add them to the user’s entry as additional attributes. Applications could then add these attributes to their search filters to find those entries using ASCII characters. This was done using a maintenance routine written in perl that ran every day as a batch job.

It turns out that this solution still works well today, but with huge improvements in perl’s UTF-8 handling, it was time to refactor my routines for creating and managing them.

My original script used Text::Unaccent::PurePerl, and was applied to all name strings indiscriminately. By using Test::utf8 in the new script I was able to make the script more efficient by only treating names with non-ASCII characters.

One factor I didn’t consider all those years ago were local differences in how certain characters are decomposed. For example, throughout most of Europe a ü is simply translated into “u”. But in Germany (DE), Austria (AT) and Switzerland (CH) the preference of most German speakers is to translate to “ue”. Although someone might have written a completely new perl module to replace Text::Unaccent::PurePerl that could take a country or language code as a parameter, I decided to take the easy way out and write a wrapper routine around it. For name strings associated with people in DE, AT or CH I would pass the value through a “for” loop that would do the substitution prior to going through Text::Unaccent::PurePerl’s more comprehensive unac_string method.

I moved all unaccent processing to its own subroutine to modularize the process, and added paged searching as an additional efficiency measure.

Following find my complete script.

#!/usr/bin/perl
# decomp2usr.pl Script to decompose Unicode characters in user names and
# then push decomposed form of name back into LDAP. Creates 2 new attribs
# for every user, corpdecompgn and corpdecompsn.
# Originally created 1/14/02 by P Lembo, Refactored 4/21/15.

use strict;
use Net::LDAP;
use Net::LDAP::Entry;
use Net::LDAP::LDIF;
use Net::LDAP::Control::Paged;
use Net::LDAP::Constant qw( LDAP_CONTROL_PAGED );
use Test::utf8;
use Text::Unaccent::PurePerl;
use File::Copy;
use String::Util qw(trim);

my $HOME = $ENV{'HOME'};
my $BIN = "/usr/local/bin";
our ($dirUsr,$dirHost, $dirPass, $dirPort);

require "$HOME/etc/ldapapp.conf";

my $time = localtime();
my $errFile = "$HOME/data/logs/decomp2usr.log";
my $changefile = "$HOME/data/import/decompchg.ldif";
my $changefileName = "decompchg.ldif";

open LOGZ,">$errFile" or die $!;
print LOGZ "$time\tStart UserName Decomp Process\n";

decompnames();
update_ldap();
clean_up();

$time = localtime ();

print LOGZ "$time\tCompleted UserName Decomp Process\n";
close LOGZ;

sub decompnames	{

 open FH, ">$changefile" or die $!;

 my @attrs = qw(sn givenname corpdecompgn corpdecompsn c);
 my $basedn = "ou=People,dc=corp,dc=com";
 my $query = "(\&(objectclass=inetorgperson)(givenname=*)(sn=*))";

 my $ldap = Net::LDAP->new($dirHost, port =>$dirPort);
 my $mesg = $ldap->start_tls(verify=>'none', sslversion =>'tlsv1');
 $mesg = $ldap->bind($dirUsr, password=> $dirPass) or die $!;
 my $page = Net::LDAP::Control::Paged->new( size => 1000) or die $!;
 my @args = (
    base => $basedn,
    scope => 'sub',
    filter => $query,
    attr =>\@attrs,
    control => [ $page ],
);

 my $cookie;

 while (1) {
 
    $mesg = $ldap->search ( @args ) or die $!;

    while (my $entry = $mesg->shift_entry()) {

       my $dn = $entry->dn;
       my $uid = $entry->get_value('uid');
       my $sn = $entry->get_value('sn');
       my $givenname = $entry->get_value('givenname');
       my $corpdecompsn = $entry->get_value('corpdecompsn');
       my $corpdecompgn = $entry->get_value('corpdecompgn');
       my $c = $entry->get_value('c')
	   
       $sn = trim($sn);
       $givenname = trim($givenname);
 
       my $sntest = is_within_ascii($sn);
       my $gntest = is_within_ascii($givenname);

       if(($sntest eq '1')&&($gntest eq '1')) {
          next;
       }
       else {
         # If both first and last are non-ASCII
         if(($sntest eq '0')&&($gntest eq '0')) {
             my $dLname = cust_unac($c, $sn);
             my $dFname = cust_unac($c, $givenname);
             # Does result match both existing? If not, make change.
             if(($dLname !~ /$corpdecompgn/i)&&($dFname !~ /$corpdecompsn/i)) {
                 if(($dLname =~ /.+/)&&($dFname =~ /.+/)) {
                     print FH "dn: $dn\n";
                     print FH "changetype: modify\n";
                     print FH "replace: corpdecompsn\n";
                     print FH "corpdecompsn: $dLname\n";
                     print FH "-\n";
                     print FH "replace: corpdecompgn\n";
                     print FH "corpdecompgn: $dFname\n";
                     print FH "\n";

                     print LOGZ $uid . " full name " . $givenname . " " . $sn . " decomped name is " . $dFname . " " . $dLname, "\n";
                 } 
             }
             
         }
         # If last name is non-ASCII but first name is ASCII
         elsif(($sntest eq '0')&&($gntest eq '1')) {
             my $dLname = cust_unac($c, $sn);
             # Does result match existing? If not, make change.
             if($dLname !~ /$corpdecompsn/i) {
                 if($dLname =~ /.+/) {
                     print FH "dn: $dn\n";
                     print FH "changetype: modify\n";
                     print FH "replace: corpdecompsn\n";
                     print FH "corpdecompsn: $dLname\n";
                     print FH "\n";

                     print LOGZ $uid . " last name " . $sn . " decompsn is " . $dLname, "\n";
                 }
             }

         }
         # If last name is ASCII but first name is non-ASCII
         elsif(($sntest eq '1')&&($gntest eq '0')) {
             my $dFname = cust_unac($c, $givenname);
             # Does result match existing? If not, make change
             if($dFname !~ /$corpdecompgn/i) {
                 if($dFname =~ /.+/) {
                     print FH "dn: $dn\n";
                     print FH "changetype: modify\n";
                     print FH "replace: corpdecompgn\n";
                     print FH "corpdecompgn: $dFname\n";
                     print FH "\n";

                     print LOGZ $uid . " first name " . $givenname . " decompgn is " . $dFname, "\n";
                 }
             }
 
         }
      } # if some non-ASCII
    } # while search

    $mesg->code and last;
    my ( $resp ) = $mesg->control ( LDAP_CONTROL_PAGED ) or last;
    $cookie = $resp->cookie or last;
    $page->cookie( $cookie );

 } # while paging

 if ($cookie) {
    $page->cookie($cookie);
    $page->size(0);
    $ldap->search( @args );
 }

 close FH;
 $ldap->unbind();

}

sub cust_unac {
 # Unaccent characters in names, according to special rules
 # for regions.
 my $loc = $_[0];
 my $name = $_[1];

 # Make exceptions for German-speaking countries
 if($loc =~ /DE|AT|CH/) {
    for($name) {
       s/Ü/UE/g;
       s/ü/ue/g;
       s/Ö/OE/g;
       s/ö/oe/g;
       s/Ä/AE/g;
       s/ä/ae/g;
       s/ß/ss/g;
    }
  }
  my $dname = unac_string("utf-8", $name);

  return($dname);

}

sub update_ldap {

    my $time = localtime();

	if (-s $changefile) {
    	print LOGZ "$time\tUpdating directory from change file\n";
    	system("$BIN/ldapmodify -X -h $dirHost -p $dirPort -D \"$dirUsr\" -w $dirPass -c -f $changefile >>$errFile 2>&1");
	}
	else {
           print LOGZ "$time\tNo changes today!\n";
	}
		

}

sub clean_up {
    
    my $time = localtime();
    print LOGZ "$time\tCleaning up files\n";

    # Define timestamp variables
    my($second,$minute,$hour,$day,$month,$year) = (localtime)[0,1,2,3,4,5];
    my $timestamp = sprintf("%04d%02d%02d%02d%02d%02d", $year + 1900, $month + 1,  $day, $hour, $minute, $second);

    if(-s $changefile) {
    	move("$changefile","$HOME/data/import/archive/$changefileName.$timestamp") or die $!; 
    }
    else {
        unlink($changefile);
    }
}


__END__;


This entry was posted in Directory, System Administration on by .

About phil

My name is Phil Lembo. In my day job I’m an enterprise IT architect for a leading distribution and services company. The rest of my time I try to maintain a semi-normal family life in the suburbs of Raleigh, NC. E-mail me at philipATlembobrothersDOTcom. The opinions expressed here are entirely my own and not those of my employers, past, present or future (except where I quote others, who will need to accept responsibility for their own rants).