Cleaning up dirty data with perl

I considered putting this post on my general tech blog, onemoretech, but finally decided to put it here as something of great interest to other directory admins.

There’s this data feed I’ve been importing into my directory for almost 7 years. Over that period the import routine has had to be modified in one way or another a number of times, mostly to accomodate changing needs on the side of the ultimate consumers of the directory data.

Of late, however, a key pattern in the data has changed. Since it’s inception, the data source owner has insisted on storing generationqualifier information (e.g. JR, SR, III) in the surname field of each user record. Previously, this information was always preceeded by a comma, so that the last name field might read something like “SMITH, JR” or “SMITH, JR.” (there was no consistency with regard to the existence of the period). The presence of the comma in the field was a handy indicator that only required a simple “split” method to effectively separate the generationqualifier from the surname proper.

Now, unhappily, I have been seeing some last name values coming across with the generationqualifier separated by only a space, or two from the surname itself.

Dealing with this required a little thought and experimentation, but here is the final result.

 # Do split on Last Name
($sn, $generationqualifier) = split(/,/, $sn);
# Source is also entering gq values with just a space between the
# end of the surname and the gq, so we're going to have to regex
# match for this
if($sn =~ /sJR.??$|sSR.??$|sI.??$|sII.??$|sIII.??$/gi) {
      $sn =~ m/^(.*)s+(.*)$/i;
      $sn = $1;
      $generationqualifier =$2;
      print "$sn,$generationqualifiern";
# Just trim surplus whitespace in gq if any
for ($generationqualifier) {
for($sn) {