This is an improvement on an old post. After getting some particularly weird data I was forced to rework my original code and come up with something new.
Let me begin by admitting I don’t know what exactly was wrong with the original data that came across. All I know is that it caused gedit to blow up and clearly contained some weird stuff that no telephone number should have.
Oh yeah, this is phone number data.
From Southern Europe, which in my world includes every country on the Mediterranean coast — and then some.
Here’s the regex I used to use:
The problem with this was that it allowed too many “non-telephone” characters, like colons (a red flag when you’re dealing with Unicode) and all those pesky roman letters (you know, A-z) that people like to annotate their phone number data with (for example “Fax: +33 0 00 …”, you get the idea).
Here’s the improved version I came up with after learning more about the ASCII Character Table:
Basically what this does is filter out every character that is not in the range hex 20 to hex 39 (“Space” to the numeral 9). So all the pluses (“+”) and infinite varieties of white space placements are allowed, along with actual numbers. But none of the awful stuff that makes feed parsers and character translation tools (like iconv) choke.