
Let’s say two people have their own lists of client email addresses. They want to know how many email addresses they have in common, but do not want to share the whole email addresses to each other. They do not also want to share their email addresses to a 3rd person. How can they compare their email addresses?
The simplest way is to normalize their email addresses to lower (or upper) cases, encrypt them with the same algorithm and compare the encrypted email addresses. Normalization is required because email addresses are case insensitive. With encrypting texts, texts can be compared without revealing the original text.
What if the two in previous example have phone numbers instead of email addresses? The basics are the same. Normalize the phone numbers, encrypt the normalized phone numbers and compare the encrypted-normalized phone numbers. The problem is normalizing a phone number is not so easy as you might expect. Because each country has its own rules for phone numbers and only with the numbers you might not be able to identify in which country the number belongs. Although there is country calling codes, the real data might not contain them. And even if we are sure the numbers belong to which country, we still have to understand the phone number rules for each countries to normalize them properly.
So let’s narrow down this problem simpler and focus on US phone numbers. To be honest, the term, ‘US phone number’ is not a correct term. There is North American Numbering Plan, or NANP, which is a telephone numbering plan for twenty-five regions in twenty countries, primarily in North America and the Caribbean. Naturally the phone numbers in US also follow this.
According to NANP, the phone number format should be ‘+1 NXX NXX-XXXX’. N can be a digit from 2 to 9, and X can be a digit from 0 to 9. The rule is quite straightforward. So you might think this don’t need to be normalized. But the data in real world is always surprise me. Below is some examples I’ve encountered. The actual digits have been masked with #
- ### ###-#### : +1 is omitted in common
- (###) #### #### : Space, underscore, parenthesis are used instead of the hyphens.
- 1 ### ###-#### : The leading plus sign have been omitted.
- 001-###-###-#### : 001 is used instead of +1. 01 is also found.
- 1 1 ### ### #### : The leading +1 or 1 has been duplicated.
- +1 (aaa) aaa ###-#### : Area code (the first 3-digit right after the +1) is duplicated.
How can we normalize these?
A careful look on the above patterns, it’s certain that characters other than digit can be ignored. Space, hyphen, underscore and parenthesis are not more than a delimiter for human eyes. If remove these, following patterns are found;
- ##########
- 1##########
- 01##########
- 001##########
- 11##########
- 1aaaaaa#######
This makes the problem easier. With counting the number of digits in the phone number and investigating the leading few digits, in which format the phone number has been written can be found. I’ve implemented this in python and uploaded in Gist:
https://gist.github.com/iizs/61652564293d1a01cbe0d823237c6665