/    Sign up×
Community /Pin to ProfileBookmark

Phone numbers, regular expressions …

Hi,

I am in need of doing an application actually a scraper, which does extracting contact details if exists in a given webpage. Means I need to pass a url to application, then application must find contact details by scraping and return all the matching records.

Emails, links, and contact forms, I was able to complete it.

But another important thing phone / fax numbers …

I think it is possible to do this using preg_match_all , but I do not understand how to prepare a regular expression for that.

As far as I know USA phone number format is as follows. Please correct me if wrong.

xxx xxx xxxx , xxx-xxx-xxxx , 1st digit should not be 1, it should be 2 to 9. As an international number 1-xxx-xxx-xxxx OR 1 xxx xxx xxxx …… I know some people may use brackets and some other characters, but think most of people place phone numbers in web pages in correct format…

So, assuming phone numbers in web pages are given in any of above formats, is it possible to extract them and store in an array using preg_match_all or any other…

If possible, could someone please let me know how to encode phone number formats into regular expressions..

Think since we have several possible phone numbers. several attempts per each may be need… May be someone have a genius regular expression to catch them all.

Thanks and Best Regards

to post a comment
PHP

4 Comments(s)

Copy linkTweet thisAlerts:
@CharlesMar 15.2009 — US phone numbers are more complicated than that and there is no [i]correct[/i] format. At the base you have a three digit exchange followed by four digits, typically written as xxx-xxxx. The last four digits are numeric but the first two digits of the exchange can be alphabetic. For instance, a number might be VAlley-3-1234 but noted as VA3-1234. The letters and numbers are related so that you might also note that as 823-1234 or you could, as many do, decide that the numbers make out a catchy word or phrase and use those. Often we now pre-fix a three digit area code. Until very recently you only needed that area code if you were calling from a different one. So if you include the area code in the number it was traditionally rendered (567)VA3-1234. But now that the area code is becoming universally required we are also seeing 567-823-1234. To get outside of your zone, which is smaller than your area code, you have to first dial a 1. That's to let you know that you're going to pay extra for the call. So numbers that are always outside of the zone are typically noted with that 1, 1-900-WANT-SEX.
Copy linkTweet thisAlerts:
@GUIRauthorMar 15.2009 — :eek: seems to be it needs AI ....

Thanks for your informative reply.
Copy linkTweet thisAlerts:
@StevishMar 15.2009 — Having letters in phone numbers is rare... and even in the case of a clever phrase (555-COOL), they would generally include the number-only version too (555-2665).

Now, assuming that you only want numbers in 10-digit form (because if you only got 7 digits, you'd have no way of knowing what the area code was, and the number would be useless), a regular expression should do the trick. Here's what I came up with... It should match most US phone numbers displayed on the web, though it will err on the conservative side (it might miss a phone number now and then, but it shouldn't return anything that's not a phone number):

[CODE]/(?s?[2-9][0-9]{2}[(s?)?s?)-.]{1,3}[0-9]{3}[s-.]{1,3}[0-9]{4}/[/CODE]

Now I'll break it down, in case you want to make changes:

The entire expression is surrounded by delimiters, which in this case is the / character. That's just to denote the beginning and end of the expression. The next 3 characters are (? where is the escape character (meaning interpret the next character "(" literally). The question mark means that the "(" character can be there 0 or 1 times. Next is the same thing with s which stands for a space character. So far, we are looking for "(", " ", "( " or "". This is in case they used parenthesis around the area code. Next we have [2-9][0-9]{2}. Each set of square brackets represents a range of characters. The curly brackets tell us how many times that set of characters will repeat. So this means that there will be exactly one digit between 2 and 9, followed by exactly two digits between 0 and 9 (any digit).

Next we have a more complex set of brackets: [(s?)?s?)-.]{1,3} This one starts off with parenthesis around two characters, which indicates a "substring". So this set of brackets will produce a match for any of the following combinations: " ) ", " )", ") ", " ", "", "-", "." (These 7 patterns could repeat 1, 2 or 3 times to accomodate phone numbers with dashes and spaces 719 - 555 - 1234). Some people show phone numbers with periods instead of dashes or spaces, so the period (denoted by ".") is included here.

The last part, [0-9]{3}[s-.]{1,3}[0-9]{4}, is the most straight forward. We have exactly 3 of any digit, a space, dash or period from 1 to 3 times, followed by exactly 4 of any digit.

And for your info, I tested this expression with the following numbers:

(719) 555-1234 MATCHED

719 555-1234 MATCHED

719-555-1234 MATCHED

719.555.1234 MATCHED

(719) 555 - 1234 MATCHED

(719)- 555-1234 MATCHED

(719).555.1234 MATCHED

719 555 1234 MATCHED

019 555 1234 NO MATCH

1-719-555-1234 (only matched 719-555-1234)

7195551234 NO MATCH (Otherwise any 10 or more digit number would match).

I hope that's what you were looking for. Let me know if you have any trouble.
Copy linkTweet thisAlerts:
@GUIRauthorMar 16.2009 — Hi,

This is great.. Thanks very much for your support. It gives everything from phone numbers printed in standard/common ways...

Best Regards
×

Success!

Help @GUIR spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 5.7,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @AriseFacilitySolutions09,
tipped: article
amount: 1000 SATS,

tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,
)...