How To Extract Domain From A Url ?

@developer_webSep 09.2020

Php Folks,

Q1.
Do you know REGEX to extract domain name from a url that your form user inputs in your form field ?

““ <form method=”GET” name=”submit_link” id=”submit_link” action=<?php echo $_SERVER[‘PHP_SELF’];?>> <label for=”url”>Url</label> <input_type=”url” name=”webpage_address” id=”url” placeholder=”Type your wepage address here …” REQUIRED>

</form> ““

When users submit their url of the page they want to my we crawler to crawl, I need to extract the domain name off from the submitted url. REGEX must work for all forms of urls.

Q2.
And, in what url format do download links come in ?
Imagine a crooked user, fed my crawler a url of a virus download page. I don’t want our crawler working on such a page. Meaning, if the submitted url is a download link then crawler should ignore.
You know. You sometimes find links on google search results that when you click, the browser does not take you to any webpage. Instead a file gets downloaded on auto to your hdd or your browser prompts you to select on your hdd where you want to save the file after it gets downloaded.
As soon as my crawler detects the user submitted a download link, I want the crawler to foil the crawling. I must learn the download link format in order to teach the crawler to not crawl such links/urls that appear in that particular format.
I guess I must achieve this with regex. What you say ?

to post a comment

PHP

2 Comments(s) _↴

@developer_webauthorSep 09.2020 — #$url = 'http://dave:[email protected]:9090/products?keywords=cell phone#samsung';

parse_url($url);

echo "scheme:"; echo $scheme = parse_url($url, PHP_URL_SCHEME); echo "<br>";

echo "user:"; echo $user = parse_url($url, PHP_URL_USER); echo "<br>";

echo "pass:"; echo $pass = parse_url($url, PHP_URL_PASS); echo "<br>";

echo "host:"; echo $host = parse_url($url, PHP_URL_HOST); echo "<br>";

echo "port:"; echo $port = parse_url($url, PHP_URL_PORT); echo "<br>";

echo "url_path:"; echo $url_path = parse_url($url, PHP_URL_PATH); echo "<br>";

echo "url_query:"; echo $url_query = parse_url($url, PHP_URL_QUERY); echo "<br>";

echo "url_fragment:"; echo $url_fragment = parse_url($url, PHP_URL_FRAGMENT); echo "<br>";

QUESTION: Why I do not like using the above ?

ANSWER: Answer I found here:

https://stackoverflow.com/questions/276516/parsing-domain-from-a-url

Answer is this they provided ....

$url = 'http://www.google.com/dhasjkdas/sadsdds/sdda/sdads.html';

$parse = parse_url($url);

echo $parse['host']; // prints 'www.google.com'

echo parse_url('https://subdomain.example.com/foo/bar', PHP_URL_HOST);

// Output: subdomain.example.com

echo parse_url('https://subdomain.example.co.uk/foo/bar', PHP_URL_HOST);

// Output: subdomain.example.co.uk

Please consider replacing the accepted solution with the following:

parse_url() will always include any sub-domain(s), so this function doesn't parse domain names very well. Here are some examples:


 $url = 'http://www.google.com/dhasjkdas/sadsdds/sdda/sdads.html';
 $parse = parse_url($url);
 echo $parse['host']; // prints 'www.google.com'
 
 echo parse_url('https://subdomain.example.com/foo/bar', PHP_URL_HOST);
 // Output: subdomain.example.com
 
 echo parse_url('https://subdomain.example.co.uk/foo/bar', PHP_URL_HOST);
 // Output: subdomain.example.co.uk
 
 Instead, you may consider this pragmatic solution. It will cover many, but not all domain names -- for instance, lower-level domains such as 'sos.state.oh.us' are not covered.
 
 if(filter_var($host,FILTER_VALIDATE_IP)) {
 // IP address returned as domain
 return $host; //* or replace with null if you don't want an IP back
 }
 
 $domain_array = explode(".", str_replace('www.', '', $host));
 $count = count($domain_array);
 if( $count>=3 && strlen($domain_array[$count-2])==2 ) {
 // SLD (example.co.uk)
 return implode('.', array_splice($domain_array, $count-3,3));
 } else if( $count>=2 ) {
 // TLD (example.com)
 return implode('.', array_splice($domain_array, $count-2,2));
 }
 }
 
 // Your domains
 echo getDomain('http://google.com/dhasjkdas/sadsdds/sdda/sdads.html'); // google.com
 echo getDomain('http://www.google.com/dhasjkdas/sadsdds/sdda/sdads.html'); // google.com
 echo getDomain('http://google.co.uk/dhasjkdas/sadsdds/sdda/sdads.html'); // google.co.uk
 
 // TLD
 echo getDomain('https://shop.example.com'); // example.com
 echo getDomain('https://foo.bar.example.com'); // example.com
 echo getDomain('https://www.example.com'); // example.com
 echo getDomain('https://example.com'); // example.com
 
 // SLD
 echo getDomain('https://more.news.bbc.co.uk'); // bbc.co.uk
 echo getDomain('https://www.bbc.co.uk'); // bbc.co.uk
 echo getDomain('https://bbc.co.uk'); // bbc.co.uk
 
 // IP
 echo getDomain('https://1.2.3.45');  // 1.2.3.45

QA. Should I stick tho this latest code you see just above or not ?

@developer_webauthorSep 09.2020 — #QB.

Remember, from these:

http://www.one.com

http://www.one.co.uk

These must be extracted as domains:

one.com

one.co.uk

So, which 11 codes from below are perfect ?

Which of these do you prefer that is suitable enough to extract the domain name from a url ?

2.


 $domain = parse_url('http://' . str_replace(array('https://', 'http://'), '', $url), PHP_URL_HOST);


 $tmp = explode("/", $url);
 $domain = $tmp[2];


 $tmp = parse_url($url);
 $url = $tmp['host']


 if (preg_match('/https?://([^/]+)//i', $target_string, $matches)) {
 $domain = $matches[1];
 }


 $regexp = '/.*//([^/:]+).*/';
 
 // www.stackoverflow.com
 echo preg_replace($regexp, '$1', 'http://www.stackoverflow.com/questions/ask');
 
 // google.de
 echo preg_replace($regexp, '$1', 'http://google.de/?q=hello');


 http://([^/]+).*


 if (preg_match('/http://([^/]+)//i', $target_string, $matches)) {
 $domain = $matches[1];
 }


 preg_match('/(http(|s))://(.*?)//si',  'http://www.example.com/page/?bla=123#!@#$%^&*()_+', $output);
 // $output[0] ------------>  https://www.example.com/

/*

Codes From: https://www.sitepoint.com/community/t/extract-domain-name-from-host-name/4355

*/

10.


 $string = "http://abc.acb.php.net/";
 preg_match('@^(?:http://)?([^/]+)@i', $string, $matches);
 $host = $matches[1];
 preg_match('/[^.]+\.[^.]+$/', $host, $matches);
 echo "domain name is: " . $matches[0] . "
 ";

11.


 $rawurl = "http://asdf.sadf.abc.com/adsfkjl/adfs/adfs.html";
 $url = parse_url($rawurl);
 echo $url['host'];

12.


 $rawurl = "http://asdf.sadf.abc.com/adsfkjl/adfs/adfs.html";
 $url = parse_url($rawurl);
 
 $domain = preg_replace('#^(?:.+?\.)+(.+?\.(?:co\.uk|com|net))#', '$1', $url['host']);
 
 echo $domain;
 
 /*
 If you need to add support for more TLDs then add them after net in the last set of brackets: 
 "(?:co\.uk|com|net|org)"
 
 And remember to escape the dot if you’re adding 2nd level domain:
 
 "(?:co\.uk|com|net|org|ac\.uk)"
 */

So, which 11 codes from the above are perfect ?

NOTE:

Remember, from these:

http://www.one.com

http://www.one.co.uk

These must be extracted as domains (2nd level domain and top level domain):

one.com

one.co.uk

And not get extracted like these:

**www.one.com** (thums down for extracting 'www' subdomain)

**co.uk** (thumbs down for failing to extract the valid domain name 'one')

Also in #PHP _↴

Automatic import of csv into mysql table Add smilies to a guestbook form PHP pipe email

Success!

Help @developer_web spread the word by sharing this article on Twitter...

Tweet This

How To Extract Domain From A Url ?

2 Comments(s) _↴

Also in #PHP _↴

Success!

Social

Version

How To Extract Domain From A Url ?

2 Comments(s) ↴

Also in #PHP ↴

Success!

The web is an endless sea of information. Don't miss the boat... Subscribe!

Social

Version

2 Comments(s) _↴

Also in #PHP _↴