/    Sign up×
Community /Pin to ProfileBookmark

How To Extract Domain From A Url ?

Php Folks,

Q1.
Do you know REGEX to extract domain name from a url that your form user inputs in your form field ?

““
<form method=”GET” name=”submit_link” id=”submit_link” action=<?php echo $_SERVER[‘PHP_SELF’];?>>
<label for=”url”>Url</label>
<input_type=”url” name=”webpage_address” id=”url” placeholder=”Type your wepage address here …” REQUIRED>

</form>
““

When users submit their url of the page they want to my we crawler to crawl, I need to extract the domain name off from the submitted url. REGEX must work for all forms of urls.

Q2.
And, in what url format do download links come in ?
Imagine a crooked user, fed my crawler a url of a virus download page. I don’t want our crawler working on such a page. Meaning, if the submitted url is a download link then crawler should ignore.
You know. You sometimes find links on google search results that when you click, the browser does not take you to any webpage. Instead a file gets downloaded on auto to your hdd or your browser prompts you to select on your hdd where you want to save the file after it gets downloaded.
As soon as my crawler detects the user submitted a download link, I want the crawler to foil the crawling. I must learn the download link format in order to teach the crawler to not crawl such links/urls that appear in that particular format.
I guess I must achieve this with regex. What you say ?

to post a comment
PHP

2 Comments(s)

Copy linkTweet thisAlerts:
@developer_webauthorSep 09.2020 — $url = 'http://dave:[email protected]:9090/products?keywords=cell phone#samsung';

parse_url($url);

echo "scheme:"; echo $scheme = parse_url($url, PHP_URL_SCHEME); echo "<br>";

echo "user:"; echo $user = parse_url($url, PHP_URL_USER); echo "<br>";

echo "pass:"; echo $pass = parse_url($url, PHP_URL_PASS); echo "<br>";

echo "host:"; echo $host = parse_url($url, PHP_URL_HOST); echo "<br>";

echo "port:"; echo $port = parse_url($url, PHP_URL_PORT); echo "<br>";

echo "url_path:"; echo $url_path = parse_url($url, PHP_URL_PATH); echo "<br>";

echo "url_query:"; echo $url_query = parse_url($url, PHP_URL_QUERY); echo "<br>";

echo "url_fragment:"; echo $url_fragment = parse_url($url, PHP_URL_FRAGMENT); echo "<br>";

QUESTION: Why I do not like using the above ?

ANSWER: Answer I found here:

https://stackoverflow.com/questions/276516/parsing-domain-from-a-url

Answer is this they provided ....

$url = 'http://www.google.com/dhasjkdas/sadsdds/sdda/sdads.html';

$parse = parse_url($url);

echo $parse['host']; // prints 'www.google.com'

echo parse_url('https://subdomain.example.com/foo/bar', PHP_URL_HOST);

// Output: subdomain.example.com

echo parse_url('https://subdomain.example.co.uk/foo/bar', PHP_URL_HOST);

// Output: subdomain.example.co.uk

Please consider replacing the accepted solution with the following:

parse_url() will always include any sub-domain(s), so this function doesn't parse domain names very well. Here are some examples:


$url = 'http://www.google.com/dhasjkdas/sadsdds/sdda/sdads.html';
$parse = parse_url($url);
echo $parse['host']; // prints 'www.google.com'

echo parse_url('https://subdomain.example.com/foo/bar', PHP_URL_HOST);
// Output: subdomain.example.com

echo parse_url('https://subdomain.example.co.uk/foo/bar', PHP_URL_HOST);
// Output: subdomain.example.co.uk

Instead, you may consider this pragmatic solution. It will cover many, but not all domain names -- for instance, lower-level domains such as 'sos.state.oh.us' are not covered.

if(filter_var($host,FILTER_VALIDATE_IP)) {
// IP address returned as domain
return $host; //* or replace with null if you don't want an IP back
}

$domain_array = explode(".", str_replace('www.', '', $host));
$count = count($domain_array);
if( $count>=3 && strlen($domain_array[$count-2])==2 ) {
// SLD (example.co.uk)
return implode('.', array_splice($domain_array, $count-3,3));
} else if( $count>=2 ) {
// TLD (example.com)
return implode('.', array_splice($domain_array, $count-2,2));
}
}

// Your domains
echo getDomain('http://google.com/dhasjkdas/sadsdds/sdda/sdads.html'); // google.com
echo getDomain('http://www.google.com/dhasjkdas/sadsdds/sdda/sdads.html'); // google.com
echo getDomain('http://google.co.uk/dhasjkdas/sadsdds/sdda/sdads.html'); // google.co.uk

// TLD
echo getDomain('https://shop.example.com'); // example.com
echo getDomain('https://foo.bar.example.com'); // example.com
echo getDomain('https://www.example.com'); // example.com
echo getDomain('https://example.com'); // example.com

// SLD
echo getDomain('https://more.news.bbc.co.uk'); // bbc.co.uk
echo getDomain('https://www.bbc.co.uk'); // bbc.co.uk
echo getDomain('https://bbc.co.uk'); // bbc.co.uk

// IP
echo getDomain('https://1.2.3.45'); // 1.2.3.45

QA. Should I stick tho this latest code you see just above or not ?
Copy linkTweet thisAlerts:
@developer_webauthorSep 09.2020 — QB.

Remember, from these:

http://www.one.com

http://www.one.co.uk

These must be extracted as domains:

one.com

one.co.uk

So, which 11 codes from below are perfect ?

Which of these do you prefer that is suitable enough to extract the domain name from a url ?

2.

$domain = parse_url('http://' . str_replace(array('https://', 'http://'), '', $url), PHP_URL_HOST);


3.

$tmp = explode("/", $url);
$domain = $tmp[2];


4.

$tmp = parse_url($url);
$url = $tmp['host']


5.

if (preg_match('/https?://([^/]+)//i', $target_string, $matches)) {
$domain = $matches[1];
}



6.

$regexp = '/.*//([^/:]+).*/';

// www.stackoverflow.com
echo preg_replace($regexp, '$1', 'http://www.stackoverflow.com/questions/ask');

// google.de
echo preg_replace($regexp, '$1', 'http://google.de/?q=hello');


7.

http://([^/]+).*


8.

if (preg_match('/http://([^/]+)//i', $target_string, $matches)) {
$domain = $matches[1];
}


9.

preg_match('/(http(|s))://(.*?)//si', 'http://www.example.com/page/?bla=123#!@#$%^&*()_+', $output);
// $output[0] ------------> https://www.example.com/



/*

Codes From: https://www.sitepoint.com/community/t/extract-domain-name-from-host-name/4355

*/


10.

$string = "http://abc.acb.php.net/";
preg_match('@^(?:http://)?([^/]+)@i', $string, $matches);
$host = $matches[1];
preg_match('/[^.]+\.[^.]+$/', $host, $matches);
echo "domain name is: " . $matches[0] . "
";



11.

$rawurl = "http://asdf.sadf.abc.com/adsfkjl/adfs/adfs.html";
$url = parse_url($rawurl);
echo $url['host'];



12.

$rawurl = "http://asdf.sadf.abc.com/adsfkjl/adfs/adfs.html";
$url = parse_url($rawurl);

$domain = preg_replace('#^(?:.+?\.)+(.+?\.(?:co\.uk|com|net))#', '$1', $url['host']);

echo $domain;

/*
If you need to add support for more TLDs then add them after net in the last set of brackets:
"(?:co\.uk|com|net|org)"

And remember to escape the dot if you’re adding 2nd level domain:

"(?:co\.uk|com|net|org|ac\.uk)"
*/


So, which 11 codes from the above are perfect ?

NOTE:

Remember, from these:

http://www.one.com

http://www.one.co.uk

These must be extracted as domains (2nd level domain and top level domain):

one.com

one.co.uk

And not get extracted like these:

**www.one.com** (thums down for extracting 'www' subdomain)

**co.uk** (thumbs down for failing to extract the valid domain name 'one')
×

Success!

Help @developer_web spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 6.17,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @nearjob,
tipped: article
amount: 1000 SATS,

tipper: @meenaratha,
tipped: article
amount: 1000 SATS,

tipper: @meenaratha,
tipped: article
amount: 1000 SATS,
)...