Home: Perl Programming Help: Regular Expressions:
expression for pulling domains out of URLs



scrpnsanctuary
Novice

Jun 8, 2009, 2:33 PM


Views: 11486
expression for pulling domains out of URLs

Hi there, I would like some help with this.

I have an expression that will grab the domain (xxx.com) out of a url (http://www.google.com/?query=blah).

Requirements:
1) Must be able to work if there is a port number in the URL (http://www.amazon.com:8010/somepage.html)
2) Needs to get the domain (something.com) part (preferably into 1 variable, but if it's in 2 I can join them)
3) Needs to work if the URL does not contain stuff between the domain and '://' (http://somesite.com/page.html)
4) Needs to work if the URL has a lot of subdomains (http://some.domain.some.other.domain.site.com/page.htm)

My current expression sometimes doesn't work, but I'm not sure if if is one of the above conditions or something else.


Here's what I'm using right now:

Code
if ($url =~ /^.*?\:\/\/(.*?\.)??([^\.]+?\.[^\.]+?)[\/\:].*/) 
{
my ($pre_domain) = $1;
my ($domain) = $2;

### More stuff ###
}




Thank you very much for your help.
----------
The vastness of what we know is only surpassed by the vastness of what we don't.

(This post was edited by scrpnsanctuary on Jun 9, 2009, 11:18 AM)


KevinR
Veteran


Jun 8, 2009, 2:55 PM


Views: 11476
Re: [scrpnsanctuary] expression for pulling domains out of URLs

Maybe look at URI::Split and use it or just look in the source code to see how the module parses URIs.
-------------------------------------------------


scrpnsanctuary
Novice

Jun 8, 2009, 3:19 PM


Views: 11472
Re: [KevinR] expression for pulling domains out of URLs

Ahh, thank you.

I guess I could use that module, but I kinda want to refine my expression just as a mental exercise. Smile
----------
The vastness of what we know is only surpassed by the vastness of what we don't.


KevinR
Veteran


Jun 9, 2009, 9:59 AM


Views: 11465
Re: [scrpnsanctuary] expression for pulling domains out of URLs

If I get a chance later today I will take a closer look at your regexp and your requirements and see if I can make a recommendation.
-------------------------------------------------


scrpnsanctuary
Novice

Jun 9, 2009, 10:20 AM


Views: 11464
Re: [KevinR] expression for pulling domains out of URLs

Oh cool, thank you very much.
----------
The vastness of what we know is only surpassed by the vastness of what we don't.


scrpnsanctuary
Novice

Jun 9, 2009, 10:52 AM


Views: 11461
Re: [KevinR] expression for pulling domains out of URLs

I looked up URI::Split and found how they do it:


Code
sub uri_split { 
return $_[0] =~ m,(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?,;
}


### Example:
($scheme, $auth, $path, $query, $frag) = uri_split($uri)


It's a bit more than I need, but wow. I can dissect most of it but I need to look at it longer to understand it all.
----------
The vastness of what we know is only surpassed by the vastness of what we don't.

(This post was edited by scrpnsanctuary on Jun 9, 2009, 10:54 AM)


KevinR
Veteran


Jun 9, 2009, 11:07 AM


Views: 11457
Re: [scrpnsanctuary] expression for pulling domains out of URLs

It is rather cryptic, even I have a hard time reading it although I understand all the code it uses. There is a neat module that explains what regular expressions mean:

http://search.cpan.org/~pinyan/YAPE-Regex-Explain-3.011/Explain.pm

See if you can get it installed and use it as needed.
-------------------------------------------------


(This post was edited by KevinR on Jun 9, 2009, 11:07 AM)


scrpnsanctuary
Novice

Jun 9, 2009, 11:19 AM


Views: 11454
Re: [KevinR] expression for pulling domains out of URLs

Got it:


Code
The regular expression: 

(?-imsx:m,(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?,)

matches as follows:

NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
m, 'm,'
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^:/?#]+ any character except: ':', '/', '?',
'#' (1 or more times (matching the
most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
// '//'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[^/?#]* any character except: '/', '?', '#' (0
or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[^?#]* any character except: '?', '#' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
\? '?'
----------------------------------------------------------------------
( group and capture to \4:
----------------------------------------------------------------------
[^#]* any character except: '#' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \4
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
# '#'
----------------------------------------------------------------------
( group and capture to \5:
----------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \5
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
, ','
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------



That actually makes sense to me after reading through it with a nice explanation, lol.
----------
The vastness of what we know is only surpassed by the vastness of what we don't.


KevinR
Veteran


Jun 9, 2009, 11:46 AM


Views: 11451
Re: [scrpnsanctuary] expression for pulling domains out of URLs

If you can understand that it means you are going insane Crazy

All really good programmers are insane, which explains why I am not too good. Unsure
-------------------------------------------------


(This post was edited by KevinR on Jun 9, 2009, 11:46 AM)