CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
expression for pulling domains out of URLs

 



scrpnsanctuary
Novice

Jun 8, 2009, 2:33 PM

Post #1 of 9 (4286 views)
expression for pulling domains out of URLs Can't Post

Hi there, I would like some help with this.

I have an expression that will grab the domain (xxx.com) out of a url (http://www.google.com/?query=blah).

Requirements:
1) Must be able to work if there is a port number in the URL (http://www.amazon.com:8010/somepage.html)
2) Needs to get the domain (something.com) part (preferably into 1 variable, but if it's in 2 I can join them)
3) Needs to work if the URL does not contain stuff between the domain and '://' (http://somesite.com/page.html)
4) Needs to work if the URL has a lot of subdomains (http://some.domain.some.other.domain.site.com/page.htm)

My current expression sometimes doesn't work, but I'm not sure if if is one of the above conditions or something else.


Here's what I'm using right now:

Code
if ($url =~ /^.*?\:\/\/(.*?\.)??([^\.]+?\.[^\.]+?)[\/\:].*/) 
{
my ($pre_domain) = $1;
my ($domain) = $2;

### More stuff ###
}




Thank you very much for your help.
----------
The vastness of what we know is only surpassed by the vastness of what we don't.

(This post was edited by scrpnsanctuary on Jun 9, 2009, 11:18 AM)


KevinR
Veteran


Jun 8, 2009, 2:55 PM

Post #2 of 9 (4276 views)
Re: [scrpnsanctuary] expression for pulling domains out of URLs [In reply to] Can't Post

Maybe look at URI::Split and use it or just look in the source code to see how the module parses URIs.
-------------------------------------------------


scrpnsanctuary
Novice

Jun 8, 2009, 3:19 PM

Post #3 of 9 (4272 views)
Re: [KevinR] expression for pulling domains out of URLs [In reply to] Can't Post

Ahh, thank you.

I guess I could use that module, but I kinda want to refine my expression just as a mental exercise. Smile
----------
The vastness of what we know is only surpassed by the vastness of what we don't.


KevinR
Veteran


Jun 9, 2009, 9:59 AM

Post #4 of 9 (4265 views)
Re: [scrpnsanctuary] expression for pulling domains out of URLs [In reply to] Can't Post

If I get a chance later today I will take a closer look at your regexp and your requirements and see if I can make a recommendation.
-------------------------------------------------


scrpnsanctuary
Novice

Jun 9, 2009, 10:20 AM

Post #5 of 9 (4264 views)
Re: [KevinR] expression for pulling domains out of URLs [In reply to] Can't Post

Oh cool, thank you very much.
----------
The vastness of what we know is only surpassed by the vastness of what we don't.


scrpnsanctuary
Novice

Jun 9, 2009, 10:52 AM

Post #6 of 9 (4261 views)
Re: [KevinR] expression for pulling domains out of URLs [In reply to] Can't Post

I looked up URI::Split and found how they do it:


Code
sub uri_split { 
return $_[0] =~ m,(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?,;
}


### Example:
($scheme, $auth, $path, $query, $frag) = uri_split($uri)


It's a bit more than I need, but wow. I can dissect most of it but I need to look at it longer to understand it all.
----------
The vastness of what we know is only surpassed by the vastness of what we don't.

(This post was edited by scrpnsanctuary on Jun 9, 2009, 10:54 AM)


KevinR
Veteran


Jun 9, 2009, 11:07 AM

Post #7 of 9 (4257 views)
Re: [scrpnsanctuary] expression for pulling domains out of URLs [In reply to] Can't Post

It is rather cryptic, even I have a hard time reading it although I understand all the code it uses. There is a neat module that explains what regular expressions mean:

http://search.cpan.org/~pinyan/YAPE-Regex-Explain-3.011/Explain.pm

See if you can get it installed and use it as needed.
-------------------------------------------------


(This post was edited by KevinR on Jun 9, 2009, 11:07 AM)


scrpnsanctuary
Novice

Jun 9, 2009, 11:19 AM

Post #8 of 9 (4254 views)
Re: [KevinR] expression for pulling domains out of URLs [In reply to] Can't Post

Got it:


Code
The regular expression: 

(?-imsx:m,(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?,)

matches as follows:

NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
m, 'm,'
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^:/?#]+ any character except: ':', '/', '?',
'#' (1 or more times (matching the
most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
// '//'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[^/?#]* any character except: '/', '?', '#' (0
or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[^?#]* any character except: '?', '#' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
\? '?'
----------------------------------------------------------------------
( group and capture to \4:
----------------------------------------------------------------------
[^#]* any character except: '#' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \4
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
# '#'
----------------------------------------------------------------------
( group and capture to \5:
----------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \5
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
, ','
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------



That actually makes sense to me after reading through it with a nice explanation, lol.
----------
The vastness of what we know is only surpassed by the vastness of what we don't.


KevinR
Veteran


Jun 9, 2009, 11:46 AM

Post #9 of 9 (4251 views)
Re: [scrpnsanctuary] expression for pulling domains out of URLs [In reply to] Can't Post

If you can understand that it means you are going insane Crazy

All really good programmers are insane, which explains why I am not too good. Unsure
-------------------------------------------------


(This post was edited by KevinR on Jun 9, 2009, 11:46 AM)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives