Home: Perl Programming Help: Regular Expressions:
This is annoying me. I'm sure I'm f@#ing it up, please correct me.



rmarc
New User

Apr 4, 2014, 10:07 PM


Views: 22206
This is annoying me. I'm sure I'm f@#ing it up, please correct me.


Code
cat /tmp/stuff |  perl -e  ' 
my %urlhash ;
while (<STDIN>) {
my ($url) = $_ =~ /https?:\/\/(\S+)\//;
print "$url\n";
if ($urlhash{$url}) {
$urlhash{$url}++;
} else {
$urlhash{$url} = 1;
}
} '

ib.adnxs.com/ttj?ttjb=1&bdref=http%3A%2F%2Fnym1.b.adnxs.com%2Fif%3Fenc%3DuB6F61G4nj-4HoXrUbieP39qvHSTGMQ_uB6F61G4nj-4HoXrUbieP4lItL9PFAlrSeCjkBlH1Hsauj5TAAAAAFyWHwBAAgAAQAIAAAIAAABnK78AFycFAAAAAQBVU0QAVVNEAKAAWAImEwAADJ4AAgQCAQIAAIwAeyTbagAAAAA.%26cnd%3D%25217iWHmAjcib0BEOfW_AUYACCXzhQwADimphhABEjABFDcrH5YAGDJBGgAcAB4AIABAIgBAJABAZgBAaABAagBA7ABALkBuB6F61G4nj_BAbgehetRuJ4_yQEltGaNUfL7P9kBAAAAAAAA8D_gAQD1AQAAAAA.%26ccd%3D%2521rwYRPwjcib0BEOfW_AUYl84UIAQ.%26udj%3Duf%2528%2527a%2527%252C%2B54560%252C%2B1396619802%2529%253Buf%2528%2527r%2527%252C%2B12528487%252C%2B1396619802%2529%253B%26vpid%3D78%26apid%3D223406%26referrer%3Dhttp%253A%252F%252Fwww.chancese.com%252Findex.php%253Foption%253Dcom_content%2526view%253Darticle%2526id%253D1367%253AThe-Disadvantages-of-Working-as-a-Team%2526catid%253D27%2526Itemid%253D67%26ct%3D0%26dlo%3D1&id=2263489&cb=[CACHEBUSTER]&pubclick=http://nym1.b.adnxs.com/click?uB6F61G4nj-4HoXrUbieP39qvHSTGMQ_uB6F61G4nj-4HoXrUbieP4lItL9PFAlrSeCjkBlH1Hsauj5TAAAAAFyWHwBAAgAAQAIAAAIAAABnK78AFycFAAAAAQBVU0QAVVNEAKAAWAImEwAADJ4DAQQCAQIAAIwAfSQAawAAAAA./cnd=%21rwYRPwjcib0BEOfW_AUYl84UIAQ./referrer=http%3A%2F%2Fwww.chancese.com%2Findex.php%3Foption%3Dcom_content%26view%3Darticle%26id%3D1367%3AThe-Disadvantages-of-Working-as-a-Team%26catid%3D27%26Itemid%3D67


It works fine if I do this:

Code
cat /tmp/stuff |  perl -e 'my %urlhash ;  
while (<STDIN>) {
my ($url) = $_ =~ /https?:\/\/(\S+)\/t/;
print "$url\n";
if ($urlhash{$url}) {
$urlhash{$url}++;
} else {
$urlhash{$url} = 1;
}
} '
ib.adnxs.com

Here’s /tmp/stuff:


Code
173.234.12.237 - - [04/Apr/2014:08:56:58 -0500] "GET http://ib.adnxs.com/ttj?ttjb=1&bdref=http%3A%2F%2Fnym1.b.adnxs.com%2Fif%3Fenc%3DuB6F61G4nj-4HoXrUbieP39qvHSTGMQ_uB6F61G4nj-4HoXrUbieP4lItL9PFAlrSeCjkBlH1Hsauj5TAAAAAFyWHwBAAgAAQAIAAAIAAABnK78AFycFAAAAAQBVU0QAVVNEAKAAWAImEwAADJ4AAgQCAQIAAIwAeyTbagAAAAA.%26cnd%3D%25217iWHmAjcib0BEOfW_AUYACCXzhQwADimphhABEjABFDcrH5YAGDJBGgAcAB4AIABAIgBAJABAZgBAaABAagBA7ABALkBuB6F61G4nj_BAbgehetRuJ4_yQEltGaNUfL7P9kBAAAAAAAA8D_gAQD1AQAAAAA.%26ccd%3D%2521rwYRPwjcib0BEOfW_AUYl84UIAQ.%26udj%3Duf%2528%2527a%2527%252C%2B54560%252C%2B1396619802%2529%253Buf%2528%2527r%2527%252C%2B12528487%252C%2B1396619802%2529%253B%26vpid%3D78%26apid%3D223406%26referrer%3Dhttp%253A%252F%252Fwww.chancese.com%252Findex.php%253Foption%253Dcom_content%2526view%253Darticle%2526id%253D1367%253AThe-Disadvantages-of-Working-as-a-Team%2526catid%253D27%2526Itemid%253D67%26ct%3D0%26dlo%3D1&id=2263489&cb=[CACHEBUSTER]&pubclick=http://nym1.b.adnxs.com/click?uB6F61G4nj-4HoXrUbieP39qvHSTGMQ_uB6F61G4nj-4HoXrUbieP4lItL9PFAlrSeCjkBlH1Hsauj5TAAAAAFyWHwBAAgAAQAIAAAIAAABnK78AFycFAAAAAQBVU0QAVVNEAKAAWAImEwAADJ4DAQQCAQIAAIwAfSQAawAAAAA./cnd=%21rwYRPwjcib0BEOfW_AUYl84UIAQ./referrer=http%3A%2F%2Fwww.chancese.com%2Findex.php%3Foption%3Dcom_content%26view%3Darticle%26id%3D1367%3AThe-Disadvantages-of-Working-as-a-Team%26catid%3D27%26Itemid%3D67/clickenc= HTTP/1.0" 200 1458 "http://nym1.b.adnxs.com/if?enc=uB6F61G4nj-4HoXrUbieP39qvHSTGMQ_uB6F61G4nj-4HoXrUbieP4lItL9PFAlrSeCjkBlH1Hsauj5TAAAAAFyWHwBAAgAAQAIAAAIAAABnK78AFycFAAAAAQBVU0QAVVNEAKAAWAImEwAADJ4AAgQCAQIAAIwAeyTbagAAAAA.&cnd=%217iWHmAjcib0BEOfW_AUYACCXzhQwADimphhABEjABFDcrH5YAGDJBGgAcAB4AIABAIgBAJABAZgBAaABAagBA7ABALkBuB6F61G4nj_BAbgehetRuJ4_yQEltGaNUfL7P9kBAAAAAAAA8D_gAQD1AQAAAAA.&ccd=%21rwYRPwjcib0BEOfW_AUYl84UIAQ.&udj=uf%28%27a%27%2C+54560%2C+1396619802%29%3Buf%28%27r%27%2C+12528487%2C+1396619802%29%3B&vpid=78&apid=223406&referrer=http%3A%2F%2Fwww.chancese.com%2Findex.php%3Foption%3Dcom_content%26view%3Darticle%26id%3D1367%3AThe-Disadvantages-of-Working-as-a-Team%26catid%3D27%26Itemid%3D67&ct=0&dlo=1" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; FunWebProducts; .NET CLR 1.1.4322)"



(This post was edited by Laurent_R on Apr 5, 2014, 8:27 AM)


Laurent_R
Veteran / Moderator

Apr 5, 2014, 8:22 AM


Views: 22186
Re: [rmarc] This is annoying me. I'm sure I'm f@#ing it up, please correct me.

Hi,
I have heavily edited your post to add code tags as well as line returns and indentation in the code, to try to make your post more readable. Please use code tags next time you post here.

Even after having reformated the post to make it more readable, I still don't understand what you want. What is your problem? What is your question?


rmarc
New User

Apr 5, 2014, 8:30 AM


Views: 22178
Re: [Laurent_R] This is annoying me. I'm sure I'm f@#ing it up, please correct me.

I'm just trying to parse an apache log. In this case, just trying to extract the hostname from the GET, if it exists. After that I'm counting.

It works for a lot, but not for everything. I'm confused as to why it doesn't work. It seems that it's not matching the slash in cases similar to the one I noted.

R. Marc


BillKSmith
Veteran

Apr 5, 2014, 8:32 AM


Views: 22177
Re: [rmarc] This is annoying me. I'm sure I'm f@#ing it up, please correct me.

I suspect that you problem is the greedy match. Try:

Code
my ($url) = $_ =~ /https?:\/\/(\S+?)\//;

Good Luck,
Bill


rmarc
New User

Apr 5, 2014, 8:37 AM


Views: 22175
Re: [BillKSmith] This is annoying me. I'm sure I'm f@#ing it up, please correct me.

Beautiful.

I would have thought an explicit match would override the greed, but I'm for what works.

Thanks.

R. Marc


FishMonger
Veteran / Moderator

Apr 5, 2014, 9:16 AM


Views: 22174
Re: [rmarc] This is annoying me. I'm sure I'm f@#ing it up, please correct me.


Quote
I'm just trying to parse an apache log.


Rather than rolling your own parser, why not use one that has been tested by tens of thousands of people.

Apache::Log::Parser - Parser for Apache Log (common, combined, and any other custom styles by LogFormat).
http://search.cpan.org/~tagomoris/Apache-Log-Parser-0.02/lib/Apache/Log/Parser.pm