CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
Regex benchmark

 



anglaissam
Novice

Jul 21, 2013, 8:34 PM

Post #1 of 9 (14514 views)
Regex benchmark Can't Post

Hi, i have these two regex that i benchmarked. I was hoping someone could explain 'WHY' the first regex is faster than the second.


Code
regex1 = '$f =~ /.*\.htm$|\.html$/i' 
regex2 = '$f =~ /\.htm$|\.html$/i'


Note the first regex has the .* reference at the beginning.

Thanks for any help.


Laurent_R
Veteran / Moderator

Jul 22, 2013, 8:47 AM

Post #2 of 9 (14506 views)
Re: [anglaissam] Regex benchmark [In reply to] Can't Post

Yes, this is a bit surprising since the two regexes should be equivalent.

It might be interesting, however that you show against which strings you ran your benchmark.

As an additional comment, you could try to deparse your program to see what the compiled result looks like.


FishMonger
Veteran / Moderator

Jul 22, 2013, 10:14 AM

Post #3 of 9 (14504 views)
Re: [anglaissam] Regex benchmark [In reply to] Can't Post

The .* in the first regex tells it to skip over all of the chars and go to the end of the string and then backtrack one char at a time to try and match the pattern at the end.

The second regex first needs to check each char in the string to see if it's a '.' then once that is matched, it checks for each of the next chars one-by-one.

The difference in speed of match becomes greater as the length of the string increases.

The alternation used in each regex is unnecessary and adds inefficiency.

This would be better than the alternation.

Code
$f =~ /.*\.html?$/i;


It would be interesting to benchmark the regex against using index.


(This post was edited by FishMonger on Jul 22, 2013, 10:15 AM)


FishMonger
Veteran / Moderator

Jul 22, 2013, 10:17 AM

Post #4 of 9 (14502 views)
Re: [anglaissam] Regex benchmark [In reply to] Can't Post

It would be a good idea to pickup a copy of "Mastering Regular Expressions, 3rd Edition"

http://shop.oreilly.com/product/9780596528126.do


anglaissam
Novice

Jul 22, 2013, 12:14 PM

Post #5 of 9 (14496 views)
Re: [FishMonger] Regex benchmark [In reply to] Can't Post

Hi, thanks for your explanations. As for using:


Code
$f =~ /.*\.html?$/i;


I originally used the above regex, however, benchmarking on multiple strings of varying lengths showed that alternating the regex with an OR is 90%+ more efficient. For those who are interested below i am posting the entire code i used to benchmark everything.


Code
#!/usr/bin/perl 
use Benchmark qw(:all);
use strict;

my $results = timethese(1050000,
{
'Code1' => sub { &code1; },
'Code2' => sub { &code2; },
'Code3' => sub { &code3; },
},
'none'
);
cmpthese( $results ) ;
system("pause");
exit 0;
#------
sub code1 {
my $dir = "C:\Users\Admin\Documents\Ebooks\Unsorted\Temp\[C] - Fair's fair.html";
$dir =~ /\.htm$|\.html$/i
}
#------
sub code2 {
my $dir = "C:\Users\Admin\Documents\Ebooks\Unsorted\Temp\[C] - Fair's fair.html";
$dir =~ /.*\.htm$|\.html$/i
}
#------
sub code3 {
my $dir = "C:\Users\Admin\Documents\Ebooks\Unsorted\Temp\[C] - Fair's fair.html";
$dir =~ /.*\.html?$/i
}
#------



FishMonger
Veteran / Moderator

Jul 22, 2013, 12:54 PM

Post #6 of 9 (14492 views)
Re: [anglaissam] Regex benchmark [In reply to] Can't Post

I believe you're misreading the results. The order of the output from Benchmark is from slowest to fastest and the rate column shows the number of executions per second.

I enabled warnings to point out a side problem with your benchmark.


Code
D:\test>anglaissam.pl 
Unrecognized escape \A passed through at D:\test\Perl-3.pl line 18.
Unrecognized escape \D passed through at D:\test\Perl-3.pl line 18.
Unrecognized escape \T passed through at D:\test\Perl-3.pl line 18.
Unrecognized escape \A passed through at D:\test\Perl-3.pl line 23.
Unrecognized escape \D passed through at D:\test\Perl-3.pl line 23.
Unrecognized escape \T passed through at D:\test\Perl-3.pl line 23.
Unrecognized escape \A passed through at D:\test\Perl-3.pl line 28.
Unrecognized escape \D passed through at D:\test\Perl-3.pl line 28.
Unrecognized escape \T passed through at D:\test\Perl-3.pl line 28.
Rate Code2 Code1 Code3
Code2 41041/s -- -71% -96%
Code1 142605/s 247% -- -88%
Code3 1141304/s 2681% 700% --
Press any key to continue . . .


I then changed the test to use references to the subs rather than closures and here's that result.


Code
           Rate Code2 Code1 Code3 
Code2 42982/s -- -72% -98%
Code1 151253/s 252% -- -92%
Code3 1868327/s 4247% 1135% --
Press any key to continue . . .



(This post was edited by FishMonger on Jul 22, 2013, 12:55 PM)


anglaissam
Novice

Jul 22, 2013, 1:33 PM

Post #7 of 9 (14487 views)
Re: [FishMonger] Regex benchmark [In reply to] Can't Post

Can you post the altered benchmarking code you used for your results? Fixing the warnings was easy enough but, whenever i adjust my code to use references, the results do not come out as expected...meaning i'm not doing it right.

Either way, thanks for your help Fishmonger.


FishMonger
Veteran / Moderator

Jul 22, 2013, 1:34 PM

Post #8 of 9 (14485 views)
Re: [anglaissam] Regex benchmark [In reply to] Can't Post


Code
#!/usr/bin/perl 
use Benchmark qw(:all);
use strict;
use warnings;
my $results = timethese(1050000,
{
'Code1' => \&code1,
'Code2' => \&code2,
'Code3' => \&code3,
},
'none'
);
cmpthese( $results ) ;
system("pause");
exit 0;
#------
sub code1 {
my $dir = "C:\Users\Admin\Documents\Ebooks\Unsorted\Temp\[C] - Fair's fair.html";
$dir =~ /\.htm$|\.html$/i
}
#------
sub code2 {
my $dir = "C:\Users\Admin\Documents\Ebooks\Unsorted\Temp\[C] - Fair's fair.html";
$dir =~ /.*\.htm$|\.html$/i
}
#------
sub code3 {
my $dir = "C:\Users\Admin\Documents\Ebooks\Unsorted\Temp\[C] - Fair's fair.html";
$dir =~ /.*\.html?$/i
}



Laurent_R
Veteran / Moderator

Jul 22, 2013, 4:09 PM

Post #9 of 9 (14473 views)
Re: [FishMonger] Regex benchmark [In reply to] Can't Post

Hi,
looking at your results, I got the feeling that:
- having the .* clause at the start does reduce speed (compare code2 and code3), and that does make some sense to me, the .* is probably just doing useless work
- the "l?" quantifier is faster than the HTML$|HTM$ alternation, this is also making sense to me.

Therefore, I had the idea that the best solution might be the fourth one, not yet tested and, so to speak, comibining the best of two worlds, i.e.:


Code
#------  
sub code4 {
my $dir = "C:\Users\Admin\Documents\Ebooks\Unsorted\Temp\[C] - Fair's fair.html";
$dir =~ /\.html?$/i
}


I was going to ask you to try it, FishMonger, but I can do it as well. Just, please everyone, make sure that you don't compare FishMonger's timings with mine, they are on a different platform. Except from adding code4 and calling it in the benchmark section, I haven't changed anything else to the code. This is what I get:


Code
           Rate Code2 Code1 Code3 Code4 
Code2 39686/s -- -74% -98% -98%
Code1 151581/s 282% -- -92% -93%
Code3 1923077/s 4746% 1169% -- -17%
Code4 2317881/s 5741% 1429% 21% --

Given that Code1 is almost 4 times more efficient than Code2, I was expecting Code4 to bring a higher improvement over Code3 (certainly not 4 times, I have enough experience in these matters to know very well that that the same change made on a slow and on a fast program don't bring proportional improvements), but I still thought it would be higher, maybe 1.5 to 2 times better. The improvement is only 21%, which is far from unsignificant, but less that I expected.

Yet, this is a pattern that I know very well. I am considered as the leading expert in a proprietary language and have been spending probably 20% of my time on performance improvements (I prefer to work in Perl, but, well, I also have to do other things) over the last 3 years. But I run regularly into one problem. The first application I worked on when I arrived in my current work department needed 5 days to complete. I was asked to see if I could improve that. In just 3 or 4 days of work, I succeeded to reduce it to about 1.5 day running time. A huge improvement, very nice. I then worked on improving about 45 other applications that were considered to take too much time. Some, I improved considerably (by a factor of 25 in some cases), others only by 30% or 40%. Over the time, I also improved my capacity to improve performance, finding new ways of doing things. After doing this for about 18 months, the situation was very different. Some applications that were originally taking huge amount of time were no longer a problem, and some others were becoming the bottleneck. So, I was asked to make another performance improvement exercise on them. But the more you have improved something, the more it is difficult to find new improvements. So, typically, I divided the run time by a factor of 4 the first time. The next time I have to do that, I only succeeded to reduce the execution time by a factor of 2. Still quite good, but not as fine looking. And, in some cases, I have had to work for the 3rd time on something, and I am now getting only a 30% improvement and I don't see what else I could do. Well, sorry for this long off-topic digression, but it might be interesting to some people.

But, back to the original problem, the results confirm my guts feeling that the ".*" at start is only wasting time and that the quantifier on the "l" letter if better than the alternation.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives