
Laurent_R
Veteran
/ Moderator
Jul 22, 2013, 4:09 PM
Post #9 of 9
(34563 views)
|
Re: [FishMonger] Regex benchmark
[In reply to]
|
Can't Post
|
|
Hi, looking at your results, I got the feeling that: - having the .* clause at the start does reduce speed (compare code2 and code3), and that does make some sense to me, the .* is probably just doing useless work - the "l?" quantifier is faster than the HTML$|HTM$ alternation, this is also making sense to me. Therefore, I had the idea that the best solution might be the fourth one, not yet tested and, so to speak, comibining the best of two worlds, i.e.:
#------ sub code4 { my $dir = "C:\Users\Admin\Documents\Ebooks\Unsorted\Temp\[C] - Fair's fair.html"; $dir =~ /\.html?$/i } I was going to ask you to try it, FishMonger, but I can do it as well. Just, please everyone, make sure that you don't compare FishMonger's timings with mine, they are on a different platform. Except from adding code4 and calling it in the benchmark section, I haven't changed anything else to the code. This is what I get:
Rate Code2 Code1 Code3 Code4 Code2 39686/s -- -74% -98% -98% Code1 151581/s 282% -- -92% -93% Code3 1923077/s 4746% 1169% -- -17% Code4 2317881/s 5741% 1429% 21% -- Given that Code1 is almost 4 times more efficient than Code2, I was expecting Code4 to bring a higher improvement over Code3 (certainly not 4 times, I have enough experience in these matters to know very well that that the same change made on a slow and on a fast program don't bring proportional improvements), but I still thought it would be higher, maybe 1.5 to 2 times better. The improvement is only 21%, which is far from unsignificant, but less that I expected. Yet, this is a pattern that I know very well. I am considered as the leading expert in a proprietary language and have been spending probably 20% of my time on performance improvements (I prefer to work in Perl, but, well, I also have to do other things) over the last 3 years. But I run regularly into one problem. The first application I worked on when I arrived in my current work department needed 5 days to complete. I was asked to see if I could improve that. In just 3 or 4 days of work, I succeeded to reduce it to about 1.5 day running time. A huge improvement, very nice. I then worked on improving about 45 other applications that were considered to take too much time. Some, I improved considerably (by a factor of 25 in some cases), others only by 30% or 40%. Over the time, I also improved my capacity to improve performance, finding new ways of doing things. After doing this for about 18 months, the situation was very different. Some applications that were originally taking huge amount of time were no longer a problem, and some others were becoming the bottleneck. So, I was asked to make another performance improvement exercise on them. But the more you have improved something, the more it is difficult to find new improvements. So, typically, I divided the run time by a factor of 4 the first time. The next time I have to do that, I only succeeded to reduce the execution time by a factor of 2. Still quite good, but not as fine looking. And, in some cases, I have had to work for the 3rd time on something, and I am now getting only a 30% improvement and I don't see what else I could do. Well, sorry for this long off-topic digression, but it might be interesting to some people. But, back to the original problem, the results confirm my guts feeling that the ".*" at start is only wasting time and that the quantifier on the "l" letter if better than the alternation.
|