Oct 24, 2013, 3:15 PM
Post #4 of 5
Re: [Kate] Is oneline Perl actually faster than Shell+AWK?
[In reply to]
My questions are
(a) Is "oneline" Perl command not equal to real Perl script?
(b) Only for data manipulation, is simply AWK better than Perl?
(c) Will it be worth learning and writing C code to do data manipulation? I know it's gonna be very difficult but it can save me a lot of time in the future doing data manipulation because no language can be faster than C?
It really depends on what you are doing. And you have to compare things that can really be compared.
The shell is really much much slower than Perl, no doubt about it, provided things are used correctly.
About 10 years ago, a colleague of mine and I once had to filter about 100 million records in order to keep only those pertaining to the clients loaded into the test database (about 25% of the full client database). I wrote a Perl program that was loading 500,000 clients into a hash and then reading the records to keep only those of interest. My mistake was to think that the 100 millions records were coming in one big file. The reality turned out to be that the records actually came in tens of thousands of small files. My colleague took my Perl file and then made a shell script to call my Perl program for each incoming file. So that my Perl program had to be compiled and to load the 500,000 clients into memory for each file, which usually had from a few hundreds to a few thousands records. The result was (obviously) devastatingly slow (taking hours). My colleague complained that my Perl program was slow. The program was not slow, the program was just massively misused. When I realized the situation, I changed my Perl script to load only once the relevant customers into the hash, and then process all the files. Once this simple change was done, the filtering of the records took only a couple of minutes.
I think that I can claim that I am really an expert on performance issues on large datasets. And I know that you just can't make general rules.
About ten years ago, I benchmarked a Perl program against an awk program, Perl was at least 5 times faster. I made a similar test recently (different versions of Perl and, much more importantly, different implementations of awk - not the same vendor), Perl was still faster, but only slightly, by a much narrower margin (perhaps 25% only). I do not think that that the Perl performance decreased, but the recent awk implementation was obviously better on AIX than on whatever I has tried ten years ago (most probably Solaris or HP-UX at the time, but I don't remember for sure). Similarly, 10 years ago, Perl was way faster than sed. I recently found cases or very simple scripts where the AIX implementation of sed turned out to be marginally faster than the equivalent Perl program (but only for doing very simple things).
However, generally speaking, and keeping in mind that I am working on extremely large files on a daily basis, Perl is really my tool of choice.
I could give you hundred of examples, but will give you just one from yesterday and today. I was asked to help some colleagues using SQL and VB scripts on an Access database for processing some moderately large data. The process was taking more than 24 hours to execute, and that was not acceptable in the context of the business process. I wrote a Perl program using hashes and working on the original flat files. My program took less than a minute to generate the required output data.
Now, it is obviously possible to write a C program that will be faster than Perl. But you would have first to find the libraries that will offer a good implementation of hashes and of many of the other Perl functionalities that I used. And even then, the C program would still be at least ten or possibly twenty times longer (in terms of the number of lines of code) and probably much more difficult to debug. So that where I spent two days to write the program in Perl, I might have needed a month (possibly more) for the same thing in C. And again, don't think that a C program is by itself faster than a Perl program. It is the case only for extremely well written C programs or very specific CPU-intensive problems where C is faster. I am still using C once in a while for very very specific issues, but Perl gives me much more at least 99.5% of the time.
And, BTW, probably more than 90% of the Human Genome project was done in Perl. C Programs would be faster on these huge amounts of data, but these C programs probably would not be completed before the next ten years. With Perl, the results are already there.
In brief, if you are really working on computationally intensive problems (finding the next prime Mersenne number, multiplying matrices with millions of lines and columns, or preparing weather forecasts), C might be you best friend. For more common problems, Perl will give you far more. Provided you know how to use it.