CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Is oneline Perl actually faster than Shell+AWK?

 



Kate
New User

Oct 23, 2013, 9:31 PM

Post #1 of 5 (1011 views)
Is oneline Perl actually faster than Shell+AWK? Can't Post

Dear Guru's

I started learning Shell+AWK to do some data manipulation work on big data a year ago. Recently I read some articles on internet and they all said that Perl is much faster than Shell script. So I wrote the following two brief scripts in order to compare the speed.

However, the Perl results is very disappointed; it shows that the oneline Perl command is even slower than Shell + awk command.


### TASK & DATASET ###

The following scripts are designed to output records with p.value < 0.05. It will go through 50,000 txt files. Each file contains 10,000 records with two column, p.value and ID.

### AWK (took ~3 hours) ###

for i in file_num*.txt
do
awk 'BEGIN{OFS="\t"}{if($1<=0.05) print $1,$2}' $i
>> ./pvalue/output.txt
done

### PERL oneline (took more than 5 hours)###

for i in in file_num*.txt
do
perl -lane '$F[0] <= 0.05 ? print "$F[0]\t$F[1]" :
continute;' $i
>> ./pvalue/output.txt
done

#####################

My questions are

(a) Is "oneline" Perl command not equal to real Perl script?

(b) Only for data manipulation, is simply AWK better than Perl?

(c) Will it be worth learning and writing C code to do data manipulation? I know it's gonna be very difficult but it can save me a lot of time in the future doing data manipulation because no language can be faster than C?

Any suggestion or feedback will be deeply appreciated!!


Kate


(This post was edited by Kate on Oct 23, 2013, 9:39 PM)


FishMonger
Veteran / Moderator

Oct 24, 2013, 6:33 AM

Post #2 of 5 (994 views)
Re: [Kate] Is oneline Perl actually faster than Shell+AWK? [In reply to] Can't Post

The start up cost of the perl interpreter is greater than awk. You need to rework your test so that the perl one liner is not being run inside a shell script loop.

I have not tested this, but give it a try outside of the shell script.

Code
perl -lane 'BEGIN{@ARGV=<file_num*.txt>} print "$F[0]\t$F[1]" if $F[0]<.05'


Answers to your questions:
A: The interpreter start up cost would be the same, but in nearly all cases the script would be doing more things or do them in a different way.

B: This question is too vague. In some cases awk would be "better" in other cases perl would be "better".

C: It's always worth learning another language. The benefits and draw backs you get from writing in C as opposed to perl will depend on the task being done and how fast you need to have to the code completed. It's much faster and easier to write perl scripts than C programs.


(This post was edited by FishMonger on Oct 24, 2013, 6:36 AM)


FishMonger
Veteran / Moderator

Oct 24, 2013, 6:42 AM

Post #3 of 5 (992 views)
Re: [Kate] Is oneline Perl actually faster than Shell+AWK? [In reply to] Can't Post

It was implicit in my comment, but just in case you didn't catch my meaning. In your test you were launching the perl interpreter 50,000 times passing in 1 file at a time. In my example I'm launching the interpreter only once and passing in all 50,000 files at once.


(This post was edited by FishMonger on Oct 24, 2013, 6:43 AM)


Laurent_R
Veteran / Moderator

Oct 24, 2013, 3:15 PM

Post #4 of 5 (979 views)
Re: [Kate] Is oneline Perl actually faster than Shell+AWK? [In reply to] Can't Post


In Reply To

My questions are

(a) Is "oneline" Perl command not equal to real Perl script?

(b) Only for data manipulation, is simply AWK better than Perl?

(c) Will it be worth learning and writing C code to do data manipulation? I know it's gonna be very difficult but it can save me a lot of time in the future doing data manipulation because no language can be faster than C?


It really depends on what you are doing. And you have to compare things that can really be compared.

The shell is really much much slower than Perl, no doubt about it, provided things are used correctly.

About 10 years ago, a colleague of mine and I once had to filter about 100 million records in order to keep only those pertaining to the clients loaded into the test database (about 25% of the full client database). I wrote a Perl program that was loading 500,000 clients into a hash and then reading the records to keep only those of interest. My mistake was to think that the 100 millions records were coming in one big file. The reality turned out to be that the records actually came in tens of thousands of small files. My colleague took my Perl file and then made a shell script to call my Perl program for each incoming file. So that my Perl program had to be compiled and to load the 500,000 clients into memory for each file, which usually had from a few hundreds to a few thousands records. The result was (obviously) devastatingly slow (taking hours). My colleague complained that my Perl program was slow. The program was not slow, the program was just massively misused. When I realized the situation, I changed my Perl script to load only once the relevant customers into the hash, and then process all the files. Once this simple change was done, the filtering of the records took only a couple of minutes.

I think that I can claim that I am really an expert on performance issues on large datasets. And I know that you just can't make general rules.

About ten years ago, I benchmarked a Perl program against an awk program, Perl was at least 5 times faster. I made a similar test recently (different versions of Perl and, much more importantly, different implementations of awk - not the same vendor), Perl was still faster, but only slightly, by a much narrower margin (perhaps 25% only). I do not think that that the Perl performance decreased, but the recent awk implementation was obviously better on AIX than on whatever I has tried ten years ago (most probably Solaris or HP-UX at the time, but I don't remember for sure). Similarly, 10 years ago, Perl was way faster than sed. I recently found cases or very simple scripts where the AIX implementation of sed turned out to be marginally faster than the equivalent Perl program (but only for doing very simple things).

However, generally speaking, and keeping in mind that I am working on extremely large files on a daily basis, Perl is really my tool of choice.

I could give you hundred of examples, but will give you just one from yesterday and today. I was asked to help some colleagues using SQL and VB scripts on an Access database for processing some moderately large data. The process was taking more than 24 hours to execute, and that was not acceptable in the context of the business process. I wrote a Perl program using hashes and working on the original flat files. My program took less than a minute to generate the required output data.

Now, it is obviously possible to write a C program that will be faster than Perl. But you would have first to find the libraries that will offer a good implementation of hashes and of many of the other Perl functionalities that I used. And even then, the C program would still be at least ten or possibly twenty times longer (in terms of the number of lines of code) and probably much more difficult to debug. So that where I spent two days to write the program in Perl, I might have needed a month (possibly more) for the same thing in C. And again, don't think that a C program is by itself faster than a Perl program. It is the case only for extremely well written C programs or very specific CPU-intensive problems where C is faster. I am still using C once in a while for very very specific issues, but Perl gives me much more at least 99.5% of the time.

And, BTW, probably more than 90% of the Human Genome project was done in Perl. C Programs would be faster on these huge amounts of data, but these C programs probably would not be completed before the next ten years. With Perl, the results are already there.

In brief, if you are really working on computationally intensive problems (finding the next prime Mersenne number, multiplying matrices with millions of lines and columns, or preparing weather forecasts), C might be you best friend. For more common problems, Perl will give you far more. Provided you know how to use it.


Kate
New User

Oct 24, 2013, 10:13 PM

Post #5 of 5 (971 views)
Re: [Laurent_R] Is oneline Perl actually faster than Shell+AWK? [In reply to] Can't Post

Thank you FishMonger & Laurent_R!!
Your professional and quick feedback stop me trying things infinitely..

So my program was starting Perl 50,000 times and that's the reason why the performance was not as good as I expect. I am learning about how to use @ARGV now so that I can feed in all files at once...

Moreover, it seems that Perl can be very powerful in handling large dataset. I used to use R to do input/output and data manipulation work, yet it's no good anymore since I need to start dealing with larger dataset in my work (bioinformatics). I think I will start with Perl Hash table, but probably won't dig too deep on C at current stage.

I am very lucky to find this forum for the Perl beginner
Thanks again for your kind assistance

Kate


(This post was edited by Kate on Oct 24, 2013, 10:15 PM)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives