CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Help on zip streaming??

 



bhkpop
Novice

Jan 31, 2009, 12:34 AM

Post #1 of 15 (3550 views)
Help on zip streaming?? Can't Post

Hi guys,
I hope you guys will englight me.
I have a big zip file where I extract it manually. Then the program I created will traverse from the folder that I extract to read the file.
When I was comparing with my friend, his speed of traversing and read those files were so quick. He was doing it in Java n I was using Perl. It turns out that he was reading directly using input stream, so in other word, he didn't extract it.

I try to search perl library that can read zip and traverse through the zip and found this "IO::Uncompress::Unzip". I didn't do any processing, and just put the example from the library then run through my zip file. However the speed is still far behind from my friend's program where as he has read the stream zip file and parse the files inside.
For your info: he can do it in 11 seconds, while the library I used (without any processing) does it 30+ seconds. I've searched n googled but finally decided to ask here.
Btw, uncompressing is not needed..

Any help really appreciated...
Thanks


KevinR
Veteran


Jan 31, 2009, 12:59 AM

Post #2 of 15 (3549 views)
Re: [bhkpop] Help on zip streaming?? [In reply to] Can't Post

Have you tried both codes on the same computer?
-------------------------------------------------


bhkpop
Novice

Jan 31, 2009, 1:21 AM

Post #3 of 15 (3548 views)
Re: [KevinR] Help on zip streaming?? [In reply to] Can't Post


In Reply To
Have you tried both codes on the same computer?

Hi.. Thanks for the reply.
For ur question is yes and it does not show any significant differences.
I'm still using the traverse extracted zip that tooks so long to process it.
I'm trying to use the archive::zip, however I read in the cpan that


Quote
readFromFileHandle( $fileHandle, $filename )

Read zipfile headers from an already-opened file handle, appending new members. Does not close the file handle. Returns AZ_OK or error code. Note that this requires a seekable file handle; reading from a stream is not yet supported.


Any idea on handling that? Thank you.


(This post was edited by bhkpop on Jan 31, 2009, 1:23 AM)


FishMonger
Veteran / Moderator

Jan 31, 2009, 4:56 AM

Post #4 of 15 (3544 views)
Re: [bhkpop] Help on zip streaming?? [In reply to] Can't Post

Since you haven't shown us any code, it's impossible for us to say why your script is running slow.

You need to profile your script to see where it spends most of its time.

Devel::Profile
http://search.cpan.org/~jaw/Devel-Profile-1.05/Profile.pm


bhkpop
Novice

Feb 1, 2009, 2:48 AM

Post #5 of 15 (3535 views)
Re: [FishMonger] Help on zip streaming?? [In reply to] Can't Post

This is the latest code that I used. However it tooks forever, even longer than just crawling the file from the extracted file.

Deleted the code to avoid confuseness, instead just uploaded using attachment in the future post

Thank you for the reply in advanced


(This post was edited by bhkpop on Feb 2, 2009, 6:13 AM)


FishMonger
Veteran / Moderator

Feb 1, 2009, 8:17 AM

Post #6 of 15 (3528 views)
Re: [bhkpop] Help on zip streaming?? [In reply to] Can't Post

How many files are in the archive?

How many of those files are .txt files?

What is the average size of each of those files?

What do you need to do with those files?

Your usage of Benchmark is giving you the runtime of the bulk of the script, which doesn't help to narrow down the problem. Timing each step would be better, but still may not be enough.

The Benchmark module is best used to compare the runtime of different sets of code that accomplish the same thing, so you can determine which is more efficient.

If you simply want to time sections of code, it would be better to use Time::HiRes instead of Benchmark.
http://search.cpan.org/~jhi/Time-HiRes-1.9719/HiRes.pm

Have you tried to profile the script with Devel::Profile as I previously recommended?

Your if block to check if the code is running is pointless because it will never evaluate to true.


bhkpop
Novice

Feb 1, 2009, 8:20 PM

Post #7 of 15 (3519 views)
Re: [FishMonger] Help on zip streaming?? [In reply to] Can't Post


In Reply To
How many files are in the archive?

How many of those files are .txt files?

What is the average size of each of those files?

What do you need to do with those files?

Your usage of Benchmark is giving you the runtime of the bulk of the script, which doesn't help to narrow down the problem. Timing each step would be better, but still may not be enough.

The Benchmark module is best used to compare the runtime of different sets of code that accomplish the same thing, so you can determine which is more efficient.

If you simply want to time sections of code, it would be better to use Time::HiRes instead of Benchmark.
http://search.cpan.org/~jhi/Time-HiRes-1.9719/HiRes.pm

Have you tried to profile the script with Devel::Profile as I previously recommended?

Your if block to check if the code is running is pointless because it will never evaluate to true.



The file is 46mb text files with 1-48 kb for each files.
There are about 37526 text files. I need to parse it and then make list of tokenized words out of the text tag.
I'm trying the Time::Hires you proposed as I reply to ur post.
Anything I don't understand will post here.
Sorry, but I don't really understand what you mean on your last sentence. Btw, thank you for your guidance.


(This post was edited by bhkpop on Feb 2, 2009, 5:15 AM)


FishMonger
Veteran / Moderator

Feb 1, 2009, 10:02 PM

Post #8 of 15 (3513 views)
Re: [bhkpop] Help on zip streaming?? [In reply to] Can't Post


Quote
Sorry, but I don't really understand what you mean on your last sentence.

In Reply To
Your if block to check if the code is running is pointless because it will never evaluate to true.




Code
$i=0;  
$j=1;
if ($i==($j*1000)) {

Since $i starts out as zero and is never incremented and $j starts out as 1 and is incremented at each iteration, when will $i (zero) ever be equal to $j (a number grater than zero)? Answer, NEVER!

Having nearly 40,000 files to process with an average of less than 200 words each is a clear indication of a poorly designed storage scheme.

Please post the testing.pl script that you tested so I can see what it's doing and post the prof.out file that the Devel::Profile created.


bhkpop
Novice

Feb 2, 2009, 5:08 AM

Post #9 of 15 (3503 views)
Re: [FishMonger] Help on zip streaming?? [In reply to] Can't Post

It is purely my mistake. Actually there is $i++. When I post the code here I deleted the line that is commented and accidentally it is also deleted though part of my script.

I've tried two methods to retrieve the txt files, which were "File::List" and "Archive::Zip". The difference is about extracted data and zip data.
I guess zip streaming would be faster than reading every extracted data. I used the archive::zip but can't seem to make it as fast as I think it should be.

About the testing.pl it is my mistake. I post the wrong Profile.
I edited it already and post the correct one.
Here I didn't use the actual data and only zip some of the files (4mb files).
-----------------------------------------------------------
C:\tmp>perl -d:Profile unzipper02.pl
Extracting ziptest03.zip (3766)...
extract: 293.752 wallclock secs (247.73 usr + 42.86 sys = 290.59 CPU))
-----------------------------------------------------------


(This post was edited by bhkpop on Feb 2, 2009, 6:09 AM)
Attachments: prof.out (44.7 KB)
  unzipper02.pl (0.79 KB)


FishMonger
Veteran / Moderator

Feb 2, 2009, 6:06 AM

Post #10 of 15 (3498 views)
Re: [bhkpop] Help on zip streaming?? [In reply to] Can't Post

I'm trying my best to help you but it's impossible to do so when you don't provide the proper info.


Quote
About the testing.pl it is my mistake. I post the wrong Profile.
I edited it already and post the correct one.


Where did you post it? It's not in this forum?

Posting the total execution time doesn't help in troubleshooting slowness of the script and since the script you ran is different from the one you posted, I don't even know exactly what you're doing in the script.


bhkpop
Novice

Feb 2, 2009, 6:11 AM

Post #11 of 15 (3496 views)
Re: [FishMonger] Help on zip streaming?? [In reply to] Can't Post


In Reply To
I'm trying my best to help you but it's impossible to do so when you don't provide the proper info.


Quote
About the testing.pl it is my mistake. I post the wrong Profile.
I edited it already and post the correct one.


Where did you post it? It's not in this forum?

Posting the total execution time doesn't help in troubleshooting slowness of the script and since the script you ran is different from the one you posted, I don't even know exactly what you're doing in the script.


Thank you for your response.
I've uploaded through attachment just now.
Both the file and the Profile ouput.
Actually the script I asked you is just a small part which was parsing the content of tag <text> from collection of txt files.
After that I still have further processes. But I haven't make it because as you know I have trouble with this first step. If you have anything that can enlight me, please do let me know fishmonger.


(This post was edited by bhkpop on Feb 2, 2009, 6:17 AM)


FishMonger
Veteran / Moderator

Feb 2, 2009, 8:26 AM

Post #12 of 15 (3486 views)
Re: [bhkpop] Help on zip streaming?? [In reply to] Can't Post

Is there a particular reason you want to access the files in this manor rather than unzipping the archive once and working with the unzipped files?

Can you post the zip file so I can run a few tests?


bhkpop
Novice

Feb 2, 2009, 10:16 AM

Post #13 of 15 (3483 views)
Re: [FishMonger] Help on zip streaming?? [In reply to] Can't Post

My friend use java to do this parsing and he did it using input stream library. The result is super fast. He accessed the zip file and tokenize it within 11 seconds.
Well actually I have coded the extracted version one and the result is not even that fast (maybe because I'm such a newbie in perl though have use C in my college year).
I want to try if accessing zip in perl and hoping it can be as fast or even faster because I know that perl is simple, powerful yet confusing (currently for me ^^v)

I have tried everything and I read that accessing zip file and read through the stream using the archive::zip library is not do-able at this moment.
If you think that in perl accessing file/member through zip file would take longer time than crawling through directory, please let me know and we should end this, so I can enhanced more on my other code and you won't have to waste your time helping me on this code ^^!
I don't want you go all the trouble for nothing, since you've been that helpful.

Saying that I still goin to upload the zip file for you (just in case you still need it).


Code
http://rapidshare.com/files/193011052/ziptest03.zip


Thanks again..


(This post was edited by bhkpop on Feb 2, 2009, 10:19 AM)


FishMonger
Veteran / Moderator

Feb 2, 2009, 1:33 PM

Post #14 of 15 (3477 views)
Re: [bhkpop] Help on zip streaming?? [In reply to] Can't Post

Since Archive::Zip only supports streaming on writing, not reading, it would be far more efficient to unzip the files before processing their data.

Here's a benchmark test which shows accessing the unzipped files with File::Find is much faster (over 3,000% faster), than accessing them inside the zip archive. The File::Find approach only takes a couple seconds to access all 3,766 files.
.

Code
#!/usr/bin/perl 

use strict;
use warnings;
use File::Find;
use Archive::Zip qw(:ERROR_CODES);
use Archive::Zip::MemberRead;
use Benchmark qw(:all);

cmpthese(-50, {
'fishmonger' => \&file_find,
'bhkpop' => \&archive_zip,
});


sub file_find {

my $t0 = Benchmark->new;
find(\&process_files, 'c:/test/bhkpop');

my $t1 = Benchmark->new;
my $extract_time = timestr(timediff($t1,$t0));
print "file_find extract: $extract_time)\n";
}


sub archive_zip {

my $t0 = Benchmark->new;
my ($fh,$line,$i,$j);
my $zipfile = "c:/test/bhkpop/tdt3.zip"; #this is 46mb of files
my $zip = Archive::Zip->new();
$zip->read($zipfile) == AZ_OK or die "Error reading $zipfile\n";

for ($zip->membersMatching( '.*\.txt' )) {
$fh = Archive::Zip::MemberRead->new($zip, "$_->{fileName}");
$fh->close();
}
my $t1 = Benchmark->new;
my $extract_time = timestr(timediff($t1,$t0));
print "archive_zip extract: $extract_time)\n";
}

sub process_files {
return unless /\.txt$/;
open my $FH, '<', $_ or die "Can't open $File::Find::name $!";
}



bhkpop
Novice

Feb 6, 2009, 4:13 AM

Post #15 of 15 (3433 views)
Re: [FishMonger] Help on zip streaming?? [In reply to] Can't Post

Hi fishmonger....
Sorry for the late reply cause been busy with other stuffs.
I have tried the code and decided to not debug furtherly on the zip problem..
Like you said almost 3000% faster ^^!
So I would post new thread if any new problem found using the new method. Hope you're still welcome my help...
Thanks fishmonger.


(This post was edited by bhkpop on Feb 6, 2009, 4:13 AM)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives