CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Advanced:
greping a large no keywords in large no of files

 



millan
New User

May 16, 2013, 11:38 PM

Post #1 of 7 (1964 views)
greping a large no keywords in large no of files Can't Post

I have below requirement.

1> around 500 files having extension .rex,.fmt,.pld which are stored in @onlyfiles
2> around 8000 keywords which are stored in the array @keyword

I want to search all the 8000 keywords in all the 100 files.
If any keyword is found in a file, it ll change it's extension from .rex to .txt,.fmt to .fmg and so on
and the file name will be put into the database.

But it is taking too long time ..Could you please someone help me to finetune the code so that it could run faster..
Or else can i use multithreading for this and how to do this.

Thank you in advance.

foreach my $curFile (@onlyFiles)
{
foreach my $row (@keyword) {
open(FILE, $curFile) or push(@errorOpenFile,$curFile);
my $out1;
my $out21;
my $filename;

if(grep{/$row/} <FILE>)
{

$filename = basename $curFile;
$out1 = substr($filename, -3);

if ($out1 eq "rex"){
chop($filename);
chop($filename);
chop($filename);
$filename=$filename."txt";
}
elsif($out1 eq "fmt"){
chop($filename);
chop($filename);
chop($filename);
$filename=$filename."fmg";
}
elsif($out1 eq "pld"){
chop($filename);
chop($filename);
chop($filename);
$filename=$filename."pl";
}
else{$filename=$filename."sh";
}
my $sth_det1 = $dbh->prepare($FILE_MATCH);

$sth_det1->bind_param( 1, $filename);

$sth_det1->execute();
}

}


Kenosis
User

May 17, 2013, 11:32 AM

Post #2 of 7 (1945 views)
Re: [millan] greping a large no keywords in large no of files [In reply to] Can't Post

Here's one option:

Code
use strict; 
use warnings;
use Data::Dumper;

my @onlyFiles = qw/words.rex print.fmt results.rex/;
my %extensions = ( rex => 'txt', fmt => 'fmg' );
my %keywords = map {lc $_ => 1 } qw/these are my Keywords/;

FILE: for my $i ( 0 .. @onlyFiles - 1 ) {
open my $fh, '<', $onlyFiles[$i] or die $!;

while (<$fh>) {
if ( grep $keywords{lc $_}, split ) {
close $fh;
$onlyFiles[$i] =~ s/.+\.\K([^.]+)$/$extensions{$1} if $extensions{$1}/e;
next FILE;
}
}

close $fh;
}

print Dumper \@onlyFiles;

Output from my testing:

Code
$VAR1 = [ 
'words.txt',
'print.fmg',
'results.rex'
];

Of course you'll want to remove:

Code
my @onlyFiles  = qw/words.rex print.fmt results.rex/;

and replace

Code
qw/these are my keywords/

with

Code
@keyword

Additionally, remove use Data::Dumper, as that module was only used for testing purposes.

The script creates a hash of your keywords. It then reads all files' lines, splitting each line into "words." The grep checks whether any one of those words are a keyword. (This check is case-insensitive; remove the two lcs if you need it to be case-sensitive.) If a keyword is found, the file is closed, the file extension is changed in the array, and the next file is processed.

Hope this helps!


(This post was edited by Kenosis on May 17, 2013, 11:48 AM)


Laurent_R
Enthusiast / Moderator

May 17, 2013, 3:22 PM

Post #3 of 7 (1935 views)
Re: [millan] greping a large no keywords in large no of files [In reply to] Can't Post

The hash solution proposed by Kenosis is probably good. You have to try it and see if it is fast enough.

I just want to comment on your code.


Code
if(grep{/$row/} <FILE>) 
{
{

$filename = basename $curFile;
$out1 = substr($filename, -3);

if ($out1 eq "rex"){
chop($filename);
chop($filename);
chop($filename);
$filename=$filename."txt";
}
elsif($out1 eq "fmt"){
chop($filename);
chop($filename);
chop($filename);
$filename=$filename."fmg";
}
elsif($out1 eq "pld"){
chop($filename);
chop($filename);
chop($filename);
$filename=$filename."pl";
}


This is very inefficient coding (not in terms of performance, where if has probably no measurable impact, but in terms of the number of useless code lines you are writing.

As a starting point, in


Code
elsif($out1 eq "fmt"){ 
chop($filename);
chop($filename);
chop($filename);
$filename=$filename."fmg";

Code
 
why do you remove three characters from the name, while you really need to change the last one?

Since you are using the File::Basename module, you could get the suffix from the same module, rather than using the substr function.

Also, in such a simple case, you could also the whole code quoted above to just 3 lines, something like this:


Code
if(grep{/$row/} <FILE>) { 
$filename =~ s/\.rex$/.txt/;
$filename =~ s/\.fmt$/.fmg/;
$filename =~ s/\.pld$/.pl/;
}



millan
New User

May 17, 2013, 6:48 PM

Post #4 of 7 (1928 views)
Re: [Kenosis] greping a large no keywords in large no of files [In reply to] Can't Post

Thank you Kenosis for this prompt reply..
can you pls explain these things in code

1> what does lc $_ do in this code?

2> $onlyFiles[$i] =~ s/.+\.\K([^.]+)$/$extensions{$1} if $extensions{$1}/e


recruiter
User

May 17, 2013, 8:14 PM

Post #5 of 7 (1922 views)
Re: [millan] greping a large no keywords in large no of files [In reply to] Can't Post

Another suggestion would be throwing away your array list of filenames and using File::Find::Rule.


Code
use File::Find::Rule;  

my $dir = '/path/to/your/files';

my @onlyfiles = File::Find::Rule->file
->name( qw(*.rex *.fmt *.pld) )->in($dir);



BillKSmith
Veteran

May 18, 2013, 1:51 PM

Post #6 of 7 (1899 views)
Re: [millan] greping a large no keywords in large no of files [In reply to] Can't Post

Here is one more method to consider. I expect 'any' to be faster than grep because it can terminate as soon as it finds one match. By using an array of Regex's, they only have to be compiled once. Even more time could probably be saved by combing groups of kewords (joined with '|' )in each regex. Slurping the files should be faster than reading line-by-line.


Code
use strict; 
use warnings;
use List::MoreUtils qw (any);
my @onlyFiles = qw/words.rex print.fmt results.rex/;
my @keywords = map {qr/$_/} qw/these are my Keywords/;
my @errorOpenFiles;
my %hash = ( rex => 'txt', fmt => 'fmg', pld => 'pl' );
foreach my $curFile (@onlyFiles) {
my $status = open my $FILE, '<', $curFile;
if ( $status ) {
push( @errorOpenFiles, $curFile );
next;
}
my $content;
{local $/ = undef; $content = <$FILE>;}
close $FILE;
my $filename = basename $curFile;
if ( any {$content =~ $_} @keywords ) {
unless ( $filename =~ s/(rex|fmt|pld)$/$hash{$1}/e ) {
$filename =~ s/(...)$/sh/;
}
rename $curFile, $filename
}
}

if ( @errorOpenFiles ) {
print STDERR "The following files were not processed due to open errors\n";
local $, = "\n";
print STDERR @errorOpenFiles, "\n";
}

Good Luck,
Bill


Kenosis
User

May 18, 2013, 3:41 PM

Post #7 of 7 (1897 views)
Re: [millan] greping a large no keywords in large no of files [In reply to] Can't Post

You're most welcome, millan!

lc $_ converts the contents of $_ to all lowercase, these contents coming from either your list of keywords or a file's line being split. This was done so case is not significant in your search for keywords. If case is significant, remove both lc.

The substitution:

Code
$onlyFiles[$i] =~ s/.+\.\K([^.]+)$/$extensions{$1} if $extensions{$1}/e 
^ ^ ^ ^ ^^ ^ ^ ^ ^ ^ ^
| | | | || | | | | | |
| | | | || | | | | | + - Execute the code in the substitution
| | | | || | | | | + - Just in case there's an extension captured that not in the hash
| | | | || | | | + - Captured extension as key to get at corresponding value
| | | | || | | + - Hash of extensions
| | | | || | + - End capture
| | | | || + - Match all that's not a period (from the end of the line)
| | | | |+ - Begin capture
| | | | + - Keep all before this point
| | | + - The dot before the extension
| | + - Match all (almost)
| + - Begin substitution
+ - An element of the files' array


 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives