CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
A challenge: Email Extractor from all directories to text file???

 



BrightNail
Novice

Apr 21, 2002, 7:46 PM

Post #1 of 10 (1921 views)
A challenge: Email Extractor from all directories to text file??? Can't Post

Hey all,

I have a dynamic self-running website..and in many directories and such, people have put their email address ....now, Is there a way to run a script that goes thru ALL the directories within the root directory and goes thru all the ".html" pages IN ALL the directories/folders/sub-directories etc (not just one directory)....and extract email addresses and write them all to ONE text file and doesn't write repeat email addresses..???

I see this as rather difficult...any ideas???


uri
Thaumaturge

Apr 21, 2002, 9:04 PM

Post #2 of 10 (1919 views)
Re: [BrightNail] A challenge: Email Extractor from all directories to text file??? [In reply to] Can't Post

trivial. use File::Find and for each html file found, slurp it in, grab all the emails with a regex and store them in a hash. then print the hash keys to the file.

under 10 lines of code max.


Jean
User


Apr 21, 2002, 11:18 PM

Post #3 of 10 (1913 views)
Re: [BrightNail] A challenge: Email Extractor from all directories to text file??? [In reply to] Can't Post

Doesn't seem like a challenge to me either Wink ...



You'll have to improve the email regexp that is passed to the function (grabs too much junk), but in general it works...



[perl]

#!/usr/bin/perl -w
# ------------------------------------------------------------------------
use strict;

my $StartDir = shift(@ARGV) || '.';
my %data;

sub ScanDir($$;$);

############################################################################
sub ScanFile($$)
############################################################################
{
my $file = shift(@_);
my $expr = shift(@_);

print "FILE: $file\n";
open(FILE, $file) or die "Error opening file $file - $!\n";
while (my $line = <FILE>) {
while ($line =~ /($expr)/g) {
$data{$1} = 1;
}
}
close(FILE);
}

############################################################################
sub ScanDir($$;$)
############################################################################
{
my $dir = shift(@_);
my $filetype = shift(@_);
my $expr = shift(@_) || '.*';

my $file;
my @dirs;
my $success;

# Scan the passed dir.
$success = opendir(DIR, "$dir");
if (!$success) {
warn "Unable to open directory $dir\n";
return;
}

while ($file = readdir(DIR)) {
if ($file !~ /^\.{1,2}$/) { # Do nothing in case of '.' or '..'
# Add path to the found file name.
($dir =~ /\/$/)?($file = $dir.$file):($file = "$dir/$file");
if (-d $file) { # Take care of subdirs.
push(@dirs, $file);
}
elsif ( (-f $file) && ($file =~ /$filetype/)) {
ScanFile($file, $expr);
}
}
}
closedir (DIR);
while ($file = shift(@dirs)) {
# Call the function recursively in order to scan found subdirs.
ScanDir($file, $filetype, $expr);
}
}

############################################################################
# main()
############################################################################
{
# [root dir] [file type regex] [email regex]
ScanDir($StartDir, '.+\.(html|htm)', '[^\s]+@[^\s.]+\.[^\s]+');
for my $key (keys %data) {
print "$key\n";
}
print "Done.\n";
}


[/perl]


Jean Spector
SQA Engineer @ Exanet
jean.spector@softhome.net


There are only 10 types of people in the world -
Those who understand binary, and those who don't.


mhx
Enthusiast / Moderator

Apr 21, 2002, 11:20 PM

Post #4 of 10 (1912 views)
Re: [uri] A challenge: Email Extractor from all directories to text file??? [In reply to] Can't Post


In Reply To
under 10 lines of code max.


Indeed... Wink


Code
#!/usr/bin/perl -w 
use IO::File;
use File::Find;
use Email::Find;
find( sub { /^.*\.html\z/s and
Email::Find->new( sub { $addr{$_[1]}++ } )
->find( do {local $/; \IO::File->new($_)->getline} )
}, '/path/to/html' );
print "$_\n" for sort keys %addr;


-- mhx

At last with an effort he spoke, and wondered to hear his own words, as if some other will was using his small voice. "I will take the Ring," he said, "though I do not know the way."

-- Frodo



uri
Thaumaturge

Apr 21, 2002, 11:43 PM

Post #5 of 10 (1909 views)
Re: [mhx] A challenge: Email Extractor from all directories to text file??? [In reply to] Can't Post

and you can lose the IO::File line by using this trick to
slurp in files:

(assuming your surrounding code)


Code
 
do{ local( @ARGV, $/ ) = $_ ; <> }


BrightNail
Novice

Apr 22, 2002, 12:08 AM

Post #6 of 10 (1906 views)
Re: [Jean] A challenge: Email Extractor from all directories to text file??? [In reply to] Can't Post

damn,,,,did you just wip that up???? I don't get it....a "QA engineer"...shouldn't you be a programmer..ahahahah, I don't understan half your code...

man..o ..man,,,what are youpassing in that first subroutine... ($;$$)..?? I don't understand that....

geesh, I have a lot to learn.---> thanks EVERYONE...I REALLY APPRECIATE IT>>>>>


Jean
User


Apr 22, 2002, 12:38 AM

Post #7 of 10 (1905 views)
Re: [BrightNail] A challenge: Email Extractor from all directories to text file??? [In reply to] Can't Post

Hey BrightNail,

First of all there are people like mhx who solve the same problem in 10 times less lines Wink and I don't understand all of it either Blush. The only thing I can say in my defense is that I prefer to write all the code by myself - this way you learn more (of course leaving more place to an error).

Regarding the ($$;$) - perlguru is a fine the place to ask questions ;-)

By declaring sub with parentheses you may decide how many parameters the sub will accept. Anything after the semicolon is optional. ($$;$) means that the sub accepts 2 or 3 scalar variables. Say, (%;$$$) declares sub that accepts a hash and, optionally, up to 3 scalars as its parameters.



Oh, sorry - now I understand - if you're talking about the first sub ScanDir($$;$); it's not a call - it's a sub predeclaration. The ScanDir is a recursive sub, i.e. it calls itself for every subdirectory found, so you have to predeclare it - just make sure the line is identical to the original sub declaration...

Hope this helps, pal SmileSmileSmile


Jean Spector
SQA Engineer @ Exanet
jean.spector@softhome.net


There are only 10 types of people in the world -
Those who understand binary, and those who don't.


(This post was edited by Jean on Apr 22, 2002, 12:38 AM)


mhx
Enthusiast / Moderator

Apr 22, 2002, 3:29 AM

Post #8 of 10 (1902 views)
Re: [uri] A challenge: Email Extractor from all directories to text file??? [In reply to] Can't Post


Code
do{ local( @ARGV, $/ ) = $_ ; <> }


Almost, since find wants a string ref. However, the trick is pretty cool and quite obvious once you've seen it! There's no day that I don't learn anything new about Perl Smile

So this should be the final 8-line version:


Code
#!/usr/bin/perl -w 
use File::Find;
use Email::Find;
find( sub { /^.*\.html\z/s and
Email::Find->new( sub {$addr{$_[1]}++} )
->find( do {local(@ARGV,$/)=$_; \<>} )
}, '/path/to/html' );
print "$_\n" for sort keys %addr;


-- mhx

At last with an effort he spoke, and wondered to hear his own words, as if some other will was using his small voice. "I will take the Ring," he said, "though I do not know the way."

-- Frodo



BrightNail
Novice

Apr 23, 2002, 9:20 PM

Post #9 of 10 (1892 views)
Re: [mhx] A challenge: Email Extractor from all directories to text file??? [In reply to] Can't Post

hey mhx....

it looks like you have it printing to he "monitor" screen....ala, dos mode..

if I wanted to write it to a text file...at the end there, would I just..go like

open(OUTF,">>somefile.txt") or dienice("Couldn't process data: $!");
flock(OUTF,2);
seek(OUTF,0,0);
print OUTF "$_\n" for sort keys %addr;
close(OUTF);


hmm, doesn't seem right to me... :-(


mhx
Enthusiast / Moderator

Apr 23, 2002, 10:08 PM

Post #10 of 10 (1891 views)
Re: [BrightNail] A challenge: Email Extractor from all directories to text file??? [In reply to] Can't Post

If I wanted the output to go to a file, I'd redirect the output directly from the shell. Say the script is called mailex.pl, then it would just be:


Code
perl mailex.pl >email.txt


This would redirect all output into the file email.txt.

-- mhx

At last with an effort he spoke, and wondered to hear his own words, as if some other will was using his small voice. "I will take the Ring," he said, "though I do not know the way."

-- Frodo


 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives