CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Keep 1 email and delete all of the same email in large database ? Need Help

 



kiho
Deleted

Jul 3, 2000, 8:17 PM

Post #1 of 7 (664 views)
Keep 1 email and delete all of the same email in large database ? Need Help Can't Post

Hello!

I get stuck with my job - anyone know how to do it please help me.

My bose he gave me a large email database look like this :

file name: email.txt

abc@mail.com
123@yaho.com
abc@mail.com
cd@hot.com
df@cool.com
abc@mail.com
....

The problem is some line I got a same email (abc@mail.com)

I can not review and delete all of it by hand

Im looking for perlscript (javascript - Visual Basic - VBA or anyway) can do it automaticly (I mean just keep 01 email and
delete all of the same one) -

Please help me keep my job - Thank you so much.



Kanji
User / Moderator

Jul 3, 2000, 8:33 PM

Post #2 of 7 (664 views)
Re: Keep 1 email and delete all of the same email in large database ? Need Help [In reply to] Can't Post

This really isn't a perl problem, and you should learn not to count on perl as your only tool.

sort -fuo email.txt email.txt should work if you're on a UNIX system or a Win32 box with Cygwin installed, although be sure to read the help first so you know what those options do.

If you must do it in perl, then a one-liner like perl -pi.bak -le 'print unless $seen{$_}++' email.txt should do the trick.


perlkid
stranger

Jul 3, 2000, 11:04 PM

Post #3 of 7 (664 views)
Re: Keep 1 email and delete all of the same email in large database ? Need Help [In reply to] Can't Post

 
Email Me The File And I'll do it for you. Smile

I wrote a script that's a part of my admin panel that will remove duplicates.

tony@olbis.com

perlkid


DrZed
User

Jul 4, 2000, 12:05 AM

Post #4 of 7 (664 views)
Re: Keep 1 email and delete all of the same email in large database ? Need Help [In reply to] Can't Post

....unless $seen{$_}++



Very clever. I love stuff like that.

Dr. Zed


simon
Deleted

Jul 4, 2000, 1:00 PM

Post #5 of 7 (664 views)
Re: Keep 1 email and delete all of the same email in large database ? Need Help [In reply to] Can't Post

I am a newbie so please sorry if it sounds stupid but,
the variables $seen is not defined to anything, so wouldn't that give an error or is that a special scalar variable?
-Simon N


perlkid
stranger

Jul 4, 2000, 1:29 PM

Post #6 of 7 (664 views)
Re: Keep 1 email and delete all of the same email in large database ? Need Help [In reply to] Can't Post

 
No.

I have a piece of code all written out so I'll just give that to you.

#########
open(INPUT, "</path/to/file/to/filter.db");
open(OUTPUT, ">/path/to/file/to/print/results.db");
while(<INPUT> ) {
chomp;
next if /^$/;
($field1, $field2)=split(/:/);
if (exists($seen{$field2}))
{
}
else {
print OUTPUT "$field1:$field2\n";
$seen{$field2} = "gotcha";
}}
close(INPUT);
close(OUTPUT);
#########

This will filter out duplicates in field 2. Just point it to the proper files.

Hope That helps,

perlkid


Kanji
User / Moderator

Jul 4, 2000, 3:49 PM

Post #7 of 7 (664 views)
Re: Keep 1 email and delete all of the same email in large database ? Need Help [In reply to] Can't Post

 <BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">quote:</font><HR>simon says...
I am a newbie so please sorry if it sounds stupid but, the variables $seen is not defined to anything, so wouldn't that give an error or is that a special scalar variable?<HR></BLOCKQUOTE>

There is some 'magic' going on, but it's not where you think it is.

First off, there's actually a typo in my one liner that I just noticed, as it should be perl -ni.bak -le 'print unless $seen{$_}++' file (I wrote -pi when I meant -ni).

For case insensitivity (so kanji is consider the same as KANJI), that $_ should be modified to lc( $_ ).

Anyhoo, the breakdown of the line is ...

perl
... invokes perl.

-n
... opens up the files in @ARGV, and places their contents in a while (<FILE> ) loop (actually it's while (<> )).

-i.bak
... edits the file 'in-place' by renaming the original file to file.bak, and placing all script output back into the original filename.

-l (that's an ell, not a one)
... enables line processing, which in this instance, automatically chomps newlines to get rid of them, and appends newlines to the end of every print I do.

In retrospect, this probably isn't needed, but I do it out of habit for one-liners.

-e
... tells perl to execute the following commands instead of reading them from a script.

print unless $seen{$_}++
... because the content of the files is being read in a while (<> ), each line is assigned to special variable $_, which many perl functions default to using if you don't suppy another argument (ie, chomp; is the same as chomp($_);.

As we iterate through the while loop, we build up a hash called %seen by incrementing a key with the same value as the line.

On the next loop, if the key already exists (because we've seen it before :-), the print function fails so we don't produce any duplicates.

Verbosely, that one-liner is equivalent to the following script (plus error checking)...

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

#!/usr/bin/perl


foreach my $file ( @ARGV ) {
# Save the original ...
rename( $file, "$file.bak" );


open OLD, "$file.bak';
open NEW, "> $file";


my %seen;


# Read a line from the original into $line
while( my $line = <OLD> ) {
# Print it out to NEW if we haven't seen
print NEW $line unless $seen{$line};


# Make a note that we've seen it
$seen{$line} = $seen{$line} + 1;
}
}</pre><HR></BLOCKQUOTE>

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives