CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Removal of duplicated element in array with order.

 



Wazezu
Novice

May 6, 2013, 8:19 AM

Post #1 of 23 (704 views)
Removal of duplicated element in array with order. Can't Post

The following is the code for removing duplicated data, but it doesn't work.

The output is exactly the same as the input with duplicated elements.

Many thanks.


Code
#$1 is the matched regex, including duplicated elements. 

#$1 contains
#dmit@sp.com
#ems@es.com
#dew@es.com
#dmit@sp.com
#erg@es.com

my $match = $1."\n";

my @match_to_array = split("(\n)",$match);

my %seen = ();
my @r = ();

foreach my $a (@match_to_array) {
unless ($seen{$a}) {
push @r, $a;
$seen{$a}++;
}
}
print @r;



(This post was edited by Wazezu on May 6, 2013, 8:19 AM)


Laurent_R
Enthusiast / Moderator

May 6, 2013, 8:58 AM

Post #2 of 23 (701 views)
Re: [Wazezu] Removal of duplicated element in array with order. [In reply to] Can't Post

Hi,

Can you show the content of the @match_to_array array (or of the @r array).

I suspect that your split might return just one element, i.e. not really split your data.

It would probably be better to split with the following code:


Code
my @match_to_array = split /\n/, $match);


or even:


Code
my @match_to_array = split /^/, $match);



Wazezu
Novice

May 6, 2013, 9:05 AM

Post #3 of 23 (700 views)
Re: [Laurent_R] Removal of duplicated element in array with order. [In reply to] Can't Post

Output of @match_to_array using my @match_to_array = split /\n/, $match);


Code
dmit@sp.comems@es.comdew@es.comdmit@sp.comerg@es.com


Smile


(This post was edited by Wazezu on May 6, 2013, 9:06 AM)


Laurent_R
Enthusiast / Moderator

May 6, 2013, 10:12 AM

Post #4 of 23 (690 views)
Re: [Wazezu] Removal of duplicated element in array with order. [In reply to] Can't Post

So it works this time?


Laurent_R
Enthusiast / Moderator

May 6, 2013, 10:39 AM

Post #5 of 23 (688 views)
Re: [Wazezu] Removal of duplicated element in array with order. [In reply to] Can't Post

I could not test it for my previous messages, but, as I said, I strongly suspected that the plit did not work properly.

Consider this session under the debugger:


Code
  DB<1> $c = "foo\nbar\nbaz\n" 

DB<2> print $c
foo
bar
baz

DB<3> @d = split("(\n)",$c);

DB<4> x @d
0 'foo'
1 '
'
2 'bar'
3 '
'
4 'baz'
5 '
'
DB<5> @d = split /\n/, $c;

DB<6> x @d
0 'foo'
1 'bar'
2 'baz'

As you can see, your split syntax does not yield what you expect, while the syntax I proposes does exactly what you need. I think your program will work with this simple change.


BillKSmith
Veteran

May 6, 2013, 10:52 AM

Post #6 of 23 (688 views)
Re: [Wazezu] Removal of duplicated element in array with order. [In reply to] Can't Post

Here is a complete program that I hope will be useful. I did have to guess the format of the raw data. I make the @raw_data array by splitting the string $raw_data on newlines. The function map extracts the email addresses and stores them in the array @match_to_array. Map is used again to remove the duplicates and store the result in the array @r.

Newlines are eliminated by the split and never restored. The special variable LIST_SEPARATOR is used to supply newlines to the print when @r is interpolated.


Code
use strict; 
use warnings;
my $email = qr/\w+\@\w+\.com/;
my $raw_data = <<"END_RAW_DATA";
dmit\@sp.com
ems\@es.com
dew\@es.com
dmit\@sp.com
erg\@es.com
END_RAW_DATA

my @raw_data = split /\n/, $raw_data;

my @match_to_array = map {/($email)/; $1} @raw_data;

my %seen;
my @r = map {$seen{$_}++ ? () : $_} @match_to_array;

{local $, = "\n"; print @r, "\n";}

Good Luck,
Bill


Wazezu
Novice

May 6, 2013, 6:33 PM

Post #7 of 23 (680 views)
Re: [Laurent_R] Removal of duplicated element in array with order. [In reply to] Can't Post

No =/. After changing the

Code
split

code like you recommended, the output is still the same.

Could this whole issue be due to my regex ?

I am extracting email addresses from an HTML file.


Code
\w[-.\w]*\@[a-z0-9]+(\.[-a-z0-9]+)*\.(com|edu|info|net|org|gov))/



(This post was edited by Wazezu on May 6, 2013, 6:39 PM)


Wazezu
Novice

May 6, 2013, 6:46 PM

Post #8 of 23 (674 views)
Re: [BillKSmith] Removal of duplicated element in array with order. [In reply to] Can't Post

I tried the code, there's an error of " Can't find string terminator END_RAW_DATA anywhere before EOF ".

I am using Perl v5.10.1.

My raw data is an HTML file with some email addresses in it. I used Regex to extract the email addresses from the HTML file.

After extraction, the extracted emails are originally in this format:


Code
 dmit@sp.comems@es.comdew@es.comdmit@sp.comerg@es.com

.

I tried to add a "\n" after each email address to separate each of them via $match = $1."\n";.

Hopefully the info helps.


FishMonger
Veteran / Moderator

May 6, 2013, 8:56 PM

Post #9 of 23 (662 views)
Re: [Wazezu] Removal of duplicated element in array with order. [In reply to] Can't Post

You should show us your code that extracts the email addresses.


BillKSmith
Veteran

May 6, 2013, 9:28 PM

Post #10 of 23 (659 views)
Re: [Wazezu] Removal of duplicated element in array with order. [In reply to] Can't Post

I posted a complete working program. Your error indicates that you changed it incorrectly. I cannot explain the error without seeing your code.

I have now modified that program to start with your extracted data. I replaced my regex with yours. (I did change your parenthesis to non-capturing parenthesis.)
My previous comments about newlines still stand.



Code
use strict; 
use warnings;
my $email
= qr{\w[-.\w]*\@[a-z0-9]+(?:\.[-a-z0-9]+)*\.(?:com|edu|info|net|org|gov)};

my $extracted = 'dmit@sp.com ems@es.com dew@es.com dmit@sp.com erg@es.com';

my @match_to_array = $extracted =~ m/($email)/g;

my %seen;
my @r = map {$seen{$_}++ ? () : $_} @match_to_array;

{local $, = "\n"; print @r, "\n";}


OUTPUT:

Code
dmit@sp.com 
ems@es.com
dew@es.com
erg@es.com


I expect that this code can extract @match_to_array directly from the html, but I have no data to try it.
Good Luck,
Bill


Wazezu
Novice

May 7, 2013, 1:24 AM

Post #11 of 23 (650 views)
Re: [BillKSmith] Removal of duplicated element in array with order. [In reply to] Can't Post

I modified the code you provided into the relevant part of my codes. Perhaps there's an error ?

The output still consists of duplicates.

I attached the test html file below.


Code
 
#!/usr/bin/perl w

use strict;
use warnings;
use Cwd;

foreach my $argnum (0 .. $#ARGV) {

if ($ARGV[$argnum] eq "-ft"){

my $perl_path = cwd;

if(-e 'testht.html') {

open(OPENFILE, "$perl_path/testht.html") or die "Unable to open file";
}

my @email = <OPENFILE>;
close OPENFILE;

foreach my $email (@email){

if ($email =~ /(\w[-.\w]*\@[a-z0-9]+(\.[-a-z0-9]+)*\.(com|edu|info|net|org|gov))/){
my $match = "$1";
#print "$match\n";

my @raw_data = split /\n/, $match;

my @match_to_array = map {/($email)/; $1} @raw_data;

my %seen;
my @r = map {$seen{$_}++ ? () : $_} @match_to_array;

{local $, = "\n"; print @r, "\n";}

} # end of if statement

} # end of foreach


} # end of elsif -ft

}



(This post was edited by Wazezu on May 7, 2013, 1:24 AM)
Attachments: testht.html (74 B)


Laurent_R
Enthusiast / Moderator

May 7, 2013, 2:38 AM

Post #12 of 23 (645 views)
Re: [Wazezu] Removal of duplicated element in array with order. [In reply to] Can't Post


In Reply To
No =/. After changing the

Code
split

code like you recommended, the output is still the same.

Could this whole issue be due to my regex ?

I am extracting email addresses from an HTML file.


Code
\w[-.\w]*\@[a-z0-9]+(\.[-a-z0-9]+)*\.(com|edu|info|net|org|gov))/



Hi,

I post here again what I answered on your cross post on the Perl Monks.

I do not know what you changed, but taking your original program, changing the split pattern to /\n/ (and adding a few prints just to show the content of the various variables) shows that this works perfectly.




Code
my $match =  "foo1\nfoo2\nfoo3\nfoo4\nfoo5\nfoo3\nfoo2"; 
print '$match = ', "\n", $match, "\n\n";
my @match_to_array = split /\n/, $match;
print '@match_to_array = ', "@match_to_array \n\n";
my %seen = ();
my @r = ();

foreach my $a (@match_to_array) {
unless ($seen{$a}) {
push @r, $a;
$seen{$a}++;
}
}
print '@r = ', "@r";


This is now the output:



Code
$ perl  remove_duplicates.pl 
$match =
foo1
foo2
foo3
foo4
foo5
foo3
foo2

@match_to_array = foo1 foo2 foo3 foo4 foo5 foo3 foo2

@r = foo1 foo2 foo3 foo4 foo5


The duplicate values (foo3 and foo2) have been duly removed.


(This post was edited by Laurent_R on May 7, 2013, 3:45 AM)


BillKSmith
Veteran

May 7, 2013, 6:23 AM

Post #13 of 23 (634 views)
Re: [Wazezu] Removal of duplicated element in array with order. [In reply to] Can't Post

The following code block shows the minimum changes to your code to make it "work".
Bold means added code.
italic means change.
strikeout means remove.

Code
#!/usr/bin/perl w 
use strict;
use warnings;
use Cwd;

my $EMAIL
= qr{\w[-.\w]*\@[a-z0-9]+(?:\.[-a-z0-9]+)*\.(?:com|edu|info|net|org|gov)

foreach my $argnum ( 0 .. $#ARGV ) {
if ( $ARGV[$argnum] eq "-ft" ) {
my $perl_path = cwd;
if ( -e 'testht.html' ) {
open( OPENFILE, "$perl_path/testht.html" )
or die "Unable to open file";
}
my @email = <OPENFILE>;
close OPENFILE;

foreach my $email (@email) {
if ( $email
=~ /(\w[-.\w]*\@[a-z0-9]+(\.[-a-z0-9]+)*\.(com|edu|info|net|org|gov))/
)
{
my $match = "$1";

#print "$match\n";
my @raw_data = split /\n/, $match;

my @match_to_array = map { /($EMAIL)/; $1 } @email;
my %seen;
my @r = map { $seen{$_}++ ? () : $_ } @match_to_array;
{ local $, = "\n"; print @r, "\n"; }

} # end of if statement
} # end of foreach

} # end of elsif -ft
}


We had conflicting use of the variable $email. I changed mine to upper case.

You do not need the foreach loop. The map function (refer: perldoc -f map) does all the looping itself.
Good Luck,
Bill


Wazezu
Novice

May 7, 2013, 6:23 AM

Post #14 of 23 (634 views)
Re: [Laurent_R] Removal of duplicated element in array with order. [In reply to] Can't Post

thanks, your code works, but when I port your code over to my code, it doesn't work. I guess it is due to the '@' in email addresses causing the issues. I am sorry for my poor porting skills, sincerely. =/

For instance, '@' have to be escaped to work in the code you provided.

Code
my $match = 	"dmit\@sp.com\nnems\@es.com\nndew\@es.com\ndmit\@sp.com";


My code
1. opens a HTML file,
2. extracts email addresses from it to a $ variable,
3. converts it to @ variable
4. and then attempts to remove the duplicated elements.

Perhaps there is some issues in the conversion to arrays like you mentioned earlier.

My modified code with your code in it. ( It has problems with variable scoping, can't find a way to print the $match properly. )
http://pastebay.net/1208653

On a sidenote, I am apologetic for the cross-post if it offends you in any way. I am hoping for more views to solve this issue.Frown


(This post was edited by Wazezu on May 7, 2013, 6:25 AM)


FishMonger
Veteran / Moderator

May 7, 2013, 6:27 AM

Post #15 of 23 (631 views)
Re: [Wazezu] Removal of duplicated element in array with order. [In reply to] Can't Post

Try this:

Code
#!/usr/bin/perl 

use strict;
use warnings;
use Email::Address;
use Data::Dumper;

my $html = do { local $/; <DATA> };
my @addresses = Email::Address->parse($html);

foreach my $address ( @addresses ) {
print $address->format, $/;
}


__DATA__
dmit@sp.com
ems@es.com
dew@es.com
dmit@sp.com
erg@es.com


Output:
c:\testing>get_addresses.pl
dmit@sp.com
ems@es.com
dew@es.com
dmit@sp.com
erg@es.com

Removing the dups is left upto the reader. (hint, use a hash)


(This post was edited by FishMonger on May 7, 2013, 6:30 AM)


FishMonger
Veteran / Moderator

May 7, 2013, 6:33 AM

Post #16 of 23 (628 views)
Re: [Wazezu] Removal of duplicated element in array with order. [In reply to] Can't Post

I'm assuming your real html file is actually an html file, and not simply a list of addresses on individual lines. Otherwise jumping thorough all these hoops is absurd.


FishMonger
Veteran / Moderator

May 7, 2013, 6:35 AM

Post #17 of 23 (627 views)
Re: [Wazezu] Removal of duplicated element in array with order. [In reply to] Can't Post


Quote

Code
foreach my $argnum (0 .. $#ARGV) {  

if ($ARGV[$argnum] eq "-ft"){



Don't do that. Use the Getopt::Long module.

http://search.cpan.org/search?query=Getopt%3A%3ALong&mode=all


Laurent_R
Enthusiast / Moderator

May 7, 2013, 10:16 AM

Post #18 of 23 (611 views)
Re: [Wazezu] Removal of duplicated element in array with order. [In reply to] Can't Post


In Reply To

On a sidenote, I am apologetic for the cross-post if it offends you in any way. I am hoping for more views to solve this issue.Frown


Don't worry, I certainly don't feel offended by that, I don't particularly care, but is is usually considered to be a poor practice, because you are mobnilizing more people on your problem, and people on one of the forum often don't see what has been written on the other.

What I did not like, though, is that you presented a program that does not work. I give you the correction that is sufficient to make it work. And then you say that it does not work, because you are actually using another program (but you did not say it, at least not immediately). But again, don't worry, I don't feel offended either, it is just that you are making us less efficient in helping you.


Wazezu
Novice

May 8, 2013, 12:55 AM

Post #19 of 23 (599 views)
Re: [BillKSmith] Removal of duplicated element in array with order. [In reply to] Can't Post

Thanks a lot for the help.
I modified the relevant parts of code which you requested to change. It prints out a blank newline.

I can't understand the line my @r = map {$seen{$_}++ ? () : $_} @match_to_array; .

Are the ? and : regex operators ?


Code
        my @match_to_array = map { /($EMAIL)/; @email }  
my %seen;
my @r = map {$seen{$_}++ ? () : $_} @match_to_array;

{ local $, = "\n"; print @r, "\n"; }



Laurent_R
Enthusiast / Moderator

May 8, 2013, 1:24 AM

Post #20 of 23 (595 views)
Re: [Wazezu] Removal of duplicated element in array with order. [In reply to] Can't Post

Hi,


Code
 my @r = map {$seen{$_}++ ? () : $_} @match_to_array;


The map functions takes each element in the @match_to_array, alias it to $_, applies the code between curlies to $_, and returns into the @r array the list of modified elements.

The code within the curlies check whether the $seen{$_} hash element is true; if it is true (the element has already been seen), it returns an empty list, if not it returns the element; at the same time, ++ adds 1 to the hash element so that if it was untrue (0), it will be true next time.

This is more or less equivalent to this less concise code:


Code
foreach my $element (@match_to_array) { 
push @r, $element if $seen{$element};
$seen{$element} ++;
}



Wazezu
Novice

May 8, 2013, 4:57 AM

Post #21 of 23 (593 views)
Re: [Laurent_R] Removal of duplicated element in array with order. [In reply to] Can't Post

Thanks for the clarification, much appreciated.


BillKSmith
Veteran

May 8, 2013, 5:55 AM

Post #22 of 23 (589 views)
Re: [Wazezu] Removal of duplicated element in array with order. [In reply to] Can't Post

I think that the blank line comes from the blank line in the data.


Quote
I can't understand the line my @r = map {$seen{$_}++ ? () : $_} @match_to_array; .


This is more complex than necessary. I should have used:

Code
@r = grep  {!$seen{$_}++} @match_to_array;


Laurent's explanation is very good. The book "Perl Cookbook" has a detailed discussion of this perl idiom in the receipe "Extracting Unique elements from a List."
Good Luck,
Bill


Laurent_R
Enthusiast / Moderator

May 8, 2013, 9:17 AM

Post #23 of 23 (583 views)
Re: [BillKSmith] Removal of duplicated element in array with order. [In reply to] Can't Post

Yes, Bill, I was a bit surprised that you used map instead of grep in that specific construct, but I did not raise any objection since it also works fine. But grep clearly allows a slightly simpler syntax in that specific case.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives