CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Find duplicates using two conditions

 



joy_stat
New User

Jun 12, 2014, 2:34 PM

Post #1 of 4 (5694 views)
Find duplicates using two conditions Can't Post

 
I am working with a genetic file and searching for duplicates. It is a tab delimited file with 6 columns.

The conditions I am looking for are the following:
a) look into column 2 (snp names) and list duplicates
and
b) look into duplicate positions (column 4) and then the snp name and print the one not starting with "rs" so that we can exclude that in the following step.

So, for the following file,
###
1 rs3121607 0 12648866 G T
1 rs3121607 0 12648867 G T
1 rs72670467 0 68702241 T C
1 rs12041665 0 68705985 C T
1 chr1:68705985 0 68705985 C T
1 rs10493440 0 68711839 G A
1 rs11209236 0 68714191 T A
###
the resultant text file will contain the 2 snps.

rs3121607
chr1:68705985

I was able to do the (a) part but have no clue how to do both in a single piece of code.

Any help is appreciated.

Thanks,


Laurent_R
Veteran / Moderator

Jun 12, 2014, 3:55 PM

Post #2 of 4 (5630 views)
Re: [joy_stat] Find duplicates using two conditions [In reply to] Can't Post

Sorry, I do not understand what you need. If you were able to do the (a) part, please show the code doing it, this should probably help understanding you what you want.


joy_stat
New User

Jun 12, 2014, 4:07 PM

Post #3 of 4 (5620 views)
Re: [Laurent_R] Find duplicates using two conditions [In reply to] Can't Post

I have attached the code. Thanks.
Attachments: findDups.pl (0.25 KB)


Kenosis
User

Jun 12, 2014, 4:12 PM

Post #4 of 4 (5616 views)
Re: [joy_stat] Find duplicates using two conditions [In reply to] Can't Post

Perhaps the following will be helpful:


Code
use strict; 
use warnings;

my ( %col2, %col4 );

while (<DATA>) {
my @fields = split;
$col2{ $fields[1] }++;
push @{ $col4{ $fields[3] } }, $fields[1];
}

$col2{$_} > 1 and print $_, "\n" for keys %col2;

for ( grep @{ $col4{$_} } > 1, keys %col4 ) {
print $_, "\n" for grep !/^rs/, @{ $col4{$_} };
}

__DATA__
1 rs3121607 0 12648866 G T
1 rs3121607 0 12648867 G T
1 rs72670467 0 68702241 T C
1 rs12041665 0 68705985 C T
1 chr1:68705985 0 68705985 C T
1 rs10493440 0 68711839 G A
1 rs11209236 0 68714191 T A


Output:

Code
rs3121607 
chr1:68705985


For your b spec, a hash of arrays (HoA) was used, where the key is the col4 entry and the values are references to arrays of col1 entries. It wasn't clear to me whether "duplicate" strictly meant only two, or more than one, thus the HoA.

The b spec is printed obtaining the keys of the col4 hash, then grepping for those arrays which have two or more elements, and printing only array elements that don't begin with "rs".

Hope this helps!


(This post was edited by Kenosis on Jun 12, 2014, 4:14 PM)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives