CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
[SOLVED] Merges 2 text files under few conditions

 



Thalakos
Novice

Apr 4, 2013, 2:06 PM

Post #1 of 11 (932 views)
[SOLVED] Merges 2 text files under few conditions Can't Post

Hi all,

I have two text files, file_a.txt and file_b.txt that look like that:

file_a:

Code
has-mir-199a 
has-miR-222
has-miR-222
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7a
hsa-let-7b
hsa-let-7b
hsa-let-7b
hsa-let-7b
hsa-let-7c
hsa-let-7c
hsa-let-7c
hsa-let-7c
hsa-let-7c
hsa-let-7c
hsa-let-7d
hsa-let-7d
hsa-let-7d
hsa-let-7d
hsa-let-7e
hsa-let-7e
hsa-let-7e
hsa-let-7e
hsa-let-7e
hsa-let-7f
hsa-let-7f
hsa-let-7f
hsa-let-7f
hsa-let-7f
hsa-let-7f
hsa-let-7f
hsa-let-7f
hsa-let-7f
hsa-let-7f
hsa-let-7f
hsa-let-7f
hsa-let-7f
hsa-let-7f
hsa-let-7g
hsa-let-7g
....
line cut


file_b:

Code
hsa-let-7a	KRAS , HMGA2 , integrin beta(3) , caspase-3 , PRDM1/Blimp-1 , HMGA2 , IGF-II , HMGA2 , HMGA2 , RAS , BCL2 , RAS , MYC , CDC25A , CDK6 , NF2 , c-myc , RAS , RAS , NIRF 
hsa-let-7b Cdc34 , Dicer , KRAS , CCND1 , CDC25A , CDK6 , HMGA2
hsa-let-7c HMGA2 , HMGA2 , HMGA2 , BCL2 , RAS , CDC25A , CDK6 , RAS
hsa-let-7d KRAS , HMGA2 , BCL2 , RAS , CDC25A , CDK6
hsa-let-7d BDNF , D3R
hsa-let-7e HMGA2
hsa-let-7g KRAS , HMGA2 , Ras , HMGA2 , CDC25A , CDK6
hsa-miR-1 c-Met , calmodulin , Gata4 , Mef2a , BCL2 , Gata4 , calmodulin , Mef2a , C/EBPa , FoxP1 , HDAC4 , MET , HCN4 , FoxP1 , HDAC4 , MET , Cdk9 , fibronectin , RasGAP , Rheb , MEF-2 , nAChR , GAJ1 , KCNJ2 , HSP60 , HSP70 , Hand2 , Kir2.1
hsa-miR-100 Plk1
hsa-miR-101 EZH2 , EZH2 , Mcl-1 , FOS , EZH2 , FOS , ATXN1 , MYCN , Ezh2
hsa-miR-101b ATXN1 , STC1
hsa-miR-106a IL-10 , E2F1 , Mylip
hsa-miR-106b p21 , APP , Itch , E2F1 , E2F1 , PCAF
hsa-miR-107 PLAG1 , BACE1
hsa-miR-10b HOXD10 , PPAR-alpha
hsa-miR-1-2 Hand2 , Irx5 , Kcnd2
hsa-miR-122 Bcl-w , ADAM-10 , SRF , Igf1R
hsa-miR-122a CCNG1 , CCNG1 , AMPK
hsa-miR-124 BDNF , D3R , Sox9
hsa-miR-124a Rb , IkappaBzeta , CDK6 , CDK6 , CDK6 , CDK6
hsa-miR-199 ET-1
hsa-miR-199a IKK-beta
hsa-miR-199a* Smad1 , ERK2 , MET
hsa-miR-222 p27 , p27 , p27 , p57 , MMP1 , SOD2 , Bim , CDKN1B/p27/Kip1 , KIT , c-KIT , p27(Kip1) , p27(Kip1) , ERalpha , CDKN1C/p57 , CDKN1B/p27/Kip1 , c-KIT , KIT , CDKN1B/p27/Kip1 , CDKN1B/p27/Kip1 , KIT , p27


The file a as only one column reporting in multiple time the same entries present in column 1 of file b.
I need a new file C in wich every multiple entry of the file A is associated with the entry related to column 2 of file B.
It should look like that:

file_c.txt:

Code
has-mir-199a	IKK-beta 
has-miR-222 p27 , p27 , p27 , p57 , MMP1 , SOD2 , Bim , CDKN1B/p27/Kip1 , KIT , c-KIT , p27(Kip1) , p27(Kip1) , ERalpha , CDKN1C/p57 , CDKN1B/p27/Kip1 , c-KIT , KIT , CDKN1B/p27/Kip1 , CDKN1B/p27/Kip1 , KIT , p27
has-miR-222 p27 , p27 , p27 , p57 , MMP1 , SOD2 , Bim , CDKN1B/p27/Kip1 , KIT , c-KIT , p27(Kip1) , p27(Kip1) , ERalpha , CDKN1C/p57 , CDKN1B/p27/Kip1 , c-KIT , KIT , CDKN1B/p27/Kip1 , CDKN1B/p27/Kip1 , KIT , p28
hsa-let-7a KRAS , HMGA2 , integrin beta(3) , caspase-3 , PRDM1/Blimp-1 , HMGA2 , IGF-II , HMGA2 , HMGA2 , RAS , BCL2 , RAS , MYC , CDC25A , CDK6 , NF2 , c-myc , RAS , RAS , NIRF
hsa-let-7a KRAS , HMGA2 , integrin beta(3) , caspase-3 , PRDM1/Blimp-1 , HMGA2 , IGF-II , HMGA2 , HMGA2 , RAS , BCL2 , RAS , MYC , CDC25A , CDK6 , NF2 , c-myc , RAS , RAS , NIRF
hsa-let-7a KRAS , HMGA2 , integrin beta(3) , caspase-3 , PRDM1/Blimp-1 , HMGA2 , IGF-II , HMGA2 , HMGA2 , RAS , BCL2 , RAS , MYC , CDC25A , CDK6 , NF2 , c-myc , RAS , RAS , NIRF
hsa-let-7a KRAS , HMGA2 , integrin beta(3) , caspase-3 , PRDM1/Blimp-1 , HMGA2 , IGF-II , HMGA2 , HMGA2 , RAS , BCL2 , RAS , MYC , CDC25A , CDK6 , NF2 , c-myc , RAS , RAS , NIRF
hsa-let-7a KRAS , HMGA2 , integrin beta(3) , caspase-3 , PRDM1/Blimp-1 , HMGA2 , IGF-II , HMGA2 , HMGA2 , RAS , BCL2 , RAS , MYC , CDC25A , CDK6 , NF2 , c-myc , RAS , RAS , NIRF
hsa-let-7a KRAS , HMGA2 , integrin beta(3) , caspase-3 , PRDM1/Blimp-1 , HMGA2 , IGF-II , HMGA2 , HMGA2 , RAS , BCL2 , RAS , MYC , CDC25A , CDK6 , NF2 , c-myc , RAS , RAS , NIRF
hsa-let-7a KRAS , HMGA2 , integrin beta(3) , caspase-3 , PRDM1/Blimp-1 , HMGA2 , IGF-II , HMGA2 , HMGA2 , RAS , BCL2 , RAS , MYC , CDC25A , CDK6 , NF2 , c-myc , RAS , RAS , NIRF
hsa-let-7a KRAS , HMGA2 , integrin beta(3) , caspase-3 , PRDM1/Blimp-1 , HMGA2 , IGF-II , HMGA2 , HMGA2 , RAS , BCL2 , RAS , MYC , CDC25A , CDK6 , NF2 , c-myc , RAS , RAS , NIRF
hsa-let-7a KRAS , HMGA2 , integrin beta(3) , caspase-3 , PRDM1/Blimp-1 , HMGA2 , IGF-II , HMGA2 , HMGA2 , RAS , BCL2 , RAS , MYC , CDC25A , CDK6 , NF2 , c-myc , RAS , RAS , NIRF
hsa-let-7a KRAS , HMGA2 , integrin beta(3) , caspase-3 , PRDM1/Blimp-1 , HMGA2 , IGF-II , HMGA2 , HMGA2 , RAS , BCL2 , RAS , MYC , CDC25A , CDK6 , NF2 , c-myc , RAS , RAS , NIRF
hsa-let-7a KRAS , HMGA2 , integrin beta(3) , caspase-3 , PRDM1/Blimp-1 , HMGA2 , IGF-II , HMGA2 , HMGA2 , RAS , BCL2 , RAS , MYC , CDC25A , CDK6 , NF2 , c-myc , RAS , RAS , NIRF
hsa-let-7a KRAS , HMGA2 , integrin beta(3) , caspase-3 , PRDM1/Blimp-1 , HMGA2 , IGF-II , HMGA2 , HMGA2 , RAS , BCL2 , RAS , MYC , CDC25A , CDK6 , NF2 , c-myc , RAS , RAS , NIRF
hsa-let-7a KRAS , HMGA2 , integrin beta(3) , caspase-3 , PRDM1/Blimp-1 , HMGA2 , IGF-II , HMGA2 , HMGA2 , RAS , BCL2 , RAS , MYC , CDC25A , CDK6 , NF2 , c-myc , RAS , RAS , NIRF
.....
line cut

Do you guys have any idea on how to make it works automatically?

Thanks a lot in advance,
Giorgio


(This post was edited by Thalakos on Apr 5, 2013, 9:49 PM)


FishMonger
Veteran / Moderator

Apr 4, 2013, 2:39 PM

Post #2 of 11 (928 views)
Re: [Thalakos] Merges 2 text files under few conditions [In reply to] Can't Post

Are you saying you want each line in file_b.txt repeated by the number of times the column 1 is found in file_a.txt?

Load file_a.txt in a hash which keeps a running total of each value and use that value as a multiplier when outputting each row in file_b.txt


(This post was edited by FishMonger on Apr 4, 2013, 2:39 PM)


Thalakos
Novice

Apr 4, 2013, 2:51 PM

Post #3 of 11 (924 views)
Re: [FishMonger] Merges 2 text files under few conditions [In reply to] Can't Post


In Reply To
Are you saying you want each line in file_b.txt repeated by the number of times the column 1 is found in file_a.txt?

Load file_a.txt in a hash which keeps a running total of each value and use that value as a multiplier when outputting each row in file_b.txt


Exaclty and obviously the lines have to correspond. Unfortunately I'm a very newbie with programming code so what you told me it's pretty much unrealizable for me.


Thalakos
Novice

Apr 4, 2013, 4:00 PM

Post #4 of 11 (913 views)
Re: [Thalakos] Merges 2 text files under few conditions [In reply to] Can't Post

No one can help me about that, please?


FishMonger
Veteran / Moderator

Apr 5, 2013, 6:38 AM

Post #5 of 11 (889 views)
Re: [Thalakos] Merges 2 text files under few conditions [In reply to] Can't Post


Code
#!/usr/bin/perl 

use strict;
use warnings;

my %ID;

open my $a_fh, '<', 'file_A.txt' or die "failed to open file_A.txt $!";
open my $b_fh, '<', 'file_B.txt' or die "failed to open file_B.txt $!";

while ( my $id = <$a_fh> ) {
chomp $id;
$ID{$id}++;
}
close $a_fh;

while ( my $line = <$b_fh> ) {
my($id, $genes) = split /\t/, $line;

if (exists $ID{$id}) {
print $line x $ID{$id};
}
else {
print $line;
}
}
close $b_fh;



Thalakos
Novice

Apr 5, 2013, 9:44 AM

Post #6 of 11 (885 views)
Re: [FishMonger] Merges 2 text files under few conditions [In reply to] Can't Post

Thank you so much the script works great.



Do you guys know how to modify that script so to have a blank like when the id in file_a.txt is not present in file_b.txt so to not loose that information?

I mean something like that:


Code
hsa-let-7a	KRAS , HMGA2 , integrin beta(3) , caspase-3 , PRDM1/Blimp-1 , HMGA2 , IGF-II , HMGA2 , HMGA2 , RAS , BCL2 , RAS , MYC , CDC25A , CDK6 , NF2 , c-myc , RAS , RAS , NIRF  
hsa-let-7a KRAS , HMGA2 , integrin beta(3) , caspase-3 , PRDM1/Blimp-1 , HMGA2 , IGF-II , HMGA2 , HMGA2 , RAS , BCL2 , RAS , MYC , CDC25A , CDK6 , NF2 , c-myc , RAS , RAS , NIRF
hsa-let-7a KRAS , HMGA2 , integrin beta(3) , caspase-3 , PRDM1/Blimp-1 , HMGA2 , IGF-II , HMGA2 , HMGA2 , RAS , BCL2 , RAS , MYC , CDC25A , CDK6 , NF2 , c-myc , RAS , RAS , NIRF
hsa-present-only-in-file_a.txt (blank line)


I asked that because I cant' loose that info. So to have at the end the same list in file_A.txt with associated the genes taken from file_B or a blank line when in file b there isn't any association.
Hope to have been clear!

Thanks in advance,
Giorgio


(This post was edited by Thalakos on Apr 5, 2013, 9:48 AM)


FishMonger
Veteran / Moderator

Apr 5, 2013, 12:13 PM

Post #7 of 11 (871 views)
Re: [Thalakos] Merges 2 text files under few conditions [In reply to] Can't Post

There are a couple possible approaches. One would be to delete the items from the hash as you process "file_b.txt" and afterwords if there is anything remaining in the hash, output it as needed.


Chris Charley
User

Apr 5, 2013, 12:52 PM

Post #8 of 11 (865 views)
Re: [Thalakos] Merges 2 text files under few conditions [In reply to] Can't Post

As FishMonger suggested, you could make two changes to your script.

Change:
print $line x $ID{$id};

To:
print $line x delete $ID{$id};

And add this line at the end of your script:
print "$_\n" for keys %ID;


Thalakos
Novice

Apr 5, 2013, 1:25 PM

Post #9 of 11 (857 views)
Re: [Chris Charley] Merges 2 text files under few conditions [In reply to] Can't Post

Thank you guys! Unfortunately the output is like incomplete.
The file_a.txt has 2026 lines in the output I got only 926 lines.
I need to have all the 2026 lines (with respective associated genes or a blank line if no association is reported).
So that way probably does not work; may be another simple chainge in the script could do the job?


Code
#!/usr/bin/perl  

use strict;
use warnings;

my %ID;

open my $a_fh, '<', 'file_A.txt' or die "failed to open file_A.txt $!";
open my $b_fh, '<', 'file_B.txt' or die "failed to open file_B.txt $!";

while ( my $id = <$a_fh> ) {
chomp $id;
$ID{$id}++;
}
close $a_fh;

while ( my $line = <$b_fh> ) {
my($id, $genes) = split /\t/, $line;

if (exists $ID{$id}) {
print $line x delete $ID{$id};
}
else {
print $line;
}
}
close $b_fh;
print "$_\n" for keys %ID;



(This post was edited by Thalakos on Apr 5, 2013, 1:27 PM)


Chris Charley
User

Apr 5, 2013, 5:50 PM

Post #10 of 11 (842 views)
Re: [Thalakos] Merges 2 text files under few conditions [In reply to] Can't Post

Yes, you are correct - the code I posted doesn't produce the correct results. In addition, one of your lines in file_b, 'hsa-let-7d BDNF , D3R' had a space AND a tab, so it didn't parse correctly. When the space was eliminated leaving the desired tab, it parsed correcly. I'm attaching the result of the run, (o33.txt) and a reconciliation, (file_c.txt), which proves the code now produces the desired results.

Code
 #!/usr/bin/perl   
use strict;
use warnings;

my %ID;

open my $a_fh, '<', 'file_a.txt' or die "failed to open file_A.txt $!";
open my $b_fh, '<', 'file_b.txt' or die "failed to open file_B.txt $!";

while ( my $id = <$a_fh> ) {
chomp $id;
$ID{$id}++;
}
close $a_fh or die $!;

my %seen;
while ( my $line = <$b_fh> ) {
my($id, $genes) = split /\t/, $line;

if (exists $ID{$id}) {
print $line x $ID{$id};
$seen{$id}++;
}
else {
print $line;
}
}
close $b_fh or die $!;

delete @ID{ keys %seen };

print "$_\n" x $ID{$_} for keys %ID;

You must be sure that every line is tab separated, (in file_b), or you won't get accurate results. If you can't be sure, then maybe split on space instead of tabs.

Code
 my($id, $genes) = split /\s+/, $line, 2;



(This post was edited by Chris Charley on Apr 5, 2013, 5:52 PM)
Attachments: o33.txt (6.78 KB)
  file_c.txt (0.12 KB)


Thalakos
Novice

Apr 5, 2013, 9:48 PM

Post #11 of 11 (834 views)
Re: [Chris Charley] Merges 2 text files under few conditions [In reply to] Can't Post

Awesome! It works great now.

Thanks a lot for your help, I appreciate.

~Giorgio

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives