CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
The hardest part!

 



CiTi
Novice

Jan 5, 2016, 7:46 AM

Post #1 of 16 (4526 views)
The hardest part! Can't Post

Hi everybody,
I'm using Perl for the first time but I had some problems...
My issue is:
I had three files with PDB format which contains aminoacid residues inner (files.pdb from Protein Data Bank, is a text file) like these:
##############################################################
ATOM 38 CB LEU A 7 -14.218 84.172 36.911 1.00 31.63 C
ATOM 39 CG LEU A 7 -15.382 83.292 36.448 1.00 38.00 C
ATOM 40 CD1 LEU A 7 -16.128 82.734 37.654 1.00 36.66 C
ATOM 41 CD2 LEU A 7 -16.319 84.099 35.564 1.00 42.94 C
ATOM 42 N VAL A 8 -12.451 81.170 37.190 1.00 28.11 N
ATOM 43 CA VAL A 8 -11.831 80.100 36.418 1.00 24.30 C
ATOM 44 C VAL A 8 -12.853 79.017 36.100 1.00 22.10 C
ATOM 45 O VAL A 8 -13.571 78.551 36.981 1.00 20.96 O
ATOM 49 N ASN A 9 -12.940 78.652 34.826 1.00 30.21 N
ATOM 50 CA ASN A 9 -13.868 77.619 34.393 1.00 31.33 C
ATOM 51 C ASN A 9 -13.094 76.309 34.273 1.00 32.01 C
ATOM 52 O ASN A 9 -12.256 76.149 33.382 1.00 37.54 O
############################################################
Now.... They are very similar and I should to merge it to obtain a single file.pdb which will contain only the commons residues.
Someone can Help me... I don't how make a damn script.
THANKS SO MUCH!!!


Laurent_R
Veteran / Moderator

Jan 5, 2016, 10:16 AM

Post #2 of 16 (4517 views)
Re: [CiTi] The hardest part! [In reply to] Can't Post

We for sure want to help you, and will do so, but this site is not a free code-writing service. It is a site about leaning Perl and getting help from more experienced Perl developers when something is too difficult.

In other words, you have to show that you are making some efforts at doing it yourself, and we will help you finding and fixing errors.

I suggest that you start with a good Perl tutorial to learn the basics of the language. You'll need to learn how to open a file in read mode, read contents of it line by line, open another file in write mode and print data to is. You'll also probably need to learn how to store data in a Perl data structure in memory, such as an array or a hash, and how to retrieve content from it. All these things are likely to be in the first 3 or 4 chapters of a good tutorial. So just a few hours of leaning should get you there.

What you want to do is really not difficult, so you don't need to learn the advanced features of Perl, but just the basic features.

Please show what you've tried and we will gladly help you forward.


CiTi
Novice

Jan 5, 2016, 10:26 AM

Post #3 of 16 (4516 views)
Re: [Laurent_R] The hardest part! [In reply to] Can't Post

I'm so sorry, i make a lot of tutorial and I know the basics of Perl but I don't specified this... Laugh.
So, I'm going to explain what have I done until now, soon as possible. Thanks so much and "Perl You Soon" Wink]


CiTi
Novice

Jan 8, 2016, 9:28 AM

Post #4 of 16 (4491 views)
Re: [CiTi] The hardest part! [In reply to] Can't Post

Hi everybody,
this is my code, with a lots of issues...I think so!
Why I can't obtain the OUTPUT file
pocket_final.pdb ?


Code
#!/usr/bin/perl  

use strict;
use warnings;
use Cwd;
use feature 'say';

my $cwd = getcwd();
my $site_1 = "$cwd/pock_1.pdb";
my $site_2 = "$cwd/pock_2.pdb";
my $site_final = "$cwd/pocket_final.pdb";

open my $FH1, "<", $site_1 or die "cannot open $site_1 $!";
open my $FH2, "<", $site_2 or die "cannot open $site_2 $!";
open my $out_fh, '>', $site_final or die "could not open $site_final <$!>";

my %model_1 = ( site_1 => "$site_1");
my %model_2 = ( site_2 => "$site_2");

my @site_com;
push @site_com, \%model_1;
push @site_com, \%model_2;
say ( \@site_com );

Thanks....Please don't kill me Unsure


Chris Charley
User

Jan 8, 2016, 12:06 PM

Post #5 of 16 (4479 views)
Re: [CiTi] The hardest part! [In reply to] Can't Post

Hello Citi

Your code does not produce any results and you are using hashes wrong. What part of the lines makes them related?

ATOM 38 CB LEU A 7 -14.218 84.172 36.911 1.00 31.63 C
ATOM 39 CG LEU A 7 -15.382 83.292 36.448 1.00 38.00 C
ATOM 40 CD1 LEU A 7 -16.128 82.734 37.654 1.00 36.66 C
ATOM 41 CD2 LEU A 7 -16.319 84.099 35.564 1.00 42.94 C

It would help if you posted some samples from file 1, 2 and 3 and show what the output would be for the common residues.


CiTi
Novice

Jan 8, 2016, 5:39 PM

Post #6 of 16 (4463 views)
Re: [Chris Charley] The hardest part! [In reply to] Can't Post

Thanks Chris for reply...
I will write you the ideal situation:
......
File_1.pdb
ATOM 1 CB LEU A 7 -14.218 84.172 36.911 1.00 31.63 C
ATOM 2 CG LEU A 7 -15.382 83.292 36.448 1.00 38.00 C
ATOM 3 CD1 LEU A 7 -16.128 82.734 37.654 1.00 36.66 C
ATOM 4 CD2 LEU A 7 -16.319 84.099 35.564 1.00 42.94 C
ATOM 9 N ASN A 9 -12.940 78.652 34.826 1.00 30.21 N
ATOM 10 CA ASN A 9 -13.868 77.619 34.393 1.00 31.33 C
ATOM 11 C ASN A 9 -13.094 76.309 34.273 1.00 32.01 C
ATOM 12 O ASN A 9 -12.256 76.149 33.382 1.00 37.54 O
######################################################
File_2.pdb
ATOM 5 N VAL A 8 -12.451 81.170 37.190 1.00 28.11 N
ATOM 6 CA VAL A 8 -11.831 80.100 36.418 1.00 24.30 C
ATOM 7 C VAL A 8 -12.853 79.017 36.100 1.00 22.10 C
ATOM 8 O VAL A 8 -13.571 78.551 36.981 1.00 20.96 O
ATOM 9 N ASN A 9 -12.940 78.652 34.826 1.00 30.21 N
ATOM 10 CA ASN A 9 -13.868 77.619 34.393 1.00 31.33 C
ATOM 11 C ASN A 9 -13.094 76.309 34.273 1.00 32.01 C
ATOM 12 O ASN A 9 -12.256 76.149 33.382 1.00 37.54 O
######################################################
File_3.pdb
ATOM 1 CB LEU A 7 -14.218 84.172 36.911 1.00 31.63 C
ATOM 2 CG LEU A 7 -15.382 83.292 36.448 1.00 38.00 C
ATOM 3 CD1 LEU A 7 -16.128 82.734 37.654 1.00 36.66 C
ATOM 4 CD2 LEU A 7 -16.319 84.099 35.564 1.00 42.94 C
ATOM 5 N VAL A 8 -12.451 81.170 37.190 1.00 28.11 N
ATOM 6 CA VAL A 8 -11.831 80.100 36.418 1.00 24.30 C
ATOM 7 C VAL A 8 -12.853 79.017 36.100 1.00 22.10 C
ATOM 8 O VAL A 8 -13.571 78.551 36.981 1.00 20.96 O
####################################################->
FINAL_FILE.pdb
ATOM 1 CB LEU A 7 -14.218 84.172 36.911 1.00 31.63 C
ATOM 2 CG LEU A 7 -15.382 83.292 36.448 1.00 38.00 C
ATOM 3 CD1 LEU A 7 -16.128 82.734 37.654 1.00 36.66 C
ATOM 4 CD2 LEU A 7 -16.319 84.099 35.564 1.00 42.94 C
ATOM 5 N VAL A 8 -12.451 81.170 37.190 1.00 28.11 N
ATOM 6 CA VAL A 8 -11.831 80.100 36.418 1.00 24.30 C
ATOM 7 C VAL A 8 -12.853 79.017 36.100 1.00 22.10 C
ATOM 8 O VAL A 8 -13.571 78.551 36.981 1.00 20.96 O
ATOM 9 N ASN A 9 -12.940 78.652 34.826 1.00 30.21 N
ATOM 10 CA ASN A 9 -13.868 77.619 34.393 1.00 31.33 C
ATOM 11 C ASN A 9 -13.094 76.309 34.273 1.00 32.01 C
ATOM 12 O ASN A 9 -12.256 76.149 33.382 1.00 37.54 O
......
As shown above the FINAL_FILE will contain only the common residues of File_1, File_2 and File_3.
The code I shown it's not capable to check and merge the files.
Did you understand what I want to said? Eventually I could write better.


Laurent_R
Veteran / Moderator

Jan 9, 2016, 6:10 AM

Post #7 of 16 (4457 views)
Re: [CiTi] The hardest part! [In reply to] Can't Post

Hi,
none of the data in your final file is common to the three input files.

So, I would assume that you want a record to be in the final file if it is present in at least any two of the input files. Is this correct? Or is it sufficient for one record to be in at least one of the input files to find its way into the final file?

Either way, the starting point would probably to load each file into a hash, probably storing all the records into the same hash would be sufficient from what I understand from your requirements.

In this hash, the key could be the full record, and the value a counter saying how many times you have seen that record. Once this is done, just read the hash again and print out what you need depending on the value of the counter.

Something like this:


Code
my %records; 

open my $IN, "<", $file1 or die "cannot open $file1 $!";
while (<$IN>) {
$records{$_}++;
}
close $IN;

# same code as above for $file2 and $file3

#once everything is loaded into %records:
for my $rec (sort keys $records) {
print $rec if $records{$rec} > 1; # record seen at least twice
}



(This post was edited by Laurent_R on Jan 9, 2016, 6:12 AM)


CiTi
Novice

Jan 9, 2016, 9:14 AM

Post #8 of 16 (4444 views)
Re: [Laurent_R] The hardest part! [In reply to] Can't Post

Thank you so much Laurent for reply...
In my case is it sufficient for one record to be in at least one of the input files to find its way into the final file.
So, using the code you wrote, I just type in the command line:

#### ./myscript.pl file_1.pdb file_2.pdb file_3.pdb > record.pdb ####

I had modified the code in this way:




Code
#!/usr/bin/perl 
use strict;
use warnings;

my %records;

open my $IN, "<", $file1 or die "cannot open $file1 $!";
while (<$IN>) {
$records{$_}++;
}
close $IN;

open my $IN, "<", $file2 or die "cannot open $file2 $!";
while (<$IN>) {
$records{$_}++;
}
close $IN;

open my $IN, "<", $file3 or die "cannot open $file3 $!";
while (<$IN>) {
$records{$_}++;
}
close $IN;

for my $rec (sort keys $records) {
print $rec if $records{$rec} > 1;
}


.....but, this is the error Frown :


Code
"my" variable $IN masks earlier declaration in same scope at ./ss4.pl line 13. 
"my" variable $IN masks earlier declaration in same scope at ./ss4.pl line 19.
Global symbol "$file1" requires explicit package name at ./ss4.pl line 7.
Global symbol "$file1" requires explicit package name at ./ss4.pl line 7.
Global symbol "$file2" requires explicit package name at ./ss4.pl line 13.
Global symbol "$file2" requires explicit package name at ./ss4.pl line 13.
Global symbol "$file3" requires explicit package name at ./ss4.pl line 19.
Global symbol "$file3" requires explicit package name at ./ss4.pl line 19.
Global symbol "$records" requires explicit package name at ./ss4.pl line 25.
Execution of ./ss4.pl aborted due to compilation errors.


What's wrong?
Thank you again...


Chris Charley
User

Jan 9, 2016, 2:51 PM

Post #9 of 16 (4430 views)
Re: [CiTi] The hardest part! [In reply to] Can't Post

For the first 2 errors, you can only declare my $IN one time, when you first declare it.

You can't use files to open in your program just because you passed them to the command line. You didn't even spell them correctly if you thought that could work.

$records is not the same as %records. That is the last error listed.


Laurent_R
Veteran / Moderator

Jan 9, 2016, 4:02 PM

Post #10 of 16 (4423 views)
Re: [CiTi] The hardest part! [In reply to] Can't Post

If you run the program this way:

Code
 ./myscript.pl file_1.pdb file_2.pdb file_3.pdb > record.pdb


Then you just need to retrieve the file names at the start of the program.

If you want to print a record that appears even only once, then you can remove the "if $records{$rec} > 1" in the main loop, as the hash will contain only eligible records anyway.

Adding the needed corrections given by Chris, you could get to something like this:


Code
#!/usr/bin/perl  
use strict;
use warnings;

my ($file1, $file2, $file3) = @ARGV;

my %records;

open my $IN, "<", $file1 or die "cannot open $file1 $!";
while (<$IN>) {
$records{$_}++;
}
close $IN;

open $IN, "<", $file2 or die "cannot open $file2 $!";
while (<$IN>) {
$records{$_}++;
}
close $IN;

open $IN, "<", $file3 or die "cannot open $file3 $!";
while (<$IN>) {
$records{$_}++;
}
close $IN;

for my $rec (sort keys %records) {
print $rec; # the conditional "if $records{$rec} > 1" is not needed
}


Please note that this code is explicitly saying everything that it does, and that's better for maximal clarity.

It could be made much shorter using some shortcuts provided by Perl, for example something like this (untested):


Code
#!/usr/bin/perl  
use strict;
use warnings;

my %records;
$records{$_} = 1 while <>; # will read all files whose name are passed on the command line
print for sort keys %records;


But I would advise you against using such shortcuts for the time being. Avoid them until you master the language much more. Also, it is less robust: for example, if one of the files passed to the program on the command line does not exist, the program will still happily run to its end (it will display a warning, but you might not notice).


CiTi
Novice

Jan 11, 2016, 1:26 AM

Post #11 of 16 (4389 views)
Re: [Laurent_R] The hardest part! [In reply to] Can't Post

Hi dears,
the code works awesome!
Thanks so much for help... I'll try to modify it for some problems related to the files.
I mean, in some case the same residues (in two or more different files) had the same ATOM NUMBER, the same RESIDUES NUMBER but different COORDINATES. So, the program let keep the residue twice if it is present with different coordinates.

############################################
File_1.pdb

#Atom Numb.#Res. Numb.###______X______Y______Z
ATOM___21__N__TRP_A__5_____8.422__-0.476__10.341
############################################
File_2.pdb

#Atom Numb.#Res. Numb.###______X______Y______Z
ATOM___21__N__TRP_A__5_____8.133__-0.355__11.200
############################################

In this case the two strings are obviously different for the program. But, I need just only one residue.
-> Can the program ignore these coordinate?

THANKS GUYS!!!! Wink


BillKSmith
Veteran

Jan 11, 2016, 11:18 AM

Post #12 of 16 (4372 views)
Re: [CiTi] The hardest part! [In reply to] Can't Post

As I understand your new requirement, you want to merge the three files, omitting a line if a previous line contains the same atom_number and residue_number. The following code works for your sample data as I cut/pasted it from your post.

Code
use strict; 
use warnings;
my %residue_data;
while (<>) {
push @{$residue_data{id($_)}}, $_;
}
print $residue_data{$_}[0] foreach sort keys %residue_data;
exit 0;

sub id {
my $record = shift;
my ($atom_number, $residues_number) = (split /\s+/, $record)[1,5];
return sprintf( "%3d%3d", $atom_number, $residues_number );
}


I am not sure how to extract the atom_number and residues_number in general. Your sample data does not
conform to the latest spec for ATOM records in a pdb file. (refer to page 187 in ftp://ftp.wwpdb.org/pub/pdb/doc/format_descriptions/Format_v33_Letter.pdf
Good Luck,
Bill


CiTi
Novice

Jan 11, 2016, 1:09 PM

Post #13 of 16 (4365 views)
Re: [BillKSmith] The hardest part! [In reply to] Can't Post

Thanks Bill...
I'm using your code but it gave some errors.
The code that Laurent wrote works well if I delete the coordinate columns.
I uploaded pdb files that i'm using. So you can check if there's somethings wrong.
Attachments: pocket_3ks3.pdb (10.8 KB)
  pocket_3t5u.pdb (12.4 KB)
  pocket_3t5z.pdb (11.7 KB)


BillKSmith
Veteran

Jan 11, 2016, 4:15 PM

Post #14 of 16 (4356 views)
Re: [CiTi] The hardest part! [In reply to] Can't Post

Your attached files do conform to the spec I referenced in my previous post. Replace the last two lines of the function 'id' with:

Code
    my ($atom_number, $residues_number) = unpack '@6A5@22A4', $record; 
return $atom_number . $residues_number;
}


Note:
The unpack format is derived directly from the spec, not inferred from the samples, therefore It should work for any valid ATOM record.

Each file ends with an "END" record which you did not mention. You can take care of these yourself. Use the spec to be sure that you get it right!
Good Luck,
Bill


CiTi
Novice

Jan 13, 2016, 6:52 AM

Post #15 of 16 (4336 views)
Re: [BillKSmith] The hardest part! [In reply to] Can't Post

Dear Bill,
you're right! I showed wrong samples. I'm sorry...

I'm using now the following code:


Code
#!/usr/bin/perl   
use strict;
use warnings;

my ($file1, $file2, $file3) = @ARGV;

my %residue_data;
while (<>) {
push @{$residue_data{id($_)}}, $_;
}
print $residue_data{$_}[0] foreach sort keys %residue_data;
exit 0;

&id;

sub id {
my $record = shift;
my ($atom_number, $residues_number) = unpack '@6xA5@22xA4', $record;
return $atom_number . $residues_number;
}


....but, the output file don't give to me just the specified column with the 'unpack' function.
What I wrong?


BillKSmith
Veteran

Jan 13, 2016, 8:44 AM

Post #16 of 16 (4329 views)
Re: [CiTi] The hardest part! [In reply to] Can't Post

I expect to see some effort on your part. I have already done "The hardest part". Unpack extracts the two numbers from ATOM records of valid .pdb files and uses them to organize the records in memory. You must handle other record types yourself. I told you the script was not complete!
Good Luck,
Bill

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives