CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Bioinformatics task

 



yaroba
Novice

May 19, 2013, 11:42 AM

Post #1 of 16 (1057 views)
Bioinformatics task Can't Post

Hello,

I am writing a program that reads data (data are codons) from .txt file and converts the data to the amino acids. But i am having a problem with situation like this :
UUUAGUGGAUU
U

The result should be : FSGU. But i dont know, how to add the last U to the first line so that the all U add up and would be converted to the F.

Can anyone help me?


recruiter
User

May 19, 2013, 2:01 PM

Post #2 of 16 (1051 views)
Re: [yaroba] Bioinformatics task [In reply to] Can't Post

Is the 'U' on the next line by itself? It would help if you post your code where you're having trouble, what exactly your text file consists of and looks like and what you have already tried.


(This post was edited by hwnd on May 19, 2013, 2:17 PM)


BillKSmith
Veteran

May 19, 2013, 3:18 PM

Post #3 of 16 (1040 views)
Re: [yaroba] Bioinformatics task [In reply to] Can't Post

Most of us are not biologists. You would get far more answers if you would express your problem as a text manipulation problem rather than as a biology problem. I have no idea what you mean by 'add up' or 'convert to F' and only the vaguest idea of what an amino acid is.
Good Luck,
Bill


Laurent_R
Veteran / Moderator

May 19, 2013, 3:59 PM

Post #4 of 16 (1038 views)
Re: [yaroba] Bioinformatics task [In reply to] Can't Post

I am not a biologist either, and my last course on biology was taken in 1974, at a time when nobody even dreamed of ever decoding the human genome (and nobody even dreamed of having a computer on her or his desktop), and I agree with the persons who answered to you before. Please state your problem in terms of data manipulation and processing, so that non biologists can help you.

From what you said, however, I understand that your data is forming patterns and that, sometimes, searched patterns overlap line breaks and are not recognized in this case. Is my understanding correct? I can think of at least half a dozen solutions or work around tricks on this, but we need more information to figure out which are relevant to your problem and which are not possible.

You should provide the code you are using (or a sufficient subset of your code dispaying the problem), an idea of the size of the files your are working on (some of the best solutions working on a 10 MB file will not be suitable for a 2 GB file or more), a slightly larger sample of the data you are working on, and an explanation of data processing rules that need to be applied.


yaroba
Novice

May 19, 2013, 4:29 PM

Post #5 of 16 (1034 views)
Re: [hwnd] Bioinformatics task [In reply to] Can't Post

The main fact of this program is that the 3 letters(UUU) are being converted to one letter(F).
I found a site with a lot of useful info http://sinsixx.com/tutorials/Beginning%20Perl%20for%20Bioinformatics/75.htm

by the way there is my code:

open(DUOMENYS, "<seka.txt") or die "nepavyko nuskaityti failo: $!";

$DNR = <DUOMENYS>;
$DNR =~ s/\s//g;
$ilgis = length($DNR);
$ilgis -= ($ilgis % 3);
$DNR = substr($DNR, 0, $ilgis);
$poz = 0;

while($poz < length($DNR)){
$nukleo = substr($DNR, $poz, 3);

if ($nukleo =~ /UU[UC]/) {$aminas = "F"} #Fenilalaninas
elsif ($nukleo =~ /(UU[AG])|(CU[ACGU])/) {$aminas = "L"} #Leucinas
elsif ($nukleo =~ /(UC[ACGU])|(AG[UC])/) {$aminas = "S"} #Serinas
elsif ($nukleo =~ /UA[UC]/) {$aminas = "Y"}#Tirozinas
elsif ($nukleo =~ /(UA[AG])|UGA/) {$aminas = "_"} #STOP
elsif ($nukleo =~ /UG[UC]/) {$aminas = "C"} #Cisteinas
elsif ($nukleo =~ /UGG/) {$aminas = "W"} #Triptofanas
elsif ($nukleo =~ /CC[ACGU]/) {$aminas = "P"} #Prolinas
elsif ($nukleo =~ /CA[UC]/) {$aminas = "H"} #Histidinas
elsif ($nukleo =~ /CA[AG]/) {$aminas = "Q"} #Glutaminas
elsif ($nukleo =~ /(CG[ACGU])|(AG[AG])/) {$aminas = "R"} #Argininas
elsif ($nukleo =~ /AU[UCA]/) {$aminas = "I"} #Izoleucinas
elsif ($nukleo =~ /AUG/) {$aminas = "M"} #Metioninas
elsif ($nukleo =~ /AC[ACGU]/) {$aminas = "T"} #Treoninas
elsif ($nukleo =~ /AA[UC]/) {$aminas = "N"} #Asparginas
elsif ($nukleo =~ /AA[AG]/) {$aminas = "K"} #Lizinas
elsif ($nukleo =~ /GU./) {$aminas = "V"} #Valinas
elsif ($nukleo =~ /GC[ACGU]/) {$aminas = "A"} #Alaninas
elsif ($nukleo =~ /GA[UC]/) {$aminas = "D"} #Asparto r&#363;gštis
elsif ($nukleo =~ /GA[AG]/) {$aminas = "E"} #Glutamo r&#363;gštis
elsif ($nukleo =~ /GG[ACGU]/) {$aminas = "G"} #Glicinas
else {$aminas = "X"} #nežinomas

$baltymas .= $aminas;

$poz += 3;
}

close(DUOMENYS);

open(REZULTATAI, ">isvedimas.txt");
print REZULTATAI $baltymas;
close(REZULTATAI);


(This post was edited by yaroba on May 19, 2013, 5:11 PM)


recruiter
User

May 19, 2013, 4:36 PM

Post #6 of 16 (1030 views)
Re: [yaroba] Bioinformatics task [In reply to] Can't Post

ok, my question is how are the lines formatted in input2.txt?


yaroba
Novice

May 19, 2013, 4:52 PM

Post #7 of 16 (1021 views)
Re: [hwnd] Bioinformatics task [In reply to] Can't Post

There are no specific requirements for line format. Just put one letter after another :)


recruiter
User

May 19, 2013, 5:03 PM

Post #8 of 16 (1015 views)
Re: [yaroba] Bioinformatics task [In reply to] Can't Post

Attach your input2.txt file so I can clearly see what were working with.


yaroba
Novice

May 19, 2013, 5:12 PM

Post #9 of 16 (1009 views)
Re: [hwnd] Bioinformatics task [In reply to] Can't Post

Input contains:
UUUAGUGGAUU
U

Output contains :
FSGF

Both files are .txt files.

Also take a look at my code I changed it a bit.


(This post was edited by yaroba on May 19, 2013, 5:12 PM)


recruiter
User

May 19, 2013, 5:20 PM

Post #10 of 16 (1006 views)
Re: [yaroba] Bioinformatics task [In reply to] Can't Post

Ok, so you just need the following 'U' to append to that line?

Try replacing

Code
$DNR =~ s/\s//g;



With

Code
$DNR =~ s/\r\n$//g if $. % 2;



yaroba
Novice

May 19, 2013, 5:32 PM

Post #11 of 16 (991 views)
Re: [hwnd] Bioinformatics task [In reply to] Can't Post

I need that my program could read line and make one letter in output out of three letters that are in input and if there is letter or two missing in line it could take it from the other line. Btw your code does not work :/ i should get the F, but i get X.

Nonetheless, thanks for your willingness to help me:)


recruiter
User

May 19, 2013, 5:38 PM

Post #12 of 16 (989 views)
Re: [yaroba] Bioinformatics task [In reply to] Can't Post

Ok well you clearly didn't state exactly that earlier, and with your input data is it always going to be just one letter per every other line?


yaroba
Novice

May 19, 2013, 5:45 PM

Post #13 of 16 (983 views)
Re: [hwnd] Bioinformatics task [In reply to] Can't Post

Yes, it is going to one letter per every three in input. UUU in input should be F in output and so on.


Laurent_R
Veteran / Moderator

May 20, 2013, 2:45 AM

Post #14 of 16 (948 views)
Re: [yaroba] Bioinformatics task [In reply to] Can't Post

If your file is not too large, then you can probably remove all end-of-line characters prior to parsing.


BillKSmith
Veteran

May 20, 2013, 4:37 AM

Post #15 of 16 (945 views)
Re: [yaroba] Bioinformatics task [In reply to] Can't Post

Is this what you need? I believe that one line answers your question.

Code
$DNR .= <DATA> if $ilgis % 3;


I have made several other changes to your script.

  • Added use strict and use warnings.


  • Changed logic to make it easier to read


  • Changed parenthesis in regexes to be non-capturing.


  • Changed opens to three argument form and changed the file handle to a lexical variable.


  • Temporarily changed file I/O to DATA and STDOUT to demonstrate the other changes.



  • Code
    use strict; 
    use warnings;
    #open(my $DUOMENYS, '<', 'seka.txt') or die "nepavyko nuskaityti failo: $!";
    #my $DNR = <$DUOMENYS>;
    my $DNR = <DATA>;
    $DNR =~ s/\s//g;
    my $ilgis = length($DNR);
    #$DNR .= <$DUOMENYS> if $ilgis % 3;
    $DNR .= <DATA> if $ilgis % 3;
    $ilgis = length($DNR);
    $ilgis -= ($ilgis % 3);
    $DNR = substr($DNR, 0, $ilgis);
    my $poz = 0;
    my $baltymas;
    while($poz < length($DNR)){
    local $_ = substr($DNR, $poz, 3);
    my $aminas = m/UU[UC]/ ? 'F' : # Fenilalaninas
    m/(?:UU[AG])|(?:CU[ACGU])/ ? 'L' : # Leucinas
    m/(?:UC[ACGU])|(?:AG[UC])/ ? 'S' : # Serinas
    m/UA[UC]/ ? 'Y' : # Tirozinas
    m/(?:UA[AG])|UGA/ ? '_' : # STOP
    m/UG[UC]/ ? 'C' : # Cisteinas
    m/UGG/ ? 'W' : # Triptofanas
    m/CC[ACGU]/ ? 'P' : # Prolinas
    m/CA[UC]/ ? 'H' : # Histidinas
    m/CA[AG]/ ? 'Q' : # Glutaminas
    m/(?:CG[ACGU])|(?:AG[AG])/ ? 'R' : # Argininas
    m/AU[UCA]/ ? 'I' : # Izoleucinas
    m/AUG/ ? 'M' : # Metioninas
    m/AC[ACGU]/ ? 'T' : # Treoninas
    m/AA[UC]/ ? 'N' : # Asparginas
    m/AA[AG]/ ? 'K' : # Lizinas
    m/GU./ ? 'V' : # Valinas
    m/GC[ACGU]/ ? 'A' : # Alaninas
    m/GA[UC]/ ? 'D' : # Asparto r&#363;gštis
    m/GA[AG]/ ? 'E' : # Glutamo r&#363;gštis
    m/GG[ACGU]/ ? 'G' : # Glicinas
    'X' ; # nežinomas
    $baltymas .= $aminas;
    $poz += 3;
    }
    #close $DUOMENYS;

    #open my $REZULTATAI, '>', 'isvedimas.txt';
    #print {$REZULTATAI} $baltymas;
    print $baltymas;
    #close $REZULTATAI;
    __DATA__
    UUUAGUGGAUU
    U


    OUTPUT:

    Code
    FSGF

    Good Luck,
    Bill


    yaroba
    Novice

    May 20, 2013, 5:38 AM

    Post #16 of 16 (943 views)
    Re: [BillKSmith] Bioinformatics task [In reply to] Can't Post

    Thank you very much, Bill ! :)

     
     


    Search for (options) Powered by Gossamer Forum v.1.2.0

    Web Applications & Managed Hosting Powered by Gossamer Threads
    Visit our Mailing List Archives