CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
DNA Sequence Count Perl Program HELP!

 

First page Previous page 1 2 Next page Last page  View All


Chris Charley
User

Oct 27, 2012, 11:02 AM

Post #26 of 34 (6340 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Laurent, your code is somewhat less dense than the BioPerl one. :-)

A problem (?) might be that it is counting newline characters in the window (by looking at his sample data snapshot earlier in this thread). Don't know if this could significantly affect the results. The newline issue is why I posted the BioPerl solution as it doesn't count newline characters.

Good catch on the division by zero error likelyhood. My solution didn't check for illegal division by zero, so would have to be amended.

Just some thoughts.


(This post was edited by Chris Charley on Oct 27, 2012, 11:19 AM)


Laurent_R
Veteran / Moderator

Oct 27, 2012, 11:47 AM

Post #27 of 34 (6332 views)
Re: [Chris Charley] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Hi Chris,

Thanks for your comments.

I usually don't try to make code too concise on forums, because this could make it more difficult to understand to the poster.

In the initial discussions I had with the OP, he or she told me that that there was no new lines, the file consisted only of the four letters representing bases. So I assumed there is no new line.

Having said that, if a Perl module is doing just that, I would definitely recommand using it.


Chris Charley
User

Oct 27, 2012, 1:13 PM

Post #28 of 34 (6311 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

His attachment in reply to your asking him about the newlines was all I could find. But, the only fasta files I've seen had newlines in the data. But, that looks like a unix type terminal window and I don't know anything about them.
The data may be without newlines.

The lines don't extend to the right side of the terminal window.

I'm not an expert with fasta files. I've just picked up bits and pieces by reading threads on BioPerl issues.


(This post was edited by Chris Charley on Oct 27, 2012, 1:46 PM)


Laurent_R
Veteran / Moderator

Oct 27, 2012, 2:08 PM

Post #29 of 34 (6279 views)
Re: [Chris Charley] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

To tell the truth, I had not seen the attachment, but only the text of the message.

And the attachment does not say too much about the real file, we don't know how the attachement was generated and on which terminal, there may be carriage returns or line feeds, or not, we don't really know.

I have no knowledge on DNA sequences, but I would tend to think that the presence or absence of new lines in the file would not change significantly the G-C skew.

Again, you pointed to some existing Perl modules which I did not know anything about. If they do what is required here, it is certainly a better option, because they have most certainly been tested thouroughly with real data much better than what I could do with my pseudo-DNA files (HTML files in my earlier posts, Perl code files in my most recent ones).


Laurent_R
Veteran / Moderator

Oct 28, 2012, 3:02 AM

Post #30 of 34 (6132 views)
Re: [Chris Charley] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Hi,

I just made a test with Chris Charley's DNA sequence file. Contrary to what I thought and wrote earlier, the end of line characters alter the computed skew quite significantly.

Example of a few windows on the DNA file as posted by Chris:


Code
Pass 1 Window Number 1 : there are 424 guanine and 443 cytosine in this window. Skew value is: -0.0989374637120816 
Pass 1 Window Number 2 : there are 451 guanine and 369 cytosine in this window. Skew value is: 0.542005420054201
Pass 1 Window Number 3 : there are 363 guanine and 359 cytosine in this window. Skew value is: 0.0308644356823741


I then removed the new line/carriage return characters from the same file:


Code
$ perl -i -pe 's/\r\n//g' test.txt


And, running the same test with the new file without the carriage returns:


Code
Pass 1 Window Number 1 : there are 437 guanine and 469 cytosine in this window. Skew value is: -0.150618713433777 
Pass 1 Window Number 2 : there are 467 guanine and 375 cytosine in this window. Skew value is: 0.582739509105305
Pass 1 Window Number 3 : there are 395 guanine and 389 cytosine in this window. Skew value is: 0.0393473584806673


So, Kevone, you should remove these new line/carriage return characters from your file if you are going to use my suggested program (you can do it with my Perl one-liner above, but you may have to change it depending on whether you are on Unix or Windows).

Another point is that I found a bug in my calculation of the skew: in the formula, I use $window_size for the total number of nucleobases analyzed, whereas I should have used the actual number of bytes read, otherwise the calculated skew is wrong whenever I reach the end of the file (i.e. for each last window of each pass).

So the line calculating the skew is now as follows:


Code
		my $skew_value = ($size_read / $nr_C)*($nr_G - $nr_C)/($nr_G + $nr_C);


Without that change, I had the following result for the first pass:



Code
Pass 1 Window Number 1 : there are 437 guanine and 469 cytosine in this window. Skew value is: -0.150618713433777 
Pass 1 Window Number 2 : there are 467 guanine and 375 cytosine in this window. Skew value is: 0.582739509105305
Pass 1 Window Number 3 : there are 395 guanine and 389 cytosine in this window. Skew value is: 0.0393473584806673
Pass 1 Window Number 4 : there are 467 guanine and 416 cytosine in this window. Skew value is: 0.277680982663995
Pass 1 Window Number 5 : there are 486 guanine and 436 cytosine in this window. Skew value is: 0.248761169376505
Pass 1 Window Number 6 : there are 509 guanine and 506 cytosine in this window. Skew value is: 0.0116824704530851
Pass 1 Window Number 7 : there are 546 guanine and 534 cytosine in this window. Skew value is: 0.0416146483562214
Pass 1 Window Number 8 : there are 552 guanine and 524 cytosine in this window. Skew value is: 0.0993217741706632
Pass 1 Window Number 9 : there are 507 guanine and 555 cytosine in this window. Skew value is: -0.162874739145926
Pass 1 Window Number 10 : there are 194 guanine and 217 cytosine in this window. Skew value is: -0.515770235572449
Pass 2 Window Number 1 : there are 421 guanine and 482 cytosine in this window. Skew value is: -0.280301254922504
Pass 2 Window Number 2 : there are 454 guanine and 358 cytosine in this window. Skew value is: 0.660483804386713


With this correction, I have now this:


Code
Pass 1 Window Number 1 : there are 437 guanine and 469 cytosine in this window. Skew value is: -0.150618713433777 
Pass 1 Window Number 2 : there are 467 guanine and 375 cytosine in this window. Skew value is: 0.582739509105305
Pass 1 Window Number 3 : there are 395 guanine and 389 cytosine in this window. Skew value is: 0.0393473584806673
Pass 1 Window Number 4 : there are 467 guanine and 416 cytosine in this window. Skew value is: 0.277680982663995
Pass 1 Window Number 5 : there are 486 guanine and 436 cytosine in this window. Skew value is: 0.248761169376505
Pass 1 Window Number 6 : there are 509 guanine and 506 cytosine in this window. Skew value is: 0.0116824704530851
Pass 1 Window Number 7 : there are 546 guanine and 534 cytosine in this window. Skew value is: 0.0416146483562214
Pass 1 Window Number 8 : there are 552 guanine and 524 cytosine in this window. Skew value is: 0.0993217741706632
Pass 1 Window Number 9 : there are 507 guanine and 555 cytosine in this window. Skew value is: -0.162874739145926
Pass 1 Window Number 10 : there are 194 guanine and 217 cytosine in this window. Skew value is: -0.219202350118291
Pass 2 Window Number 1 : there are 421 guanine and 482 cytosine in this window. Skew value is: -0.280301254922504
Pass 2 Window Number 2 : there are 454 guanine and 358 cytosine in this window. Skew value is: 0.660483804386713


As it can be seen, the computed skew values are the same for every window, except the last window of each pass (window number 10 in the above example). So I think you should probably apply the same correction.

EDIT: I checked Chris Charley's program, it has the same bug in the skew calculation (since Chris probably reused my formula). So you should modify his program if it is the one you are going to use.


(This post was edited by Laurent_R on Oct 28, 2012, 3:08 AM)


Chris Charley
User

Oct 28, 2012, 9:01 AM

Post #31 of 34 (6116 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Here is a portion of the sample input fasta file and I want to explain what it contains.


Code
>NM_198399 1 
aacagattttaactctgaaaagccatttccagtgtctatagactattgtg
agcctggagaagtagcatttagttgggatagcttcactagagctgcctgc
caaagacttccttccacaggatcttgtcgcaccagcaactgacaggagct
tgggagctcgggagcttgggagagggcttatgtttttaataatgtagctg
tcagttcgaagcctggaaatgttgaccctcaaagggcataaaatcttgtt
attttaatttgcatctgggagaatgtctgagcaaggagacctgaatcagg
caatagcagaggaaggagggactgagcaggagacggccactccagagaac
ggcattgttaaatcagaaagtctggatgaagaggagaaactggaactgca
gaggcggctggaggctcagaatcaagaaagaagaaaatccaagtcaggag
caggaaaaggtaaactgactcgcagccttgctgtctgtgaggaatcttct
gccagaccaggaggtgaaagtcttcaggatcagactctctgaaaactgca
aatggaaaggaattcaaaagaatttagattaaaagttaaataaaaagtag
gcacagtagtgctgaattttcctcaaaggctctcttttgataaggctgaa
ccaaatataatcccaagtatcctctctccttccttgttggagatgtctta
cctctcagctccccaaaatgcacttgcctataagaaacacaattgctggt
tcatatgaaacttaggaaatagtgaataaggtgcatttaactttggagaa
atacttttatggctttggtggagatttctcaatactgcaaaagttgtcca
gaaatgaatctgagctgatggtgactttaagttaatattattaatatatc
actgcatatttttacccttatttttgctccttacagcaagattagtaggt
tataaaaatttaaatttaaacaaaattatttcatgacaaaatgggaaact
tcacatcatacttatttttgtttgcctttcaggcatcatattagctttta
taaaaaatggtcttgctgctgaaattgtacttattttatcagaggctggg
tgcagtcaagacaaaagtaaaatggtttacctgagcccaggggagggaaa
attgattaagatatcattatttttgtttggtttggttttgcttttttcct
cttactttaattgaaatactctgaattcccctcatggaaacagagagcat
tgagagcactttctttaaaaggaccaaaaataaattcctaatagattttg
tcctaagagagtgtttttttttctagcatcattttctttacatgccactc
atgtcataaggcatggacaggctatctttcagtggccattactatgtttc
gtacacatgctttattttacttgggctctgagaaatgtgtggctttcctt
cagcattttatttgtgcttctctttttaatggagattgaaaagggagaat
aatgtgaatatcacggcttatattattaaatgttgattgatggcttgtaa
tgtactgcacacaatatatgttaactctgcagaatgacagaccctgggag
aagtaatgccccagttgtcccccactcctaatgccaggcagagaaggaca
gcctttatagacttaatctgctttttgtcccatttgacaaggtaccagga
ggaaattttttaagggatcaactgtatcacagtgcccactctggacctaa
gtctagtgtatccatacaattggtgcagagaaataaggtgtaaatggtgc
tttgttcctgctggttccaagctcagaaaccaagactagctttgtaggag
agaatgagagcctgcaagcctctctttggattggctgaggagtggtggga
gcagggggttgatagaaaacatccagacacacatataagcaagtggccgt
gctacctttttagagaataaagaaacagacttttgagtttatatgcaatg
ccttcattaggtaccaccggcacttacaaaatgtgcggactgaatcccag
agaacactggcagatgtatacagtatatggattgtatcgcttccccaatg
tttgtaaattcacagtatttggaaaactgccttcattttccagtgtggga
aaaactcttgctacctgtattacttgatctcagacccatacctgatggtt
cagtctgtccttaagttaaaagaattttgcttttctaatgttatactatt
tacctgtcagtgtattactgcaacttgaatcactcttttactgttgttgg
atataaacttatcctgtaccaatgtatttattaacacttgtattttatta
ttgagcatatcaataaaaatattaaaaaataacagattgttttttaccaa
aaaaaaaaaaaaa
>NR_026816 1
caacccactctctgtgctatgacttcattactctttcccagcccagccct
gggcaagccccttacgaagtctcaggctacctggatgaccaccctttctt
atgatgctgcaaggagggcaggtgggcagagccccgtgcatcctgggctc
aggccagggacccaagagcttgggagaagctggttctcagactgaaggcc
agagcccagcaccttgtcaccatcccggggagcatcatggcacacaacaa
ccagagccaaggctacagctagagagttgactcctctatttgagattgac
aggcctcggaagtcaaaataagtggtttcctagaccgggtcgagagcaag
tctctattggtcccaactgagttttttcagctggtttttcaaccaaacag
cacctcatctcccagtgaggggaagggaaggctgggctgagagcagcaag
gctgctcatctcacctctccccacccagccatgccagccgcctcacctgg
tggggagaggtgggcctcacctgggtcccctggcagtgctctgtgaaggg
tcttgacattgcactgtaataataaaggtgtgtgtgaagtatcaaaaaaa
>NR_027917 1
atgaagatgattgagcagcacaatcaggaatacagggaagggaaacacag
cttcacaatggccatgaacgcctttggagaaatgaccagtgaagaattca
ggcaggtggtgaatggctttcaaaaccagaagcacaggaaggggaaagtg
ctccaggaacctctgcttcatgacatccgcaaatctgtggattggagaga
gaaaggctacgtgactcctgtgaaggatcagtgcagctggggctctgtaa
ggacagatgttaggaaaactgagaaactagtttcactgagtgtgcagacc
tggtggactgctctaggcttcaaggcaatgttggctgcatttttggagaa
ccattattttgcttccagtatgttgccgacaatggaggcctggactctga
ggaatccttttcatatgaagaaaagctctggagactggaaagtccaaggt
cacagaggtgcatctggtgagagccttcttgctagtggggaatctcagca
gagtcctgaggtggcacagtattctgggaagcatcaagtgcagtgtcatc
ttatcgaggaggctctgcagatgctaagtggtggggatgaggatcacgat
gaagacaaatggccccatgacatgaggaatcatctggctggagaggccca
ggtgtag
>NR_002777 3
cttgtcctttcagaagatcagagacaagtgatatctgtgccaatttggcc
ttttcagtgttataattatggtgtcttgggatcccaatatttctcctaat
gtttccctgatgtgatactttgagagcccaggatgccagtacaataattg
aaattcacaaatgtctggtatcttgtccctcgtgccccatatattatctg
tggtttcggagagctcacttgtctcttatcttcagaaatgacagcacatg
aaatgttgtttggagccactgtcacatcaactgtagaaaaattaacaggt
cagctaagggatataatgtaactttatttgtgatatgagagaaatcttga
taaagacttgagagaaaactgggaggaaccttgtttagaagttataagga
ggggtaagttatgtgtgtcttggaaggagaatcataaatcttaaaacatg
agcctaatagagaacataaaattctaaaagataaagataataataatgat
aagccgcagggtggcttatgataatgtgacttctccttaccccagtagcg
tcggacatctgtcagctctgaaatgataaaaatgcacaatattgaataca
aacaaaggagtcagcactgaaattcattttctctccagattagggaaaga
gtaggtatgccctatggtagggcagtaaattgctgaatgatgagatgaaa
cagccacctagccatttcccattaaatataatcccatcagcagcagacaa
tatctatcctcccctatcccctctatccatatttggaaactgcaccctct
tccctatttagcaccctaacaccacttgaattccataaccctgttgttga
tctagctctcctcacctctaaacacttctagcattcctttcagatcagga
gctcgaaacactctcctttgattttttggaaaagtttctggcttcttcaa
ggtcacgttctccgtcctaagaattaaaaaaaaaaaaaaaaacttccaaa
cctttgaccttgtgtccgtggaacacccctgacttcctatcatttcaacc
cattgaggcacttgaactctcttcttggggatcctgagaagggagagtgc
aaactcttgaccctggaggcaaacaaaatgttctcatgtttgccttccca
cttactttctgtgagaacgtgggaagatcttaacctctcagaagcacagt
ttcttccttctaaaatgaaataattaacctctccctgtctacattcttaa
actcataggacataaaaaaaaaaaaaa

The lines beginning with '>whatever something else' give you the 'id', (whatever), for the sequence and the description, (something else). In the sample above, there are ids like NR_002777 and a description like 3. The following lines contain the sequence itself, (cttgtcctttcagaagatcagag...).

So, there is more than 1 sequence in the sample file.

You shouldn't include these header lines in the window. Things start to get messy when trying to manually parse the dna sequences. It is probably safer to use the modules made for this purpose. (I don't know how many sequences he had for his problem).

I noticed a difference in the way you and I generated the windows. I started at the beginning and slid the window 100 chars to the right and generated the skew value, sliding to the end of the sequence. Your solution gets the windows from start to finish, then slides 200 offset and redoes the file again from start to finish. The number slid, (my 100 or your 200), is not the point, (I could adjust to 200 or whatever size offset). It's the different approachs taken.

I made the adjustment to the skew calculation you poointed out, using $size_read rather than the $window_size. The 2 changed programs are listed earlier in this thread.


Laurent_R
Veteran / Moderator

Oct 28, 2012, 11:19 AM

Post #32 of 34 (6109 views)
Re: [Chris Charley] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Hi Chris,

I had never done any work or anything on DNA sequences before, and I clearly said from the beginning that my knowledge is close to 0 on these matters.

Among other things, I did not know about headers and I did not notice that there were such headers in you file. I think Kevone said that his or her file contained only a stream of ACGT letters, so I assume it had not headers.

Of course it would not be complicated to remove the headers just as I have removed carriage return and new line characters, but I agree (and already wrote) that, in general, it is better to use existing modules if they are fit for the task at hand. In this particular case, the initial requirement was so simple that it could be done in a dozen lines of code, so I suggested some code without modules (I had no idea about what existing modules can or cannot do, as I said, this is really not my field).

As for the sliding window, I thought of at least 3 different methods to implement it, the one I used, something very similar to the one you used and, since the window size is a multiple of the slide factor, a third one where it would cut the file in segments of 200 bases and store the numbers of C and G for each segment in an array (or 2 arrays, an array of hashes, or whatever), and then do all the calculations on this array. This last method would probably be the best in terms of performance, but performance is not an issue with a mere 5 million nucleobases and the program might be a bit more complex. I chose the approach I used simply because it seemed to be the one that required the minimal amount of changes to the code I had already written.


Chris Charley
User

Oct 28, 2012, 1:58 PM

Post #33 of 34 (6104 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Yes Laurent, I realize you said earlier that it might be better to use the Bio::... modules for this problem. I guess it slipped my mind.

Haven't heard from the original poster as to whether any of these schemes will work however, so maybe its best to let this thread rest until we hear from him.

There are many things I don't know about genomic problems. There is a BLAST, reverse compliment, translate and several other things about genomes and I don't know what they are or what Bio modules to use to get them. (Well, reverse compliment and translate methods are in Bio::Seq.)


Kevone07
Novice

Oct 29, 2012, 2:09 PM

Post #34 of 34 (6074 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Honestly, you do an amazing job at what you are able to do with perl. Just by looking at the what you collaborated together, I'm able to tweak it just a little to make it work perfect. I just want to confirm it with you on my outlook on the file by interpreting it. So, what this program is doing is reading the input file, going through a bunch of loops with certain data types, counting the specific letters contained in the input file, inserting in the skew function, making the first recorded data in the test_results, going through this whole loop again with a slided window of +100 and so and so until it reaches the total amount of about 5,000,000 bp (I tweaked the program to go to somewhere around that amount). I just need to make sure that I get the correct slided window because I'll need to take the average of the 5 windows skew in-order to get clear up the noise on my graph. I'll show you what I ended up getting with the tweak. I'm going to send you a private message of my results.

First page Previous page 1 2 Next page Last page  View All
 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives