CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate: Re: [Laurent_R] DNA Sequence Count Perl Program HELP!: Edit Log



Laurent_R
Veteran / Moderator

Oct 27, 2012, 8:57 AM


Views: 7798
Re: [Laurent_R] DNA Sequence Count Perl Program HELP!

As I said previously, using existing biology modules is probably better if they give you what you want.

Having said that, this is what I can come up without using such modules:


Code
#!/usr/bin/perl  

use strict;
use warnings;

my $window_size = 2000;

my $in_file = "test.txt";
my $out_file = "test_results.txt";

my $pass = 1;
open (my $out_fh, ">", $out_file) or die "cannot open $out_file $! \n";

my $offset = 0;
while ($offset < 2000) {
open (my $in_fh, "<", $in_file) or die "cannot open $in_file $! \n";
my $discarded_data = read $in_fh, (my $unused_line), $offset;
my $window_nr = 0;
while (my $size_read = read $in_fh, my $window, $window_size) {
$window_nr++;
my $nr_G = () = $window =~ /g/gi;
my $nr_C = () = $window =~ /c/gi;
print $out_fh "skew undetermined for pass $pass and Window $window_nr\n" and next unless $nr_C; #avoid division by 0 error if $nr_C == 0
my $skew_value = ($window_size / $nr_C)*($nr_G - $nr_C)/($nr_G + $nr_C);
print $out_fh "Pass $pass Window Number $window_nr : there are $nr_G guanine and $nr_C cytosine in this window. Skew value is: $skew_value\n";
}
close $in_fh;
$offset += 200;
$pass ++;
}
close $out_fh;


This gives me the following output:


Code
Pass 1 Window Number 1 : there are 28 guanine and 73 cytosine in this window. Skew value is: -12.206700122067 
Pass 1 Window Number 2 : there are 20 guanine and 81 cytosine in this window. Skew value is: -14.9126023713482
Pass 1 Window Number 3 : there are 15 guanine and 24 cytosine in this window. Skew value is: -19.2307692307692
Pass 2 Window Number 1 : there are 29 guanine and 75 cytosine in this window. Skew value is: -11.7948717948718
Pass 2 Window Number 2 : there are 19 guanine and 80 cytosine in this window. Skew value is: -15.4040404040404
Pass 2 Window Number 3 : there are 13 guanine and 20 cytosine in this window. Skew value is: -21.2121212121212
Pass 3 Window Number 1 : there are 29 guanine and 86 cytosine in this window. Skew value is: -11.5267947421638
Pass 3 Window Number 2 : there are 19 guanine and 71 cytosine in this window. Skew value is: -16.2754303599374
Pass 3 Window Number 3 : there are 12 guanine and 16 cytosine in this window. Skew value is: -17.8571428571429
Pass 4 Window Number 1 : there are 26 guanine and 86 cytosine in this window. Skew value is: -12.4584717607973
Pass 4 Window Number 2 : there are 17 guanine and 68 cytosine in this window. Skew value is: -17.6470588235294
Pass 4 Window Number 3 : there are 11 guanine and 13 cytosine in this window. Skew value is: -12.8205128205128
Pass 5 Window Number 1 : there are 25 guanine and 91 cytosine in this window. Skew value is: -12.5047366426677
Pass 5 Window Number 2 : there are 16 guanine and 61 cytosine in this window. Skew value is: -19.1611667021503
Pass 5 Window Number 3 : there are 9 guanine and 12 cytosine in this window. Skew value is: -23.8095238095238
Pass 6 Window Number 1 : there are 16 guanine and 96 cytosine in this window. Skew value is: -14.8809523809524
...


(Again, the program was run on a test file that is not a DNA sequence, so the values obtained are completely unsignificant.)

Asides from the changes due to your sliding window requirement, I made an additional change, because I had a hidden bug. This change is the addition of the following line:


Code
print $out_fh "skew undetermined for pass $pass and Window $window_nr\n" and next unless $nr_C; #avoid division by 0 error if $nr_C == 0


There was the possibility of having a division by 0 error if the sequence did not contain any C. This was relatively unlikely in the original version, but it becomes more likely with the sliding window rule, as the last chunk read in the file might have only 2 or 3 nucleobases and no C (because of the sliding window rule, it becomes more likely that there be one pass where the last chunk has very few nucleobases).

This leads me to another possible problem that I did not see before and discovered because of the sliding window implementation: when the last segment of data read during one pass has much less than 2000 nucleobases (say 1 to 10), its skew is possibly completely unsignificant. For example, in my above example, during pass 6, the last chunk of data gives me the following result:


Code
Pass 6 Window Number 3 : there are 7 guanine and 7 cytosine in this window. Skew value is: 0


Another (opposite) example is in the last chunk of data read in my last pass:


Code
Pass 9 Window Number 3 : there are 1 guanine and 3 cytosine in this window. Skew value is: -333.333333333333


In both cases, the calculation of the skew is correct for the data being read, but, because we are reading the last chunk of data for a given pass, we are most probably reading too little of the data for the calculated skew to be significant. So, maybe, the last chunk of data of a given pass should be discarded from the skew calculation when the number of nucleobases in the chunk is less that a given value.

Again, I have extremely little knowledge in genetics, the algorithms using these skew values may take this issue into consideration, I just want to inform you that there may be something wrong here.


(This post was edited by Laurent_R on Oct 27, 2012, 11:54 AM)


Edit Log:
Post edited by Laurent_R (Veteran) on Oct 27, 2012, 11:54 AM


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives