Laurent_R
Veteran
/ Moderator
Oct 28, 2012, 3:02 AM
Views: 9943

Re: [Chris Charley] DNA Sequence Count Perl Program HELP!



Hi, I just made a test with Chris Charley's DNA sequence file. Contrary to what I thought and wrote earlier, the end of line characters alter the computed skew quite significantly. Example of a few windows on the DNA file as posted by Chris:
Pass 1 Window Number 1 : there are 424 guanine and 443 cytosine in this window. Skew value is: 0.0989374637120816 Pass 1 Window Number 2 : there are 451 guanine and 369 cytosine in this window. Skew value is: 0.542005420054201 Pass 1 Window Number 3 : there are 363 guanine and 359 cytosine in this window. Skew value is: 0.0308644356823741 I then removed the new line/carriage return characters from the same file:
$ perl i pe 's/\r\n//g' test.txt And, running the same test with the new file without the carriage returns:
Pass 1 Window Number 1 : there are 437 guanine and 469 cytosine in this window. Skew value is: 0.150618713433777 Pass 1 Window Number 2 : there are 467 guanine and 375 cytosine in this window. Skew value is: 0.582739509105305 Pass 1 Window Number 3 : there are 395 guanine and 389 cytosine in this window. Skew value is: 0.0393473584806673 So, Kevone, you should remove these new line/carriage return characters from your file if you are going to use my suggested program (you can do it with my Perl oneliner above, but you may have to change it depending on whether you are on Unix or Windows). Another point is that I found a bug in my calculation of the skew: in the formula, I use $window_size for the total number of nucleobases analyzed, whereas I should have used the actual number of bytes read, otherwise the calculated skew is wrong whenever I reach the end of the file (i.e. for each last window of each pass). So the line calculating the skew is now as follows:
my $skew_value = ($size_read / $nr_C)*($nr_G  $nr_C)/($nr_G + $nr_C); Without that change, I had the following result for the first pass:
Pass 1 Window Number 1 : there are 437 guanine and 469 cytosine in this window. Skew value is: 0.150618713433777 Pass 1 Window Number 2 : there are 467 guanine and 375 cytosine in this window. Skew value is: 0.582739509105305 Pass 1 Window Number 3 : there are 395 guanine and 389 cytosine in this window. Skew value is: 0.0393473584806673 Pass 1 Window Number 4 : there are 467 guanine and 416 cytosine in this window. Skew value is: 0.277680982663995 Pass 1 Window Number 5 : there are 486 guanine and 436 cytosine in this window. Skew value is: 0.248761169376505 Pass 1 Window Number 6 : there are 509 guanine and 506 cytosine in this window. Skew value is: 0.0116824704530851 Pass 1 Window Number 7 : there are 546 guanine and 534 cytosine in this window. Skew value is: 0.0416146483562214 Pass 1 Window Number 8 : there are 552 guanine and 524 cytosine in this window. Skew value is: 0.0993217741706632 Pass 1 Window Number 9 : there are 507 guanine and 555 cytosine in this window. Skew value is: 0.162874739145926 Pass 1 Window Number 10 : there are 194 guanine and 217 cytosine in this window. Skew value is: 0.515770235572449 Pass 2 Window Number 1 : there are 421 guanine and 482 cytosine in this window. Skew value is: 0.280301254922504 Pass 2 Window Number 2 : there are 454 guanine and 358 cytosine in this window. Skew value is: 0.660483804386713 With this correction, I have now this:
Pass 1 Window Number 1 : there are 437 guanine and 469 cytosine in this window. Skew value is: 0.150618713433777 Pass 1 Window Number 2 : there are 467 guanine and 375 cytosine in this window. Skew value is: 0.582739509105305 Pass 1 Window Number 3 : there are 395 guanine and 389 cytosine in this window. Skew value is: 0.0393473584806673 Pass 1 Window Number 4 : there are 467 guanine and 416 cytosine in this window. Skew value is: 0.277680982663995 Pass 1 Window Number 5 : there are 486 guanine and 436 cytosine in this window. Skew value is: 0.248761169376505 Pass 1 Window Number 6 : there are 509 guanine and 506 cytosine in this window. Skew value is: 0.0116824704530851 Pass 1 Window Number 7 : there are 546 guanine and 534 cytosine in this window. Skew value is: 0.0416146483562214 Pass 1 Window Number 8 : there are 552 guanine and 524 cytosine in this window. Skew value is: 0.0993217741706632 Pass 1 Window Number 9 : there are 507 guanine and 555 cytosine in this window. Skew value is: 0.162874739145926 Pass 1 Window Number 10 : there are 194 guanine and 217 cytosine in this window. Skew value is: 0.219202350118291 Pass 2 Window Number 1 : there are 421 guanine and 482 cytosine in this window. Skew value is: 0.280301254922504 Pass 2 Window Number 2 : there are 454 guanine and 358 cytosine in this window. Skew value is: 0.660483804386713 As it can be seen, the computed skew values are the same for every window, except the last window of each pass (window number 10 in the above example). So I think you should probably apply the same correction. EDIT: I checked Chris Charley's program, it has the same bug in the skew calculation (since Chris probably reused my formula). So you should modify his program if it is the one you are going to use.
(This post was edited by Laurent_R on Oct 28, 2012, 3:08 AM)
