CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
DNA Sequence Count Perl Program HELP!

 



Kevone07
Novice

Oct 20, 2012, 1:14 PM

Post #1 of 34 (7970 views)
DNA Sequence Count Perl Program HELP! Can't Post

HI,
I been trying to make this program work for a good week and I have been with perl for longer than 6 weeks just having a hard time creating certain programs. This program I have make it able to read a file that has about 5,230,000 DNA sequences which I am suppose to find the count of G and C then use a equation to calculate the GC-skew. The equation is (G-C)/(G+C) x W (window)/C (length). The window is suppose to be 2000 bp and character length suppose to contain 5,230,000 bp. Its mainly meant for the program to determine the terminus sequences for a specific species. Is there anyone that has some idea to how to start this because I been spending too much time on it and freaking out. I would really appreciate it and, if you need me to explain it clear, Ill do my best. Please help & thanks in advance.


Laurent_R
Veteran / Moderator

Oct 20, 2012, 1:34 PM

Post #2 of 34 (7967 views)
Re: [Kevone07] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

I would love to help you, but I haven't understood what you want to do.

May be the fact that I am a relatively good IT specialist, but I know nothing (or extremly little) about biology.


Kevone07
Novice

Oct 20, 2012, 4:22 PM

Post #3 of 34 (7962 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

I need help getting a count of finding the amount of "G" and "C" in a whole sequence of about 5,230,000 bp which I can go through this step by step but there are checkpoints in reading the sequence of a window of 2,000 bp. So first, I need to get a perl program to read the 2,000 bp of 5,230,000 bp finding how many "G" & "C" show up in that length (2,000 bp) which it will record and start again where it left of at the first checkpoint (2,000 bp to 4,000 bp). Again the same letters are counted and recorded. This keeps on going till the program reaches the end of the loop which is 5,230,000 bp character length . Once the results are figured out of the amount of "G" and "C" in the DNA sequence (research data saved in a file) for each 2,000 bp window. After completed the results, these numbers have to go in a GC-skew which will be calculated in a formula (G-C)/(G+C) x (W/C) for each 2,000 bp window (W) of 5,230,000 bp character length (C). These calculations only calculate for each 2,000 bp which will determine the terminus sequences for the specie genetic code I'm helping someone with there research. If you still don't understand, Ill explain it little by little bit. I tried to write it but I just get too mind boggled and get no where. I would appreciate your help and would save me a lot of time for other classes. I spent days trying to figure this out without any errors which was very impossible for me. Please respond back asap and thanks in advance! Much appreciate it!


Kevone07
Novice

Oct 20, 2012, 7:19 PM

Post #4 of 34 (7956 views)
Re: [Kevone07] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

The DNA Sequence is located on a stored file so I would start off with:

open <INFILE> ".txt"
while (<INFILE>)....

Im confused on where to state the variables and how to state them to go with what I'm trying to do. Do I use an @arr in a %hash or...


wickedxter
User

Oct 20, 2012, 8:09 PM

Post #5 of 34 (7953 views)
Re: [Kevone07] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post


Code
 
open my $fh ,'>','/dir/to/file.txt';

while (my $line = <$fh>){
#this reads the file one line(row) at a time starting at the top of the file and its assigned to $line

}
close $fh;



Laurent_R
Veteran / Moderator

Oct 21, 2012, 2:59 AM

Post #6 of 34 (7948 views)
Re: [wickedxter] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

OK, Kevone, now I start to understand better. Can you provide a short sample of your DNA file, so that we understand better how to look for the relevant nucleobases in your data.


In Reply To

Code
open my $fh ,'>','/dir/to/file.txt'; 
# ...



It should ideally be:


Code
open my $fh ,'>','/dir/to/file.txt' or die "could not open file.txt $! \n"; 
# ...



Kevone07
Novice

Oct 21, 2012, 10:33 AM

Post #7 of 34 (7857 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

An example of this file would contain, AGTCATTCAAGACAGGTAA... I need the program to read every 2,000 of these leters counting the "G" and "C" then putting it in the equation for calculation that I provided. This is how it would look like if I put the details with what you provided: $fh ,'>','/home/share/bmeg/bacillus_spp/QMB1551_chromosome.seq' or die "could not open file.txt $! \n";


Laurent_R
Veteran / Moderator

Oct 21, 2012, 10:57 AM

Post #8 of 34 (7855 views)
Re: [Kevone07] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Is your file organized in lines (separated by new line or cariage return characters) or is it just consisting of a stream of only the letters corresponding to the four nucleobases?

This has some importance on the way to best read the file.


Kevone07
Novice

Oct 21, 2012, 11:04 AM

Post #9 of 34 (7852 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

I posted a picture of somewhat is contained in the file.


(This post was edited by Kevone07 on Oct 21, 2012, 11:45 AM)
Attachments: Untitled.png (159 KB)


Laurent_R
Veteran / Moderator

Oct 21, 2012, 2:15 PM

Post #10 of 34 (7841 views)
Re: [Kevone07] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Allright, your file is just a pure sequence of the letters corresponding to the four nucleobases, I understand.

This requires probably a specific approach. Probably a binary slurp of a 2,000 characters and analysis of the number of the C or G components, something not really looking as what Perl is usually very good at (processing successive test lines). But I am pretty sure that what you need can be done efficiently (within a few seconds).

Having said that, I need a bit of thinking to try to figure out the best way to do what you need, but I am pretty sure it can be done efficiently.


Kevone07
Novice

Oct 21, 2012, 8:05 PM

Post #11 of 34 (7832 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Sounds like i have a lot more to learn and Thank you for putting in the time to help me, really do appreciate! I have to get something in to show it to the person I'm helping by Tuesday. I'll try to help too as much as possible. Also, any other questions you need I'll answer ASAP.


Laurent_R
Veteran / Moderator

Oct 22, 2012, 1:58 AM

Post #12 of 34 (7817 views)
Re: [Kevone07] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Hi Kevone,

here is a starting point. What my program below does is to read a file by windows of 2000 bytes, count the number of C and G in each window and print the information:


Code
#!/usr/bin/perl 

use strict;
use warnings;

my $window_size = 2000;

my $file = "test.txt";
my $window_nr = 0;

open (my $fh, "<", $file) or die "cannot open $file $! \n";
while (my $size_read = read $fh, my $window, $window_size) {
$window_nr++;
my $nr_G = () = $window =~ /g/gi;
my $nr_C = () = $window =~ /c/gi;
print "Window Number $window_nr : there are $nr_G guanine and $nr_C cytosine in this window\n";
}


This is the output:

Code
perl test_read.pl 
Window Number 1 : there are 14 guanine and 50 cytosine in this window
Window Number 2 : there are 23 guanine and 40 cytosine in this window
Window Number 3 : there are 18 guanine and 57 cytosine in this window
Window Number 4 : there are 11 guanine and 47 cytosine in this window
Window Number 5 : there are 12 guanine and 70 cytosine in this window
Window Number 6 : there are 9 guanine and 50 cytosine in this window
Window Number 7 : there are 10 guanine and 56 cytosine in this window
Window Number 8 : there are 12 guanine and 50 cytosine in this window
Window Number 9 : there are 11 guanine and 53 cytosine in this window
Window Number 10 : there are 4 guanine and 47 cytosine in this window


Note: the number of guanine and cytosine nucleobases is low because I ran the test not on a DNA sequence file (I don't have any), but on a plain HTML file that happened to be in my test directory.

What still needs to be done is the computing of the ratio and storing the individual results in an array, or something like that.

Please try it and let me know if this fits parts of your needs. Also let me know if you need explanations on the above.


Laurent_R
Veteran / Moderator

Oct 22, 2012, 7:24 AM

Post #13 of 34 (7807 views)
Re: [Kevone07] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Hi,

looking at your requirement, it seems that the only missing part in my earlier post was the calculation of skew. So, here is the complete script:


Code
#!/usr/bin/perl 

use strict;
use warnings;

my $window_size = 2000;

my $file = "test.txt";
my $window_nr = 0;

open (my $fh, "<", $file) or die "cannot open $file $! \n";
while (my $size_read = read $fh, my $window, $window_size) {
$window_nr++;
my $nr_G = () = $window =~ /g/gi;
my $nr_C = () = $window =~ /c/gi;
my $skew_value = ($window_size / $nr_C)*($nr_G - $nr_C)/($nr_G + $nr_C);
print "Window Number $window_nr : there are $nr_G guanine and $nr_C cytosine in this window. Skew value is: $skew_value\n";
}
close $fh;


This gives the following output:


Code
Window Number 1 : there are 14 guanine and 50 cytosine in this window. Skew value is: -22.5 
Window Number 2 : there are 23 guanine and 40 cytosine in this window. Skew value is: -13.4920634920635
Window Number 3 : there are 18 guanine and 57 cytosine in this window. Skew value is: -18.2456140350877
Window Number 4 : there are 11 guanine and 47 cytosine in this window. Skew value is: -26.4123257520176
Window Number 5 : there are 12 guanine and 70 cytosine in this window. Skew value is: -20.2090592334495
Window Number 6 : there are 9 guanine and 50 cytosine in this window. Skew value is: -27.7966101694915
Window Number 7 : there are 10 guanine and 56 cytosine in this window. Skew value is: -24.8917748917749
Window Number 8 : there are 12 guanine and 50 cytosine in this window. Skew value is: -24.5161290322581
Window Number 9 : there are 11 guanine and 53 cytosine in this window. Skew value is: -24.7641509433962
Window Number 10 : there are 4 guanine and 47 cytosine in this window. Skew value is: -35.8781810596579


Of course, you may want to produce a different output (in particular, you may want to round off the skew), but you did not provide enough information for that.

Note that the results show heavily skewed data, but remember I ran it on a HTML text file, not on a DNA sequence. Since English has many more "C" than "G", the calculated skew is meaningless.


Kevone07
Novice

Oct 22, 2012, 4:18 PM

Post #14 of 34 (7790 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Seriously, I don't know how to thank you! I know I'm going to need future help so would it be poss. helping me when needed and I'll donate some money or something in return??

Also, I just want to make sure. Would the output file go something like this?

Code
 #!/usr/bin/perl  

use strict;
use warnings;

my $window_size = 2000;

my $file = "test.txt";
my $window_nr = 0;
my $outfile = "results.txt"

open (my $fh, "<", $file) or die "cannot open $file $! \n";
while (my $size_read = read $fh, my $window, $window_size) {
$window_nr++;
my $nr_G = () = $window =~ /g/gi;
my $nr_C = () = $window =~ /c/gi;
my $skew_value = ($window_size / $nr_C)*($nr_G - $nr_C)/($nr_G + $nr_C);
print "Window Number $window_nr : there are $nr_G guanine and $nr_C cytosine in this window. Skew value is: $skew_value\n";
}
close $fh;
open (OUTFILE, ">> $outfile") or die "ERROR: opening $outfile\n";

would I make a foreach under this or just place the while under here?


Kevone07
Novice

Oct 22, 2012, 5:24 PM

Post #15 of 34 (7787 views)
Re: [Kevone07] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Forget about that question but is this way familiar. I found it in google.

You can forward the output into an outfile using the > on the command line:
project1.pl > project1_output.txt I tried using that as the command in Linux but I guess its meant for perl. I'm not sure, would you know?


Laurent_R
Veteran / Moderator

Oct 23, 2012, 12:15 AM

Post #16 of 34 (7779 views)
Re: [Kevone07] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

I am not sure I understand your questions.

The easiest way to capture the output is to redirect it to a file using the '>' operator under Linux or Unix (I haven,'t tried, but I think this would work even under Dos).

Otherwise, you could just open an output file and write to this file, I can give you the syntax, but will do that later (it is not very practical for me right now, as I am in public transportation for the moment).


Laurent_R
Veteran / Moderator

Oct 23, 2012, 1:26 AM

Post #17 of 34 (7776 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

The same program with the opening of an output file:


Code
#!/usr/bin/perl 

use strict;
use warnings;

my $window_size = 2000;

my $in_file = "test.txt";
my $out_file = "test_results.txt";

my $window_nr = 0;

open (my $in_fh, "<", $in_file) or die "cannot open $in_file $! \n";
open (my $out_fh, ">", $out_file) or die "cannot open $out_file $! \n";

while (my $size_read = read $in_fh, my $window, $window_size) {
$window_nr++;
my $nr_G = () = $window =~ /g/gi;
my $nr_C = () = $window =~ /c/gi;
my $skew_value = ($window_size / $nr_C)*($nr_G - $nr_C)/($nr_G + $nr_C);
print $out_fh "Window Number $window_nr : there are $nr_G guanine and $nr_C cytosine in this window. Skew value is: $skew_value\n";
}
close $in_fh;
close $out_fh


And the content of the output file:


Code
cat test_results.txt 
Window Number 1 : there are 14 guanine and 50 cytosine in this window. Skew value is: -22.5
Window Number 2 : there are 23 guanine and 40 cytosine in this window. Skew value is: -13.4920634920635
Window Number 3 : there are 18 guanine and 57 cytosine in this window. Skew value is: -18.2456140350877
Window Number 4 : there are 11 guanine and 47 cytosine in this window. Skew value is: -26.4123257520176
Window Number 5 : there are 12 guanine and 70 cytosine in this window. Skew value is: -20.2090592334495
Window Number 6 : there are 9 guanine and 50 cytosine in this window. Skew value is: -27.7966101694915
Window Number 7 : there are 10 guanine and 56 cytosine in this window. Skew value is: -24.8917748917749
Window Number 8 : there are 12 guanine and 50 cytosine in this window. Skew value is: -24.5161290322581
Window Number 9 : there are 11 guanine and 53 cytosine in this window. Skew value is: -24.7641509433962
Window Number 10 : there are 4 guanine and 47 cytosine in this window. Skew value is: -35.8781810596579



Kevone07
Novice

Oct 23, 2012, 8:47 AM

Post #18 of 34 (7766 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

That's a good way of putting it. I figured out what I was trying to do which was execute a command in linux to give me the output results in a txt file of the perl program I was running which meant I had to use " perl project1.pl > project1_output_results.txt " in the linux server. Thanks for the help, seriously! I may have more questions, I'll see how things go on this project.


Kevone07
Novice

Oct 25, 2012, 6:11 PM

Post #19 of 34 (7699 views)
Re: [Kevone07] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Question: How would I make the initial bp count restart, after reaching the first 2000bp, to 100 bp for the second window initial count ranging still in 2000bp ( 100bp + 2000bp = 2100bp). I was thinking of just adding " + 100 " to the window size but it will count all the "G" and "C" from 0 to 100 and so.. instead of just counting 100bp to 2100bp than going to 200 to 2200bp and being consistent till reaching the total bp of 5097129. I know I aint making much sense but maybe you will understand. If not, just let me know and Ill try to explain it better. Thanks in advance!


Kevone07
Novice

Oct 25, 2012, 6:13 PM

Post #20 of 34 (7698 views)
Re: [Kevone07] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Writing in perl gets me really confused and lose my train of thought. Sorry about anything that doesnt make sense


Laurent_R
Veteran / Moderator

Oct 25, 2012, 11:27 PM

Post #21 of 34 (7680 views)
Re: [Kevone07] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Do you want the count window to go like this:

- 1 - 2000
- 2001- 2100
- 2101- 4100
- 4101 - 4201?

Or something else?


Chris Charley
User

Oct 26, 2012, 8:17 AM

Post #22 of 34 (7672 views)
Re: [Kevone07] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Maybe the attached files will give you an idea. t33.pl is the program file, fasta_dat.txt is the data file and junk.txt is the output from the program.

I used a smaller window size because my sample fasta file sequences weren't as long as the ones you have and the skip is smaller too. You would just need to make those adjustments for your program.

Look it over and ask any questions.

Hope this helps.

Update: Made a change to t33.pl

if ($begin_pos + $window_size > $length)

to

if ($begin_pos + $window_size >= $length)

Update2: Just to show a possible output ordered by skew, (uses SQLite database, pretty simple). File is bio_stat_db.pl.


Update3: Made adjustments to t33.pl and bio_stat_db.pl. Changed from $window_size to $size_read for skew calculations. Laurent caught this error (later in this thread). Output is now at the end of the 2 program files (after the __END__ token).


(This post was edited by Chris Charley on Oct 29, 2012, 8:20 AM)
Attachments: fasta_dat.txt (19.4 KB)
  bio_stat_db.pl (4.90 KB)
  t33.pl (9.98 KB)


Kevone07
Novice

Oct 26, 2012, 1:57 PM

Post #23 of 34 (7642 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

The initial count range of 'G' & 'C' go up +100 everytime after the first initial count at 1. So it would look something like this:

-1 - 2000

-100 - 2100

-200 - 2200

-300 - 2300 etc.. till hits the total bp of about 5,000,000

I figured I didn't explain it well enough but I think you have a clear idea of it now. I wish I was better at writting in perl, too difficult from where I started at..


Laurent_R
Veteran / Moderator

Oct 26, 2012, 2:28 PM

Post #24 of 34 (7639 views)
Re: [Kevone07] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Allright, you want a kind of sliding windows. I think I understood what you want.

But it is very late for me now, I hope I'll be able to give you an answer tomorrow.

I should add that Chris Charley gave a track which might be much better: using existing modules that appear to do exactly the type of calculations you need. I must say that I asked myself the question before posting previous posts: "but, isn't there already a module to do just that?" Personally, I have never used Perl for biology, so I have no idea about the possibilities offered by such modules. But since this is your field, I would suggest that you investigate thoroughly before reinventing the wheel.


Laurent_R
Veteran / Moderator

Oct 27, 2012, 8:57 AM

Post #25 of 34 (7584 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

As I said previously, using existing biology modules is probably better if they give you what you want.

Having said that, this is what I can come up without using such modules:


Code
#!/usr/bin/perl  

use strict;
use warnings;

my $window_size = 2000;

my $in_file = "test.txt";
my $out_file = "test_results.txt";

my $pass = 1;
open (my $out_fh, ">", $out_file) or die "cannot open $out_file $! \n";

my $offset = 0;
while ($offset < 2000) {
open (my $in_fh, "<", $in_file) or die "cannot open $in_file $! \n";
my $discarded_data = read $in_fh, (my $unused_line), $offset;
my $window_nr = 0;
while (my $size_read = read $in_fh, my $window, $window_size) {
$window_nr++;
my $nr_G = () = $window =~ /g/gi;
my $nr_C = () = $window =~ /c/gi;
print $out_fh "skew undetermined for pass $pass and Window $window_nr\n" and next unless $nr_C; #avoid division by 0 error if $nr_C == 0
my $skew_value = ($window_size / $nr_C)*($nr_G - $nr_C)/($nr_G + $nr_C);
print $out_fh "Pass $pass Window Number $window_nr : there are $nr_G guanine and $nr_C cytosine in this window. Skew value is: $skew_value\n";
}
close $in_fh;
$offset += 200;
$pass ++;
}
close $out_fh;


This gives me the following output:


Code
Pass 1 Window Number 1 : there are 28 guanine and 73 cytosine in this window. Skew value is: -12.206700122067 
Pass 1 Window Number 2 : there are 20 guanine and 81 cytosine in this window. Skew value is: -14.9126023713482
Pass 1 Window Number 3 : there are 15 guanine and 24 cytosine in this window. Skew value is: -19.2307692307692
Pass 2 Window Number 1 : there are 29 guanine and 75 cytosine in this window. Skew value is: -11.7948717948718
Pass 2 Window Number 2 : there are 19 guanine and 80 cytosine in this window. Skew value is: -15.4040404040404
Pass 2 Window Number 3 : there are 13 guanine and 20 cytosine in this window. Skew value is: -21.2121212121212
Pass 3 Window Number 1 : there are 29 guanine and 86 cytosine in this window. Skew value is: -11.5267947421638
Pass 3 Window Number 2 : there are 19 guanine and 71 cytosine in this window. Skew value is: -16.2754303599374
Pass 3 Window Number 3 : there are 12 guanine and 16 cytosine in this window. Skew value is: -17.8571428571429
Pass 4 Window Number 1 : there are 26 guanine and 86 cytosine in this window. Skew value is: -12.4584717607973
Pass 4 Window Number 2 : there are 17 guanine and 68 cytosine in this window. Skew value is: -17.6470588235294
Pass 4 Window Number 3 : there are 11 guanine and 13 cytosine in this window. Skew value is: -12.8205128205128
Pass 5 Window Number 1 : there are 25 guanine and 91 cytosine in this window. Skew value is: -12.5047366426677
Pass 5 Window Number 2 : there are 16 guanine and 61 cytosine in this window. Skew value is: -19.1611667021503
Pass 5 Window Number 3 : there are 9 guanine and 12 cytosine in this window. Skew value is: -23.8095238095238
Pass 6 Window Number 1 : there are 16 guanine and 96 cytosine in this window. Skew value is: -14.8809523809524
...


(Again, the program was run on a test file that is not a DNA sequence, so the values obtained are completely unsignificant.)

Asides from the changes due to your sliding window requirement, I made an additional change, because I had a hidden bug. This change is the addition of the following line:


Code
print $out_fh "skew undetermined for pass $pass and Window $window_nr\n" and next unless $nr_C; #avoid division by 0 error if $nr_C == 0


There was the possibility of having a division by 0 error if the sequence did not contain any C. This was relatively unlikely in the original version, but it becomes more likely with the sliding window rule, as the last chunk read in the file might have only 2 or 3 nucleobases and no C (because of the sliding window rule, it becomes more likely that there be one pass where the last chunk has very few nucleobases).

This leads me to another possible problem that I did not see before and discovered because of the sliding window implementation: when the last segment of data read during one pass has much less than 2000 nucleobases (say 1 to 10), its skew is possibly completely unsignificant. For example, in my above example, during pass 6, the last chunk of data gives me the following result:


Code
Pass 6 Window Number 3 : there are 7 guanine and 7 cytosine in this window. Skew value is: 0


Another (opposite) example is in the last chunk of data read in my last pass:


Code
Pass 9 Window Number 3 : there are 1 guanine and 3 cytosine in this window. Skew value is: -333.333333333333


In both cases, the calculation of the skew is correct for the data being read, but, because we are reading the last chunk of data for a given pass, we are most probably reading too little of the data for the calculated skew to be significant. So, maybe, the last chunk of data of a given pass should be discarded from the skew calculation when the number of nucleobases in the chunk is less that a given value.

Again, I have extremely little knowledge in genetics, the algorithms using these skew values may take this issue into consideration, I just want to inform you that there may be something wrong here.


(This post was edited by Laurent_R on Oct 27, 2012, 11:54 AM)


Chris Charley
User

Oct 27, 2012, 11:02 AM

Post #26 of 34 (6444 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Laurent, your code is somewhat less dense than the BioPerl one. :-)

A problem (?) might be that it is counting newline characters in the window (by looking at his sample data snapshot earlier in this thread). Don't know if this could significantly affect the results. The newline issue is why I posted the BioPerl solution as it doesn't count newline characters.

Good catch on the division by zero error likelyhood. My solution didn't check for illegal division by zero, so would have to be amended.

Just some thoughts.


(This post was edited by Chris Charley on Oct 27, 2012, 11:19 AM)


Laurent_R
Veteran / Moderator

Oct 27, 2012, 11:47 AM

Post #27 of 34 (6436 views)
Re: [Chris Charley] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Hi Chris,

Thanks for your comments.

I usually don't try to make code too concise on forums, because this could make it more difficult to understand to the poster.

In the initial discussions I had with the OP, he or she told me that that there was no new lines, the file consisted only of the four letters representing bases. So I assumed there is no new line.

Having said that, if a Perl module is doing just that, I would definitely recommand using it.


Chris Charley
User

Oct 27, 2012, 1:13 PM

Post #28 of 34 (6415 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

His attachment in reply to your asking him about the newlines was all I could find. But, the only fasta files I've seen had newlines in the data. But, that looks like a unix type terminal window and I don't know anything about them.
The data may be without newlines.

The lines don't extend to the right side of the terminal window.

I'm not an expert with fasta files. I've just picked up bits and pieces by reading threads on BioPerl issues.


(This post was edited by Chris Charley on Oct 27, 2012, 1:46 PM)


Laurent_R
Veteran / Moderator

Oct 27, 2012, 2:08 PM

Post #29 of 34 (6383 views)
Re: [Chris Charley] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

To tell the truth, I had not seen the attachment, but only the text of the message.

And the attachment does not say too much about the real file, we don't know how the attachement was generated and on which terminal, there may be carriage returns or line feeds, or not, we don't really know.

I have no knowledge on DNA sequences, but I would tend to think that the presence or absence of new lines in the file would not change significantly the G-C skew.

Again, you pointed to some existing Perl modules which I did not know anything about. If they do what is required here, it is certainly a better option, because they have most certainly been tested thouroughly with real data much better than what I could do with my pseudo-DNA files (HTML files in my earlier posts, Perl code files in my most recent ones).


Laurent_R
Veteran / Moderator

Oct 28, 2012, 3:02 AM

Post #30 of 34 (6236 views)
Re: [Chris Charley] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Hi,

I just made a test with Chris Charley's DNA sequence file. Contrary to what I thought and wrote earlier, the end of line characters alter the computed skew quite significantly.

Example of a few windows on the DNA file as posted by Chris:


Code
Pass 1 Window Number 1 : there are 424 guanine and 443 cytosine in this window. Skew value is: -0.0989374637120816 
Pass 1 Window Number 2 : there are 451 guanine and 369 cytosine in this window. Skew value is: 0.542005420054201
Pass 1 Window Number 3 : there are 363 guanine and 359 cytosine in this window. Skew value is: 0.0308644356823741


I then removed the new line/carriage return characters from the same file:


Code
$ perl -i -pe 's/\r\n//g' test.txt


And, running the same test with the new file without the carriage returns:


Code
Pass 1 Window Number 1 : there are 437 guanine and 469 cytosine in this window. Skew value is: -0.150618713433777 
Pass 1 Window Number 2 : there are 467 guanine and 375 cytosine in this window. Skew value is: 0.582739509105305
Pass 1 Window Number 3 : there are 395 guanine and 389 cytosine in this window. Skew value is: 0.0393473584806673


So, Kevone, you should remove these new line/carriage return characters from your file if you are going to use my suggested program (you can do it with my Perl one-liner above, but you may have to change it depending on whether you are on Unix or Windows).

Another point is that I found a bug in my calculation of the skew: in the formula, I use $window_size for the total number of nucleobases analyzed, whereas I should have used the actual number of bytes read, otherwise the calculated skew is wrong whenever I reach the end of the file (i.e. for each last window of each pass).

So the line calculating the skew is now as follows:


Code
		my $skew_value = ($size_read / $nr_C)*($nr_G - $nr_C)/($nr_G + $nr_C);


Without that change, I had the following result for the first pass:



Code
Pass 1 Window Number 1 : there are 437 guanine and 469 cytosine in this window. Skew value is: -0.150618713433777 
Pass 1 Window Number 2 : there are 467 guanine and 375 cytosine in this window. Skew value is: 0.582739509105305
Pass 1 Window Number 3 : there are 395 guanine and 389 cytosine in this window. Skew value is: 0.0393473584806673
Pass 1 Window Number 4 : there are 467 guanine and 416 cytosine in this window. Skew value is: 0.277680982663995
Pass 1 Window Number 5 : there are 486 guanine and 436 cytosine in this window. Skew value is: 0.248761169376505
Pass 1 Window Number 6 : there are 509 guanine and 506 cytosine in this window. Skew value is: 0.0116824704530851
Pass 1 Window Number 7 : there are 546 guanine and 534 cytosine in this window. Skew value is: 0.0416146483562214
Pass 1 Window Number 8 : there are 552 guanine and 524 cytosine in this window. Skew value is: 0.0993217741706632
Pass 1 Window Number 9 : there are 507 guanine and 555 cytosine in this window. Skew value is: -0.162874739145926
Pass 1 Window Number 10 : there are 194 guanine and 217 cytosine in this window. Skew value is: -0.515770235572449
Pass 2 Window Number 1 : there are 421 guanine and 482 cytosine in this window. Skew value is: -0.280301254922504
Pass 2 Window Number 2 : there are 454 guanine and 358 cytosine in this window. Skew value is: 0.660483804386713


With this correction, I have now this:


Code
Pass 1 Window Number 1 : there are 437 guanine and 469 cytosine in this window. Skew value is: -0.150618713433777 
Pass 1 Window Number 2 : there are 467 guanine and 375 cytosine in this window. Skew value is: 0.582739509105305
Pass 1 Window Number 3 : there are 395 guanine and 389 cytosine in this window. Skew value is: 0.0393473584806673
Pass 1 Window Number 4 : there are 467 guanine and 416 cytosine in this window. Skew value is: 0.277680982663995
Pass 1 Window Number 5 : there are 486 guanine and 436 cytosine in this window. Skew value is: 0.248761169376505
Pass 1 Window Number 6 : there are 509 guanine and 506 cytosine in this window. Skew value is: 0.0116824704530851
Pass 1 Window Number 7 : there are 546 guanine and 534 cytosine in this window. Skew value is: 0.0416146483562214
Pass 1 Window Number 8 : there are 552 guanine and 524 cytosine in this window. Skew value is: 0.0993217741706632
Pass 1 Window Number 9 : there are 507 guanine and 555 cytosine in this window. Skew value is: -0.162874739145926
Pass 1 Window Number 10 : there are 194 guanine and 217 cytosine in this window. Skew value is: -0.219202350118291
Pass 2 Window Number 1 : there are 421 guanine and 482 cytosine in this window. Skew value is: -0.280301254922504
Pass 2 Window Number 2 : there are 454 guanine and 358 cytosine in this window. Skew value is: 0.660483804386713


As it can be seen, the computed skew values are the same for every window, except the last window of each pass (window number 10 in the above example). So I think you should probably apply the same correction.

EDIT: I checked Chris Charley's program, it has the same bug in the skew calculation (since Chris probably reused my formula). So you should modify his program if it is the one you are going to use.


(This post was edited by Laurent_R on Oct 28, 2012, 3:08 AM)


Chris Charley
User

Oct 28, 2012, 9:01 AM

Post #31 of 34 (6220 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Here is a portion of the sample input fasta file and I want to explain what it contains.


Code
>NM_198399 1 
aacagattttaactctgaaaagccatttccagtgtctatagactattgtg
agcctggagaagtagcatttagttgggatagcttcactagagctgcctgc
caaagacttccttccacaggatcttgtcgcaccagcaactgacaggagct
tgggagctcgggagcttgggagagggcttatgtttttaataatgtagctg
tcagttcgaagcctggaaatgttgaccctcaaagggcataaaatcttgtt
attttaatttgcatctgggagaatgtctgagcaaggagacctgaatcagg
caatagcagaggaaggagggactgagcaggagacggccactccagagaac
ggcattgttaaatcagaaagtctggatgaagaggagaaactggaactgca
gaggcggctggaggctcagaatcaagaaagaagaaaatccaagtcaggag
caggaaaaggtaaactgactcgcagccttgctgtctgtgaggaatcttct
gccagaccaggaggtgaaagtcttcaggatcagactctctgaaaactgca
aatggaaaggaattcaaaagaatttagattaaaagttaaataaaaagtag
gcacagtagtgctgaattttcctcaaaggctctcttttgataaggctgaa
ccaaatataatcccaagtatcctctctccttccttgttggagatgtctta
cctctcagctccccaaaatgcacttgcctataagaaacacaattgctggt
tcatatgaaacttaggaaatagtgaataaggtgcatttaactttggagaa
atacttttatggctttggtggagatttctcaatactgcaaaagttgtcca
gaaatgaatctgagctgatggtgactttaagttaatattattaatatatc
actgcatatttttacccttatttttgctccttacagcaagattagtaggt
tataaaaatttaaatttaaacaaaattatttcatgacaaaatgggaaact
tcacatcatacttatttttgtttgcctttcaggcatcatattagctttta
taaaaaatggtcttgctgctgaaattgtacttattttatcagaggctggg
tgcagtcaagacaaaagtaaaatggtttacctgagcccaggggagggaaa
attgattaagatatcattatttttgtttggtttggttttgcttttttcct
cttactttaattgaaatactctgaattcccctcatggaaacagagagcat
tgagagcactttctttaaaaggaccaaaaataaattcctaatagattttg
tcctaagagagtgtttttttttctagcatcattttctttacatgccactc
atgtcataaggcatggacaggctatctttcagtggccattactatgtttc
gtacacatgctttattttacttgggctctgagaaatgtgtggctttcctt
cagcattttatttgtgcttctctttttaatggagattgaaaagggagaat
aatgtgaatatcacggcttatattattaaatgttgattgatggcttgtaa
tgtactgcacacaatatatgttaactctgcagaatgacagaccctgggag
aagtaatgccccagttgtcccccactcctaatgccaggcagagaaggaca
gcctttatagacttaatctgctttttgtcccatttgacaaggtaccagga
ggaaattttttaagggatcaactgtatcacagtgcccactctggacctaa
gtctagtgtatccatacaattggtgcagagaaataaggtgtaaatggtgc
tttgttcctgctggttccaagctcagaaaccaagactagctttgtaggag
agaatgagagcctgcaagcctctctttggattggctgaggagtggtggga
gcagggggttgatagaaaacatccagacacacatataagcaagtggccgt
gctacctttttagagaataaagaaacagacttttgagtttatatgcaatg
ccttcattaggtaccaccggcacttacaaaatgtgcggactgaatcccag
agaacactggcagatgtatacagtatatggattgtatcgcttccccaatg
tttgtaaattcacagtatttggaaaactgccttcattttccagtgtggga
aaaactcttgctacctgtattacttgatctcagacccatacctgatggtt
cagtctgtccttaagttaaaagaattttgcttttctaatgttatactatt
tacctgtcagtgtattactgcaacttgaatcactcttttactgttgttgg
atataaacttatcctgtaccaatgtatttattaacacttgtattttatta
ttgagcatatcaataaaaatattaaaaaataacagattgttttttaccaa
aaaaaaaaaaaaa
>NR_026816 1
caacccactctctgtgctatgacttcattactctttcccagcccagccct
gggcaagccccttacgaagtctcaggctacctggatgaccaccctttctt
atgatgctgcaaggagggcaggtgggcagagccccgtgcatcctgggctc
aggccagggacccaagagcttgggagaagctggttctcagactgaaggcc
agagcccagcaccttgtcaccatcccggggagcatcatggcacacaacaa
ccagagccaaggctacagctagagagttgactcctctatttgagattgac
aggcctcggaagtcaaaataagtggtttcctagaccgggtcgagagcaag
tctctattggtcccaactgagttttttcagctggtttttcaaccaaacag
cacctcatctcccagtgaggggaagggaaggctgggctgagagcagcaag
gctgctcatctcacctctccccacccagccatgccagccgcctcacctgg
tggggagaggtgggcctcacctgggtcccctggcagtgctctgtgaaggg
tcttgacattgcactgtaataataaaggtgtgtgtgaagtatcaaaaaaa
>NR_027917 1
atgaagatgattgagcagcacaatcaggaatacagggaagggaaacacag
cttcacaatggccatgaacgcctttggagaaatgaccagtgaagaattca
ggcaggtggtgaatggctttcaaaaccagaagcacaggaaggggaaagtg
ctccaggaacctctgcttcatgacatccgcaaatctgtggattggagaga
gaaaggctacgtgactcctgtgaaggatcagtgcagctggggctctgtaa
ggacagatgttaggaaaactgagaaactagtttcactgagtgtgcagacc
tggtggactgctctaggcttcaaggcaatgttggctgcatttttggagaa
ccattattttgcttccagtatgttgccgacaatggaggcctggactctga
ggaatccttttcatatgaagaaaagctctggagactggaaagtccaaggt
cacagaggtgcatctggtgagagccttcttgctagtggggaatctcagca
gagtcctgaggtggcacagtattctgggaagcatcaagtgcagtgtcatc
ttatcgaggaggctctgcagatgctaagtggtggggatgaggatcacgat
gaagacaaatggccccatgacatgaggaatcatctggctggagaggccca
ggtgtag
>NR_002777 3
cttgtcctttcagaagatcagagacaagtgatatctgtgccaatttggcc
ttttcagtgttataattatggtgtcttgggatcccaatatttctcctaat
gtttccctgatgtgatactttgagagcccaggatgccagtacaataattg
aaattcacaaatgtctggtatcttgtccctcgtgccccatatattatctg
tggtttcggagagctcacttgtctcttatcttcagaaatgacagcacatg
aaatgttgtttggagccactgtcacatcaactgtagaaaaattaacaggt
cagctaagggatataatgtaactttatttgtgatatgagagaaatcttga
taaagacttgagagaaaactgggaggaaccttgtttagaagttataagga
ggggtaagttatgtgtgtcttggaaggagaatcataaatcttaaaacatg
agcctaatagagaacataaaattctaaaagataaagataataataatgat
aagccgcagggtggcttatgataatgtgacttctccttaccccagtagcg
tcggacatctgtcagctctgaaatgataaaaatgcacaatattgaataca
aacaaaggagtcagcactgaaattcattttctctccagattagggaaaga
gtaggtatgccctatggtagggcagtaaattgctgaatgatgagatgaaa
cagccacctagccatttcccattaaatataatcccatcagcagcagacaa
tatctatcctcccctatcccctctatccatatttggaaactgcaccctct
tccctatttagcaccctaacaccacttgaattccataaccctgttgttga
tctagctctcctcacctctaaacacttctagcattcctttcagatcagga
gctcgaaacactctcctttgattttttggaaaagtttctggcttcttcaa
ggtcacgttctccgtcctaagaattaaaaaaaaaaaaaaaaacttccaaa
cctttgaccttgtgtccgtggaacacccctgacttcctatcatttcaacc
cattgaggcacttgaactctcttcttggggatcctgagaagggagagtgc
aaactcttgaccctggaggcaaacaaaatgttctcatgtttgccttccca
cttactttctgtgagaacgtgggaagatcttaacctctcagaagcacagt
ttcttccttctaaaatgaaataattaacctctccctgtctacattcttaa
actcataggacataaaaaaaaaaaaaa

The lines beginning with '>whatever something else' give you the 'id', (whatever), for the sequence and the description, (something else). In the sample above, there are ids like NR_002777 and a description like 3. The following lines contain the sequence itself, (cttgtcctttcagaagatcagag...).

So, there is more than 1 sequence in the sample file.

You shouldn't include these header lines in the window. Things start to get messy when trying to manually parse the dna sequences. It is probably safer to use the modules made for this purpose. (I don't know how many sequences he had for his problem).

I noticed a difference in the way you and I generated the windows. I started at the beginning and slid the window 100 chars to the right and generated the skew value, sliding to the end of the sequence. Your solution gets the windows from start to finish, then slides 200 offset and redoes the file again from start to finish. The number slid, (my 100 or your 200), is not the point, (I could adjust to 200 or whatever size offset). It's the different approachs taken.

I made the adjustment to the skew calculation you poointed out, using $size_read rather than the $window_size. The 2 changed programs are listed earlier in this thread.


Laurent_R
Veteran / Moderator

Oct 28, 2012, 11:19 AM

Post #32 of 34 (6213 views)
Re: [Chris Charley] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Hi Chris,

I had never done any work or anything on DNA sequences before, and I clearly said from the beginning that my knowledge is close to 0 on these matters.

Among other things, I did not know about headers and I did not notice that there were such headers in you file. I think Kevone said that his or her file contained only a stream of ACGT letters, so I assume it had not headers.

Of course it would not be complicated to remove the headers just as I have removed carriage return and new line characters, but I agree (and already wrote) that, in general, it is better to use existing modules if they are fit for the task at hand. In this particular case, the initial requirement was so simple that it could be done in a dozen lines of code, so I suggested some code without modules (I had no idea about what existing modules can or cannot do, as I said, this is really not my field).

As for the sliding window, I thought of at least 3 different methods to implement it, the one I used, something very similar to the one you used and, since the window size is a multiple of the slide factor, a third one where it would cut the file in segments of 200 bases and store the numbers of C and G for each segment in an array (or 2 arrays, an array of hashes, or whatever), and then do all the calculations on this array. This last method would probably be the best in terms of performance, but performance is not an issue with a mere 5 million nucleobases and the program might be a bit more complex. I chose the approach I used simply because it seemed to be the one that required the minimal amount of changes to the code I had already written.


Chris Charley
User

Oct 28, 2012, 1:58 PM

Post #33 of 34 (6208 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Yes Laurent, I realize you said earlier that it might be better to use the Bio::... modules for this problem. I guess it slipped my mind.

Haven't heard from the original poster as to whether any of these schemes will work however, so maybe its best to let this thread rest until we hear from him.

There are many things I don't know about genomic problems. There is a BLAST, reverse compliment, translate and several other things about genomes and I don't know what they are or what Bio modules to use to get them. (Well, reverse compliment and translate methods are in Bio::Seq.)


Kevone07
Novice

Oct 29, 2012, 2:09 PM

Post #34 of 34 (6178 views)
Re: [Laurent_R] DNA Sequence Count Perl Program HELP! [In reply to] Can't Post

Honestly, you do an amazing job at what you are able to do with perl. Just by looking at the what you collaborated together, I'm able to tweak it just a little to make it work perfect. I just want to confirm it with you on my outlook on the file by interpreting it. So, what this program is doing is reading the input file, going through a bunch of loops with certain data types, counting the specific letters contained in the input file, inserting in the skew function, making the first recorded data in the test_results, going through this whole loop again with a slided window of +100 and so and so until it reaches the total amount of about 5,000,000 bp (I tweaked the program to go to somewhere around that amount). I just need to make sure that I get the correct slided window because I'll need to take the average of the 5 windows skew in-order to get clear up the noise on my graph. I'll show you what I ended up getting with the tweak. I'm going to send you a private message of my results.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives