CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
My First real script (version 2 troubles)

 



orion
Novice

Jun 20, 2011, 12:18 PM

Post #1 of 15 (1374 views)
My First real script (version 2 troubles) Can't Post

i am trying to write a script to scan text files, extract data from them
scan that data for sets of a's followed by t's just a's or just t's
that are between 4 and 7 characters long
then output the start and end points of each of those runs
suffice to say it does not work, and i don't know why
here is my code so far:
#!/usr/bin/perl -w
use strict;
use warnings;


my @genome;
my $input_file;
my $line;
my $output;
my $genome;


open FILE, "~/test.gbk;" or die("could not open $input_file");

while (my $line = <FILE>) {
if ($line =~ [\d|\s]{8}\d \D{10} \D{10} \D{10} \D{10})
push (@genome, $line)}


open(MYOUTFILE, ">filename.out") or die("could not open $output");
print MYOUTFILE "Start:End:Atract:angle";


# removes spaces from array
$genome = @genome;
$genome =~ s/\s|\d//g;
# end of




while ($genome =~ /a+t?|t+/g) {

if ($+[0] - $-[0] == (4|7)) {
print MYOUTFILE "$-[0]:$+[0]:15/n";
}
if ($+[0] - $-[0] == (5|6)) {
print MYOUTFILE "$-[0]:$+[0]:20/n";
}

# the regex for the patern will be a+t?|t+


(This post was edited by orion on Jun 24, 2011, 10:23 AM)


rovf
Veteran

Jun 21, 2011, 1:01 AM

Post #2 of 15 (1359 views)
Re: [orion] My First real script (help needed) [In reply to] Can't Post

Don't just say "it does not work". Explain the expected output, and the output you get instead. Also use code tags around your code. Code of more than a few lines is hard to read if formatted like normal text.

BTW, is there a particular reason why you use in your regexp


Quote
\D{10} \D{10} \D{10} \D{10}


instead of the simpler


Code
\D{40}


?


mino
Novice

Jun 21, 2011, 2:18 AM

Post #3 of 15 (1354 views)
Re: [orion] My First real script (help needed) [In reply to] Can't Post

There are several errors in your code which you should fix first of all, among others:


Code
open FILE, "~/test.gbk;" or die("could not open $input_file");


Is the file really called "test.gbk;"? Also, $input_file is not defined anywhere.


Code
while (my $line = <FILE>) { 
if ($line =~ [\d|\s]{8}\d \D{10} \D{10} \D{10} \D{10})
push (@genome, $line)}


The regex must be enclosed by slashes. There's no code block following the "if" statement.


Code
while ($genome =~ /a+t?|t+/g) {


The while loop doesn't have an ending }.

Since you use warnings and strictures, start by fixing the errors you get. Then post back
here if it still doesn't work.


orion
Novice

Jun 22, 2011, 7:37 AM

Post #4 of 15 (1337 views)
Re: [mino] My First real script (help needed) [In reply to] Can't Post

sorry bout the poor post

to answer questions
for now yes the input file is test.gbk
the file is a mock genome file that I'm using to make sure the script works before i bring in the real ones.
the reason that the first regex is not one piece is because there should be spaces in between each ten chars.
i continued to fix errors until no more came but it still doesn't output correctly.

Code
#!/usr/bin/perl -w 
use strict;
use warnings;


my @genome;
my $input_file;
my $line;
my $output;
my $genome;



#$input_file = ~/test.gbk;
open(my $fh, "<", "test.gbk")
or die "cannot open < test.gbk: $!";

while (my $line = <$fh>) {
if ($line =~ /[\d|\s]{8}\d \D{10} \D{10} \D{10} \D{10}/) {
push (@genome, $line);
}
}


open(MYOUTFILE, ">-test.out") or die("could not open test.out");
print MYOUTFILE "Start:End:angle";


# removes spaces from array
$genome = @genome;
$genome =~ s/\s|\d//g;
# end of




while ($genome =~ /a+t?|t+/g) {

if ($+[0] - $-[0] == (4|7)) {
print MYOUTFILE "$-[0]:$+[0]:15/n";
}
if ($+[0] - $-[0] == (5|6)) {
print MYOUTFILE "$-[0]:$+[0]:20/n";
}


}

the input is a file that looks something like

Code
ORIGIN       
1 ttacaaataa gggactcggt caccccctga aacgcccaga ccgaaaaccc taaaccgaaa
61 cctaaaacct aaaaccctaa gcctaaacct agacctaaaa cctaaaaccc ccttaaaccc
121 ctaaacctaa acctaaacct aaacccctaa acctaggccc taaacctaaa cgccctaaac
181 ctaaaaccta aaacctgaaa cctaaaccta aaccctgaac ctaaaaaacc taaaacctaa

and for this file the output should be

Code
Start:End:angle 
5:8:15
54:57:15
etc.

but at the moment the ouput just is
Start:end:angle



mino
Novice

Jun 22, 2011, 9:22 AM

Post #5 of 15 (1330 views)
Re: [orion] My First real script (help needed) [In reply to] Can't Post


Quote
yes the input file is test.gbk


Seems more likely, your original code had the filename "test.gbk;" (with the semicolon). Smile

So, I fixed a few errors in your code.


Code
$genome = @genome;


This will return the number of elements of @genome and store it in $genome.
Read up on scalar vs. list context.
Since you don't use @genome for anything really, I removed it altogether.


Code
if ($+[0] - $-[0] == (4|7)) {


Won't work, you have to spell out the alternatives.

The following is the revised code.


Code
#!/usr/bin/perl  
use strict;
use warnings;

my $input_file;
my $genome;

#$input_file = ~/test.gbk;
open( my $fh, "<", "test.gbk" )
or die "cannot open < test.gbk: $!";

while ( my $line = <$fh> ) {
if ( $line =~ /[\d|\s]{8}\d \D{10} \D{10} \D{10} \D{10}/ ) {
$genome .= $line;
}
}

open( MYOUTFILE, ">-test.out" ) or die("could not open test.out");
print MYOUTFILE "Start:End:angle\n";

while ( $genome =~ /a+t?|t+/g ) {
my $offset = $+[0] - $-[0];

if ( $offset == 4 or $offset == 7 ) {
print MYOUTFILE "$-[0]:$+[0]:15\n";
}
elsif ( $offset == 5 or $offset == 6 ) {
print MYOUTFILE "$-[0]:$+[0]:20\n";
}
}



(This post was edited by mino on Jun 22, 2011, 9:23 AM)


orion
Novice

Jun 22, 2011, 10:28 AM

Post #6 of 15 (1325 views)
Re: [mino] My First real script (help needed) [In reply to] Can't Post

Thanks for all the help
but before i read yours i figured i would throw more regex at it

i also added a bit more to the output
thank you very much for all the help

Code
#!/usr/bin/perl -w 
use strict;
use warnings;


my @genome;
my $input_file;
my $line;
my $output;
my $genome;
my $d;
my $track;
my $s;
my $e;
#$input_file = ~/test.gbk;
open(my $fh, "<", "test.gbk")
or die "cannot open < test.gbk: $!";

while (my $line = <$fh>) {
if ($line =~ /[\d|\s]{8}\d\s\D{10}\s\D{10}\s\D{10}\s\D{10}/) {
push (@genome, $line);

}
}


open(MYOUTFILE, ">test.out") or die("could not open test.out");
print MYOUTFILE "Start:End:Atract:angle\n";


# removes spaces from array
$genome = "@genome";
$genome =~ s/\s|\d//g;
# end of



# prints output file/pushes data onto arrays
while ($genome =~ /a+t?|t+/g) {
$d = $+[0] - $-[0];
$track = $&;
$s = $-[0];
$e = $+[0];

if ($d =~ m/4|7/) {
print MYOUTFILE "$s:$e:$track:15\n";
}

if ($d =~ m/5|6/) {
print MYOUTFILE "$s:$e:$track:20\n";
}
}



mino
Novice

Jun 22, 2011, 10:32 AM

Post #7 of 15 (1323 views)
Re: [orion] My First real script (help needed) [In reply to] Can't Post


Code
if ($d =~ m/4|7/) {


Beware, this will also match e.g. 14 or 17.


orion
Novice

Jun 22, 2011, 10:41 AM

Post #8 of 15 (1319 views)
Re: [mino] My First real script (help needed) [In reply to] Can't Post

how could i fix that?


mino
Novice

Jun 22, 2011, 10:50 AM

Post #9 of 15 (1318 views)
Re: [orion] My First real script (help needed) [In reply to] Can't Post

Well, you can add anchors:


Code
if ($d =~ m/^(4|7)$/) {


But personally I would rather use numerical match instead of string match:


Code
if ( $d == 4 or $d == 7 ) {



harshmane
Novice

Jun 23, 2011, 10:51 PM

Post #10 of 15 (1298 views)
Re: [mino] My First real script (help needed) [In reply to] Can't Post

i want to know the use of $+[0] & $-[0].


orion
Novice

Jun 24, 2011, 10:17 AM

Post #11 of 15 (1288 views)
Re: [harshmane] My First real script (help needed) [In reply to] Can't Post

they are automatically stored variables
that print the location of the start and end of the regular expression match


orion
Novice

Jun 24, 2011, 10:22 AM

Post #12 of 15 (1287 views)
Re: [orion] My First real script (help needed) [In reply to] Can't Post

i have come to a new problem now though
something is clearly wrong with my sum loops
the terminal output is:
Use of uninitialized value in addition (+) at atract.pl line 98, <$fh> line 136.
Use of uninitialized value in addition (+) at atract.pl line 106, <$fh> line 136.

and like a million of those



Code
#!/usr/bin/perl -w  
use Math::Trig;
use strict;
use warnings;
########set both start and end to 0 to scan full file##########
my $start; $start = 0; # starts scanning at
my $end; $end = 0; # ends scanning at
my $bendmin; $bendmin = 45; # minimum bend
my $Mtract; $Mtract = 99; # maximum number of a-tracts in a scan
my $ecut;
my @genome;
my $genome;
my $length;
my $Wi;my $X;my $Y; #for math
my @X;my @Y; my @X1; my @Y1; my @X2; my @Y2;
my $pop;
my $k;my $ysum;my $xsum;

#opens input
open(my $fh, "<", "input.gbk") or die "cannot open < test.gbk: $!";

#Depricated output
#open(MYOUTFILE, ">test.out") or die "could not open test.out";
#print MYOUTFILE "Start:End:A-tract:Xi:Yi:Angle\n";

#imports file removes spaces and digits
while (my $line = <$fh>) {
if ($line =~ /[\d|\s]{8}\d\s\D{10}\s\D{10}\s\D{10}\s\D{10}/) {
push (@genome, $line);
}
}
$genome = "@genome";
$genome =~ s/\s|\d//g;

#cuts the genome at the start and end points.
$end -= 1;
$start -= 1;
my $size; $size = length($genome);
$ecut = $size - $end;

if ($ecut + $start < $size) {
if ($end > 0) {
$genome = reverse;
$genome =~ s/\w{$ecut}//;
$genome = reverse;}
if ($start > 0) {
$genome =~ s/\w{$start}//;
} else { $start = 0 ;}
}



# searches $genome, does math, prints output.
while ($genome =~ /(?i)a+t?|(?i)t+/g) {
$length = $+[0] - $-[0];
if ( $length == 4 or $length == 5 or $length == 6 or $length == 7 ) {
$Wi = ((($-[0] + $start) + ($+[0] + $start)) / 3.18309886); #10.5 period
$Y = sin $Wi;
$X = cos $Wi;


if ( $length == 4 or $length == 7 ) {
$X *= 0.261799388;
$Y *= 0.261799388;
}

if ( $length == 5 or $length == 6 ) {
$X *= 0.34906585;
$Y *= 0.34906585;
}
$X = rad2deg($X);
$Y = rad2deg($Y);
push (@X, $X);
push (@Y, $Y);
}
}

#stores backups
@X2 = @X;
@Y2 = @Y;

#sub add adds the a-tracts together untill they have reached the max number of a-tracts in a curve
#it prints when the curve is larger then the bend min
sub add {
$pop = shift (@X);
push(@X1, $pop);
$pop = shift (@Y);
push(@Y1, $pop);
$k = 0;
my $n; $n = 0;


while ( $n + 1 < $Mtract ) {
$pop = shift (@X);
push(@X1, $pop);
$xsum = 0;
foreach (@X1) {
$xsum += $_;}
$xsum *= $xsum;


$pop = shift (@Y);
push(@Y1, $pop);
$ysum = 0;
foreach (@Y1) {
$ysum += $_;}
$ysum *= $ysum;
$k = sqrt ($xsum + $ysum);
$n += 1;

if ( $k <= $bendmin ) {
print "degrees: $k\n";
}
}


#this bit is so that if i run &add multiple times (which my end result will) it will advance
shift(@X1);
shift(@Y1);

while ($n > 0) {
$pop = pop(@X1);
unshift (@X, $pop);
$pop = pop(@Y1);
unshift (@Y, $pop);
$n -= 1;
}
}

&add;



(This post was edited by orion on Jun 24, 2011, 10:31 AM)


BillKSmith
Veteran

Jun 24, 2011, 1:37 PM

Post #13 of 15 (1280 views)
Re: [orion] My First real script (help needed) [In reply to] Can't Post

Is there a blank line (possibly at the end) in your data file? You probably should test for it.


Code
while (my $line = <$fh>) {  

next if $line =~ /^\s*$/;



Good Luck,
Bill


orion
Novice

Jun 27, 2011, 7:36 AM

Post #14 of 15 (1272 views)
Re: [BillKSmith] My First real script (help needed) [In reply to] Can't Post

that doesn't seam to be the problem
i tried switching the addition to summation but that didn't help either


BillKSmith
Veteran

Jun 28, 2011, 1:33 PM

Post #15 of 15 (1255 views)
Re: [orion] My First real script (help needed) [In reply to] Can't Post

You problem is not in your arithmetic. You are not even creating the genome string. It would be easier to delete everything except the symbols a,c,g, and t.


Code
  

do{

open( my $fh, "<", "input.gbk" ) or die "cannot open < test.gbk: $!";

$INPUT_RECORD_SEPARATOR = undef;

$genome = <$fh>;

close $fh;

};

$genome =~ s/[^agct]//g;






Your first if is only executed once, and always fails.

The regular expression in your while statement always fails to match. The loop is never executed. If it did match, the loop would never terminate.

It is still not clear to me what you are trying to do. Consider the contrived genome “aaaaaaattttttt” . I count 38 subsequences that appear to meet your conditions. (Eight with 7 bases each, nine with 6 bases each, ten with 5 bases each, and eleven with 4 bases each.) Do you intend to find the start and end of each of them? If not, which ones? Are you sure that this explosion of sequences can never occur in nature? Explain.




Good Luck,
Bill

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives