Home: Perl Programming Help: Beginner:
My First real script (version 2 troubles)



orion
Novice

Jun 20, 2011, 12:18 PM


Views: 3796
My First real script (version 2 troubles)

i am trying to write a script to scan text files, extract data from them
scan that data for sets of a's followed by t's just a's or just t's
that are between 4 and 7 characters long
then output the start and end points of each of those runs
suffice to say it does not work, and i don't know why
here is my code so far:
#!/usr/bin/perl -w
use strict;
use warnings;


my @genome;
my $input_file;
my $line;
my $output;
my $genome;


open FILE, "~/test.gbk;" or die("could not open $input_file");

while (my $line = <FILE>) {
if ($line =~ [\d|\s]{8}\d \D{10} \D{10} \D{10} \D{10})
push (@genome, $line)}


open(MYOUTFILE, ">filename.out") or die("could not open $output");
print MYOUTFILE "Start:End:Atract:angle";


# removes spaces from array
$genome = @genome;
$genome =~ s/\s|\d//g;
# end of




while ($genome =~ /a+t?|t+/g) {

if ($+[0] - $-[0] == (4|7)) {
print MYOUTFILE "$-[0]:$+[0]:15/n";
}
if ($+[0] - $-[0] == (5|6)) {
print MYOUTFILE "$-[0]:$+[0]:20/n";
}

# the regex for the patern will be a+t?|t+


(This post was edited by orion on Jun 24, 2011, 10:23 AM)


rovf
Veteran

Jun 21, 2011, 1:01 AM


Views: 3781
Re: [orion] My First real script (help needed)

Don't just say "it does not work". Explain the expected output, and the output you get instead. Also use code tags around your code. Code of more than a few lines is hard to read if formatted like normal text.

BTW, is there a particular reason why you use in your regexp


Quote
\D{10} \D{10} \D{10} \D{10}


instead of the simpler


Code
\D{40}


?


mino
Novice

Jun 21, 2011, 2:18 AM


Views: 3776
Re: [orion] My First real script (help needed)

There are several errors in your code which you should fix first of all, among others:


Code
open FILE, "~/test.gbk;" or die("could not open $input_file");


Is the file really called "test.gbk;"? Also, $input_file is not defined anywhere.


Code
while (my $line = <FILE>) { 
if ($line =~ [\d|\s]{8}\d \D{10} \D{10} \D{10} \D{10})
push (@genome, $line)}


The regex must be enclosed by slashes. There's no code block following the "if" statement.


Code
while ($genome =~ /a+t?|t+/g) {


The while loop doesn't have an ending }.

Since you use warnings and strictures, start by fixing the errors you get. Then post back
here if it still doesn't work.


orion
Novice

Jun 22, 2011, 7:37 AM


Views: 3759
Re: [mino] My First real script (help needed)

sorry bout the poor post

to answer questions
for now yes the input file is test.gbk
the file is a mock genome file that I'm using to make sure the script works before i bring in the real ones.
the reason that the first regex is not one piece is because there should be spaces in between each ten chars.
i continued to fix errors until no more came but it still doesn't output correctly.

Code
#!/usr/bin/perl -w 
use strict;
use warnings;


my @genome;
my $input_file;
my $line;
my $output;
my $genome;



#$input_file = ~/test.gbk;
open(my $fh, "<", "test.gbk")
or die "cannot open < test.gbk: $!";

while (my $line = <$fh>) {
if ($line =~ /[\d|\s]{8}\d \D{10} \D{10} \D{10} \D{10}/) {
push (@genome, $line);
}
}


open(MYOUTFILE, ">-test.out") or die("could not open test.out");
print MYOUTFILE "Start:End:angle";


# removes spaces from array
$genome = @genome;
$genome =~ s/\s|\d//g;
# end of




while ($genome =~ /a+t?|t+/g) {

if ($+[0] - $-[0] == (4|7)) {
print MYOUTFILE "$-[0]:$+[0]:15/n";
}
if ($+[0] - $-[0] == (5|6)) {
print MYOUTFILE "$-[0]:$+[0]:20/n";
}


}

the input is a file that looks something like

Code
ORIGIN       
1 ttacaaataa gggactcggt caccccctga aacgcccaga ccgaaaaccc taaaccgaaa
61 cctaaaacct aaaaccctaa gcctaaacct agacctaaaa cctaaaaccc ccttaaaccc
121 ctaaacctaa acctaaacct aaacccctaa acctaggccc taaacctaaa cgccctaaac
181 ctaaaaccta aaacctgaaa cctaaaccta aaccctgaac ctaaaaaacc taaaacctaa

and for this file the output should be

Code
Start:End:angle 
5:8:15
54:57:15
etc.

but at the moment the ouput just is
Start:end:angle



mino
Novice

Jun 22, 2011, 9:22 AM


Views: 3752
Re: [orion] My First real script (help needed)


Quote
yes the input file is test.gbk


Seems more likely, your original code had the filename "test.gbk;" (with the semicolon). Smile

So, I fixed a few errors in your code.


Code
$genome = @genome;


This will return the number of elements of @genome and store it in $genome.
Read up on scalar vs. list context.
Since you don't use @genome for anything really, I removed it altogether.


Code
if ($+[0] - $-[0] == (4|7)) {


Won't work, you have to spell out the alternatives.

The following is the revised code.


Code
#!/usr/bin/perl  
use strict;
use warnings;

my $input_file;
my $genome;

#$input_file = ~/test.gbk;
open( my $fh, "<", "test.gbk" )
or die "cannot open < test.gbk: $!";

while ( my $line = <$fh> ) {
if ( $line =~ /[\d|\s]{8}\d \D{10} \D{10} \D{10} \D{10}/ ) {
$genome .= $line;
}
}

open( MYOUTFILE, ">-test.out" ) or die("could not open test.out");
print MYOUTFILE "Start:End:angle\n";

while ( $genome =~ /a+t?|t+/g ) {
my $offset = $+[0] - $-[0];

if ( $offset == 4 or $offset == 7 ) {
print MYOUTFILE "$-[0]:$+[0]:15\n";
}
elsif ( $offset == 5 or $offset == 6 ) {
print MYOUTFILE "$-[0]:$+[0]:20\n";
}
}



(This post was edited by mino on Jun 22, 2011, 9:23 AM)


orion
Novice

Jun 22, 2011, 10:28 AM


Views: 3747
Re: [mino] My First real script (help needed)

Thanks for all the help
but before i read yours i figured i would throw more regex at it

i also added a bit more to the output
thank you very much for all the help

Code
#!/usr/bin/perl -w 
use strict;
use warnings;


my @genome;
my $input_file;
my $line;
my $output;
my $genome;
my $d;
my $track;
my $s;
my $e;
#$input_file = ~/test.gbk;
open(my $fh, "<", "test.gbk")
or die "cannot open < test.gbk: $!";

while (my $line = <$fh>) {
if ($line =~ /[\d|\s]{8}\d\s\D{10}\s\D{10}\s\D{10}\s\D{10}/) {
push (@genome, $line);

}
}


open(MYOUTFILE, ">test.out") or die("could not open test.out");
print MYOUTFILE "Start:End:Atract:angle\n";


# removes spaces from array
$genome = "@genome";
$genome =~ s/\s|\d//g;
# end of



# prints output file/pushes data onto arrays
while ($genome =~ /a+t?|t+/g) {
$d = $+[0] - $-[0];
$track = $&;
$s = $-[0];
$e = $+[0];

if ($d =~ m/4|7/) {
print MYOUTFILE "$s:$e:$track:15\n";
}

if ($d =~ m/5|6/) {
print MYOUTFILE "$s:$e:$track:20\n";
}
}



mino
Novice

Jun 22, 2011, 10:32 AM


Views: 3745
Re: [orion] My First real script (help needed)


Code
if ($d =~ m/4|7/) {


Beware, this will also match e.g. 14 or 17.


orion
Novice

Jun 22, 2011, 10:41 AM


Views: 3741
Re: [mino] My First real script (help needed)

how could i fix that?


mino
Novice

Jun 22, 2011, 10:50 AM


Views: 3740
Re: [orion] My First real script (help needed)

Well, you can add anchors:


Code
if ($d =~ m/^(4|7)$/) {


But personally I would rather use numerical match instead of string match:


Code
if ( $d == 4 or $d == 7 ) {



harshmane
Novice

Jun 23, 2011, 10:51 PM


Views: 3720
Re: [mino] My First real script (help needed)

i want to know the use of $+[0] & $-[0].


orion
Novice

Jun 24, 2011, 10:17 AM


Views: 3710
Re: [harshmane] My First real script (help needed)

they are automatically stored variables
that print the location of the start and end of the regular expression match


orion
Novice

Jun 24, 2011, 10:22 AM


Views: 3709
Re: [orion] My First real script (help needed)

i have come to a new problem now though
something is clearly wrong with my sum loops
the terminal output is:
Use of uninitialized value in addition (+) at atract.pl line 98, <$fh> line 136.
Use of uninitialized value in addition (+) at atract.pl line 106, <$fh> line 136.

and like a million of those



Code
#!/usr/bin/perl -w  
use Math::Trig;
use strict;
use warnings;
########set both start and end to 0 to scan full file##########
my $start; $start = 0; # starts scanning at
my $end; $end = 0; # ends scanning at
my $bendmin; $bendmin = 45; # minimum bend
my $Mtract; $Mtract = 99; # maximum number of a-tracts in a scan
my $ecut;
my @genome;
my $genome;
my $length;
my $Wi;my $X;my $Y; #for math
my @X;my @Y; my @X1; my @Y1; my @X2; my @Y2;
my $pop;
my $k;my $ysum;my $xsum;

#opens input
open(my $fh, "<", "input.gbk") or die "cannot open < test.gbk: $!";

#Depricated output
#open(MYOUTFILE, ">test.out") or die "could not open test.out";
#print MYOUTFILE "Start:End:A-tract:Xi:Yi:Angle\n";

#imports file removes spaces and digits
while (my $line = <$fh>) {
if ($line =~ /[\d|\s]{8}\d\s\D{10}\s\D{10}\s\D{10}\s\D{10}/) {
push (@genome, $line);
}
}
$genome = "@genome";
$genome =~ s/\s|\d//g;

#cuts the genome at the start and end points.
$end -= 1;
$start -= 1;
my $size; $size = length($genome);
$ecut = $size - $end;

if ($ecut + $start < $size) {
if ($end > 0) {
$genome = reverse;
$genome =~ s/\w{$ecut}//;
$genome = reverse;}
if ($start > 0) {
$genome =~ s/\w{$start}//;
} else { $start = 0 ;}
}



# searches $genome, does math, prints output.
while ($genome =~ /(?i)a+t?|(?i)t+/g) {
$length = $+[0] - $-[0];
if ( $length == 4 or $length == 5 or $length == 6 or $length == 7 ) {
$Wi = ((($-[0] + $start) + ($+[0] + $start)) / 3.18309886); #10.5 period
$Y = sin $Wi;
$X = cos $Wi;


if ( $length == 4 or $length == 7 ) {
$X *= 0.261799388;
$Y *= 0.261799388;
}

if ( $length == 5 or $length == 6 ) {
$X *= 0.34906585;
$Y *= 0.34906585;
}
$X = rad2deg($X);
$Y = rad2deg($Y);
push (@X, $X);
push (@Y, $Y);
}
}

#stores backups
@X2 = @X;
@Y2 = @Y;

#sub add adds the a-tracts together untill they have reached the max number of a-tracts in a curve
#it prints when the curve is larger then the bend min
sub add {
$pop = shift (@X);
push(@X1, $pop);
$pop = shift (@Y);
push(@Y1, $pop);
$k = 0;
my $n; $n = 0;


while ( $n + 1 < $Mtract ) {
$pop = shift (@X);
push(@X1, $pop);
$xsum = 0;
foreach (@X1) {
$xsum += $_;}
$xsum *= $xsum;


$pop = shift (@Y);
push(@Y1, $pop);
$ysum = 0;
foreach (@Y1) {
$ysum += $_;}
$ysum *= $ysum;
$k = sqrt ($xsum + $ysum);
$n += 1;

if ( $k <= $bendmin ) {
print "degrees: $k\n";
}
}


#this bit is so that if i run &add multiple times (which my end result will) it will advance
shift(@X1);
shift(@Y1);

while ($n > 0) {
$pop = pop(@X1);
unshift (@X, $pop);
$pop = pop(@Y1);
unshift (@Y, $pop);
$n -= 1;
}
}

&add;



(This post was edited by orion on Jun 24, 2011, 10:31 AM)


BillKSmith
Veteran

Jun 24, 2011, 1:37 PM


Views: 3702
Re: [orion] My First real script (help needed)

Is there a blank line (possibly at the end) in your data file? You probably should test for it.


Code
while (my $line = <$fh>) {  

next if $line =~ /^\s*$/;



Good Luck,
Bill


orion
Novice

Jun 27, 2011, 7:36 AM


Views: 3694
Re: [BillKSmith] My First real script (help needed)

that doesn't seam to be the problem
i tried switching the addition to summation but that didn't help either


BillKSmith
Veteran

Jun 28, 2011, 1:33 PM


Views: 3677
Re: [orion] My First real script (help needed)

You problem is not in your arithmetic. You are not even creating the genome string. It would be easier to delete everything except the symbols a,c,g, and t.


Code
  

do{

open( my $fh, "<", "input.gbk" ) or die "cannot open < test.gbk: $!";

$INPUT_RECORD_SEPARATOR = undef;

$genome = <$fh>;

close $fh;

};

$genome =~ s/[^agct]//g;






Your first if is only executed once, and always fails.

The regular expression in your while statement always fails to match. The loop is never executed. If it did match, the loop would never terminate.

It is still not clear to me what you are trying to do. Consider the contrived genome “aaaaaaattttttt” . I count 38 subsequences that appear to meet your conditions. (Eight with 7 bases each, nine with 6 bases each, ten with 5 bases each, and eleven with 4 bases each.) Do you intend to find the start and end of each of them? If not, which ones? Are you sure that this explosion of sequences can never occur in nature? Explain.




Good Luck,
Bill