CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Help outputting first and last positions of blocks of the same type

 



GeneticsGirl
Novice

Aug 22, 2012, 1:42 PM

Post #1 of 13 (3630 views)
Help outputting first and last positions of blocks of the same type Can't Post

Hello,

I'm fairly new to programming - my field is genetics. I'm looking for help in how to write code to solve my problem below...

I have a file with 4 columns:
column 1: a position on a chromosome
column 2: allele 1
column 3: allele 2
column 4: classification type (2 types: SAME, DIFF)

A sample of my file is as follows:
1457 G G SAME
1979 G G SAME
2056 T T SAME
3091 A A SAME
3562 A G DIFF
3778 A A SAME
4124 T T SAME
4229 C T DIFF
4571 A G DIFF
5019 A C DIFF
5114 C C SAME
6291 T T SAME
6414 C C SAME
6553 C C SAME
6941 G G SAME

What I want to do is for each block of either SAME or DIFF, get the positions of the last and first element in that block. It would be great to also get a count of how many "SAME"s or "DIFF"s fall in the range as well.

So for the sample file, I'd like to get an output that would look like this:

Column 1: First position of block (of same type)
Column 2: Last position of block (of same type)
Column 3: Number of occurrences within block (of same type)
Column 4: type (SAME or DIFF)

Sample output:
1457 3091 4 SAME
3562 3562 1 DIFF
3778 4124 2 SAME
4229 5019 3 DIFF
5114 6941 5 SAME

Not sure where to start - either with a while loop, or using until...I'm very new to using any programming language.
Any help with this would be much appreciated.

Regards,

GeneticsGirl


BillKSmith
Veteran

Aug 22, 2012, 3:58 PM

Post #2 of 13 (3618 views)
Re: [GeneticsGirl] Help outputting first and last positions of blocks of the same type [In reply to] Can't Post

The first and last blocks can get tricky. For your first attempt, ignore them and write a loop that does everything else correctly. For your second attempt, we can help you with the two special cases. When that is working, you will have a working prototype of your program that we all understand. We can then discuss ways to make it clearer, faster, and more robust.

This is what the main loop of version 0 has to do:

Read and parse a line.

Decide if it is the start of a new block.

If it is not:
Update last position with new position.
Increment the record count.

otherwise:
Print the data that you have for the previous block.
Set first_position and last_position to current position
Set type from the new record
set count = 1.
Return to read next line.
Good Luck,
Bill


Laurent_R
Veteran / Moderator

Aug 22, 2012, 11:17 PM

Post #3 of 13 (3609 views)
Re: [GeneticsGirl] Help outputting first and last positions of blocks of the same type [In reply to] Can't Post

In the loop to read your file, you need to memorize what the last line was, to that you can figure out reading the current line whether the last line was the end of a block.

It should not be too complicated.


GeneticsGirl
Novice

Aug 23, 2012, 12:27 PM

Post #4 of 13 (3598 views)
Re: [BillKSmith] Help outputting first and last positions of blocks of the same type [In reply to] Can't Post

Hi Bill,

Thanks so much for the help and advice! Below is my first crack at the code - I feel I can parse the lines fine, but not sure how to tackle what you say as deciding on new blocks. I know that my loop for each line at the bottom is not working - it prints everything together, not in blocks.

Do you know of a link or info for me to look at on the web that might help explain as you had mentioned, finding the start of new blocks? Also how to set positions?



open (FILE, 'haplotypes.txt') || die "Unable to open file";

my $pos;
my $V_GT;
my $I_GT;
my $type;
my @array;
my @pos;
my @V_GT;
my @I_GT;
my @type;
my @lines;
my @temp;
$count = 0;

while (<FILE>){
my $line = $_;
chomp $line;

@array = split (/\s/, $line);

foreach my $array(@array){
$pos = $array[0];
$V_GT = $array[1];
$I_GT = $array[2];
$type = $array[3];
}
push @lines, $line;
push @pos, $pos;
push @V_GT, $V_GT;
push @I_GT, $I_GT;
push @type, $type;
}

for $x (0..$#lines){
if ($type[$x] eq $type[$x+1]){
push (@temp, $type[$x], $pos[$x]);
$count++;
print "@temp\n";
}
}


Regards,

Heather
GeneticsGirl


Laurent_R
Veteran / Moderator

Aug 23, 2012, 1:37 PM

Post #5 of 13 (3591 views)
Re: [GeneticsGirl] Help outputting first and last positions of blocks of the same type [In reply to] Can't Post

Hi Heather,

a few things that will make your code simpler.

To start with, use the following pragmas:


Code
use strict; 
use warnings;


That will give you a lot of indications on possibly wrong code.

Second, your long list of declarations could be simplified:


Code
my $pos; 
my $V_GT;
my $I_GT;
my $type;
# ...
}

into:


Code
my ( $pos,  $V_GT,  $I_GT, $type @array, ...};


This:

Code
while (<FILE>){ 
my $line = $_;
chomp $line;

@array = split (/\s/, $line);


could be rewritten as follows:


Code
while (<FILE>){ 
chomp;
@array = split;


@array is a poor name for an array. Try instead to use a name that correspond to the contents.

I do not have time right now to suggest a proper algorithm, but if that can wait for the weekend, I guess I could supply you with the basis of what you need.


(This post was edited by Laurent_R on Aug 23, 2012, 1:41 PM)


BillKSmith
Veteran

Aug 23, 2012, 3:10 PM

Post #6 of 13 (3585 views)
Re: [GeneticsGirl] Help outputting first and last positions of blocks of the same type [In reply to] Can't Post

You are over complicating the problem. Only four variables need file scope ($first_position, $last_position, block_size, and $type). Inside the loop, the only significant variables are $new_position and $new_type. There is no need for any arrays. It probably is a good idea to use $line rather than $_.

A new block is starting if $new_type is not equal to $type.

If you add these hints to my previous pseudo code, you will receive a warning about uninitialized variables while processing the first line. You will not print the last block. We will fix these problems later. Just get the loop working for the normal cases. Laurent's advice on style will help you in the long run.
Good Luck,
Bill


GeneticsGirl
Novice

Aug 23, 2012, 8:21 PM

Post #7 of 13 (3575 views)
Re: [Laurent_R] Help outputting first and last positions of blocks of the same type [In reply to] Can't Post

Hi Laurent_R,

Great - thank you for the simplifications - it cleans up my code quite a bit!

I'm working on what you said in your previous post about memorizing the last line to see whether it was the end of a block...again, I'm very new to programming, so I'm working on it! Will probably take me to the weekend to get even this to work!

Heather
GeneticsGirl


GeneticsGirl
Novice

Aug 23, 2012, 8:26 PM

Post #8 of 13 (3571 views)
Re: [BillKSmith] Help outputting first and last positions of blocks of the same type [In reply to] Can't Post

Hi Bill,

So I would have to push the current $type into a new variable? I will have to think about how to do this without arrays.

As Laurent said, I need to figure out the current/last position coding, which I'm trying to learn via Google help. :-)

Heather
GeneticsGirl


wickedxter
User

Aug 24, 2012, 2:24 AM

Post #9 of 13 (3555 views)
Re: [GeneticsGirl] Help outputting first and last positions of blocks of the same type [In reply to] Can't Post

This might help get you closer its not where it should be but close.


Code
use strict; 
use warnings;

my $new_data;
my $counter = 1;
my $class_watch;
my $inbetween = 1;

my $line_before;

while (<DATA>){
chomp;
my ($chrom_pos, $all1, $all2, $class) = split /\s/;

#if the class changes save the $chrom_pos and the line before
if($class ne $class_watch ){

$new_data .= "$chrom_pos\n" if $inbetween <= 2;
$new_data .= "$line_before \n" if $inbetween > 1;

#reset counter data
$inbetween = 1;
$class_watch = $class;
$line_before = $chrom_pos;
}else {
#if the class is the same keep track
$inbetween++;
$line_before = $chrom_pos;
$class_watch = $class;
}
#line counter
$counter++;
}

print $new_data;

__DATA__
1457 G G SAME
1979 G G SAME
2056 T T SAME
3091 A A SAME
3562 A G DIFF
3778 A A SAME
4124 T T SAME
4229 C T DIFF
4571 A G DIFF
5019 A C DIFF
5114 C C SAME
6291 T T SAME
6414 C C SAME
6553 C C SAME
6941 G G SAME
EOF


This output's :
1457
3091
3778
4229
4124
5019
6941


(This post was edited by wickedxter on Aug 24, 2012, 2:26 AM)


BillKSmith
Veteran

Aug 24, 2012, 8:48 AM

Post #10 of 13 (3531 views)
Re: [GeneticsGirl] Help outputting first and last positions of blocks of the same type [In reply to] Can't Post

Let me review. A block consists of four values (first position, last position, occurrence count, and classification type.) When we have these four values, we print them and initialize a new block.

I doubt that google will help you know when the block is complete. You already told us in your first post that a block consists of consecutive lines with the same classification type.

A line contains only two useful values (Chromosome position and classification type.) For each line, if the classification type of the line is the same as the classification type of the current block, this line belongs to that block. Replace the final position in the block with the chromosome position from the line and increment the occurrence count of the block. Otherwise, the block is complete - print it and then initialize a new block. Set both the first and last positions in the block to the new chromosome position, set the occurrence count to one, and the classification type to the value from the line.

You will have a problem with the first line because the block is not initialized and the last line will never get printed. Do not worry, we will fix this later.

In my version, I store the block as a hash with four keys. That hash is the only file-scope variable.
Good Luck,
Bill


Laurent_R
Veteran / Moderator

Aug 25, 2012, 4:06 AM

Post #11 of 13 (3501 views)
Re: [BillKSmith] Help outputting first and last positions of blocks of the same type [In reply to] Can't Post

Hi

this is a quick try:


Code
use strict;  
use warnings;

#process the first line separatly, this duplicates the code a bit but simplifies the algorithm
my $previous_line = <DATA>;
my $same = 0;
$same = 1 if $previous_line =~ /SAME/;
my $block_start = $1 if $previous_line =~ /^(\d+) /;
my $count = 1;
my $previous_same = $same;

while (my $line = <DATA>) {
$same = 0;
$same = 1 if $line =~ /SAME/;
if ($previous_same != $same or eof) {
#we are at the beginning of a new block, we need first to complete the processing of the previous one
my $block_end = $1 if $previous_line =~ /^(\d+) /;
my $block_type = $previous_same ? "SAME" : "DIFF";
my $output = "$block_start $block_end $count $block_type \n";
print $output;
#reinit of variables
$count = 1;
$block_start = $1 if $line =~ /^(\d+) /;
$previous_same = $same;
}
else {

#we are within in a block; just count
$count ++;
}
$previous_line = $line;
}

__DATA__
1457 G G SAME
1979 G G SAME
2056 T T SAME
3091 A A SAME
3562 A G DIFF
3778 A A SAME
4124 T T SAME
4229 C T DIFF
4571 A G DIFF
5019 A C DIFF
5114 C C SAME
6291 T T SAME
6414 C C SAME
6553 C C SAME
6941 G G SAME


The output (on the data placed at the end of the program after the __DATA__ tag):



Code
1457 3091 4 SAME 
3562 3562 1 DIFF
3778 4124 2 SAME
4229 5019 3 DIFF
5114 6553 4 SAME


As you can see, there is still a small bug, the last line is not processed correctly, but I don't have time right now to fix it. I still post it, since it gives you a good start.

I'll try to come back to it later.


(This post was edited by Laurent_R on Aug 25, 2012, 7:45 AM)


BillKSmith
Veteran

Aug 25, 2012, 7:44 AM

Post #12 of 13 (3494 views)
Re: [GeneticsGirl] Help outputting first and last positions of blocks of the same type [In reply to] Can't Post

Here is the code which I have been describing. Although it looks much different, it works the same as Laurent's. (Except that the last position is updated correctly.) Every line (even the first) is assumed to be the last line of a block until proven false.

Data for the current Block is stored in a hash %block. In each section of the code, a single assignment statement (using hash slices) updates the block.

The do block around the initialization is not necessary. Its purpose it to limit the scope of the two temporary variables ($position and $class) used in the initialization. Note that the two 'my' variables with the same names that are used inside the loop are not the same two variables. Their scope is limited to the loop.

Refer to perldoc -f undef for an example of this use of undef.


The print statements are idiomatic. Hash slice notation is used to force the order of the values. The special variable $OUTPUT_FIELD_SEPARATOR (default: single space) is used implicitly to separate the values. (Thanks to FishMonger for suggesting this idiom in a recent unrelated post)

The print statement after the loop is needed to print the final block.

Code
use strict; 
use warnings;
my %block;

do {
# Initialize first block
$_ = <DATA>;
my ($position, undef, undef, $class) = split;
@block{'first_position', 'last_position', 'occur', 'class' }
= ( $position, $position, 1, $class );
};

while (<DATA>) {
my ($position, undef, undef, $class) = split;
if ($class eq $block{class}) {
# Update current block
@block{'last_position', 'occur' }
= ( $position, $block{occur}+1);
}
else {
# Print previous block
print "@block{'first_position', 'last_position', 'occur', 'class'}\n";
# Initialize new block
@block{'first_position', 'last_position', 'occur', 'class' }
= ( $position, $position, 1, $class );
}
}

# Print final block
print "@block{'first_position', 'last_position', 'occur', 'class'}\n";



__DATA__
1457 G G SAME
1979 G G SAME
2056 T T SAME
3091 A A SAME
3562 A G DIFF
3778 A A SAME
4124 T T SAME
4229 C T DIFF
4571 A G DIFF
5019 A C DIFF
5114 C C SAME
6291 T T SAME
6414 C C SAME
6553 C C SAME
6941 G G SAME

Good Luck,
Bill


GeneticsGirl
Novice

Aug 27, 2012, 10:44 AM

Post #13 of 13 (3436 views)
Re: [BillKSmith] Help outputting first and last positions of blocks of the same type [In reply to] Can't Post

Of course the code worked great for me! This is saving me a lot of time in my research - I'm so very grateful to everyone for their posts and suggestions - I'm picking up a lot from the snippets of code that have been put on here. Thank you Bill, Laurent and wickedxter - this has been a wonderful first experience for me entering a programming forum! So glad to have found somewhere I can go for help and advice as I'm trying to apply programs to speed up my research!

Thanks again!

Heather
GeneticsGirl

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives