CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
Split table,Count and substitution character

 



Giffredo
Novice

Mar 21, 2014, 4:55 AM

Post #1 of 16 (16752 views)
Split table,Count and substitution character Can't Post

Hi, I m a beginner in programming world. I m studying Perl to do a script necessary to continue my work.

Problem: I have a table, in the third column I have the characters "." or "," (the number and type is very different for each line). I d like to count how many dot or comma are present and substituting these with the number of these.

Ex. I have this "....,..,,,,,.." I want "8 6"

Someone can give me a suggestion?


Laurent_R
Veteran / Moderator

Mar 21, 2014, 11:07 AM

Post #2 of 16 (16747 views)
Re: [Giffredo] Split table,Count and substitution character [In reply to] Can't Post

The substitution operator returns the number of matches. You can have something like this (I cannot test anything right now):

Code
$_ = "....,..,,,,,.." ; 
my $nr_dots = s/\././g;
my $nr_commas = s/,/,/g;
# use the $nr_dots and $nb_commas for whatever you need.



Laurent_R
Veteran / Moderator

Mar 21, 2014, 11:36 AM

Post #3 of 16 (16744 views)
Re: [Giffredo] Split table,Count and substitution character [In reply to] Can't Post

Actually, you could also use the tr/// transliteration operator, which will be slightly faster (if that matters). This is an example demonstrated under the perl debugger:


Code
  DB<1> $_ = "....,..,,,,,.." ; 

DB<2> $nr_dots = tr/././;

DB<3> print $nr_dots
8
DB<4> $nr_commas = tr/,/,/;

DB<5> p $nr_commas
6



Giffredo
Novice

Mar 22, 2014, 6:13 AM

Post #4 of 16 (16728 views)
Re: [Laurent_R] Split table,Count and substitution character [In reply to] Can't Post

Ok thanks, I knew how to use s command (i didn t know the possibility to use tr.. maybe is a good idea) but the problem is to act on the table.. I ll explain better.

My example is:

chrM 136 A 6 ,,..,, 896774
chrM 137 A 6 ,,,,,,.. ?=@===
chrM 138 A 6 ,,..,,,, ?=<=8<
chrM 139 C 6 ,,,..,,, 887801
chrM 140 C 6 ..,,, @>==9;
chrM 141 C 6 ,,,,,, ><==9;
chrM 142 T 6 ,,,,,, 747701

So I have to use s or tr only on the 6th column.. This is my problem and above all I want that the script works in all the lines.


FishMonger
Veteran / Moderator

Mar 22, 2014, 7:19 AM

Post #5 of 16 (16710 views)
Re: [Giffredo] Split table,Count and substitution character [In reply to] Can't Post

I'm not sure I understand completely what you want to do.

As I understand it:
a) You have a file with 6 space separated fields
b) You need to count the number of dots and commas in the 5th field
c) You need to concatenate those 2 values and put it in the 6th (last) field


Code
#!/usr/bin/perl 

use strict;
use warnings;
use Text::CSV_XS;

my $csv = Text::CSV_XS->new({ binary => 1, sep_char => ' ', eol => "\n" });

while (my $line = $csv->getline(*DATA)) {
my %cnt;
$cnt{'.'} = $line->[4] =~ tr/././;
$cnt{','} = $line->[4] =~ tr/,/,/;
$line->[5] = $cnt{'.'} . $cnt{','};
$csv->print(*STDOUT, $line);
}


__DATA__
chrM 136 A 6 ,,.......,, 896774
chrM 137 A 6 ,,,,,,.. ?=@===
chrM 138 A 6 ,,..,,,, ?=<=8<
chrM 139 C 6 ,,,..,,, 887801
chrM 140 C 6 ..,,, @>==9;
chrM 141 C 6 ,,,,,, ><==9;
chrM 142 T 6 ,,,,,, 747701



Results

Code
c:\test>test.pl 
chrM 136 A 6 ,,.......,, 74
chrM 137 A 6 ,,,,,,.. 26
chrM 138 A 6 ,,..,,,, 26
chrM 139 C 6 ,,,..,,, 26
chrM 140 C 6 ..,,, 23
chrM 141 C 6 ,,,,,, 06
chrM 142 T 6 ,,,,,, 06



Laurent_R
Veteran / Moderator

Mar 22, 2014, 3:44 PM

Post #6 of 16 (16610 views)
Re: [Giffredo] Split table,Count and substitution character [In reply to] Can't Post

Well, the first step is obviously to isolate the field on which you need to work. Then only you can count the characters that you need. But as far as I can say, you haven't given enough information about how to get to the right field, if that is what you are looking for. It seems likely that you want to split the input data on spaces, but you haven't really said it.

Please provide a sample on your input file and let us know what you need in it.


(This post was edited by Laurent_R on Mar 22, 2014, 3:47 PM)


Giffredo
Novice

Mar 23, 2014, 2:30 AM

Post #7 of 16 (16467 views)
Re: [Laurent_R] Split table,Count and substitution character [In reply to] Can't Post

Very helpful Laurent_R!!!! The output is very similar at what I hoped:

chrM 136 A 6 7:4
chrM 137 A 6 2:6
chrM 138 A 6 2:6
chrM 139 C 6 2:6
chrM 140 C 6 2:3
chrM 141 C 6 0:6
chrM 142 T 6 0:6

I put a small extract of my input (extract is from 136 to 142, my whole file starts from 1 to about 40000). Anyway I paste exactly what I copied from unix. My intent is only to count the "." and "," because I need to number and not series of . or , to make a statistical test.

Can you put some comment to the code to allow me to learn the code? for example which is the part concerning the introduction, the part to isolate the right column etc.


FishMonger
Veteran / Moderator

Mar 23, 2014, 8:42 AM

Post #8 of 16 (16394 views)
Re: [Giffredo] Split table,Count and substitution character [In reply to] Can't Post

Is this last sample the input data or desired output?

You've posted multiple differing samples of input and haven't given a clear explanation on which fields you need to keep and which ones you don't and which one holds the data you want counted which makes it difficult for us to provide the proper solution.


(This post was edited by FishMonger on Mar 23, 2014, 8:42 AM)


Giffredo
Novice

Mar 23, 2014, 10:49 AM

Post #9 of 16 (16357 views)
Re: [FishMonger] Split table,Count and substitution character [In reply to] Can't Post

Sorry I make a mistake because of confusion with the mail replies.

Anyway,

Input:

chrM 136 A 6 ,,.......,, 896774
chrM 137 A 6 ,,,,,,.. ?=@===
chrM 138 A 6 ,,..,,,, ?=<=8<
chrM 139 C 6 ,,,..,,, 887801
chrM 140 C 6 ..,,, @>==9;
chrM 141 C 6 ,,,,,, ><==9;
chrM 142 T 6 ,,,,,, 747701

output that I would like:

chrM 136 A 6 7:4
chrM 137 A 6 2:6
chrM 138 A 6 2:6
chrM 139 C 6 2:6
chrM 140 C 6 2:3
chrM 141 C 6 0:6
chrM 142 T 6 0:6

(I want in output one number for "." and one number for "," separate by something. All the symbols and numbers that compare (in the input) on the last column are not important for me)

I am a biologist and the input derived from a samtools:
http://samtools.sourceforge.net/pileup.shtml

FishMonger you made a good code in line with my needs. But now I would like to understand it; if you have time to explain me how you do it I will grateful..

I hope that now is more clear.


Chris Charley
User

Mar 23, 2014, 2:23 PM

Post #10 of 16 (16300 views)
Re: [Giffredo] Split table,Count and substitution character [In reply to] Can't Post

Hello Giffredo,

I will try to walk through FishMonger's code.


Code
use Text::CSV_XS; 

my $csv = Text::CSV_XS->new({ binary => 1, sep_char => ' ', eol => "\n" });


Uses the module Text::CSV_XS and he creates a Text::CSV_XS object in the code to parse and print the input line. Text::CSV splits the line on a single space for input, (sep_char => ' '), and also uses a space to separate the columns for output, print.


Code
while (my $line = $csv->getline(*DATA))

Here the program reads a line (from the *DATA filehandle - your program would use a filehandle for your input file instead).


Code
$cnt{'.'}  = $line->[4] =~ tr/././;  
$cnt{','} = $line->[4] =~ tr/,/,/;


This code uses the transliteration (tr) operator to count the periods and commas in the column noted as $line->[4]. These counts are assigned to %cnt hash which stores the counts.


Code
$line->[5] = $cnt{'.'} . $cnt{','};


This sets array item 6, (arrays count starting with 0).


Code
$csv->print(*STDOUT, $line);


This code uses the Text::CSV object to print (to the screen). To print to an output file instead, you would replace *STDOUT with your output filehandle.

This code gives the results:

Code
chrM 136 A 6 ,,.......,, 74  
chrM 137 A 6 ,,,,,,.. 26
chrM 138 A 6 ,,..,,,, 26
chrM 139 C 6 ,,,..,,, 26
chrM 140 C 6 ..,,, 23
chrM 141 C 6 ,,,,,, 06
chrM 142 T 6 ,,,,,, 06


To get your desired output, the code would need 1 or 2 minor changes.

Change

Code
$line->[5] = $cnt{'.'} . $cnt{','};

To

Code
splice @$line, 4, 2, "$cnt{'.'}:$cnt{','}";



(This post was edited by Chris Charley on Mar 23, 2014, 4:02 PM)


Giffredo
Novice

Mar 24, 2014, 2:30 AM

Post #11 of 16 (16111 views)
Re: [Chris Charley] Split table,Count and substitution character [In reply to] Can't Post

Hello Chris!

thx!

Can you clarify better this part:

while (my $line = $csv->getline(*DATA))

I didn t understand the use of getline..


Chris Charley
User

Mar 24, 2014, 12:24 PM

Post #12 of 16 (15958 views)
Re: [Giffredo] Split table,Count and substitution character [In reply to] Can't Post

That reads in a line from the file, one at a time.

Just to show how to do it without the Text::CSV module.

Code
#!/usr/bin/perl 
use strict;
use warnings;

open my $in, "<", 'junk.txt' or die $!;
open my $out, ">", 'new.txt' or die $!;

while (<$in>) {
my @line = split;
my %cnt;
$cnt{'.'} = $line[4] =~ tr/.//;
$cnt{','} = $line[4] =~ tr/,//;
splice @line, 4, 2, "$cnt{'.'}:$cnt{','}";
print $out join(" ", @line), "\n";
}
close $in or die $!;
close $out or die $!;

__END__
contents of 'junk.txt'

chrM 136 A 6 ,,.......,, 896774
chrM 137 A 6 ,,,,,,.. ?=@===
chrM 138 A 6 ,,..,,,, ?=<=8<
chrM 139 C 6 ,,,..,,, 887801
chrM 140 C 6 ..,,, @>==9;
chrM 141 C 6 ,,,,,, ><==9;
chrM 142 T 6 ,,,,,, 747701

contents of 'new.txt'

chrM 136 A 6 7:4
chrM 137 A 6 2:6
chrM 138 A 6 2:6
chrM 139 C 6 2:6
chrM 140 C 6 2:3
chrM 141 C 6 0:6
chrM 142 T 6 0:6

The docs for splice are http://perldoc.perl.org/functions/splice.html


Giffredo
Novice

Mar 25, 2014, 4:56 AM

Post #13 of 16 (15749 views)
Re: [Chris Charley] Split table,Count and substitution character [In reply to] Can't Post

Ok your code works well (in fact unix said me that I have not Text::CSV module.

Can you explain me the code and in particular this line:

open my $out, ">", 'new.txt' or die $!;

I d like to be able to change the code in base on new necessities..
For example on this input:

chrM 982 A 14 .....+2TG.......,, 69=9;<><5<==79
chrM 983 G 14 ............,, 69<95;>=3=:879
chrM 984 T 14 ............,, 84<9=<>=3=<:.:
chrM 985 A 14 ............,, 84=7=:<=6=<857
chrM 986 G 15 ............,,, 74<7;<<>5=<<.1:
chrM 987 A 15 ............,,, 7:<6<=<=9@<<204
chrM 988 A 14 ............,, 4285586528787:
chrM 989 G 14 ............,, 97<;8><;9=<<76
chrM 990 A 15 .....-1C....-1C...,,, 76<=5=;<9=<<4.7
chrM 991 C 14 ....*..*...,,, 36=;0=96=:<4=.
chrM 992 T 13 ...........,, 36=<0=96=;;07
chrM 993 T 12 .........,,, 09/74/6506>.
chrM 994 A 13 ..........,,, 36>5<958;87;.
chrM 995 T 14 ...........,,, 3<;5<7;57=37<.
chrM 996 A 13 ..........,,, 9;<<8<98:56<.
chrM 997 G 13 ..........,,, 9:5;7<5:838<7
chrM 998 A 13 ..........,,, 9:5<==5:8=845
chrM 999 T 13 ..........,,, ;;5;=?4:7<631
chrM 1000 T 12 .........,,, 60/4585.786;
chrM 1001 A 13 ..........,,, ;74::=297<212
chrM 1002 A 13 ..........,,, ?74:;<2=7;?<<
chrM 1003 A 13 ..........,,, ?74::<2=7;?<<
chrM 1004 A 10 .......,,, 2.301/5@;<
chrM 1005 T 13 ..........,,, 644:7;1:6;766
chrM 1006 T 12 .........,,, 2567/37389;;
chrM 1007 C 13 ....$.$.$....,,, 6<;=5:3:8B9;;
chrM 1008 T 10 .......,,, 8=;9<7>;78
chrM 1009 C 10 .......,,, 8>;7>9B4/0
chrM 1010 C 9 ...$.$.$.$,,, 861159:56
chrM 1011 A 8 ...,,,^S.^S. .=;5/0?=

How I can count the "." and "," leaving the symbols in the middle? for example on the first row I want to keep the +2TG..

transform the row in something like:

chrM 982 A 14 12:2 +2TG


FishMonger
Veteran / Moderator

Mar 25, 2014, 6:40 AM

Post #14 of 16 (15721 views)
Re: [Giffredo] Split table,Count and substitution character [In reply to] Can't Post

I'm not a biologist so I can't begin to understand the meaning of your data but before we spend a lot of time parsing your data, you may want to look over the related CPAN modules to see if they have already solved this parsing issue.

http://search.cpan.org/search?query=pileup&mode=all


Giffredo
Novice

Mar 25, 2014, 7:34 AM

Post #15 of 16 (15708 views)
Re: [FishMonger] Split table,Count and substitution character [In reply to] Can't Post

it is a useful link if someone wants alternative way (without using samtools) to make mapping, alignment, blast etc. but not for my problem that it is not a biological problem (i want to change and grouping data in a very big table, that all.)
I would like to understand the code, is not necessary that someone give me the result. I need of this pl script but I am more interested to enter in Perl mentality in order to solve my problems by myself.


Laurent_R
Veteran / Moderator

Mar 25, 2014, 11:16 AM

Post #16 of 16 (15642 views)
Re: [Giffredo] Split table,Count and substitution character [In reply to] Can't Post


In Reply To
Can you explain me the code and in particular this line:

open my $out, ">", 'new.txt' or die $!;


This opens the file "new.txt" for output (">" means for output). The $out variable is the filehandle with which you'll be able to write data into this file. The "or die" means that the progrfam will abort if it turns out to be impossible to open the file, and "$!" will give the reason why opening the file failed.

But I think you should grap a good tutorial or perhaps better buy a book on learning Perl, such as "Learning Perl", published by O'Reilly.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives