CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Extracting Data from a File and Tabulating It

 



manchester
New User

Feb 25, 2013, 5:38 PM

Post #1 of 7 (742 views)
Extracting Data from a File and Tabulating It Can't Post

Hello friends,

First-timer here. I am trying to write a script that will extract data from four files that I have and put them together in a table. I have four INPUT files each containing names of specific things and and a particular number next to them. What I want to do is to extract the names from the four files one by one and put the names in a column, and then create a new column for each of the files with the corresponding number on each row.

For example, my four files are: ONE, TWO, THREE, FOUR

ONE:
alpha 3
bravo 2
charlie 1

TWO:
alpha 6
charlie 9
delta 2

THREE:
bravo 2
delta 4

FOUR:
charlie 4
echo 1

Thus, what I am trying to do is to get a final result that looks sort of like this:

NAME ONE TWO THREE FOUR
===========================
alpha .....3 ......6 ...........0 .........0
bravo .....2 ......0 ...........2 .........0
charlie ....1 ....9 ............0 .........4
delta .....0 ......2 ...........4 .........0
echo .....0 ......0 ...........0 .........1


I am unsure exactly about how to go on extracting the names from the files and then printing the corresponding number in the same row, and putting in a zero where there is no occurrence of that name in the file.

Thank you for your time. Wink


BillKSmith
Veteran

Feb 25, 2013, 6:42 PM

Post #2 of 7 (727 views)
Re: [manchester] Extracting Data from a File and Tabulating It [In reply to] Can't Post

Use a regex to parse each line (Alternatively, split on whitespace). Store the data in a hash of arrays (Refer: perldoc perldsc). Use each name as a hash key. The corresponding value would be a reference to an array containing the corresponding numbers. Use the filenumber as the array index.

Change Undefined array elements to zero then print your data structure.

Give it a try. Show us what you have.
Good Luck,
Bill


Chris Charley
User

Feb 25, 2013, 7:29 PM

Post #3 of 7 (724 views)
Re: [manchester] Extracting Data from a File and Tabulating It [In reply to] Can't Post

I found using a hash of hashes data structure worked best for me. I made the assumption that any name occurred only once in any file.

Code
  #!/usr/bin/perl   
use strict;
use warnings;
use List::Util qw/ max /;

my @files = qw/ o33.txt o44.txt o55.txt o66.txt /;

my %data;
for my $file (@files) {
open my $fh, "<", $file or die "Unable to open '$file'. $!";
while (<$fh>) {
my ($name, $count) = split;
$data{$name}{$file} = $count;
}
close $fh or die "Unable to close '$file'. $!";
}

my @names = sort keys %data;

my $name_len = 1 + max map length, @names;
my $file_len = 1 + max map length, @files;

my $format = "%-${name_len}s" . "%${file_len}s" x @files . "\n";

printf $format, 'Name', @files; # print header

for my $name (@names) {
printf $format, $name, map $data{$name}{$_} || 0, @files;
}


This produced this output.

Code
   
C:\Old_Data\perlp>perl t11.pl
Name o33.txt o44.txt o55.txt o66.txt
alpha 3 6 0 0
bravo 2 0 2 0
charlie 1 9 0 4
delta 0 2 4 0
echo 0 0 0 1


Here is the structure of the hash of hashes.

Code
  $VAR1 = {   
'delta' => {
'o44.txt' => '2',
'o55.txt' => '4'
},
'alpha' => {
'o44.txt' => '6',
'o33.txt' => '3'
},
'bravo' => {
'o33.txt' => '2',
'o55.txt' => '2'
},
'charlie' => {
'o66.txt' => '4',
'o44.txt' => '9',
'o33.txt' => '1'
},
'echo' => {
'o66.txt' => '1'
}
};



(This post was edited by Chris Charley on Feb 26, 2013, 5:33 AM)


Kenosis
User

Feb 25, 2013, 9:12 PM

Post #4 of 7 (712 views)
Re: [manchester] Extracting Data from a File and Tabulating It [In reply to] Can't Post

I had also used the same hash structure as Chris Charley, but the following's run from the command line and uses Text::Table:

Code
use strict; 
use warnings;
use Text::Table;

my ( @files, %names, @row, @rows ) = @ARGV;

while (<>) {
$names{$1}{$ARGV} = $2 if /(\w+)\s+(\w+)/;
}

for my $name ( sort keys %names ) {
push @row, $name;
push @row, $names{$name}{$_} // 0 for @files;
push @rows, [ splice @row, 0, @row ];
}

my $tb = Text::Table->new( 'Name', @files );
$tb->load(@rows);
print $tb;

Usage: perl script.pl ONE TWO THREE FOUR [>outFile]

Output:

Code
Name    ONE TWO THREE FOUR 
alpha 3 6 0 0
bravo 2 0 2 0
charlie 1 9 0 4
delta 0 2 4 0
echo 0 0 0 1

The files' names are in @ARGV, and are saved in @files for later use. The while (<>) notation reads all the files consecutively, and the current file's name is in $ARGV. A regex is used to capture the files' data. Text::Table takes an array of arrays (AoA) for the table's rows it builds, so the for loop builds this. The the // "defined or" operator is used to either return the number associated with the name/file or 0. The square brackets [ ] denote an anonymous array, and the splice inside them moves all the elements out of @row for use as the anonymous array's elements. Finally, a Text::Table object is created and initialized with heading information. The row data is then loaded, and the table is printed.


(This post was edited by Kenosis on Feb 25, 2013, 9:56 PM)


manchester
New User

Feb 28, 2013, 8:32 AM

Post #5 of 7 (675 views)
Re: [BillKSmith] Extracting Data from a File and Tabulating It [In reply to] Can't Post

Thank you for your help. The output doesn't seem to include all the files I input. This is what I have:

Code
#/usr/bin/perl -w; 
use strict;

my %microRNA_read_count_1 = ( );
my %microRNA_read_count_2 = ( );
my %microRNA_read_count_3 = ( );
my %microRNA_read_count_4 = ( );
# defines the hash to store microRNA and corresponding read counts from each of the four files

open(INFILE,$ARGV[0]);
# Opens the input file, specified in the first argument
for(<INFILE>){
# Reads the input file line by line
my @row = split(/\t/,$_);
#Splits the row by tab and stores it in an array
my $mir = $row[0];
# defines the scalar with the microRNA names
my $count = $row[1];
# defines the scalar with the read counts
$microRNA_read_count_1{$mir} = $count;
#puts the microRNA and corresponding read counts and stores it in the hash
}
close INFILE;
# Close the input file
# ========================================
open(INFILE,$ARGV[1]);
# Opens the input file, specified in the first argument
for(<INFILE>){
# Reads the input file line by line
my @row = split(/\t/,$_);
# Splits the row by space and stores it in an array
my $mir = $row[0];
# defines the scalar with the microRNA names
my $count = $row[1];
# defines the scalar with the read counts
$microRNA_read_count_2{$mir} = $count;
# puts the microRNA and corresponding read counts in the hash
}
close INFILE;
# Close the input file
# ========================================
open(INFILE,$ARGV[2]);
# Opens the input file, specified in the first argument
for(<INFILE>){
# Reads the input file line by line
my %microRNA_read_count_3 = ( );
# defines the hash to store microRNA and corresponding read counts
my @row = split(/\t/,$_);
#Splits the row by space and stores it in an array
my $mir = $row[0];
# defines the scalar with the microRNA names
my $count = $row[1];
# defines the scalar with the read counts
$microRNA_read_count_3{$mir} = $count;
#puts the microRNA and corresponding read counts in hash
}
close INFILE;
# Close the input file
# ========================================
open(INFILE,$ARGV[3]);
# Opens the input file, specified in the first argument
for(<INFILE>){
# Reads the input file line by line
my @row = split(/\t/,$_);
#Splits the row by space and stores it in an array
my $mir = $row[0];
# defines the scalar with the microRNA names
my $count = $row[1];
# defines the scalar with the read counts
$microRNA_read_count_4{$mir} = $count;
#puts the microRNA and corresponding read counts in hash
}
close INFILE;
# Close the input file
# ========================================
open(INFILE,$ARGV[4]);
# Opens the input file, specified in the first argument
for(<INFILE>){
# Reads the input file line by line
chomp $_;
print $_ ."\t". $microRNA_read_count_1{$_} ."\t". $microRNA_read_count_2{$_} ."\t". $microRNA_read_count_3{$_} ."\t". $microRNA_read_count_4{$_} ."\n";
}
close INFILE;
# Close the input file
exit 0;
# Exits the program


Is my syntax wrong?

Thanks in advance. I really appreciate any advice you may offer.


BillKSmith
Veteran

Feb 28, 2013, 12:38 PM

Post #6 of 7 (662 views)
Re: [manchester] Extracting Data from a File and Tabulating It [In reply to] Can't Post

In your sample files, the separator was a space not a tab.
Other than that, you are parsing correctly.

You should only have one hash. It should be a hash of arrays.

You should process the files in a loop. It is poor practice to repeat the code for each one.

In your original post, there were only four input files. Your new code reads five. I do not understand the purpose of the additional file. (It clearly is much different from the the others.)

The following code solves your original problem. The main loop should work for any valid data. The display routine is tailored to the posted data. Data::Dumper is included to display output data which does not display proberly. You may have to change the \s in the parser's regex to a \r.

Code
use strict; 
use warnings;
use Carp;
use Data::Dumper;
my @FILE_NAMES;
# Set Default file names
$FILE_NAMES[0] = $ARGV[0] // 'ONE';
$FILE_NAMES[1] = $ARGV[1] // 'TWO';
$FILE_NAMES[2] = $ARGV[2] // 'THREE';
$FILE_NAMES[3] = $ARGV[3] // 'FOUR';

my %microRNA_read_count;
my $file_num = -1;
CHOOSE_FILE:
foreach my $file_name (@FILE_NAMES) {
$file_num++;
open my $INPUT_FILE, '<', $file_name or die "Cannot open $file_name:$!";
STORE_LINES:
while (my $line = <$INPUT_FILE>) {
my ($mir, $count) = parse( $line, $file_name );
if (!exists $microRNA_read_count{$mir}) {
$microRNA_read_count{$mir} = [(0) x @FILE_NAMES]; # Set defaults
}
$microRNA_read_count{$mir}[$file_num] = $count; # Store the line
}
close $INPUT_FILE;
}
print Dumper \%microRNA_read_count;
display( \@FILE_NAMES, \%microRNA_read_count );


sub parse {
my ($line, $file_name) = @_;
my ($mir, $count) = $line =~ m/^(\w+) \s (\d+) \s* $/xms;
if (!defined $mir){ # validate line
croak "Invalid input line $. of file $file_name\n";
}
return $mir, $count;
}

sub display{
my($file_names_ref, $microRNA_read_count_ref) = @_;
printf "\n\nNAME %3s %3s %5s %4s\n", @$file_names_ref;
printf '='x (9+4*@$file_names_ref) . "\n";
while (my ($mir, $counts_ref) = each %$microRNA_read_count_ref) {
$mir .= '.' x (7-length($mir)); # Pad all names to 7 characters
printf "%-7s"."....%1d" x @$file_names_ref . "\n",
$mir,
@$counts_ref;
}
}
__END__

Good Luck,
Bill


Chris Charley
User

Feb 28, 2013, 3:57 PM

Post #7 of 7 (654 views)
Re: [manchester] Extracting Data from a File and Tabulating It [In reply to] Can't Post

I 'cleaned up' your code - there was only 1 obvious error I saw.

my %microRNA_read_count_3 = ( );

That line wipes out your hash every time through the loop and you will only get the last record read. Just eliminate that line.

Instead of using for loops, it is better to use a while loop. See your code below with those changes.

Just to repeat what Bill Smith said, you should probably not use 4 hashes - just 1 and possibly use an array as he described.

Code
  #!/usr/bin/perl   
use strict;
use warnings;

my %microRNA_read_count_1 = ( );
my %microRNA_read_count_2 = ( );
my %microRNA_read_count_3 = ( );
my %microRNA_read_count_4 = ( );

open(INFILE,$ARGV[0]) or die $!;;
while (<INFILE>){
chomp;
my @row = split /\t/;
my $mir = $row[0];
my $count = $row[1];
$microRNA_read_count_1{$mir} = $count;
}
close INFILE or die $!;

open(INFILE,$ARGV[1]) or die $!;
while (<INFILE>){
chomp;
my @row = split /\t/;
my $mir = $row[0];
my $count = $row[1];
$microRNA_read_count_2{$mir} = $count;
}
close INFILE or die $!;

open(INFILE,$ARGV[2]) or die $!;
while (<INFILE>){
chomp;
my @row = split /\t/;
my $mir = $row[0];
my $count = $row[1];
$microRNA_read_count_3{$mir} = $count;
}
close INFILE or die $!;

open(INFILE,$ARGV[3]) or die $!;
while (<INFILE>){
chomp;
my @row = split /\t/;
my $mir = $row[0];
my $count = $row[1];
$microRNA_read_count_4{$mir} = $count;
}
close INFILE or die $!;

open(INFILE,$ARGV[4]) or die $!;
while (<INFILE>){
# Reads the input file line by line
chomp;
#print $_ ."\t". $microRNA_read_count_1{$_} ."\t". $microRNA_read_count_2{$_} ."\t". $microRNA_read_count_3{$_} ."\t". $microRNA_read_count_4{$_} ."\n";
print join("\t", $_, $microRNA_read_count_1{$_} || 0, $microRNA_read_count_2{$_} || 0, $microRNA_read_count_3{$_} || 0, $microRNA_read_count_4{$_} || 0), "\n";
}
close INFILE or die $!;

The lines like $microRNA_read_count_2{$_} || 0 assign a zero, 0, when that name was not found in the hash you created.


(This post was edited by Chris Charley on Feb 28, 2013, 8:07 PM)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives