CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Reading multiple files and pulling certain columns

 



davidcassidy22
Novice

May 28, 2015, 11:20 AM

Post #1 of 16 (6539 views)
Reading multiple files and pulling certain columns Can't Post

Hi all, perl newbie here. I need some help with reading multiple files and ultimately printing out certain columns from those files into a new file. What I have is:

file 1:
x1 y1 z1
x2 y2 z2
x3 y3 z3
x4 y4 z4

file 2:
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4

what I want my program to do is make something like

file 3:
a1 z1 x1 y1
a2 z2 x2 y2
a3 z3 x3 y3
a4 z4 x4 y4

My trouble is that I can read a single file and print the columns I want out, but when I open the new file next and print out the other columns I want, it prints the data right under the first column instead of next to it so I am thinking that I have to read multiple files simultaneously?

I'm getting something like this:

file 3:
a1
a2
a3
a4
z1 x1 y1
z2 x2 y2
z3 x3 y3
z4 x4 y4

Any help is greatly appreciated!


FishMonger
Veteran / Moderator

May 28, 2015, 11:45 AM

Post #2 of 16 (6537 views)
Re: [davidcassidy22] Reading multiple files and pulling certain columns [In reply to] Can't Post

Post your code.


davidcassidy22
Novice

May 28, 2015, 12:04 PM

Post #3 of 16 (6535 views)
Re: [FishMonger] Reading multiple files and pulling certain columns [In reply to] Can't Post

I worked on it a bit more and seemed to make it worse.


Code
#!/usr/bin/perl -w 
use strict;

=cut
my$ir_path = "\/home\/master\/files";
my$intab = "$ir_path\/pre\_irsa\/MSXlist\.txt";
my$intab2 = "$ir_path\/data\/msxmaster\_2mass\_data\.txt";
my$outtab = "$ir_path\/data\/master\.txt";

die " FILE $intab NOT FOUND\!\n" if (! -f $intab) ;
die " FILE $intab2 NOT FOUND\!\n" if (! -f $intab2);

unlink ("$outtab") if (-e $outtab);

open INT , "$intab" or die "Cannot open file $intab" ;
open INT2, "$intab2" or die "Cannot open file $intab2";
open OUTT, ">$outtab" or die "Cannot open file $outtab";
my $kk=0;
my @name;
while (<INT>) {
next if (! m/^\s+/);
$name[$kk] = (split /\s/)[0];
#print "$name[$kk]\n";
$kk=$kk+1;
}

my $nn=0;
my @ra;
my @dec;
my @l;
my @b;
my @j;
my @h;
my @k;
my @jh;
my @hk;
my @jk;
while (<INT2>) {
next if (! m/^\s+/);
($ra[$nn],$dec[$nn],$l[$nn],$b[$nn],$j[$nn],$h[$nn],$k[$nn]) = (split /\s/)[7,8,34,35,13,17,21];
#print "$l[$nn]\n";
$jh[$nn] = $j[$nn]-$h[$nn];
$hk[$nn] = $h[$nn]-$k[$nn];
$jk[$nn] = $j[$nn]-$k[$nn];
$jh[$nn] = substr($jh[$nn],0,8);
$hk[$nn] = substr($hk[$nn],0,8);
$jk[$nn] = substr($jk[$nn],0,8);
$nn=$nn+1;
}

for (my $ii = 0;$ii<32067; $ii++) {
printf OUTT "%-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s\n", $name[$ii],$ra[$ii],$dec[$ii],$l[$ii],$b[$ii],$j[$ii],$h[$ii],$k[$ii],$jh[$ii],$hk[$ii],$jk[$ii];
}


close OUTT;
close INT;
close INT2;
print "Done\, woot woot\n";
exit(0);


It isn't grabbing some of the columns and it's printing them in wild areas now haha


FishMonger
Veteran / Moderator

May 28, 2015, 12:28 PM

Post #4 of 16 (6528 views)
Re: [davidcassidy22] Reading multiple files and pulling certain columns [In reply to] Can't Post


Quote
It isn't grabbing some of the columns and it's printing them in wild areas now haha

That's a very poor problem statement.

Which columns is it not grabbing?
What do you mean by "wild areas"?

Please post sample lines from both input files that demonstrate the problem so that I can test your script.

FYI, your script is poorly written. I'll point out some of the issues as we proceed.


davidcassidy22
Novice

May 28, 2015, 12:48 PM

Post #5 of 16 (6525 views)
Re: [FishMonger] Reading multiple files and pulling certain columns [In reply to] Can't Post

As I stated in the beginning of the thread, I'm not very well versed in perl, but I suppose I could have worded that better. It is getting correct values for some lines, but wrong values most of the time. Some columns of data aren't even showing up which makes me think that it is not grabbing them at all.

Here are a few lines from each file.

File 1:

Code
 
ad3a-00001; RA=17; Equatorial;J2000; 17:28:17.928; -34:59:35.88; LSRK;Radio; 0.0;## ad3a-00001 17:28:17.928 -34:59:35.88 G 352.9196 -00.1587 MSX AD 0.30157
ad3a-00002; RA=17; Equatorial;J2000; 17:38:54.912; -34:59:28.68; LSRK;Radio; 0.0;## ad3a-00002 17:38:54.912 -34:59:28.68 G 354.1036 -01.9825 MSX AD -0.23049
ad3a-00003; RA=17; Equatorial;J2000; 17:28:32.232; -34:59:18.96; LSRK;Radio; 0.0;## ad3a-00003 17:28:32.232 -34:59:18.96 G 352.9505 -00.1967 MSX AD 0.16257
ad3a-00004; RA=17; Equatorial;J2000; 17:31:06.456; -34:59:18.96; LSRK;Radio; 0.0;## ad3a-00004 17:31:06.456 -34:59:18.96 G 353.2409 -00.6358 MSX AD 0.37855
ad3a-00005; RA=17; Equatorial;J2000; 17:11:04.392; -34:59:05.28; LSRK;Radio; 0.0;## ad3a-00005 17:11:04.392 -34:59:05.28 G 350.9055 02.7381 MSX AD -0.26616
ad3a-00006; RA=17; Equatorial;J2000; 17:46:50.208; -34:58:51.60; LSRK;Radio; 0.0;## ad3a-00006 17:46:50.208 -34:58:51.60 G 354.9645 -03.3584 MSX AD -0.34597
ad3a-00007; RA=17; Equatorial;J2000; 17:29:51.816; -34:58:43.68; LSRK;Radio; 0.0;## ad3a-00007 17:29:51.816 -34:58:43.68 G 353.1090 -00.4177 MSX AD 0.07159


File 2:

Code
 
1 1.351956 -122.178809 262.074700 -34.993300 5.000000 262.074312 -34.993500 0.06 0.06 90 17281783-3459366 13.260 0.018 0.021 46.6 9.327 0.020 0.021 547.1 7.219 0.017 0.022 3046.6 AAA 221 111 000 665556 11.5 187563838 0 0 352.919 -0.159 3.9330 2.1080 6.0410
2 1.239526 -75.019874 264.728800 -34.991300 5.000000 264.728394 -34.991211 0.29 0.29 90 17385481-3459283 4.199 0.194 0.194 176941.6 2.888 0.192 0.192 205925.4 2.273 0.246 0.246 212432.0 CCD 333 111 d00 060505 23.2 163209034 0 0 354.104 -1.982 1.3110 0.6150 1.9260
3 1.659672 -125.696439 262.134300 -34.988600 5.000000 262.133843 -34.988869 0.06 0.06 90 17283212-3459199 14.215 0.031 0.033 19.3 10.621 0.020 0.021 172.0 8.439 0.019 0.021 924.3 AAA 222 111 000 666655 6.3 187600017 0 0 352.950 -0.197 3.5940 2.1820 5.7760
4 1.822914 -50.951454 262.776900 -34.988600 5.000000 262.776420 -34.988281 0.08 0.08 90 17310634-3459178 17.407 null null null 12.735 0.022 0.023 18.0 9.860 0.023 0.024 196.1 UAA 022 011 000 005555 9.4 188004848 0 0 353.241 -0.635 - 2.8750 -
5 0.334036 -27.903496 257.768300 -34.984800 5.000000 257.768247 -34.984718 0.08 0.08 90 17110437-3459049 6.166 0.017 0.024 55722.4 4.889 0.051 0.053 88778.9 4.204 0.033 0.036 124399.9 AEE 111 111 00d 662615 15.2 20623640 0 0 350.906 2.738 1.2770 0.6850 1.9620



FishMonger
Veteran / Moderator

May 28, 2015, 12:56 PM

Post #6 of 16 (6521 views)
Re: [davidcassidy22] Reading multiple files and pulling certain columns [In reply to] Can't Post

Does file 2 actually have those leading spaces on each line, or was that an error when posting?


davidcassidy22
Novice

May 28, 2015, 1:09 PM

Post #7 of 16 (6517 views)
Re: [FishMonger] Reading multiple files and pulling certain columns [In reply to] Can't Post

It actually has the spacing in the front of each line.


FishMonger
Veteran / Moderator

May 28, 2015, 1:15 PM

Post #8 of 16 (6513 views)
Re: [davidcassidy22] Reading multiple files and pulling certain columns [In reply to] Can't Post

I'm tied up on one of my projects, but will run a test a little later when I have some free time.


FishMonger
Veteran / Moderator

May 28, 2015, 6:28 PM

Post #9 of 16 (6505 views)
Re: [davidcassidy22] Reading multiple files and pulling certain columns [In reply to] Can't Post

You haven't provided any info on how the output you're receiving differs from what you expect, but I've made the minimum amount of changes to your script to produce the output I think you wanted. There are a lot of additional changes it needs to bring it up to the quality of code I would feel comfortable using.


Code
#!/usr/bin/perl 

use strict;
use warnings;

my$ir_path = "/home/master/files";
my$intab = "$ir_path/pre_irsa/MSXlist.txt";
my$intab2 = "$ir_path/data/msxmaster_2mass_data.txt";
my$outtab = "$ir_path/data/master.txt";

die " FILE $intab NOT FOUND\!\n" if (! -f $intab) ;
die " FILE $intab2 NOT FOUND\!\n" if (! -f $intab2);

unlink ("$outtab") if (-e $outtab);

open INT , "$intab" or die "Cannot open file $intab" ;
open INT2, "$intab2" or die "Cannot open file $intab2";
open OUTT, ">$outtab" or die "Cannot open file $outtab";

my @name;
while (<INT>) {
next if /^\s+$/;
push @name, (split /\s/)[0];
}

my $nn=0;
my @ra;
my @dec;
my @l;
my @b;
my @j;
my @h;
my @k;
my @jh;
my @hk;
my @jk;

while (<INT2>) {
next if /^\s*$/;
($ra[$nn],$dec[$nn],$l[$nn],$b[$nn],$j[$nn],$h[$nn],$k[$nn]) = (split)[6,7,33,34,12,16,20];
$jh[$nn] = $j[$nn]-$h[$nn];
$hk[$nn] = $h[$nn]-$k[$nn];
$jk[$nn] = $j[$nn]-$k[$nn];
$jh[$nn] = substr($jh[$nn],0,8);
$hk[$nn] = substr($hk[$nn],0,8);
$jk[$nn] = substr($jk[$nn],0,8);
$nn=$nn+1;
}

for my $i (0 .. $nn - 1) {
printf OUTT "%-12s %12s %12s %12s %12s %12s %12s %12s %12s %12s %12s\n", $name[$i],$ra[$i],$dec[$i],$l[$i],$b[$i],$j[$i],$h[$i],$k[$i],$jh[$i],$hk[$i],$jk[$i];
}


close OUTT;
close INT;
close INT2;
print "Done\, woot woot\n";
exit(0);



davidcassidy22
Novice

May 29, 2015, 11:54 AM

Post #10 of 16 (6473 views)
Re: [FishMonger] Reading multiple files and pulling certain columns [In reply to] Can't Post

Wow, thank you so much for the help. I noticed my newbie mistake by not starting at zero when counting columns. Can you explain in words what these are:


Quote
/^\s+$/;



Quote
/^\s*$/;


syntax always confuses me. I'm not a computer science person but I'd like to get better with my programming skills. Are there any good books or anything with exercises to practice? What would be your advice to someone at my level to get better?


Laurent_R
Veteran / Moderator

May 29, 2015, 1:16 PM

Post #11 of 16 (6463 views)
Re: [davidcassidy22] Reading multiple files and pulling certain columns [In reply to] Can't Post

This is looking for empty lines:


Code
/^\s+$/;


^ : beginning of line
\s+ : one or more spaces
$ end of line

So any line matching this pattern will have only one or more spaces (at least one).

Code
/^\s*$/;


Almost the same thing:
^ : beginning of line
\s* : 0 or more spaces
$ end of line

So any line matching this pattern will have only 0 or more spaces.


davidcassidy22
Novice

Jun 1, 2015, 7:43 AM

Post #12 of 16 (6421 views)
Re: [Laurent_R] Reading multiple files and pulling certain columns [In reply to] Can't Post

Awesome, thank you so much, I really appreciate it.


aaron_baugher
Novice

Jun 2, 2015, 9:47 PM

Post #13 of 16 (6404 views)
Re: [davidcassidy22] Reading multiple files and pulling certain columns [In reply to] Can't Post

Why not open both files and read through them line-by-line simultaneously? Using file1 and file2 from your first example:


Code
#!/usr/bin/env perl 
use 5.010; use strict; use warnings;

open my $f1, '<', 'file1' or die $!;
open my $f2, '<', 'file2' or die $!;
while(my $l1 = <$f1>){
my $l2 = <$f2>;
my @a1 = split ' ', $l1;
my @a2 = split ' ', $l2;
say join ' ', $a2[0], @a1[2,0,1];
}



Laurent_R
Veteran / Moderator

Jun 3, 2015, 1:47 PM

Post #14 of 16 (6393 views)
Re: [aaron_baugher] Reading multiple files and pulling certain columns [In reply to] Can't Post

Hum, that's OK if you know for sure beforehand that both files have exactly the same structure. If not, you need a more complicated algorithm to read both files in parallel (assuming they are sorted in accordance with the same criteria).


aaron_baugher
Novice

Jun 3, 2015, 5:04 PM

Post #15 of 16 (6391 views)
Re: [Laurent_R] Reading multiple files and pulling certain columns [In reply to] Can't Post

True. I assumed from his description that the files have the same number of lines, but if not, he'd need a few more lines to code to deal with that, depending on how he wants that handled (throw an error, drop the extra lines, or use the extra lines).


Laurent_R
Veteran / Moderator

Jun 4, 2015, 1:19 AM

Post #16 of 16 (6379 views)
Re: [aaron_baugher] Reading multiple files and pulling certain columns [In reply to] Can't Post

I am working in a data quality department and we quite often have this type of work to be done: comparing CSV files from two different applications of the same information system supposed to cover the same customer base (35m customers). With a bit of over-simplification, we usually follow an essentially two-step process: we first read in parallel file A and file B and remove from the main files what we call orphans, i.e. records that are in A and not in B or that are in B and not in A (on the basis of a common key according to which the files are sorted). Orphans are stored in separate files for further action.

Then, once this is done, we know for sure that the main orphan-less files have exactly the same structure (and same number of records) and we can read the files in parallel simply with the method you described or something similar in order to compare the individual fields of each customer record and produce files containing inconsistencies which, together with the orphan files, will be the basis for producing corrections for one of the applications.

We are doing that often enough that I have written a module with functions taking care of the various steps, so that the main program might end up being quite short.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives