CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
parse tab delimited file and write to new files

 



Norwegian_Wood
New User

Jun 29, 2013, 12:51 PM

Post #1 of 7 (617 views)
parse tab delimited file and write to new files Can't Post

 
Hi
I have a very large tab delimited file with around 3000 fields with a header row. It is in the format ID start end name1 name2 name3... etc to name 3000. My aim is to parse this file into 3000 separate files. Each file should have the first 3 fields of the source file ID start
end
and the next consecutive name field, starting at name1. And the next file should have ID start end name2... etc to name3000.

I'm new to Perl but I assume that I should read and write the first 3 columns to the new text file with the fourth column being a variable where the columns numbers from 4 to 3000 are looped over and passed to the $column4 variable each time a new file is created.

All I have been able to do so far is to read the first column of the text file with the Text::CSV module.

Any help would be appreciated.

this is my script so far :





Code
use strict; 
use Text::CSV;

my $file = 'MyTextFile.txt';
my $csv = Text::CSV->new( { 'sep_char' => "\t" } );

open (CSV, "<", $file) or die "Can't open $file: $!\n";

while (<CSV>) {
if ($csv->parse($_)) {
my @column = $csv->fields();
print "$column[1]\n";
} else {
my $err = $csv->error_input;
print "ERROR: Failed to parse line: $err";
}
}

close CSV;



FishMonger
Veteran / Moderator

Jun 29, 2013, 1:52 PM

Post #2 of 7 (607 views)
Re: [Norwegian_Wood] parse tab delimited file and write to new files [In reply to] Can't Post

The format of your source file tells me that the process that generates that file was not thought out very well. Why take it a step further by creating 3000 nearly identical files?

It appears to me that you have an XY problem.

Here's your requested solution.

Code
if ($csv->parse($_)) {  
my ($id, $start, $end, @columns) = $csv->fields();

while (@columns) {
print join("\t", $id, $start, $end, shift @columns), "\n";
}
}


I left out the part that opens/creates each file because you didn't provide enough info regarding how they are to be named. That open statement would be placed inside the while loop.


(This post was edited by FishMonger on Jun 29, 2013, 1:56 PM)


Norwegian_Wood
New User

Jun 29, 2013, 3:04 PM

Post #3 of 7 (601 views)
Re: [FishMonger] parse tab delimited file and write to new files [In reply to] Can't Post

Thanks very much for your help.
I have tested your code and it seems to be working.

Sadly I have no control over the source file. It would be easier if received these in a more convenient format. the reason I would like to split them is that I then want to import them into MYSQL as tables with the name from the header as the table name, which will then be queried. this will be much more efficient if this is done by joining multiple tables on the primary key rather than importing the current source file into one huge table. (only 10-20 tables will be queried at a time according to user request)


So Ideally I would like to take the header name from the the fourth field of each file as the new file name. This can be done at the file split stage or when I import the files to my sql which I am more familiar with.





In Reply To


FishMonger
Veteran / Moderator

Jun 29, 2013, 3:48 PM

Post #4 of 7 (597 views)
Re: [Norwegian_Wood] parse tab delimited file and write to new files [In reply to] Can't Post

So, you plan on creating 3000 tables? That's as bad if not worse.

If you provide more details on what you need to accomplish, not how you think you should implement your possible solution, we should be able to recommend better methods.


hwnd
User

Jun 29, 2013, 4:18 PM

Post #5 of 7 (591 views)
Re: [FishMonger] parse tab delimited file and write to new files [In reply to] Can't Post

Amongst your header row:


Code
ID   start   end   name1   name2   name3



What does your data look like on the next line under your header, I am curious?


Norwegian_Wood
New User

Jun 30, 2013, 3:55 AM

Post #6 of 7 (581 views)
Re: [hwnd] parse tab delimited file and write to new files [In reply to] Can't Post

Hi

The data under the header looks like this. with id, start, end as integers and each name has a long field of decimal values.
chr22 50163501 50164000 0.0216543833380203 0.161245861239383 0.146899443524876


So for each file I am looking for an output like this:
ID Start End name1
chr22 1 500 0.0216543833380203
chr22 251 750 0.0374431201561708

I am creating a PHP application where the user will provide a list of up to 20 name fields, these are then converted to an array and these columns are selected from the tables in the MYSQL database with a dynamic query.
The reason I chose to split the large file by name is that MYSQL or any database does not generally hold data in rows lengths larger than around 8000 bytes, my source file has much more than this. Also querying a table with much more than 50 columns starts to get very inefficient, its much less resource intensive to create smaller tables and query them using a join on a related key. In order to do this I need to have the source data normalized before I import it.

I have been trying to play around with your code to print to multiple text files within the while loop with no success. I'm very new to Perl. The file name is not so important at this stage. they can just be name1 ... name3000.


regards


FishMonger
Veteran / Moderator

Jun 30, 2013, 8:30 AM

Post #7 of 7 (565 views)
Re: [Norwegian_Wood] parse tab delimited file and write to new files [In reply to] Can't Post

Saying you're having "no success" is a very poor problem description and doesn't give us any info on what you're doing or the results you're getting.

What have you tried? You need to post your code.

What were the results?

How do the results differ from what you expected?

Did you receive any errors/warnings and if so what were they?

When you say you're trying to print to multiple files within the loop, are you saying you're trying to write to 3000 different files on each iteration of the loop?


Quote
I need to have the source data normalized before I import it.

That does not require creating 3000 separate files which then need to be parsed when inserting the data into the database.

You normalize the database structure/schema before anything else. Then as you parse each line of your source data, you either insert each record as needed, or if the record needs to be built up from multiple lines, you'd store the data in proper data structure, such as a hash, and do the insert(s) when the record has all of its data points.

BTW, creating 3000 tables with the exact same structure is far from being normalized.


(This post was edited by FishMonger on Jun 30, 2013, 8:32 AM)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives