CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Advanced:
Sorting big files

 



mmcw2201
User

Feb 10, 2002, 5:26 AM

Post #1 of 20 (7633 views)
Sorting big files Can't Post

I am using this code to sort:

$datafile = "path/to/flat/database/file";


The Datafile contains a flat database like this:

ID|Name|Category|Title|Image|Description|Price|Taxable
0028|KINKS|10|Candy From Mr.Dandy.|geenidee.gif|label:<br>tracklist:<br>|30.00|1|
0050|BISHOPS|1|Live!|geenidee.gif|label:<br>tracklist:<br>|40.00|1|
0051|CHURCH|2|Temperature Drop In Downtown|geenidee.gif|label:<br>tracklist:<br>|40.00|1|

But actual contains about 9000 lines!

open(DATA,"$datafile")) {
@{$r_data} = <DATA>;
close(DATA);

# Set values if values are empty
$r_in->{'row'} = 0 unless ($r_in->{'row'});
if ($r_in->{'row'}) {
$r_in->{'type'} = "a" unless ($r_in->{'type'});
$r_in->{'order'} = "a" unless ($r_in->{'order'});
}
else {
$r_in->{'type'} = "n" unless ($r_in->{'type'});
$r_in->{'order'} = "a" unless ($r_in->{'order'});
}

$r_data = sort_data($r_in->{'row'},$r_in->{'type'},$r_in->{'order'},$r_data);

#########################################################################
# #
# subroutine sort_data #
# Subroutine that does the actual sort. #
# Accepts 4 params viz. #
# 1. The column number to sort. Column no. start from 0. #
# 2. The type of sort numeric or alphabetic. Default is Alphabetic. #
# 3. The order of sort. Default order is ascending #
# 4. The referrence of the array that needs to be sorted. #
#########################################################################

sub sort_data {

my ($row,$type,$sort_order,$r_data) = @_;
my (@array);

if ($type eq "n") {
if ($sort_order eq "d") {
@array = map { (split ('<->', $_))[1] }
reverse sort {$a <=> $b}
map { join ('<->', lc ((split('\|', $_))[$row]) , $_) }
@{$r_data};
}
else {
@array = map { (split ('<->', $_))[1] }
sort {$a <=> $b}
map { join ('<->', lc ((split('\|', $_))[$row]) , $_) }
@{$r_data};
}
}
else {
if ($sort_order eq "d") {
@array = map { (split ('<->', $_))[1] }
reverse sort {$a cmp $b}
map { join ('<->', lc ((split('\|', $_))[$row]) , $_) }
@{$r_data};
}
else {
@array = map { (split ('<->', $_))[1] }
sort {$a cmp $b}
map { join ('<->', lc ((split('\|', $_))[$row]) , $_) }
@{$r_data};
}
}

return (\@array)
}


This code will work with small flat databases.

Using big flat databases it will not work because perl won't let it. You will get an timeout error.

I searched the internet and found a module called: file::sort. This module should work?? Or is there an other way to sort easily big flat databases! Has to work on NT and UNIX!!!

How to implement the file::sort module to the like the script I used above!

And is it possible to use this module without installing it. I can not install modules myself and my provider won't do it for me!! Some modules can be uploaded to a certain directory and you can link to it using something like this:

# Set module
use lib 'Path/to/Modules/Directory';
use file::Sort;

Can someone help me??


freddo
User

Feb 10, 2002, 5:49 AM

Post #2 of 20 (7628 views)
Re: [mmcw2201] Sorting big files [In reply to] Can't Post

Hello,


In Reply To
And is it possible to use this module without installing it. I can not install modules myself and my provider won't do it for me!!



If it is this file::sort, just unpack sort.pm in your script directory, and use sort; in your script to load the module. Use perldoc sort.pm to check the docs.

I hope this helps a bit
freddo
;---


mmcw2201
User

Feb 10, 2002, 5:56 AM

Post #3 of 20 (7623 views)
Re: [freddo] Sorting big files [In reply to] Can't Post

I did read the code but did not understand it.

I want to use the sort module the same way I did uisng the subroutine.

# subroutine sort_data #
# Subroutine that does the actual sort. #
# Accepts 4 params viz. #
# 1. The column number to sort. Column no. start from 0. #
# 2. The type of sort numeric or alphabetic. Default is Alphabetic. #
# 3. The order of sort. Default order is ascending #
# 4. The referrence of the array that needs to be sorted. #


Can you help me??


mmcw2201
User

Feb 10, 2002, 7:10 AM

Post #4 of 20 (7615 views)
Re: [freddo] Sorting big files [In reply to] Can't Post

I tried your suggestion:

use Sort;

sort_file($r_in->{'datafile'});

But I get the error:

Undefined subroutine &main::sort_file called at /home/cgi-bin/Shop/Lib/shop_list_category.pl line 46.

What is wrong??


Jasmine
Administrator / Moderator

Feb 10, 2002, 3:16 PM

Post #5 of 20 (7603 views)
Re: [mmcw2201] Sorting big files [In reply to] Can't Post

It looks like you need to explicitly import the sort_file subroutine by using:


Code
use File::Sort qw(sort_file);

instead of

Code
use Sort;



Jasmine


mmcw2201
User

Feb 11, 2002, 3:23 AM

Post #6 of 20 (7592 views)
Re: [Jasmine] Sorting big files [In reply to] Can't Post

I think that works now.

But what now to enter as argument for the

sort_file(????????);

Michel


mmcw2201
User

Feb 11, 2002, 3:36 AM

Post #7 of 20 (7589 views)
Re: [Jasmine] Sorting big files [In reply to] Can't Post

I tried this:


Code
  

use lib $cgidir.'/Modules'; # Path to Modules dir
use File::Sort qw(sort_file);
sort_file({k => 2, I => $r_in->{'datafile'}, -t => '|'});
open(DATA, "$r_in->{'datafile'}");
@{$r_data} = <DATA>;
close(DATA);



mmcw2201
User

Feb 11, 2002, 3:46 AM

Post #8 of 20 (7587 views)
Re: [Jasmine] Sorting big files [In reply to] Can't Post

Without success. I get an error:

malformed header from script. Bad header=ID|Name|Category|Image|Descrip

What am I doing wrong??


Paul
Enthusiast

Feb 11, 2002, 8:50 AM

Post #9 of 20 (7581 views)
Re: [mmcw2201] Sorting big files [In reply to] Can't Post

You need to print a header.

print "Content-type: text/html\n\n";

With the other code snippet you experienced errors with you needed:


Code
use Sort;  

Sort::sort_file($r_in->{'datafile'});



mmcw2201
User

Feb 11, 2002, 9:44 AM

Post #10 of 20 (7575 views)
Re: [WiredON.net] Sorting big files [In reply to] Can't Post

But How to sort the file and have the following options:

# subroutine sort_data #
# Subroutine that does the actual sort. #
# Accepts 4 params viz. #
# 1. The column number to sort. Column no. start from 0. #
# 2. The type of sort numeric or alphabetic. Default is Alphabetic. #
# 3. The order of sort. Default order is ascending #
# 4. The referrence of the array that needs to be sorted. #


What do I have to give as argument???

What code do I have to add to my script???


mmcw2201
User

Feb 11, 2002, 9:49 AM

Post #11 of 20 (7570 views)
Re: [WiredON.net] Sorting big files [In reply to] Can't Post

I want to sort the data in the file: test.txt

ID|Name|Category|Title|Image|Description|Price|Taxable
0028|KINKS|10|Candy From Mr.Dandy.|geenidee.gif|label:<br>tracklist:<br>|30.00|1|
0050|BISHOPS|1|Live!|geenidee.gif|label:<br>tracklist:<br>|40.00|1|
0051|CHURCH|2|Temperature Drop In Downtown|geenidee.gif|label:<br>tracklist:<br>|40.00|1|


I want the option to sort on the:


first row: ID

Second row: Name etc.


I want to have the option to sort ascending and decending.

I want to sort numeric and alphabetic.

How to do that using this module?


Jasmine
Administrator / Moderator

Feb 11, 2002, 11:51 AM

Post #12 of 20 (7564 views)
Re: [mmcw2201] Sorting big files [In reply to] Can't Post

Note that throwing 9,000 lines into memory for sorting will probably make your program crawl (and maybe time out, as you already experienced), regardless of whether or not you're using a module. Modules are just a group of specially written and grouped functions that exist outside of your main program. The effect on the machine that's running the code is the same, with the exception that modules generally are tested and optimized.

File::Sort performs a "Sort a file or merge sort multiple files". Are you sure this is what you want/need? Perhaps [url=http://search.cpan.org/search?dist=Sort-Fields]Sort::Fields may be closer to what you need. Docs are [url=http://search.cpan.org/doc/JNH/Sort-Fields-0.90/Fields.pm]here.

With flatfile of 9,000 files, I'd urge you to consider MySQL or other database program. Any sorting algorithm on that much flatfile data (considering that the entire file will be slurped into memory and then sorted record by record) will be slow.

The code that you wrote in your first post of this thread looks fine... if it times out, it just means that the function you are asking for is taking a exorbitant amount of time.


mmcw2201
User

Feb 11, 2002, 10:10 PM

Post #13 of 20 (7554 views)
Re: [Jasmine] Sorting big files [In reply to] Can't Post

In some other forum they told me to use this module file::sort to get around the bug in perl using the sort function???


AndyNewby
Novice

Feb 12, 2002, 3:59 AM

Post #14 of 20 (7550 views)
Re: [mmcw2201] Sorting big files [In reply to] Can't Post

I would seriously not consider doing that if I was you.

1) You may find you get thrown off your host Wink
2) It will slow down the rest of your site.
3) Why not use MySQL? There are features in mySQL that let you do exactly what you are asking, and it is much faster and safer Smile

Andy
webmaster@ace-installer.com
http://www.ace-installer.com


mmcw2201
User

Feb 15, 2002, 10:01 AM

Post #15 of 20 (7534 views)
Re: [AndyNewby] Sorting big files [In reply to] Can't Post

Thank you for your advice.

The problem is I can not use MySQL and I do not know MySQL!


gregarios
stranger

Feb 23, 2002, 10:34 PM

Post #16 of 20 (7510 views)
Re: [mmcw2201] Sorting big files [In reply to] Can't Post

Hmm after reading these posts, I'm left wondering how big is too big for a flatfile DB?

I'm running a site that has a similar DB to the one discussed here. It is only 830 lines long at the moment, but could grow to over the 9000+ lines that are being discussed here. I do very similar things to what is being attempted with sorting the contents then displaying the sorted output.

Here is the thing... My script is reading the entire file into memory, then picking out any number of entries as verified by a pattern match, then sorting alphabetically, weeding out the duplicates if neccessary, then displaying the output. Even though it has over 800 lines, the script never takes more than 1 second to run. Will it start having the same timeout problems when and if it reaches over 9000+ lines? Right now the filesize is only 300K as read on disk.

Also, could there be a difference in Unix and NT handling of large files as well?

Greg J Piper
[url=http://www.macpicks.com]MacPiCkS



Jasmine
Administrator / Moderator

Feb 24, 2002, 10:43 AM

Post #17 of 20 (7501 views)
Re: [gregarios] Sorting big files [In reply to] Can't Post


In Reply To
My script is reading the entire file into memory, then picking out any number of entries as verified by a pattern match, then sorting alphabetically, weeding out the duplicates if neccessary, then displaying the output.


It sounds like you don't need to slurp the file into memory. You can go through each line of the file one by one, keep only the lines that match, throw the rest away, perform your functions, sort then output. Example:

[perl]#!/usr/bin/perl -w

use strict;


# initialize the matches hashref. i prefer hashrefs over hashes because they
# can easily be thrown around throughout programs. it's a good idea to always
# append _href to hash ref names or _aref for array ref names -- this way,
# you always know what reference type you're working on.

my $matches_href = {};

open( FILE, "<test.db" ) or die $!;


# instead of my @db = <FILE>, using while ( <FILE> ) will read one line at
# a time, saving the overhead of keeping possibly large files in
# memory.

while ( <FILE> ){
chomp;


# here, grab the id number and throw the rest of the line in an array.
# this assumes that the id number is the first element of the line :)

my ( $id, @row ) = split( /\|/ );


# this line checks the search criteria, and if it succeeds, it adds the
# line to the matches href. Note that $id is the key and an array ref
# which contains the remainder of the matched line. If you don't use
# an array ref (if you add @row instead of \@row), you will receive
# "Can't use ('something') as an ARRAY ref" errors later on in your
# program.

$matches_href->{ $id } = \@row if $row[0] =~ /Perl/ && $id;
}

close FILE or die $!;


# sort the records based on the first element of the arrayref. if you wanted
# to sort it based on the id instead (remember that the id was the key, and
# isn't in the arrayref), you can use this instead:
# sort { lc $matches_href->{$a} <=> lc $matches_href->{$b} } %$matches_href;
# before you sort, you can do whatever else you wanted to weed out duplicates.

my @sorted_ids = sort {
lc $matches_href->{$a}[0] cmp lc $matches_href->{$b}[0]
} %$matches_href;


# print the sorted list of first elements.

print "$matches_href->{ $_ }[0]\n" foreach @sorted_ids;
[/perl]


mmcw2201
User

Feb 26, 2002, 9:36 AM

Post #18 of 20 (7481 views)
Re: [Jasmine] Sorting big files [In reply to] Can't Post

Thank you for your answer but I do not understand the line:


Code
 $matches_href->{ $id } = \@row if $row[0] =~ /Perl/ && $id; 



I want to have the possibility to sort all record in the flat database on the first, second, third, etc - acsending or desending -

Kan this be done with this code also??


ka0osk
Novice

Mar 4, 2002, 6:54 AM

Post #19 of 20 (7457 views)
Re: [mmcw2201] Sorting big files [In reply to] Can't Post

CoolYou can probably stop the timeouts by sending dummy HTML every 500 lines or so (comments should work). I found that after about 4000 lines or so, the sorts can time out the browser. Writing your own manual database based on flatfiles, is kinda like re-inventing the wheel. MySQL is probably the best way to go, and well worth looking into.

Even if you get the thing to not time out, you still eat up proctime. If volume goes up, you will be right back where you were! If you can split out the db into smaller db's based on some criteria, you might get away with it.

Think about MySQL... Its easier than you think. Buy the Oreilly book.

John Step


gregarios
stranger

Feb 25, 2004, 9:17 PM

Post #20 of 20 (5911 views)
Re: [Jasmine] Sorting big files [In reply to] Can't Post

Thanks for your input Jasmine... But I seem to have come to the conclusion, after some experimentation, that my "slurping" of the whole file is actually more efficient for me. Reason: I have to output nearly the entire file anyway after processing the data, so slurping it all into memory tends to allow it to output faster. :-)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives