CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
iterate over millions hash entries

 



skualito92
New User

Apr 12, 2017, 2:09 AM

Post #1 of 9 (3279 views)
iterate over millions hash entries Can't Post

Hello,

I have a simple hash with millions entries and i need to iterate over all this entries to affect a value for each key and then sort the hash by values.

actually it takes more than one minute to process, do you have any other approach to do the same things in a more efficient way ?

thanks in advance for your help.

regards.


FishMonger
Veteran / Moderator

Apr 12, 2017, 6:40 AM

Post #2 of 9 (3272 views)
Re: [skualito92] iterate over millions hash entries [In reply to] Can't Post

Use a database?

We can't really answer that question without seeing your code and sample data.

The only suggestion I can make at this point is for you to profile your script to see where it's spending it's time and then focus on optimizing the slow parts, which may or may not be the section where you think the problem is located.

Devel::NYTProf - Powerful fast feature-rich Perl source code profiler
https://metacpan.org/pod/Devel::NYTProf


(This post was edited by FishMonger on Apr 12, 2017, 6:41 AM)


BillKSmith
Veteran

Apr 12, 2017, 6:41 AM

Post #3 of 9 (3271 views)
Re: [skualito92] iterate over millions hash entries [In reply to] Can't Post

I do not understand your question. It is not possible to 'sort the hash'. You imply that you already have a working program. Please post sample code that demonstrates (with a small hash) what your are doing with the large one. Your sample should be a complete, error free (be sure to use strict and warnings), executable Perl and include a small, but otherwise realistic sample of your data. Do not post your complete production code. Abstract just enough to demonstrate your problem.
Good Luck,
Bill


skualito92
New User

Apr 12, 2017, 7:12 AM

Post #4 of 9 (3267 views)
Re: [FishMonger] iterate over millions hash entries [In reply to] Can't Post


In Reply To
Use a database?

We can't really answer that question without seeing your code and sample data.

The only suggestion I can make at this point is for you to profile your script to see where it's spending it's time and then focus on optimizing the slow parts, which may or may not be the section where you think the problem is located.

Devel::NYTProf - Powerful fast feature-rich Perl source code profiler
https://metacpan.org/pod/Devel::NYTProf


my code is a simple key,value hash.
keys are strings and values are numbers.
then i sort keys by the value number.

example:

%hash = (
'foo' => 10,
'bar' => 30,
)

it just take lots of time because i have millions of keys, you think using a DB will be more efficient ? i will give a try with sqlite...

i already profiled the code and it show that it spend all his time in the loop over the hash.


(This post was edited by skualito92 on Apr 12, 2017, 7:13 AM)


BillKSmith
Veteran

Apr 12, 2017, 8:36 AM

Post #5 of 9 (3254 views)
Re: [skualito92] iterate over millions hash entries [In reply to] Can't Post

It should help some to 'preallocate' the hash. (Refer: 'keys' function in perlib) This would be a trivial change, but you may have to experiment some to get the optimal value.

UPDATE: Test cases which I have generated suggest that this technique has little if any value. Frown
Good Luck,
Bill

(This post was edited by BillKSmith on Apr 12, 2017, 3:40 PM)


skualito92
New User

Apr 13, 2017, 2:52 AM

Post #6 of 9 (3231 views)
Re: [BillKSmith] iterate over millions hash entries [In reply to] Can't Post


In Reply To
It should help some to 'preallocate' the hash. (Refer: 'keys' function in perlib) This would be a trivial change, but you may have to experiment some to get the optimal value.

UPDATE: Test cases which I have generated suggest that this technique has little if any value. Frown


i try preallocating the hash but the gain is nearly null.

i found on rosetta code something interesting, a structure like this:


Code
@people = (['joe', 120], ['foo', 31], ['bar', 51]); 
@people = sort { $a->[1] <=> $b->[1] } @people;


the gain with this is about 10%, not bad but i want more ;)


FishMonger
Veteran / Moderator

Apr 13, 2017, 7:04 AM

Post #7 of 9 (3223 views)
Re: [skualito92] iterate over millions hash entries [In reply to] Can't Post

You could invert the hash so that the keys are the numbers and the values are array refs of your current keys.

For example, instead of this hash structure:

Code
%hash = ( 
'foo' => 10,
'bar' => 30,
'baz' => 10,
);


You'd have:

Code
%hash = ( 
10 => [ 'foo', 'baz' ],
30 => [ 'bar' ],
);


Then do a normal/typical sort on the keys.


skualito92
New User

Apr 13, 2017, 7:15 AM

Post #8 of 9 (3220 views)
Re: [FishMonger] iterate over millions hash entries [In reply to] Can't Post


In Reply To
You could invert the hash so that the keys are the numbers and the values are array refs of your current keys.

For example, instead of this hash structure:

Code
%hash = ( 
'foo' => 10,
'bar' => 30,
'baz' => 10,
);


You'd have:

Code
%hash = ( 
10 => [ 'foo', 'baz' ],
30 => [ 'bar' ],
);


Then do a normal/typical sort on the keys.


i dont understand why inverting the hash will accelerate the iteration and sorting, except it could reduce the hash size for keys with same value but in my program it will not be the case


(This post was edited by skualito92 on Apr 13, 2017, 7:17 AM)


FishMonger
Veteran / Moderator

Apr 13, 2017, 7:49 AM

Post #9 of 9 (3211 views)
Re: [skualito92] iterate over millions hash entries [In reply to] Can't Post

If the values are not unique, then using the inverted structure would sort faster and if the numbers are unique, then you don't need the array ref or sub sorting.

So far, you've been unwilling to show your code or give us a clear understanding of your data (such as if the values are unique or not), so we can only make rough guesses on possible options.

It's possible that the iteration and sorting is not causing a slowdown. Maybe it's the outputting of the data during that iteration. I/O is typically the slowest part of a process.

It's also possible that a processing time of 1 minute might be very reasonable for the amount of data you're dealing with and tweaking the code may only give you a small optimization.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives