CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Advanced:
How to speed up search on array?

 



bulrush
User

Aug 30, 2016, 3:49 AM

Post #1 of 18 (8138 views)
How to speed up search on array? Can't Post

I have a program where I have to search through a huge array with 3,000,000 items and return every matching item in @prkeys. Each time the program runs I'm running this search routine 3000+ times. It can take me 90+ minutes to run the whole program. Here's the code that is slowing me down.

Code
# Regex will be: MODEL.+OLDPRICE 
$k=$basemodel.'.+'.$oldprice;
# Find all models that start with key and end with price.
@k=grep(/^$k/, @prkeys);

The intention of this code is to make a fuzzy match looking for strings that begin with a model number and end with an price. This is my last resort to find a match on a model, and it must be a fuzzy match of some type, although I could split the search into 2 parts, but it seems both parts would be a regex and splitting it into 2 parts would slow me down even more.

  1. @prkeys is an array that contains 3,000,000+ items.
  2. @prkeys is all in memory.
  3. Each string in @prkeys is 5-30 characters long.
  4. I must return every item that matches in @prkeys.
  5. Each time I search for $k, $k starts with a model number, and ends with a price. So the regex looks like: <code>/MODEL.+PRICE/</code>
  6. Because of the bad data the customer gives us I do have to use this search method of searching 3,000,000 strings.
  7. This OS is a virtual machine and there are other VMs on this physical server, and I suspect the other VMs are also slowing me down. I cannot move the VM to another physical server, so I must address speed in the code itself.
  8. I have a test program to test read speed but I have no other ideas how to speed this up. Speed normally is not an issue for me.

Questions

  1. How can I speed this up? Each time this one line runs it takes about 2 seconds. That's 6000 seconds just for this one line only, not counting any other overhead and processing for the rest of the program.
  2. Will I have to use another data structure to search for all this data?

Thank you for your help. I normally don't have to do such searching on a huge dataset.

I will post a link to the huge file and a test program for you to use shortly.

Here's the link to the data file and test program. It's about 800mb. https://www.dropbox.com/s/x4qrsmy06zdcc57/bigdata.zip?dl=0
-----


(This post was edited by bulrush on Aug 30, 2016, 10:07 AM)


Laurent_R
Veteran / Moderator

Aug 30, 2016, 3:59 AM

Post #2 of 18 (8137 views)
Re: [bulrush] How to speed up search on array? [In reply to] Can't Post

You don't give enough details, and we don't know the shape of your data, but using a hash of arrays (instead of a simple array) is likely to speed up considerably your search.

Failing that, the other option is to sort the data and go for some form of binary search.


bulrush
User

Aug 30, 2016, 4:07 AM

Post #3 of 18 (8135 views)
Re: [Laurent_R] How to speed up search on array? [In reply to] Can't Post

I posted a link to the file and test program in my first message, that should give you more details.

I don't even know how to create or search through a hash of arrays. How would I do that?
-----


Laurent_R
Veteran / Moderator

Aug 30, 2016, 5:54 AM

Post #4 of 18 (8128 views)
Re: [bulrush] How to speed up search on array? [In reply to] Can't Post

When you read your file, split the data between the key and prices. Store each price in an array ref within an hash entry with that key. Then, when looking up, find the hash entry for the key, and just look into the array ref stored there.

Example. Suppose you have this input data:


Code
key1 price1 
key1 price2
key2 price1
key2 price3
key2 price4


For this small dataset, you would construct a hash with two entries, for key1 and key2, and the hash values would be the price lists:


Code
( key1 => (price1, price2), 
key2 => (price1, price3, price4)
)

Then, when you are doing the lookup, if you have key1, you need to grep among an arrayref containing only two values; if you have key2, you grep among three values. You're no longer parsing the entire dataset each time (because access to a hash is almost immediate).

Note that you could also possibly use a hash of hashes, that would be even quicker, but whether you can do it depends on the details of your data.

Some quick untested pseudo-code for constructing the hash of arrays (making some simple assumptions about your data, you'll have to adjust to their format):


Code
my %hash; 
while (my $line = <$IN>) {
chomp $line;
my ($key, $val) = split / /, $line;
push @{$hash{$key}}, $val;
}


This should give you an hash of arrays essentially looking like what I described above.

Then, some also untested code for retrieving the data from the hash:


Code
while (my $line = <$IN>) { 
chomp $line;
my ($key, $val) = split / /, $line;
if (defined $hash{$key}) {
my @result = grep /$val/, @{$hash{$key}};
print "Found it!: @result \n" if @result;
}
}



bulrush
User

Aug 30, 2016, 6:11 AM

Post #5 of 18 (8125 views)
Re: [Laurent_R] How to speed up search on array? [In reply to] Can't Post

Thank you Laurent, I think you understand what I'm trying to do. My data in @prkeys would look something like this:


Code
H1234-LA-55<tab>190 
H2255-S-6788<tab>200
H1234-SA-55<tab>185
H2255-S-6788<tab>210
H1234<tab>185
H1234-BB-23<tab>185
H1234-ZZ-55<tab>185
(And another 3 million records)


So I would be looking for "H1234.+190" as a regexp.

I haven't had time to look at the code but I think your idea will work.

I use a subroutine to pass in 2 parameters, the base model and the price I'm looking for and I changed how I deal with those. So my code now looks like:


Code
sub guessprice 
{my($base,$oldpr)=@_;

$k='^'.$_[0]; # base model as key
if (length($_[1])>0)
{
$k.='.+'.$_[1]; # Append old price
}
@k=grep(/$k/, @prkeys); # Find all models that start with key.
# ...


Because guessprice() is called 3000+ times, using $_[n] instead of parameter names cuts time by about half, which is amazing. I'm thinking I don't even need this line either: "my($base,$oldpr)=@_;" One website said Perl does a lot of excessive string copying.

This OS is a Virtual Machine and it shares space with other VMs on a single physical server, and I think other VMs are also slowing this program down, but I cannot fix that part.
-----


bulrush
User

Aug 30, 2016, 9:06 AM

Post #6 of 18 (8123 views)
Re: [bulrush] How to speed up search on array? [In reply to] Can't Post

Oops, I spoke too soon. If I use your hash method, I actually am looking for a key that begins with my model number, i.e a partial match on the hash key. So your method doesn't quite work.
-----


FishMonger
Veteran / Moderator

Aug 30, 2016, 9:10 AM

Post #7 of 18 (8122 views)
Re: [bulrush] How to speed up search on array? [In reply to] Can't Post

My company blocks file sharing sites like dropbox so I can't see your files. It's best to post those files here as attachments.

Based on your description of the problem, my suggestion would be to load the data into a db and use sql queries to do the searching.


(This post was edited by FishMonger on Aug 30, 2016, 9:10 AM)


Laurent_R
Veteran / Moderator

Aug 30, 2016, 9:25 AM

Post #8 of 18 (8118 views)
Re: [bulrush] How to speed up search on array? [In reply to] Can't Post


In Reply To
Oops, I spoke too soon. If I use your hash method, I actually am looking for a key that begins with my model number, i.e a partial match on the hash key. So your method doesn't quite work.


Just keep for the hash key the part of the model string that you need for your match and/or lookup.


bulrush
User

Aug 30, 2016, 9:26 AM

Post #9 of 18 (8117 views)
Re: [FishMonger] How to speed up search on array? [In reply to] Can't Post

I can't attach the file because it's more than the site limit of 250kb. It's more like 800mb.
-----


(This post was edited by bulrush on Aug 30, 2016, 9:54 AM)


FishMonger
Veteran / Moderator

Aug 30, 2016, 9:33 AM

Post #10 of 18 (8114 views)
Re: [bulrush] How to speed up search on array? [In reply to] Can't Post

You don't need nor should upload 800Mb files. You should reduce your examples down to a small but reasonable enough size to demonstrate the problem.

As I mentioned in your other thread, profiling should be your first step. Then write short test scripts on the sections that need optimization which can be used to benchmark different approaches in doing those portions.


bulrush
User

Aug 30, 2016, 9:52 AM

Post #11 of 18 (8106 views)
Re: [FishMonger] How to speed up search on array? [In reply to] Can't Post

Profiling is not needed here as I already know which line takes the longest time in my search routine. This line is the problem:

Code
@k=grep(/^$k/, @prkeys);

Since I'm looking for a partial key, a hash will not work here. So that's why I went with a regex.

I don't know of any data structures that will work with my problem, hence my post here.
-----


(This post was edited by bulrush on Aug 30, 2016, 9:55 AM)


FishMonger
Veteran / Moderator

Aug 30, 2016, 10:47 AM

Post #12 of 18 (8099 views)
Re: [bulrush] How to speed up search on array? [In reply to] Can't Post

How did you determine that is the statement that needs to be optimized?

There's not much you can do to optimize that 1 statement. The bigger issue you should be looking at is the grepping of that array 3,000+ times. You should look into better ways to do that loop rather than focusing on that 1 grep statement.

Try the sql approach as a possibility.


(This post was edited by FishMonger on Aug 30, 2016, 10:47 AM)


Laurent_R
Veteran / Moderator

Aug 30, 2016, 2:59 PM

Post #13 of 18 (8087 views)
Re: [bulrush] How to speed up search on array? [In reply to] Can't Post


In Reply To
Since I'm looking for a partial key, a hash will not work here. So that's why I went with a regex.


I don't think that's true. As I already said (but perhaps you did not see my post because you posted at almost the same time) , given your data set, it seems that a hash will do if you reduce the hash key to the partial data key. Your regex (or its first part) isn't doing more than keeping only part of your data key; you can just to the same for your hash key.

Don't say nay if you don't have a really good reason to do so. From everything I have seen about your data, it should just work. Try it, your program will probably run so much incredibly faster that you'll find it hard to believe (you might even end up thinking the program died, whereas it will have completed its task).


BillKSmith
Veteran

Aug 31, 2016, 8:04 AM

Post #14 of 18 (8076 views)
Re: [bulrush] How to speed up search on array? [In reply to] Can't Post

You should gain some improvement by simply pre-compiling your regex.

Code
my $basemodel_re; 
my $oldprice_re;
#...
$basemodel_re = qr/$basemodel/;
#...
$oldprice_re = qr/$oldprice/;
#...
$k = qr/^$basemodel_re.+$oldprice_re/;
@k = grep(/$k/, @prkeys);


Place the new declarations in the smallest possible scope. Place the new definitions immediately after the definitions of $basemode and $oldprice.

Laurent's suggestion would make a huge difference, but this quick fix may be "good enough".
Good Luck,
Bill


bulrush
User

Aug 31, 2016, 8:20 AM

Post #15 of 18 (8074 views)
Re: [FishMonger] How to speed up search on array? [In reply to] Can't Post


Quote
How did you determine that is the statement that needs to be optimized?


By stepping through the code. I can watch that it actually takes 3 seconds to execute that one line once. I don't actually need to profile code in a complex way if all other areas take a fraction of a second to execute and this one line takes 3 seconds. You don't need a Corvette just to go get groceries.

I actually ended up using a hash of arrays. The key to the hash is the first 4 ctrs of each model, so the resulting array in that hash value is very limited, so my grep looks at a much smaller chunk of data.

It works great and my program, with all other improvements takes only 10% as long as it used it.
-----


bulrush
User

Aug 31, 2016, 8:22 AM

Post #16 of 18 (8072 views)
Re: [BillKSmith] How to speed up search on array? [In reply to] Can't Post

Bill,

From http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators:

Quote
This operator quotes (and possibly compiles) its STRING as a regular expression.


So does qr compile the regex every time if it needs it?
-----


(This post was edited by bulrush on Aug 31, 2016, 8:25 AM)


BillKSmith
Veteran

Aug 31, 2016, 3:16 PM

Post #17 of 18 (8060 views)
Re: [bulrush] How to speed up search on array? [In reply to] Can't Post

Good question. I cannot find an answer. I have always assumed that qr is the same as all other regex operators. It always recompiles if it contains a variable. The compilation is much faster if the content of the variable is pre-compiled code. Under this assumption, we control where the compilation is done by where we place the qr. We gain speed when we move compilation out of a loop.
Good Luck,
Bill


FishMonger
Veteran / Moderator

Aug 31, 2016, 4:45 PM

Post #18 of 18 (8058 views)
Re: [bulrush] How to speed up search on array? [In reply to] Can't Post

The answer is a little further down that section of the documentation. If you use the qr() operator prior to the loop, it will not recompile the regex when the var is used inside the loop.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives