
JADobson
New User
Nov 19, 2010, 7:55 AM
Post #1 of 2
(179 views)
|
|
Advice with Function Call From Loop
|
Can't Post
|
|
Hey, Firstly, a bit of info on what I'm trying to do. I'm attempting to write a script which finds words from several lists, one with , over 20,000 entries, in a string (I'll be honest from the start - this is for a university project, so I'm not looking for answers, just pointers and advice). My current approach (Probably not very efficient, but it works) is checking if the substring of the list entry is in the original string. So, say I have the list (Purely an example): black car white car green car red car ... And the string: "my friend drives a bright red car" It would attempt to find the substring "black car", then the substring "white car", until it gets to "red car" which is a match. (If anybody has any suggestions as to a different more efficient approach to this lookup, please let me know) Anyway, that all works fine, however I'm required to strip all punctuation and format both the original string and list entries in a certain way before attempting to find an entry in the original string. Originally I had something like:
while (my $crLine = <CRFILE>) { $crLine = &formatString($crLine); if($OriginalString =~ m/$crLine/i){ ... After reading up on function calls in Perl, it turns out that (Calling a function from within a loop 20,000+ times) is a terrible idea where optimization is concerned. I'm struggling to find an alternative approach, without writing duplicating the code within the formatString sub routine 4 or so times. Maybe I'm just being braindead. I thought about writing a sub routine which took a file handler or reference to a FH as a parameter, then doing the formatting and lookup within that function, but the lookup for each list is different (Some are looking for matches, some replacing text and others simply removing text). Any ideas? The formatString sub routine basically performs a bunch of regex operations on the passed string:
$cwLine =~ tr/\-/ /; $cwLine =~ tr/A-Z/a-z/; $cwLine =~ s/[^a-zA-Z\s]|\s+$|^\s+//g; $cwLine =~ s/$/ /; $cwLine =~ s/^/ /; Optimization is my biggest requirement here. If you see anything inefficient with what I've said, I'd greatly appreciate some pointers. Execution wise, the formatString sub routine approach performs the lookup on a string against 3 lists (With a total of round about 30,000 entries) in approximately 4.5 seconds. Removing the sub routine and duplicating the contained code takes that time down to around 1.2 seconds. Am I going to have to deal with duplicated code for the sake of optimization or is there an alternative approach I could consider? Thanks for your time, James
|