Jun 7, 2006, 3:37 PM
Post #1 of 3
Processing Large files, Regular Expression taking long
I have to filter out some lines from the log files generated by our webserver.
Each file is around 35+MB (with around 355,000 lines) and there are multiple such files.
I need to eliminate all the lines that has '_BODY' or '_BODY_text' in it with as exception that it should not be in query parameters
Eliminate lines like:
172.001.16.87 - - [31/Jan/2006:19:14:53 -0700] "GET /ptrusts/11/dynamic/datapages/HWEligibility_BODY.jhtml HTTP/1.1" 200 7722
172.696.61.87 - - [31/Mar/2006:19:19:07 -0700] "GET /ptrusts/58/dynamic/datapages/Eligibility_BODY_text.jhtml HTTP/1.1" 200 7722
and DO NOT eliminate lines like:
184.108.40.206 - - [01/Feb/2006:08:03:55 -0800] "POST /common/profile/dynamic/PRLogin.jhtml?_DARGS=/common/profile/dynamic/PRLogin_BODY.jhtml.4 HTTP/1.1" 302 12481
To achieve this I am using the regular Expression:
This seems to be working. But the problem with this is its taking really long time (like appx 15 minutes) to process the 355,000 lines.
(Note: A bit on file process mechanism: I am reading the entire log file into a string and applying the regular expression on that string. )
In general its not a problem to have the script run for 15 minutes.But I have more of these type of filters (around 10) to be applied in which case its going to take 10 * 15 = 150 minutes, which is not quite acceptable.
Is there any way to improve my regular expression or improve the way to process really large file like more than 35MB of size?