
Zhris
Enthusiast
Feb 27, 2018, 6:36 PM
Post #8 of 10
(1377 views)
|
Re: [ewh006] File Comparison (2 files): Matching Occurences
[In reply to]
|
Can't Post
|
|
Hi,
What I have right now is counting instances but only within that one file (EMAIL). Your code as it stands simply tallys each line in the email, effectively counting duplicate lines. This is obviously not what you want, you want to work with "words" not lines.
My main problem is comparing the two files and matching the KEYWORDS vs. instances of those keywords found in the EMAIL. How do I make it take the email and compare it to matches within the KEYWORDS file? We have provided a couple of solutions to this, and separately code to help you parse your keywords file now we have seen its format. I understand they aren't particularly beginner friendly solutions, but I was hoping you might be able to apply some of their features in your own code and we could work from there. You have your own style you are set on using, I will stick to it as not to further confuse things. Using your current code as a basis, and trying to keep the process and notations straight forward, here is a working solution that you should be able to follow:
#!/usr/bin/perl use strict; use warnings; use autodie; ##################################################################################################################### #Read Phishing Terms Into Array my $phishing_terms = 'C:\Users\HOUSTONE\Desktop\Perl\10X-PHISHING-TERMS.txt'; open INFILE, "$phishing_terms"; my @phish_data = <INFILE>; close INFILE ; ##################################################################################################################### #Read EMAIL Into Array my $sample_email = 'C:\Users\HOUSTONE\DESKTOP\Perl\Sample_Email.txt'; open INFILE, "$sample_email"; my @email_data = <INFILE>; close INFILE; ##################################################################################################################### #Matching Keywords my %counts; foreach my $str (@phish_data) { ( undef, my $keyword ) = split( ' ', $str, 3 ); my $count = 0; for my $line (@email_data) { my $count_b = () = $line =~ /($keyword)/ig; $count = $count + $count_b; } if ( $count > 0 ) { $counts{$keyword} = $count; } } foreach my $str (sort keys %counts) { printf "%-31s %s\n", $str, $counts{$str}; } ##################################################################################################################### Its not without its limitations. Its not as efficient as it could be, that regular expression could be improved to support word boundaries (assuming you don't want to match keywords inside words), etc. For reference purposes, here is a solution on par with the approach I might take, by building a hash of keywords, then use that to build a second hash of matching keywords:
use strict; use warnings; use autodie; use Data::Dumper; my $filepath_keywords = 'C:\Users\chris\Desktop\Unorganized\phish\keywordsb.txt'; my $filepath_email = 'C:\Users\chris\Desktop\Unorganized\phish\emails\four.txt'; open my $filehandle_keywords, '<', $filepath_keywords; open my $filehandle_email, '<', $filepath_email; my %keywords = map { /\)\s*(\S+)/; lc $1 => 1 } <$filehandle_keywords>; local $/; my %keywords_b; $keywords_b{lc $_}++ for grep { exists $keywords{lc $_} } <$filehandle_email> =~ /(\w+)/g; close $filehandle_keywords; close $filehandle_email; print Dumper \%keywords_b; Regards, Chris
(This post was edited by Zhris on Feb 27, 2018, 6:53 PM)
|