CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
File Comparison (2 files): Matching Occurences

 



ewh006
New User

Feb 25, 2018, 5:47 PM

Post #1 of 10 (6654 views)
File Comparison (2 files): Matching Occurences Can't Post

Good Evening,

I am new to Perl, but I do understand basic Linux operating system stuff.

So far with Perl, I've got a simple understanding of Variables, Lists/Arrays, and File I/O.

I am currently looking for a way to write the following in Perl via Positional Parameters or even within the file itself:

1) I want to take two files (text file with common phishing keywords) and compare it to an email (for right now I'm just trying to use a text file as an example email to get going)

2) I want to match any occurrences in the (phishing-keywords) file with the (email.txt) file and print the matches/# of occurrences.


The output doesn't need to be anything specific, just simple, easy to read and understand.

If anyone could help me get started etc. I would greatly appreciate it.

Thanks!!!


Zhris
Enthusiast

Feb 25, 2018, 8:30 PM

Post #2 of 10 (6646 views)
Re: [ewh006] File Comparison (2 files): Matching Occurences [In reply to] Can't Post

Hi,

This sounds like a homework question, apologies if it isn't.

- open the keywords file and read it into a hash.
- read each email file in the emails directory.
- open each email file and split it into individual words, or use a regular expression.
- test if each individual word exists in the keywords hash.
- print the result.

Here is an example, the approach I would take, but perhaps beyond the complexity you require due to its heavy use of Path::Tiny, map and grep. Use it as a basis.


Code
use strict; 
use warnings;
use Path::Tiny;
$| = 1;
$, = "\n";
$\ = "\n\n";

my $dirpath_root = path( $0 )->parent;
my $filepath_keywords = $dirpath_root->child( 'keywords.txt' );
my $dirpath_emails = $dirpath_root->child( 'emails' );

# convert lines from keywords file into hash.
my %keywords = map { $_ => 1 } $filepath_keywords->lines( { chomp => 1 } );

# iterate over each email file in emails directory.
for my $filepath_email ( $dirpath_emails->children( qr/\.txt$/ ) )
{
# read email file and push matching keywords into array.
my @matches = grep { exists $keywords{$_} } map { /(\w+)/g } $filepath_email->slurp;

# print email file and matching keywords.
print $filepath_email, @matches;
}


Regards,

Chris


(This post was edited by Zhris on Feb 25, 2018, 8:32 PM)


BillKSmith
Veteran

Feb 25, 2018, 8:52 PM

Post #3 of 10 (6639 views)
Re: [ewh006] File Comparison (2 files): Matching Occurences [In reply to] Can't Post

There are two possible approaches (with several variations of each). In the first, you search the email for each 'word' in the text file. In the second, you search the text file for each word in the email. I cannot recommend either one without knowing more about your problem.

  • How many words are in your text file? Are you concerned about processing speed?

  • How do you define 'word' in the email? Is it any contiguous string of characters which match an entry in the file? Is it a string of characters surrounded by whitespace? Anything else? Must we be concerned with case? word-wrap? plurals? tense? ...

  • Do you plan to extract the text of the email before you start?

  • Does your email contain non-ASCII characters?



  • UPDATE:

    Chris has provided an example of the first method. Here is an example of the second:


    Code
    C:\Users\Bill\forums\guru>type ewh006.pl 
    use strict;
    use warnings;

    my $email = \do{ my $mail = << 'END_EMAIL' };
    Here is some
    arbitrary text. It
    really does not matter much
    what it says -its just email.
    END_EMAIL

    my $text = \do{ my $word_list = << 'END_TEXT' };
    some
    text
    fum
    email
    foo
    END_TEXT


    open my $TEXT, '<', $text or die "Cannot open text file:$!";
    my @words = <$TEXT>;
    chomp @words;
    my $re = join '|', @words;
    my $regex = qr/$re/;


    open my $EMAIL, '<', $email or die "Cannot open email:$!";
    my $body = do{ local $/ = undef; <$EMAIL>};

    my @matched_words = $body =~ /($regex)/g;

    do{local $, = "\n"; print @matched_words;};


    C:\Users\Bill\forums\guru>perl ewh006.pl
    some
    text
    email
    C:\Users\Bill\forums\guru>

    Good Luck,
    Bill

    (This post was edited by BillKSmith on Feb 26, 2018, 12:06 PM)


    ewh006
    New User

    Feb 26, 2018, 4:18 PM

    Post #4 of 10 (6619 views)
    Re: [BillKSmith] File Comparison (2 files): Matching Occurences [In reply to] Can't Post

    Good Evening,

    Thanks for getting back Chris and Bill!

    I am a bit new to perl, especially perl, I have prior experience to an extent with python and bash. However, your methods seem a bit complex for my knowledge.

    I have gotten started based off reviewing your submissions to my post' and have been trying a different, maybe simpler route.

    So far I have read the PHISHING TERMS (shown in the screen shot) into an array. I need to now figure out a method for using that array go find matches in the Sample_Email.txt file (shown in the screen shot as well). As for the PHISHING TERMS documents, I may shorten that list to just one instance of each word, instead of the variations, because I'm not even sure I need that.

    Basically,

    1. Read in all or one line from the email

    2. Split

    3. Compare the words to email array

    4. Increment a counter

    5. Move to the next line (email)

    If there is an easier way using grep or AWK that would be nice too.

    Thanks for the help, much appreciated!!!


    (This post was edited by ewh006 on Feb 26, 2018, 4:20 PM)
    Attachments: Screenshot_1.png (100 KB)


    Zhris
    Enthusiast

    Feb 26, 2018, 9:38 PM

    Post #5 of 10 (6609 views)
    Re: [ewh006] File Comparison (2 files): Matching Occurences [In reply to] Can't Post

    Hi,

    You're welcome.

    I see from your screenshot you have each line from your keywords file in the @data array. Your array will contain data you don't need, your next step is to extract what you need. You don't need each case variation of each word, it is probably suitable just to later use the i modifier on a regular expression to match words in every variation of case.


    Code
    use strict; 
    use warnings;
    use Cwd;
    use Data::Dumper;

    my $filepath_keywords = getcwd . '/keywordsb.txt';

    open my $filehandle_keywords, '<', $filepath_keywords or die "cannot open '$filepath_keywords': $!";
    my @keywords = map { /\)\s*(\w+)/ } <$filehandle_keywords>;
    close $filehandle_keywords;

    print Dumper \@keywords;


    This is very similar to what you have done so far with a couple of improvements.
    - It uses the recommended three way form of open, with lexical scoped filehandle variable and error statement.
    - It filters each line with map to extract what you need ( the first word that follows ") " ).

    Check screenshot for output.

    I suggest you read through Bill's solution carefully and understand what it is doing at each step, it is a general solution to your overall task, further requirements can easily be fitted in. Consider each of his questions too, they are important in deciding which approach is best for your needs, including any necessary fine tuning to make the code production ready.

    The five steps you provided are the right train of thought. Reading the entire email in ( slurping ) is easier to process in your case, particularly on such small files. You'll find splitting won't be as accurate as using a regular expression, take for example words that end in a fullstop, this isn't even a necessary step with Bill's solution. Perl has an excellent loop system that avoids the need for explicit counters to iterate over data in typical cases.

    There are of course solutions that incorporate grep and/or awk, though Perl is just as capable, this is a great little task to help you develop your Perl skills.

    Regards,

    Chris


    (This post was edited by Zhris on Feb 26, 2018, 10:04 PM)
    Attachments: output.png (67.5 KB)


    BillKSmith
    Veteran

    Feb 27, 2018, 7:27 AM

    Post #6 of 10 (6596 views)
    Re: [ewh006] File Comparison (2 files): Matching Occurences [In reply to] Can't Post

    It is hard to add to Chris's excellent reply. However, I can point out that your example does not contain any of the special cases that I mentioned before. Both solutions should give the same result for this example. It is important to consider all of those cases and create examples which test all the ones which matter. You do not want the pointy-haired boss to be compromised because you ignored a possibility. This is especially bad if it is difficult or impossible to fix in the implementation you have chosen.
    Good Luck,
    Bill


    ewh006
    New User

    Feb 27, 2018, 12:35 PM

    Post #7 of 10 (6582 views)
    Re: [Zhris] File Comparison (2 files): Matching Occurences [In reply to] Can't Post

    I've gained a little knowledge through you and Bill's post, appreciate it. I still have not came up with a solution however.

    My main problem is comparing the two files and matching the KEYWORDS vs. instances of those keywords found in the EMAIL.

    What I have right now is counting instances but only within that one file (EMAIL). How do I make it take the email and compare it to matches within the KEYWORDS file?


    Code
    #!/usr/bin/perl 
    #use strict;
    #use warnings;
    #use autodie;

    #####################################################################################################################

    #Read Phishing Terms Into Array

    $phishing_terms = 'C:\Users\HOUSTONE\Desktop\Perl\10X-PHISHING-TERMS.txt';
    open INFILE, "$phishing_terms";
    @phish_data = <INFILE>;
    close INFILE ;

    #####################################################################################################################

    #Read EMAIL Into Array

    $sample_email = 'C:\Users\HOUSTONE\DESKTOP\Perl\Sample_Email.txt';
    open INFILE, "$sample_email";
    @email_data = <INFILE>;
    close INFILE;

    #####################################################################################################################

    #Matching Keywords

    my @strings = @email_data;

    my %count;

    foreach my $str (@strings) {
    $count{$str}++;
    }

    foreach my $str (sort keys %count) {
    printf "%-31s %s\n", $str, $count{$str};
    }



    #####################################################################################################################



    (This post was edited by ewh006 on Feb 27, 2018, 12:38 PM)


    Zhris
    Enthusiast

    Feb 27, 2018, 6:36 PM

    Post #8 of 10 (6568 views)
    Re: [ewh006] File Comparison (2 files): Matching Occurences [In reply to] Can't Post

    Hi,


    Quote
    What I have right now is counting instances but only within that one file (EMAIL).


    Your code as it stands simply tallys each line in the email, effectively counting duplicate lines. This is obviously not what you want, you want to work with "words" not lines.


    Quote
    My main problem is comparing the two files and matching the KEYWORDS vs. instances of those keywords found in the EMAIL.

    How do I make it take the email and compare it to matches within the KEYWORDS file?


    We have provided a couple of solutions to this, and separately code to help you parse your keywords file now we have seen its format. I understand they aren't particularly beginner friendly solutions, but I was hoping you might be able to apply some of their features in your own code and we could work from there.

    You have your own style you are set on using, I will stick to it as not to further confuse things. Using your current code as a basis, and trying to keep the process and notations straight forward, here is a working solution that you should be able to follow:


    Code
    #!/usr/bin/perl  
    use strict;
    use warnings;
    use autodie;

    #####################################################################################################################

    #Read Phishing Terms Into Array

    my $phishing_terms = 'C:\Users\HOUSTONE\Desktop\Perl\10X-PHISHING-TERMS.txt';
    open INFILE, "$phishing_terms";
    my @phish_data = <INFILE>;
    close INFILE ;

    #####################################################################################################################

    #Read EMAIL Into Array

    my $sample_email = 'C:\Users\HOUSTONE\DESKTOP\Perl\Sample_Email.txt';
    open INFILE, "$sample_email";
    my @email_data = <INFILE>;
    close INFILE;

    #####################################################################################################################

    #Matching Keywords

    my %counts;

    foreach my $str (@phish_data) {
    ( undef, my $keyword ) = split( ' ', $str, 3 );

    my $count = 0;
    for my $line (@email_data) {
    my $count_b = () = $line =~ /($keyword)/ig;
    $count = $count + $count_b;
    }

    if ( $count > 0 ) {
    $counts{$keyword} = $count;
    }
    }

    foreach my $str (sort keys %counts) {
    printf "%-31s %s\n", $str, $counts{$str};
    }



    #####################################################################################################################


    Its not without its limitations. Its not as efficient as it could be, that regular expression could be improved to support word boundaries (assuming you don't want to match keywords inside words), etc.

    For reference purposes, here is a solution on par with the approach I might take, by building a hash of keywords, then use that to build a second hash of matching keywords:


    Code
    use strict; 
    use warnings;
    use autodie;
    use Data::Dumper;

    my $filepath_keywords = 'C:\Users\chris\Desktop\Unorganized\phish\keywordsb.txt';
    my $filepath_email = 'C:\Users\chris\Desktop\Unorganized\phish\emails\four.txt';

    open my $filehandle_keywords, '<', $filepath_keywords;
    open my $filehandle_email, '<', $filepath_email;

    my %keywords = map { /\)\s*(\S+)/; lc $1 => 1 } <$filehandle_keywords>;
    local $/;
    my %keywords_b;
    $keywords_b{lc $_}++ for grep { exists $keywords{lc $_} } <$filehandle_email> =~ /(\w+)/g;

    close $filehandle_keywords;
    close $filehandle_email;

    print Dumper \%keywords_b;


    Regards,

    Chris


    (This post was edited by Zhris on Feb 27, 2018, 6:53 PM)


    ewh006
    New User

    Feb 27, 2018, 8:14 PM

    Post #9 of 10 (6551 views)
    Re: [Zhris] File Comparison (2 files): Matching Occurences [In reply to] Can't Post

    Awesome, quick question, I obviously don't want to count them, but I assume the output of '938' the white/spaces?

    Thanks, again,

    Eric


    (This post was edited by ewh006 on Feb 27, 2018, 8:15 PM)
    Attachments: whitespacequestion.png (10.6 KB)


    Zhris
    Enthusiast

    Feb 27, 2018, 11:32 PM

    Post #10 of 10 (6542 views)
    Re: [ewh006] File Comparison (2 files): Matching Occurences [In reply to] Can't Post

    Hi,

    That sounds correct. I'm guessing your keywords/phishing terms file has empty lines in it. With warnings enabled, I would have thought it would complain about an uninitialized variable. If this assumption is correct you can skip empty lines using something along the lines of next if $line =~ /^\s*$/. Otherwise attach your keywords and email files and we can test your actual raw data.

    P.s. if and when you come to processing multiple emails, the code will definitely need a rework, it really needs to pre-process the keywords beforehand like our other solutions do. Feel free to ask for further help if you get stuck.

    Chris


    (This post was edited by Zhris on Feb 27, 2018, 11:42 PM)

     
     


    Search for (options) Powered by Gossamer Forum v.1.2.0

    Web Applications & Managed Hosting Powered by Gossamer Threads
    Visit our Mailing List Archives