CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
remove similar files

 



orange
User

Jan 19, 2017, 2:43 AM

Post #1 of 7 (3698 views)
remove similar files Can't Post

I have a directory with large number of small text files. Would like to remove all (but one) that are similar to each other by, say, more than 90% using "String::Similarity".
I've managed to make a program that takes one file (keeps it), compare it to all the others and deletes similar ones. But its slow, and more importantly it doesn't keep the best file.
Best file would be the one which is similar to most others, or something like that, so that in the end I have least number of files which are fairly unique.
Um, can you please give me some advice?


Laurent_R
Veteran / Moderator

Jan 19, 2017, 6:19 AM

Post #2 of 7 (3694 views)
Re: [orange] remove similar files [In reply to] Can't Post

Hi,

concerning the speed, depending on your input data, one thing you could do is to start by comparing the size of each input file, and launch the similarity function between two files only where the file size difference is less than a given upper bound (calculated probably as ratio between the file sizes). Calculating the file size ratios is going to be probably much faster than using the similarity function, and this can speed up considerably your program if this enables filtering out a significant number of the string similarity calculations.

Also, just in case, have you used the limit argument of the similarity function? According to the documentation:


Quote
You can add an optional argument $limit (default 0) that gives the minimum similarity the two strings must satisfy. similarity stops analyzing the string as soon as the result drops below the given limit, in which case the result will be invalid but lower than the given $limit. You can use this to speed up the common case of searching for the most similar string from a set by specifying the maximum similarity found so far.


Concerning your second question, it is phrased too vaguely to permit a definite answer and we would need much more details. This is certainly not easy, and this might even be an NP-hard problem, i.e. a problem that cannot be solved within a reasonable time as soon as the number of your data samples becomes even moderately high.

If your input files can easily be split into distinct groups, then sorting the groups in accordance to the similarity might be a possible solution.

But we would need to know much more about your input shape, number of samples, etc., as well about the algorithm and code you are presently using.


(This post was edited by Laurent_R on Jan 19, 2017, 6:20 AM)


orange
User

Jan 19, 2017, 12:24 PM

Post #3 of 7 (3684 views)
Re: [Laurent_R] remove similar files [In reply to] Can't Post

ok, thanks. I'll load everything in RAM as files are small and use those optimizations.


orange
User

Jan 24, 2017, 2:49 AM

Post #4 of 7 (3635 views)
Re: [orange] remove similar files [In reply to] Can't Post

I'm currently using this:

Code
#!/usr/bin/perl 

use strict;
use warnings;

use utf8;
use String::Similarity;
use File::Find::Rule;
use Data::Dumper; # print Dumper($foo, $bar);
use List::Util qw(min max);

use File::Basename;
use bytes;


my ( $filename, $similarity, @allfiles, @allpoints, @to_del, @filtered, @similars_all, @deltas );
my $limit = 0.9;
my $points_sim = 0;

my @files = File::Find::Rule->file()->name(qr/.*$/i)->maxdepth(4)->in('lha_uniq/bad/'); # ->size('>41')
if ( !$files[0] ) { print "no files found! \n"; exit; }


for my $filename (@files) {local $/; open FILE, $filename || die "Can't open $filename $!\n"; binmode(FILE); push @allfiles, [$filename, <FILE>]; close FILE;}
@allfiles = reverse sort { length $a->[1] <=> length $b->[1] } @allfiles;

for (my $i = $#allfiles; $i > -1; $i--) {
my @similars;
my $log = 20*log(length ($allfiles[$i][1]));

for (my $i2 = $#allfiles; $i2 > -1; $i2--) {
if ( $allfiles[$i][0] eq $allfiles[$i2][0] ) { next; print "same\n"; next; }
next if ( abs (length ($allfiles[$i2][1]) - length ($allfiles[$i][1])) > $log );
$similarity = similarity ( $allfiles[$i][1], $allfiles[$i2][1], $limit );
if ( $similarity > $limit ) {
push @similars, $allfiles[$i2][0];
splice @allfiles, $i2, 1;
$i--;
$points_sim += 1;
}
} # next file2


push @similars_all, @similars; # print Dumper \@similars; exit;
push @allpoints, [$allfiles[$i][0], $points_sim, \@similars];
$points_sim = 0;
} # next file


How to make it faster? Thanks.


BillKSmith
Veteran

Jan 24, 2017, 9:16 PM

Post #5 of 7 (3619 views)
Re: [orange] remove similar files [In reply to] Can't Post

I suspect that similarity is symmetric (i.e. similarity($string1, $string2) == similarity($string2, $string1)). If so, you could halve your processing time by only doing the tests one way.
Good Luck,
Bill


Laurent_R
Veteran / Moderator

Jan 24, 2017, 11:15 PM

Post #6 of 7 (3611 views)
Re: [orange] remove similar files [In reply to] Can't Post

Did you try to profile your code? My gut feeling is that you'll find that your program is spending 95% or more of its time in the similarity function, but you should really use a profiler on your code and with meaningful data to find out.

Once you know where ti is spent, you can concentrate on optimizations likely to yield something.


orange
User

Jan 25, 2017, 11:53 PM

Post #7 of 7 (3589 views)
Re: [Laurent_R] remove similar files [In reply to] Can't Post

Thanks.

@Bill
I've changed the inner loop to

Code
  for (my $i2 = $i-1; $i2 > -1; $i2--) { 
...
}

there is some speed improvement. output seems to be the same.

@Laurent

String::Similarity::fstrcmp uses more than 99% time.


(This post was edited by orange on Jan 26, 2017, 12:21 AM)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives