CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
Search Posts SEARCH
Who's Online WHO'S
Log in LOG

Home: Perl Programming Help: Beginner:
remove similar files



Jan 19, 2017, 2:43 AM

Post #1 of 7 (4019 views)
remove similar files Can't Post

I have a directory with large number of small text files. Would like to remove all (but one) that are similar to each other by, say, more than 90% using "String::Similarity".
I've managed to make a program that takes one file (keeps it), compare it to all the others and deletes similar ones. But its slow, and more importantly it doesn't keep the best file.
Best file would be the one which is similar to most others, or something like that, so that in the end I have least number of files which are fairly unique.
Um, can you please give me some advice?

Veteran / Moderator

Jan 19, 2017, 6:19 AM

Post #2 of 7 (4015 views)
Re: [orange] remove similar files [In reply to] Can't Post


concerning the speed, depending on your input data, one thing you could do is to start by comparing the size of each input file, and launch the similarity function between two files only where the file size difference is less than a given upper bound (calculated probably as ratio between the file sizes). Calculating the file size ratios is going to be probably much faster than using the similarity function, and this can speed up considerably your program if this enables filtering out a significant number of the string similarity calculations.

Also, just in case, have you used the limit argument of the similarity function? According to the documentation:

You can add an optional argument $limit (default 0) that gives the minimum similarity the two strings must satisfy. similarity stops analyzing the string as soon as the result drops below the given limit, in which case the result will be invalid but lower than the given $limit. You can use this to speed up the common case of searching for the most similar string from a set by specifying the maximum similarity found so far.

Concerning your second question, it is phrased too vaguely to permit a definite answer and we would need much more details. This is certainly not easy, and this might even be an NP-hard problem, i.e. a problem that cannot be solved within a reasonable time as soon as the number of your data samples becomes even moderately high.

If your input files can easily be split into distinct groups, then sorting the groups in accordance to the similarity might be a possible solution.

But we would need to know much more about your input shape, number of samples, etc., as well about the algorithm and code you are presently using.

(This post was edited by Laurent_R on Jan 19, 2017, 6:20 AM)


Jan 19, 2017, 12:24 PM

Post #3 of 7 (4005 views)
Re: [Laurent_R] remove similar files [In reply to] Can't Post

ok, thanks. I'll load everything in RAM as files are small and use those optimizations.


Jan 24, 2017, 2:49 AM

Post #4 of 7 (3956 views)
Re: [orange] remove similar files [In reply to] Can't Post

I'm currently using this:


use strict;
use warnings;

use utf8;
use String::Similarity;
use File::Find::Rule;
use Data::Dumper; # print Dumper($foo, $bar);
use List::Util qw(min max);

use File::Basename;
use bytes;

my ( $filename, $similarity, @allfiles, @allpoints, @to_del, @filtered, @similars_all, @deltas );
my $limit = 0.9;
my $points_sim = 0;

my @files = File::Find::Rule->file()->name(qr/.*$/i)->maxdepth(4)->in('lha_uniq/bad/'); # ->size('>41')
if ( !$files[0] ) { print "no files found! \n"; exit; }

for my $filename (@files) {local $/; open FILE, $filename || die "Can't open $filename $!\n"; binmode(FILE); push @allfiles, [$filename, <FILE>]; close FILE;}
@allfiles = reverse sort { length $a->[1] <=> length $b->[1] } @allfiles;

for (my $i = $#allfiles; $i > -1; $i--) {
my @similars;
my $log = 20*log(length ($allfiles[$i][1]));

for (my $i2 = $#allfiles; $i2 > -1; $i2--) {
if ( $allfiles[$i][0] eq $allfiles[$i2][0] ) { next; print "same\n"; next; }
next if ( abs (length ($allfiles[$i2][1]) - length ($allfiles[$i][1])) > $log );
$similarity = similarity ( $allfiles[$i][1], $allfiles[$i2][1], $limit );
if ( $similarity > $limit ) {
push @similars, $allfiles[$i2][0];
splice @allfiles, $i2, 1;
$points_sim += 1;
} # next file2

push @similars_all, @similars; # print Dumper \@similars; exit;
push @allpoints, [$allfiles[$i][0], $points_sim, \@similars];
$points_sim = 0;
} # next file

How to make it faster? Thanks.


Jan 24, 2017, 9:16 PM

Post #5 of 7 (3940 views)
Re: [orange] remove similar files [In reply to] Can't Post

I suspect that similarity is symmetric (i.e. similarity($string1, $string2) == similarity($string2, $string1)). If so, you could halve your processing time by only doing the tests one way.
Good Luck,

Veteran / Moderator

Jan 24, 2017, 11:15 PM

Post #6 of 7 (3932 views)
Re: [orange] remove similar files [In reply to] Can't Post

Did you try to profile your code? My gut feeling is that you'll find that your program is spending 95% or more of its time in the similarity function, but you should really use a profiler on your code and with meaningful data to find out.

Once you know where ti is spent, you can concentrate on optimizations likely to yield something.


Jan 25, 2017, 11:53 PM

Post #7 of 7 (3910 views)
Re: [Laurent_R] remove similar files [In reply to] Can't Post


I've changed the inner loop to

  for (my $i2 = $i-1; $i2 > -1; $i2--) { 

there is some speed improvement. output seems to be the same.


String::Similarity::fstrcmp uses more than 99% time.

(This post was edited by orange on Jan 26, 2017, 12:21 AM)


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives