CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Advanced:
Digest::MD5::digest woes

 



1arryb
User

Feb 23, 2009, 9:28 AM

Post #1 of 16 (7913 views)
Digest::MD5::digest woes Can't Post

Hi all,

I have a data warehousing application that identifies unique files by the MD5 of their content. I am using Digest::MD5 to calculate the digests. The database uses the binary digest (as returned by Digest::MD5::digest(), but I'd like the hex string too for logging.

I think I have discovered a bug in Digest::MD5. If I call its digest() method, subsequent calls to hexdigest() return a bogus value. The test program below illustrates this.

Please check my work before I file a bug report. In the meantime, I suppose I can workaround the problem by getting the hex string first.

On a related note, It would be great if I could independently check the values returned by Digest::MD5::digest(). I have some old code that purports to convert digests to strings, but it just does 'sprintf("%x", $digest)'. This doesn't seem to work anymore, if it ever did. I've been looking around at various techniques and haven't found an answer that produces the same string as md5sum. Now I'm getting worried that the hash itself may be wrong. What's the right way?

I have Perl 5.10 and the latest (2.38) version of Digest::MD5.
I've tried this on an x86_64 machine running Linux FC9 and on an x86 box running Windows XP and cygwin.

UPDATE
*deleted*

UPDATE2
I thought the problem transcended Digest::MD5 instances but I forgot to reinitialize the file handle. Never mind.

Thanks,

Larry


Code
#!/usr/bin/perl 

use Digest::MD5;

my $file = $ARGV[0];
open (IN, "<$file") or die "can't open $file";
# get a reference digest string.
my $refhex = `md5sum $file`;
chomp($refhex);
print "md5sum: $refhex\n";
my $md5obj = Digest::MD5->new;
$md5obj->addfile(*IN);
# Get the digest hex string from Digest::MD5.
my $hexdigest = $md5obj->hexdigest;
print "hexdig1: $hexdigest\n";
# Get the binary digest from Digest::MD5
print "[calling Digest::MD5::digest]\n";
my $digest1 = $md5obj->digest;
$hexdigest = $md5obj->hexdigest;
print "hexdig2: $hexdigest\n";
# but the digest method seems to return the same value each time.
my $digest2 = $md5obj->digest;
print "[internal digest is stable call to call]\n" if $digest1 == $digest2;



(This post was edited by 1arryb on Feb 23, 2009, 10:00 AM)


FishMonger
Veteran / Moderator

Feb 23, 2009, 9:45 AM

Post #2 of 16 (7908 views)
Re: [1arryb] Digest::MD5::digest woes [In reply to] Can't Post

I have not run your test script, but 1 issue (there are several) I see is that you're passing a typeglob instead of a filehandle.


1arryb
User

Feb 23, 2009, 9:52 AM

Post #3 of 16 (7905 views)
Re: [FishMonger] Digest::MD5::digest woes [In reply to] Can't Post

Just following the Digest::MD5 perldoc. I'll try a real filehandle and get back to you. Thx.


1arryb
User

Feb 23, 2009, 10:01 AM

Post #4 of 16 (7903 views)
Re: [1arryb] Digest::MD5::digest woes [In reply to] Can't Post

FishMonger,

Tried it using a IO::File handle. No difference.

Larry


FishMonger
Veteran / Moderator

Feb 23, 2009, 10:53 AM

Post #5 of 16 (7892 views)
Re: [1arryb] Digest::MD5::digest woes [In reply to] Can't Post


Code
#!/usr/bin/perl 

use strict;
use warnings;

use Digest::MD5;

my $file = $ARGV[0];
open (my $IN, "<$file") or die "can't open $file";
binmode $IN;

# get a reference digest string.
my $refhex = `md5sum $file`;
chomp($refhex);
print "md5sum: $refhex\n";

my $md5obj = Digest::MD5->new;
$md5obj->addfile($IN);

# Get the digest hex string from Digest::MD5.
my $hexdigest1 = $md5obj->hexdigest;
print "hexdig1: $hexdigest1\n";

# Get the binary digest from Digest::MD5
print "[calling Digest::MD5::digest]\n";
my $digest1 = $md5obj->digest;
print "digest1: $digest1\n";


seek $IN, 0,0;
$md5obj->reset;
$md5obj->addfile($IN);

my $hexdigest2 = $md5obj->hexdigest;
print "hexdig2: $hexdigest2\n";

# but the digest method seems to return the same value each time.
my $digest2 = $md5obj->digest;
print "digest2: $digest2\n";
print "[internal digest is stable call to call]\n" if $digest1 eq $digest2;



(This post was edited by FishMonger on Feb 23, 2009, 10:54 AM)


KevinR
Veteran


Feb 23, 2009, 10:54 AM

Post #6 of 16 (7892 views)
Re: [1arryb] Digest::MD5::digest woes [In reply to] Can't Post

Why are you doing this twice?


Code
my $hexdigest = $md5obj->hexdigest;  
.
.
.
.
$hexdigest = $md5obj->hexdigest;


Turn on warnings when you run your script:


Code
use warnings;


Edit:

Looks like Fish may have figured it out.
-------------------------------------------------


(This post was edited by KevinR on Feb 23, 2009, 10:55 AM)


1arryb
User

Feb 23, 2009, 12:01 PM

Post #7 of 16 (7889 views)
Re: [FishMonger] Digest::MD5::digest woes [In reply to] Can't Post

Hi FishMonger,

First of all, thanks a lot for replying! Now, onto the meat of your comments:

1. Cleaner test code. Ok, I'll try to do better in future posts.
2. use warnings - yes, good idea.
3. my $IN vs *IN, cool, but irrelevant.
4. use binmode. Agree, safer. Turns out to be Irrelevant for linux/cygwin though.
5. seek to file start before recalculating. Yes, it works, but it implies that Digest::MD5 needs to read all of the bytes for each call to *digest(). That's fine for passwords, but not so good for calculating the digest of large file contents, as I'm doing.

I guess the bottom line is that I (naively, as it turns out) thought that Digest::MD5 was caching the digest (or content) at addfile() then using the cached representation for subsequent calls to *digest(). I will recode accordingly. Perhaps something like this:


Code
#!/usr/bin/perl 

use strict;
use warnings;
use Digest::MD5;

# Calculate the md5 sum for a file.
# Note, $d and $s are references to the returned binary and hex string digests, respectively.
#
sub getFileDigest {
my ($f, $d, $s) = @_;

open (my $fh, "<$f") or return -1;
binmode($fh);

my $md5 = Digest::MD5->new;
$md5->addfile($fh);

$$d = $md5->digest;

# Don't use hexdigest() after digest().
$$s = '';
map { $$s .= sprintf("%02x", $_) } (unpack("C*", $$d));
close($fh);

return 0;
}

my $file = $ARGV[0];
my ($digest, $digestStr) = '';

if ( getFileDigest($file, \$digest, \$digestStr) ) {
print "Oops! getFileDigest() failed.\n";
} else {
# Reference MD5.
my $tStr = `md5sum $file`;
$tStr =~ tr/\r\n//d;
$tStr =~ s/ .*//;
print "md5sum: $tStr\n";

# Digest::MD5->digest, converted to a string inside the function.
print "hexdigest: $digestStr\n";

# Same as the above, just checking that the binary digest made it
# back to the main program ok.
$tStr = '';
map { $tStr .= sprintf("%02x", $_) } (unpack("C*", $digest));
print "digest: $tStr\n";
}



FishMonger
Veteran / Moderator

Feb 23, 2009, 4:43 PM

Post #8 of 16 (7878 views)
Re: [1arryb] Digest::MD5::digest woes [In reply to] Can't Post

When I posted the adjusted code I decided not to go over the details and I didn't make all of the changes that I thought it needed. Basically what I'm talking about is writing code that not only works but is clean, efficient and overall good quality.

Your updated script is a bit obfuscated and does a few odd things.

You name your sub getFileDigest and call it it boolean context but if it fails (or should I say if the open call fails), you return a true value and if it succeeds, you return a false value. That's backwards from what it should do.

When an open call fails, you should include the reason. It is also better to use the 3 arg form of open.

Why are you applying the exact same map/unpack statement against the same var?

The use of `md5sum $file` only serves to make the script less portable.

The use of binmode is recommended in the module's doc and altho it doesn't give a clear reason part of the reason would be for portability.

I agree that seeking to the beginning of the file and recalculate the digest is inefficient. I'd need to run a few tests, but there probably is a way to run both hexdigest() and digest() without using seek and a second addfile.

Here's a cleaned up version that accomplishes the exact same thing as your script.


Code
#!/usr/bin/perl 

use strict;
use warnings;
use Digest::MD5;

my $file = $ARGV[0] || die "Usage: $0 <filename>";

open my $FH, '<', $file or die "can't open '$file' $!";
binmode $FH;

my $md5sum = (split(/\s+/, `md5sum $file`))[0];

my $md5 = Digest::MD5->new;
$md5->addfile($FH);

my $digest = $md5->digest;
my $digestStr1;
map { $digestStr1 .= sprintf("%02x", $_) } (unpack("C*", $digest));

my $digestStr2 = $digestStr1;

print "digest: $digest\n",
"md5sum: $md5sum\n",
"hexdigest: $digestStr1\n",
"digeststr: $digestStr2\n";



FishMonger
Veteran / Moderator

Feb 23, 2009, 6:52 PM

Post #9 of 16 (7874 views)
Re: [1arryb] Digest::MD5::digest woes [In reply to] Can't Post

The best solution is to use the functional interface instead of the OO interface.

Code
#!/usr/bin/perl 

use strict;
use warnings;
use Digest::MD5 qw< md5 md5_hex md5_base64 >;

@ARGV == 2 or die "Usage: $0 <file1> <file2>";
my $file1 = $ARGV[0];
my $file2 = $ARGV[1];

open my $FH1, '<', $file1 or die "can't open '$file1' $!";
binmode $FH1;

open my $FH2, '<', $file2 or die "can't open '$file2' $!";
binmode $FH2;

for (1..5) {
my $md5digest = md5($FH1);
my $md5hex = md5_hex($FH1);
my $md5base64 = md5_base64($FH1);
print "digest: $md5digest\n",
"hex: $md5hex\n",
"base64: $md5base64\n\n";
}

for (1..5) {
my $md5digest = md5($FH2);
my $md5hex = md5_hex($FH2);
my $md5base64 = md5_base64($FH2);
print "digest: $md5digest\n",
"hex: $md5hex\n",
"base64: $md5base64\n\n";
}



1arryb
User

Feb 24, 2009, 7:08 AM

Post #10 of 16 (7865 views)
Re: [FishMonger] Digest::MD5::digest woes [In reply to] Can't Post

Hi FishMonger,

All your comments are appropos; however, the getFileDigest subroutine is part of a larger module. The driver program is throw-away code I whipped up for this forum thread only.

I think you are treading perilously close to a religious argument w.r.t. whether methods should fail with true or false Laugh In this case, I stuck to the convention used elsewhere in my application.

Agree that there's really no point in the Digest::MD5 OO interface for my use case. In fact, one could say that my problems with this module started with (erroneous) assumptions about the stateful-ness of the object.

Thanks again for your help.


1arryb
User

Feb 24, 2009, 7:50 AM

Post #11 of 16 (7861 views)
Re: [FishMonger] Digest::MD5::digest woes [In reply to] Can't Post

BTW, this program doesn't work Shocked. In order to use the functional interface to Digest::MD5, you have to pass the file content, not the file handle, as in:


Code
... 

undef $/;
my $content = <$FH1>;
$/ = "\n";
my $md5digest = md5($content);

...


That makes the OO I/F mandatory for me, since it appears to stream the content and I'm dealing with very large files here.

Cheers,

Larry


FishMonger
Veteran / Moderator

Feb 24, 2009, 8:08 AM

Post #12 of 16 (7858 views)
Re: [1arryb] Digest::MD5::digest woes [In reply to] Can't Post


Quote
I think you are treading perilously close to a religious argument w.r.t. whether methods should fail with true or false


Not at all. But the return value should reflect the context in which the sub is being used. If the the sub was called FailedToGetFileDigest then your return values would make sense. But returning true when you mean false and false when you mean true only adds confusion to anyone else that needs to work with the code. At the very least a comment should be added to clarify the meaning of the return value.


Quote
BTW, this program doesn't work Shocked. In order to use the functional interface to Digest::MD5, you have to pass the file content, not the file handle


Wrong.


Quote
[root@localhost ~]# cat fish.pl
#!/usr/bin/perl

use strict;
use warnings;
use Digest::MD5 qw< md5 md5_hex md5_base64 >;

@ARGV == 2 or die "Usage: $0 <file1> <file2>";
my $file1 = $ARGV[0];
my $file2 = $ARGV[1];

open my $FH1, '<', $file1 or die "can't open '$file1' $!";
binmode $FH1;

open my $FH2, '<', $file2 or die "can't open '$file2' $!";
binmode $FH2;

print "md5 digest values for $file1 and $file2\n\n";


for (1..5) {
my $md5digest = md5($FH1);
my $md5hex = md5_hex($FH1);
my $md5base64 = md5_base64($FH1);
print "digest: $md5digest\n",
"hex: $md5hex\n",
"base64: $md5base64\n\n";
}

for (1..5) {
my $md5digest = md5($FH2);
my $md5hex = md5_hex($FH2);
my $md5base64 = md5_base64($FH2);
print "digest: $md5digest\n",
"hex: $md5hex\n",
"base64: $md5base64\n\n";
}

[root@localhost ~]# ./fish.pl /etc/passwd /etc/shadow
md5 digest values for /etc/passwd and /etc/shadow

digest: );dA"8
hex: 112915d8c23bc96441f5e20f22023880
base64: ESkV2MI7yWRB9eIPIgI4gA

digest: );dA"8
hex: 112915d8c23bc96441f5e20f22023880
base64: ESkV2MI7yWRB9eIPIgI4gA

digest: );dA"8
hex: 112915d8c23bc96441f5e20f22023880
base64: ESkV2MI7yWRB9eIPIgI4gA

digest: );dA"8
hex: 112915d8c23bc96441f5e20f22023880
base64: ESkV2MI7yWRB9eIPIgI4gA

digest: );dA"8
hex: 112915d8c23bc96441f5e20f22023880
base64: ESkV2MI7yWRB9eIPIgI4gA

digest: vnRb7O
hex: 766ebe52628fac8837149bdac6ffd24f
base64: dm6+UmKPrIg3FJvaxv/STw

digest: vnRb7O
hex: 766ebe52628fac8837149bdac6ffd24f
base64: dm6+UmKPrIg3FJvaxv/STw

digest: vnRb7O
hex: 766ebe52628fac8837149bdac6ffd24f
base64: dm6+UmKPrIg3FJvaxv/STw

digest: vnRb7O
hex: 766ebe52628fac8837149bdac6ffd24f
base64: dm6+UmKPrIg3FJvaxv/STw

digest: vnRb7O
hex: 766ebe52628fac8837149bdac6ffd24f
base64: dm6+UmKPrIg3FJvaxv/STw



1arryb
User

Feb 24, 2009, 8:26 AM

Post #13 of 16 (7854 views)
Re: [FishMonger] Digest::MD5::digest woes [In reply to] Can't Post

Easy does it, Fish. I didn't say your program didn't run, just that it doesn't work. Try:


Code
use strict; 
use warnings;
use Digest::MD5 qw< md5 md5_hex md5_base64 >;

@ARGV == 1 or die "Usage: $0 <file1>";
my $file1 = $ARGV[0];

open my $FH1, '<', $file1 or die "can't open '$file1' $!";
binmode $FH1;

for (1..5) {
my $refdigest = `md5sum $file1`;
$refdigest = (split(/\s+/, $refdigest))[0];

my $md5digest = md5($FH1);
my $md5hex = md5_hex($FH1);
my $md5base64 = md5_base64($FH1);
print "refdigest: $refdigest\n",
"digest: $md5digest\n",
"hex: $md5hex\n",
"base64: $md5base64\n\n";
}


If you run the program, note that the "refdigest" and "hex" values don't match.

Respectfully,

Larry


FishMonger
Veteran / Moderator

Feb 24, 2009, 9:31 AM

Post #14 of 16 (7850 views)
Re: [1arryb] Digest::MD5::digest woes [In reply to] Can't Post

Well, it looks like I was mistaken, it appears that the functional interface only works on the first line when passed the filehandle.

Here's an updated version where all 3 approaches (functional, OO, and system call) return the same hex value. The only thing left to do is benchmark it to see which is the most efficient.


Code
#!/usr/bin/perl 

use strict;
use warnings;
use Digest::MD5 qw< md5 md5_hex md5_base64 >;

@ARGV or die "Usage: $0 <file>";
my $file = $ARGV[0];
my $size = -s $file;

print "md5 values for $file with a size of $size\n\n";

my $refdigest = `md5sum $file`;
$refdigest = (split(/\s+/, $refdigest))[0];

open my $FH, '<', $file or die "can't open '$file' $!";
binmode $FH;

my $content = join '', <$FH>;
seek $FH, 0,0;

my $md5 = Digest::MD5->new;
$md5->addfile($FH);
my $digest = $md5->digest;
my $digestStr1;
map { $digestStr1 .= sprintf("%02x", $_) } (unpack("C*", $digest));

my $md5digest = md5($content);
my $md5hex = md5_hex($content);
my $md5base64 = md5_base64($content);

print "md5sum: $refdigest\n";
print "OO digest: $digestStr1\n";
print "hex: $md5hex\n";
print "digest: $md5digest\n";
print "base64: $md5base64\n\n";

[root@asterisk099b ~]# time ./fish.pl /var/log/asterisk/messages
md5 values for /var/log/asterisk/messages with a size of 636994435

md5sum: b14a30772307956c11a6c65432258e0a
OO digest: b14a30772307956c11a6c65432258e0a
hex: b14a30772307956c11a6c65432258e0a
digest: J0w#lT2%

base64: sUowdyMHlWwRpsZUMiWOCg


real 0m55.237s
user 0m27.008s
sys 0m4.604s



1arryb
User

Feb 24, 2009, 10:47 AM

Post #15 of 16 (7846 views)
Re: [FishMonger] Digest::MD5::digest woes [In reply to] Can't Post

OK, I'll bite. I stripped out all non-essential stuff from your program, parameterized it for method (oo or functional) and put the guts in a loop on a glob.

I then pointed this program at a directory containing 222 text files whose contents totalled 543kb. After running it a few times to populate the system file cache, the program took about 3-3/4 seconds to iterate over the directory on my laptop. The times differed by at most .03 second regardless of method.

So, to summarize this thread:

1. The functional interface requires your program to read the file into memory and pass it each time you use the interface. Therefore, contents which can be processed by the functional interface are limited to the available RAM.

2. The functional interface is idempotent, in that you get the same results for the same parameter, call after call, with no special treatment required.

3. The oo interface streams the data from the file handle you pass as a parameter. This makes the oo interface more appropriate for summing large files.

4. The oo interface is sensitive to the state of the passed file handle. Idempotent results are only achievable if the state of the file handle is reset between calls (e.g., seek $fh, 0,0).

5. There is no significant difference in execution speed between the functional and oo interfaces.

This was fun!

Cheers,

Larry


Code
#!/usr/bin/perl 

use strict;
use warnings;
use Digest::MD5 qw< md5 md5_hex md5_base64 >;
use Getopt::Std;

my $usage = "Usage: $0 <dir> oo|func";

unless(scalar(@ARGV) == 2) {
print "$usage\n";
exit -1;
}

my $dir = $ARGV[0];
my $method = $ARGV[1];

unless ( -d $dir ) {
print "dir \"$dir\" does not exist\n";
exit -1;
}
unless ( $method eq 'oo' or $method eq 'func' ) {
print "method must be one of 'oo' or 'func'\n";
print "$usage\n";
exit -1;
}

for my $file (glob("$dir/*")) {
my $size = -s $file;

print "md5 values for $file with a size of $size using $method method\n\n";


open my $FH, '<', $file or die "can't open '$file' $!";
binmode $FH;

my ($digest, $digesthex);

if ( $method eq 'func' ) {
my $content = join '', <$FH>;
$digest = md5($content);
$digesthex = md5_hex($content);
} else {

my $md5 = Digest::MD5->new;
$md5->addfile($FH);
$digest = $md5->digest;
$digesthex = '';
map { $digesthex .= sprintf("%02x", $_) } (unpack("C*", $digest));
}
close $FH;

print "digesthex: $digesthex\n";
print "digest: $digest\n";
}



FishMonger
Veteran / Moderator

Feb 24, 2009, 11:40 AM

Post #16 of 16 (7843 views)
Re: [1arryb] Digest::MD5::digest woes [In reply to] Can't Post

Try this benchmark test. I put the open call inside the subs because I think that might be a more far comparison with the system call.

Also, keep in mind that using join when assigning $content is not the most efficient method.


Code
#!/usr/bin/perl 

use strict;
use warnings;
use Digest::MD5 qw<md5_hex>;
use Benchmark qw <:all>;

@ARGV == 1 or die "Usage: $0 <file>";
my $file = $ARGV[0];


cmpthese(-1,
{
'md5sum' => \&md5sum,
'functional' => \&functional,
'OO' => \&OO,
}
);

sub md5sum {
my $refdigest = `md5sum $file`;
$refdigest = (split(/\s+/, $refdigest))[0];
}

sub functional {
open my $FH, '<', $file or die "can't open '$file' $!";
binmode $FH;
my $content = join '', <$FH>;
my $md5hex = md5_hex($content);
}

sub OO {
open my $FH, '<', $file or die "can't open '$file' $!";
binmode $FH;
my $md5 = Digest::MD5->new;
$md5->addfile($FH);
my $digest = $md5->digest;
my $digestStr1;
map { $digestStr1 .= sprintf("%02x", $_) } (unpack("C*", $digest));
}


 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives