Still Slow After Fork
A simple paradigm for speeding up code is to use multiple processes created by fork. Unfortunately this doesn't always work as we can see in the following example.
Calculating Tons of MD5
A simple case would be brute forcing MD5. Using Math::Cartesian::Product it's easy to generate all the 5 letter combinations and calculate an MD5 sum from each of them.
If we were searching for the sources of the following MD5s:
5d41402abc4b2a76b9719d911017c592
3899dcbab79f92af727c2190bbd8abc5
9aeaed51f2b0f6680c4ed4b07fb1a83c
96e25e3c6373262faae824dc5a16cce1
We could use the following perl program to brute force them:
use v5.18;
use strict;
use warnings;
use Data::Dumper;
use Digest::MD5 qw/md5_hex/;
use Math::Cartesian::Product;
my %lookup = map { ($_ =~ s/\R$//r, undef) } <DATA>;
sub tryword {
my ($word) = @_;
if (exists $lookup{md5_hex($word)}) {
$lookup{md5_hex($word)} = $word;
print "Found: ${word} = ", md5_hex($word), "\n";
}
return;
}
my $a = [('a'..'z')];
cartesian {tryword(join('', @_))} ($a) x 5;
__DATA__
5d41402abc4b2a76b9719d911017c592
3899dcbab79f92af727c2190bbd8abc5
9aeaed51f2b0f6680c4ed4b07fb1a83c
96e25e3c6373262faae824dc5a16cce1
And the output (including time):
$ time crackmd5.pl
Found: hello = 5d41402abc4b2a76b9719d911017c592
Found: ninja = 3899dcbab79f92af727c2190bbd8abc5
Found: troll = 9aeaed51f2b0f6680c4ed4b07fb1a83c
Found: xoxox = 96e25e3c6373262faae824dc5a16cce1
real 0m19.226s
user 0m18.717s
sys 0m0.129s
(Not) Speeding Up
Now let's speed up the calculation using fork. For brute forcing it's easy to split the work by first letter, having each process try all the combinations with a given first letter. I'll use Forks::Queue to queue the tasks and have each process get items from the queue until it's empty (i.e. no more letters to calculate).
Here's the code:
use v5.18;
use strict;
use warnings;
use Data::Dumper;
use Digest::MD5 qw/md5_hex/;
use Math::Cartesian::Product;
use Forks::Queue;
my $fq = Forks::Queue->new( impl => 'File', style => 'fifo' );
my %lookup;
%lookup = map { ($_ =~ s/\R$//r, undef) } <DATA>;
sub handle_job {
while(my $first_letter = $fq->get) {
my $a = [('a'..'z')];
my @l = cartesian {tryword(join('', $first_letter, @_)); 0} ($a) x 4;
}
}
sub tryword {
my ($word) = @_;
my $md5 = md5_hex($word);
if (exists $lookup{$md5}) {
warn "$word = ", md5_hex($word), "\n";
}
return;
}
my @children;
sub fork_worker {
my $cid = fork();
if($cid == 0) {
handle_job();
exit(0);
}
push @children, $cid
}
$fq->put($_) for ('a'..'z');
$fq->end;
fork_worker();
fork_worker();
waitpid($_, 0) for @children;
__DATA__
5d41402abc4b2a76b9719d911017c592
3899dcbab79f92af727c2190bbd8abc5
9aeaed51f2b0f6680c4ed4b07fb1a83c
96e25e3c6373262faae824dc5a16cce1
And here's our daily surprise - This code runs just as slow as the previous version:
hello = 5d41402abc4b2a76b9719d911017c592
ninja = 3899dcbab79f92af727c2190bbd8abc5
troll = 9aeaed51f2b0f6680c4ed4b07fb1a83c
xoxox = 96e25e3c6373262faae824dc5a16cce1
real 0m18.449s
user 0m35.971s
sys 0m0.128s
Further Observations
I suspect the reason for the above lies in how Digest::MD5 works. Perhaps there's some shared synchronized data the XS module is accessing for some reason, or perhaps just running XS code takes its toll.
Moving this to Digest::Perl::MD5 is able to speed up, but as the starting position there is so much slower I doubt anyone would take it.
When calculating MD5 in pure perl we get:
5 letters, pure perl, 2 process
real 3m15.244s
user 6m16.292s
sys 0m1.725s
5 letters, pure perl, single process
real 6m15.383s
user 5m43.227s
sys 0m4.260s
Which makes sense but won't help you calculate MD5s any faster.
My conclusion here is don't take anything for granted and always check when it comes to performance.
And if any of you has a good idea what slows down this calculation or how to speed it up a comment is appreciated.