Still Slow After Fork

in perl

A simple paradigm for speeding up code is to use multiple processes created by fork. Unfortunately this doesn't always work as we can see in the following example.

Calculating Tons of MD5

A simple case would be brute forcing MD5. Using Math::Cartesian::Product it's easy to generate all the 5 letter combinations and calculate an MD5 sum from each of them.

If we were searching for the sources of the following MD5s:

5d41402abc4b2a76b9719d911017c592
3899dcbab79f92af727c2190bbd8abc5
9aeaed51f2b0f6680c4ed4b07fb1a83c
96e25e3c6373262faae824dc5a16cce1

We could use the following perl program to brute force them:

use v5.18;
use strict;
use warnings;

use Data::Dumper;
use Digest::MD5 qw/md5_hex/;
use Math::Cartesian::Product;
my %lookup = map { ($_ =~ s/\R$//r, undef) } <DATA>;

sub tryword {
  my ($word) = @_;
  if (exists $lookup{md5_hex($word)}) {
    $lookup{md5_hex($word)} = $word;    
    print "Found: ${word} = ", md5_hex($word), "\n";
  }
  return;
}

my $a = [('a'..'z')];
cartesian {tryword(join('', @_))} ($a) x 5;

__DATA__
5d41402abc4b2a76b9719d911017c592
3899dcbab79f92af727c2190bbd8abc5
9aeaed51f2b0f6680c4ed4b07fb1a83c
96e25e3c6373262faae824dc5a16cce1

And the output (including time):

$ time crackmd5.pl
Found: hello = 5d41402abc4b2a76b9719d911017c592
Found: ninja = 3899dcbab79f92af727c2190bbd8abc5
Found: troll = 9aeaed51f2b0f6680c4ed4b07fb1a83c
Found: xoxox = 96e25e3c6373262faae824dc5a16cce1

real	0m19.226s
user	0m18.717s
sys	0m0.129s

(Not) Speeding Up

Now let's speed up the calculation using fork. For brute forcing it's easy to split the work by first letter, having each process try all the combinations with a given first letter. I'll use Forks::Queue to queue the tasks and have each process get items from the queue until it's empty (i.e. no more letters to calculate).

Here's the code:

use v5.18;
use strict;
use warnings;

use Data::Dumper;
use Digest::MD5 qw/md5_hex/;
use Math::Cartesian::Product;
use Forks::Queue;
my $fq = Forks::Queue->new( impl => 'File', style => 'fifo' );

my %lookup;
%lookup = map { ($_ =~ s/\R$//r, undef) } <DATA>;

sub handle_job {
  while(my $first_letter = $fq->get) {
    my $a = [('a'..'z')];
    my @l = cartesian {tryword(join('', $first_letter, @_)); 0} ($a) x 4;
  }
}

sub tryword {
  my ($word) = @_;
  my $md5 = md5_hex($word);
  if (exists $lookup{$md5}) {
    warn "$word = ", md5_hex($word), "\n";
  }
  return;
}

my @children;
sub fork_worker {
  my $cid = fork();
  if($cid == 0) {
    handle_job();
    exit(0);
  }

  push @children, $cid
}

$fq->put($_) for ('a'..'z');
$fq->end;

fork_worker();
fork_worker();

waitpid($_, 0) for @children;

__DATA__
5d41402abc4b2a76b9719d911017c592
3899dcbab79f92af727c2190bbd8abc5
9aeaed51f2b0f6680c4ed4b07fb1a83c
96e25e3c6373262faae824dc5a16cce1

And here's our daily surprise - This code runs just as slow as the previous version:

hello = 5d41402abc4b2a76b9719d911017c592
ninja = 3899dcbab79f92af727c2190bbd8abc5
troll = 9aeaed51f2b0f6680c4ed4b07fb1a83c
xoxox = 96e25e3c6373262faae824dc5a16cce1

real	0m18.449s
user	0m35.971s
sys	0m0.128s

Further Observations

I suspect the reason for the above lies in how Digest::MD5 works. Perhaps there's some shared synchronized data the XS module is accessing for some reason, or perhaps just running XS code takes its toll.

Moving this to Digest::Perl::MD5 is able to speed up, but as the starting position there is so much slower I doubt anyone would take it.

When calculating MD5 in pure perl we get:

5 letters, pure perl, 2 process
real	3m15.244s
user	6m16.292s
sys	0m1.725s

5 letters, pure perl, single process
real	6m15.383s
user	5m43.227s
sys	0m4.260s

Which makes sense but won't help you calculate MD5s any faster.

My conclusion here is don't take anything for granted and always check when it comes to performance.

And if any of you has a good idea what slows down this calculation or how to speed it up a comment is appreciated.

Comments