Writing Simple Filters In Perl, Python and Ruby

File input opeations in perl, python and ruby feels almost as if they were written by the same person. In this post I'd like to examine their similarities and differences, and hopefully learn something about the philosophy of the languages themselves.

Our Task

A unix filter is a command that takes its input from stdin, though most filters also allow input from files passed in as command line arguments. Take wc for example:

$ ls | wc
$ wc /etc/passwd
$ wc /etc/passwd /etc/shells

All 3 work and cause wc to count characters, words and lines in its input.

Perl, python and ruby all offer language constructs to help us build programs that take line based input from stdin or file names passed as command line arguments. We'll build a simple filter in all 3 languages that prints the name of each file passed in followed by its content.

So expected result from the program should be:

$ perl readall.pl /etc/shells
----------
/etc/shells
----------
1: # List of acceptable shells for chpass(1).
2: # Ftpd will not allow users to connect who are not using
3: # one of these shells.
4: 
5: /bin/bash
6: /bin/csh
7: /bin/ksh
8: /bin/sh
9: /bin/tcsh
10: /bin/zsh
11: /usr/local/bin/zsh

Perl: simple and direct

use strict;
use warnings;

my $print_next_header = 1;

sub print_header {
  print '-' x 10, "\n";
  print $ARGV, "\n";
  print '-' x 10, "\n";
}

while(<>) {
  if ($print_next_header) {
    print_header;
    $print_next_header = 0;
  }

  $print_next_header = 1 if eof;

  print $., ": ", $_;
}

The perl implementation is simple and straightforward. The special readline operator <> reads a line from either STDIN or the files passed in as command line arguments and returns it. Being a while loop, if no more lines remain undef is returned and the loop if finished.

The loop uses eof function to check if a file is finished. This has the drawback that it requires a flag variable $print_next_header.

The program uses 3 magic globals: $ARGV, $. and $_. A full documentation on all magic global in perl is available with perldoc perlvar. In our case $ARGV is the current file name, $. is the commulative line number and $_ is the current line.

Using all globals mean we don't have to pass the file name as argument to print_header function, which is nice in small programs but can lead to confusion in larger ones.

Python: fileinput module

Python has a built-in module called fileinput which does exactly what the above perl code does but with cleaner code and tons of extra features. Here's the same program in Python:

import fileinput
import sys

def header(filename):
    return "{0}\n{1}\n{0}\n".format("-" * 10, filename)

def text(index, line):
    return "{0}. {1}".format(index, line)

for line in fileinput.input():
    if fileinput.isfirstline():
        sys.stdout.write(header(fileinput.filename()))

    sys.stdout.write(text(fileinput.lineno(), line))

fileinput.input() returns an iterator over all lines in STDIN or in files passed in as command line arguments. The fileinput object itself provides a lot of utility methods to get information about the read operation, such as isfirstline() which tells us if we're in the first line of a file; filename() which returns the current file name and lineno() which yields the commulative line number.

This is the same data provided by perl's super globals, but with a nicer interface. Everything is scoped so we're also passing the relevant data to header and text functions, which creates less confusion in larger programs.

You do need to know some object oriented concepts, and how iterators work to understand the above code.

One major difference between the above and the perl version is what happens when a name specified as a command line argument is not a valid file name, for example assume the program is called readall.py and no file named /foo/bar exists, what happens here?

$ python readall.py /foo/bar /etc/passwd
$ perl readall.py /foo/bar /etc/passwd

Turns out perl version prints a warning the /foo/bar is not a valid file name and then prints all the lines from /etc/passwd. Python version on the other hand raises an IOError and doesn't print anything.

There are several ways to get python to behave like perl in such a case as discussed here. I think filtering out non-file arguments is the easiest one:

import fileinput
import sys
import os.path

def header(filename):
    return "{0}\n{1}\n{0}\n".format("-" * 10, filename)

def text(index, line):
    return "{0}. {1}".format(index, line)

for line in fileinput.input(f for f in sys.argv[1:] if os.path.isfile(f)):
    if fileinput.isfirstline():
        sys.stdout.write(header(fileinput.filename()))

    sys.stdout.write(text(fileinput.lineno(), line))

Ruby: Hello ARGF

Ruby's version of <> is called ARGF. Here's the code:

def header(filename)
  <<~END
    ----------
    #{filename}
    ----------
  END
end

def text(index, line)
  "#{index}. #{line}"
end

ARGF.each_line do |line|
  puts header(ARGF.filename) if ARGF.file.lineno == 1
  puts text(ARGF.lineno, line)
end

Missing files would still raise exception so if you're expecting any you'll need to filter them out before calling each_line.

Other than that this code is almost a 1-1 copy of the python version, with the following visible differences:

  1. Ruby's puts is smart enough not to add newline if the line already ends in one (unline python's print which will always add newline).
  2. Ruby's heredoc and string interpolation results in nicer code than Python's string format. Also I liked not having to use return in the last statement of the functions.

Both ruby's enumerators and python's iterators let an object decide how its iteration logic should work. And in both languages file read operations are performed using object oriented syntax: In ruby it's ARGF and in python it's fileinput.

Afterthoughts

Writing the same program in 3 languages can shine some light on the difference in philosophy each language has. Perl is direct and minimalistic. It uses magic global variables and procedural mindset, which works well for small scripts and is very easy to understand for developers coming from C.

Both Python and Ruby take a more object oriented approach. Among the two, Python's syntax is more arrayed while Ruby is more open to using more characters and adding new language constructs.

Comments