http://www.dmst.aueb.gr/dds/pubs/trade/1997-login-Perl/html/shperl.html This is an HTML rendering of a working paper draft that led to a publication. The publication should always be cited in preference to this draft using the following reference:
|
People often need to decide whether to code the solution to a given task in Perl or as a shell script. Some factors that we consider are influenced by the answers to the questions below.
RBK: Of course, that requires fluency with each of the tools you'll be using. Even in the area of regular expressions, syntaxes and capabilities vary slightly among the various tools. I've pretty much revered to perl for just about everything that isn't a straightline set of running commands or simple pipes.
DDS: On the other hand, filter-type pipeline sequences can be succinctly expressed as a shell script; perl may require non-intuitive nested invocations of the grep function or the use of complicated control structures. Many problems can be easily solved with a sequence like:
grep .... | # Find the data we are interested in sed ... | # Convert it to the right format awk ... | # Calculate something sort ... | # Sort output pr # And paginateIn my mind the expression of this sequence in perl as sequential operations on records accumulated into arrays (for sorting) is not as natural as the shell counterpart.
RBK: My system's architecture pushes me in the other direction. Even with 32MB of RAM, perl should be sitting around in memory (along with the shells and the simpler tools like sed, grep, and sort). Amusingly enough, on my system perl requires 250KB of RAM while bash requires 1.35MB! Even if one swapped them both, that's less than half a second or so of real time if you're running on a modern UNIX-style system. Once the good parts are paged in, presumably the launch time would approach zero.
DDS: I aggree with RBK's point and plea for frugal tool implementations. Just because a feature can be added does not mean that is should be.
RBK: I no longer worry about the speed limitations of interpreted code vs. fast compiled code. It is a rare day that I translate programs into C for speed. Pentium 90s approach 100 MIPS; P-200s approach 200 MIPS. That's plenty of speed for 99% of what I do. I must admit that our bookkeeping software is finally starting to run a bit slow for my tastes (20 seconds to process 10,000 orders), but I'll upgrade the processor before re-writing the software. I can get 2x there and that's plenty.
I used to hate the overhead of lots of forks but even that has improved with the latest round of fast CPUs. I don't sweat that as much any more.
DDS: Having a fork (command execution) at the inner loop of a large data set processing program will yield to pathologicaly slow implementation. If find it difficult to persuade myself that this overhead is irrelevant. A quick experiment (grep implemented as a naive perl loop or shell script) on my system shows that the perl script is an order of magnitude slower than grep (1.37s vs 0.09s) and that the naive shell loop implementation taking 92.77s is more than three orders of magnitude slower than grep. According to Moore's law in six years (less than the age of the perl language) the processor technology provides the speedup needed for the perl script to match grep's current performance. On the other hand the shell script will match grep's current performance in 15 years (interestingly this is still less than the age of Unix).
RBK: The point of reading-all-into-RAM is well taken. This approach ultimately can not scale well if your data is growing (e.g., order management systems). On the other hand, I've been able to structure all my programs so that they don't keep everything in memory unless it's extremely advantageous for me to do so. I agree that the best way in perl to implement pipelines that are easy structured with standard Unix toolset commands is by using the `system' directive to call them.
DDS: It is certainly possible to keep perl's data off memory and still keeping the implementation simple by using database-mapped associative arrays (dbm, gdbm etc.). Furthermore, binding perl to a relational database management system completely redefines perl's applicability and performance over very large data sets. At that point however the implementation domain has been transformed into an RDBMS which just happens to be programmed in perl. <2>How complex is the data and its relationships? DDS: Shell programs together with standard Unix tools can at its best deal with relational (join) or ordering (tsort) operations on sequential record lists. Dealing with more complicated data structures such as graphs can become a nightmare for even the most dedicated shell programmer. Version 5 of Perl, on the other hand, provides references (pointers) and therefore the possibility to create data structures of arbitrary complexity. Looking up data and dealing with unstructured text or binary data is also more easily performed in Perl than using Unix commands in a shell script. If the data lookups are to be performed on orderly structured line-based records and can be collected and executed in a batch then using a shell script together with one of fgrep, comm, uniq or join could fit the bill. If this is not the case (consider for example converting references to hypertext links in a LaTeX document) then the associative arrays and advanced regular expressions of Perl are the tool for the job.
RBK: I'm particularly enamored of the control structures available in perl. I think they enable programmers to keep fewer `contexts' in their mind and concentrate on procedural programming just as they would in C or C++ (or whatever).
RBK: I don't think I agree that you can (in a reasonable amount of time) write a shell script conservatively enough to run unmodified on all Unix systems (though you can do it for `a large number'). Too many file locations, command locations, and other system dependencies (also known as `vendor enhancements') make this difficult. Perl on the other hand, has just one version and just one support organization. I like that.
RBK: I rarely use Perl 5. As before, the control structures and simple data structures that I use in perl provide me with more than 90% of what I need. I do have some scripts that grew over time to be overly complex and ultimately needed a rewrite. I think that's generic, though.
DDS: When dealing with code that other people have written I usually find perl program's easier to understand than corresponding shell script.
The programs are run with a single argument which gives a directory. Both programs provide as output a list of file pairs that contain exactly the same data. I needed such a tool in order to clean up a set of 19000 files occupying 500MB as a first step for further processing. A naive solution to this problem involves comparing every file against all others. This is an O(n!) algorithm; obviously unusable for most practical purposes. A more efficient approach would be to sort the files by their content (an O(n log n) process) and then print duplicate files from the list. Unfortunately sorting files by their content can still be expensive since every file comparison involves opening, reading, and closing the file; in my case about 850000 system calls. In addition, sorting file names by file content can not be easily performed with standard Unix tools. I decided to minimize file accesses by operating on the checksums of the file data instead of the files themselves. Checksums are integer numbers and can be easily sorted and manipulated. Identical files will have identical checksums, the opposite however is not true. For this reason after finding all files with identical checksums a final step compares those files byte by byte and prints only files that actually match.
The shell program (illustrated bellow) creates a list of files in the directory (find), converts that list into a list of checksums (xargs sum) and sorts them to bring identical checksums together (sort). This is then saved (tee) into a temporary file for further lookups. The invocation of uniq prints the first entry of each identical checksum file sequence which feeds data into the loop for determining the matched files. The loop contains two nested inner loops: the first one loops over all possibly identical files (e.g., a1, a2, a3, a4), the second one loops over the file coming after the first loop file up the the last one (e.g., first a2, a3, a4, then a3, a4, then a4). The body of the inner loop simply compares the files and prints their names if they are equal.
#!/bin/sh # # List duplicate files # TMP=/tmp/dup.$$ trap "rm -f $TMP ; exit 1" 1 2 3 15 # Look for candidates (having the same checksum) find $1 -type f -print | xargs sum | awk '{print $3, $1}' | sort +1 | tee $TMP | uniq -d -1 | while read NAME SUM do # List each file from the candidate set fgrep $SUM $TMP | awk '{print $1}' | while read FIRST do # List and process the remaining files of the set fgrep $SUM $TMP | sed -n '\*^'$FIRST'*,$p' | sed 1d | while read SECOND SUM2 do if cmp -s $FIRST $SECOND then echo $FIRST $SECOND fi done done done rm -f $TMPThe Perl script (illustrated below) operates on exactly the same principles. It first calls the find library function which in turn calls process for every entry in the specified directory. Process determines whether the entry is in fact a file and calculates its checksum using the unpack function. It then pushes the filename into the appropriate entry of the files associative array. The files associative array is indexed by the checksum number. Every entry contains a list of files that have the given checksum. After all files have been processed a loop goes through all elements of the associative arrays and two inner loops compare (as in the shell script) the files in pairs. Identical files are printed.
#!/usr/bin/perl # # List duplicate files # use File::Find; undef $/; # Call process for every found entry find(\&process, $ARGV[0]); # For every plain file found add its filename under its checksum hash sub process { return unless -f; open(IN, $_) || die; $sum = unpack("%32C*", <IN>) % 32767; close(IN); push @{$files{$sum}}, $File::Find::name; } # Go through all checksums comparing files that have the same sum foreach $files (values %files) { next if ($#$files == 0); for ($i = 0; $i <= $#$files; $i++) { for ($j = $i + 1; $j <= $#$files; $j++) { print "$$files[$i] $$files[$j]\n" if (same($$files[$i], $$files[$j])); } } } # Return true if two files are identical sub same { local($a, $b) = @_; open(A, $a) || die; open(B, $b) || die; return <A> eq <B>; }With the shell script coded in 36 lines and the Perl script in 42 both scripts are approximately the same size. With three nested pipelines feeding a loop and a temporary file for storing intermediate results the shell script appears to be stretching the limits of shell programming. The Perl script executed in about half the time of the shell script (810 user and 816 system seconds for Perl versus 1672 user and 1842 system seconds for the shell script); a notable, but not dramatic performance difference. On the other hand at the peak of its execution the shell script was running 10 processes with a combined resident set size (RSS) of 4892K (probably a lot less since four and two of them shared the same text image) whereas the Perl script had an RSS of 8064K. It is also important to note that the RSS of Perl was constantly rising as entries accumulated into the associative array whereas the RSS of the shell script remained almost constant over its execution. This means that the shell script can handle much larger set sizes without running out of swap space as the Perl script might (in the October 1996 issue of ;login: Andrew Hume presented the listing of a directory containing 80GB of data). In this example both implementations are reasonable approaches for solving the problem; I hope that the example illustrated the different coding styles and tradeoffs between the two scripting languages and that the initial paragraphs provided you with some insights on the issues that are important when deciding the implementation vehicle for your problem at hand.
RBK: I coincidentally encountered this same problem when I was trying to reduce the number of typefaces I have stored on disk. Many are identical. I wrote a program to store checksums in a `SUM' file in each directory (new directories appear from time to time) and another program to traverse those directories and SUM files to see which were identical. The Perl programs differed slightly from DDS's, not enough for it to be worthwhile to show them here.
[1] David G. Korn. Ksh - An extensible high level language. USENIX Very High Level Languages Symposium Proceedings, pp. 129-146. Santa Fe, October 1994.
[2] Theodore H. Romer et al., The structure and performance of interpreters. ACM Sigplan Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 50-159. Cambridge MA, October 1996.