README                                         19 Feb 2005

All the data is extracted from .c and .f files by the program
'ngram'.  Various scripts are available to process the output of
ngram to produce graphs (mkallgra.sh) and tables (mkalltab.sh).

Most of the support tools used are included with most Linux
distributions (i.e., awk, colrm, gzcat, and sort).  Some of the
figures and tables use data generated using commercially
available tools which are not included in this package.

The graphs are created using 'grap'.  This tool is part of the
groff/pic/tbl family, but is not packaged with many Linux
distributions.  You can download it from:
www.lunabase.org/~faber/Vault/software/grap

The generated files containing the table data were automatically
inserted into the text of the book, which is then processed using
various tools (including latex) to generate a pdf file.  It would be
nice if there was some stripped down system for generating a pdf, but
there isn't.  For the time being no tools are supplied for producing
pleasant looking usage tables, in ASCII text form.

Obtaining the raw data can take a while (it is mostly disk bound) and
will depend on the amount of source being processed.  Running ngram
without any options (i.e., using c_use.sh) on the source used for the
book takes 11 minutes on a 266MHz Pentium.  The mkallget.sh script
may take an order of magnitude longer to complete.

The top level script mkallget.sh invokes the scripts that execute
ngram with various arguments.  This generates various raw data files
from the source code.  Some of this raw data may need to be processed
again (generating other data files) before it is a suitable final
form.  Setting GEN_DATA (in config.files) to 0 stops mkallgra.sh and
mkalltab.sh generating these data files, but they continue to do
everything else they do (this saves time when tuning the formatting
of the graphs and tables).

There are data extraction scripts in various directories (these are
invariably called getxxxx.sh, where xxxx is a name relating to the
data they extract).  The mkgra.sh script in each directory extracts
the necessary data (creating .d files) and expand the .g files (to
.gra files).  The mktab.sh script in each directory extracts the
necessary data and writes it to a text file (one file per table).

If you are running RedHat Linux the distributed executables should
work (built under RedHat 9).  Otherwise they can be built by
executing:

bldall.sh

Starting with a new set of source files available in the directory
program the commands to execute are:

c_use.sh > c.cnt
h_use.sh > h.cnt
mkallget.sh
mkallgra.sh
mkalltab.sh

Assuming you have installed grap, then the following:

cd ../diagrams
mkps.sh

will generate postscript and pdf files for the contents of the
diagrams directory (placed in the ps subdirectory).

The distributed directories are:

program -- this is the default directory used for 'find'ing .c
           and .h files.
           The top level directories in program are assumed to
	   denote various applications, or groups of programs.
	   This directory can simply consist of links to where the
	   source is actually held.

idents  -- Identifier related data and scripts.  By default the
	   various Levenstein distance measurements are not generated.
	   It takes 4+ hours, on a 1.5MHz Pentium, to generate the
	   measurements for gcc (it is an N^2 algorithm).  Edit getall.sh
	   to change this default.

prepro  -- Preprocessor related data and scripts.

statements  -- Statement related data and scripts.  It takes 3+ hours,
	       on a 1.5MHz Pentium, to generate all the raw data
	       for gcc.

decls -- Declaration related data and scripts.

tables  -- Scripts for extracting data from various ngram generated
           files.  Numbers are designed to appear in tables.

duplicates -- Duplicate lines are detected using simian.
	      A full evaluation version of this tool can be downloaded
	      from
	      www.redhillconsulting.com.au/products/simian

bldgra -- Build .gra and .d files from the raw data generated by
	  c_use.sh and h_use.sh.
	  The output is written to the directory diagrams.

bldtab -- Create the table information from the raw data generated by
	  c_use.sh and h_use.sh.
	  The output is written to the directory tables.

scripts -- Contains various general utility scripts.

diagrams  -- Holds the .gra and .d files generated by the various
	     mkgra.sh scripts.

tables -- The subdirectory tab_data holds the information generated
	  by the various mktab.sh scripts.
	  updTABLE.sh combines the table information and a stripped down
	  version of the books text.  The tools to generate a pdf file
	  from this output are not part of the distribution.

thirdparty -- Various programs and scripts that might not be part
	      of an installation.

config.files -- Set environment variables used by many shell scripts


Derek Jones
derek@knosof.co.uk

