Computer scientist and compiler expert Alfred V. Aho is a man at the forefront of
computer science research. He has been involved in the development of programming
languages from his days working as the vice president of the Computing Sciences
Research Center at Bell Labs to his current position as Lawrence Gussman Professor
in the Computer Science Department at Columbia University.
As well as co-authoring the ‘Dragon’ book series, Aho was one of the three developers
of the AWK pattern matching language in the mid-1970s, along with Brian Kernighan
and Peter Weinberger.
Computerworld recently spoke to Professor Aho to learn more about the development
of AWK
How did the idea/concept of the AWK language develop and come into practice?
As with a number of languages, it was born from the necessity to meet a need. As a researcher
at Bell Labs in the early 1970s, I found myself keeping track of budgets, and keeping track of
editorial correspondence. I was also teaching at a nearby university at the time, so I had to keep
track of student grades as well.
I wanted to have a simple little language in which I could write one- or two-line programs
to do these tasks. Brian Kernighan, a researcher next door to me at the Labs, also wanted to
create a similar language. We had daily conversations which culminated in a desire to create a
pattern-matching language suitable for simple data-processing tasks.
We were heavily influenced by grep, a popular string-matching utility on Unix, which had
been created in our research center. grep would search a file of text looking for lines matching
a pattern consisting of a limited form of regular expressions, and then print all lines in the file
that matched that regular expression.
We thought that we’d like to generalize the class of patterns to deal with numbers as well as
strings. We also thought that we’d like to have more computational capability than just printing
the line that matched the pattern.
So out of this grew AWK, a language based on the principle of pattern-action processing. It
was built to do simple data processing: the ordinary data processing that we routinely did on
a day-to-day basis. We just wanted to have a very simple scripting language that would allow
us, and people who weren’t very computer savvy, to be able to write throw-away programs for
routine data processing.
Were there any programs or languages that already had these functions at the time
you developed AWK?
Our original model was grep. But grep had a very limited form of pattern action processing,
so we generalized the capabilities of grep considerably. I was also interested at that time in
string pattern matching algorithms and context-free grammar parsing algorithms for compiler
applications. This means that you can see a certain similarity between what AWK does and
what the compiler construction tools lex and yacc do.
lex and yacc were tools that were built around string pattern matching algorithms that I
was working on: lex was designed to do lexical analysis and yacc syntax analysis. These tools
were compiler construction utilities which were widely used in Bell labs, and later elsewhere,
to create all sorts of little languages. Brian Kernighan was using them to make languages for
typesetting mathematics and picture processing.
lex is a tool that looks for lexemes in input text. Lexemes are sequences of characters that
make up logical units. For example, a keyword like then in a programming language is a lexeme.
The character t by itself isn’t interesting, h by itself isn’t interesting, but the combination then
is interesting. One of the first tasks a compiler has to do is read the source program and group
its characters into lexemes.
AWK was influenced by this kind of textual processing, but AWK was aimed at dataprocessing
tasks and it assumed very little background on the part of the user in terms of programming sophistication.
Can you provide Computerworld readers with a brief summary in your own words
of AWK as a language?
AWK is a language for processing files of text. A file is treated as a sequence of records, and by
default each line is a record. Each line is broken up into a sequence of fields, so we can think
of the first word in a line as the first field, the second word as the second field, and so on. An
AWK program is a sequence of pattern-action statements. AWK reads the input a line at a
time. A line is scanned for each pattern in the program, and for each pattern that matches, the
associated action is executed.
A simple example should make this clear. Suppose we have a file in which each line is a
name followed by a phone number. Let’s say the file contains the line Naomi 1234. In the AWK
program the first field is referred to as $1, the second field as $2, and so on. Thus, we can create an
AWK program to retrieve Naomi’s phone number by simply writing $1 == "Naomi" {print $2}
which means if the first field matches Naomi, then print the second field. Now you’re an AWK
programmer! If you typed that program into AWK and presented it with the file that had names
and phone numbers, then it would print 1234 as Naomi’s phone number.
A typical AWK program would have several pattern-action statements. The patterns can
be Boolean combinations of strings and numbers; the actions can be statements in a C-like
programming language.
AWK became popular since it was one of the standard programs that came with every Unix
system.
What are you most proud of in the development of AWK?
AWK was developed by three people: me, Brian Kernighan and Peter Weinberger. Peter Weinberger
was interested in what Brian and I were doing right from the start. We had created a
grammatical specification for AWK but hadn’t yet created the full run-time environment. Weinberger
came along and said ‘hey, this looks like a language I could use myself,’ and within a week
he created a working run time for AWK. This initial form of AWK was very useful for writing
the data processing routines that we were all interested in but more importantly it provided an
evolvable platform for the language.
One of the most interesting parts of this project for me was that I got to know how Kernighan
and Weinberger thought about language design: it was a really enlightening process! With the
flexible compiler construction tools we had at our disposal, we very quickly evolved the language
to adopt new useful syntactic and semantic constructs. We spent a whole year intensely debating
what constructs should and shouldn’t be in the language.
Language design is a very personal activity and each person brings to a language the classes
of problems that they’d like to solve, and the manner in which they’d like them to be solved. I
had a lot of fun creating AWK, and working with Kernighan and Weinberger was one of the most
stimulating experiences of my career. I also learned I would not want to get into a programming
contest with either of them however! Their programming abilities are formidable.
Interestingly, we did not intend the language to be used except by the three of us. But very
quickly we discovered lots of other people had the need for the routine kind of data processing
that AWK was good for. People didn’t want to write hundred-line C programs to do data
processing that could be done with a few lines of AWK, so lots of people started using AWK.
For many years AWK was one of the most popular commands on Unix, and today, even
though a number of other similar languages have come on the scene, AWK still ranks among
the top 25 or 30 most popular programming languages in the world. And it all began as a little
exercise to create a utility that the three of us would find useful for our own use.
How do you feel about AWK being so popular?
I am very happy that other people have found AWK useful. And not only did AWK attract
a lot of users, other language designers later used it as a model for developing more powerful
languages.
About 10 years after AWK was created, Larry Wall created a language called Perl, which was patterned after AWK and some other Unix commands. Perl is now one of the most popular
programming language in the world. So not only was AWK popular when it was introduced but
it also stimulated the creation of other popular languages.
AWK has inspired many other languages as you’ve already mentioned: why do you
think this is?
What made AWK popular initially was its simplicity and the kinds of tasks it was built to do. It
has a very simple programming model. The idea of pattern-action programming is very natural
for people. We also made the language compatible with pipes in Unix. The actions in AWK are
really simple forms of C programs. You can write a simple action like {print $2} or you can
write a much more complex C-like program as an action associated with a pattern. Some Wall
Street financial houses used AWK when it first came out to balance their books because it was
so easy to write data-processing programs in AWK.
AWK turned a number of people into programmers because the learning curve for the language
was very shallow. Even today a large number of people continue to use AWK, saying languages
such as Perl have become too complicated. Some say Perl has become such a complex language
that it’s become almost impossible to understand the programs once they’ve been written.
Another advantage of AWK is that the language is stable. We haven’t changed it since the
mid 1980s. And there are also lots of other people who’ve implemented versions of AWK on
different platforms such as Windows.
How did you determine the order of initials in AWK?
This was not our choice. When our research colleagues saw the three of us in one or another’s
office, they’d walk by the open door and say ‘AWK! AWK!.’ So, we called the language AWK
because of the good natured ribbing we received from our colleagues. We also thought it was a
great name, and we put the auk bird picture on the AWK book when we published it.
What did you learn from developing AWK that you still apply in your work today?
My research specialties include algorithms and programming languages. Many more people know
me for AWK as they’ve used it personally. Fewer people know me for my theoretical papers even
though they may be using the algorithms in them that have been implemented in various tools.
One of the nice things about AWK is that it incorporates efficient string pattern matching
algorithms that I was working on at the time we developed AWK. These pattern matching
algorithms are also found in other Unix utilities such as egrep and fgrep, two string-matching
tools I had written when I was experimenting with string pattern matching algorithms.
What AWK represents is a beautiful marriage of theory and practice. The best engineering is
often built on top of a sound scientific foundation. In AWK we have taken expressive notations
and efficient algorithms founded in computer science and engineered them to run well in practice.
I feel you gain wisdom by working with great people. Brian Kernighan is a master of useful
programming language design. His basic precept of language design is to keep a language simple,
so that a language is easy to understand and easy to use. I think this is great advice for any
language designer.
Have you had any surprises in the way that AWK has developed over the years?
One Monday morning I walked into my office to find a person from the Bell Labs micro-electronics
product division who had used AWK to create a multi-thousand-line computer-aided design
system. I was just stunned. I thought that no one would ever write an AWK program with more
than handful of statements. But he had written a powerful CAD development system in AWK
because he could do it so quickly and with such facility. My biggest surprise is that AWK has
been used in many different applications that none of us had initially envisaged. But perhaps
that’s the sign of a good tool, as you use a screwdriver for many more things than turning screws.
Do you still work with AWK today?
Since it’s so useful for routine data processing I use it daily. For example, I use it whenever
I’m writing papers and books. Because it has associative arrays, I have a simple two-line AWK
program that translates symbolically named figures and examples into numerically encoded figures and examples; for instance, it translates Figure AWK-program into Figure 1.1. This AWK
program allows me to rearrange and renumber figures and examples at will in my papers and
books. I once saw a paper that had a 1000-line C that had less functionality than these two lines
of AWK. The economy of expression you can get from AWK can be very impressive.
How has being one of the three creators of AWK impacted your career?
As I said, many programmers know me for AWK, but the computer science research community
is much more familiar with my theoretical work. So I initially viewed the creation of AWK as a
learning experience and a diversion rather than part of my regular research activities. However,
the experience of implementing AWK has greatly influenced how I now teach programming
languages and compilers, and software engineering.
What I’ve noticed is that some scientists aren’t as well known for their primary field of research
by the world at large as they are for their useful tools. Don Knuth, for example, is one of the
world’s foremost computer scientists, a founder of the field of computer algorithms. However, he
developed a language for typesetting technical papers, called TEX. This wasn’t his main avenue
of research but TEX became very widely used throughout the world by many scientists outside of
computer science. Knuth was passionate about having a mathematical typesetting system that
could be used to produce beautiful looking papers and books.
Many other computer science researchers have developed useful programming languages as
a by-product of their main line of research as well. As another example, Bjarne Stroustrup
developed the widely used C++ programming language because he wanted to write network
simulators.
Would you do anything differently in the development of AWK looking back?
One of the things that I would have done differently is instituting rigorous testing as we started
to develop the language. We initially created AWK as a throw-away language, so we didn’t do
rigorous quality control as part of our initial implementation.
I mentioned to you earlier that there was a person who wrote a CAD system in AWK. The
reason he initially came to see me was to report a bug in the AWK complier. He was very testy
with me saying I had wasted three weeks of his life, as he had been looking for a bug in his own
code only to discover that it was a bug in the AWK compiler! I huddled with Brian Kernighan
after this, and we agreed we really need to do something differently in terms of quality control.
So we instituted a rigorous regression test for all of the features of AWK. Any of the three of us
who put in a new feature into the language from then on, first had to write a test for the new
feature.
I have been teaching the programming languages and compilers course at Columbia University,
for many several years. The course has a semester long project in which students work in teams
of four or five to design their own innovative little language and to make a compiler for it.
Students coming into the course have never looked inside a compiler before, but in all the
years I’ve been teaching this course, never has a team failed to deliver a working compiler at the
end of the course. All of this is due to the experience I had in developing AWK with Kernighan
and Weinberger. In addition to learning the principles of language and compiler design, the
students learn good software engineering practices. Rigorous testing is something students do
from the start. The students also learn the elements of project management, teamwork, and
communication skills, both oral and written. So from that perspective AWK has significantly
influenced how I teach programming languages and compilers and software development.
Comments
Post a Comment