gawk - Online Manual Page Of Unix/Linux

Our Recommended Sites:

File: gawk.info, Node: Top, Next: Foreword, Up: (dir)

General Introduction
********************

This file documents `awk', a program that you can use to select
particular records in a file and perform operations upon them.

This is Edition 3 of `GAWK: Effective AWK Programming: A User's
Guide for GNU Awk', for the 3.1.3 (or later) version of the GNU
implementation of AWK.

Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.2 or
any later version published by the Free Software Foundation; with the
Invariant Sections being "GNU General Public License", the Front-Cover
texts being (a) (see below), and with the Back-Cover Texts being (b)
(see below). A copy of the license is included in the section entitled
"GNU Free Documentation License".

a. "A GNU Manual"

b. "You have freedom to copy and modify this GNU Manual, like GNU
software. Copies published by the Free Software Foundation raise
funds for GNU development."

* Menu:

* Foreword:: Some nice words about this
Info file.
* Preface:: What this Info file is about; brief
history and acknowledgments.
* Getting Started:: A basic introduction to using
`awk'. How to run an `awk'
program. Command-line syntax.
* Regexp:: All about matching things using regular
expressions.
* Reading Files:: How to read files and manipulate fields.
* Printing:: How to print using `awk'. Describes
the `print' and `printf'
statements. Also describes redirection of
output.
* Expressions:: Expressions are the basic building blocks
of statements.
* Patterns and Actions:: Overviews of patterns and actions.
* Arrays:: The description and use of arrays. Also
includes array-oriented control statements.
* Functions:: Built-in and user-defined functions.
* Internationalization:: Getting `gawk' to speak your
language.
* Advanced Features:: Stuff for advanced users, specific to
`gawk'.
* Invoking Gawk:: How to run `gawk'.
* Library Functions:: A Library of `awk' Functions.
* Sample Programs:: Many `awk' programs with complete
explanations.
* Language History:: The evolution of the `awk'
language.
* Installation:: Installing `gawk' under various
operating systems.
* Notes:: Notes about `gawk' extensions and
possible future work.
* Basic Concepts:: A very quick intoduction to programming
concepts.
* Glossary:: An explanation of some unfamiliar terms.
* Copying:: Your right to copy and distribute
`gawk'.
* GNU Free Documentation License:: The license for this Info file.
* Index:: Concept and Variable Index.

* History:: The history of `gawk' and
`awk'.
* Names:: What name to use to find `awk'.
* This Manual:: Using this Info file. Includes
sample input files that you can use.
* Conventions:: Typographical Conventions.
* Manual History:: Brief history of the GNU project and this
Info file.
* How To Contribute:: Helping to save the world.
* Acknowledgments:: Acknowledgments.
* Running gawk:: How to run `gawk' programs;
includes command-line syntax.
* One-shot:: Running a short throwaway `awk'
program.
* Read Terminal:: Using no input files (input from terminal
instead).
* Long:: Putting permanent `awk' programs in
files.
* Executable Scripts:: Making self-contained `awk'
programs.
* Comments:: Adding documentation to `gawk'
programs.
* Quoting:: More discussion of shell quoting issues.
* Sample Data Files:: Sample data files for use in the
`awk' programs illustrated in this
Info file.
* Very Simple:: A very simple example.
* Two Rules:: A less simple one-line example using two
rules.
* More Complex:: A more complex example.
* Statements/Lines:: Subdividing or combining statements into
lines.
* Other Features:: Other Features of `awk'.
* When:: When to use `gawk' and when to use
other things.
* Regexp Usage:: How to Use Regular Expressions.
* Escape Sequences:: How to write nonprinting characters.
* Regexp Operators:: Regular Expression Operators.
* Character Lists:: What can go between `[...]'.
* GNU Regexp Operators:: Operators specific to GNU software.
* Case-sensitivity:: How to do case-insensitive matching.
* Leftmost Longest:: How much text matches.
* Computed Regexps:: Using Dynamic Regexps.
* Locales:: How the locale affects things.
* Records:: Controlling how data is split into records.
* Fields:: An introduction to fields.
* Nonconstant Fields:: Nonconstant Field Numbers.
* Changing Fields:: Changing the Contents of a Field.
* Field Separators:: The field separator and how to change it.
* Regexp Field Splitting:: Using regexps as the field separator.
* Single Character Fields:: Making each character a separate field.
* Command Line Field Separator:: Setting `FS' from the command-line.
* Field Splitting Summary:: Some final points and a summary table.
* Constant Size:: Reading constant width data.
* Multiple Line:: Reading multi-line records.
* Getline:: Reading files under explicit program
control using the `getline' function.
* Plain Getline:: Using `getline' with no arguments.
* Getline/Variable:: Using `getline' into a variable.
* Getline/File:: Using `getline' from a file.
* Getline/Variable/File:: Using `getline' into a variable from a
file.
* Getline/Pipe:: Using `getline' from a pipe.
* Getline/Variable/Pipe:: Using `getline' into a variable from a
pipe.
* Getline/Coprocess:: Using `getline' from a coprocess.
* Getline/Variable/Coprocess:: Using `getline' into a variable from a
coprocess.
* Getline Notes:: Important things to know about
`getline'.
* Getline Summary:: Summary of `getline' Variants.
* Print:: The `print' statement.
* Print Examples:: Simple examples of `print' statements.
* Output Separators:: The output separators and how to change
them.
* OFMT:: Controlling Numeric Output With
`print'.
* Printf:: The `printf' statement.
* Basic Printf:: Syntax of the `printf' statement.
* Control Letters:: Format-control letters.
* Format Modifiers:: Format-specification modifiers.
* Printf Examples:: Several examples.
* Redirection:: How to redirect output to multiple files
and pipes.
* Special Files:: File name interpretation in `gawk'.
`gawk' allows access to inherited
file descriptors.
* Special FD:: Special files for I/O.
* Special Process:: Special files for process information.
* Special Network:: Special files for network communications.
* Special Caveats:: Things to watch out for.
* Close Files And Pipes:: Closing Input and Output Files and Pipes.
* Constants:: String, numeric and regexp constants.
* Scalar Constants:: Numeric and string constants.
* Nondecimal-numbers:: What are octal and hex numbers.
* Regexp Constants:: Regular Expression constants.
* Using Constant Regexps:: When and how to use a regexp constant.
* Variables:: Variables give names to values for later
use.
* Using Variables:: Using variables in your programs.
* Assignment Options:: Setting variables on the command-line and a
summary of command-line syntax. This is an
advanced method of input.
* Conversion:: The conversion of strings to numbers and
vice versa.
* Arithmetic Ops:: Arithmetic operations (`+', `-',
etc.)
* Concatenation:: Concatenating strings.
* Assignment Ops:: Changing the value of a variable or a
field.
* Increment Ops:: Incrementing the numeric value of a
variable.
* Truth Values:: What is ``true'' and what is ``false''.
* Typing and Comparison:: How variables acquire types and how this
affects comparison of numbers and strings
with `<', etc.
* Boolean Ops:: Combining comparison expressions using
boolean operators `||' (``or''),
`&&' (``and'') and `!' (``not'').
* Conditional Exp:: Conditional expressions select between two
subexpressions under control of a third
subexpression.
* Function Calls:: A function call is an expression.
* Precedence:: How various operators nest.
* Pattern Overview:: What goes into a pattern.
* Regexp Patterns:: Using regexps as patterns.
* Expression Patterns:: Any expression can be used as a pattern.
* Ranges:: Pairs of patterns specify record ranges.
* BEGIN/END:: Specifying initialization and cleanup
rules.
* Using BEGIN/END:: How and why to use BEGIN/END rules.
* I/O And BEGIN/END:: I/O issues in BEGIN/END rules.
* Empty:: The empty pattern, which matches every
record.
* Using Shell Variables:: How to use shell variables with
`awk'.
* Action Overview:: What goes into an action.
* Statements:: Describes the various control statements in
detail.
* If Statement:: Conditionally execute some `awk'
statements.
* While Statement:: Loop until some condition is satisfied.
* Do Statement:: Do specified action while looping until
some condition is satisfied.
* For Statement:: Another looping statement, that provides
initialization and increment clauses.
* Switch Statement:: Switch/case evaluation for conditional
execution of statements based on a value.
* Break Statement:: Immediately exit the innermost enclosing
loop.
* Continue Statement:: Skip to the end of the innermost enclosing
loop.
* Next Statement:: Stop processing the current input record.
* Nextfile Statement:: Stop processing the current file.
* Exit Statement:: Stop execution of `awk'.
* Built-in Variables:: Summarizes the built-in variables.
* User-modified:: Built-in variables that you change to
control `awk'.
* Auto-set:: Built-in variables where `awk'
gives you information.
* ARGC and ARGV:: Ways to use `ARGC' and `ARGV'.
* Array Intro:: Introduction to Arrays
* Reference to Elements:: How to examine one element of an array.
* Assigning Elements:: How to change an element of an array.
* Array Example:: Basic Example of an Array
* Scanning an Array:: A variation of the `for' statement. It
loops through the indices of an array's
existing elements.
* Delete:: The `delete' statement removes an
element from an array.
* Numeric Array Subscripts:: How to use numbers as subscripts in
`awk'.
* Uninitialized Subscripts:: Using Uninitialized variables as
subscripts.
* Multi-dimensional:: Emulating multidimensional arrays in
`awk'.
* Multi-scanning:: Scanning multidimensional arrays.
* Array Sorting:: Sorting array values and indices.
* Built-in:: Summarizes the built-in functions.
* Calling Built-in:: How to call built-in functions.
* Numeric Functions:: Functions that work with numbers, including
`int', `sin' and `rand'.
* String Functions:: Functions for string manipulation, such as
`split', `match' and
`sprintf'.
* Gory Details:: More than you want to know about `\'
and `&' with `sub', `gsub',
and `gensub'.
* I/O Functions:: Functions for files and shell commands.
* Time Functions:: Functions for dealing with timestamps.
* Bitwise Functions:: Functions for bitwise operations.
* I18N Functions:: Functions for string translation.
* User-defined:: Describes User-defined functions in detail.
* Definition Syntax:: How to write definitions and what they
mean.
* Function Example:: An example function definition and what it
does.
* Function Caveats:: Things to watch out for.
* Return Statement:: Specifying the value a function returns.
* Dynamic Typing:: How variable types can change at runtime.
* I18N and L10N:: Internationalization and Localization.
* Explaining gettext:: How GNU `gettext' works.
* Programmer i18n:: Features for the programmer.
* Translator i18n:: Features for the translator.
* String Extraction:: Extracting marked strings.
* Printf Ordering:: Rearranging `printf' arguments.
* I18N Portability:: `awk'-level portability issues.
* I18N Example:: A simple i18n example.
* Gawk I18N:: `gawk' is also internationalized.
* Nondecimal Data:: Allowing nondecimal input data.
* Two-way I/O:: Two-way communications with another
process.
* TCP/IP Networking:: Using `gawk' for network
programming.
* Portal Files:: Using `gawk' with BSD portals.
* Profiling:: Profiling your `awk' programs.
* Command Line:: How to run `awk'.
* Options:: Command-line options and their meanings.
* Other Arguments:: Input file names and variable assignments.
* AWKPATH Variable:: Searching directories for `awk'
programs.
* Obsolete:: Obsolete Options and/or features.
* Undocumented:: Undocumented Options and Features.
* Known Bugs:: Known Bugs in `gawk'.
* Library Names:: How to best name private global variables
in library functions.
* General Functions:: Functions that are of general use.
* Nextfile Function:: Two implementations of a `nextfile'
function.
* Assert Function:: A function for assertions in `awk'
programs.
* Round Function:: A function for rounding if `sprintf'
does not do it correctly.
* Cliff Random Function:: The Cliff Random Number Generator.
* Ordinal Functions:: Functions for using characters as numbers
and vice versa.
* Join Function:: A function to join an array into a string.
* Gettimeofday Function:: A function to get formatted times.
* Data File Management:: Functions for managing command-line data
files.
* Filetrans Function:: A function for handling data file
transitions.
* Rewind Function:: A function for rereading the current file.
* File Checking:: Checking that data files are readable.
* Empty Files:: Checking for zero-length files.
* Ignoring Assigns:: Treating assignments as file names.
* Getopt Function:: A function for processing command-line
arguments.
* Passwd Functions:: Functions for getting user information.
* Group Functions:: Functions for getting group information.
* Running Examples:: How to run these examples.
* Clones:: Clones of common utilities.
* Cut Program:: The `cut' utility.
* Egrep Program:: The `egrep' utility.
* Id Program:: The `id' utility.
* Split Program:: The `split' utility.
* Tee Program:: The `tee' utility.
* Uniq Program:: The `uniq' utility.
* Wc Program:: The `wc' utility.
* Miscellaneous Programs:: Some interesting `awk' programs.
* Dupword Program:: Finding duplicated words in a document.
* Alarm Program:: An alarm clock.
* Translate Program:: A program similar to the `tr'
utility.
* Labels Program:: Printing mailing labels.
* Word Sorting:: A program to produce a word usage count.
* History Sorting:: Eliminating duplicate entries from a
history file.
* Extract Program:: Pulling out programs from Texinfo source
files.
* Simple Sed:: A Simple Stream Editor.
* Igawk Program:: A wrapper for `awk' that includes
files.
* V7/SVR3.1:: The major changes between V7 and System V
Release 3.1.
* SVR4:: Minor changes between System V Releases 3.1
and 4.
* POSIX:: New features from the POSIX standard.
* BTL:: New features from the Bell Laboratories
version of `awk'.
* POSIX/GNU:: The extensions in `gawk' not in
POSIX `awk'.
* Contributors:: The major contributors to `gawk'.
* Gawk Distribution:: What is in the `gawk' distribution.
* Getting:: How to get the distribution.
* Extracting:: How to extract the distribution.
* Distribution contents:: What is in the distribution.
* Unix Installation:: Installing `gawk' under various
versions of Unix.
* Quick Installation:: Compiling `gawk' under Unix.
* Additional Configuration Options:: Other compile-time options.
* Configuration Philosophy:: How it's all supposed to work.
* Non-Unix Installation:: Installation on Other Operating Systems.
* Amiga Installation:: Installing `gawk' on an Amiga.
* BeOS Installation:: Installing `gawk' on BeOS.
* PC Installation:: Installing and Compiling `gawk' on
MS-DOS and OS/2.
* PC Binary Installation:: Installing a prepared distribution.
* PC Compiling:: Compiling `gawk' for MS-DOS, Windows32,
and OS/2.
* PC Using:: Running `gawk' on MS-DOS, Windows32 and
OS/2.
* PC Dynamic:: Compiling `gawk' for dynamic
libraries.
* Cygwin:: Building and running `gawk' for
Cygwin.
* VMS Installation:: Installing `gawk' on VMS.
* VMS Compilation:: How to compile `gawk' under VMS.
* VMS Installation Details:: How to install `gawk' under VMS.
* VMS Running:: How to run `gawk' under VMS.
* VMS POSIX:: Alternate instructions for VMS POSIX.
* Unsupported:: Systems whose ports are no longer
supported.
* Atari Installation:: Installing `gawk' on the Atari ST.
* Atari Compiling:: Compiling `gawk' on Atari.
* Atari Using:: Running `gawk' on Atari.
* Tandem Installation:: Installing `gawk' on a Tandem.
* Bugs:: Reporting Problems and Bugs.
* Other Versions:: Other freely available `awk'
implementations.
* Compatibility Mode:: How to disable certain `gawk'
extensions.
* Additions:: Making Additions To `gawk'.
* Adding Code:: Adding code to the main body of
`gawk'.
* New Ports:: Porting `gawk' to a new operating
system.
* Dynamic Extensions:: Adding new built-in functions to
`gawk'.
* Internals:: A brief look at some `gawk'
internals.
* Sample Library:: A example of new functions.
* Internal File Description:: What the new functions will do.
* Internal File Ops:: The code for internal file operations.
* Using Internal File Ops:: How to use an external extension.
* Future Extensions:: New features that may be implemented one
day.
* Basic High Level:: The high level view.
* Basic Data Typing:: A very quick intro to data types.
* Floating Point Issues:: Stuff to know about floating-point numbers.

To Miriam, for making me complete.

To Chana, for the joy you bring us.

To Rivka, for the exponential increase.

To Nachum, for the added dimension.

To Malka, for the new beginning.

File: gawk.info, Node: Foreword, Next: Preface, Prev: Top, Up: Top

Foreword
********

Arnold Robbins and I are good friends. We were introduced 11 years
ago by circumstances--and our favorite programming language, AWK. The
circumstances started a couple of years earlier. I was working at a new
job and noticed an unplugged Unix computer sitting in the corner. No
one knew how to use it, and neither did I. However, a couple of days
later it was running, and I was `root' and the one-and-only user. That
day, I began the transition from statistician to Unix programmer.

On one of many trips to the library or bookstore in search of books
on Unix, I found the gray AWK book, a.k.a. Aho, Kernighan and
Weinberger, `The AWK Programming Language', Addison-Wesley, 1988.
AWK's simple programming paradigm--find a pattern in the input and then
perform an action--often reduced complex or tedious data manipulations
to few lines of code. I was excited to try my hand at programming in
AWK.

Alas, the `awk' on my computer was a limited version of the
language described in the AWK book. I discovered that my computer had
"old `awk'" and the AWK book described "new `awk'." I learned that
this was typical; the old version refused to step aside or relinquish
its name. If a system had a new `awk', it was invariably called
`nawk', and few systems had it. The best way to get a new `awk' was to
`ftp' the source code for `gawk' from `prep.ai.mit.edu'. `gawk' was a
version of new `awk' written by David Trueman and Arnold, and available
under the GNU General Public License.

(Incidentally, it's no longer difficult to find a new `awk'. `gawk'
ships with Linux, and you can download binaries or source code for
almost any system; my wife uses `gawk' on her VMS box.)

My Unix system started out unplugged from the wall; it certainly was
not plugged into a network. So, oblivious to the existence of `gawk'
and the Unix community in general, and desiring a new `awk', I wrote my
own, called `mawk'. Before I was finished I knew about `gawk', but it
was too late to stop, so I eventually posted to a `comp.sources'
newsgroup.

A few days after my posting, I got a friendly email from Arnold
introducing himself. He suggested we share design and algorithms and
attached a draft of the POSIX standard so that I could update `mawk' to
support language extensions added after publication of the AWK book.

Frankly, if our roles had been reversed, I would not have been so
open and we probably would have never met. I'm glad we did meet. He
is an AWK expert's AWK expert and a genuinely nice person. Arnold
contributes significant amounts of his expertise and time to the Free
Software Foundation.

This book is the `gawk' reference manual, but at its core it is a
book about AWK programming that will appeal to a wide audience. It is
a definitive reference to the AWK language as defined by the 1987 Bell
Labs release and codified in the 1992 POSIX Utilities standard.

On the other hand, the novice AWK programmer can study a wealth of
practical programs that emphasize the power of AWK's basic idioms: data
driven control-flow, pattern matching with regular expressions, and
associative arrays. Those looking for something new can try out
`gawk''s interface to network protocols via special `/inet' files.

The programs in this book make clear that an AWK program is
typically much smaller and faster to develop than a counterpart written
in C. Consequently, there is often a payoff to prototype an algorithm
or design in AWK to get it running quickly and expose problems early.
Often, the interpreted performance is adequate and the AWK prototype
becomes the product.

The new `pgawk' (profiling `gawk'), produces program execution
counts. I recently experimented with an algorithm that for n lines of
input, exhibited ~ C n^2 performance, while theory predicted ~ C n log n
behavior. A few minutes poring over the `awkprof.out' profile
pinpointed the problem to a single line of code. `pgawk' is a welcome
addition to my programmer's toolbox.

Arnold has distilled over a decade of experience writing and using
AWK programs, and developing `gawk', into this book. If you use AWK or
want to learn how, then read this book.

Michael Brennan
Author of `mawk'

File: gawk.info, Node: Preface, Next: Getting Started, Prev: Foreword, Up: Top

Preface
*******

Several kinds of tasks occur repeatedly when working with text files.
You might want to extract certain lines and discard the rest. Or you
may need to make changes wherever certain patterns appear, but leave
the rest of the file alone. Writing single-use programs for these
tasks in languages such as C, C++, or Pascal is time-consuming and
inconvenient. Such jobs are often easier with `awk'. The `awk'
utility interprets a special-purpose programming language that makes it
easy to handle simple data-reformatting jobs.

The GNU implementation of `awk' is called `gawk'; it is fully
compatible with the System V Release 4 version of `awk'. `gawk' is
also compatible with the POSIX specification of the `awk' language.
This means that all properly written `awk' programs should work with
`gawk'. Thus, we usually don't distinguish between `gawk' and other
`awk' implementations.

Using `awk' allows you to:

* Manage small, personal databases

* Generate reports

* Validate data

* Produce indexes and perform other document preparation tasks

* Experiment with algorithms that you can adapt later to other
computer languages

In addition, `gawk' provides facilities that make it easy to:

* Extract bits and pieces of data for processing

* Sort data

* Perform simple network communications

This Info file teaches you about the `awk' language and how you can
use it effectively. You should already be familiar with basic system
commands, such as `cat' and `ls',(1) as well as basic shell facilities,
such as input/output (I/O) redirection and pipes.

Implementations of the `awk' language are available for many
different computing environments. This Info file, while describing the
`awk' language in general, also describes the particular implementation
of `awk' called `gawk' (which stands for "GNU awk"). `gawk' runs on a
broad range of Unix systems, ranging from 80386 PC-based computers up
through large-scale systems, such as Crays. `gawk' has also been ported
to Mac OS X, MS-DOS, Microsoft Windows (all versions) and OS/2 PCs,
Atari and Amiga microcomputers, BeOS, Tandem D20, and VMS.

* Menu:

---------- Footnotes ----------

(1) These commands are available on POSIX-compliant systems, as well
as on traditional Unix-based systems. If you are using some other
operating system, you still need to be familiar with the ideas of I/O
redirection and pipes.

File: gawk.info, Node: History, Next: Names, Up: Preface

History of `awk' and `gawk'
===========================

Recipe For A Programming Language
1 part `egrep' 1 part `snobol'
2 parts `ed' 3 parts C

Blend all parts well using `lex' and `yacc'. Document minimally
and release.

After eight years, add another part `egrep' and two more parts C.
Document very well and release.

The name `awk' comes from the initials of its designers: Alfred V.
Aho, Peter J. Weinberger and Brian W. Kernighan. The original version
of `awk' was written in 1977 at AT&T Bell Laboratories. In 1985, a new
version made the programming language more powerful, introducing
user-defined functions, multiple input streams, and computed regular
expressions. This new version became widely available with Unix System
V Release 3.1 (SVR3.1). The version in SVR4 added some new features
and cleaned up the behavior in some of the "dark corners" of the
language. The specification for `awk' in the POSIX Command Language
and Utilities standard further clarified the language. Both the `gawk'
designers and the original Bell Laboratories `awk' designers provided
feedback for the POSIX specification.

Paul Rubin wrote the GNU implementation, `gawk', in 1986. Jay
Fenlason completed it, with advice from Richard Stallman. John Woods
contributed parts of the code as well. In 1988 and 1989, David
Trueman, with help from me, thoroughly reworked `gawk' for compatibility
with the newer `awk'. Circa 1995, I became the primary maintainer.
Current development focuses on bug fixes, performance improvements,
standards compliance, and occasionally, new features.

In May of 1997, Ju"rgen Kahrs felt the need for network access from
`awk', and with a little help from me, set about adding features to do
this for `gawk'. At that time, he also wrote the bulk of `TCP/IP
Internetworking with `gawk'' (a separate document, available as part of
the `gawk' distribution). His code finally became part of the main
`gawk' distribution with `gawk' version 3.1.

*Note Contributors::, for a complete list of those who made
important contributions to `gawk'.

File: gawk.info, Node: Names, Next: This Manual, Prev: History, Up: Preface

A Rose by Any Other Name
========================

The `awk' language has evolved over the years. Full details are
provided in *Note Language History::. The language described in this
Info file is often referred to as "new `awk'" (`nawk').

Because of this, many systems have multiple versions of `awk'. Some
systems have an `awk' utility that implements the original version of
the `awk' language and a `nawk' utility for the new version. Others
have an `oawk' version for the "old `awk'" language and plain `awk' for
the new one. Still others only have one version, which is usually the
new one.(1)

All in all, this makes it difficult for you to know which version of
`awk' you should run when writing your programs. The best advice I can
give here is to check your local documentation. Look for `awk', `oawk',
and `nawk', as well as for `gawk'. It is likely that you already have
some version of new `awk' on your system, which is what you should use
when running your programs. (Of course, if you're reading this Info
file, chances are good that you have `gawk'!)

Throughout this Info file, whenever we refer to a language feature
that should be available in any complete implementation of POSIX `awk',
we simply use the term `awk'. When referring to a feature that is
specific to the GNU implementation, we use the term `gawk'.

---------- Footnotes ----------

(1) Often, these systems use `gawk' for their `awk' implementation!

File: gawk.info, Node: This Manual, Next: Conventions, Prev: Names, Up: Preface

Using This Book
===============

The term `awk' refers to a particular program as well as to the
language you use to tell this program what to do. When we need to be
careful, we call the language "the `awk' language," and the program
"the `awk' utility." This Info file explains both the `awk' language
and how to run the `awk' utility. The term "`awk' program" refers to a
program written by you in the `awk' programming language.

Primarily, this Info file explains the features of `awk', as defined
in the POSIX standard. It does so in the context of the `gawk'
implementation. While doing so, it also attempts to describe important
differences between `gawk' and other `awk' implementations.(1) Finally,
any `gawk' features that are not in the POSIX standard for `awk' are
noted.

There are subsections labelled as *Advanced Notes* scattered
throughout the Info file. They add a more complete explanation of
points that are relevant, but not likely to be of interest on first
reading. All appear in the index, under the heading "advanced
features."

Most of the time, the examples use complete `awk' programs. In some
of the more advanced sections, only the part of the `awk' program that
illustrates the concept currently being described is shown.

While this Info file is aimed principally at people who have not been
exposed to `awk', there is a lot of information here that even the `awk'
expert should find useful. In particular, the description of POSIX
`awk' and the example programs in *Note Library Functions::, and in
*Note Sample Programs::, should be of interest.

*Note Getting Started::, provides the essentials you need to know to
begin using `awk'.

*Note Regexp::, introduces regular expressions in general, and in
particular the flavors supported by POSIX `awk' and `gawk'.

*Note Reading Files::, describes how `awk' reads your data. It
introduces the concepts of records and fields, as well as the `getline'
command. I/O redirection is first described here.

*Note Printing::, describes how `awk' programs can produce output
with `print' and `printf'.

*Note Expressions::, describes expressions, which are the basic
building blocks for getting most things done in a program.

*Note Patterns and Actions::, describes how to write patterns for
matching records, actions for doing something when a record is matched,
and the built-in variables `awk' and `gawk' use.

*Note Arrays::, covers `awk''s one-and-only data structure:
associative arrays. Deleting array elements and whole arrays is also
described, as well as sorting arrays in `gawk'.

*Note Functions::, describes the built-in functions `awk' and `gawk'
provide, as well as how to define your own functions.

*Note Internationalization::, describes special features in `gawk'
for translating program messages into different languages at runtime.

*Note Advanced Features::, describes a number of `gawk'-specific
advanced features. Of particular note are the abilities to have
two-way communications with another process, perform TCP/IP networking,
and profile your `awk' programs.

*Note Invoking Gawk::, describes how to run `gawk', the meaning of
its command-line options, and how it finds `awk' program source files.

*Note Library Functions::, and *Note Sample Programs::, provide many
sample `awk' programs. Reading them allows you to see `awk' solving
real problems.

*Note Language History::, describes how the `awk' language has
evolved since first release to present. It also describes how `gawk'
has acquired features over time.

*Note Installation::, describes how to get `gawk', how to compile it
under Unix, and how to compile and use it on different non-Unix
systems. It also describes how to report bugs in `gawk' and where to
get three other freely available implementations of `awk'.

*Note Notes::, describes how to disable `gawk''s extensions, as well
as how to contribute new code to `gawk', how to write extension
libraries, and some possible future directions for `gawk' development.

*Note Basic Concepts::, provides some very cursory background
material for those who are completely unfamiliar with computer
programming. Also centralized there is a discussion of some of the
issues surrounding floating-point numbers.

The *Note Glossary::, defines most, if not all, the significant
terms used throughout the book. If you find terms that you aren't
familiar with, try looking them up here.

*Note Copying::, and *Note GNU Free Documentation License::, present
the licenses that cover the `gawk' source code and this Info file,
respectively.

---------- Footnotes ----------

(1) All such differences appear in the index under the entry
"differences in `awk' and `gawk'."

File: gawk.info, Node: Conventions, Next: Manual History, Prev: This Manual, Up: Preface

Typographical Conventions
=========================

This Info file is written using Texinfo, the GNU documentation
formatting language. A single Texinfo source file is used to produce
both the printed and online versions of the documentation. This minor
node briefly documents the typographical conventions used in Texinfo.

Examples you would type at the command-line are preceded by the
common shell primary and secondary prompts, `$' and `>'. Output from
the command is preceded by the glyph "-|". This typically represents
the command's standard output. Error messages, and other output on the
command's standard error, are preceded by the glyph "error-->". For
example:

$ echo hi on stdout
-| hi on stdout
$ echo hello on stderr 1>&2
error--> hello on stderr

Characters that you type at the keyboard look `like this'. In
particular, there are special characters called "control characters."
These are characters that you type by holding down both the `CONTROL'
key and another key, at the same time. For example, a `Ctrl-d' is typed
by first pressing and holding the `CONTROL' key, next pressing the `d'
key and finally releasing both keys.

Dark Corners
............

Dark corners are basically fractal -- no matter how much you
illuminate, there's always a smaller but darker one.
Brian Kernighan

Until the POSIX standard (and `The Gawk Manual'), many features of
`awk' were either poorly documented or not documented at all.
Descriptions of such features (often called "dark corners") are noted
in this Info file with "(d.c.)". They also appear in the index under
the heading "dark corner."

As noted by the opening quote, though, any coverage of dark corners
is, by definition, something that is incomplete.

File: gawk.info, Node: Manual History, Next: How To Contribute, Prev: Conventions, Up: Preface

The GNU Project and This Book
=============================

The Free Software Foundation (FSF) is a nonprofit organization
dedicated to the production and distribution of freely distributable
software. It was founded by Richard M. Stallman, the author of the
original Emacs editor. GNU Emacs is the most widely used version of
Emacs today.

The GNU(1) Project is an ongoing effort on the part of the Free
Software Foundation to create a complete, freely distributable,
POSIX-compliant computing environment. The FSF uses the "GNU General
Public License" (GPL) to ensure that their software's source code is
always available to the end user. A copy of the GPL is included for
your reference (*note Copying::). The GPL applies to the C language
source code for `gawk'. To find out more about the FSF and the GNU
Project online, see the GNU Project's home page (http://www.gnu.org).
This Info file may also be read from their web site
(http://www.gnu.org/manual/gawk/).

A shell, an editor (Emacs), highly portable optimizing C, C++, and
Objective-C compilers, a symbolic debugger and dozens of large and
small utilities (such as `gawk'), have all been completed and are
freely available. The GNU operating system kernel (the HURD), has been
released but is still in an early stage of development.

Until the GNU operating system is more fully developed, you should
consider using GNU/Linux, a freely distributable, Unix-like operating
system for Intel 80386, DEC Alpha, Sun SPARC, IBM S/390, and other
systems.(2) There are many books on GNU/Linux. One that is freely
available is `Linux Installation and Getting Started', by Matt Welsh.
Many GNU/Linux distributions are often available in computer stores or
bundled on CD-ROMs with books about Linux. (There are three other
freely available, Unix-like operating systems for 80386 and other
systems: NetBSD, FreeBSD, and OpenBSD. All are based on the 4.4-Lite
Berkeley Software Distribution, and they use recent versions of `gawk'
for their versions of `awk'.)

The Info file itself has gone through a number of previous editions.
Paul Rubin wrote the very first draft of `The GAWK Manual'; it was
around 40 pages in size. Diane Close and Richard Stallman improved it,
yielding a version that was around 90 pages long and barely described
the original, "old" version of `awk'.

I started working with that version in the fall of 1988. As work on
it progressed, the FSF published several preliminary versions (numbered
0.X). In 1996, Edition 1.0 was released with `gawk' 3.0.0. The FSF
published the first two editions under the title `The GNU Awk User's
Guide'.

This edition maintains the basic structure of Edition 1.0, but with
significant additional material, reflecting the host of new features in
`gawk' version 3.1. Of particular note is *Note Array Sorting::, as
well as *Note Bitwise Functions::, *Note Internationalization::, and
also *Note Advanced Features::, and *Note Dynamic Extensions::.

`GAWK: Effective AWK Programming' will undoubtedly continue to
evolve. An electronic version comes with the `gawk' distribution from
the FSF. If you find an error in this Info file, please report it!
*Note Bugs::, for information on submitting problem reports
electronically, or write to me in care of the publisher.

---------- Footnotes ----------

(1) GNU stands for "GNU's not Unix."

(2) The terminology "GNU/Linux" is explained in the *Note Glossary::.

File: gawk.info, Node: How To Contribute, Next: Acknowledgments, Prev: Manual History, Up: Preface

How to Contribute
=================

As the maintainer of GNU `awk', I am starting a collection of
publicly available `awk' programs. For more information, see
`ftp://ftp.freefriends.org/arnold/Awkstuff'. If you have written an
interesting `awk' program, or have written a `gawk' extension that you
would like to share with the rest of the world, please contact me
(<>). Making things available on the Internet helps keep
the `gawk' distribution down to manageable size.

File: gawk.info, Node: Acknowledgments, Prev: How To Contribute, Up: Preface

Acknowledgments
===============

The initial draft of `The GAWK Manual' had the following
acknowledgments:

Many people need to be thanked for their assistance in producing
this manual. Jay Fenlason contributed many ideas and sample
programs. Richard Mlynarik and Robert Chassell gave helpful
comments on drafts of this manual. The paper `A Supplemental
Document for `awk'' by John W. Pierce of the Chemistry Department
at UC San Diego, pinpointed several issues relevant both to `awk'
implementation and to this manual, that would otherwise have
escaped us.

I would like to acknowledge Richard M. Stallman, for his vision of a
better world and for his courage in founding the FSF and starting the
GNU Project.

The following people (in alphabetical order) provided helpful
comments on various versions of this book, up to and including this
edition. Rick Adams, Nelson H.F. Beebe, Karl Berry, Dr. Michael
Brennan, Rich Burridge, Claire Cloutier, Diane Close, Scott Deifik,
Christopher ("Topher") Eliot, Jeffrey Friedl, Dr. Darrel Hankerson,
Michal Jaegermann, Dr. Richard J. LeBlanc, Michael Lijewski, Pat Rankin,
Miriam Robbins, Mary Sheehan, and Chuck Toporek.

Robert J. Chassell provided much valuable advice on the use of
Texinfo. He also deserves special thanks for convincing me _not_ to
title this Info file `How To Gawk Politely'. Karl Berry helped
significantly with the TeX part of Texinfo.

I would like to thank Marshall and Elaine Hartholz of Seattle and
Dr. Bert and Rita Schreiber of Detroit for large amounts of quiet
vacation time in their homes, which allowed me to make significant
progress on this Info file and on `gawk' itself.

Phil Hughes of SSC contributed in a very important way by loaning me
his laptop GNU/Linux system, not once, but twice, which allowed me to
do a lot of work while away from home.

David Trueman deserves special credit; he has done a yeoman job of
evolving `gawk' so that it performs well and without bugs. Although he
is no longer involved with `gawk', working with him on this project was
a significant pleasure.

The intrepid members of the GNITS mailing list, and most notably
Ulrich Drepper, provided invaluable help and feedback for the design of
the internationalization features.

Nelson Beebe, Martin Brown, Andreas Buening, Scott Deifik, Darrel
Hankerson, Isamu Hasegawa, Michal Jaegermann, Ju"rgen Kahrs, Pat Rankin,
Kai Uwe Rommel, and Eli Zaretskii (in alphabetical order) make up the
`gawk' "crack portability team." Without their hard work and help,
`gawk' would not be nearly the fine program it is today. It has been
and continues to be a pleasure working with this team of fine people.

David and I would like to thank Brian Kernighan of Bell Laboratories
for invaluable assistance during the testing and debugging of `gawk',
and for help in clarifying numerous points about the language. We
could not have done nearly as good a job on either `gawk' or its
documentation without his help.

Chuck Toporek, Mary Sheehan, and Claire Coutier of O'Reilly &
Associates contributed significant editorial help for this Info file
for the 3.1 release of `gawk'.

I must thank my wonderful wife, Miriam, for her patience through the
many versions of this project, for her proofreading, and for sharing me
with the computer. I would like to thank my parents for their love,
and for the grace with which they raised and educated me. Finally, I
also must acknowledge my gratitude to G-d, for the many opportunities
He has sent my way, as well as for the gifts He has given me with which
to take advantage of those opportunities.

Arnold Robbins
Nof Ayalon
ISRAEL
March, 2001

File: gawk.info, Node: Getting Started, Next: Regexp, Prev: Preface, Up: Top

Getting Started with `awk'
**************************

The basic function of `awk' is to search files for lines (or other
units of text) that contain certain patterns. When a line matches one
of the patterns, `awk' performs specified actions on that line. `awk'
keeps processing input lines in this way until it reaches the end of
the input files.

Programs in `awk' are different from programs in most other
languages, because `awk' programs are "data-driven"; that is, you
describe the data you want to work with and then what to do when you
find it. Most other languages are "procedural"; you have to describe,
in great detail, every step the program is to take. When working with
procedural languages, it is usually much harder to clearly describe the
data your program will process. For this reason, `awk' programs are
often refreshingly easy to read and write.

When you run `awk', you specify an `awk' "program" that tells `awk'
what to do. The program consists of a series of "rules". (It may also
contain "function definitions", an advanced feature that we will ignore
for now. *Note User-defined::.) Each rule specifies one pattern to
search for and one action to perform upon finding the pattern.

Syntactically, a rule consists of a pattern followed by an action.
The action is enclosed in curly braces to separate it from the pattern.
Newlines usually separate rules. Therefore, an `awk' program looks
like this:

PATTERN { ACTION }
PATTERN { ACTION }
...

* Menu:

* Running gawk:: How to run `gawk' programs; includes
command-line syntax.
* Sample Data Files:: Sample data files for use in the `awk'
programs illustrated in this Info file.
* Very Simple:: A very simple example.
* Two Rules:: A less simple one-line example using two
rules.
* More Complex:: A more complex example.
* Statements/Lines:: Subdividing or combining statements into
lines.
* Other Features:: Other Features of `awk'.
* When:: When to use `gawk' and when to use
other things.

File: gawk.info, Node: Running gawk, Next: Sample Data Files, Up: Getting Started

How to Run `awk' Programs
=========================

There are several ways to run an `awk' program. If the program is
short, it is easiest to include it in the command that runs `awk', like
this:

awk 'PROGRAM' INPUT-FILE1 INPUT-FILE2 ...

When the program is long, it is usually more convenient to put it in
a file and run it with a command like this:

awk -f PROGRAM-FILE INPUT-FILE1 INPUT-FILE2 ...

This minor node discusses both mechanisms, along with several
variations of each.

* Menu:

* One-shot:: Running a short throwaway `awk'
program.
* Read Terminal:: Using no input files (input from terminal
instead).
* Long:: Putting permanent `awk' programs in
files.
* Executable Scripts:: Making self-contained `awk' programs.
* Comments:: Adding documentation to `gawk'
programs.
* Quoting:: More discussion of shell quoting issues.

File: gawk.info, Node: One-shot, Next: Read Terminal, Up: Running gawk

One-Shot Throwaway `awk' Programs
---------------------------------

Once you are familiar with `awk', you will often type in simple
programs the moment you want to use them. Then you can write the
program as the first argument of the `awk' command, like this:

awk 'PROGRAM' INPUT-FILE1 INPUT-FILE2 ...

where PROGRAM consists of a series of PATTERNS and ACTIONS, as
described earlier.

This command format instructs the "shell", or command interpreter,
to start `awk' and use the PROGRAM to process records in the input
file(s). There are single quotes around PROGRAM so the shell won't
interpret any `awk' characters as special shell characters. The quotes
also cause the shell to treat all of PROGRAM as a single argument for
`awk', and allow PROGRAM to be more than one line long.

This format is also useful for running short or medium-sized `awk'
programs from shell scripts, because it avoids the need for a separate
file for the `awk' program. A self-contained shell script is more
reliable because there are no other files to misplace.

*Note Very Simple::, presents several short, self-contained programs.

File: gawk.info, Node: Read Terminal, Next: Long, Prev: One-shot, Up: Running gawk

Running `awk' Without Input Files
---------------------------------

You can also run `awk' without any input files. If you type the
following command line:

awk 'PROGRAM'

`awk' applies the PROGRAM to the "standard input", which usually means
whatever you type on the terminal. This continues until you indicate
end-of-file by typing `Ctrl-d'. (On other operating systems, the
end-of-file character may be different. For example, on OS/2 and
MS-DOS, it is `Ctrl-z'.)

As an example, the following program prints a friendly piece of
advice (from Douglas Adams's `The Hitchhiker's Guide to the Galaxy'),
to keep you from worrying about the complexities of computer programming
(`BEGIN' is a feature we haven't discussed yet):

$ awk "BEGIN { print \"Don't Panic!\" }"
-| Don't Panic!

This program does not read any input. The `\' before each of the
inner double quotes is necessary because of the shell's quoting
rules--in particular because it mixes both single quotes and double
quotes.(1)

This next simple `awk' program emulates the `cat' utility; it copies
whatever you type on the keyboard to its standard output (why this
works is explained shortly).

$ awk '{ print }'
Now is the time for all good men
-| Now is the time for all good men
to come to the aid of their country.
-| to come to the aid of their country.
Four score and seven years ago, ...
-| Four score and seven years ago, ...
What, me worry?
-| What, me worry?
Ctrl-d

---------- Footnotes ----------

(1) Although we generally recommend the use of single quotes around
the program text, double quotes are needed here in order to put the
single quote into the message.

File: gawk.info, Node: Long, Next: Executable Scripts, Prev: Read Terminal, Up: Running gawk

Running Long Programs
---------------------

Sometimes your `awk' programs can be very long. In this case, it is
more convenient to put the program into a separate file. In order to
tell `awk' to use that file for its program, you type:

awk -f SOURCE-FILE INPUT-FILE1 INPUT-FILE2 ...

The `-f' instructs the `awk' utility to get the `awk' program from
the file SOURCE-FILE. Any file name can be used for SOURCE-FILE. For
example, you could put the program:

BEGIN { print "Don't Panic!" }

into the file `advice'. Then this command:

awk -f advice

does the same thing as this one:

awk "BEGIN { print \"Don't Panic!\" }"

This was explained earlier (*note Read Terminal::). Note that you
don't usually need single quotes around the file name that you specify
with `-f', because most file names don't contain any of the shell's
special characters. Notice that in `advice', the `awk' program did not
have single quotes around it. The quotes are only needed for programs
that are provided on the `awk' command line.

If you want to identify your `awk' program files clearly as such,
you can add the extension `.awk' to the file name. This doesn't affect
the execution of the `awk' program but it does make "housekeeping"
easier.

File: gawk.info, Node: Executable Scripts, Next: Comments, Prev: Long, Up: Running gawk

Executable `awk' Programs
-------------------------

Once you have learned `awk', you may want to write self-contained
`awk' scripts, using the `#!' script mechanism. You can do this on
many Unix systems(1) as well as on the GNU system. For example, you
could update the file `advice' to look like this:

#! /bin/awk -f

BEGIN { print "Don't Panic!" }

After making this file executable (with the `chmod' utility), simply
type `advice' at the shell and the system arranges to run `awk'(2) as
if you had typed `awk -f advice':

$ chmod +x advice
$ advice
-| Don't Panic!

(We assume you have the current directory in your shell's search path
variable (typically `$PATH'). If not, you may need to type `./advice'
at the shell.)

Self-contained `awk' scripts are useful when you want to write a
program that users can invoke without their having to know that the
program is written in `awk'.

Advanced Notes: Portability Issues with `#!'
--------------------------------------------

Some systems limit the length of the interpreter name to 32
characters. Often, this can be dealt with by using a symbolic link.

You should not put more than one argument on the `#!' line after the
path to `awk'. It does not work. The operating system treats the rest
of the line as a single argument and passes it to `awk'. Doing this
leads to confusing behavior--most likely a usage diagnostic of some
sort from `awk'.

Finally, the value of `ARGV[0]' (*note Built-in Variables::) varies
depending upon your operating system. Some systems put `awk' there,
some put the full pathname of `awk' (such as `/bin/awk'), and some put
the name of your script (`advice'). Don't rely on the value of
`ARGV[0]' to provide your script name.

---------- Footnotes ----------

(1) The `#!' mechanism works on Linux systems, systems derived from
the 4.4-Lite Berkeley Software Distribution, and most commercial Unix
systems.

(2) The line beginning with `#!' lists the full file name of an
interpreter to run and an optional initial command-line argument to
pass to that interpreter. The operating system then runs the
interpreter with the given argument and the full argument list of the
executed program. The first argument in the list is the full file name
of the `awk' program. The rest of the argument list contains either
options to `awk', or data files, or both.

File: gawk.info, Node: Comments, Next: Quoting, Prev: Executable Scripts, Up: Running gawk

Comments in `awk' Programs
--------------------------

A "comment" is some text that is included in a program for the sake
of human readers; it is not really an executable part of the program.
Comments can explain what the program does and how it works. Nearly all
programming languages have provisions for comments, as programs are
typically hard to understand without them.

In the `awk' language, a comment starts with the sharp sign
character (`#') and continues to the end of the line. The `#' does not
have to be the first character on the line. The `awk' language ignores
the rest of a line following a sharp sign. For example, we could have
put the following into `advice':

# This program prints a nice friendly message. It helps
# keep novice users from being afraid of the computer.
BEGIN { print "Don't Panic!" }

You can put comment lines into keyboard-composed throwaway `awk'
programs, but this usually isn't very useful; the purpose of a comment
is to help you or another person understand the program when reading it
at a later time.

*Caution:* As mentioned in *Note One-shot::, you can enclose small
to medium programs in single quotes, in order to keep your shell
scripts self-contained. When doing so, _don't_ put an apostrophe
(i.e., a single quote) into a comment (or anywhere else in your
program). The shell interprets the quote as the closing quote for the
entire program. As a result, usually the shell prints a message about
mismatched quotes, and if `awk' actually runs, it will probably print
strange messages about syntax errors. For example, look at the
following:

$ awk '{ print "hello" } # let's be cute'
>

The shell sees that the first two quotes match, and that a new
quoted object begins at the end of the command line. It therefore
prompts with the secondary prompt, waiting for more input. With Unix
`awk', closing the quoted string produces this result:

$ awk '{ print "hello" } # let's be cute'
> '
error--> awk: can't open file be
error--> source line number 1

Putting a backslash before the single quote in `let's' wouldn't help,
since backslashes are not special inside single quotes. The next
node describes the shell's quoting rules.

File: gawk.info, Node: Quoting, Prev: Comments, Up: Running gawk

Shell-Quoting Issues
--------------------

For short to medium length `awk' programs, it is most convenient to
enter the program on the `awk' command line. This is best done by
enclosing the entire program in single quotes. This is true whether
you are entering the program interactively at the shell prompt, or
writing it as part of a larger shell script:

awk 'PROGRAM TEXT' INPUT-FILE1 INPUT-FILE2 ...

Once you are working with the shell, it is helpful to have a basic
knowledge of shell quoting rules. The following rules apply only to
POSIX-compliant, Bourne-style shells (such as `bash', the GNU
Bourne-Again Shell). If you use `csh', you're on your own.

* Quoted items can be concatenated with nonquoted items as well as
with other quoted items. The shell turns everything into one
argument for the command.

* Preceding any single character with a backslash (`\') quotes that
character. The shell removes the backslash and passes the quoted
character on to the command.

* Single quotes protect everything between the opening and closing
quotes. The shell does no interpretation of the quoted text,
passing it on verbatim to the command. It is _impossible_ to
embed a single quote inside single-quoted text. Refer back to
*Note Comments::, for an example of what happens if you try.

* Double quotes protect most things between the opening and closing
quotes. The shell does at least variable and command substitution
on the quoted text. Different shells may do additional kinds of
processing on double-quoted text.

Since certain characters within double-quoted text are processed
by the shell, they must be "escaped" within the text. Of note are
the characters `$', ``', `\', and `"', all of which must be
preceded by a backslash within double-quoted text if they are to
be passed on literally to the program. (The leading backslash is
stripped first.) Thus, the example seen in *Note Read Terminal::,
is applicable:

$ awk "BEGIN { print \"Don't Panic!\" }"
-| Don't Panic!

Note that the single quote is not special within double quotes.

* Null strings are removed when they occur as part of a non-null
command-line argument, while explicit non-null objects are kept.
For example, to specify that the field separator `FS' should be
set to the null string, use:

awk -F "" 'PROGRAM' FILES # correct

Don't use this:

awk -F"" 'PROGRAM' FILES # wrong!

In the second case, `awk' will attempt to use the text of the
program as the value of `FS', and the first file name as the text
of the program! This results in syntax errors at best, and
confusing behavior at worst.

Mixing single and double quotes is difficult. You have to resort to
shell quoting tricks, like this:

$ awk 'BEGIN { print "Here is a single quote <'"'"'>" }'
-| Here is a single quote <'>

This program consists of three concatenated quoted strings. The first
and the third are single-quoted, the second is double-quoted.

This can be "simplified" to:

$ awk 'BEGIN { print "Here is a single quote <'\''>" }'
-| Here is a single quote <'>

Judge for yourself which of these two is the more readable.

Another option is to use double quotes, escaping the embedded,
`awk'-level double quotes:

$ awk "BEGIN { print \"Here is a single quote <'>\" }"
-| Here is a single quote <'>

This option is also painful, because double quotes, backslashes, and
dollar signs are very common in `awk' programs.

If you really need both single and double quotes in your `awk'
program, it is probably best to move it into a separate file, where the
shell won't be part of the picture, and you can say what you mean.

File: gawk.info, Node: Sample Data Files, Next: Very Simple, Prev: Running gawk, Up: Getting Started

Data Files for the Examples
===========================

Many of the examples in this Info file take their input from two
sample data files. The first, `BBS-list', represents a list of
computer bulletin board systems together with information about those
systems. The second data file, called `inventory-shipped', contains
information about monthly shipments. In both files, each line is
considered to be one "record".

In the data file `BBS-list', each record contains the name of a
computer bulletin board, its phone number, the board's baud rate(s),
and a code for the number of hours it is operational. An `A' in the
last column means the board operates 24 hours a day. A `B' in the last
column means the board only operates on evening and weekend hours. A
`C' means the board operates only on weekends:

aardvark 555-5553 1200/300 B
alpo-net 555-3412 2400/1200/300 A
barfly 555-7685 1200/300 A
bites 555-1675 2400/1200/300 A
camelot 555-0542 300 C
core 555-2912 1200/300 C
fooey 555-1234 2400/1200/300 B
foot 555-6699 1200/300 B
macfoo 555-6480 1200/300 A
sdace 555-3430 2400/1200/300 A
sabafoo 555-2127 1200/300 C

The data file `inventory-shipped' represents information about
shipments during the year. Each record contains the month, the number
of green crates shipped, the number of red boxes shipped, the number of
orange bags shipped, and the number of blue packages shipped,
respectively. There are 16 entries, covering the 12 months of last year
and the first four months of the current year.

Jan 13 25 15 115
Feb 15 32 24 226
Mar 15 24 34 228
Apr 31 52 63 420
May 16 34 29 208
Jun 31 42 75 492
Jul 24 34 67 436
Aug 15 34 47 316
Sep 13 55 37 277
Oct 29 54 68 525
Nov 20 87 82 577
Dec 17 35 61 401

Jan 21 36 64 620
Feb 26 58 80 652
Mar 24 75 70 495
Apr 21 70 74 514

If you are reading this in GNU Emacs using Info, you can copy the
regions of text showing these sample files into your own test files.
This way you can try out the examples shown in the remainder of this
document. You do this by using the command `M-x write-region' to copy
text from the Info file into a file for use with `awk' (*Note
Miscellaneous File Operations: (emacs)Misc File Ops, for more
information). Using this information, create your own `BBS-list' and
`inventory-shipped' files and practice what you learn in this Info file.

If you are using the stand-alone version of Info, see *Note Extract
Program::, for an `awk' program that extracts these data files from
`gawk.texi', the Texinfo source file for this Info file.

File: gawk.info, Node: Very Simple, Next: Two Rules, Prev: Sample Data Files, Up: Getting Started

Some Simple Examples
====================

The following command runs a simple `awk' program that searches the
input file `BBS-list' for the character string `foo' (a grouping of
characters is usually called a "string"; the term "string" is based on
similar usage in English, such as "a string of pearls," or "a string of
cars in a train"):

awk '/foo/ { print $0 }' BBS-list

When lines containing `foo' are found, they are printed because
`print $0' means print the current line. (Just `print' by itself means
the same thing, so we could have written that instead.)

You will notice that slashes (`/') surround the string `foo' in the
`awk' program. The slashes indicate that `foo' is the pattern to
search for. This type of pattern is called a "regular expression",
which is covered in more detail later (*note Regexp::). The pattern is
allowed to match parts of words. There are single quotes around the
`awk' program so that the shell won't interpret any of it as special
shell characters.

Here is what this program prints:

$ awk '/foo/ { print $0 }' BBS-list
-| fooey 555-1234 2400/1200/300 B
-| foot 555-6699 1200/300 B
-| macfoo 555-6480 1200/300 A
-| sabafoo 555-2127 1200/300 C

In an `awk' rule, either the pattern or the action can be omitted,
but not both. If the pattern is omitted, then the action is performed
for _every_ input line. If the action is omitted, the default action
is to print all lines that match the pattern.

Thus, we could leave out the action (the `print' statement and the
curly braces) in the previous example and the result would be the same:
all lines matching the pattern `foo' are printed. By comparison,
omitting the `print' statement but retaining the curly braces makes an
empty action that does nothing (i.e., no lines are printed).

Many practical `awk' programs are just a line or two. Following is a
collection of useful, short programs to get you started. Some of these
programs contain constructs that haven't been covered yet. (The
description of the program will give you a good idea of what is going
on, but please read the rest of the Info file to become an `awk'
expert!) Most of the examples use a data file named `data'. This is
just a placeholder; if you use these programs yourself, substitute your
own file names for `data'. For future reference, note that there is
often more than one way to do things in `awk'. At some point, you may
want to look back at these examples and see if you can come up with
different ways to do the same things shown here:

* Print the length of the longest input line:

awk '{ if (length($0) > max) max = length($0) }
END { print max }' data

* Print every line that is longer than 80 characters:

awk 'length($0) > 80' data

The sole rule has a relational expression as its pattern and it
has no action--so the default action, printing the record, is used.

* Print the length of the longest line in `data':

expand data | awk '{ if (x < length()) x = length() }
END { print "maximum line length is " x }'

The input is processed by the `expand' utility to change tabs into
spaces, so the widths compared are actually the right-margin
columns.

* Print every line that has at least one field:

awk 'NF > 0' data

This is an easy way to delete blank lines from a file (or rather,
to create a new file similar to the old file but from which the
blank lines have been removed).

* Print seven random numbers from 0 to 100, inclusive:

awk 'BEGIN { for (i = 1; i <= 7; i++)
print int(101 * rand()) }'

* Print the total number of bytes used by FILES:

ls -l FILES | awk '{ x += $5 }
END { print "total bytes: " x }'

* Print the total number of kilobytes used by FILES:

ls -l FILES | awk '{ x += $5 }
END { print "total K-bytes: " (x + 1023)/1024 }'

* Print a sorted list of the login names of all users:

awk -F: '{ print $1 }' /etc/passwd | sort

* Count the lines in a file:

awk 'END { print NR }' data

* Print the even-numbered lines in the data file:

awk 'NR % 2 == 0' data

If you use the expression `NR % 2 == 1' instead, the program would
print the odd-numbered lines.

File: gawk.info, Node: Two Rules, Next: More Complex, Prev: Very Simple, Up: Getting Started

An Example with Two Rules
=========================

The `awk' utility reads the input files one line at a time. For
each line, `awk' tries the patterns of each of the rules. If several
patterns match, then several actions are run in the order in which they
appear in the `awk' program. If no patterns match, then no actions are
run.

After processing all the rules that match the line (and perhaps
there are none), `awk' reads the next line. (However, *note Next
Statement::, and also *note Nextfile Statement::). This continues
until the program reaches the end of the file. For example, the
following `awk' program contains two rules:

/12/ { print $0 }
/21/ { print $0 }

The first rule has the string `12' as the pattern and `print $0' as the
action. The second rule has the string `21' as the pattern and also
has `print $0' as the action. Each rule's action is enclosed in its
own pair of braces.

This program prints every line that contains the string `12' _or_
the string `21'. If a line contains both strings, it is printed twice,
once by each rule.

This is what happens if we run this program on our two sample data
files, `BBS-list' and `inventory-shipped':

$ awk '/12/ { print $0 }
> /21/ { print $0 }' BBS-list inventory-shipped
-| aardvark 555-5553 1200/300 B
-| alpo-net 555-3412 2400/1200/300 A
-| barfly 555-7685 1200/300 A
-| bites 555-1675 2400/1200/300 A
-| core 555-2912 1200/300 C
-| fooey 555-1234 2400/1200/300 B
-| foot 555-6699 1200/300 B
-| macfoo 555-6480 1200/300 A
-| sdace 555-3430 2400/1200/300 A
-| sabafoo 555-2127 1200/300 C
-| sabafoo 555-2127 1200/300 C
-| Jan 21 36 64 620
-| Apr 21 70 74 514

Note how the line beginning with `sabafoo' in `BBS-list' was printed
twice, once for each rule.

File: gawk.info, Node: More Complex, Next: Statements/Lines, Prev: Two Rules, Up: Getting Started

A More Complex Example
======================

Now that we've mastered some simple tasks, let's look at what
typical `awk' programs do. This example shows how `awk' can be used to
summarize, select, and rearrange the output of another utility. It uses
features that haven't been covered yet, so don't worry if you don't
understand all the details:

ls -l | awk '$6 == "Nov" { sum += $5 }
END { print sum }'

This command prints the total number of bytes in all the files in the
current directory that were last modified in November (of any year).
(1) The `ls -l' part of this example is a system command that gives you
a listing of the files in a directory, including each file's size and
the date the file was last modified. Its output looks like this:

-rw-r--r-- 1 arnold user 1933 Nov 7 13:05 Makefile
-rw-r--r-- 1 arnold user 10809 Nov 7 13:03 awk.h
-rw-r--r-- 1 arnold user 983 Apr 13 12:14 awk.tab.h
-rw-r--r-- 1 arnold user 31869 Jun 15 12:20 awk.y
-rw-r--r-- 1 arnold user 22414 Nov 7 13:03 awk1.c
-rw-r--r-- 1 arnold user 37455 Nov 7 13:03 awk2.c
-rw-r--r-- 1 arnold user 27511 Dec 9 13:07 awk3.c
-rw-r--r-- 1 arnold user 7989 Nov 7 13:03 awk4.c

The first field contains read-write permissions, the second field
contains the number of links to the file, and the third field
identifies the owner of the file. The fourth field identifies the group
of the file. The fifth field contains the size of the file in bytes.
The sixth, seventh, and eighth fields contain the month, day, and time,
respectively, that the file was last modified. Finally, the ninth field
contains the name of the file.(2)

The `$6 == "Nov"' in our `awk' program is an expression that tests
whether the sixth field of the output from `ls -l' matches the string
`Nov'. Each time a line has the string `Nov' for its sixth field, the
action `sum += $5' is performed. This adds the fifth field (the file's
size) to the variable `sum'. As a result, when `awk' has finished
reading all the input lines, `sum' is the total of the sizes of the
files whose lines matched the pattern. (This works because `awk'
variables are automatically initialized to zero.)

After the last line of output from `ls' has been processed, the
`END' rule executes and prints the value of `sum'. In this example,
the value of `sum' is 80600.

These more advanced `awk' techniques are covered in later sections
(*note Action Overview::). Before you can move on to more advanced
`awk' programming, you have to know how `awk' interprets your input and
displays your output. By manipulating fields and using `print'
statements, you can produce some very useful and impressive-looking
reports.

---------- Footnotes ----------

(1) In the C shell (`csh'), you need to type a semicolon and then a
backslash at the end of the first line; see *Note Statements/Lines::,
for an explanation. In a POSIX-compliant shell, such as the Bourne
shell or `bash', you can type the example as shown. If the command
`echo $path' produces an empty output line, you are most likely using a
POSIX-compliant shell. Otherwise, you are probably using the C shell
or a shell derived from it.

(2) On some very old systems, you may need to use `ls -lg' to get
this output.

File: gawk.info, Node: Statements/Lines, Next: Other Features, Prev: More Complex, Up: Getting Started

`awk' Statements Versus Lines
=============================

Most often, each line in an `awk' program is a separate statement or
separate rule, like this:

awk '/12/ { print $0 }
/21/ { print $0 }' BBS-list inventory-shipped

However, `gawk' ignores newlines after any of the following symbols
and keywords:

, { ? : || && do else

A newline at any other point is considered the end of the statement.(1)

If you would like to split a single statement into two lines at a
point where a newline would terminate it, you can "continue" it by
ending the first line with a backslash character (`\'). The backslash
must be the final character on the line in order to be recognized as a
continuation character. A backslash is allowed anywhere in the
statement, even in the middle of a string or regular expression. For
example:

awk '/This regular expression is too long, so continue it\
on the next line/ { print $1 }'

We have generally not used backslash continuation in the sample programs
in this Info file. In `gawk', there is no limit on the length of a
line, so backslash continuation is never strictly necessary; it just
makes programs more readable. For this same reason, as well as for
clarity, we have kept most statements short in the sample programs
presented throughout the Info file. Backslash continuation is most
useful when your `awk' program is in a separate source file instead of
entered from the command line. You should also note that many `awk'
implementations are more particular about where you may use backslash
continuation. For example, they may not allow you to split a string
constant using backslash continuation. Thus, for maximum portability
of your `awk' programs, it is best not to split your lines in the
middle of a regular expression or a string.

*Caution:* _Backslash continuation does not work as described with
the C shell._ It works for `awk' programs in files and for one-shot
programs, _provided_ you are using a POSIX-compliant shell, such as the
Unix Bourne shell or `bash'. But the C shell behaves differently!
There, you must use two backslashes in a row, followed by a newline.
Note also that when using the C shell, _every_ newline in your awk
program must be escaped with a backslash. To illustrate:

% awk 'BEGIN { \
? print \\
? "hello, world" \
? }'
-| hello, world

Here, the `%' and `?' are the C shell's primary and secondary prompts,
analogous to the standard shell's `$' and `>'.

Compare the previous example to how it is done with a
POSIX-compliant shell:

$ awk 'BEGIN {
> print \
> "hello, world"
> }'
-| hello, world

`awk' is a line-oriented language. Each rule's action has to begin
on the same line as the pattern. To have the pattern and action on
separate lines, you _must_ use backslash continuation; there is no
other option.

Another thing to keep in mind is that backslash continuation and
comments do not mix. As soon as `awk' sees the `#' that starts a
comment, it ignores _everything_ on the rest of the line. For example:

$ gawk 'BEGIN { print "dont panic" # a friendly \
> BEGIN rule
> }'
error--> gawk: cmd. line:2: BEGIN rule
error--> gawk: cmd. line:2: ^ parse error

In this case, it looks like the backslash would continue the comment
onto the next line. However, the backslash-newline combination is never
even noticed because it is "hidden" inside the comment. Thus, the
`BEGIN' is noted as a syntax error.

When `awk' statements within one rule are short, you might want to
put more than one of them on a line. This is accomplished by
separating the statements with a semicolon (`;'). This also applies to
the rules themselves. Thus, the program shown at the start of this
minor node could also be written this way:

/12/ { print $0 } ; /21/ { print $0 }

*Note:* The requirement that states that rules on the same line must be
separated with a semicolon was not in the original `awk' language; it
was added for consistency with the treatment of statements within an
action.

---------- Footnotes ----------

(1) The `?' and `:' referred to here is the three-operand
conditional expression described in *Note Conditional Exp::. Splitting
lines after `?' and `:' is a minor `gawk' extension; if `--posix' is
specified (*note Options::), then this extension is disabled.

File: gawk.info, Node: Other Features, Next: When, Prev: Statements/Lines, Up: Getting Started

Other Features of `awk'
=======================

The `awk' language provides a number of predefined, or "built-in",
variables that your programs can use to get information from `awk'.
There are other variables your program can set as well to control how
`awk' processes your data.

In addition, `awk' provides a number of built-in functions for doing
common computational and string-related operations. `gawk' provides
built-in functions for working with timestamps, performing bit
manipulation, and for runtime string translation.

As we develop our presentation of the `awk' language, we introduce
most of the variables and many of the functions. They are defined
systematically in *Note Built-in Variables::, and *Note Built-in::.

File: gawk.info, Node: When, Prev: Other Features, Up: Getting Started

When to Use `awk'
=================

Now that you've seen some of what `awk' can do, you might wonder how
`awk' could be useful for you. By using utility programs, advanced
patterns, field separators, arithmetic statements, and other selection
criteria, you can produce much more complex output. The `awk' language
is very useful for producing reports from large amounts of raw data,
such as summarizing information from the output of other utility
programs like `ls'. (*Note More Complex::.)

Programs written with `awk' are usually much smaller than they would
be in other languages. This makes `awk' programs easy to compose and
use. Often, `awk' programs can be quickly composed at your terminal,
used once, and thrown away. Because `awk' programs are interpreted, you
can avoid the (usually lengthy) compilation part of the typical
edit-compile-test-debug cycle of software development.

Complex programs have been written in `awk', including a complete
retargetable assembler for eight-bit microprocessors (*note Glossary::,
for more information), and a microcode assembler for a special-purpose
Prolog computer. However, `awk''s capabilities are strained by tasks of
such complexity.

If you find yourself writing `awk' scripts of more than, say, a few
hundred lines, you might consider using a different programming
language. Emacs Lisp is a good choice if you need sophisticated string
or pattern matching capabilities. The shell is also good at string and
pattern matching; in addition, it allows powerful use of the system
utilities. More conventional languages, such as C, C++, and Java, offer
better facilities for system programming and for managing the complexity
of large programs. Programs in these languages may require more lines
of source code than the equivalent `awk' programs, but they are easier
to maintain and usually run more efficiently.

File: gawk.info, Node: Regexp, Next: Reading Files, Prev: Getting Started, Up: Top

Regular Expressions
*******************

A "regular expression", or "regexp", is a way of describing a set of
strings. Because regular expressions are such a fundamental part of
`awk' programming, their format and use deserve a separate major node.

A regular expression enclosed in slashes (`/') is an `awk' pattern
that matches every input record whose text belongs to that set. The
simplest regular expression is a sequence of letters, numbers, or both.
Such a regexp matches any string that contains that sequence. Thus,
the regexp `foo' matches any string containing `foo'. Therefore, the
pattern `/foo/' matches any input record containing the three
characters `foo' _anywhere_ in the record. Other kinds of regexps let
you specify more complicated classes of strings.

* Menu:

* Regexp Usage:: How to Use Regular Expressions.
* Escape Sequences:: How to write nonprinting characters.
* Regexp Operators:: Regular Expression Operators.
* Character Lists:: What can go between `[...]'.
* GNU Regexp Operators:: Operators specific to GNU software.
* Case-sensitivity:: How to do case-insensitive matching.
* Leftmost Longest:: How much text matches.
* Computed Regexps:: Using Dynamic Regexps.
* Locales:: How the locale affects things.

File: gawk.info, Node: Regexp Usage, Next: Escape Sequences, Up: Regexp

How to Use Regular Expressions
==============================

A regular expression can be used as a pattern by enclosing it in
slashes. Then the regular expression is tested against the entire text
of each record. (Normally, it only needs to match some part of the
text in order to succeed.) For example, the following prints the
second field of each record that contains the string `foo' anywhere in
it:

$ awk '/foo/ { print $2 }' BBS-list
-| 555-1234
-| 555-6699
-| 555-6480
-| 555-2127

`~' (tilde), `~' operator Regular expressions can also be used in
matching expressions. These expressions allow you to specify the
string to match against; it need not be the entire current input
record. The two operators `~' and `!~' perform regular expression
comparisons. Expressions using these operators can be used as
patterns, or in `if', `while', `for', and `do' statements. (*Note
Statements::.) For example:

EXP ~ /REGEXP/

is true if the expression EXP (taken as a string) matches REGEXP. The
following example matches, or selects, all input records with the
uppercase letter `J' somewhere in the first field:

$ awk '$1 ~ /J/' inventory-shipped
-| Jan 13 25 15 115
-| Jun 31 42 75 492
-| Jul 24 34 67 436
-| Jan 21 36 64 620

So does this:

awk '{ if ($1 ~ /J/) print }' inventory-shipped

This next example is true if the expression EXP (taken as a
character string) does _not_ match REGEXP:

EXP !~ /REGEXP/

The following example matches, or selects, all input records whose
first field _does not_ contain the uppercase letter `J':

$ awk '$1 !~ /J/' inventory-shipped
-| Feb 15 32 24 226
-| Mar 15 24 34 228
-| Apr 31 52 63 420
-| May 16 34 29 208
...

When a regexp is enclosed in slashes, such as `/foo/', we call it a
"regexp constant", much like `5.27' is a numeric constant and `"foo"'
is a string constant.

File: gawk.info, Node: Escape Sequences, Next: Regexp Operators, Prev: Regexp Usage, Up: Regexp

Escape Sequences
================

Some characters cannot be included literally in string constants
(`"foo"') or regexp constants (`/foo/'). Instead, they should be
represented with "escape sequences", which are character sequences
beginning with a backslash (`\'). One use of an escape sequence is to
include a double-quote character in a string constant. Because a plain
double quote ends the string, you must use `\"' to represent an actual
double-quote character as a part of the string. For example:

$ awk 'BEGIN { print "He said \"hi!\" to her." }'
-| He said "hi!" to her.

The backslash character itself is another character that cannot be
included normally; you must write `\\' to put one backslash in the
string or regexp. Thus, the string whose contents are the two
characters `"' and `\' must be written `"\"\\"'.

Backslash also represents unprintable characters such as TAB or
newline. While there is nothing to stop you from entering most
unprintable characters directly in a string constant or regexp constant,
they may look ugly.

The following table lists all the escape sequences used in `awk' and
what they represent. Unless noted otherwise, all these escape sequences
apply to both string constants and regexp constants:

`\\'
A literal backslash, `\'.

`\a'
The "alert" character, `Ctrl-g', ASCII code 7 (BEL). (This
usually makes some sort of audible noise.)

`\b'
Backspace, `Ctrl-h', ASCII code 8 (BS).

`\f'
Formfeed, `Ctrl-l', ASCII code 12 (FF).

`\n'
Newline, `Ctrl-j', ASCII code 10 (LF).

`\r'
Carriage return, `Ctrl-m', ASCII code 13 (CR).

`\t'
Horizontal TAB, `Ctrl-i', ASCII code 9 (HT).

`\v'
Vertical tab, `Ctrl-k', ASCII code 11 (VT).

`\NNN'
The octal value NNN, where NNN stands for 1 to 3 digits between
`0' and `7'. For example, the code for the ASCII ESC (escape)
character is `\033'.

`\xHH...'
The hexadecimal value HH, where HH stands for a sequence of
hexadecimal digits (`0'-`9', and either `A'-`F' or `a'-`f'). Like
the same construct in ISO C, the escape sequence continues until
the first nonhexadecimal digit is seen. However, using more than
two hexadecimal digits produces undefined results. (The `\x'
escape sequence is not allowed in POSIX `awk'.)

`\/'
A literal slash (necessary for regexp constants only). This
expression is used when you want to write a regexp constant that
contains a slash. Because the regexp is delimited by slashes, you
need to escape the slash that is part of the pattern, in order to
tell `awk' to keep processing the rest of the regexp.

`\"'
A literal double quote (necessary for string constants only).
This expression is used when you want to write a string constant
that contains a double quote. Because the string is delimited by
double quotes, you need to escape the quote that is part of the
string, in order to tell `awk' to keep processing the rest of the
string.

In `gawk', a number of additional two-character sequences that begin
with a backslash have special meaning in regexps. *Note GNU Regexp
Operators::.

In a regexp, a backslash before any character that is not in the
previous list and not listed in *Note GNU Regexp Operators::, means
that the next character should be taken literally, even if it would
normally be a regexp operator. For example, `/a\+b/' matches the three
characters `a+b'.

For complete portability, do not use a backslash before any
character not shown in the previous list.

To summarize:

* The escape sequences in the table above are always processed first,
for both string constants and regexp constants. This happens very
early, as soon as `awk' reads your program.

* `gawk' processes both regexp constants and dynamic regexps (*note
Computed Regexps::), for the special operators listed in *Note GNU
Regexp Operators::.

* A backslash before any other character means to treat that
character literally.

Advanced Notes: Backslash Before Regular Characters
---------------------------------------------------

If you place a backslash in a string constant before something that
is not one of the characters previously listed, POSIX `awk' purposely
leaves what happens as undefined. There are two choices:

Strip the backslash out
This is what Unix `awk' and `gawk' both do. For example, `"a\qc"'
is the same as `"aqc"'. (Because this is such an easy bug both to
introduce and to miss, `gawk' warns you about it.) Consider `FS =
"[ \t]+\|[ \t]+"' to use vertical bars surrounded by whitespace as
the field separator. There should be two backslashes in the string
`FS = "[ \t]+\\|[ \t]+"'.)

Leave the backslash alone
Some other `awk' implementations do this. In such
implementations, typing `"a\qc"' is the same as typing `"a\\qc"'.

Advanced Notes: Escape Sequences for Metacharacters
---------------------------------------------------

Suppose you use an octal or hexadecimal escape to represent a regexp
metacharacter. (See *Note Regexp Operators::.) Does `awk' treat the
character as a literal character or as a regexp operator?

Historically, such characters were taken literally. (d.c.)
However, the POSIX standard indicates that they should be treated as
real metacharacters, which is what `gawk' does. In compatibility mode
(*note Options::), `gawk' treats the characters represented by octal
and hexadecimal escape sequences literally when used in regexp
constants. Thus, `/a\52b/' is equivalent to `/a\*b/'.

File: gawk.info, Node: Regexp Operators, Next: Character Lists, Prev: Escape Sequences, Up: Regexp

Regular Expression Operators
============================

You can combine regular expressions with special characters, called
"regular expression operators" or "metacharacters", to increase the
power and versatility of regular expressions.

The escape sequences described in *Note Escape Sequences::, are
valid inside a regexp. They are introduced by a `\' and are recognized
and converted into corresponding real characters as the very first step
in processing regexps.

Here is a list of metacharacters. All characters that are not escape
sequences and that are not listed in the table stand for themselves:

`\'
This is used to suppress the special meaning of a character when
matching. For example, `\$' matches the character `$'.

`^'
This matches the beginning of a string. For example, `^@chapter'
matches `@chapter' at the beginning of a string and can be used to
identify chapter beginnings in Texinfo source files. The `^' is
known as an "anchor", because it anchors the pattern to match only
at the beginning of the string.

It is important to realize that `^' does not match the beginning of
a line embedded in a string. The condition is not true in the
following example:

if ("line1\nLINE 2" ~ /^L/) ...

`$'
This is similar to `^', but it matches only at the end of a string.
For example, `p$' matches a record that ends with a `p'. The `$'
is an anchor and does not match the end of a line embedded in a
string. The condition in the following example is not true:

if ("line1\nLINE 2" ~ /1$/) ...

`.'
This matches any single character, _including_ the newline
character. For example, `.P' matches any single character
followed by a `P' in a string. Using concatenation, we can make a
regular expression such as `U.A', which matches any
three-character sequence that begins with `U' and ends with `A'.

In strict POSIX mode (*note Options::), `.' does not match the NUL
character, which is a character with all bits equal to zero.
Otherwise, NUL is just another character. Other versions of `awk'
may not be able to match the NUL character.

`[...]'
This is called a "character list".(1) It matches any _one_ of the
characters that are enclosed in the square brackets. For example,
`[MVX]' matches any one of the characters `M', `V', or `X' in a
string. A full discussion of what can be inside the square
brackets of a character list is given in *Note Character Lists::.

`[^ ...]'
This is a "complemented character list". The first character after
the `[' _must_ be a `^'. It matches any characters _except_ those
in the square brackets. For example, `[^awk]' matches any
character that is not an `a', `w', or `k'.

`|'
This is the "alternation operator" and it is used to specify
alternatives. The `|' has the lowest precedence of all the regular
expression operators. For example, `^P|[[:digit:]]' matches any
string that matches either `^P' or `[[:digit:]]'. This means it
matches any string that starts with `P' or contains a digit.

The alternation applies to the largest possible regexps on either
side.

`(...)'
Parentheses are used for grouping in regular expressions, as in
arithmetic. They can be used to concatenate regular expressions
containing the alternation operator, `|'. For example,
`@(samp|code)\{[^}]+\}' matches both `@code{foo}' and `@samp{bar}'.
(These are Texinfo formatting control sequences. The `+' is
explained further on in this list.)

`*'
This symbol means that the preceding regular expression should be
repeated as many times as necessary to find a match. For example,
`ph*' applies the `*' symbol to the preceding `h' and looks for
matches of one `p' followed by any number of `h's. This also
matches just `p' if no `h's are present.

The `*' repeats the _smallest_ possible preceding expression.
(Use parentheses if you want to repeat a larger expression.) It
finds as many repetitions as possible. For example, `awk
'/$c[ad][ad]*r x$/ { print }' sample' prints every record in
`sample' containing a string of the form `(car x)', `(cdr x)',
`(cadr x)', and so on. Notice the escaping of the parentheses by
preceding them with backslashes.

`+'
This symbol is similar to `*', except that the preceding
expression must be matched at least once. This means that `wh+y'
would match `why' and `whhy', but not `wy', whereas `wh*y' would
match all three of these strings. The following is a simpler way
of writing the last `*' example:

awk '/$c[ad]+r x$/ { print }' sample

`?'
This symbol is similar to `*', except that the preceding
expression can be matched either once or not at all. For example,
`fe?d' matches `fed' and `fd', but nothing else.

`{N}'
`{N,}'
`{N,M}'
One or two numbers inside braces denote an "interval expression".
If there is one number in the braces, the preceding regexp is
repeated N times. If there are two numbers separated by a comma,
the preceding regexp is repeated N to M times. If there is one
number followed by a comma, then the preceding regexp is repeated
at least N times:

`wh{3}y'
Matches `whhhy', but not `why' or `whhhhy'.

`wh{3,5}y'
Matches `whhhy', `whhhhy', or `whhhhhy', only.

`wh{2,}y'
Matches `whhy' or `whhhy', and so on.

Interval expressions were not traditionally available in `awk'.
They were added as part of the POSIX standard to make `awk' and
`egrep' consistent with each other.

However, because old programs may use `{' and `}' in regexp
constants, by default `gawk' does _not_ match interval expressions
in regexps. If either `--posix' or `--re-interval' are specified
(*note Options::), then interval expressions are allowed in
regexps.

For new programs that use `{' and `}' in regexp constants, it is
good practice to always escape them with a backslash. Then the
regexp constants are valid and work the way you want them to, using
any version of `awk'.(2)

In regular expressions, the `*', `+', and `?' operators, as well as
the braces `{' and `}', have the highest precedence, followed by
concatenation, and finally by `|'. As in arithmetic, parentheses can
change how operators are grouped.

In POSIX `awk' and `gawk', the `*', `+', and `?' operators stand for
themselves when there is nothing in the regexp that precedes them. For
example, `/+/' matches a literal plus sign. However, many other
versions of `awk' treat such a usage as a syntax error.

If `gawk' is in compatibility mode (*note Options::), POSIX
character classes and interval expressions are not available in regular
expressions.

---------- Footnotes ----------

(1) In other literature, you may see a character list referred to as
either a "character set", a "character class", or a "bracket
expression".

(2) Use two backslashes if you're using a string constant with a
regexp operator or function.

File: gawk.info, Node: Character Lists, Next: GNU Regexp Operators, Prev: Regexp Operators, Up: Regexp

Using Character Lists
=====================

Within a character list, a "range expression" consists of two
characters separated by a hyphen. It matches any single character that
sorts between the two characters, using the locale's collating sequence
and character set. For example, in the default C locale, `[a-dx-z]' is
equivalent to `[abcdxyz]'. Many locales sort characters in dictionary
order, and in these locales, `[a-dx-z]' is typically not equivalent to
`[abcdxyz]'; instead it might be equivalent to `[aBbCcDdxXyYz]', for
example. To obtain the traditional interpretation of bracket
expressions, you can use the C locale by setting the `LC_ALL'
environment variable to the value `C'.

To include one of the characters `\', `]', `-', or `^' in a
character list, put a `\' in front of it. For example:

[d\]]

matches either `d' or `]'.

This treatment of `\' in character lists is compatible with other
`awk' implementations and is also mandated by POSIX. The regular
expressions in `awk' are a superset of the POSIX specification for
Extended Regular Expressions (EREs). POSIX EREs are based on the
regular expressions accepted by the traditional `egrep' utility.

"Character classes" are a new feature introduced in the POSIX
standard. A character class is a special notation for describing lists
of characters that have a specific attribute, but the actual characters
can vary from country to country and/or from character set to character
set. For example, the notion of what is an alphabetic character
differs between the United States and France.

A character class is only valid in a regexp _inside_ the brackets of
a character list. Character classes consist of `[:', a keyword
denoting the class, and `:]'. Here are the character classes defined
by the POSIX standard.

`[:alnum:]' Alphanumeric characters.
`[:alpha:]' Alphabetic characters.
`[:blank:]' Space and TAB characters.
`[:cntrl:]' Control characters.
`[:digit:]' Numeric characters.
`[:graph:]' Characters that are both printable and visible. (A space is
printable but not visible, whereas an `a' is both.)
`[:lower:]' Lowercase alphabetic characters.
`[:print:]' Printable characters (characters that are not control
characters).
`[:punct:]' Punctuation characters (characters that are not letters,
digits, control characters, or space characters).
`[:space:]' Space characters (such as space, TAB, and formfeed, to name a
few).
`[:upper:]' Uppercase alphabetic characters.
`[:xdigit:]' Characters that are hexadecimal digits.

For example, before the POSIX standard, you had to write
`/[A-Za-z0-9]/' to match alphanumeric characters. If your character
set had other alphabetic characters in it, this would not match them,
and if your character set collated differently from ASCII, this might
not even match the ASCII alphanumeric characters. With the POSIX
character classes, you can write `/[[:alnum:]]/' to match the alphabetic
and numeric characters in your character set.

Two additional special sequences can appear in character lists.
These apply to non-ASCII character sets, which can have single symbols
(called "collating elements") that are represented with more than one
character. They can also have several characters that are equivalent for
"collating", or sorting, purposes. (For example, in French, a plain "e"
and a grave-accented "e`" are equivalent.) These sequences are:

Collating symbols
Multicharacter collating elements enclosed between `[.' and `.]'.
For example, if `ch' is a collating element, then `[[.ch.]]' is a
regexp that matches this collating element, whereas `[ch]' is a
regexp that matches either `c' or `h'.

Equivalence classes
Locale-specific names for a list of characters that are equal. The
name is enclosed between `[=' and `=]'. For example, the name `e'
might be used to represent all of "e," "e`," and "e'." In this
case, `[[=e=]]' is a regexp that matches any of `e', `e'', or `e`'.

These features are very valuable in non-English-speaking locales.

*Caution:* The library functions that `gawk' uses for regular
expression matching currently recognize only POSIX character classes;
they do not recognize collating symbols or equivalence classes.

File: gawk.info, Node: GNU Regexp Operators, Next: Case-sensitivity, Prev: Character Lists, Up: Regexp

`gawk'-Specific Regexp Operators
================================

GNU software that deals with regular expressions provides a number of
additional regexp operators. These operators are described in this
minor node and are specific to `gawk'; they are not available in other
`awk' implementations. Most of the additional operators deal with word
matching. For our purposes, a "word" is a sequence of one or more
letters, digits, or underscores (`_'):

`\w'
Matches any word-constituent character--that is, it matches any
letter, digit, or underscore. Think of it as shorthand for
`[[:alnum:]_]'.

`\W'
Matches any character that is not word-constituent. Think of it
as shorthand for `[^[:alnum:]_]'.

`\<'
Matches the empty string at the beginning of a word. For example,
`/\'
Matches the empty string at the end of a word. For example,
`/stow\>/' matches `stow' but not `stowaway'.

`\y'
Matches the empty string at either the beginning or the end of a
word (i.e., the word boundar*y*). For example, `\yballs?\y'
matches either `ball' or `balls', as a separate word.

`\B'
Matches the empty string that occurs between two word-constituent
characters. For example, `/\Brat\B/' matches `crate' but it does
not match `dirty rat'. `\B' is essentially the opposite of `\y'.

There are two other operators that work on buffers. In Emacs, a
"buffer" is, naturally, an Emacs buffer. For other programs, `gawk''s
regexp library routines consider the entire string to match as the
buffer. The operators are:

`\`'
Matches the empty string at the beginning of a buffer (string).

`\''
Matches the empty string at the end of a buffer (string).

Because `^' and `$' always work in terms of the beginning and end of
strings, these operators don't add any new capabilities for `awk'.
They are provided for compatibility with other GNU software.

In other GNU software, the word-boundary operator is `\b'. However,
that conflicts with the `awk' language's definition of `\b' as
backspace, so `gawk' uses a different letter. An alternative method
would have been to require two backslashes in the GNU operators, but
this was deemed too confusing. The current method of using `\y' for the
GNU `\b' appears to be the lesser of two evils.

The various command-line options (*note Options::) control how
`gawk' interprets characters in regexps:

No options
In the default case, `gawk' provides all the facilities of POSIX
regexps and the GNU regexp operators described in *Note Regexp
Operators::. However, interval expressions are not supported.

`--posix'
Only POSIX regexps are supported; the GNU operators are not special
(e.g., `\w' matches a literal `w'). Interval expressions are
allowed.

`--traditional'
Traditional Unix `awk' regexps are matched. The GNU operators are
not special, interval expressions are not available, nor are the
POSIX character classes (`[[:alnum:]]', etc.). Characters
described by octal and hexadecimal escape sequences are treated
literally, even if they represent regexp metacharacters.

`--re-interval'
Allow interval expressions in regexps, even if `--traditional' has
been provided. (`--posix' automatically enables interval
expressions, so `--re-interval' is redundant when `--posix' is is
used.)

File: gawk.info, Node: Case-sensitivity, Next: Leftmost Longest, Prev: GNU Regexp Operators, Up: Regexp

Case Sensitivity in Matching
============================

Case is normally significant in regular expressions, both when
matching ordinary characters (i.e., not metacharacters) and inside
character sets. Thus, a `w' in a regular expression matches only a
lowercase `w' and not an uppercase `W'.

The simplest way to do a case-independent match is to use a character
list--for example, `[Ww]'. However, this can be cumbersome if you need
to use it often, and it can make the regular expressions harder to
read. There are two alternatives that you might prefer.

One way to perform a case-insensitive match at a particular point in
the program is to convert the data to a single case, using the
`tolower' or `toupper' built-in string functions (which we haven't
discussed yet; *note String Functions::). For example:

tolower($1) ~ /foo/ { ... }

converts the first field to lowercase before matching against it. This
works in any POSIX-compliant `awk'.

Another method, specific to `gawk', is to set the variable
`IGNORECASE' to a nonzero value (*note Built-in Variables::). When
`IGNORECASE' is not zero, _all_ regexp and string operations ignore
case. Changing the value of `IGNORECASE' dynamically controls the
case-sensitivity of the program as it runs. Case is significant by
default because `IGNORECASE' (like most variables) is initialized to
zero:

x = "aB"
if (x ~ /ab/) ... # this test will fail

IGNORECASE = 1
if (x ~ /ab/) ... # now it will succeed

In general, you cannot use `IGNORECASE' to make certain rules
case-insensitive and other rules case-sensitive, because there is no
straightforward way to set `IGNORECASE' just for the pattern of a
particular rule.(1) To do this, use either character lists or
`tolower'. However, one thing you can do with `IGNORECASE' only is
dynamically turn case-sensitivity on or off for all the rules at once.

`IGNORECASE' can be set on the command line or in a `BEGIN' rule
(*note Other Arguments::; also *note Using BEGIN/END::). Setting
`IGNORECASE' from the command line is a way to make a program
case-insensitive without having to edit it.

Prior to `gawk' 3.0, the value of `IGNORECASE' affected regexp
operations only. It did not affect string comparison with `==', `!=',
and so on. Beginning with version 3.0, both regexp and string
comparison operations are also affected by `IGNORECASE'.

Beginning with `gawk' 3.0, the equivalences between upper- and
lowercase characters are based on the ISO-8859-1 (ISO Latin-1)
character set. This character set is a superset of the traditional 128
ASCII characters, which also provides a number of characters suitable
for use with European languages.

The value of `IGNORECASE' has no effect if `gawk' is in
compatibility mode (*note Options::). Case is always significant in
compatibility mode.

---------- Footnotes ----------

(1) Experienced C and C++ programmers will note that it is possible,
using something like `IGNORECASE = 1 && /foObAr/ { ... }' and
`IGNORECASE = 0 || /foobar/ { ... }'. However, this is somewhat
obscure and we don't recommend it.

File: gawk.info, Node: Leftmost Longest, Next: Computed Regexps, Prev: Case-sensitivity, Up: Regexp

How Much Text Matches?
======================

Consider the following:

echo aaaabcd | awk '{ sub(/a+/, ""); print }'

This example uses the `sub' function (which we haven't discussed yet;
*note String Functions::) to make a change to the input record. Here,
the regexp `/a+/' indicates "one or more `a' characters," and the
replacement text is `'.

The input contains four `a' characters. `awk' (and POSIX) regular
expressions always match the leftmost, _longest_ sequence of input
characters that can match. Thus, all four `a' characters are replaced
with `' in this example:

$ echo aaaabcd | awk '{ sub(/a+/, ""); print }'
-| bcd

For simple match/no-match tests, this is not so important. But when
doing text matching and substitutions with the `match', `sub', `gsub',
and `gensub' functions, it is very important. *Note String Functions::,
for more information on these functions. Understanding this principle
is also important for regexp-based record and field splitting (*note
Records::, and also *note Field Separators::).

File: gawk.info, Node: Computed Regexps, Next: Locales, Prev: Leftmost Longest, Up: Regexp

Using Dynamic Regexps
=====================

The righthand side of a `~' or `!~' operator need not be a regexp
constant (i.e., a string of characters between slashes). It may be any
expression. The expression is evaluated and converted to a string if
necessary; the contents of the string are used as the regexp. A regexp
that is computed in this way is called a "dynamic regexp":

BEGIN { digits_regexp = "[[:digit:]]+" }
$0 ~ digits_regexp { print }

This sets `digits_regexp' to a regexp that describes one or more digits,
and tests whether the input record matches this regexp.

When using the `~' and `!~' *Caution:* When using the `~' and `!~'
operators, there is a difference between a regexp constant enclosed in
slashes and a string constant enclosed in double quotes. If you are
going to use a string constant, you have to understand that the string
is, in essence, scanned _twice_: the first time when `awk' reads your
program, and the second time when it goes to match the string on the
lefthand side of the operator with the pattern on the right. This is
true of any string-valued expression (such as `digits_regexp', shown
previously), not just string constants.

What difference does it make if the string is scanned twice? The
answer has to do with escape sequences, and particularly with
backslashes. To get a backslash into a regular expression inside a
string, you have to type two backslashes.

For example, `/\*/' is a regexp constant for a literal `*'. Only
one backslash is needed. To do the same thing with a string, you have
to type `"\\*"'. The first backslash escapes the second one so that
the string actually contains the two characters `\' and `*'.

Given that you can use both regexp and string constants to describe
regular expressions, which should you use? The answer is "regexp
constants," for several reasons:

* String constants are more complicated to write and more difficult
to read. Using regexp constants makes your programs less
error-prone. Not understanding the difference between the two
kinds of constants is a common source of errors.

* It is more efficient to use regexp constants. `awk' can note that
you have supplied a regexp and store it internally in a form that
makes pattern matching more efficient. When using a string
constant, `awk' must first convert the string into this internal
form and then perform the pattern matching.

* Using regexp constants is better form; it shows clearly that you
intend a regexp match.

Advanced Notes: Using `\n' in Character Lists of Dynamic Regexps
----------------------------------------------------------------

Some commercial versions of `awk' do not allow the newline character
to be used inside a character list for a dynamic regexp:

$ awk '$0 ~ "[ \t\n]"'
error--> awk: newline in character class [
error--> ]...
error--> source line number 1
error--> context is
error--> >>> <<<

But a newline in a regexp constant works with no problem:

$ awk '$0 ~ /[ \t\n]/'
here is a sample line
-| here is a sample line
Ctrl-d

`gawk' does not have this problem, and it isn't likely to occur
often in practice, but it's worth noting for future reference.

File: gawk.info, Node: Locales, Prev: Computed Regexps, Up: Regexp

Where You Are Makes A Difference
================================

Modern systems support the notion of "locales": a way to tell the
system about the local character set and language. The current locale
setting can affect the way regexp matching works, often in surprising
ways. In particular, many locales do case-insensitive matching, even
when you may have specified characters of only one particular case.

The following example uses the `sub' function, which does text
replacement (*note String Functions::). Here, the intent is to remove
trailing uppercase characters:

$ echo something1234abc | gawk '{ sub("[A-Z]*$", ""); print }'
-| something1234

This output is unexpected, since the `abc' at the end of
`something1234abc' should not normally match `[A-Z]*'. This result is
due to the locale setting (and thus you may not see it on your system).
There are two fixes. The first is to use the POSIX character class
`[[:upper:]]', instead of `[A-Z]'. The second is to change the locale
setting in the environment, before running `gawk', by using the shell
statements:

LANG=C LC_ALL=C
export LANG LC_ALL

The setting `C' forces `gawk' to behave in the traditional Unix
manner, where case distinctions do matter. You may wish to put these
statements into your shell startup file, e.g., `$HOME/.profile'.

Similar considerations apply to other ranges. For example, `["-/]'
is perfectly valid in ASCII, but is not valid in many Unicode locales,
such as `en_US.UTF-8'. (In general, such ranges should be avoided;
either list the characters individually, or use a POSIX character class
such as `[[:punct:]]'.)

For the normal case of `RS = "\n"', the locale is largely irrelevant.
For other single byte record separators, using `LC_ALL=C' will give you
much better performance when reading records. Otherwise, `gawk' has to
make several function calls, _per input character_ to find the record
terminator.

File: gawk.info, Node: Reading Files, Next: Printing, Prev: Regexp, Up: Top

Reading Input Files
*******************

In the typical `awk' program, all input is read either from the
standard input (by default, this is the keyboard, but often it is a
pipe from another command) or from files whose names you specify on the
`awk' command line. If you specify input files, `awk' reads them in
order, processing all the data from one before going on to the next.
The name of the current input file can be found in the built-in variable
`FILENAME' (*note Built-in Variables::).

The input is read in units called "records", and is processed by the
rules of your program one record at a time. By default, each record is
one line. Each record is automatically split into chunks called
"fields". This makes it more convenient for programs to work on the
parts of a record.

On rare occasions, you may need to use the `getline' command. The
`getline' command is valuable, both because it can do explicit input
from any number of files, and because the files used with it do not
have to be named on the `awk' command line (*note Getline::).

* Menu:

* Records:: Controlling how data is split into records.
* Fields:: An introduction to fields.
* Nonconstant Fields:: Nonconstant Field Numbers.
* Changing Fields:: Changing the Contents of a Field.
* Field Separators:: The field separator and how to change it.
* Constant Size:: Reading constant width data.
* Multiple Line:: Reading multi-line records.
* Getline:: Reading files under explicit program control
using the `getline' function.

File: gawk.info, Node: Records, Next: Fields, Up: Reading Files

How Input Is Split into Records
===============================

The `awk' utility divides the input for your `awk' program into
records and fields. `awk' keeps track of the number of records that
have been read so far from the current input file. This value is
stored in a built-in variable called `FNR'. It is reset to zero when a
new file is started. Another built-in variable, `NR', is the total
number of input records read so far from all data files. It starts at
zero, but is never automatically reset to zero.

Records are separated by a character called the "record separator".
By default, the record separator is the newline character. This is why
records are, by default, single lines. A different character can be
used for the record separator by assigning the character to the
built-in variable `RS'.

Like any other variable, the value of `RS' can be changed in the
`awk' program wit