UP | HOME

1 An AWK Tutorial

2 Getting started

  • Awk is a convenient and expressive programming language that can be applied to a wide variety of computing and data-manipulation tasks.
  • emp.data will be used to explore further
$ bat emp.data
Beth 4.0 0
Dan 3.75 0
Kathy 4.0 10
Mark 5.0 20
Mary 5.5 22
Susie 4.25 18
  • First column: Employee name
  • Second column: Pay rate in dollars per hour
  • Third column: Number of hours worked

2.1 Employees who worked more than zero hours

$3 > 0 { print $1, $2 * $3} 
Kathy 40
Mark 100
Mary 121
Susie 76.5

Note that the above command in a CLI will be invoked like this:

$ awk '$3 > 0 { print $1, $2 * $3}' emp.data
  • But in this document we will just concentrate on it's language. The part inside the quotes is the complete awk program. It consists of a single pattern action statement.
  • The pattern $3 > 0$ matches every input line in which the third column or field is greater than zero and executes the corresponding action.

2.2 Employees who didn't work

$3 == 0 { print $1 }
Beth
Dan

3 The Structure of an AWK Program

  • Each awk program is a sequence of one or more pattern action statements:
pattern { action }
pattern { action }
...
  • The basic operation of awk is to scan a sequence of input lines one after another, searching for lines that are matched by any of the patterns in the program. The precise meaning of the word "match" depends on the pattern.
  • Every input line is tested against each of the patterns in turn. For each pattern that matches, the corresponding action is performed.
  • If the pattern has no action, then each line that the pattern matches is printed:
$3 == 0
Beth 4.0 0
Dan 3.75 0
  • If there is an action with no pattern, then for each line the action is performed:
{ print $1 }
Beth
Dan
Kathy
Mark
Mary
Susie

4 Running an AWK Program

Several ways:

$ aws 'program' files
$ aws '$3 == 0 { print $1}' file1 file2

Or you can supply input via stdin:

$ awk 'program'

It accepts input till the end-of-file signal. (Ctrl-D on Unix). The stdin way of doing will be quite handy to experiment. If your awk program is long, you can pass the program explicitly:

$ awk -f progfile input files

5 Errors

Awk will report an error if there is an error in the code (duh!). Let's mistype a brace:

$ awk '$3 == 0 [ print $1 }' emp.dat
awk: line 1: syntax error at or near [
awk: line 1: extra '}'

6 Simple output

  • There are only two types of data in awk: numbers and string of characters.
  • Awk reads its input one line at a time and splits each line into fields, where, by default, a field is a sequence of characters that doesn't contain any blanks or tabs. The first field in the current input line is called $1, the second $2, and so forth. The entire line is called $0. The number of fields can vary from line to line.

6.1 Printing Every line

{ print }
Beth 4.0 0
Dan 3.75 0
Kathy 4.0 10
Mark 5.0 20
Mary 5.5 22
Susie 4.25 18

Or even this:

{ print $0 }
Beth 4.0 0
Dan 3.75 0
Kathy 4.0 10
Mark 5.0 20
Mary 5.5 22
Susie 4.25 18

6.2 Printing Certain Fields

{ print $1, $3 }
Beth 0
Dan 0
Kathy 10
Mark 20
Mary 22
Susie 18
  • Expressions separated by a comma in a print statement are, by default, separated by a single blank when they are printed.

6.3 NF, the Number of Fields

  • Awk counts the number of fields in the current input line and stores the count in a built-in variable called NF.
  • Any expression can be used after $ to denote a field number; the expression is evaluated and its numberic value is used as the field number.
{ print NF, $1, $NF }
3 Beth 0
3 Dan 0
3 Kathy 10
3 Mark 20
3 Mary 22
3 Susie 18

6.4 Computing and Printing

{ print $1, $2 * $3 }
Beth 0
Dan 0
Kathy 40
Mark 100
Mary 121
Susie 76.5

6.5 Printing Line Numbers

  • Awk provides another built-variable, called NR, that counts the number of lines read so far.
{ print NR, $0 }
1 Beth 4.0 0
2 Dan 3.75 0
3 Kathy 4.0 10
4 Mark 5.0 20
5 Mary 5.5 22
6 Susie 4.25 18

6.6 Printing Text in the Output

{ print "total pay for", $1, "is", $2 * $3 }
total pay for Beth is 0
total pay for Dan is 0
total pay for Kathy is 40
total pay for Mark is 100
total pay for Mary is 121
total pay for Susie is 76.5

7 Fancier Output

  • The print statement is meant for quick and easy output.
  • To format the output exactly the way you want it, you may have to use the printf statement.

7.1 Lining up Fields

printf(format, value_1, value_2, ... , value_n)
  • format: is a string that contains text to be printed verbatim, interspered with specifications of how each of the values is to be printed.
  • A specification is a % followed by a few characters that control the format of a value.
  • Thus, there must be as many % specifications in format as values to be printed.
{ printf("total pay for %s is $%.2f\n", $1, $2 * $3) }
total pay for Beth is $0.00
total pay for Dan is $0.00
total pay for Kathy is $40.00
total pay for Mark is $100.00
total pay for Mary is $121.00
total pay for Susie is $76.50
  • %s: Specifices that the first value is a string of characters.
  • %.2f: Specifices that the second value ($2*$3) is a number with 2 digits after decimal point.
  • \n: Newline
  • With printf, no blanks or newlines are produced automatically; you must create them yourself.
{ printf("%-8s $%6.2f\n",$1, $2 * $3)}
Beth     $  0.00
Dan      $  0.00
Kathy    $ 40.00
Mark     $100.00
Mary     $121.00
Susie    $ 76.50
  • %-8s Prints a name as a string of characters left justified in a field 8 characters wide.
  • %6.2f prints pay as a number with two digits after the decimal point, in a field 6 characters wide.

7.2 Sorting the Output

$ awk '{printf("%6.2f %s\n", $2 * $3, $0)}' emp.data | sort
0.0 Beth 4.0 0
0.0 Dan 3.75 0
100.0 Mark 5.0 20
121.0 Mary 5.5 22
40.0 Kathy 4.0 10
76.5 Susie 4.25 18

8 Selection

  • Awk patterns are good for selecting interesting lines from the input for further processing.
  • Since a pattern without an action prints all lines matching the pattern, many awk programs consist of nothing more than a single pattern.

8.1 Selection by Comparison

  • Employees who earn more than or equal to $5 per hour:
$2 >= 5
Mark 5.0 20
Mary 5.5 22

8.2 Selection by Computation

  • Employees whose total pay exceeds $50
$2 * $3 > 50 { printf("$%.2f for %s\n", $2 * $3, $1)}
$100.00 for Mark
$121.00 for Mary
$76.50 for Susie

8.3 Selection by Text Content

$1 == "Susie"
Susie 4.25   18
  • The text pattern can also be regular expressions

8.4 Combinations of Patterns

  • Logical operators can be used: &&, || and !
$2 >= 4
$3 >= 20
Beth 4.0 0
Kathy 4.0 10
Mark 5.0 20
Mark 5.0 20
Mary 5.5 22
Mary 5.5 22
Susie 4.25 18
!($2 < 4 && $3 < 20)
Beth 4.0 0
Kathy 4.0 10
Mark 5.0 20
Mary 5.5 22
Susie 4.25 18

8.5 Data validation

  • Awk is an excellent tool for checking that data has reasonable values and is in the right format, a task that is often called data validation.
  • Data validation is essentially negative: instead of printing lines with desirable properties, one prints lines that are suspicious.
NF != 3 { print $0, "number of files is not equal to 3"}

If there are no errors, there's no output.

$2 > 10 { print $0, "rate exceeds $10 per hour"}

8.6 BEGIN and END

  • The special pattern BEGIN matches before the first line of the first input ·file is read, and END matches after the last line of the last file has been processed.
BEGIN { print "NAME  RATE   HOURS"; print ""}
      { print }
NAME  RATE   HOURS

Beth  4.00   0
Dan   3.75   0
Kathy 4.00   10
Mark  5.00   20
Mary  5.50   22
Susie 4.25   18

9 Computing with AWK

  • An action is a sequence of statements separated by newlines or semicolons.
  • In awk, user-created variables are not declared.
$3 > 15 { emp = emp + 1}
END     { print emp, "employees worked more than 15 hours" }
3 employees worked more than 15 hours
  • Awk variables used as numbers begin life with the value of 0, so we didn't need to initialize emp above.

9.1 Computing Sums and Averages

  • Computing number of employees
END { print NR, "employees" }
6 employees
  • Compute average pay
    { pay = pay + $2 * $3 }
END { print NR, "employees" 
      print "total pay is", pay
      print "average pay is", pay/NR
    }
6 employees
total pay is 337.5
average pay is 56.25

9.2 Handling Text

  • Awk variables can hold strings of characters.
  • Variables used to store strings begin life holding the null string.
$2 > maxrate { maxrate = $2; maxemp = $1 }
END { print "highest hourly rate:", maxrate, "for", maxemp }
highest hourly rate: 5.50 for Mary
  • In the above awk code, the variable maxemp holds a string.

9.3 String Concatenation

  • Combine all the employee names into a single string:
    { names = names $1 " " }
END { print names }
Beth Dan Kathy Mark Mary Susie

9.4 Printing the Last Input Line

    { last = $0 }
END { print last }
Susie 4.25   18

9.5 Built-in functions

  • Arithematic functions for square roots, logarithms, random numbers etc.
  • length: Counts the number of characters in a string
{ print $1, length($1) }
Beth 4
Dan 3
Kathy 5
Mark 4
Mary 4
Susie 5

9.6 Counting Lines, Words and Characters

  • In the below program, for convenience we will treat each field as word.
    { nc = nc + length($0) + 1 } # 1 for newline
    { nw = nw + NF }
END { print NR, "lines", nw, "words", nc, "characters" }
6 lines 18 words 94 characters

10 Control-Flow Statements

10.1 If-Else Statement

  • Compute total and average pay of employees making more than $6 per hour.
$2 > 6 { n = n + 1; pay = pay + $2 * $3 }
END    { if (n > 0)
             print n, "employees total pay is", pay, "average pay is", pay/n
         else
             print "no employees are paid more than $6/hour"
       }
no employees are paid more than $6/hour

10.2 While Statement

{ i = 1
  while (i <= $3) {
    printf("\t%.2f\n", $1 * (1 + $2) ^ i)
    i = i + 1
  }
}

10.3 For Statement

{ for(i = 1; i <= $3 ; i = i + 1)
     printf("\t%.2f\n", $1 * (1 + $2) ^ i)
}

11 Arrays

  • Arrays are used for storing groups of related values.
  • Program to reverse each line
    { line[NR] = $0 }               # remember each input line
END { i = NR
      while (i > 0) {
        print line[i]
        i = i - 1
      }
}
Susie 4.25   18
Mary  5.50   22
Mark  5.00   20
Kathy 4.00   10
Dan   3.75   0
Beth  4.00   0

Same example with a for statement:

    { line[NR] = $0 }
END { for(i = NR; i > 0; i = i - 1)
           print line[i]
    }
Susie 4.25   18
Mary  5.50   22
Mark  5.00   20
Kathy 4.00   10
Dan   3.75   0
Beth  4.00   0

12 A Handful of Useful "One-liners"

  • Print total number of input lines
END { print NR }
6
  • Print the third input line
NR == 3
Kathy 4.00   10