AWK tutorial
last modified October 18, 2023
This is AWK tutorial. It covers the basics of the AWK tool.
AWK
AWK is a pattern-directed scanning and processing language. An AWK program consists of a set of actions to be taken against streams of textual data. AWK extensively uses regular expressions. It is a standard feature of most Unix-like operating systems.
AWK was created at Bell Labs in the 1977. Its name is derived from the family names of its authors – Alfred Aho, Peter Weinberger, and Brian Kernighan.
There are two major implementations of AWK. The traditional Unix AWK and the newer GAWK. GAWK is the GNU Project's implementation of the AWK programming language. GAWK has several extensions to the original AWK.
AWK program
An AWK program consists of a sequence of pattern-action statements and optional function definitions. It processes text files. AWK is a line oriented language. It divides a file into lines called records. Each line is broken up into a sequence of fields. The fields are accessed by special variables: $1 reads the first field, $2 the second and so on. The $0 variable refers to the whole record.
The structure of an AWK program has the following form:
pattern { action }
The pattern is a test that is performed on each of the records. If the condition is met then the action is performed. Either pattern or action can be omitted, but not both. The default pattern matches each line and the default action is to print the record.
awk -f program-file [file-list] awk program [file-list]
An AWK program can be run in two basic ways: a) the program is read from a
separate file; the name of the program follows the -f option, b)
the program is specified on the command line enclosed by quote characters.
AWK one-liners
AWK one-linears are simple one-shot programs run from the command line. Let us have the following text file:
storeroom tree cup store book cloud existence ministerial falcon town sky top bookworm bookcase war
We want to print all words included in the words.txt file that are
longer than five characters.
$ awk 'length($1) > 5 {print $0}' words.txt
storeroom
existence
ministerial
falcon
bookworm
bookcase
The AWK program is placed between two single quote characters. The first is the
pattern; we specify that the length of the record is greater that five. The
length function returns the length of the string. The
$1 variable refers to the first field of the record; in our case
there is only one field per record. The action is placed between curly brackets.
$ awk 'length($1) > 5' words.txt storeroom existence ministerial falcon bookworm bookcase
As we have specified earlier, the action can be omitted. In such a case a default action is performed — printing of the whole record.
$ awk 'length($1) == 3' words.txt cup sky top war
We print all words that have three characters.
$ awk '!(length($1) == 3)' words.txt storeroom tree store book cloud existence ministerial falcon town bookworm bookcase
With the ! operator, we can negate the condition; we print all
lines that do not have three characters.
$ awk '(length($1) == 3) || (length($1) == 4)' words.txt tree cup book town sky top war
We combined two conditions with the || operator.
$ awk 'length($1) > 0 {print $1, "has", length($1), "chars"}' words.txt
storeroom has 9 chars
tree has 4 chars
cup has 3 chars
store has 5 chars
book has 4 chars
cloud has 5 chars
existence has 9 chars
ministerial has 11 chars
falcon has 6 chars
town has 4 chars
sky has 3 chars
top has 3 chars
bookworm has 8 chars
bookcase has 8 chars
war has 3 chars
This AWK command prints the length of each of the words. If we separate
print arguments with a comma, AWK adds a space character.
$ grep book words.txt book bookworm bookcase $ grep book words.txt -n 5:book 13:bookworm 14:bookcase
The grep command is used for locating text patterns inside files.
$ awk '/book/ {print}' words.txt
book
bookworm
bookcase
$ awk '/book/ {print NR ":" $0}' words.txt
5:book
13:bookworm
14:bookcase
These are AWK equivalents of the above grep commands. The NR
variable gives the total number of records/lines being processed.
Next we apply conditions on numbers.
Peter 89 Lucia 95 Thomas 76 Marta 67 Joe 92 Alex 78 Sophia 90 Alfred 65 Kate 46
We have a file with scores of students.
$ awk '$2 >= 90 { print $0 }' scores.txt
Lucia 95
Joe 92
Sophia 90
We print all students with scores 90+.
$ awk '$2 >= 90 { print }' scores.txt
Lucia 95
Joe 92
Sophia 90
If we omit an argument for the print function, the $0
is assumed.
$ awk '$2 >= 90' scores.txt Lucia 95 Joe 92 Sophia 90
A missing { action } means print the matching line.
$ awk '{ if ($2 >= 90) print }' scores.txt
Lucia 95
Joe 92
Sophia 90
Instead of a pattern, we can also use an if condition in the action.
$ awk '{sum += $2} END { printf("The average score is %.2f\n", sum/NR) }' scores.txt
The average score is 77.56
This command calculates the average score. In the action block, we calculate
the sum of scores. In the END block, we print the average score.
We format the output with the built-in printf function. The
%.2f is a format specifier; each specifier begins with the
% character. The .2 is the precision -- the number of
digits after the decimal point. The f expects a floating point
value. The \n is not a part of the specifier; it is a newline
character. It prints a newline after the string is shown on the terminal.
AWK working with pipes
AWK can receive input and send output to other commands via the pipe.
$ echo -e "1 2 3 5\n2 2 3 8" | awk '{print $(NF)}'
5
8
In this case, AWK receives output from the echo command.
It prints the values of last column.
$ awk -F: '$7 ~ /bash/ {print $1}' /etc/passwd | wc -l
3
Here, the AWK program sends data to the wc command via the pipe.
In the AWK program, we find out those users who use bash. Their names are
passed to the wc command which counts them. In our
case, there are three users using bash.
AWK fields
AWK reads files line by line. Each line or record can be separated into fields.
The FS variable stores the field separator, which is a space by
default.
$ ls -l total 132 drwxr-xr-x 2 jano7 jano7 512 Feb 11 16:02 data -rw-r--r-- 1 jano7 jano7 110211 Oct 12 2019 sid.jpg -rw-r--r-- 1 jano7 jano7 5 Jul 22 20:21 some.txt -rw-r--r-- 1 jano7 jano7 226 Apr 23 16:56 thermopylae.txt -rw-r--r-- 1 jano7 jano7 365 Aug 4 10:22 users.txt -rw-r--r-- 1 jano7 jano7 24 Jul 21 21:03 words.txt -rw-r--r-- 1 jano7 jano7 30 Jul 22 21:20 words2.txt
We have these files in the current working directory.
$ ls -l | awk '{print $6 " " $9}'
Feb data
Oct sid.jpg
Jul some.txt
Apr thermopylae.txt
Aug users.txt
Jul words.txt
Jul words2.txt
We redirect the output of the ls command to AWK. We print the sixth
and ninth columns of the output.
John Doe, gardener, London, M, 11/23/1982 Jane Doe, teacher, London, F, 10/12/1988 Peter Smith, programmer, New York, M, 9/18/2000 Joe Brown, driver, Portland, M, 1/1/1976 Jack Smith, physician, Manchester, M, 2/27/1983 Lucy Black, accountant, Birmingham, F, 5/5/1998 Martin Porto, actor, Los Angeles, M, 4/30/1967 Sofia Harris, interpreter, Budapest, F, 8/18/1993
In the users.txt file, we have a few users. The fields are now
separated with a comma character.
$ awk -F, '{print $1 " is a(n)" $2}' users.txt
John Doe is a(n) gardener
Jane Doe is a(n) teacher
Peter Smith is a(n) programmer
Joe Brown is a(n) driver
Jack Smith is a(n) physician
Lucy Black is a(n) accountant
Martin Porto is a(n) actor
Sofia Harris is a(n) interpreter
We print the first and the second column of the file. We specify the field
separator with the -F option.
$ awk 'BEGIN {FS=","} {print $3}' users.txt
London
London
New York
Portland
Manchester
Birmingham
Los Angeles
Budapest
The field separator can be also set in the program. We set the FS
variable to comma inside the BEGIN block, which is executed once at the
beginning of the program execution.
$ awk 'BEGIN {FS=","} {print $3}' users.txt | uniq
London
New York
Portland
Manchester
Birmingham
Los Angeles
Budapest
We pass the output to the uniq command to get distinct values.
$ awk -F, '$4 ~ "F" {print $1}' users.txt
Jane Doe
Lucy Black
Sofia Harris
We print all females. We use the ~ operator to match against a
pattern.
$ awk '{print "The", NR". record has", length($0), "characters"}' users.txt
The 1. record has 41 characters
The 2. record has 40 characters
The 3. record has 47 characters
The 4. record has 40 characters
The 5. record has 47 characters
The 6. record has 47 characters
The 7. record has 46 characters
The 8. record has 49 characters
The command prints the number of characters for each record. The $0
stands for the whole line.
$ awk -F, '{print $NF, $(NF-1)}' users.txt
11/23/1982 M
10/12/1988 F
9/18/2000 M
1/1/1976 M
2/27/1983 M
5/5/1998 F
4/30/1967 M
8/18/1993 F
The $NF is the last field, the $(NF-1) is the second
last field.
$ awk -F, '{ if ($4 ~ "M") {m++} else {f++} } END {printf "users: %d\nmales: %d\nfemales: %d\n", m+f, m, f}' users.txt
users: 8
males: 5
females: 3
The command prints the number of users, males, and females. When the command is becoming too complex, it is better to put it inside a file.
{
if ($4 ~ "M") {
m++
} else {
f++
}
}
END {
printf "users: %d\nmales: %d\nfemales: %d\n", m+f, m, f
}
The first block delimited by {} is executed for each line of the
file. We count all records that have and don't have M in the 4th field. The
numbers are stored in the m and f variables. The
END block is executed once in the end of the program. There we
print the number of users, males, and females. The printf function
allows us to create formatted strings.
$ awk -F, -f males_females.awk users.txt users: 8 males: 5 females: 3
AWK reads the program from the file followed by the -f option.
AWK regular expressions
Regular expressions are often applied on AWK fields. The ~ is the
regular expression match operator. It checks if a string matches the provided
regular expression.
$ awk '$1 ~ /^[b,c]/ {print $1}' words.txt
cup
book
cloud
bookworm
bookcase
In this command we print all the words that begin with b or c character. The regular expression is placed between two slash characters.
$ awk '$1 ~ /[e,n]$/ {print $1}' words.txt
tree
store
existence
falcon
town
bookcase
This command prints all words that end with e or n.
$ awk '$1 ~ /\<...\>/ {print $1}' words.txt
cup
sky
top
war
The command prints all words that have three characters. The doc (.) stands for any character and the \< \> characters are word boundaries.
$ awk '$1 ~ /\<...\>/ || $1 ~ /\<....\>/ {print $1}' words.txt
tree
cup
book
town
sky
top
war
We combine two condition with the or (||) operator. The AWK command prints all words that have either three of four characters.
$ awk '$1 ~ /store|room|book/' words.txt storeroom store book bookworm bookcase
With the alternation operator (|), we print fields that contain either of the specified wores.
$ awk '$1 ~ /^book(worm|case)?$/' words.txt book bookworm bookcase
Applying a subpattern with , we print fields that include words
book, bookwor, or bookcase. The ? tells that the subpattern may or
may not be there.
The match is a built-in string manipulation function. It tests if
the given string contains a regular expression pattern. The first parameter is
the string, the second is the regex pattern. It is similar to the ~
operator.
$ awk 'match($0, /^[c,b]/)' words.txt brown craftsmanship book beautiful computer
The program prints those lines that begin with c or b. The regular expression is placed between two slash characters.
The match function sets the RSTART variable;
it is the index of the start of the matching pattern.
$ awk 'match($0, /i/) {print $0 " has i character at " RSTART}' words.txt
craftsmanship has i character at 12
beautiful has i character at 6
existence has i character at 3
ministerial has i character at 2
The program prints those words that contain the i character. In addition, it prints the first occurrence of the character.
AWK built-in variables
AWK provides important built-in variables.
| Variable name | Description |
|---|---|
FS | field separator (default space) |
NF | # of fields in the current record |
NR | current record/line number |
$0 | whole line |
$n | n-th field |
FNR | current record number in the current file |
RS | input record separator (default newline) |
OFS | output field separator (default blank) |
ORS | output record separator (default newline) |
OFMT | output format for numbers (default %.6g) |
SUBSEP | separates multiple subscripts (default 034) |
ARGC | argument count |
ARGV | array of arguments |
FILENAME | the name of the current input file |
RSTART | index of the start of the matching pattern |
RLENGTH | the length of the string matched by the match function |
CONVFMT | conversion format used when converting numbers (default %.6g) |
The table lists common AWK variables.
$ awk 'NR % 2 == 0 {print}' words.txt
tree
store
cloud
ministerial
town
top
bookcase
The above program prints each second record of the words.txt file.
Modulo dividing the NR variable we get an even line.
Say we want to print the line numbers of the file.
$ awk '{print NR, $0}' words.txt
1 storeroom
2 tree
3 cup
4 store
5 book
6 cloud
7 existence
8 ministerial
9 falcon
10 town
11 sky
12 top
13 bookworm
14 bookcase
15 war
Again, we use the NR variable. We skip the pattern, therefore, the
action is peformed on each line. The $0 variable refers to the
whole record.
$ echo -e "cup\nbill\ncoin" > words1.txt $ echo -e "cloud\nbreath\nrank" > words2.txt
We create two text files with some words.
$ awk '{ print $1, "is at line", FNR, "in", FILENAME }' words1.txt words2.txt
cup is at line 1 in words1.txt
bill is at line 2 in words1.txt
coin is at line 3 in words1.txt
cloud is at line 1 in words2.txt
breath is at line 2 in words2.txt
rank is at line 3 in words2.txt
We print where each word is located; we include the line number and the
filename. The difference between the NR and FNR
variables is that the former counts lines in all files while the latter counts
lines always in the current file.
For the following example, we have this C source file.
1 #include <stdio.h>
2
3 int main(void) {
4
5 char *countries[5] = { "Germany", "Slovakia", "Poland",
6 "China", "Hungary" };
7
8 size_t len = sizeof(countries) / sizeof(*countries);
9
10 for (size_t i=0; i < len; i++) {
11
12 printf("%s\n", countries[i]);
13 }
14 }
We have a source file with line numbers. Our task is to remove the numbers from the text.
$ awk '{print substr($0, 4)}' source.c
#include <stdio.h>
int main(void) {
char *countries[5] = { "Germany", "Slovakia", "Poland",
"China", "Hungary" };
size_t len = sizeof(countries) / sizeof(*countries);
for (size_t i=0; i < len; i++) {
printf("%s\n", countries[i]);
}
}
We use the substr function. It prints a substring from the given
string. We apply the function on each line, skipping the first three characters.
In other words, we print each record from the fourth character till its end.
$ awk '{print substr($0, 4) >> "source2.c"}' source.c
We redirect the output to a new file.
The NF is the number of fields in the current record.
2 3 1 34 21 12 43 21 11 2 11 33 12 43 72 91 90 32 14 34 87 22 12 75 2 42 13 75 23 1 42 41 94 4 32 2 1 6 2 1 3 1 4 53 13 52 84 14 14 63 3 2 5 76 31 45
We have a file of values.
$ awk 'NF == 6' values.txt 2 3 1 34 21 12 43 72 91 90 32 14 3 2 5 76 31 45
We print records that have six fields.
$ awk '{print "line", NR, "has", NF, "values"}' values.txt
line 1 has 6 values
line 2 has 7 values
line 3 has 6 values
line 4 has 8 values
line 5 has 8 values
line 6 has 8 values
line 7 has 7 values
line 8 has 6 values
This command prints the number of values for each line.
{
for (i = 1; i<=NF; i++) {
sum += $i
}
print "line", NR, "sum:", sum
sum = 0
}
The program calculates the sum of values for each line.
for (i = 1; i<=NF; i++) {
sum += $i
}
This is a classic for loop. We go through each of the fields in the record and
add the value to the sum variable. The += is a
compound addition operator.
$ awk -f calc_sum.awk values.txt line 1 sum: 73 line 2 sum: 133 line 3 sum: 342 line 4 sum: 287 line 5 sum: 312 line 6 sum: 20 line 7 sum: 293 line 8 sum: 162
The following is an alternative solution using the split function.
{
split($0, vals)
for (idx in vals) {
sum += vals[idx]
}
print "line", NR, "sum:", sum
sum = 0
}
The program calculates the sum of values for each line.
split($0, vals)
The split function splits the given string into an array; the
default separator for elements of the record is FS.
for (idx in vals) {
sum += vals[idx]
}
We go through the array and calculate the sum. In each of the loop cycles, the
idx variable is set to the current index of the array.
$ awk -f calc_sum2.awk values.txt line 1 sum: 73 line 2 sum: 133 line 3 sum: 342 line 4 sum: 287 line 5 sum: 312 line 6 sum: 20 line 7 sum: 293 line 8 sum: 162
BEGIN and END blocks
BEGIN and END are blocks that are executed before and
after all records have been read. These two keywords are followed by curly
brackets where we specify statements to be executed.
$ awk 'BEGIN { print "Unix time: ", systime()}'
Unix time: 1628156179
The BEGIN block is executed before the first input line is
processed. We print the Unix time, utilizing the systime function.
The function is a gawk extension function.
$ awk 'BEGIN { print "Today is", strftime("%Y-%m-%d") }'
Today is 2021-08-05
The program prints the current date. The strftime is a GAWK
extension.
$ echo "1,2,3,4,5" | awk '{ split($0,a,",");for (idx in a) {sum+=a[idx]} } END {print sum}'
15
The program splits the line into an array of numbers with the split
function. We go over the array elements and calculate their sum. In the
END block, we print the sum.
The Battle of Thermopylae was fought between an alliance of Greek city-states, led by King Leonidas of Sparta, and the Persian Empire of Xerxes I over the course of three days, during the second Persian invasion of Greece.
We want to count the number of lines, words, and characters in the file.
$ wc thermopylae.txt 4 38 226 thermopylae.txt
To count the number of lines, words, and characters in a file, we have the
wc command.
{
words += NF
chars += length + 1 # include newline character
}
END { print NR, words, chars }
The first part of the program is executed for each line of the file. The
END block is run at the end of the program.
$ awk -f count_words.awk thermopylae.txt 4 38 226
$ cat words.txt brown tree craftsmanship book beautiful existence ministerial computer town $ cat words2.txt pleasant curly storm hering immune
We want to know the number of lines in those two lines.
$ awk 'END {print NR}' words.txt words2.txt
14
We pass two files to the AWK program. AWK sequentially processes the
file names received on the command line.
The block following the END keyword is executed
at the end of the program; we print the NR variable
which holds the line number of the last processed line.
$ awk 'BEGIN {srand()} {lines[NR] = $0} END { r=int(rand()*NR + 1); print lines[r]}' words.txt
tree
The above program prints a random line from the words.txt file.
The srand function seeds the random number generator.
The function has to be executed only once. In the main part of the program,
we store the current record into the lines array.
In the end, we compute a random number between 1 and NR
and print the randomly chosen line from the array structure.
AWK playing with words dictionary
In the following examples, we create a couple of AWK programs that work with
an English dictionary. On Unix systems, a dictionary is located in
/usr/share/dict/words file.
$ awk 'length($1) == 10 { n++ } END {print n}' /usr/share/dict/words
30882
This command prints the number of words in the given dictionary that have ten
characters. In the action block, we increase the n variable for
each match. In the END block, we print the final number.
$ awk '$1 ~ /^w/ && length($1) == 4 { n++; if (n<15) {print} else {exit} }' /usr/share/dict/words
waag
waar
wabe
wace
wack
wade
wadi
waeg
waer
waff
waft
wage
waif
waik
This command prints the first fifteen words that begin with 'w' and have four
letters. The exit statement terminates the AWK program.
$ awk '$1 ~ /^w.*r$/ { n++; if (n<15) {print} } END {print n}' /usr/share/dict/words
waar
wabber
wabster
wacker
wadder
waddler
wader
wadmaker
wadsetter
waer
wafer
waferer
wafermaker
wafter
417
The command prints the first fifteen words that begin with 'w' and end in 'r'. In the end, it prints the total number of such words in the file.
Palindrome is a word, number, phrase, or other sequence of characters which reads the same backward as forward, such as madam or racecar.
{
for (i=length($0); i!=0; i--) {
r = r substr($0, i, 1)
}
if (length($0) > 1 && $0 == r) {
print
n++
}
r = ""
}
END {
printf "There are %d palindromes\n", n
}
The program finds all palindromes. The algorithm is that the original word must equal to a reversed word.
for (i=length($0); i!=0; i--) {
r = r substr($0, i, 1)
}
Using a for loop, we reverse the given string. The substr function
returns a subtring; the first parameter is the string, the second is the
beginning position, and the last is the length of the substring. To concatenate
strings in AWK, we simply separate them by a space character.
if (length($0) > 1 && $0 == r) {
print
n++
}
The length of the word must be greater than 1; we don't count single letters
as palindromes. If the reversed word is equal to the original word, we print it
and increase the n variable.
r = ""
We reset the r variable.
END {
printf "There are %d palindromes\n", n
}
In the end, we print the number of palindromes in the file.
{
if (length($0) == 1) {next}
rev = reverse($0)
if ($0 == rev) {
print
n++
}
}
END {
printf "There are %d palindromes\n", n
}
function reverse(word) {
r = ""
for (i=length(word); i!=0; i--) {
r = r substr(word, i, 1)
}
return r
}
To improve the readability of the program, we create a custom
reverse function.
AWK ARGC and ARGV variables
Next, we work with ARGC and ARGV variables.
$ awk 'BEGIN { print ARGC, ARGV[0], ARGV[1]}' words.txt
2 awk words.txt
The program prints the number of arguments of the AWK program and the first
two arguments. ARGC is the number of command line arguments;
in our case there are two arguments including the AWK itself.
ARGV is an array of command line arguments. The array is indexed
from 0 to ARGC - 1.
FS is an input field separator, a space by default. NF
is the number of fields in the current input record.
For the following program, we use this file:
$ cat values 2, 53, 4, 16, 4, 23, 2, 7, 88 4, 5, 16, 42, 3, 7, 8, 39, 21 23, 43, 67, 12, 11, 33, 3, 6
We have three lines of comma-separated values.
BEGIN {
FS=","
max = 0
min = 10**10
sum = 0
avg = 0
}
{
for (i=1; i<=NF; i++) {
sum += $i
if (max < $i) {
max = $i
}
if (min > $i) {
min = $i
}
printf("%d ", $i)
}
}
END {
avg = sum / NF
printf("\n")
printf("Min: %d, Max: %d, Sum: %d, Average: %d\n", min, max, sum, avg)
}
The program counts the basic statistics from the provided values.
FS=","
The values in the file are separated by the comma character; therefore, we
set the FS variable to comma character.
max = 0 min = 10**10 sum = 0 avg = 0
We define default values for the maximum, minimum, sum, and average. AWK variables are dynamic; their values are either floating-point numbers or strings, or both, depending upon how they are used.
{
for (i=1; i<=NF; i++) {
sum += $i
if (max < $i) {
max = $i
}
if (min > $i) {
min = $i
}
printf("%d ", $i)
}
}
In the main part of the script, we go through each line and
calculate the maximum, minumum, and the sum of the values.
The NF is used to determine the number of values
per line.
END {
avg = sum / NF
printf("\n")
printf("Min: %d, Max: %d, Sum: %d, Average: %d\n", min, max, sum, avg)
}
In the end part of the script, we calculate the average and print the calculations to the console.
$ awk -f stats.awk values 2 53 4 16 4 23 2 7 88 4 5 16 42 3 7 8 39 21 23 43 67 12 11 33 3 6 Min: 2, Max: 88, Sum: 542, Average: 67
The FS variable can be specified as a command line option with
the -F flag.
$ awk -F: '{print $1, $7}' /etc/passwd | head -7
root /bin/bash
daemon /usr/sbin/nologin
bin /usr/sbin/nologin
sys /usr/sbin/nologin
sync /bin/sync
games /usr/sbin/nologin
man /usr/sbin/nologin
The example prints the first (the user name) and the seventh field (user's shell)
from the system /etc/passwd file. The head command is used
to print only the first seven lines. The data in the /etc/passwd file
is separated by a colon. So the colon is given to the -F option.
The RS is the input record separator, by default a newline.
$ echo "Jane 17#Tom 23#Mark 34" | awk 'BEGIN {RS="#"} {print $1, "is", $2, "years old"}'
Jane is 17 years old
Tom is 23 years old
Mark is 34 years old
In the example, we have relevant data separated by the # character. The RS
is used to strip them. AWK can receive input from other commands like echo.
AWK GET request
AWK can make HTTP requests. We use the getline function and
the /inet/tcp/0/ file.
BEGIN {
site = "webcode.me"
server = "/inet/tcp/0/" site "/80"
print "GET / HTTP/1.0" |& server
print "Host: " site |& server
print "\r\n\r\n" |& server
while ((server |& getline line) > 0 ) {
content = content line "\n"
}
close(server)
print content
}
The program makes a GET request to the webcode.me page and reads its response.
print "GET / HTTP/1.0" |& server
The |& operator starts a coprocess, which allows a two-way
communication.
while ((server |& getline line) > 0 ) {
content = content line "\n"
}
With the getline function, we read the response from the server.
Passing variables to AWK
AWK has the -v option which is used to assign values to variables.
$ awk -v today=$(date +%Y-%m-%d) 'BEGIN { print "Today is", today }'
Today is 2021-08-05
We pass the output of the date command to the today
variable, which can be then accessed in the AWK program.
The Battle of Thermopylae was fought between an alliance of Greek city-states, led by King Leonidas of Sparta, and the Persian Empire of Xerxes I over the course of three days, during the second Persian invasion of Greece.
{
for (i=1; i<=NF; i++) {
field = $i
if (field ~ word) {
c = index($0, field)
print NR "," c, $0
next
}
}
}
The example simulates the grep utility. It finds the provided
word and prints its line and the its starting index. (The program finds only
the first occurrence of the word.) The word variable
is passed to the program using the -v option.
$ awk -v word=the -f mygrep.awk thermopylae.txt 2,37 led by King Leonidas of Sparta, and the Persian Empire of Xerxes I over the 3,30 course of three days, during the second Persian invasion of Greece.
We have looked for the "the" word in the thermopylae.txt file.
Word frequency
Next, we calculate the word frequency of the Bible.
$ wget https://raw.githubusercontent.com/janbodnar/data/main/the-king-james-bible.txt
We download the King James Bible.
$ file the-king-james-bible.txt the-king-james-bible.txt: UTF-8 Unicode (with BOM) text
When we examine the text, we can see that is is UTF-8 Unicode text with Byte order mark. The BOM must be taken into account in the AWK proram.
{
if (NR == 1) {
sub(/^\xef\xbb\xbf/,"")
}
gsub(/[,;!()*:?.]*/, "")
for (i = 1; i <= NF; i++) {
if ($i ~ /^[0-9]/) {
continue
}
w = $i
words[w]++
}
}
END {
for (idx in words) {
print idx, words[idx]
}
}
We calculate how many times a word is present in the book.
if (NR == 1) {
sub(/^\xef\xbb\xbf/,"")
}
From the first line, we remove the BOM character. If we did not remove the BOM, the very first word (The in our case) would contain it and would be thus recognized as a unique word.
gsub(/[,;!()*:?.]*/, "")
From the current record, we remove punctuation characters such as comma and colon. Otherwise, text such as the and the, would be considered as two distinct words.
The gsub function globally replaces the given regular expression
with the specified string; since the string is empty, it means that they are
deleted. If the string where the substitutions should take place is not
specified, the $0 is assumed.
for (i = 1; i <= NF; i++) {
if ($i ~ /^[0-9]/) {
continue
}
w = $i
words[w]++
}
In the for loop, we go over the fields of the current line. Bible text is
preceded with verses; these we do not want to include. So if the first field
begins with a digit, we skip the current cycle with the continue
keyword. The words is an array of words. Each index is a word
from the text. The values correspoinding to the indexes are frequencies.
Each time a word is encountered, its value is incremented.
END {
for (idx in words) {
print idx, words[idx]
}
}
In the end, we go through the words and print their indexes (words) and values (frequencies).
$ awk -f word_freq.awk the-king-james-bible.txt > bible_words.txt
We run the program and redirect the output to the file.
$ sort -nr -k 2 bible_words.txt | head the 62103 and 38848 of 34478 to 13400 And 12846 that 12576 in 12331 shall 9760 he 9665 unto 8942
We sort the data and print the first ten most frequent words.
PROCINFO is a special, built-in array which can influence the AWK
program. For instance, it can determine the way the array is traversed. It is
a GAWK extension.
{
if (NR == 1) {
sub(/^\xef\xbb\xbf/,"")
}
gsub(/[,;!()*:?.]*/, "")
for (i = 1; i <= NF; i++) {
if ($i ~ /[0-9]/) {
continue
}
w = $i
words[w]++
}
}
END {
PROCINFO["sorted_in"] = "@val_num_desc"
for (idx in words) {
print idx, words[idx]
}
}
With PROCINFO["sorted_in"] = "@val_num_desc", we traverse the array
by comparing the values in descending order.
$ awk -f freq_top.awk the-king-james-bible.txt | head the 62103 and 38848 of 34478 to 13400 And 12846 that 12576 in 12331 shall 9760 he 9665 unto 8942
Spell checking
We create an AWK program for spell checking.
BEGIN {
count = 0
i = 0
while (getline myword <"/usr/share/dict/words") {
dict[i] = myword
i++
}
}
{
for (i=1; i<=NF; i++) {
field = $i
if (match(field, /[[:punct:]]$/)) {
field = substr(field, 0, RSTART-1)
}
mywords[count] = field
count++
}
}
END {
for (w_i in mywords) {
for (w_j in dict) {
if (mywords[w_i] == dict[w_j] ||
tolower(mywords[w_i]) == dict[w_j]) {
delete mywords[w_i]
}
}
}
for (w_i in mywords) {
if (mywords[w_i] != "") {
print mywords[w_i]
}
}
}
The script compares the words of the provided text file against
a dictionary. Under the standard /usr/share/dict/words
path we can find an English dictionary; each word is on
a separate line.
BEGIN {
count = 0
i = 0
while (getline myword <"/usr/share/dict/words") {
dict[i] = myword
i++
}
}
Inside the BEGIN block, we read the words from the dictionary
into the dict array. The getline command reads a
record from the given file name; the record is stored in the
$0 variable.
{
for (i=1; i<=NF; i++) {
field = $i
if (match(field, /[[:punct:]]$/)) {
field = substr(field, 0, RSTART-1)
}
mywords[count] = field
count++
}
}
In the main part of the program, we place the words of
the file that we are spell checking into the mywords
array. We remove any punctuation marks (like commas or dots) from
the endings of the words.
END {
for (w_i in mywords) {
for (w_j in dict) {
if (mywords[w_i] == dict[w_j] ||
tolower(mywords[w_i]) == dict[w_j]) {
delete mywords[w_i]
}
}
}
...
}
We compare the words from the mywords array against the dictionary
array. If the word is in the dictionary, it is removed with the
delete command. Words that begin a sentence start with an uppercase
letter; therefore, we also check for a lowercase alternative utilizing the
tolower function.
for (w_i in mywords) {
if (mywords[w_i] != "") {
print mywords[w_i]
}
}
Remaining words have not been found in the dictionary; they are printed to the console.
$ awk -f spellcheck.awk text consciosness finaly
We have run the program on a text file; we have found two misspelled words. Note that the program takes some time to finish.
Rock-paper-scissors
Rock-paper-scissors is a popular hand game in which each player simultaneously forms one of three shapes with an outstretched hand. We create this game in AWK.
# This program creates a rock-paper-scissors game.
BEGIN {
srand()
opts[1] = "rock"
opts[2] = "paper"
opts[3] = "scissors"
do {
print "1 - rock"
print "2 - paper"
print "3 - scissors"
print "9 - end game"
ret = getline < "-"
if (ret == 0 || ret == -1) {
exit
}
val = $0
if (val == 9) {
exit
} else if (val != 1 && val != 2 && val != 3) {
print "Invalid option"
continue
} else {
play_game(val)
}
} while (1)
}
function play_game(val) {
r = int(rand()*3) + 1
print "I have " opts[r] " you have " opts[val]
if (val == r) {
print "Tie, next throw"
return
}
if (val == 1 && r == 2) {
print "Paper covers rock, you loose"
} else if (val == 2 && r == 1) {
print "Paper covers rock, you win"
} else if (val == 2 && r == 3) {
print "Scissors cut paper, you loose"
} else if (val == 3 && r == 2) {
print "Scissors cut paper, you win"
} else if (val == 3 && r == 1) {
print "Rock blunts scissors, you loose"
} else if (val == 1 && r == 3) {
print "Rock blunts scissors, you win"
}
}
We play the game against the computer, which chooses its options randomly.
srand()
We seed the random number generator with the srand function.
opts[1] = "rock" opts[2] = "paper" opts[3] = "scissors"
The three options are stored in the opts array.
do {
print "1 - rock"
print "2 - paper"
print "3 - scissors"
print "9 - end game"
...
The cycle of the game is controlled by the do-while loop.
First, the options are printed to the terminal.
ret = getline < "-"
if (ret == 0 || ret == -1) {
exit
}
val = $0
A value, our choice, is read from the command line using the getline command;
the value is stored in the val variable.
if (val == 9) {
exit
} else if (val != 1 && val != 2 && val != 3) {
print "Invalid option"
continue
} else {
play_game(val)
}
We exit the program if we choose option 9. If the value is outside the printed
menu options, we print an error message and start a new loop with the
continue command. If we have choosen one of the three options
correctly, we call the play_game function.
r = int(rand()*3) + 1
A random value from 1..3 is chosen with the rand function. This is
the choice of the computer.
if (val == r) {
print "Tie, next throw"
return
}
In case both players choose the same option there is a tie. We return from the function and a new loop is started.
if (val == 1 && r == 2) {
print "Paper covers rock, you loose"
} else if (val == 2 && r == 1) {
...
We compare the chosen values of the players and print the result to the console.
$ awk -f rock_scissors_paper.awk 1 - rock 2 - paper 3 - scissors 9 - end game 1 I have scissors you have rock Rock blunts scissors, you win 1 - rock 2 - paper 3 - scissors 9 - end game 3 I have paper you have scissors Scissors cut paper, you win 1 - rock 2 - paper 3 - scissors 9 - end game
A sample run of the game.
Marking keywords
In the following example, we mark Java keywords in a source file.
# the program adds tags around Java keywords
# it works on keywords that are separate words
BEGIN {
# load java keywords
i = 0
while (getline kwd <"javakeywords2") {
keywords[i] = kwd
i++
}
}
{
mtch = 0
ln = ""
space = ""
# calculate the beginning space
if (match($0, /[^[:space:]]/)) {
if (RSTART > 1) {
space = sprintf("%*s", RSTART, "")
}
}
# add the space to the line
ln = ln space
for (i=1; i <= NF; i++) {
field = $i
# go through keywords
for (w_i in keywords) {
kwd = keywords[w_i]
# check if a field is a keyword
if (field == kwd) {
mtch = 1
}
}
# add tags to the line
if (mtch == 1) {
ln = ln "<kwd>" field "</kwd> "
} else {
ln = ln field " "
}
mtch = 0
}
print ln
}
The program adds <kwd> and </kwd> tags around each of the keywords that it recognizes. This is a basic example; it works on keywords that are separate words. It does not address the more complicated structures.
# load java keywords
i = 0
while (getline kwd <"javakeywords2") {
keywords[i] = kwd
i++
}
We load Java keywords from a file; each keyword is on a separate line.
The keywords are stored in the keywords array.
# calculate the beginning space
if (match($0, /[^[:space:]]/)) {
if (RSTART > 1) {
space = sprintf("%*s", RSTART, "")
}
}
Using regular expression, we calculate the space at the beginning
of the line if any. The space is a string variable equaling
to the width of the space at the current line. The space is calculated
in order to keep the indentation of the program.
# add the space to the line ln = ln space
The space is added to the ln variable. In AWK, we use
a space to add strings.
for (i=1; i <= NF; i++) {
field = $i
...
}
We go through the fields of the current line; the field in question
is stored in the field variable.
# go through keywords
for (w_i in keywords) {
kwd = keywords[w_i]
# check if a field is a keyword
if (field == kwd) {
mtch = 1
}
}
In a for loop, we go through the Java keywords and check if a field is a Java keyword.
# add tags to the line
if (mtch == 1) {
ln = ln "<kwd>" field "</kwd> "
} else {
ln = ln field " "
}
If there is a keyword, we attach the tags around the keyword; otherwise we just append the field to the line.
print ln
The constructed line is printed to the console.
$ awk -f markkeywords2.awk program.java
<kwd>package</kwd> com.zetcode;
<kwd>class</kwd> Test {
<kwd>int</kwd> x = 1;
<kwd>public</kwd> <kwd>void</kwd> exec1() {
System.out.println(this.x);
System.out.println(x);
}
<kwd>public</kwd> <kwd>void</kwd> exec2() {
<kwd>int</kwd> z = 5;
System.out.println(x);
System.out.println(z);
}
}
<kwd>public</kwd> <kwd>class</kwd> MethodScope {
<kwd>public</kwd> <kwd>static</kwd> <kwd>void</kwd> main(String[] args) {
Test ts = <kwd>new</kwd> Test();
ts.exec1();
ts.exec2();
}
}
A sample run on a small Java program.
This was AWK tutorial.