Awk is a useful language for processing and working with text files that are formatted with some kind of delimeter, by default it assumes the newline. Gawk is the GNU implementation of the awk language, it also includes extra features. If you have to parse or format a large text dataset gawk is probably one of the most efficient methods to do it. Awk/Gawk scripts can be super compact and are often used inline in bash commands to process/format output on the fly. This post will refer to gawk but most of these commands/programs are super basic they're interchangable to any flavor of awk, like mawk or tawk.
The basic syntax of a gawk program is a condition followed by actions to take, for example formatting the line somehow.
condition { actions }
Either the action or the condition can be left off here. If you don't specify
an action gawk will print lines ($0
) where the condition
is true
and if the condition is left off the action
will be executed on each line.
gawk '!exists[$0]++' fileName
When gawk reads the file here the whole line is available in the $0
variable. This puts the line in the array as the key, the first time the line goes in it returns true
, since !exists
means is this line not in the array. For each additional entry (duplicate) it returns false
. Since the default action is print all lines that match the condition this trims out all the duplicates in a file by printing lines once and not printing subsequent duplicates.
Gawk takes the input line and processes the results into fields, this makes formatting tabular data like the output of some bash commands simple.
The command ps aux
outputs a table of all the running processes on my machine. In Ubuntu, running ps aux | head -1
outputs the table headers.
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
When awk processes each line $0
will be the whole line and the USER column will be available as $1
, the PID as $2
and COMMAND as $11
.
ps aux | gawk '/myRunningScript.sh/ { print $2 }'
This yields the PIDs matching the pattern, which can be useful to find a kill a process. Only when the search term is real unique, like a custom script name.
kill $(ps aux | gawk '/myRunningScript.sh/ {print $2}')
I wouldn't recommend killing processes automatically if the search term is too short or obscure, make sure you know what's getting shut down.
It's very helpful when debugging requests to be able to filter out lines from your own IP address and see some basic information from the server.
tail -f /apache/log/access.log | gawk '/IP_ADDRESS/ { print $3,$4,$5,$9 }'
This will follow the bottom of the log file as entries are added and print any that match the IP address. It displays the user id, date/time and request status code.
gawk -F\" '{print $6}' /apache/log/access.log | sort | uniq -c | sort -fr
This example sorts user agents and displays a unique list of those present in the logs with the number of hits associated for each, it can be adjusted to any field in the log.
The default format for a gawk program can also be chained. There are two special conditions that are preset, BEGIN and END, also known as the BEGIN and END blocks.
BEGIN { actions } ; condition { actions } ; END { actions }
A program can use these as many times as needed, BEGIN
blocks should be at the start of the script and END
blocks should be at the end. They'll execute in order of appearance.
ORS
is a default variable in gawk. It is the default output record separator. By default ORS
is set to a newline but can be changed to another character.
The RS
variable is the record separator. By default ths variable is set to a newline but it can be set to one or more characters.
The field separator value in gawk can be set with the FS
variable. Field separators are core functionality and will probably vary a lot depending on the data you're working with. Gawk provides the ability to use either characters or regular expressions to chunk out the pieces of the file to work with. The FS
variable can be set in a BEGIN
block.
gawk 'BEGIN { FS = "|" } ; condition { actions }' fileName