Gawk - Useful Programs and Things to Know

January 2, 2014
gawkawkmawktawksed

Awk is a useful language for processing and working with text files that are formatted with some kind of delimeter, by default it assumes the newline. Gawk is the GNU implementation of the awk language, it also includes extra features. If you have to parse or format a large text dataset gawk is probably one of the most efficient methods to do it. Awk/Gawk scripts can be super compact and are often used inline in bash commands to process/format output on the fly. This post will refer to gawk but most of these commands/programs are super basic they're interchangable to any flavor of awk, like mawk or tawk.

The basic syntax of a gawk program is a condition followed by actions to take, for example formatting the line somehow.

  
condition { actions }
  

Either the action or the condition can be left off here. If you don't specify an action gawk will print lines ($0) where the condition is true and if the condition is left off the action will be executed on each line.

Removing Duplicates

  
gawk '!exists[$0]++' fileName
  

When gawk reads the file here the whole line is available in the $0 variable. This puts the line in the array as the key, the first time the line goes in it returns true, since !exists means is this line not in the array. For each additional entry (duplicate) it returns false . Since the default action is print all lines that match the condition this trims out all the duplicates in a file by printing lines once and not printing subsequent duplicates.

Searching Processes

Gawk takes the input line and processes the results into fields, this makes formatting tabular data like the output of some bash commands simple.


The command ps aux outputs a table of all the running processes on my machine. In Ubuntu, running ps aux | head -1 outputs the table headers.

  
USER  PID  %CPU  %MEM  VSZ  RSS  TTY  STAT  START  TIME  COMMAND
  

When awk processes each line $0 will be the whole line and the USER column will be available as $1, the PID as $2 and COMMAND as $11.

  
ps aux | gawk '/myRunningScript.sh/ { print $2 }'
  

This yields the PIDs matching the pattern, which can be useful to find a kill a process. Only when the search term is real unique, like a custom script name.

  
kill $(ps aux | gawk '/myRunningScript.sh/ {print $2}') 
  

I wouldn't recommend killing processes automatically if the search term is too short or obscure, make sure you know what's getting shut down.

Follow, Filter and Format Apache Logs by IP Address

Follow the log and display basic request info.

It's very helpful when debugging requests to be able to filter out lines from your own IP address and see some basic information from the server.

  
tail -f /apache/log/access.log | gawk '/IP_ADDRESS/ { print $3,$4,$5,$9 }' 
  

This will follow the bottom of the log file as entries are added and print any that match the IP address. It displays the user id, date/time and request status code.

Sorting and counting request information by column.

  
gawk -F\" '{print $6}' /apache/log/access.log | sort | uniq -c | sort -fr
  

This example sorts user agents and displays a unique list of those present in the logs with the number of hits associated for each, it can be adjusted to any field in the log.

Useful Built-in Variables

BEGIN and END

The default format for a gawk program can also be chained. There are two special conditions that are preset, BEGIN and END, also known as the BEGIN and END blocks.

  
BEGIN { actions } ; condition { actions } ; END { actions }
  

A program can use these as many times as needed, BEGIN blocks should be at the start of the script and END blocks should be at the end. They'll execute in order of appearance.

ORS - Output Record Separator.

ORS is a default variable in gawk. It is the default output record separator. By default ORS is set to a newline but can be changed to another character.

RS - Record Separator.

The RS variable is the record separator. By default ths variable is set to a newline but it can be set to one or more characters.

FS - Field Separator.

The field separator value in gawk can be set with the FS variable. Field separators are core functionality and will probably vary a lot depending on the data you're working with. Gawk provides the ability to use either characters or regular expressions to chunk out the pieces of the file to work with. The FS variable can be set in a BEGIN block.

  
gawk 'BEGIN { FS = "|" } ; condition { actions }' fileName