Web analytics from CloudFront access logs with GoAccess

By on

First, turn on turn on logging in your CloudFront Distribution settings to get logs written to an S3 bucket. Then, use the AWS command line interface1 to sync the logs to a local directory:2

/usr/local/bin/aws s3 sync s3://jeffkreeftmeijer.com-log-cf logs

With the logs downloaded, generate a report by passing the logs to GoAccess to produce a 28-day HTML3 report:

find logs -name "*.gz" | \
    xargs gzcat | \
    grep --invert-match --file=exclude.txt | \
    /usr/local/bin/goaccess \
        --log-format CLOUDFRONT \
        --date-format CLOUDFRONT \
        --time-format CLOUDFRONT \
        --ignore-crawlers \
        --ignore-status=301 \
        --ignore-status=302 \
        --keep-last=28 \
        --output index.html

This command consists of four commands piped together:

find logs -name "*.gz"

Produce a list of all files in the logs directory. Because of the number of files in the logs directory, passing a directory glob to gunzip directly would result in an “argument list too long” error because the list of filenames exceeds the ARG_MAX configuration:

gunzip logs/*.gz
zsh: argument list too long: gunzip
xargs gzcat

xargs takes the output from find—which outputs a stream of filenames delimited by newlines—and calls the gzcat utility for every line by appending it to the passed command. Essentially, this runs the gzcat command for every file in the logs directory.

gzcat is an alias for gzip --decompress --stdout, which decompresses gzipped files and prints the output to the standard output stream.

grep --invert-match --file=exclude.txt
grep takes the input stream and filters out all log lines that match a line in the exclude file (exclude.txt). The exclude file is a list of words that are ignored when producing the report4.
goaccess --log-format CLOUDFRONT --date-format CLOUDFRONT --time-format CLOUDFRONT --ignore-crawlers --ignore-status=301 --ignore-status=302 --keep-last=28 --output index.html
goaccess reads the decompressed logs to generate a report with the following options:
--log-format CLOUDFRONT --date-format CLOUDFRONT --time-format CLOUDFRONT
Use CloudFront’s log, date and time formats to parse the log lines.
--ignore-crawlers --ignore-status=301 --ignore-status=302
Ignore crawlers and redirects.
--keep-last=28
Use the last 28 days to build the report.
--output=index.html
Output an HTML report to a file named index.html.

To sync the logs and generate a new report, run the sync.sh and html.sh scripts in a cron job every night at midnight:

echo '0 0 * * * ~/stats/sync.sh && ~/stats/html.sh' | crontab

1

On a mac, use Homebrew:

brew install awscli
2

Running the aws s3 sync command on an empty local directory took me two hours and produced a 2.1 GB directory of .gz files for roughly 3 years of logs. Updating the logs by running the same command takes about five minutes.

Since I’m only interested in the stats for the last 28 days, it would make sense to only download the last 28 days of logs to generate the report. However, AWS’s command line tool doesn’t support filters like that.

One thing that does work is using both the --exclude and --include options to include only the logs for the current month:

/usr/local/bin/aws s3 sync --exclude "*" --include "*2021-07-*" s3://jeffkreeftmeijer.com-log-cf ~/stats/logs

While this still loops over all files, it won’t download anything outside of the selected month.

The command accepts the --include= option multiple times, so it’s possible to select multiple months like this. One could, theoretically, write a script that finds the current year and month, then downloads that stats matching that month and the month before it to produce a 28-day report.

3

GoAccess generates JSON and CSV files when passing a filename with a .json or .csv extension, respectively. To generate the 28-day report in CSV format:

find logs -name "*.gz" | \
    xargs gzcat | \
    grep --invert-match --file=exclude.txt | \
    /usr/local/bin/goaccess \
        --log-format CLOUDFRONT \
        --date-format CLOUDFRONT \
        --time-format CLOUDFRONT \
        --ignore-crawlers \
        --ignore-status=301 \
        --ignore-status=302 \
        --keep-last=28 \
        --output stats.csv
4

My exclude.txt currently consists of the HEAD HTTP request type and the path to the feed file:

HEAD
feed.xml