First, turn on logging in your CloudFront Distribution settings to get logs written to an S3 bucket. Then, use the AWS command line interface1 to sync the logs to a local directory:2
aws s3 sync s3://jeffkreeftmeijer.com-log-cf logs
With the logs downloaded, generate a report by passing the logs to GoAccess to produce a 28-day HTML3 report:
find logs -name "*.gz" | \ xargs zcat | \ grep --invert-match --file=exclude.txt | \ goaccess \ --log-format "%d\\t%t\\t%^\\t%b\\t%h\\t%m\\t%^\\t%r\\t%s\\t%R\\t%u\\t%^" \ --date-format CLOUDFRONT \ --time-format CLOUDFRONT \ --ignore-crawlers \ --ignore-status=301 \ --ignore-status=302 \ --keep-last=28 \ --output index.html
This command consists of four commands piped together:
find logs -name "*.gz"
Produce a list of all files in the
logs
directory. Because of the number of files in the logs directory, passing a directory glob togunzip
directly would result in an “argument list too long” error because the list of filenames exceeds theARG_MAX
configuration:gunzip logs/*.gz
zsh: argument list too long: gunzip
xargs zcat
xargs
takes the output fromfind
—which outputs a stream of filenames delimited by newlines—and calls thezcat
utility for every line by appending it to the passed command. Essentially, this runs thezcat
command for every file in thelogs
directory.zcat
is an alias forgzip --decompress --stdout
, which decompresses gzipped files and prints the output to the standard output stream.grep --invert-match --file=exclude.txt
grep
takes the input stream and filters out all log lines that match a line in the exclude file (exclude.txt
). The exclude file is a list of words that are ignored when producing the report4.goaccess …
- The decompressed logs get piped to
goaccess
to generate a report with the following options:--log-format "%d\\t%t\\t%^\\t%b\\t%h\\t%m\\t%^\\t%r\\t%s\\t%R\\t%u\\t%^"
- Use CloudFront’s log format.5
--log-format CLOUDFRONT --date-format CLOUDFRONT --time-format CLOUDFRONT
- Use CloudFront’s date and time formats to parse the log lines.
--ignore-crawlers --ignore-status=301 --ignore-status=302
- Ignore crawlers and redirects.
--keep-last=28
- Use the last 28 days to build the report.
--output=index.html
- Output an HTML report to a file named
index.html
.
To sync the logs and generate a new report every night, add the commands to a Makefile
.
.PHONY: html sync html: find logs -name "*.gz" | \ xargs zcat | \ grep --invert-match --file=exclude.txt | \ goaccess \ --log-format "%d\\t%t\\t%^\\t%b\\t%h\\t%m\\t%^\\t%r\\t%s\\t%R\\t%u\\t%^" \ --date-format CLOUDFRONT \ --time-format CLOUDFRONT \ --ignore-crawlers \ --ignore-status=301 \ --ignore-status=302 \ --keep-last=28 \ --output index.html sync: aws s3 sync s3://jeffkreeftmeijer.com-log-cf logs
Then, run the sync
and html
tasks in a cron job every night at midnight:
echo '0 0 * * * make --directory=~/stats/ sync html' | crontab
On a mac, use Homebrew:
brew install awscli
Then, set up the AWS credentials to access the logs bucket:
aws configure
Running the
aws s3 sync
command on an empty local directory took me two hours and produced a 2.1 GB directory of.gz
files for roughly 3 years of logs. Updating the logs by running the same command takes about five minutes.Since I’m only interested in the stats for the last 28 days, it would make sense to only download the last 28 days of logs to generate the report. However, AWS’s command line tool doesn’t support filters like that.
One thing that does work is using both the
--exclude
and--include
options to include only the logs for the current month:aws s3 sync --exclude "*" --include "*2021-07-*" s3://jeffkreeftmeijer.com-log-cf ~/stats/logs
While this still loops over all files, it won’t download anything outside of the selected month.
The command accepts the
↩︎--include=
option multiple times, so it’s possible to select multiple months like this. One could, theoretically, write a script that finds the current year and month, then downloads that stats matching that month and the month before it to produce a 28-day report.GoAccess generates JSON and CSV files when passing a filename with a
.json
or.csv
extension, respectively. To generate the 28-day report in CSV format:find logs -name "*.gz" | \ xargs zcat | \ grep --invert-match --file=exclude.txt | \ goaccess \ --log-format "%d\\t%t\\t%^\\t%b\\t%h\\t%m\\t%^\\t%r\\t%s\\t%R\\t%u\\t%^" \ --date-format CLOUDFRONT \ --time-format CLOUDFRONT \ --ignore-crawlers \ --ignore-status=301 \ --ignore-status=302 \ --keep-last=28 \ --output stats.csv
My
exclude.txt
currently consists of theHEAD
HTTP request type and the path to the feed file:HEAD feed.xml
↩︎Initially, the value for the
--log-format
option wasCLOUDFRONT
, which points to a predefined log format. However, that internal value changed, which broke the script when updating goaccess, producing errors like this:==77953== FILE: - ==77953== Parsed 10 lines producing the following errors: ==77953== ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== ==77953== Format Errors - Verify your log/date/time format
I haven’t been able to find out what the problem is, so I’ve reverted to the old log format for the time being. I’m suspecting the newly introduced log format in GoAccess doesn’t match the old log lines from 2017 anymore.
An example of a log line that doesn’t match the new log format:
2017-09-17 09:16:00 CDG50 573 132.166.177.54 GET d2xkchmcg9g2pt.cloudfront.net /favicon.ico 301 - Mozilla/5.0%2520(X11;%2520Linux%2520x86_64)%2520KHTML/5.37.0%2520(like%2520Gecko)%2520Konqueror/5%2520KIO/5.37 - - Redirect QgyNFmDkLiZ23dKCu9ozmQFWrGY407bHn9VRlWzhp9KjyCe3b0b4WQ== jeffkreeftmeijer.com http 425 0.000 - - - Redirect HTTP/1.1
↩︎