First, turn on turn on logging in your CloudFront Distribution settings to get logs written to an S3 bucket. Then, use the AWS command line interface1 to sync the logs to a local directory:2
/usr/local/bin/aws s3 sync s3://jeffkreeftmeijer.com-log-cf logs
With the logs downloaded, generate a report by passing the logs to GoAccess to produce a 28-day HTML3 report:
find logs -name "*.gz" | \ xargs gzcat | \ grep --invert-match --file=exclude.txt | \ /usr/local/bin/goaccess \ --log-format CLOUDFRONT \ --date-format CLOUDFRONT \ --time-format CLOUDFRONT \ --ignore-crawlers \ --ignore-status=301 \ --ignore-status=302 \ --keep-last=28 \ --output index.html
This command consists of four commands piped together:
find logs -name "*.gz"
Produce a list of all files in the
logsdirectory. Because of the number of files in the logs directory, passing a directory glob to
gunzipdirectly would result in an “argument list too long” error because the list of filenames exceeds the
zsh: argument list too long: gunzip
xargstakes the output from
find—which outputs a stream of filenames delimited by newlines—and calls the
gzcatutility for every line by appending it to the passed command. Essentially, this runs the
gzcatcommand for every file in the
gzcatis an alias for
gzip --decompress --stdout, which decompresses gzipped files and prints the output to the standard output stream.
grep --invert-match --file=exclude.txt
greptakes the input stream and filters out all log lines that match a line in the exclude file (
exclude.txt). The exclude file is a list of words that are ignored when producing the report4.
goaccess --log-format CLOUDFRONT --date-format CLOUDFRONT --time-format CLOUDFRONT --ignore-crawlers --ignore-status=301 --ignore-status=302 --keep-last=28 --output index.html
goaccessreads the decompressed logs to generate a report with the following options:
--log-format CLOUDFRONT --date-format CLOUDFRONT --time-format CLOUDFRONT
- Use CloudFront’s log, date and time formats to parse the log lines.
--ignore-crawlers --ignore-status=301 --ignore-status=302
- Ignore crawlers and redirects.
- Use the last 28 days to build the report.
- Output an HTML report to a file named
To sync the logs and generate a new report, run the
html.sh scripts in a cron job every night at midnight:
echo '0 0 * * * ~/stats/sync.sh && ~/stats/html.sh' | crontab
On a mac, use Homebrew:
brew install awscli
aws s3 sync command on an empty local directory took me two hours and produced a 2.1 GB directory of
.gz files for roughly 3 years of logs. Updating the logs by running the same command takes about five minutes.
Since I’m only interested in the stats for the last 28 days, it would make sense to only download the last 28 days of logs to generate the report. However, AWS’s command line tool doesn’t support filters like that.
One thing that does work is using both the
--include options to include only the logs for the current month:
/usr/local/bin/aws s3 sync --exclude "*" --include "*2021-07-*" s3://jeffkreeftmeijer.com-log-cf ~/stats/logs
While this still loops over all files, it won’t download anything outside of the selected month.
The command accepts the
--include= option multiple times, so it’s possible to select multiple months like this. One could, theoretically, write a script that finds the current year and month, then downloads that stats matching that month and the month before it to produce a 28-day report.
GoAccess generates JSON and CSV files when passing a filename with a
.csv extension, respectively. To generate the 28-day report in CSV format:
find logs -name "*.gz" | \ xargs gzcat | \ grep --invert-match --file=exclude.txt | \ /usr/local/bin/goaccess \ --log-format CLOUDFRONT \ --date-format CLOUDFRONT \ --time-format CLOUDFRONT \ --ignore-crawlers \ --ignore-status=301 \ --ignore-status=302 \ --keep-last=28 \ --output stats.csv
exclude.txt currently consists of the
HEAD HTTP request type and the path to the feed file: