First, turn on logging in your CloudFront Distribution settings to get logs written to an S3 bucket. Then, use the AWS command line interface1 to sync the logs to a local directory:2
aws s3 sync s3://jeffkreeftmeijer.com-log-cf logs
With the logs downloaded, generate a report by passing the logs to GoAccess to produce a 28-day HTML3 report:
find logs -name "*.gz" | \ xargs zcat | \ grep --invert-match --file=exclude.txt | \ goaccess \ --log-format "%d\\t%t\\t%^\\t%b\\t%h\\t%m\\t%^\\t%r\\t%s\\t%R\\t%u\\t%^" \ --date-format CLOUDFRONT \ --time-format CLOUDFRONT \ --ignore-crawlers \ --ignore-status=301 \ --ignore-status=302 \ --keep-last=28 \ --output index.html
This command consists of four commands piped together:
find logs -name "*.gz"Produce a list of all files in the
logsdirectory. Because of the number of files in the logs directory, passing a directory glob togunzipdirectly would result in an “argument list too long” error because the list of filenames exceeds theARG_MAXconfiguration:gunzip logs/*.gz
zsh: argument list too long: gunzip
xargs zcatxargstakes the output fromfind—which outputs a stream of filenames delimited by newlines—and calls thezcatutility for every line by appending it to the passed command. Essentially, this runs thezcatcommand for every file in thelogsdirectory.zcatis an alias forgzip --decompress --stdout, which decompresses gzipped files and prints the output to the standard output stream.grep --invert-match --file=exclude.txtgreptakes the input stream and filters out all log lines that match a line in the exclude file (exclude.txt). The exclude file is a list of words that are ignored when producing the report4.goaccess …- The decompressed logs get piped to
goaccessto generate a report with the following options:--log-format "%d\\t%t\\t%^\\t%b\\t%h\\t%m\\t%^\\t%r\\t%s\\t%R\\t%u\\t%^"- Use CloudFront’s log format.5
--log-format CLOUDFRONT --date-format CLOUDFRONT --time-format CLOUDFRONT- Use CloudFront’s date and time formats to parse the log lines.
--ignore-crawlers --ignore-status=301 --ignore-status=302- Ignore crawlers and redirects.
--keep-last=28- Use the last 28 days to build the report.
--output=index.html- Output an HTML report to a file named
index.html.
To sync the logs and generate a new report every night, add the commands to a Makefile.
.PHONY: html sync html: find logs -name "*.gz" | \ xargs zcat | \ grep --invert-match --file=exclude.txt | \ goaccess \ --log-format "%d\\t%t\\t%^\\t%b\\t%h\\t%m\\t%^\\t%r\\t%s\\t%R\\t%u\\t%^" \ --date-format CLOUDFRONT \ --time-format CLOUDFRONT \ --ignore-crawlers \ --ignore-status=301 \ --ignore-status=302 \ --keep-last=28 \ --output index.html sync: aws s3 sync s3://jeffkreeftmeijer.com-log-cf logs
Then, run the sync and html tasks in a cron job every night at midnight:
echo '0 0 * * * make --directory=~/stats/ sync html' | crontab
On a mac, use Homebrew:
brew install awscli
Then, set up the AWS credentials to access the logs bucket:
↩︎aws configure
Running the
aws s3 synccommand on an empty local directory took me two hours and produced a 2.1 GB directory of.gzfiles for roughly 3 years of logs. Updating the logs by running the same command takes about five minutes.Since I’m only interested in the stats for the last 28 days, it would make sense to only download the last 28 days of logs to generate the report. However, AWS’s command line tool doesn’t support filters like that.
One thing that does work is using both the
--excludeand--includeoptions to include only the logs for the current month:aws s3 sync --exclude "*" --include "*2021-07-*" s3://jeffkreeftmeijer.com-log-cf ~/stats/logs
While this still loops over all files, it won’t download anything outside of the selected month.
The command accepts the
↩︎--include=option multiple times, so it’s possible to select multiple months like this. One could, theoretically, write a script that finds the current year and month, then downloads that stats matching that month and the month before it to produce a 28-day report.GoAccess generates JSON and CSV files when passing a filename with a
.jsonor.csvextension, respectively. To generate the 28-day report in CSV format:↩︎find logs -name "*.gz" | \ xargs zcat | \ grep --invert-match --file=exclude.txt | \ goaccess \ --log-format "%d\\t%t\\t%^\\t%b\\t%h\\t%m\\t%^\\t%r\\t%s\\t%R\\t%u\\t%^" \ --date-format CLOUDFRONT \ --time-format CLOUDFRONT \ --ignore-crawlers \ --ignore-status=301 \ --ignore-status=302 \ --keep-last=28 \ --output stats.csv
My
exclude.txtcurrently consists of theHEADHTTP request type and the path to the feed file:HEAD feed.xml
↩︎Initially, the value for the
--log-formatoption wasCLOUDFRONT, which points to a predefined log format. However, that internal value changed, which broke the script when updating goaccess, producing errors like this:==77953== FILE: - ==77953== Parsed 10 lines producing the following errors: ==77953== ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== Token for '%H' specifier is NULL. ==77953== ==77953== Format Errors - Verify your log/date/time format
I haven’t been able to find out what the problem is, so I’ve reverted to the old log format for the time being. I’m suspecting the newly introduced log format in GoAccess doesn’t match the old log lines from 2017 anymore.
An example of a log line that doesn’t match the new log format:
2017-09-17 09:16:00 CDG50 573 132.166.177.54 GET d2xkchmcg9g2pt.cloudfront.net /favicon.ico 301 - Mozilla/5.0%2520(X11;%2520Linux%2520x86_64)%2520KHTML/5.37.0%2520(like%2520Gecko)%2520Konqueror/5%2520KIO/5.37 - - Redirect QgyNFmDkLiZ23dKCu9ozmQFWrGY407bHn9VRlWzhp9KjyCe3b0b4WQ== jeffkreeftmeijer.com http 425 0.000 - - - Redirect HTTP/1.1
↩︎