Free SEO Diagnostics Tool: Log File Parsing Script
A cornerstone of search engine optimization is the analysis of server log files. Analytical tools are essential, but we need direct access to the raw logs in order to study bot behaviour and spot errors, patterns and potential crawling issues GUI analytics can’t give us.
We use a number of tools to diagnose crawling and indexing patterns. Google’s webmaster console is excellent, but in my experience (especially with large sites) crawling errors are shown as indicators rather than comprehensive reports – good for a heads-up to delve into some deeper analysis. That’s where command line scripts come into the picture. They can parse huge log files for critical information, such as search engine spider behaviour, or error codes like 500s that may be a signal of coding issues. They can also pull out data for how individual pages or specific categories are being crawled, by using patterns to match crawler activity beneath a specific sub-domain or directory, for example.
When working with lots of logged data, a fast script that pulls out criteria we assign and then outputs it in a format that’s easy to read is mighty handy.
On that note, I’d like to announce a free log parsing script we’re offering to the SEO community. It’s a simplified version of the scripts we use internally for our clients. Please use it, abuse it, improve it and pass it around freely. We only ask that you credit AudetteMedia and its author, Edward Arenberg, with the script. I’d also like to thank my friend and colleague Aaron Shear for the inspiration.
What is it?
logfilt is a command line tool for Unix (and Unix-based systems such as OS X). It allows us to quickly parse out key criteria from log files, in an efficient manner, using simple commands. Because it’s written in sed and nawk, it’s fast and handles large log files quite easily. It’s a very basic, elegant and efficient script. Plus it’s fun to use!
How do I install it?
You can download your preferred package below. Both nawk (default) and awk versions are included in the zip file.
Depending on how you save logfilt on your system, you may need to make it a Unix executable file. Either copy and paste the script below into a new file saved as logfilt or just right-click and choose “Save as…” Then, open a terminal window and run the following command on the file:
chmod +x logfilt
This should make it an executable program.
How do I use it?
You can always access logfilt help by simply running:
The full command syntax is as follows:
logfilt [options] file1 file2…
-H Host Filter (regexp)
-R Return Code Filter
-L Limit Number of Host Matches
-h Access this Help
Here’s how to use logfilt to analyze your log files:
Unless you’ve opted to install logfilt on a remote server, go ahead and download the log files to your computer and run the script locally. To combine your logs, issue the following:
Create a new file:
Combine all the logs in the directory:
cat * > access_log
Finally, set logfilt loose.
Alternatively, you can isolate and analyze your log files separately by date (rather than combine them, although chronology should be respected with a cat command) and run a cache:domain.com command at Google to find out when Googlebot last indexed and crawled your site. Then, pull log files before and after the cache date and run diagnostics on those. You should be able to capture a view of Googlebot’s behaviour on your domain.
Where are the bots?
Googlebot lives at a variety of IP blocks. Lately I’ve seen them arriving from 66.249.xx.xx, but they don’t appear to own that entire class-B address range. Apparently Google has reserved a large number of blocks but aren’t using them all. If you have a good resource for finding up-to-date IP information for search spiders, please let me know in the comments.
You can use regular expressions to pull bots out once you have their IP ranges. Here are some of the ranges Google has reserved for Googlebots:
18.104.22.168 – 22.214.171.124
126.96.36.199 – 188.8.131.52
184.108.40.206 – 220.127.116.11
18.104.22.168 – 22.214.171.124
Lots of ranges there… one way to pull out the bots is to use regular expressions like the following:
Note: if you’re using the default script with nawk, you won’t need to make use of the regular expressions below.
These should cover the 4 ranges listed above. Right now we’ve (mostly) seen one GoogleBot address for a long time (126.96.36.199).
Why it’s useful
This script gives you data outputed chronologically for parameters you specificy. You can specify a search spider such as Googlebot or Slurp and see their crawling patterns. A well-behaved spider should land on the robots.txt file with a 200 status code to initiate their crawl.
You can also use this script to output all 404 errors by user agent (or leave the segment off and see all 404s), or any status code you’re concerned with (does a client have a lot of 302s? Use this to output a list of all 302 status codes returned, segmented by user agent). Another cool application is to view a snapshot of pages crawled under a specific sub-directory.
Can I see Examples?
Here are some examples of how to use logfilt for SEO diagnostics (using log files with reverse lookups by IP enabled):
To find all 500 errors for any user agent:
% logfilt -R 500 access_log
To find all the 301s GoogleBot, Slurp, or MSN see (respectively) and output to a new document called 301.txt:
% logfilt -H googlebot [msn] [yahoo] -R 301 access_log > 301.txt
To use your cache date to pull crawl data for GoogleBot (assuming your cache was July 16), bookended with before and after logs and piped to the pagination command less, issue:
% logfilt -H googlebot log-Jul-15 log-Jul-16 log-Jul-17 | less
Other examples for you to play around with:
% logfilt -H google logfile.txt -R 302 logfile2.txt
% logfilt -H “google|yahoo” logfile.txt logfile2.txt logfile3.txt
% logfilt logfile.txt -H googlebot -L 250
Here’s a screenshot of the output (click for a full-size view). The command issued was:
logfilt -H yahoo access_log | less
Note that on this server, reverse lookups are enabled for IPs.
Share the love
This tool is given as-is (and no guarantees of course). I think you’ll find it pretty useful – especially when you need to crunch a few gigs worth of log files. Please use it and share it however you like. All we ask is that you credit AudetteMedia and Edward Arenberg for the work involved in creating it. A link to this post would be a great thank you!
Download the Script
Both nawk and awk versions are included in the zip file.
You can either right-click and choose “Save as…” for these, or simply click them to copy and paste the scripts. If you do the latter, follow the guidelines above for creating a Unix executable.
Have fun and let me know your thoughts!