THE RKGBLOG

Free SEO Diagnostics Tool: Log File Parsing Script

A cornerstone of search engine optimization is the analysis of server log files. Analytical tools are essential, but we need direct access to the raw logs in order to study bot behaviour and spot errors, patterns and potential crawling issues GUI analytics can’t give us.

We use a number of tools to diagnose crawling and indexing patterns. Google’s webmaster console is excellent, but in my experience (especially with large sites) crawling errors are shown as indicators rather than comprehensive reports – good for a heads-up to delve into some deeper analysis. That’s where command line scripts come into the picture. They can parse huge log files for critical information, such as search engine spider behaviour, or error codes like 500s that may be a signal of coding issues. They can also pull out data for how individual pages or specific categories are being crawled, by using patterns to match crawler activity beneath a specific sub-domain or directory, for example.

When working with lots of logged data, a fast script that pulls out criteria we assign and then outputs it in a format that’s easy to read is mighty handy.

On that note, I’d like to announce a free log parsing script we’re offering to the SEO community. It’s a simplified version of the scripts we use internally for our clients. Please use it, abuse it, improve it and pass it around freely. We only ask that you credit AudetteMedia and its author, Edward Arenberg, with the script. I’d also like to thank my friend and colleague Aaron Shear for the inspiration.

What is it?

logfilt is a command line tool for Unix (and Unix-based systems such as OS X). It allows us to quickly parse out key criteria from log files, in an efficient manner, using simple commands. Because it’s written in sed and nawk, it’s fast and handles large log files quite easily. It’s a very basic, elegant and efficient script. Plus it’s fun to use!

How do I install it?

You can download your preferred package below. Both nawk (default) and awk versions are included in the zip file.

Depending on how you save logfilt on your system, you may need to make it a Unix executable file. Either copy and paste the script below into a new file saved as logfilt or just right-click and choose “Save as…” Then, open a terminal window and run the following command on the file:

chmod +x logfilt

This should make it an executable program.

How do I use it?

You can always access logfilt help by simply running:

We’re also offering an awk version of logfilt. nawk is preferable to awk in this application because the latter doesn’t match against variables, only static text. However, we use OS X which doesn’t have nawk installed by default. If you’re on a Mac like us, use the awk version instead.

logfilt -h

The full command syntax is as follows:

logfilt [options] file1 file2…

Options

-H Host Filter (regexp)
-R Return Code Filter
-L Limit Number of Host Matches
-h Access this Help

Here’s how to use logfilt to analyze your log files:

Unless you’ve opted to install logfilt on a remote server, go ahead and download the log files to your computer and run the script locally. To combine your logs, issue the following:

Create a new file:

touch access_log

Combine all the logs in the directory:

cat * > access_log

Finally, set logfilt loose.


Alternatively, you can isolate and analyze your log files separately by date (rather than combine them, although chronology should be respected with a cat command) and run a cache:domain.com command at Google to find out when Googlebot last indexed and crawled your site. Then, pull log files before and after the cache date and run diagnostics on those. You should be able to capture a view of Googlebot’s behaviour on your domain.

Where are the bots?

Googlebot lives at a variety of IP blocks. Lately I’ve seen them arriving from 66.249.xx.xx, but they don’t appear to own that entire class-B address range. Apparently Google has reserved a large number of blocks but aren’t using them all. If you have a good resource for finding up-to-date IP information for search spiders, please let me know in the comments.

You can use regular expressions to pull bots out once you have their IP ranges. Here are some of the ranges Google has reserved for Googlebots:

64.68.88.0 – 64.68.95.255
64.233.160.0 – 64.233.191.255
66.102.0.0 – 66.102.15.255
66.249.64.0 – 66.249.79.255

Lots of ranges there… one way to pull out the bots is to use regular expressions like the following:

Quick regexp tutorial: the () groups terms, [] gives you a set of options “-” will give a character (or number) range, and | is the “or” operator. The first regexp in our range above reads: At beginning of entry look for 64.68. followed by an 8 with an 8 or 9 after it, or a 9 with a 0 through 5 after it. Thanks to Ed Arenberg for this explanation.

Note: if you’re using the default script with nawk, you won’t need to make use of the regular expressions below.

“^64.68.(8[8-9]|9[0-5])”
“^66.233.1([6-8]|91)”
“^66.102.(0|15)”
“^66.249.(6[4-9]|7[0-9])”

These should cover the 4 ranges listed above. Right now we’ve (mostly) seen one GoogleBot address for a long time (66.249.73.46).

Why it’s useful

This script gives you data outputed chronologically for parameters you specificy. You can specify a search spider such as Googlebot or Slurp and see their crawling patterns. A well-behaved spider should land on the robots.txt file with a 200 status code to initiate their crawl.

You can also use this script to output all 404 errors by user agent (or leave the segment off and see all 404s), or any status code you’re concerned with (does a client have a lot of 302s? Use this to output a list of all 302 status codes returned, segmented by user agent). Another cool application is to view a snapshot of pages crawled under a specific sub-directory.

Can I see Examples?

Here are some examples of how to use logfilt for SEO diagnostics (using log files with reverse lookups by IP enabled):

To find all 500 errors for any user agent:

% logfilt -R 500 access_log

To find all the 301s GoogleBot, Slurp, or MSN see (respectively) and output to a new document called 301.txt:

% logfilt -H googlebot [msn] [yahoo] -R 301 access_log > 301.txt

To use your cache date to pull crawl data for GoogleBot (assuming your cache was July 16), bookended with before and after logs and piped to the pagination command less, issue:

% logfilt -H googlebot log-Jul-15 log-Jul-16 log-Jul-17 | less

Other examples for you to play around with:

% logfilt -H google logfile.txt -R 302 logfile2.txt
% logfilt -H “google|yahoo” logfile.txt logfile2.txt logfile3.txt
% logfilt logfile.txt -H googlebot -L 250

Here’s a screenshot of the output (click for a full-size view). The command issued was:

logfilt -H yahoo access_log | less

Note that on this server, reverse lookups are enabled for IPs.

Share the love

This tool is given as-is (and no guarantees of course). I think you’ll find it pretty useful – especially when you need to crunch a few gigs worth of log files. Please use it and share it however you like. All we ask is that you credit AudetteMedia and Edward Arenberg for the work involved in creating it. A link to this post would be a great thank you!

Download the Script

Both nawk and awk versions are included in the zip file.

You can either right-click and choose “Save as…” for these, or simply click them to copy and paste the scripts. If you do the latter, follow the guidelines above for creating a Unix executable.

logfilt zip file

Have fun and let me know your thoughts!

  • Adam Audette
    Adam Audette is the Chief Knowledge Officer of RKG.
  • Comments
    15 Responses to “Free SEO Diagnostics Tool: Log File Parsing Script”
    1. Really cool Adam. Thanks for sharing.

    2. Adam Audette says:

      @Brian you’re most welcome, glad you like it.

      Here’s the post on Sphinn: http://sphinn.com/story/59457

      I really need to get a sphinn badge on our site. Added to my to-do’s….

    3. Hi Adam,

      Pretty neat tool you got here. Thanks for sharing.

      Have you checked iplists.com for Search Engine IP’s? I think they have a well updated IP database.

      Not probably exactly what you are looking for, but you can also check a script that we just finished named LogZ Free Log Analyzer for User-Agent and IP Detection. That’s more to block scrapers and bad bots that can affect your overall SERP.

    4. Adam Audette says:

      @Augusto thanks for the tip to check out http://www.iplists.com, just what I was looking for. LogZ looks very cool – well done. Thanks for sharing.

    5. seo pixy says:

      Thanks a lot for sharing this great tool and for the tutorial, it was really helpful:)

    6. Adam, you and Aaron are truly some advanced SEO techs. Keep up the great work!

    7. Up to date IPs? Sounds like you need to get Fantomaster’s product. Fantomaster.com

    8. Voz says:

      How can I run tool on a Microsoft Windows machine where I have access to log file locally.

    9. Adam Audette says:

      @Voz you’ll have to comb through it manually or build your own script, or even better access a shell account on Unix and install logfilt there.

    10. Sen Hu says:

      Excellent for UNIX.

      On Windows, I use biterscripting. I have downloaded it free from their website. You can too. There is a sample script for log parsing at http://biterscripting.com/Download/SS_WebLogParser.txt .

      Sen

    11. seo company says:

      Thanks for the seo parsing tool. just what l was looking for.

    12. Vahe says:

      Hi Adam,

      More of a basic question – how do you approach clients to ask them to regularly obtain their log analysis files?

      Thanks,

      Vahe

    13. Adam Audette Adam Audette says:

      Few ways you can do it. Easy way is to set up a cron job to upload them weekly to something like Amazon AWS.

    14. Vahe says:

      Any recommended resources I can followup in setting up a cron job – haven’t done this before. Also would you be able to get it uploaded it to a storage device like dropbox?

    15. Adam Audette Adam Audette says:

      You could potentially use any cloud-based storage service, yes. The details of setting up automated scripts is beyond the scope of this article. Check this out for starters: http://drupal.org/node/23714 (note: this is specific to Drupal, but cron is a Unix based command).

      Best of luck, Vahe!