THE RKGBLOG

Robots.txt Best Practices for SEO

Application of the robots.txt file typically falls sharply into two opposing schools of thought: either entries here are taken for granted as rote, but necessary, directives for search engines to follow (and required by CMSs); or there is an omnipresent fear (rightly placed) over placing any entry into the file lest it block search engine access to something critical on the site.

robots.txt talking robot

image credit: robotstxt.org

What’s missing in this paradoxical way of thinking is a middle way that uses the robots.txt for the good of the SEO campaign.

Many robots.txt best practices are well established, and yet we continue to see incorrect information spread in prominent places, such as this recent article on SEW. There are several points stated in the piece that are either fundamentally wrong, or that we strongly disagree with.

How To Use The Robots.txt File For SEO

There are several best practices that should first be covered:

  • As a general rule, the robots.txt file should never be used to handle duplicate content. There are better ways.
  • Disallow statements within the robots.txt file are hard directives, not hints, and should be thought of as such. Directives here are akin to using a sledgehammer.
  • No equity will be passed through URLs blocked by robots.txt. Keep this in mind when dealing with duplicate content (see above).
  • Using robots.txt to disallow URLs will not prevent them from being displayed in Google’s search engine (see below for details).
  • When Googlebot is specified as a user agent, all preceding rules are ignored and the subsequent rules are followed. For example, this Disallow directive applies to all user agents:

User-Agent: *
Disallow: /

  • However, this example of directives applies differently to all user agents, and Googlebot, respectively:

User-Agent: *
Disallow: /
User-Agent: Googlebot
Disallow: /cgi-bin/

  • Use care when disallowing content. Use of the following syntax will block the directory /folder-of-stuff/ and everything located within it (including subsequent folders and assets):

Disallow: /folder-of-stuff/

  • Limited use of regular expression is supported. This means that you can use wildcards to block all content with a specific extension, for example, such as the following directive which will block Powerpoints:

Disallow: *.ppt$

robots.txt is a sledgehammer

robots.txt

  • Always remember that robots.txt is a sledgehammer and is not subtle. There are often other tools at your disposal that can do a better job of influencing how search engines crawl, such as the parameter handling tools within Google and Bing Webmaster Tools, the meta robots tag, and the x-robots-tag response header.

Setting A Few Facts Straight

Let’s correct a few statements the previously cited SEW article stumbled on.

Wrong:

“Stop the search engines from indexing certain directories of your site that might include duplicate content. For example, some websites have “print versions” of web pages and articles that allow visitors to print them easily. You should only allow the search engines to index one version of your content.”

Using robots.txt for duplicate content is almost always bad advice. Rel canonical is your best friend here, and there are other methods. The example given is especially important: publishers with print versions should always use rel canonical to pass equity properly, as these often get shared and linked to by savvy users.

Wrong:

“Don’t use comments in your robots.txt file.”

You should absolutely use comments in your robots.txt file, there is no reason not to. In fact, comments here can be quite useful, much like commenting source code. Do it!

# The use of robots or other automated means to access the Adobe site
# without the express permission of Adobe is strictly prohibited.
# Details about Googlebot available at: http://www.google.com/bot.html
# The Google search engine can see everything
User-agent: gsa-crawler-www
Disallow: /events/executivecouncil/
Disallow: /devnet-archive/
Disallow: /limited/
Disallow: /special/
# Adobe's SEO team rocks

Wrong:

“There’s no “/allow” command in the robots.txt file, so there’s no need to add it to the robots.txt file.”

There is a well documented Allow directive for robots.txt. This can be quite useful, for example if you want to disallow URLs based on a matched pattern, but allow a subset of those URLs. The example given by Google is:

User-agent: *
Allow: /*?$
Disallow: /*?

… where any URL that ends with a ? is crawled (Allow), and any URL with a ? somewhere in the path or parameters is not (Disallow). To be fair, this is an advanced case where something like Webmaster Tools may work better, but having this type of constraint is helpful when you need it. Allow is most definitely ‘allowed’ here.

Robots.txt and Suppressed Organic Results

Blocked content can still appear in search results, leading to a poor user experience in some cases. When Googlebot is blocked from a particular URL, it has no way of accessing the content. When a link appears to that content, the URL often is displayed in the index without snippet or title information. It becomes a so-called “suppressed listing” in organic search.

URLs blocked with robots.txt in Google's index

One important note: while robots.txt will create these undesirable suppressed listings, use of meta robots noindex will keep URLs from appearing in the index entirely, even when links appear to the URLs (astute readers will note this is because meta noindex URLs are crawled). However, using either method (meta noindex or robots.txt disallow) creates a wall that prevents the passing of link equity and anchor text. It is effectively a PageRank dead end.

Common Gotchas with Robots.txt

  • As described above, if the user-agent Googlebot is specified, it overrules all other directives in the file.
  • Limited use of regular expression is supported. That means that wildcards (*), end of line ($), anything before (^) and some others will work.
  • Ensure CSS files are not blocked in robots.txt. For similar reasons, javascript assets that assist in rendering rich content should also be omitted from disallow statements, as these can cause problems with snippet previews.
  • It may sound obvious, but exclude content carefully. This directive will block the folder “stuff” and everything beneath it (note trailing slash):

Disallow: /folder/stuff/

  • Verify your syntax with a regular expression testing tool. Sadly, Google will remove the robots.txt tool from within Webmaster Tools. This is a bit of a loss as it was a quick and handy way to double-check syntax before pushing robots.txt changes live.
  • Remember that adding disallow statements to a robots.txt file does not remove content. It simply blocks access to spiders. Oftentimes when there is content that you want removed, it’s best to use a meta noindex and wait for the next crawl.

Resources

  • Adam Audette
    Adam Audette is the Chief Knowledge Officer of RKG.
  • Comments
    16 Responses to “Robots.txt Best Practices for SEO”
    1. Nick Roshon says:

      Great post, Adam! Nice to see you take on some common misconceptions and clear the air.

      It might be helpful to share your recommendation on mobile sites & robots.txt too – I often see that folks will recommend blocking desktop content to googlebot-mobile and block mobile content to regular googlebot via robots.txt for SEO purposes, but personally I don’t think this is a very good strategy as it limits Google’s ability to find & classify that content appropriately via its various crawlers. Thoughts?

    2. Adam Audette Adam Audette says:

      Nick, that’s a great point. We often see the same thing, especially with regards to sites blocking Googlebot from the mobile version. It was a bad idea a year ago, and it’s even worse of a practice today considering the great strides Google has made with mobile SEO. Ben Goodsell has an excellent rundown on handling mobile for SEO that is pertinent to this topic: http://searchenginewatch.com/article/2188344/The-New-Mobile-SEO-Strategy

    3. Suranga says:

      Excellent post Adam!. Cleared many doubts I had on robots.txt and noindex meta tag. Thanks for sharing.

    4. Holly says:

      Thank you so much for this article. I’ve been trying to figure this robots.txt thing out for several weeks now. With over 40,000 users, Squarespace (a hosted blogging /website platform) uses many of the “wrong” things you mention in EVERY single site.( eg. Disallow: categories; Disallow: css, etc. ) In fact, I think there’s always at least 10 disallows listed. To add insult to injury they don’t escape the disallowed urls so it can really mess with your site if you happen to start other pages with the same word/phrases. Unfortunately, we don’t have the ability to add meta robots tags to individual pages (at least not noindex, follow) or canonical links. The image of a description-less serps is exactly what happened in my “case study” in a post I have written on the problem. http://www.squarespaceplugins.com/squarespace-case-study-okaygeek-seo-categories-robots/ Can you offer any alternatives if you don’t have the ability to add the tags and canonical links you suggest? We can change the robots.txt if we want and that’s the only thing I’ve been able to think of that could make a difference.

    5. Adam Audette Adam Audette says:

      Holly, since the correct way to handle blog category URLs is usually ‘meta noindex, follow’ I would recommend really pushing for that functionality in the CMS. I don’t know about your specific limitations, but as an additional meta tag field, adding it doesn’t usually mean lots of time and resources (since they’re probably already supporting other meta fields, right?).

      If you can modify the response codes you could always use x-robots-tag but I have a hunch this wouldn’t be possible if you can’t add a new meta field. Handy if you can implement it: https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag

    6. Mary Kay Lofurno says:

      Nice post, very refreshing. Loved the point on not using robots.txt to fix duplicate content. Agree that citation at the page & author level has become important.

      Thanks, Mary Kay

    7. Adam,

      Great article, always good to see topics that cause confusion to be explained in a simple way.

      One correction to your article I think you should consider is regarding the use of meta robots noindex & it being a PageRank sinkhole. PageRank flows into URLs that are blocked using robots.txt and/or excluded by meta robots noindex. However PageRank only flows out of URLs that that aren’t blocked using robots.txt (but can still be excluded using meta robots noindex) because Google is able to crawl the URL to determine the links which PageRank should flow through.

      The above was covered directly in an Eric Enge & Matt Cutts interview in 2007. I’m not aware of Google making a public statement to contradict that interview since then, do you know of one?

      One scenario from the above that isn’t clear is what happens if a URL has been indexed and is later blocked using robots.txt. For quite a while after the block is put in place, Google will continue to show the SERP snippet based on what they were last able to crawl. Do the outgoing links on that page stop passing PageRank when Google first discover the robots.txt block, when Google stops showing the old cached snippet or some other time frame?

      Regards,
      Al.

    8. Adam Audette Adam Audette says:

      Alistair – fantastic comments, thank you. I’m not sure we can say if PageRank flows into URLs blocked with robots.txt. We haven’t done any tests to ascertain if that’s the case. As a completely ‘blocked’ URL, I’m not totally sure if PR could be passed correctly. Remember, Google knows nothing about those URLs. All it knows is about the links that point to those URLs. I don’t believe a URL could accumulate PR without Google being able to access it and place it in the index.

      However, you are correct that URLs annotated with meta noindex most certainly can accumulate PageRank. Now, how that PR and anchor text (and other signals) are passed or flowed out of ‘noindex, follow’ URLs is a bit of an open question. There’s no reason noindex’d URLs should NOT flow PR. This is precisely why we recommend noindex, follow for our “Classic” pagination technique documented here: http://searchengineland.com/the-latest-greatest-on-seo-pagination-114284

      But remember, the link flow from those noindex’d URLs is a bit of a second order effect.

      As to your last scenario, to me all this is simplified by thinking about robot access. How can outgoing links on a URL pass anything if Googlebot can’t access the page?

    9. Adam,

      I think you’ll find that PageRank flows into a robot.txt blocked URL just the same as normal. This is why Google is able to show search results for URLs that have been blocked. While the search result is inferior because they can’t crawl the URL, Google use other signals such as link text pointing to that URL to help form the search result.

      Al.

    10. Nikolas says:

      Hi Adam – thank you for that concise overview on robots.txt! For testing purposes you might want to check out the roboxt! Firefox Add-On that I wrote. It shows whether the current URL is blocked via robots.txt.

      The Add-On at Mozilla: https://addons.mozilla.org/en-US/firefox/addon/roboxt/
      And a short manual in my blog: http://nikolassv.de/roboxt-en/ (in english)

    11. Adam Audette Adam Audette says:

      Alistair – we’re planning a few tests to see if your hypothesis holds water.

      Nikolas – thanks for the tool. I’ll check it out!

    12. Dan Richardson says:

      Good article, worth adding something on location of robots? i.e. root
      Seen a few times people adding it to wrong folder. thx

    13. Nice one, Adam. Don’t forget to mention, in your robots.tx, the location of the sitemap.xml file. And a list of disallowed bots such as xenu would also come in handy :)

    14. Adam Audette Adam Audette says:

      Ovidiu – great points, thanks.

    Trackbacks
    Check out what others are saying...
    1. [...] 1) Disallow all in robots.txt – Possibly the most horrifying specter in all of SEO.  Believe it or not we’ve seen this in the wild, and it wasn’t pretty.  The fix is pretty easy of course, granted you’re able to quickly figure out why your indexed pages have suddenly dropped to 0.  As with all things, moderation is key when it comes to disallow statements, so be careful.  If you need a refresher on how to properly use the robots.txt file, check out “Robots.txt Best Practices for SEO”. [...]

    2. [...] the sledgehammer of SEO, disallow rules here will put a brick wall between your content and Googlebot. This can be a very [...]