THE RKGBLOG

GWT: Fact Finder or Over Eager Beaver?

Bizarro GWT URLs

Many SEOs over the years have mined data from Google Webmaster Tools—it’s a vital, constantly moving piece of our daily analysis work. Without GWT, we would navigate significantly more in the dark and with far less information at our fingertips. And Google, thankfully, has effectively kept GWT from being stagnant, partly by adding new functionality (URL Parameters, Index Status or Structured Data charts) or removing existing functionality (the Site Latency chart recently relocated to Google Analytics).

One area we often look to for help when assessing the health of a site is the “Crawl Errors” section of GWT. Extremely valuable information can be attained here, including possible systemic issues, specters of website iterations past or pertinent, and current troubles that need to be addressed. Furthermore, Google recently added the ability to look at incoming links that point to a page throwing a status code other than 200. While this data is very useful when cleaning up possible loss of external link equity, it also raises some interesting questions in another particular area.

After looking at the “Other” section on a client report recently, we inquired about the origin of some links. The client, however, had never seen those kinds of URLs on their site.

GWT Crawl Errors

As SEOs, we can fix a problem when we’re able to see where links originate, especially when a client links to pages from the past. In this example, all the links were coming from the same directory.

Linked From URLs in GWT crawl error report

We figured the solution was simple: go to the forum page, find the offending links to the client site, report them to the forum and request they update the links to the proper URLs. Oddly, we were unable to find the href=”<url>” style statement showing the same URL as seen in the report. In order to find anything that resembled these links, we had to look at the javascript in the page source:

Javascript source for a link

Notice how the c = `US`; in the javascript snippet lines up with the abnormal links we saw in the GWT report. This pattern appears no other place on the page but, fortunately, we could see many other examples of this pattern elsewhere, repeated over and over again.

Google doesn’t create these links solely from javascript; it tries to cobble a link together from anything that could “look” like a URL outside of an <a> tag:

<a href=”DOMAIN.com”>XXX XXX Indoor Planters Organize It All Planters | COMPANY</a></b><br />Free shipping on orders over $75! COMPANY carries many XXX XXX indoor planters organize it all planters<BR /><span>www.domain.com/XXX-XXX-indoor-planters-organize-it-all…</span>

GWT picked up the link that appears between the <span> tags (on-page content) as the source of the URL. The link above displayed as:

/XXX-XXX-indoor-planters-organize-it-all…

Again, with this specific example, the same issue occurred many times, in which no other link to the actual page appeared except outside the <a> tags.

The takeaway is that links can show up as regular content with no <a> tag in the vicinity:

Crawl error linked from report

Source code for a GWT URL error

In fact, we were tipped off about this nearly two years ago in Google’s response to “Bizzaro URLs that never existed.”

Q: Most of my 404s are for bizarro URLs that never existed on my site. What’s up with that? Where did they come from?
A:
If Google finds a link somewhere on the web that points to a URL on your domain, it may try to crawl that link, whether any content actually exists there or not; and when it does, your server should return a 404 if there’s nothing there to find. These links could be caused by someone making a typo when linking to you, some type of misconfiguration (if the links are automatically generated, e.g. by a CMS), or by Google’s increased efforts to recognize and crawl links embedded in JavaScript or other embedded content (emphasis added); or they may be part of a quick check from our side to see how your server handles unknown URLs, to name just a few. If you see 404s reported in Webmaster Tools for URLs that don’t exist on your site, you can safely ignore them. We don’t know which URLs are important to you vs. which are supposed to 404, so we show you all the 404s we found on your site and let you decide which, if any, require your attention.

Ultimately, Google is trying to put together anything it thinks “might” be a link, whether in an <a> tag, javascript, plain text, etc. Google states that 404 pages aren’t a big deal and are a natural part of the web. But we believe these links can offer really good insight into a client’s past linking practices, problems with how code is represented on linking sites, and systemic problems within a client’s internal linking. SEOs who understand that these links can exist, and how to best explain their existence, can help their clients also understand this otherwise head-scratcher of a phenomena.

Special thanks to RKG’s Craig Zagurski for editing my posts and making them understandable to human eyeballs.

Comments are closed.