Here are 9 juicy takeaways from Joachim Kupke's presentation at SMX East in NYC this month. Overall it was a terrific conference, other than the cursed Javits Center constantly causing issues with the wifi (or freezing us, or creating AV headaches). Danny Sullivan (the conference organizer, for those living under rocks and things) repeatedly said things like, "Javits sucks!" and "blame Javits, don't blame us!". We blame Javits, Danny.
Notably absent from SMX East this year were regular search engine reps like Matt Cutts and Nathan Buggia, but it was great getting to hear from lesser known Google and Microsoft folks like Maile Ohye (Google) and Sasi Parthasarathy (Bing). As an SEO, I'm particularly interested in what the search engines have to say about specific technical issues such as indexation, duplicate content, crawling and redirects, and this conference had a couple of great sessions where a lot of that information was discussed.
There were a few surprises (elaborated below) and a couple new announcements made, but overall the information shared by Joachim and the other search reps was very specific and likely subtle to anyone outside the 'inner realms' of search engine optimization. I love me some inner realms.
Let's get to it -- here are my 9 SEO takeaways from Joachim's contributions at SMX East.
Joachim Kupke's Presentation on Duplicate Content
Joachim is on the indexing team at Google, and shared some juicy tidbits on how Google handles duplicate content, but also shared a lot of insights into how Google 'sees' the web and indexes URLs. Here are the points that stood out to me.
1. Impressions & Clicks
Joachim repeatedly used the terms 'impressions' and 'clicks' in the context of a URL in Google's index. He mentioned that if they see a URL with very few impressions (or none), it will likely take very long to be updated in the index (no surprise there). However, URLs with a lot of impressions and clicks (or on domains that are important and crawled frequently) will be updated quickly. This makes sense, but it's interesting to hear a search engineer reinforce these things. Those 301s or noindex tags on some pages that aren't being re-crawled and updated in Google? Probably because they're very low priority for the engine (yet another reason why big brands rule in SEO).
2. Infrastructure for Handling Duplicate Content
Google is said to have "a ton of infrastructure for duplicate elimination," some of which includes:
A treasure found of particularly great value; Silver, gold or money that is found hidden and has no identifiable owner
- Detection of recurring URL patterns
- The contents of a page
- The link canonical tag (if all else fails)
Of note is that Google is recognizing patterns within URLs and eliminating certain parameters as the cause of duplication (of course they recently released the parameter removal tool in Webmaster Tools, which might be a client-facing outcome of this power internally).
What do they mean by "the contents of a page"? To me this was most interesting, read on...
3. Historical Record of URLs
Google keeps a sort of Archive.org of the web with older versions of content (not really like that at all, but you get the idea: a historical record of pages), for the ability to compare the most recently-crawled version with an earlier version. The contents that change can be subtracted from things that don't change within a site. This may also give Google the ability to ascertain where global elements, shingles, and content stubs appear within a site separately from definitive, unique and changing content.
4. Google + rel=canonical = Love
Google loves the link canonical meta tag. It has been, in Joachim's words, "tremendously successful" and has seen exponential adoption on the web. They are treating this tag very seriously, it is a "strong hint" as Maile Ohye told us at SMX Advanced in June of this year. This was reinforced by both Maile and Joachim at SMX East. It has "huge impact" on Google's canonicalization decisions: 2 out of 3 times, rel=canonical alters the organic decision. This is big, folks.
5. 302s are Just Fine for Canonical Targets
302 redirects are fine canonical targets. This was explained at least twice by Joachim, and actually has 2 parts:
- Because of an internal method for handling the trailing slash on URLs, Google needs to have (and recommends all web developers deploy) a trailing slash on canonical targets and internal links. Without the trailing slash, Google will actually add the slash and update the URL in its index. Now, I've found multiple examples of pages where this doesn't happen, but Joachim was pretty firm that it's a web problem in general that Google is forced to work around.
- The takeaway is that you should always add the trailing slash to the absolute URL in the canonical target. If you don't, Google will add it anyway, but adding it proactively should speed up server response times (which may have impact on very large sites).
6. How 302 Canonical Targets Could be Abused
302 redirects are fine as canonical targets. Yes, I know I just repeated myself. Here's the interesting part for SEOs: if 302s are ok to use here, I can think of a method to use the link canonical meta tag for SEO purposes without having to do any heavy lifting on URL structure improvement. How? Read on for a theoretical example:
A site with very poor URL structure (how about this example) would like to improve URLs for SEO and usability reasons. However, the developers are swamped, the technical platform is wonky, they don't have enough money for quality SEO, or they simply don't believe it matters that much to change.
An SEO comes to them with the following proposition:
- Create a table with search-friendly URL versions of every URL to be improved.
- Add these search-friendly URLs as rel=canonical targets in source code.
- 302 the canonical target to the existing (crappy) URL on the site.
- Presto! Pretty URLs in search results.
If the canonical tag acts like a 301 and updates URLs in search indices (which it does), and the target canonical URL redirects with a temporary 302 which doesn't force an update (which it does), then the pretty (and pseudo) URL in the link canonical target will stay in search indices while the ugly, parameter-ridden but non-pseudo URL will act as a temporary page (to spiders). Get it? Interesting. (See the crude flow chart to the right for a visual of this.)
No, I have no plans to start trying this, but I do know of at least one major ecommerce site doing the practice (I think unintentionally) and it's been working fine since June.
7. Don't Disallow Your Duplicate Content (?)
Google says "please do not use
Disallow: directives in
robots.txt to annotate duplicate content." Content Google can't get to, Google can't know about, and they don't like that. Their preference seems to be "put it all out there" and we can decide what's best, and anytime content is excluded from search engines they lose that ability. My personal preference is to take more control, not less, but I understand the thinking behind this and why they'd want to say this.
8. Indexing May Take Very Long for "Unpopular" URLs
Joachim stated that indexing takes time (as I mentioned previously), but especially for "obscure or unpopular" URLs. And while indexing takes time, cleaning up an "existing part of the index" takes an even longer period of time. There are of course ways to issue a crawl from Google (which is separate from an index update, of course), but by and large lesser-known sites don't get the same love popular sites do.
9. Cross-domain Support for the Link Canonical
Google will soon be bringing cross-domain support for the canonical tag. This is fairly huge. Yahoo! and Bing both said that they're still working on simply supporting rel=canonical at all.
Other great stuff at SMX East
There was plenty of other great stuff, too, especially David Mihm, Will Scott, Andrew Shotland, Mike Blumenthal and Mary Bowling on Local SEO Ranking Factors. Local is such an exciting area right now for search marketers, and this crew brought together an amazing session. It got me so fired up that I came back very excited to delve deeper into local.
Update: Laurent Bourrelly (@laurent8) has graciously translated this post into French and posted it here: Redirection 302, contenu dupliqué et autres infos sur l'indexation. Thanks, Laurent!