This article will explore the basic concepts of designing optimized site architectures for efficient spidering by search engines. Building an easily spidered site has ramifications in how pages, sections of a site, and entire domains are topically understood and categorized by bots, which influences indexing and rankings.
While search engine optimization concerns are the focus here, there are many different applications of information architecture (IA) that go beyond search engines. IA overlaps with several other disciplines, including navigability, user experience, and interface design. It's very hard to speak categorically about this subject, because how IA is applied to a site is based largely on business goals, the site infrastructure, user testing, and the whims of the people involved (for real).
At its most fundamental, however, information architecture is about organizing digital inventories so they're easily understood by robots and human beings.
I normally wouldn't want to focus only on the SEO side of site architecture, because it leaves out the end user (and so much more). When providing site architecture recommendations, we work with interface designers and (sometimes) a usability engineer, in addition to the development team. We'll have the results of their usability testing and can balance that with SEO goals. This topic is complex. Like SEO, information architecture seeps into every aspect of web production.
Recently Magnus Brättemark posted a question in the LED Digest about site architecture for multilingual sites. I'm going to incorporate some of my response to that post here since this is a frequent topic of discussion. It also speaks to the complexities inherent in multi-national site architectures.
Finally, we'll touch on findability (again) and how it (perhaps) takes us into the realms of IA in a way that SEO simply cannot. I'm still not convinced findability is the Holy Grail, but I'm becoming more interested in its role, especially how it dovetails with other important factors that make up the web ecology: from information retrieval to usability.
Enough rambling. Let's get started!
What is Information Architecture? Definitions
"We shape our buildings; thereafter they shape us." - Winston Churchill
in•for•ma•tion ar•chi•tec•ture n.
- 1. The structural design of shared information environments.
- 2. The combination of organization, labeling, search, and navigation systems within web sites and intranets.
- 3. The art and science of shaping information products and experiences to support usability and findability.
- 4. An emerging discipline and community of practice focused on bringing principles of design and architecture to the digital landscape.
Four unique definitions, all of them rather hard to hold on to. This encapsulates IA and indicates how theoretical it is. However, it also shows us its flexibility and kinship to SEO: we can define it any number of ways. In its best form, SEO has much in common with IA, which is why most of the best search marketers are deeply skilled in site architecture.
A simpler definition for our purposes:
Information architecture is the semantic structure and organization of digital inventories.
With SEO, we care primarily about delivering relevant content to the spiders in a format that they can easily digest and understand, but we also care about making it usable, credible, attractive, and high-quality. SEO like IA has to be part of every aspect of web production, from the initial strategic planning phase to the on-going preservation of rankings and expansion of content.
We don't create sites for search engines, we create sites for people. Balancing the needs of a spider with the needs of your visitor is a critical distinction.
While SEO (and IA for that matter) is not necessarily about design, a deep understanding of usability and interface design principles, with empirical data from testing, will pay large dividends (read: cold hard cash) from improvements in relevance, conversion rates and meritocratic sharing by site visitors. It's all tied together.
Pieces of the IA Puzzle
A site's architecture is built by domains, sections, categories, pages and media (to name but a few). A description of each one of these follows:
- Domains: The top level domain (TLD), which can have within it multiple sub-domains.
- Sections: These represent the organizational hubs where categories (and sometimes other sections) are located.
- Categories: These represent organizational reference points for pages and media (and sometimes other categories).
- Pages: Web documents in the form of whatever language - xHTML, PHP, ASP, etc. and either static or dynamic (or a combination).
- Media: Images, videos, documents (such as PDFs), sound files, etc.
Of course, this is a simplified treatment of a site's structure, but it's accurate enough for our purposes.
Optimized IA: Domains
Domain names are a critical asset that communicates volumes to users early in the searching process. We see your domain name in the SERPs, we see it in print, we hear it spoken. A good domain name can literally make (or break) a site - it's critically important from a marketing perspective.
From an architectural perspective, there are a number of concerns we need to keep in mind. The domain name is the foundation that supports the entire web property. Take care that the following best practices are built into this foundation:
• Semantic Value: a domain with your primary keyword is a very good thing, but it shouldn't be the total focus. It's likely a keyword in the domain (and to a lesser extent, the URL) will score additional relevance points, provided the topical theme of the site matches closely. But it also needs to be short and memorable (or long and memorable, if you can get away with it). It needs to be easy to share with others vocally, and it should reflect your market position or brand. Beware of too many dashes in the domain, as these tend to lower credibility since they've been abused so heavily by spammers. Generally speaking, a shorter URL tends to raise credibility because of their higher value and scarcity.
• Canonicalization: with the increased sophistication of search engines (especially Google), concerns about duplicate content will become less and less pronounced. But it's still important, and always a best practice, to rewrite URLs so a "www" and root domain don't both display the same content. With Google, it's likely all you'll need to do is specificy a preference between the canonicals in your Webmaster Tools console. However, ensure this is built into your site architecture so other search engines don't hiccup on the duplicates. This also helps ensure consistency among backlinks, since it controls what versions of a URL are likely to be found (and cited), and it simplifies internal linking.
• Additional Canonicalization Concerns: besides basic URL canonicalization, there are other scenarios where a site can get in trouble. For example, older versions of IIS use a 302 meta refresh to display a trailing slash on pages entered without one. This is easily solved with something like the ISAPI rewrite tool.
You'll also want to ensure there's consistency in the internal linking of your site. Some content management systems (CMS) such as Joomla! will create multiple versions of pages and link to them with multiple URLs within the site. The Joomla! issue is especially bad with their frontpage treatment, which can create dozens of different home pages, each of which gets linked from various sections of the site. Ensure you link to your pages using a standardized rule, and stick to it.
• Crawling Errors: within Google Webmaster Tools, you can monitor your site for crawling errors and export any results to CSV for analysis and repair. You should also periodically crawl your site with a tool like Xenu (or something more powerful, but you'll have to roll-your-own) to verify link integrity within your site and more.
Excessive 404 errors can cause ranking penalties at Google, so it's something you'll want to monitor.
• Redirects: there is an art and science to properly redirecting expired, non-existent, or relocated pages. In general, you'll want to make wide use of 301 permanent redirects in cases where a) the page has moved to a new location, b) visitors are likely to be confused or frustrated by a 404 error page, c) another page closely matches the content of the expired or deleted page. 302 temporary redirects are also used widely, but don't pass PageRank the way 301s do, and should only be used in specific cases (such as browser look-ups). There are good uses of redirects, and there are bad uses. We'll talk about some of each in a future article.
For now, just remember the golden rule: is it good for the visitor? In many cases, 301'ing an entire site is NOT good for the visitor, it's good for the site owner hoping to cash in on accumulated PageRank (and thus an entire strategy has been built up to acquire sites and redirect them for these purposes). In general, permanently redirecting a site in whole may not have as much benefit as you'd think it would. Sometimes it's best to test by redirecting a few pages first, and sometimes it's best to leave the site in place for a specific duration of time (6 months, a year) with a message about future changes. And yes, sometimes it's best to leave the site intact and build it out separately.
• Domain Registration: registering your domain name for the maximum duration (10 years) may give an additional quality award from Google, who is a registrar for quality control reasons (or big brother snooping, depending on whom you ask).
• Multilingual Domain Structure: there are a few ways to handle sites that are created for different countries and languages. The first and best method is to create country-specific TLDs with unique sites and content. Localization is critical - ensure your Spain site has language specific to Spain, and not Mexico. The second method, inferior to the first, is to create sub-domains for each language version. The third method, least desirable of all, is to create a directory structure serving each language. In the case of the latter, make sure you name your directories using the language of the country you're serving. There's nothing more annoying than being a native of Germany and having to navigate to www.domain.com/german/index.html instead of www.domain.com/deutsch/index.html.
The first method has many benefits, including the ability to authenticate the domain in Google Webmaster Tools and specify the geo-location. International TLDs will be indexed and listed in language-specific versions of search engines and regional directories, when many sub-domains and directory localization strategies will not.
To bring it all together, serve an international hub page (or small site) with a version selector for each country-specific domain. This helps with spidering and enables users to individually select the version of your site they're interested in using. A browser detection script gives your visitors less control - don't rely on the browser language to serve the language version. This also does away with the 302 redirect commonly deployed for browser look-ups.
Optimized IA: Sections & Categories
Now we get into sections and categories, the pillars and columns of our structure. You'll hear these areas of a site referred to in lots of different ways - as hubs, doorways and hallways for instance - but the basic idea is the same. These are the areas of a site that bridge the root domain with key individual pages of content. Or more precisely, sections and categories provide entry points into deeper content that allow for comprehensive crawling of a site's hierarchy.
Since the root domain on a site won't rank for nearly as many terms as its sub-pages, these represent the spidering gateway into your money pages. As you build an optimized information architecture, remember that these interior sections of a site feed much of the ranking power of the domain, and represent the bulk of its potential traffic. This is long-tail paradise.
We mentioned the idea of topical themes above in the section on domain names. Themes are an important aspect of IA and govern things like applying keyword research to labeling and navigation, but they also dictate strategies for assembling sections and categories topically. A site hierarchy can be visually represented nicely in something like Visio or OmniGraffle, and they can get pretty complex.
There are plenty of exceptions to this rule, but in general search engines weigh content directly below the root domain with more value than deeper pages. Think about it: pages one level deep tend to be pretty important. That's why, after all these years and the continued sophistication of search algorithms, it can still be effective to create static HTML pages and publish them in the root directory.
When thinking about how to lay out your site hierarchy, consider taking your core keyword list and chunking it into groups. These groups will represent the basic sections of the site, and each one should be optimized with keyword messaging. Below these sections are the finer category keyword sets you're targeting, with pages (or more categories) within those. And so on. You may need to only go two or three levels deep, or you may need to get much more complex. The construction of site hierarchy is strongly dependent on the market and business goals (and the SEO benefits of specific keyword markets).
When assembling your keyword list to build the sections of your site, use the Google AdWords keyword tool which automatically creates good keyword groupings. You may need to filter some of the results, which you can dump directly into a file for compilation and research with other keyword lists.
A post about creating site themes written by Brett Tabke in early 2001 visualizes this arrangement quite well. Without getting into internal linking strategies (which we'll cover below), the idea is to funnel spiders from the keyword-themed sections downward to well-targeted interior pages. Then, instead of linking across from section to section (and category to category), you link vertically to and from the keyword-matched theme pages. This strategy was developed long before nofollow was in the common SEO vernacular, but maximizes internal PageRank in much the same way by controlling how spiders crawl through a site following its links.
There's no right way to set up a site's hierarchy, and a lot depends on the site's size. Zappos is going to have a far different strategy than Bastyr for example. But the basic concept is to build core section and category themes that funnel spiders (and PageRank) to deeper pages. We'll cover this in more detail below, in the section on internal linking.
Optimized IA: Pages
Standards-compliant and clean code has never been more important. As the web evolves, search engines will become less patient with messy, broken markup. Imagine a web where high-quality content is no longer a scarcity (we're getting there); where standards-compliant code is the rule rather than the exception (nope, not there yet); and where websites will be counted in the trillions rather than billions (we there yet?). By creating clean code and semantically optimized pages, you're helping spiders to quickly crawl and understand a page.
This is the area of SEO that's becoming the first real standard. It's not complicated to build well-optimized pages. Here are the basics.
• Semantic Structure: good semantic markup is a must for spiders. It enables for efficient crawling and indexing of pages by serving content in a form that's easy for spiders to understand. The basics of W3C compliant code should be followed: relevant and optimized title tags kept under 70 characters, descriptive meta descriptions, relevant header tags that echo or subtly modify the title tag and then narrow the focus with subsequent tags, and wise use of bulleted and numbered lists, bolding, and emphasis. Be sure to do away with deprecated tags such as
• Metadata: a word about the meta keyword tag: Google still uses this for targeting contextual ads in AdSense (apparently), and there's a strong chance it's still used by Yahoo! for their search algorithm (but probably not much). Feel free to add 3 or 4 keyword modifiers here, but don't put much time into it. Often the meta keyword tag is abused by design and developer teams. If there are problems with departments in your company misusing it, remove the keyword field altogether in your CMS. As Bob Porter from Office Space says,
We always like to avoid confrontation, whenever possible. Problem is solved from your end.
• Standards: if possible, design your code to conform to W3C standards. Standards compliant code is cool, sure, but it also aids search crawlers and can help with SEO efforts. How much? Hard to say. I feel strongly that standards-compliant code is a high water mark we should all strive to achieve. Is it critical? Definitely not. Is it professional? Absolutely.
• Accessibility: search engines care about accessibility, so be sure to include descriptive alternate text attributes on all images. This allows users with screen readers to understand what the pages are about, but it also gives you an additional field to propogate with text. Provided the text is relevant to the image and the page's content, it can help in SEO efforts.
• Content: natural, high-quality writing is best. Avoid keyword-stuffed copy, it turns people off and probably isn't as effective as natural writing anyway. Use keyword modifiers in the copy, if it makes sense and flows, to help draw in long-tail searches.
• Orphan Pages: since crawlers like Xenu work by following links, they won't be able to locate any orphan pages (pages not linked internally anywhere on your site). To find these pages, you'll need a use a custom java or CGI script. Within Google Webmaster Tools, you may try looking at the internal link report for any pages with 5 links or less. If pages you care about are only being linked to once or twice, do something about it! Generally speaking, more internal links into a page gives it more importance and potential ranking power.
• Media: as I already mentioned, descriptive ALT attributes should be on image files. Images should also (ideally) have descriptive file names and relevant text surrounding them or near them. Videos should have relevant keywords in their titles, be transcribed with text on the page, and have their metadata information filled out properly.
Optimized IA: Internal Linking
Creating a well-optimized internal linking strategy is an art. Think about the factors involved using Google's algorithm as the example:
- 1. Each site and each page has a certain amount of PageRank. We have no idea how much that is. We don't even really know what PageRank is!
- 2. We have no real idea about how much PageRank a site or page has - we can only guess.
- 3. We have no idea how much PageRank fluctuates.
With the above points made, there are some things you can do. As always, take great care you know what you're getting into before deploying nofollow on a site. You could be doing far more damage than you know.
• Basic Linking: ideally you'll be able to link to all the important sections from the home page. The main index page will likely possess the most PageRank, and is the natural entry point for search spiders. Providing plain text links to each important section of the site is critical. You'll also want to create an HTML sitemap as an additional spider crawling area, and have it linked directly off the home page (and sitewide).
• Deep Linking: link to important sub-categories and specific pages from the home page, or directly under the home page if that's not possible.
• Anchor Text: anchor text matters with internal links too. Ensure you link to keyword-optimized pages with the relevant anchor text. You want your money phrase in the link text and on the page it links to. Sounds simple, but you'd be amazed at how often this is messed up! People tend to think, "if I link to this page with the right anchor text I'm done" and forget that the page has to be keyword optimized too. It's a head-slapper, I know.
rel='nofollow' attribute to the internal links. You may also nofollow pages that require a login, such as the shopping cart and account profile links. For pages with over 100 links, you'll want to carefully sculpt with nofollow to concentrate the amount of PageRank flowing off the page. Dan Thies has a fantastic explanation of PageRank sculpting:
If you think of every link on your site as a valve that pushes some PageRank on to the next page, nofollow simply lets you turn some valves off. This increases the amount of PageRank flowing through the remaining links.
• Link Threshholds: any page with over 150 links is a waste in terms of usability and internal linking, so split your sitemaps into multiples if you have thousands of pages (note the sub-domain being deployed there). In general, less links on a page mean more PageRank is available for the links present. Keep this in mind with your homepage, because that tends to have the most PageRank to spend and we like to use it for linking to everything under the sun.
• Advanced Bot Herding: first a warning: don't implement major nofollow sculpting unless you know what you're doing. Thies calls this technique the third level push. The strategy uses the following general methodology to flow more PageRank to deeper pages:
2. Second tier pages (what we've discussed as sections and/or categories) that have links to each other (other second tier pages) and the home page (upwards in the site hierarchy) are given nofollows. This allows more PageRank to flow deeper to third tier pages.
3. Third tier pages that link upwards to second-tier pages have those links nofollowed. This gives them more PageRank to pass along the third tier.
Halfdeck applies this method slightly differently and also cites some useful explanatory quotes from Thies. Halfdeck also explains advanced linking strategies such as paired and circular linking on the second and third tier.
The basic idea behind this method is to force more PageRank to flow downwards from the home page to deeper pages of the site. With more PageRank granted, these deeper pages will be indexed by Google (the engine this technique specifically targets), thus giving you more money pages in the main index.
You'll want to make use of Halfdeck's PageRankBot tool, which can provide a shortcut to diagnosing PageRank leaks and making smart use of nofollow. Here's detailed information about using this tool. You'll need some basic geek chops to get this installed, but it's well worth the effort!
For the sake of completion, some other terms you'll hear this technique referred to are siloing from Bruce Clay's team, and dynamic linking from Dan Thies. That should take care of your reading material for the next week or so.
There are normally far more important steps to take on a site than manipulating internal PageRank. Most sites will benefit by implementing a basic nofollow strategy for overhead pages, but that's as far as you'll probably need to go. Techniques like the third level push should be reserved for advanced SEO when indexing (and ranking) goals have been largely achieved, or for sites without many external links pointing at deeper pages and an imbalance of PageRank on the root domain. By and large it's far more important (and effective) to work on adding great content to a site's deep pages and building PageRank that way.
Optimized IA: Final Considerations
Below is a collection of assorted recommendations that I haven't mentioned yet.
• Keywords in URLs: use keywords in your folder paths and filenames, and hyphens to separate them. When all other factors are equal, a relevancy score can be won by having additional semantic value in the URL.
• XML Sitemaps: you may consider using an XML sitemap. I haven't found any major benefit except when moving a site or trying to get a substantial amount of pages (over 25,000) crawled (but not necessarily indexed - they don't appear to give any advantage here). You can add the sitemap directive to your robots.txt file. Make sure your XML structure is clean and you haven't included URL errors or pages you don't want crawled.
• 404 Error Pages: custom 404 error pages give visitors confidence in a site, add credibility, and help to keep them on your site. There are a number of best practices to take, such as:
1. keeping choices to a minimum. Treat your error page like a simplified landing page - don't overwhelm a user with choices. They're lost, they need a minimum of feedback to get them back on track.
2. put clear call-to-action links to the sitemap and main categories of your site.
3. put a search box on the 404 page.
4. ensure your 404 returns the correct server header code.
5. consider asking for feedback on what page they were trying to find; you won't get many takers, but some will let you know. You can use that information to redirect non-existent pages or find other errors in the site.
• Excluding Content: use the robots.txt file to exclude content and sections of your site from robots. For specific pages, you can also add a
meta name="robots" content="noindex, nofollow" to the head section to exclude that page. There are many other combinations as well. In general, you'll want to ensure administrative sections of the site are excluded from all bots, and content sections are left spiderable. It used to be a common practice to exclude the /images directory in robots.txt, but I normally don't recommend doing so. Google Image search has the capability to send significant traffic to your site, and with blended search results images are even showing up in the web results.
• No ODP: if you've been lucky enough to get a listing in DMOZ, you'll want to add the
meta name="robots" content="noodp" tag to force Google, MSN and Yahoo! to ignore the title and description summaries in your ODP listings. All three search engines now should properly behave in this regard. Google has famously used all sorts of combinations of ODP entries in various titles and snippets, so this can be an important step to take if your site isn't optimized in Google results.
• Other Meta Tags: get rid of any extraneous meta tags, such as
meta name="robots" content="Index, Follow". It's useless.
Conclusions: How Findability Fits In
This article is about information architecture from an SEO perspective. It mostly leaves out the concepts of usability and design (on purpose). Using the strategies listed here will help sites get crawled quicker, and give search engines a more accurate understanding of what a site is about. Ultimately (provided off-page factors are covered well), that will lead to more pages in the search indexes and higher rankings.
So what about findability, which I alluded to in the beginning of the article? (I know... way up there near the top.) Well, I believe findability is important where the promotion of user-centric and relevant content in the SERPs is the primary focus, rather than commercial intent. Companies that can marry a user-centric relevance with marketing goals will have a distinct advantage. The challenge (and the upside) is bringing these SEO recommendations into the process and balancing them with usability, branding, conversion goals and design. It's fantastically complex, but it can be done.
Be sure to also read this post about how incorrect implementation of rel=canonical is breaking websites.