THE RKGBLOG

Content Spam at (at least) 8%?

I enjoy stumbling onto new things, and so changed my default FireFox homepage from Google’s personalized homepage to Yahoo’s redirect to a random URL (random.yahoo.com/bin/ryl) just to shake things up. After randomly hitting content spam pages (MFA) a few times when opening the browser in the morning, I began to wonder about their prevalence. After all, the web is a huge haystack, and those bogus pages must be occasional needles, right?

Curious, I tried 50 random pages from random.yahoo.com/bin/ryl. I’m assuming (big assumption) that Y! isn’t filtering all that much, save for language — that’s my guess because (a) all the results were in english, (b) three of the 50 were broken links, and (c) three of the 50 were porn sites.

Of the 50, four were clearly junk pages solely designed to generate search revenue. These four URLs were all concatenations of two common dictionary words which didn’t make much sense together, clearly suggesting they were purchased by a ‘bot. (The most amusing of the four was dochunter.com
, which can’t seem to decide if the page is about hunting moose, choosing a MD, or
– gasp — hunting doctors).

This survey is decidedly unscientific, is based on a tiny sample, and depends critically on the randomness of random.yahoo.com/bin/ryl, which isn’t known.

But still, 4 in 50 is 8% — that is amazingly high, in my opinion. The web is well over 11.5 billion pages (that estimate is over 18 months stale) — 8% of 11.5b is over 900 million junk pages.

Even if this estimate is off on the high side by an order of magnitude, that suggests at least 100 million bogus content pages siphoning value from advertisers to spammers. Scary.

  • Alan Rimm-Kaufman
    Alan Rimm-Kaufman founded the Rimm-Kaufman Group...
  • Comments
    One Response to “Content Spam at (at least) 8%?”
    Trackbacks
    Check out what others are saying...
    1. [...] year, a decidedly unscientific experiment suggested about 8% of all pages on the web are MFA trash. My intuition is that estimate was low [...]