When Google’s Panda Rewards Content Theft

If you’re a webmaster who cares about Search Engines, you have heard about the Google “Panda” Update. Here’s what the update was about, in Google’s own terms:

“Google depends on the high-quality content created by wonderful websites around the world, and we do have a responsibility to encourage a healthy web ecosystem. Therefore, it is important for high-quality sites to be rewarded, and that’s exactly what this change does.” (Google)

As usual, with each update, some sites go up, and some go down. There is usually an uproar in the various webmaster forums as well. Traffic fluctuates, and we usually don’t pay too much attention to this. However, for the first time in our 5 years as a web publisher, we have noticed some very bizarre behaviors that other site owners should look into.

Context

If you are new to this site because you are a web master in search of information about Google’s “Panda” update and potential issues and pitfalls, welcome!

Ubergizmo is a popular independent gadget site that reports about consumer electronics and popular technology on a daily basis. We’ve been a Webby Award Nominee, a PCMag Top 100 and occasionally we land on TV in the USA (ABC7) or abroad (3Sat/Germany).

We’re geeks, so we’re not great at talking about ourselves, but SF reporter David Weir is right on target in his “Ubergizmo Makes Gadgets Accessible to Non-Geeks (and Fashion Accessible to Geeks!)” article on Ubergizmo.

Search 101

Before we start, we need you to understand a couple of basic things. Search engines are mainly designed to match a search phrase with web page content. A search on a specific phrase should easily find the page that contains it. If more than one page contains the same text, the page deemed by the search engine to be the “most relevant” (normally, the originator of the content) is ranked at the top.

This is an easy way for web masters to see who’s copying content, and to make sure that search engines can differentiate between original and copied/cloned/stolen content.

With “Panda”, SPAM sites can beat original content

As bloggers, we often refer to previous articles, and we have this habit of searching for an older post or checking on who is syndicating our content. After the second Panda update, we could not find many of our own posts with a search on the title or a specific phrase from the article. Instead, what we found was that sites that *steal* our content do rank higher in the Google search results, while our own post archive could not even be found. We immediately knew that something was very, very wrong.

If you have your own site, do this: copy the first line (or the title) of a post, and search for that phrase. If this phrase is unique enough, your article will appear first (if not, you will simply compete with others). Here’s an example using content from SearchEngineLand:

As expected, a search on a unique phrase shows the original author at the top

You can see that the original post comes up first. Then an archive page from the same site, then other sites that copy/syndicate their content. Everything seems normal – that’s how things should be.

However, if we try to find our iPad 2 review using the first phrase, here’s what we get (full-size screenshot):

Many sites copy our content, some are legitimate syndicators (TreeHugger), others are spam sites

Above image: what you are looking at are sites that either syndicate our content (with a link to tell Google where the original article is), or sites that simply steal our content and fool Google into thinking that they have original content. i-nooz.com is a good example of content theft. It’s completely automated, so if you scroll down the page, you will even see the Twitter info from the original review.

Let’s try with something else. Yesterday we’ve posted about a Macbook Air SSD upgrade. We’re trying to find it by searching for its title: “Macbook Air gets a 25% boost with new SSD”. And here are the results:

Full-size screenshot: http://tinyurl.com/3rv8fty

That’s our article, but again, the top results are sites that copy of our content (Hubert’s name is even in the byline of polaris-website.com! Nice…). Some do link back, many don’t links or credit. You would think that even a handful of links would allow Google to realize that Ubergizmo.com is the originator of that content. This is obviously not the case.

Many offenders are using Blogspot (a Google service) hosting and Adsense (Google’s advertising network) to host spam for free, and make “free” money with our content. They are even using our bandwidth by “hotlinking” the photos directly from our servers. Yet, Google’s new “Panda” algorithm believes that those pages should rank higher than ours. It’s a strange world. Some say that Google is in fact creating more “black hats” (black hat refers to those who use shady tactics to gain search rankings)

Note that even sites specifically targeted by “Panda” don’t have this problem. Suite101.com is an interesting example (note that suite101 is not an automated spam site). As you can see, a copy was properly ranked in second place.

Suite101 shows up as the page #1 for its own content. Full screenshot http://tinyurl.com/3ebgu8w

So, what happened?

It’s impossible to know for sure what happened. From these tests, it seems that Google’s algorithm cannot recognize Ubergizmo.com as the original source of our articles, and is treating our site as an automated spam site — yes, that sucks.

We get it: for an algorithm, it’s really hard to figure out who is the real author of the content, but it was working fine before, so we can only conclude that the “Panda” update has a hand in this.

Additionally we’re in Google News, so that could be one more signal that our site does not steal/copy content. Heck, even Matt Cutts, the head of Google’s Web Spam team has linked to Ubergizmo from his blog a few months ago (search for “Looks like that did happen in 2009” in that post). Surely, that’s a sign that we’re not a spam site. Right?

But here we are, our content treated worse than spam and ranking below those hundreds of automated content-stealing sites. As technologists, we measure the difficulty of the task, but Google got it right before. And as far as we can tell, Google is giving our whole site a “black eye” (like a Panda?), so even new articles immediately rank lower than spam. Unfortunately, this problem has been reported many times in the Google Webmaster forums.

In the real world

We’ve kept the best for the end. First, open this screenshot. The (slightly) “synthetic” searches are great to prove a point, but for webmasters out there, what does this mean in the real world, with a real search. Here’s a telling example: Hubert happens to know a bit about computer graphics and chips that NVIDIA makes, so he spent a good chunk of time writing an overview of the mobile chip called Tegra 2.

If we search for, “Tegra 2 Overview”, we rank a good 10 positions below a spam site that stole our content. Better yet: that spam site called “techgeer.com” is still rising in the SERPS, while original content sinks deeper. This shows that Panda has real implications for good content — in the real world.

Now what?

Ironically, back in 2006 Google itself said “not to worry” about scrappers:

Don’t worry be happy: Don’t fret too much about sites that scrape (misappropriate and republish) your content. Though annoying, it’s highly unlikely that such sites can negatively impact your site’s presence in Google. If you do spot a case that’s particularly frustrating, you are welcome to file a DMCA request to claim ownership of the content and have us deal with the rogue site. (Google)

If you think that you have been affected, you can gather *hard data* to see if there’s something obviously wrong like what we’ve shown above. Then try to get the word out and provide some feedback to Google via the Web Master site. It’s probably the best way to make them aware of potential pitfalls with the new algorithm. You can also look at “Panda attack” survivor tales and decide for yourself if you need to take action on your site.

Now, you won’t know if someone looks at -or cares about- those glitches. Google says “We’re Working to Help Good Sites Caught by Spam Cleanup“, but in reality, things aren’t quite as easy as this phrase would suggest. At the moment, web masters can post in this thread of web master forum, but you can imagine how flooded it is right now (1800+ posts). It is monitored by Google employees, who will relay the information to the Google Search Team and the Web Spam team. In time, they *might* issue a fix/tweak, if it’s for the greater good of the web. If not…

If you’re a legitimate web publisher, consider yourself a “collateral damage” of Google’s war on spam and try to “guess” what could have caused this (it’s always “your fault”, says the robot). Be aware that there are “manual” penalties often following a complain, and there are “algorithmic” penalties that cannot be lifted by a Google employee, even after a request for reconsideration.

We’re lucky to have a great community behind us in times like this when automation fails. Unfortunately, not everyone is as fortunate.

Update 4/20: we’ve received confirmation from Google that we are complying with the Google WebMaster Guideline, so there is no manual penalty that the Google Webspam team can lift. This is therefore a purely algorithmic problem (that may or may not be fixed):

We reviewed your site and found no manual actions by the webspam team that might affect your site’s ranking in Google. There’s no need to file a reconsideration request for your site, because any ranking issues you may be experiencing are not related to a manual action taken by the webspam team.

Of course, there may be other issues with your site that affect your site’s ranking. Google’s computers determine the order of our search results using a series of formulas known as algorithms. We make hundreds of changes to our search algorithms each year, and we employ more than 200 different signals when ranking pages. As our algorithms change and as the web (including your site) changes, some fluctuation in ranking can happen as we make updates to present the best results to our users.

If you’ve experienced a change in ranking which you suspect may be more than a simple algorithm change, there are other things you may want to investigate as possible causes, such as a major change to your site’s content, content management system, or server architecture. For example, a site may not rank well if your server stops serving pages to Googlebot, or if you’ve changed the URLs for a large portion of your site’s pages. This article has a list of other potential reasons your site may not be doing well in search.

We did not make any major change since December 2010, and as of late March, the spammers were still kept at bay by Google.

Update 5/07: Google has published a new blog post titled: More guidance on building high-qualiy sites. No new ranking signal exposed, but Google tries to tell you what they’re aiming at. Most of it is a repeat from previous advices, with some additions. So far, the comments to this post are fairly negative, but you can add your own feedback, positive or not.

Filed in Top Stories >Web. Read more about Google and Search Engine.