No doubt that at some point you have done a search in Google, clicked on an attractive result, and come up with a frightening wall—the article or page in question requires a subscription! 😉 As a user, we all find this annoying, and the last thing we want to do is get a new name and password. But as a content provider, it’s an excellent business move. Premium/paid content is a fine monetization strategy for anyone with content good enough to sell.
It also brings up an interesting question for SEO. How exactly does Google index paid content?
I got this email on from my loyal reader Wing Yew:
Hamlet,
I've read your blog since the day you launched. That said, I can
completely appreciate if you don't have time to respond to this
message or post a blog about it. On the off chance you do know an
answer, I knew I had to ask.
Question: How do you have google/yahoo/msn spider password protected
content? I know that SEOMoz does it with their premium content, but
I'm not sure how. I'm rather desperately seeking out a hard and fast
answer… and I know of no better person to whom to go.
for His reknown,
Wing Yew
Saying that I've been extremely busy lately is an understatement, but how can I say no to a loyal reader that has been following my blog from day one? Thanks for your support, Wing! Letting search engines index paid content is not only a good idea, it is also a very clever one.
Activate cloaking device
In order to do this you need to use cloaking. Before you panic and run for the hills saying that this is black-hat stuff and you don't want to be penalized, it is important that you know that Google does not penalize every type of cloaking. It is all about the intention. Let me explain the main concept and then dive into the technical details.
Your paid or sensitive content can be protected by your web server or by a web application. Let's call it the gatekeeper. The gatekeeper is responsible for asking for credentials anytime a visitor lands on a protected page. It validates the credentials and, assuming they are good, allows access to the page.
In this case we need to make the gatekeeper a little bit smarter by teaching it how to distinguish search engine spiders from regular users. The gatekeeper should still ask for passwords from any web surfer, but it should not ask search engine spiders for credentials. This is where cloaking comes in.
As I explained before, cloaking is presenting different content to crawlers than we show regular users. Traditionally I have used two detection strategies: either by user agent or by IP address. The first is having the code check if the HTTP_USER_AGENT server variable contains a bot identification string (e.g. Googlebot, Yahoo Slurp, etc.), and the second is to check the IP against a list of known bots. You can get just such an IP list from http://iplists.com/. A list of search engine user agents can be found here: http://www.user-agents.org/
Both approaches are relatively simple, but they have flaws and are not difficult to exploit if an advanced user wants access to paid content for free. The user agent can be forged, for example there is a Firefox extension that can be used to make the gatekeeper think the user is a search engine robot simply by providing a search engine user agent instead of the browser's. The IP list method is stronger, but maintaining/updating an accurate list of bot IP addresses is extremely difficult and time consuming.
Here is a better strategy
Let's use a method I've discussed before to protect against CGI hijackers. The method is not infallible, but it is extremely powerful for our purposes here. Here are the steps:
-
Do a simple user agent detection as explained above.
-
In order to detect fake robots, we use reverse-forward DNS detection. We only do this check if the requestor has been identified as a known search engine robot in step 1. Making two DNS requests for every single request will definitely slow your server down and we don't want that.
-
Once the code confirms that the requestor is a search engine, we allow the robot to access the paid content.
A word of caution
It is wise to prevent the search engine from caching the paid content. Clever users will hit the back button and access the content via the cache. I see a lot of sites that implement this type of cloaking, yet forget to prevent the search engine from caching the protected content.
As regular readers know, this is as simple as setting the meta robots tag with the command “noarchive.” Alternatively you can set the HTTP header X-Robots-Tag with the value “noarchive,” but this only works with Google at the moment.
Now to the technical details
If the gatekeeper is the web server directly and is using HTTP authentication you can use mod_rewrite to set up rules that identify the bot and set the status code 401 (Authorization Required) if the requestor is not a search engine. Doing more advanced detection of this type of gatekeeper requires another post.
Robot detection by user agent or IP address
The code simply needs to check a couple of variables that are set by the web server for every request. These are HTTP_USER_AGENT and REMOTE_ADDR.
Most scripting languages such as Python, Php and Ruby have a class or module named CGI that provides access to such variables. If you are using a framework such as Django, Ruby on Rails or Cake Php
, look f
or the relevant documentation to see how you can access and modify the HTTP headers from your controller or view. It is important to keep in mind that any code that modifies headers needs to be executed before any other that sends output to the browser.
Reverse forward DNS
To do this type of detection you need to query your DNS cache or name server. The low level way to do this is by calling C system level functions that are available in any BSD-based TCP/IP implementation. They are gethostbyaddr() and gethostbyname(). With the first call your script does the name lookup by providing the IP address obtained from the server variable REMOTE_ADDR. In the second call your script passes the result from the first call to confirm with the DNS that the host does indeed correspond with such an IP. For all this to work, it is very important that the search engines maintain accurate forward (A) and reverse (PTR) DNS records for all the crawler IPs. It is also very important that you have a solid DNS cache if your site receives a lot of traffic.
I don't think it is so bad, but most web developers are not big fans of C. But it is good to know that those APIs have been ported and are accessible as scripting functions/methods on any modern scripting language such as Php, Python, Ruby and Perl.
Jez
September 3, 2007 at 1:16 am
Hi Hamlet, Are you saying that by setting the user agent in Curl I could spider and rip SEOMOZ premium content? ;-) Jez
Jez
September 3, 2007 at 1:19 am
On a related note, how do you maintain the lists of IP's for Google and other SE's? Outdated IP lists seem to be the risk with any form of cloaking...
Hamlet Batista
September 3, 2007 at 6:31 am
Jez - With my updated strategy you don't need to maintain IP lists. Please read it again carefully. You simply check the user agent and if it is XXBot you do a forward-reverse DNS lookup to confirm it is indeed such bot. No need to check list of IPs. Well, maybe to improve performance you'd want to cache the IPs of confirmed bots ;-)
randfish
September 3, 2007 at 6:05 am
Jez - actually, we don't cloak. Neither search engines, nor humans have access to the premium content unless they're logged into a premium account. Instead, we show non-logged-in users and bots a page with a summary of the content and an outline (for the guides).
Hamlet Batista
September 3, 2007 at 6:27 am
Rand - Thanks for your comment. I was about to say the same. I checked the cached versions of your premium article pages and they ask for credentials.
David Hopkins
September 3, 2007 at 10:13 am
Again, any hope of being a smart ass were dashed as I read further through the article. I think you being a programmer (I’m sure you said Perl was your favourite?) is an important attribute of your SEO knowledge. Most SEOs don’t seem to have any programming knowledge. They are quite happy to talk about 301 redirects, but how many actually know what a HTTP header is? They have a pretty bullet-proof setup over there at SEOMoz. I also had a sniff around to find it watertight. I have had quite a few adventures with the curl libraries – all strictly legitimate I should add. The only flaw that comes to mind is the possibility of using proxies. I am not sure what you are doing at stage 2? Are you checking against a list? As you may have guessed, that list I sent over was the product of curl – I am adding to it from further sources at the moment. As for my idea for scoping out domains that you can buy for a few dollars its been pretty unsuccessful. I got so excited at seeing a domain with 20,000 links available for I bought it without hesitation only to find out that the links came from a handful of domains. Thankfully it only cost $9. Although I’ve since picked up a genuine PR5 domain for $9 and a three letter .com for $150. I have no idea what to do with them though. P.S. I’m impressed with RankSense. Like yourself I’m really busy and hope to get a better look at it later. I’ll give you some linkage when the time is right.
Hamlet Batista
September 3, 2007 at 2:55 pm
David, aka Mutiny, glad to see you using your real name. <blockquote> I think you being a programmer (I’m sure you said Perl was your favourite?) is an important attribute of your SEO knowledge. Most SEOs don’t seem to have any programming knowledge. They are quite happy to talk about 301 redirects, but how many actually know what a HTTP header is?</blockquote> My favorite is Python. You could say that I have an unfair advantage as many things in SEO are highly technical. I still need to play catchup with the Marketing aspect, though. <blockquote>I have no idea what to do with them though. </blockquote> Although I've bought domains in the past, I think that whole domaining thing is a little bit overrated. I prefer to buy sites where I can see profit potential before hand. Before expending a dime, make sure you have a plan to make it back. ;-)
David Hopkins
September 4, 2007 at 8:40 am
Unfortunatly they don't correlate on your top commenters. :(
Hamlet Batista
September 6, 2007 at 5:37 am
David, if you want I can change your site name to your real name in all your comments.
MB Web Design
September 3, 2007 at 8:31 am
Nice try - just give those good people at SEOmoz your money :p
egorych
September 4, 2007 at 9:58 am
Very interesting, really. I've translated your article into Russian. Good job.
Hamlet Batista
September 4, 2007 at 3:01 pm
egorych - Yes, I noticed that. I used Google translate to understand it. Thanks for the translation. I though that was some new kind of automated scrapping.
Продвинутый клоакинг: как роботы индексируют платный контент.
October 19, 2007 at 6:10 pm
[...] Оригинал статьи: Advanced Cloaking Technique: How to feed password-protected content to search engine spiders [...]