It's been a while since I posted some juicy source code. This time, I am going to explain the infamous black hat technique known as cloaking with some basic PHP code.
While most people think of cloaking as evil (asking for search engines to penalize your site), there are circumstances where it is perfectly legitimate and reasonable to use it.
From Google quality guidelines:
Make pages for users, not for search engines. Don't deceive your users or present different content to search engines than you display to users, which is commonly referred to as "cloaking."
What is cloaking?
It is the use of some clever, dynamic code to present different content to search engines than which is presented to users. Black hats use this to present optimized (keyword stuffed) content to the search engine spiders and sales/affiliate pages to users. Using it this way is potentially very risky; as, there are ways for search engine quality engineers to identify this easily, once reported.
Yesterday, Shoemoney reported about his experience in typing some technical questions in Google and finding links to the answers in ExpertsExchange. The interesting thing is that part of the answer is visible on the SERPs (search engine result pages), but once you land on the website you are presented with a login/subscription screen. I am sure you have probably experienced something similar with the New York Times Online, and some of the other news subscription services sites as well. They provide the real content to the search bots (in order to get the search referrals), and a subscription screen to the user. These are legitimate ways where cloaking can be used. Note that they are not trying to manipulate rankings,they are simply trying to increase their sign-ups.
The clever Jeremy figured it out by using the Google cache. He did not have to register with Experts Exchange, and received access to the full content 🙂
That is exactly how your competitors and the search quality engineers can tell you are cloaking. In order to avoid this, you only need to use this meta tag on the cloaked pages:
<meta name="robots" content="nocache">
This tells search engines to remove the evidence that you are cloaking.
Now to the best part.
How can you implement cloaking in your pages?
Depending on the detection method, you can cloak using two techniques: detecting user agent or detecting robot IP address.
Detecting robot user agent. The dynamic code checks the HTTP_USER_AGENT that is passed from the web server. If the user agent matches a known robot, it displays the content to be cloaked, the page intended for the user otherwise.
Detecting search engine robot IP. The dynamic code checks the HTTP_REMOTE_ADDR that is passed from the web server. If the IP address matches a known robot, it reveals the content to be cloaked or the page intended for the user. You can compile the user agent and IP lists by studying your log files (look for hits to robots.txt), or you can use the lists compiled by other webmasters:
http://www.user-agents.org/ List of Search Engine Robots' User Agents
http://iplists.com/ List of Search Engine Robots' IP addresses
Here is the PHP source code. Enjoy!
cloaking.zip
Paul Montwill
June 29, 2007 at 5:45 am
Is it possible to build a script that will pretend to be a search engine bot and collect the information that should only be available after logging in? I am just wondering about security of this script...
Hamlet Batista
June 29, 2007 at 10:41 am
Not sure what you are asking. A script that collects information after you are logged in is perfectly possible. The only way to prevent this is with a CAPTCHA
eTown Landlord
July 1, 2007 at 11:51 am
good article bro. Nice island place too. I have just started using a little cloaking. I got it working in asp.net and use iplist.com for my info. I was wondering if you dynamically scrape iplist to update your list of ips? I was thinking about it but didn't. I would of course tell him i'm doing it b/c he specifically asks for it.
Hamlet Batista
July 1, 2007 at 6:26 pm
eTown, Thanks for your comment. iplists.com is a good place to start, but you can easily create your own lists and those you should trust more. Look in your log files for hits coming from search engine robots.
Advanced Cloaking Technique: How to feed password-protected content to search engine spiders † Hamlet Batista dot Com
September 3, 2007 at 1:02 am
[...] I explained before, cloaking is presenting different content to crawlers than we show regular users. Traditionally I [...]