I promised everybody that I’d be posting my presentation slides from my talk at the SMX Advanced Bot Herding panel, so here they are!
First, let me say that I was very excited to be speaking at a major search marketing conference, and I can say with confidence that all the traveling was definitely worth it. My only regret is that I did not get to finish my presentation. This is the first time I spoke publicly and as an inexperienced speaker I was not even looking at the timer. My apologies to all those in attendance. 🙂 Frankly, I do think speakers should be allowed a little bit more time for SMX Advanced, as you really do need time to lay the groundwork before delving deeply into these sorts of topics.
For those that didn’t come, let me summarize the key takeaways from my speech and put it into context regarding Google’s recent post on Webmaster Central:
Cloaking: Serving different content to users than to Googlebot. This is a violation of our webmaster guidelines. If the file that Googlebot sees is not identical to the file that a typical user sees, then you’re in a high-risk category. A program such as md5sum or diff can compute a hash to verify that two different files are identical.
Basically, Google says that geolocation and IP delivery (when used for geolocation purposes) are fine as long as you present the same content to the Googlebot as you would present to the user coming from the same region. Altering the content the robot sees puts you in “a high-risk category.” Google is so strict that it suggests you need a checksum program to make sure you are delivering the same content. Obviously, it doesn’t matter if your intention is to improve the crawling and indexing of your site or not.
Why would you want to cloak anyway?
Let’s talk about the key scenarios I discussed in my speech:
– Search unfriendly Content Management Systems. According to Google, if you are using a proprietary CMS that does not allow the flexibility of making the URLs search-engine friendly, or if it has cookie-based session IDs, or has unique titles and descriptions, you need to replace your CMS with a newer one. Using a reverse proxy that cloaks to fix those issues is a “bad idea.” Again: easy for Google, hard for the customer.
– Rich media sites. If you use Scalable Inman Replacement, SWFObject, JavaScript or CSS to render rich media content to the user and regular text to the search engine then you are fine, because the checksums will be the same.
– Content behind forms. Google is experimenting with a bot that will try to pull content from basic forms using HTTP GET and providing values listed in the HTML.
– Free and paid content. Google recommends we register our premium content using Google News’ First click free. The idea is that you need to give searchers the first page of your content for free and they need to register for the rest. This is very practical for newspapers that have resorted to cloaking in the past. I do see a problem with this technique for sites like SEOmoz where some of the premium pages are guides that cost money. If SEOmoz signed up for this service, I would be able to pull all the guides by guessing search terms that would bring them up in the results.
- Site structure improvements
– Alternative to PageRank sculpting via “no-follow.” I explained a clever technique where you can cloak a different link path to robots than you present to regular users. The link path for users should be focused on ease of navigation and the link path to regular users should be focused on ease of crawling and deeper index penetration. This is very practical but not really mandatory.
– According to the post we don’t need to worry about this. Some good news at last!
– This is a very interesting case and I would have liked them to explain this in the Webmaster Central post. Search engine robots don’t take part in these experiments because they don’t execute JavaScript, yet many users will see a different version of the page than the robot sees. JavaScript-based cloaking will provide the same checksum for the page the bot sees and the page the user sees. I’m sure some clever black hats are taking advantage of this to do “approved” cloaking. 🙂
Google = Romulans
Just like the Romulans from Star Trek, Google doesn’t want cloaking technology in the hands of everyone. I didn’t get to talk about this in my presentation, but let me speculate as to why Google is drawing such a hard line on cloaking: Simply put, it is the easiest, cheapest and most scalable solution for them.
1. As a developer I can tell you that running checksums against the content presented to Googlebot vs. the content presented to the cloaking detection bots is the easiest and most scalable way for them to do it.
2. Similar to the problem with paid links, it is easier to let us do all the work of labeling our sites so they can detect the bad guys without having to dedicate a huge amount of resources to solve such problems.
Enjoy the slides and feel free to ask any questions. If you were there at SMX Advanced and watched me present, please let me know your honest comments. Criticism can only help me improve. Let me know what you think of the slides, too. Originally, I had planned to use more graphics than text, but ultimately I thought that the advanced audience would appreciate the added information.
Adam Audette
June 8, 2008 at 4:39 pm
Nice job with this Hamlet. It was great chatting with you at the conference, you've got some killer ideas and skills. See you at the next big show I hope.
Seomaniac
June 9, 2008 at 4:38 am
Hi Hamlet, Very interesting piece, presentation included. The technical bit was out of my league unfortunately, but I do get the general direction and I thank you for showing me those options for me to pursuit and expand my knowledge. Thumbs Up!
Levon
June 11, 2008 at 1:30 pm
Nice post. I recently posted a question about something similar. However, I am not so sure about this md5 thing you mentioned. If they pull a page twice using different user agent, ip etc the two results can be different but not clocked. For example if the site is displaying the time, if it displays some icon based on the user agent or location, rotating content and probably a few other things. Maybe you have an answer for my question I asked in google groups... A lot of people seem hell bent on getting their homepage to rank for everything rather than their subpages. I think this is a bad idea because you'll have a load of weakling subpages and your homepage cannot be targeted for X different products. If this is the case what would Google think of me redirecting people who have searched for certain keywords to the page of my choice. For example if someone searches for Matt Cutts Voodoo Dolls. Rather than them coming to the home page, could I not send them to my subpage for Google Engineer Voodoo Dolls?
Hamlet Batista
June 12, 2008 at 7:05 am
Thanks, Bastian. I look forward to meeting you at one of the search conferences. Again, changing/replacing a CMS (specially if it is an expensive one) is not always practical. Levon - You hit the nail on the head. I personally don't think the checksum is a good idea. Thanks for mentioning those practical examples. Google doesn't have problems with you presenting different content to different people. Their problem is when you present different content to their robots. I've seen John Battelle doing this on his blog.
Levon
June 12, 2008 at 4:14 pm
Thanks for this. I was told otherwise by the folks in Google Groups.
u know who
June 13, 2008 at 4:33 am
Very happy to see that you posted this. Am at the office, so will be back to review tonight. Constructive feedback: make fewer points and use fewer slides. In particular, you told us what you would say 2-3 times; summarize that quickly in the intro and you're good. Check out Guy Kawasaki on presenting - good tips.
Andy
June 16, 2008 at 12:43 am
I had the same thoughts as Levon. I think they must have developed a method to give a similarity figure between page views rather than doing an exact match. Easy parameters to compare would be ratio of text to code, number of images, file size etc. I think a checksum can easily be made to be the same by dynamically adding extra characters in say a comment tag.