I recently worked on an enterprise-level client’s non-SEO related project where the goal was to confirm or deny that their new product:
1) Was not doing anything that could be considered black hat.
2) Was providing any SEO benefit for their clients.
The problems you face with projects like this is that Google doesn’t provide enough information, and you cannot post corner-case questions like this in public Webmaster forums. To do so would violate your NDA, and potentially reveal your client’s intellectual property. So, what option do you have left? Well, you set up a honeypot!
A honeypot is a term that comes from the information security industry. Honeypots are a set of files that, to an automated program, appear like regular files, but they allow for the monitoring and “capturing” of specific viruses, e-mail harvesters, etc. In our case, we set up a honeypot with the purpose of detecting and tracking search engine bot behavior in specific circumstances. We also wanted to track the outcome (positive, neutral or negative) in the search engine results pages (SERPs).
Let me walk you trough a few ways you can learn advanced SEO by using a honeypot.
Goals of the honeypot
First, let’s define the goals in terms of questions for which we don’t have public answers. Here are some interesting questions you and I might have:
1. Which search bots support the if-modified-since and/or the if-unmodified-since headers?
2. Is Googlebot really a headless browser?
3. Which search bots crawl AJAX URLs? Which ones support Google’s crawlable scheme?
4. Does Google follow links inside PDFs? Do they count for indexation and rankings?
5. Does the in-page canonical tag carries more weight that the canonical link header?
Add your own questions to this list. For the purpose of this post, I’m going to explain how you go about answering first question. The recent work I did for a client was related to AJAX style fragment URLs. Unfortunately, I can’t share any details.
Setting up the Honeypot
The first thing you need to do is understand the problem really well. In our case, if-modified-since is a header that browsers and bots can send to a webserver, and the webserver will avoid resending a resource (image, document, page, video, etc.) if it hasn’t changed since the last time it was requested. The primary goal is to save bandwidth.
If-unmodified-since does the opposite. It returns the resource if it hasn’t changed.
There is technical protocol that HTTP clients and servers must obey, and a typical conversation looks like this:
CLIENT/BOT Request:
[RAW]GET / HTTP/1.1
Host: hamletbatista.com
If-Modified-Since: Thu, 26 Jan 2012 17:32:59 GMT[/RAW]
SERVER Response:
[RAW]HTTP/1.1 304 Not modified
Date: Thu, 26 Jan 2012 17:32:59 GMT[/RAW]
CLIENT/BOT Request:
[RAW]GET / HTTP/1.1
Host: hamletbatista.com
If-Unmodified-Since: Thu, 26 Jan 2012 17:32:59 GMT[/RAW]
SERVER Response:
[RAW]
HTTP/1.1 412 Precondition failed
[/RAW]
You can learn more about this here.
The most common way to follow these conversations between servers and bots is to setup and analyze traffic logs. However, the typical format of a traffic log does not store ‘if-modified-since’ header information. Sometimes it is practical to set up a custom log to track this information, but other times it isn’t.
Here is how a typical log entry looks like for valid Googlebot request .
[RAW]
66.249.67.9 – – [26/Jan/2011:02:29:32 -0500] “GET / HTTP/1.1” 200 157 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
[/RAW]
Getting the answers
One simple alternative is to look for the response code. In the case of a request that includes the ‘if-modified-since’ header, the web server will return status code 200 if the page changed, and status code 304 if it hasn’t changed. On the other hand, it will return 412 if the resource changed, and the client sent an ‘if-unmodified-since’.
Because 200 is a code that can be returned when the ‘if-modified-since’/’if-unmodified-since’ headers are not sent, the most reliable way to tell if a request included the header we want to check, is to track responses that returned 304 (the response that say nothing changed) or 412 (something changed).
You also want to make sure your webserver support the corresponding headers. You can use Firebug for this.
As you should have guess by now, it is easy to check if Googlebot supports this header by checking the traffic log for entries coming from Googlebot and seeing if the responses include the 304 or 412 status codes.
I wrote a simple log parsing script in Python to look for response codes 304 or 412 and see if any entry came for Googlebot. In order to make it work, you will need the excellent Python log parser, apachelog.
[python]
import apachelog, sys, glob
format = r’%{X-Forwarded-For}i %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-agent}i\”‘
files = glob.glob(“access_log*”)
p = apachelog.parser(format)
for log in files:
for line in open(log):
try:
data = p.parse(line)
status = data[‘%>s’]
ua = data[‘%{User-agent}i’]
rq = data[‘%r’]
referrer = data[‘%{Referer}i’]
if rq.indexof(‘/feed/’) < 0 and ( status == ‘304’ or status == ‘412’):
#print referrer
print rq
print ua
print status
except:
#sys.stderr.write(“Unable to parse %s” % line)
pass[/python]
This is the partial output.
[RAW]GET /feed/ HTTP/1.1
Netvibes (http://www.netvibes.com/; 58 subscribers; feedID: 1503582)
304
GET /feed/ HTTP/1.1
Alltop/1.1
304
GET /feed/ HTTP/1.1
Xianguo.com 1 Subscribers
304
GET /feed/ HTTP/1.1
Alltop/1.1
304
GET /feed/ HTTP/1.1
Alltop/1.1
304
GET /feed/ HTTP/1.1
Alltop/1.1
304
GET /feed/ HTTP/1.1
NetNewsWire/3.2.15 (Mac OS X; http://netnewswireapp.com/mac/; gzip-happy)
304[/RAW]
All entries came from newsreaders and related bots. There wasn’t a single entry from Googlebot or any other search bot.
Conclusion: No evidence of support.
I know I said I would only cover one example, but I feel like I need to give you a little bit more to get you really excited about this stuff.
Let’s say you didn’t think about looking at the response codes to track ‘if-modified-since’ or that you need to track which search bots support the canonical header element or that you want to know if Googlebot requests compression when making requests. In order to track this easily, you need to log extra header information that is not part of the typical log setup.
This is how you do it:
- You create a separate log file so you don’t mess up the ability to use log analysis tools that rely on standard log formats.
- You filter this separate log so it only records the traffic you want to track. In our case, we want to track search bot traffic.
- You change the log format so it records the additional fields
Here is the partial configuration I used for to perform tests for this post:
[RAW]SetEnvIf User-Agent “.*Googlebot/2.1.*” gbot
LogFormat “%{X-Forwarded-For}i %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-agent}i\” \”%{Accept-encoding}i\”” proxy2
# I use CloudFlare to speed up this blog, so I need to record the X-Forwarded-For instead of the reverse proxy IP address
CustomLog “|/usr/sbin/rotatelogs -l /var/www/hamletbatista/logs/googlebot_log.%Y-%m-%d 86400” proxy2 env=gbot[/RAW]
You don’t need to wait for Googlebot to come to the site to test your honeypot. You can use Google Webmaster Tool’s ‘Fetch as Googlebot,’ and Googlebot will come right away. The main difference I’ve seen using this method is that if you provide a URL with a redirect, Googlebot won’t follow it. The regular Googlebot crawler, however, will.
This post is just scratching the surface of all the possible insights you can gain by setting up honeypots to answer your more complex technical SEO questions. If you use this approach and get some really useful results, please make sure to share them in the comments.
Adam Audette
January 30, 2012 at 8:28 am
Hamlet, this is terrific stuff. We're doing more and more SEO testing these days. Recently the team set up a little honey pot to see if Googlebot follows links within iframe elements (based on this discussion: <a href="http://www.seroundtable.com/google-iframe-link-14558.html" rel="nofollow">http://www.seroundtable.com/google-iframe-link-14...</a>. It's always fascinating to see the results of these little science experiments. I definitely plan to keep pushing the tests, they can be pretty illuminating. Great job on this piece!
Hamlet Batista
January 30, 2012 at 8:28 am
I just realized that I forgot to loop over all the log files and also forgot to share the results of the second test :( I'll update the post to reflect this later Today
Hamlet Batista
January 30, 2012 at 8:38 am
Thanks, Adam! I agree. We can unlock so many hidden and valuable insights by performing SEO experiments like the one you and I have been doing. BTW, I'm just getting 're-started' here ;)
Devin
January 30, 2012 at 4:54 pm
Good to see you posting again! Missed your SEO knowledge
Amanda
January 31, 2012 at 6:19 am
Excellent post. Thanks!
Dennis Miedema
February 8, 2012 at 9:34 am
Great post! My new goal for 2012: get as much bees uhh sorry bots stuck in my honey as possible haha. I do have one question though: what if? What if your site does not represent the typical dataset, what if the query deserves freshness, what if... a single result isn't representative of the keyword query market, hell, of what Google will do a day from now since it always tweaks the ranking factors in minor ways?
Candice Medina
February 26, 2012 at 7:19 am
Great stuff Hamlet! Thank you for sharing this knowledge. I'm planning to make money from performing SEO tests and this post will really help.
Anil
March 26, 2012 at 11:43 pm
Thanks for sharing. I just started my career in SEO and going through articles like your's really helps.
Fren Dee Bee
April 25, 2012 at 10:18 am
Observing the behavior of Googlebot is really an advanced SEO technique. I have done on-page and off-page very often, but not yet on this level. Thanks Hamlet for sharing such a valuable info.