We have discussed before how to control Googlebot via robots.txt and meta robot tags. Both methods have limitations. With robots.txt you can block the crawling of any page or directory, but you cannot control the indexing, caching or snippets. With the robots meta tag you can control crawling, caching and snippets but you can only do that for HTML files, as the tag is embedded in the files themselves. You have no granular control for binary and non-HTML files.
Until now. Google recently introduced another clever solution to this problem. You can now specify robot meta tags via an HTTP header. The new header is the X-Robots-Tag, and it behaves and supports the same directives as the regular robots meta tag: index/noindex, archive/noarchive, snippet/nosnippet and the new unavailable_after directive. This new technique makes it possible to have granular control over crawling, caching, and other functions for any page on your website, no matter the type of content it has—PDF, Word doc, Excel file, zip files, etc.This is all possible because we will be using an HTTP header instead of a meta tag. For non-technical readers, let me use an analogy to explain this better.
A web crawler basically behaves very similar to a web browser: it opens pages hosted on web servers using a communications method called Hyper Text Transfer Protocol (HTTP). Each HTTP request and response has two elements: 1) the headers and 2) the content (a web page for example). Think of each request/response like an e-mail, where the headers are the envelope that contains, among other things, the address of the requested page or the status of the request.
Here are a couple of examples of how an HTTP request and response look like. You normally don't see this, but it is a routine conversation your browser has every time you request a page.
Request->
GET / HTTP/1.1
Host: hamletbatista.com
User-Agent: Mozilla/5.0 … Gecko/20070713 Firefox/2.0.0.5
Connection: close
Response->
HTTP/1.1 200 OK
Date: Wed, 01 Aug 2007 00:41:47 GMT
Server:·Apache
X-Robots-Tag: index,archive
X-Powered-By: PHP/5.0.3
X-Pingback: http://hamletbatista.com/xmlrpc.php
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
There are many standard headers, and the beauty of the HTTP protocol is that you can define your own proprietary headers. You only need to make sure you start them with X- to avoid name collision with future standard names. This is the approach Google takes and it is a wise one.
How can you implement this?
This is the interesting part. You know I love this.
The simplest way to add the header is to have all your pages written in a dynamic language, such as PHP, and include one line of code at the top that sets the X-Robots-Tag header. For example:
<?php header('X-Robots-Tag: index,archive'); ?>
In order to work, that code needs to be at the very top of the dynamic page, before anything is outputted to the browser.
Unfortunately, this strategy does not help us much, as we want to add the headers to non-text files, like PDFs, Word documents, and so on. I think I have a better solution.
Using Apache's mod_headers and mod_setenvif, we can control which files we add the header to as easily as we do with mod_rewrite for controlling redirects. Here is the trick.
SetEnvIf Request_URI “*\.pdf$” is_pdf=yes
Header add X-Robots-Tag “index, noarchive” env=is_pdf
The first line sets an environment variable if the file requested is a PDF file. We can check any requested header and we can use any regular expression to match the files we want to add to the header.
The second line adds the header only if the environment variable is_pdf (you can name the variable anything you want) is set. We can add these rules to our .htaccess file. And voilà: we can now control which files we add the header very easily.
There are a lot of real-world uses for this technique. Let's say you offer a free PDF e-book on your site, but users have to subscribe to your feed to get it. It is very likely that Google will be able to reach the file and smart visitors will pull the e-book from the Google cache to avoid subscribing. One way to avoid this is to let Google index the file but not provide the cache: index, noarchive. This is not possible to control with robots.txt, and we can’t implement robot meta tags because the e-book is a PDF file.
This is only one example, but I am sure users out there have plenty of other practical applications for this. Please share some other uses you can think of.
Sebastian
August 1, 2007 at 10:18 pm
You can go a step further and rewrite PDF requests to a PHP script where you can handle the values of X-Robots-Tag per file, or serve a shortened version to bots and lurkers and the full version to members ... <a href="http://sebastianx.blogspot.com/2007/07/handling-googles-neat-x-robots-tag.html" rel="nofollow">http://sebastianx.blogspot.com/2007/07/handling-g...</a> HTH Sebastian
Hamlet Batista
August 2, 2007 at 6:22 am
Sebastian - Good job!
Mark
August 25, 2007 at 9:02 pm
I track traffic from my RSS feeds by appending tracking code to the URLs. These pages are currently being indexed by Google, hence I have a duplicate content issue. I want to prevent the URLs in my RSS (the one that have the tracking codes) from appearing in SERPs. Do you whether the, "X-Robots-Tag: noindex,nofollow" added to my .rss pages will fix this problem? Is there any other method that you'd recommend? Regards, Mark
Hamlet Batista
August 27, 2007 at 4:45 am
Mark - Technically you don't add the X-Robots-Tag to the pages, but your webserver or some other script will need to add them to the HTTP headers. If you only want to block access to your RSS feeds, deny access to them via your robot.txt file.
Noobliminal
September 13, 2007 at 6:40 am
What if ... I told you this has another use that can take link-exchange back to stone age? This is your white-hat use, check my dark-side use. Fortunately it's not accessible to every1 but still... Regards.
Hamlet Batista
September 13, 2007 at 7:54 am
Scott - I am afraid I agree with you. This is a very clever and effective black hat trick and no way to detect it or protect against it.
Noobliminal
September 13, 2007 at 8:01 am
I'm clueless where you came up with the Scott name :) Just call me <strong>Noobliminal</strong> or Claude ... <em>PS: I'm 100% sure you don't know me so if you take me for some1 else ... :)</em>
Hamlet Batista
September 13, 2007 at 10:30 am
Claude - That is why is good to include an about page on your blog. I hit the home page and it seems I picked the name from one of you random quotes. Sorry :-)
Noobliminal
September 13, 2007 at 10:44 am
You killed me... I almost spilled my RedBull. It's a good thing you didn't call me Plato ;) (There's many quotes of him too!) I'm working on the About Page (the big photo of me). There's so much things to say ... it's difficult. PS:You made my day! Thanks.
Robots.txt, meta-robots, nofollow ja hakukoneet | Nettibisnes.Info
November 13, 2007 at 8:07 am
[...] Robots-meta-tagit ei-HTML-tiedostoissa – virallisen Google-blogin artikkeli esittelee paitsi unavailable_after -parametrin niin myös X-Robots-Tag -direktiivin jolla voidaan vaikuttaa esimerkiksi videoiden, kuvien, ääni- ja pdf-tiedostojen indeksoitumiseen. Aihetta syventävät myös Sebastian ja Hamlet Batista. [...]