Robots.txt 101

# Disallow all monthly archive pages Disallow: /2005/12 Disallow: /2006/01 Disallow: /2006/02 Disallow: /2006/03 Disallow: /2006/04 Disallow: /2006/05 Disallow: /2006/06 Disallow: /2006/07 Disallow: /2006/08 Disallow: /2006/09 Disallow: /2006/10 Disallow: /2006/11 Disallow: /2006/12 Disallow: /2007/01 Disallow: /2007/02 Disallow: /2007/03 Disallow: /2007/04 Disallow: /2007/05 # The Googlebot is the main search bot for google User-agent: Googlebot # Disallow all files ending with these extensions Disallow: /*.php$ Disallow: /*.js$ Disallow: /*.inc$ Disallow: /*.css$ Disallow: /*.gz$ Disallow: /*.wmv$ Disallow: /*.tar$ Disallow: /*.tgz$ Disallow: /*.cgi$ Disallow: /*.xhtml$ # Disallow Google from parsing indididual post feeds and trackbacks.. Disallow: */feed/ Disallow: */trackback/ # Disallow all files with ? in url Disallow: /*?* Disallow: /*? # Disallow all archived monthlies Disallow: /2006/0* Disallow: /2007/0* Disallow: /2005/1* Disallow: /2006/1* Disallow: /2007/1*

Hamlet Batista

Chief Executive Officer

Hamlet Batista is CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He holds US patents on innovative SEO technologies, started doing SEO as a successful affiliate marketer back in 2002, and believes great SEO results should not take 6 months

Try our SEO automation tool for free!

RankSense automatically creates search snippets using advanced natural language generation. Get your free trial today.

Getting Started with NLP and Python for SEO [Webinar]

Custom Python scripts are much more customizable than Excel spreadsheets. This is good news for SEOs — this can lead to optimization opportunities and low-hanging fruit. One way you can use Python to uncover these opportunities is by pairing it with natural language processing. This way, you can match how your audience searches with your...

READ POST

Making it easier to implement SEO changes on your website

Changes to the RankSense SEO rules interface

As we continue to improve the RankSense app for Cloudflare, we are always working to make the app more intuitive and easy to use. I'm pleased to share that we have made significant changes to our SEO rules interface in the settings tab of our app. It is now easier to publish multiple rules sheets and to see which changes have not yet been published to production.

READ POST

How to Find Content Gaps at Scale: Atrapalo vs Skyscanner

For the following Ranksense Webinar, we were joined by Antoine Eripret, who works at Liligo as an SEO lead. Liligo.com is a travel search engine which instantly searches all available flight, bus and train prices on an exhaustive number of travel sites such as online travel agencies, major and low-cost airlines and tour-operators. In this...

READ POST

Jez

June 7, 2007 at 3:04 pm

Hi Hamlet, I have been following this issue too and think you have made a bit of an error... as you know the cache is always a few days old, but the robots.txt will be analysed on the day of the crawl, in "real time". If Google never let go of the cached file, how would it ever crawl the site again??? The actual crawl runs ahead of the cache, but you already know this... One thing you may not have seen is this post on JC: <a href="http://www.johnchow.com/getting-out-of-the-google-supplemental-index/" rel="nofollow">http://www.johnchow.com/getting-out-of-the-google...</a> A few days earlier the robots.txt file was changed due to reasons outlined in the above post... give it a few days for the denied pages to be dropped, a couple of days for users to report the drop in SERPs and the timing is about right for Johns "google ban". Then, the latest robots.txt file reverses what had been done, re-allows the supplemental pages things return to normal. <b>What we should have checked was whether the supplemental pages were back in the cache</b> I think JC made a blunder in blocking his supplemental pages simple as that. Does anyone really believe Google would change their algorithm because of John Chow!!!! I think you have to bear in mind that JC survives on hype, spin and reader manipulation, that's what the his site exemplifies. I think he has created a lot of buzz and mystery out of his own %$£% up... that's what he is good at.
Hamlet Batista

June 7, 2007 at 3:40 pm

Jez, Thanks for your comment. I am really glad to have experts visiting my blog. Please note that I did not rule out the robots.txt changes, as the solution to the problem. <blockquote>If this was the change that fixed the problem, it might have been because removing those internal pages from the spider view might have weaken his internal link structure. His claim is not without merit.</blockquote> I am not sure I follow part of your conclusions. <blockquote>… as you know the cache is always a few days old, but the robots.txt will be analysed on the day of the crawl, in “real time”. If Google never let go of the cached file, how would it ever crawl the site again???</blockquote> The robots.txt is "analyzed" (parsed) in real time, but the results of this will need to be reflected when the index is updated. Search engines first crawl and then index pages. Dropping pages imply modification to the index (as a result of a crawl). To me, the pages in the cache are the pages that are affecting the current index. I might be wrong, but I need some research papers that would tell me otherwise. Again, I am not ruling out the robots.txt as the solution to his problem. At one moment, I thought that he blocked access to the regular posts by misusing the wildcards ie.: Disallow: /2007/1*. My blog includes the date in normal posts, but I checked his and it doesn't.
Jez

June 8, 2007 at 12:43 am

Hi Hamlet, Sorry if I did not read your post thoroughly enough... your points are interesting, it could well have been the anchor text... or perhaps a mix of anchor and robots.txt. I thought for some time that JC should have asked users to use a mix of different links. If it had been me, I would have asked users to collect the link text from a dynamic page that rotated a number of different permutations.... As for being an expert, far from it, I am here to learn ;-)
Jez

June 8, 2007 at 12:45 am

Oh yes, the point I was trying to make about the cache was that although the old robots file was still cached it is possible that re-allowed pages had already been re-indexed etc. I did not explain myself well....
Hamlet Batista

June 8, 2007 at 7:11 am

<blockquote>If it had been me, I would have asked users to collect the link text from a dynamic page that rotated a number of different permutations….</blockquote> That doesn't sound like newbie stuff to me :-) I am working on a post where I am dissecting Google's original paper. Hopefully we all can learn something valuable from it.
Jez

June 8, 2007 at 10:45 am

Hi Hamlet, I am no stranger to code, I work in that field, but most of my experience has been on Intranets... I currently manage a large installation (9 instances) of moodle.org for a University... but there is no SEO requirement for this work... SEO is something I am interested in learning more about...
- Hamlet Batista
  
  June 8, 2007 at 11:42 am
  
  Jez, I'm glad to have other developers visiting my blog. Hopefully you can put some of the code in my posts to work. I appreciate any feedback. It's amazing how we can find open source code for pretty much everything. Moodle.org looks very interesting.
- Jez
  
  June 10, 2007 at 12:47 am
  
  If I get time ;-) I notice some of it uses Python, its been a long time since Ive used Python but I may have a play with it. Jez
Siaar

January 20, 2009 at 6:42 am

Really it is working by doing some changes to my blog by using the same techniques you wrote above - and i like to purchase that software which you added at bottom of your article.... Siaar Of Siaar Group

Hamlet Batista

Jez

Hamlet Batista

Jez

Jez

Hamlet Batista

Jez

Hamlet Batista

Jez

Siaar

Try our SEO automation tool for free!

OUR BLOG

Latest news and tactics

Getting Started with NLP and Python for SEO [Webinar]

Making it easier to implement SEO changes on your website

Changes to the RankSense SEO rules interface

How to Find Content Gaps at Scale: Atrapalo vs Skyscanner

Request your SEO Monitoring Invitation

Robots.txt 101

Hamlet Batista

Jez

Hamlet Batista

Jez

Jez

Hamlet Batista

Jez

Hamlet Batista

Jez

Siaar

Try our SEO automation tool for free!

OUR BLOG

Latest news and tactics

Getting Started with NLP and Python for SEO [Webinar]

Making it easier to implement SEO changes on your website

Changes to the RankSense SEO rules interface

How to Find Content Gaps at Scale: Atrapalo vs Skyscanner