First let me thank my beloved reader SEO blog.
Thanks to him I got a really nice bump in traffic and several new RSS subscribers.
It is really funny how people that don’t know you, start questioning your knowledge, calling you names, etc. I am glad that I don’t take things personal. For me it was a great opportunity to get my new blog some exposure.
I did not try intentionally, to be controversial. I did ran a back link check on John’s site and found those interesting results I reported. I am still more inclined to believe that my theory has more grounds than SEO Blog’s. Please keep reading to learn why.
His theory is that John fixed the problem, by making some substantial changes to his robots.txt file. I am really glad that he finally decided to dig for evidence. This is far more professional than calling people, you don’t know, names.
I thoughtfully checked both robots.txt files and here is what John removed in the new version:
# Disallow all monthly archive pages Disallow: /2005/12 Disallow: /2006/01 Disallow: /2006/02 Disallow: /2006/03 Disallow: /2006/04 Disallow: /2006/05 Disallow: /2006/06 Disallow: /2006/07 Disallow: /2006/08 Disallow: /2006/09 Disallow: /2006/10 Disallow: /2006/11 Disallow: /2006/12 Disallow: /2007/01 Disallow: /2007/02 Disallow: /2007/03 Disallow: /2007/04 Disallow: /2007/05 # The Googlebot is the main search bot for google User-agent: Googlebot # Disallow all files ending with these extensions Disallow: /*.php$ Disallow: /*.js$ Disallow: /*.inc$ Disallow: /*.css$ Disallow: /*.gz$ Disallow: /*.wmv$ Disallow: /*.tar$ Disallow: /*.tgz$ Disallow: /*.cgi$ Disallow: /*.xhtml$ # Disallow Google from parsing indididual post feeds and trackbacks.. Disallow: */feed/ Disallow: */trackback/ # Disallow all files with ? in url Disallow: /*?* Disallow: /*? # Disallow all archived monthlies Disallow: /2006/0* Disallow: /2007/0* Disallow: /2005/1* Disallow: /2006/1* Disallow: /2007/1*
In English, this means, he is now letting Google crawl and index his archived articles, dynamic pages,
and files ending with “.php”, “.js”,”.inc”, “.css”, etc. Note that in none of the robots.txt files, John is preventing the crawler from accessing his home page or the regular posts. WordPress uses PHP, but regular posts and the home page can be accessed without “.php”.
If this was the change that fixed the problem, it might have been because removing those internal pages from the spider view might have weaken his internal link structure. His claim is not without merit.
Now, here is one tiny little detail that my friend is missing. To prove his point, he used Google’s cache to show the different version of the robots.txt file. If Google still has that version on their cache, what makes him think that Google is already using the new one? Google should be caching the new version not the old one. That is why I am still not convinced that this is the reason for the fix.
John says he is not telling, because a reader said Google might change their algorithm and drop him again. How does the changes John did to his robot.txt file , have anything to do with algorithm changes? I am just curious.
In reality, we can theorize all we can, but the only ones who can tell for sure is the guys at the Googleplex. John probably tried many different things and one or several of them worked. He is probably not even sure which one did.
How did I learn SEO?
SEO Blog suggests I visit his forum to learn SEO. Here is the problem with that. I am a technical guy, I can not take gut feelings or opinions as truth. I do visit some forums and blogs every now and then, but my experience is that the noise to signal ratio is too high. I prefer to learn and get my insights from the source: search engine research papers, search engine representatives blogs or my own experiments.
I learned SEO back in 2002 when I read this paper. Back then, nobody was even talking about Google bombs, anchor text, etc. Read the paper, it is all there.
Jez
June 7, 2007 at 3:04 pm
Hi Hamlet, I have been following this issue too and think you have made a bit of an error... as you know the cache is always a few days old, but the robots.txt will be analysed on the day of the crawl, in "real time". If Google never let go of the cached file, how would it ever crawl the site again??? The actual crawl runs ahead of the cache, but you already know this... One thing you may not have seen is this post on JC: <a href="http://www.johnchow.com/getting-out-of-the-google-supplemental-index/" rel="nofollow">http://www.johnchow.com/getting-out-of-the-google...</a> A few days earlier the robots.txt file was changed due to reasons outlined in the above post... give it a few days for the denied pages to be dropped, a couple of days for users to report the drop in SERPs and the timing is about right for Johns "google ban". Then, the latest robots.txt file reverses what had been done, re-allows the supplemental pages things return to normal. <b>What we should have checked was whether the supplemental pages were back in the cache</b> I think JC made a blunder in blocking his supplemental pages simple as that. Does anyone really believe Google would change their algorithm because of John Chow!!!! I think you have to bear in mind that JC survives on hype, spin and reader manipulation, that's what the his site exemplifies. I think he has created a lot of buzz and mystery out of his own %$£% up... that's what he is good at.
Hamlet Batista
June 7, 2007 at 3:40 pm
Jez, Thanks for your comment. I am really glad to have experts visiting my blog. Please note that I did not rule out the robots.txt changes, as the solution to the problem. <blockquote>If this was the change that fixed the problem, it might have been because removing those internal pages from the spider view might have weaken his internal link structure. His claim is not without merit.</blockquote> I am not sure I follow part of your conclusions. <blockquote>… as you know the cache is always a few days old, but the robots.txt will be analysed on the day of the crawl, in “real time”. If Google never let go of the cached file, how would it ever crawl the site again???</blockquote> The robots.txt is "analyzed" (parsed) in real time, but the results of this will need to be reflected when the index is updated. Search engines first crawl and then index pages. Dropping pages imply modification to the index (as a result of a crawl). To me, the pages in the cache are the pages that are affecting the current index. I might be wrong, but I need some research papers that would tell me otherwise. Again, I am not ruling out the robots.txt as the solution to his problem. At one moment, I thought that he blocked access to the regular posts by misusing the wildcards ie.: Disallow: /2007/1*. My blog includes the date in normal posts, but I checked his and it doesn't.
Jez
June 8, 2007 at 12:43 am
Hi Hamlet, Sorry if I did not read your post thoroughly enough... your points are interesting, it could well have been the anchor text... or perhaps a mix of anchor and robots.txt. I thought for some time that JC should have asked users to use a mix of different links. If it had been me, I would have asked users to collect the link text from a dynamic page that rotated a number of different permutations.... As for being an expert, far from it, I am here to learn ;-)
Jez
June 8, 2007 at 12:45 am
Oh yes, the point I was trying to make about the cache was that although the old robots file was still cached it is possible that re-allowed pages had already been re-indexed etc. I did not explain myself well....
Hamlet Batista
June 8, 2007 at 7:11 am
<blockquote>If it had been me, I would have asked users to collect the link text from a dynamic page that rotated a number of different permutations….</blockquote> That doesn't sound like newbie stuff to me :-) I am working on a post where I am dissecting Google's original paper. Hopefully we all can learn something valuable from it.
Jez
June 8, 2007 at 10:45 am
Hi Hamlet, I am no stranger to code, I work in that field, but most of my experience has been on Intranets... I currently manage a large installation (9 instances) of moodle.org for a University... but there is no SEO requirement for this work... SEO is something I am interested in learning more about...
Hamlet Batista
June 8, 2007 at 11:42 am
Jez, I'm glad to have other developers visiting my blog. Hopefully you can put some of the code in my posts to work. I appreciate any feedback. It's amazing how we can find open source code for pretty much everything. Moodle.org looks very interesting.
Jez
June 10, 2007 at 12:47 am
If I get time ;-) I notice some of it uses Python, its been a long time since Ive used Python but I may have a play with it. Jez
Siaar
January 20, 2009 at 6:42 am
Really it is working by doing some changes to my blog by using the same techniques you wrote above - and i like to purchase that software which you added at bottom of your article.... Siaar Of Siaar Group