Fix Robots.txt: December 2010

I've gone for years without a Robots.txt
file. Why should I care? If I don't have
one, all robots are welcome to crawl my site.
That's what I thought.

I recently learned that Google Adsense
was not crawling my site due to the absence
of a Robots.txt file. Now a have a motive
to learn more.

The construction of a Robots.txt is rather
interesting. If you don't mention a robot
by name, Robots.txt has no effect. That's
the first and last thing to remember about
Robots.txt.

Here's another wrinkle on mentioning a robot
by name in Robots.txt. You can do it with
a wildcard. You can even use an asterisk
wildcard. If you use an asterisk, you are
mentioning all possible robots by name.

It took me a while to get the hang of this.
However, it does make sense. It works a
little bit like the law does in a free
society. You are at liberty to do something
unless prohibited by the law.

Here's an example:

User-agent: *
Disallow: /tmp
Disallow: /logs

In the above example, the line User-agent
specifies which user agents (robots) are not
going to be allowed to do something. In this
case, it is all robots as the asterisk specifies
that all user agents are not going to be allowed
to do something.

The two lines with the line Disallow tell
the robots specified (all robots, in this case)
where they cannot go.

This is all much simpler than it might appear. We
are really working with just two keywords:

User-agent
Disallow

Since we are only working with two keywords, the
only thing we can do with Robots.txt is prevent
a robot, or a group of robots, from visiting a
certain portion of our website. It's that simple.

Note that we could potentially stop all user agents
from visiting our entire website:

User-agent: *
Disallow: /

Because we are starting at the root of the website
when we use the Disallow keyword, everything
from the root on down is disallowed. In other words,
there is not a single robot in the world that is allowed
to visit a single page of our website. The whole thing
is disallowed.

How do we specify the root of the website? With a
single slash. A single slash means start at the
root folder and include all folders below
the root folder. All folders below the root folder
are quite simply, all folders. In other words, a
single slash encompasses every folder on your website
that is visable to the public.

That's how you disallow robots. How do you allow
them?

Perhaps the hardest thing to get the hang of is that
you allow something by not mentioning it at all. In
other words, there is no way to say, be sure to
visit my website!. Since you can only disallow
things, you can only welcome robots by not mentioning
them.

I would imagine that for people who are used to giving
invitations to parties, this might be difficult to grasp.
How do you invite someone to a party by not mentioning
them?

If your website is a party, and a specific robot is a
party-goer, you invite the robot to the party by saying
nothing. On a human level, this may seem weird. However,
on a machine level, this works very very well.

If you can't invite anyone specifically to the party, how
do you issue a party invitation? Again, the answer is
simple. The existence of the your website, and the fact
that other website link to it, is your party invitation.
By putting something on the web, you are inviting everyone
to look at it, unless otherwise specified.

This makes more sense if you think of the web as a public
forum that anyone can participate in. There's no need to
invite anyone to the party because they are already
invited.

Here's a good lesson that can come of all of this. Don't
make trouble where there is none. Don't invite a robot
to visit your website unless you have to.

Now here's an odd twist on all of this. Some robots want
an explicit invitation even though it would seem that none
is required. Here's how to invite the robot for Google
Adsense:

User-agent: Mediapartners-Google 
Disallow:

Frankly, I'm a bit confused by all of this. I don't understand
why Google Adsense seems to want a specific invitation
when none should be necessary. I'm not an expert on Robots.txt.
I'm just trying to learn.

Why do they need this line? Why do you need a Robots.txt file
at all to participate in Google Adsense. I'm really not
sure. If it sounds like I'm confused, it is because I am.

In any case, the invitation, or the seeming invitation, is
accomplished by saying absolutely nothing after Disallow.
If you disallow nothing, you are allowing everything. Weird
logic, isn't it?

OK. I'm back and I've learned more. I read somewhere a few
days ago that a blocked URL is nothing to worry about if it
is Google blocking its own cache. This is precisely what is
happening in my case.

Google provides a diagnostics page for its Adsense customers
and it turns out that Google is giving me a false diagnostic.
Basically Google is saying that it is blocking URLs that mention
Google's own cache. This is not helpful as this is a false alarm.

As it turns out, I really don't need a robots.txt file. I may
erase mine since I'm really not trying to block access to my
site and that's really the purpose of a robots.txt file. It is
there to block access.

I ended up learning more about robots.txt then I really
needed to know.

I've also learned an old lesson all over again. Don't believe
everything you read. Why Google Adsense wants me to
create robots.txt when I don't need one at all is not at all
clear.

Here's the webpage where Google Adsense tells you to
create a robots.txt file:

Why is my URL showing up as blocked?

At some point, I may comment on this page on the
page itself. By the time you read this, the bad
advice that Google seems to be offering may have
changed.

Ed Abbott

Fix Robots.txt

Wednesday, December 22, 2010

My First Robots.txt