Over the summer, I had the chance to work on an iCalendar feed for the calendar app in the Table. It was one of the most challenging projects I have worked on to date. For the most part, I had a blast working on this. However, I ran into several issues while testing my implementation with calendar clients. Many of them are not very interesting, but one took me way longer to figure out that I’d care to admit: getting the feed to work with Google Calendar. They were simply unable to access the feed, leading me down an annoying road of server configuration tweaking, DNS testing, and finally, simply waiting to see if they were simply lagging for some reason1.
The solution was much simpler than any of those things. It turns out that Google’s calendar reader requires you to have a robots.txt file present in order to access your iCalendar feed. Unfortunately, they provide no debug information to help you realize this is the issue. Their help is also rather, um, unhelpful.
your robots.txt file might prohibit the Googlebot entirely; it might prohibit access to the directory in which this URL is located; or it might prohibit access to the URL specifically. Often, this is not an error. You may have specifically set up a robots.txt file to prevent us from crawling this URL. If that is the case, there’s no need to fix this; we will continue to respect robots.txt for this file.
This leads me to believe that I am actively preventing them from accessing it, when in reality, I don’t even have a robots.txt file in place. The Table is a private application that currently has no public pages aside from the calendar feed, so we never went out of our way to implement it.
If I take a step back from the solution, I think this is a misapplication of a robots.txt file. The stated purpose2 of a robots.txt file is to “give instructions to visiting Web robots, most importantly what areas of the site are to be avoided”. It goes on to define robots as “Web client programs that automatically traverse the Web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.”
According to the definition in the RFC, Google Calendar is not a crawler - it is a program accessing a resource explicitly provided by an end user. Even if we grant them the classification of a robot though, the absence of a robots.txt file should imply consent from the site owner for the reader to access the feed. This may be the only case I can think of where Google is over-applying privacy!