2007-08-13
Using the Blogger Data APIs to Fix Markup Errors
A few days ago, I clicked the Validate HTML link down there on the left for the first time in years, and discovered some carelessness in the HTML of my blog posts. I write these posts as HTML in Emacs before copying it across into the Blogger posting form. The Emacs HTML mode features syntax coloring that makes it easy to see when characters need escaping as character entity references. Except, that is, when those characters appear in attribute values — attribute values appear in a uniform shade of pink. The result is that when writing several posts, I copied URIs containing ampersands into href attributes, and didn't notice that the ampersands need replacing with & entities. And this is the issue that the HTML Validator flagged. Browsers are able to handle these attributes without problems, but this is not something I could leave alone. And fixing the problematic posts is a perfect opportunity to play with the Blogger Data APIs!
I started with the Python client libraries provided by Google, and working from the developer's guide and sample code, I was quickly able to come up with a fairly clean 50-line Python program to do the job. Modifying that program to correct other simple markup errors should be straightforward.
Draft blog entries came in useful for testing this program. The draft entries are marked in the feed by the app:draft element from from the Atom Publishing Protocol. So it was easy to restrict the program it to update only draft entries, then create a draft entry containing my test cases, and review the changes made by the program to that draft entry through the Blogger UI. When I was ready to entrust it with my precious blog posts, I removed that draft check. This approach was quicker than creating a whole new test blog just to hold test entries.
I do have some small gripes with the Python GData client library. Google provides reasonable howto-style documentation describing how to manipulate Blogger data via the library. But when you move away from the examples in that documentation to do something a bit different, you need reference documentation to guide you. In particular, I was wondering whether my code needed to do anything to support the different entry content types from the Atom spec; the answer seems to be that Blogger currently supports only type='html', with the escaped HTML markup that entails. The Python code for the library does not provide ready answers, because it mostly consists of hundreds of lines of boilerplate to convert between the the GData protocol XML structures and the corresponding Python objects (an opportunity for some metaprogramming, I wonder?). So there is a gap for more systematic documentation describing the correspondence between the Python API and GData protocol. Such documentation doesn't need to be verbose: For example, the documentation for the Universal Feed Parser is very good at concisely defining how its Python objects correspond to the XML structure of RSS and Atom feeds.