1.5 Calais Update
At this point, the API is a little buggy, but I've managed to get a Python script to work. If there's enough interest, I may create a more full-featured script that you can use to generate metadata on your own.
In addition to the main Calais Web site, you can also get information about how to access the Calais API at the following address: http://api.opencalais.com/enlighten/calais.asmx?op=Enlighten
You need to pass three values to the API to get it to work: licenseID, content and paramsXML. There's a test form on the page I just mentioned that lets you test the values.
The API's address is: http://api.opencalais.com/enlighten/calais.asmx/Enlighten. You can use SOAP, but I think the easiest way to access the API is to use a HTTP POST request, with the values of thelicenseID, content and paramsXML passed as a URL-encoded string in the body of the POST.
Here's a sample of the paramsXML value that I used:
<c:params xmlns:c="http://s.opencalais.com/1/pred/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <c:processingDirectives c:contentType="text/html" c:outputFormat="xml/rdf"></c:processingDirectives> <c:userDirectives c:allowDistribution="false" c:allowSearch="false" c:externalID="WhateverYouWant" c:submitter="TheNameOfYourApplication"></c:userDirectives> <c:externalMetadata></c:externalMetadata> </c:params>
It needs to be customized to what you want to do. First of all, c:contentType needs to match the content type you are sending to Calais and it can be text/html, or text/txt or text/xml. You also need to update c:externalID, which should be a globally unique identifier for the story you are posting to the service and, finally, you need to update c:submitterto match the application that is doing the submitting. The maximum length is 16,000 characters.
The content can be no longer than 100,000 characters and it must be one of the following types: text/html, text/txt or text/xml. The type must match the type encoded in the paramsXML value (see above). At this point (Feb, 2008), the API is very finicky and will generate errors if there's even a minor problem. If you use text/xml as the format, then you can use the following tags:
title, headline, header, body, description, content, date, datetime, dateandtime, pubdate
Finally, the content must be in English (for now)
A sample Python script that pulls all of this together can be downloaded here: calais-snippet.py