1.6 Reuter's Calais Semantic Web API

By Mark Choate
Last modified: 2008-02-14 13:22:29

Reuter's Calais is a Web API that automatically generates metadata for news articles by analyzing news stories and automatically extracting important information about what the article is about (it uses RDF, which you can read about here: RDF: A Three Minute Summary, and More about Metadata, which discusses the need for metadata and why it is important for newspapers). The current version is intended to work with business and economics news, and therefore identifies a rather small set of terms.

When Calais analyzes text, it identifies entities, events and facts. An entity is a thing that falls into one of the following categories:

  • City

  • Company

  • Continent

  • Country

  • IndustryTerm

  • MoneyAmount

  • Organization

  • Person

  • ProvinceOrState

  • Region

  • URL

The events and facts that Calais looks for are:

  • Acquisition

  • Alliance

  • Bankruptcy

  • BusinessRelation

  • Buybacks

  • CompanyEarningsAnnouncement

  • CompanyEarningsGuidance

  • CompanyInvestment

  • CompanyLegalIssues

  • JointVenture

  • ManagementChange

  • Merger

  • PersonPolitical

  • PersonPoliticalPast

  • PersonProfessional

  • PersonProfessionalPast

  • StockSplit

As you can see, this is a relatively limited number of "facts" that are only applicable to the business world.

So what is the point? The API actually does a fairly sophisticated analysis of the text so that it can identify entities, events or facts even when the articles use different words to describe them. Not only does it identify acquisitions, it also identifies which company is doing the acquiring and which one is being acquired. Likewise, it can link a person to an event, even when in the text itself, a person is being referred to by a pronoun (he/she). It analyzes the context and figures out what the human being's name should be.

Most importantly, by identifying and encoding entities, events and facts in a standard way that can be processed and manipulated by a computer, you can more readily identify multiple stories about the same entity, or the same event or fact as it relates to a particular entity and so on.

For example, I could search for stories about acquisitions made by IBM. I could also search for stories about joint ventures in Raleigh, North Carolina. This is surprisingly hard to do with any accuracy in other environments without this kind of metadata and making this ability available goes a long way toward making your content much more useful to information seekers.

Calais is owned by Reuters and Reuters keeps a copy of the metadata that gets produced by the service, so using the service benefits Reuters. This is a very Google-like strategy by Reuters where by exposing an API that provides a very useful service, they are able to aggregate important data that will make their product more useful in the future.

I ran an earlier version of this article through the API, and it extracted the following information:

IndustryTerm: Web API
Company: IBM, Reuters, Google
ProvinceOrState: North Carolina
City: Raleigh

While this is a rather simple example, you can see how it identified the industry term "Web API", and that the story refers to three different companies, as well as the state of North Carolina and the city of Raleigh.