Tagging with Calais
What is Calais?
Calais is a web service provided by Thomson Reuters that analyzes documents and automatically identifies and extracts semantic metadata/tags from your content, as well as related facts, events, and more.
Many services can tell you that IBM was mentioned in your content, but no other service identifies that IBM is an Organization, then disambiguates all the various references to IBM (International Business Machines, etc.), and finally tells you with a scoring mechanism that your content is more about IBM than any other term identified.
Your OpenPublish installation includes a suite of modules we've developed called the Calais Collection, which integrates Thomson Reuters' Calais web service into the Drupal platform.
How do I set up Calais for my OpenPublish site?
First, you'll need an API key if you haven't taken care of that already.
Next, open your Admin Toolbar and navigate to Site configuration >> Calais configuration.
Calais API Settings
You'll land on the "Calais API Settings" tab by default, which contains a field for your API key. Enter it and save your changes.
Before delving into the specifics of all of the various configuration options, let's take a few steps back and talk about how Calais is used on a general level.
How do I use Calais?
Whenever you create/update a node, the contents of your node are sent to Calais for processing and the tags it comes up with are applied to your node in the form of taxonomy terms from one of Calais' many vocabularies including:
Each term that is returned for your node also has a node-specific relevancy score associated with it, which indicates the strength of the relationship between your node and the tag applied to it.
Viewing Calais Tags Applied to a Node
When viewing a node, you should see a "Calais" tab next to the traditional "View" and "Edit" tabs.
Click it to reveal a list of the Calais terms that have been suggested for your node, and those which have been applied.
The Calais terms that are listed in the input fields are the ones that were applied to your node.
The terms listed in the help text are the ones that were suggested by Calais but did not meet the minimum relevancy threshold requirement that must be satisfied before a term is automatically applied (learn how this can be adjusted in "Advanced Calais Configuration" below).
- Terms with a larger font face are more relevant to this particular node than terms with a smaller font face.
- You can apply/unapply these suggested terms by clicking on the term names in the help text.
You can remove any applied terms from your node by clearing them out of the fields provided. If you were expecting a certain term to be suggested but it was not, you can try typing it in manually - if that term already exists in Drupal, it will be suggested via an autocomplete mechanism.
What OpenPublish features are made possible by Calais?
The metadata extracted from Calais can be used for a very wide variety of features and applications - here are some specific examples from OpenPublish.
Beneath an Article in OpenPublish, you will see a list of Related Terms which includes all of the Calais terms that were applied to that node.
More Like This
The More Like This modules allow you to display related content from your site (or from around the web) on your node pages, using the Calais terms you've deemed to be most relevant.
- MLT - Taxonomy (related nodes on your site)
- MLT - Flickr
- MLT - Google Video
- MLT - Yahoo! BOSS
For more specifics, check out our documentation on Related Content with More Like This.
Advanced Calais Configuration
This section covers additional fine-tuning and configuration options available at Site configuration >> Calais configuration.
Be sure to save your changes before toggling to another tab.
Calais Node Settings
In Global, you can choose which Calais entities you would like to use for your implementation. A separate vocabulary will be created in your Taxonomy for each of the enabled entities.
Beneath the Global settings, you can configure Calais for each of your content types:
- Calais Processing: Determines if these nodes are analyzed via Calais, and if so, how that process is implemented. It ranges from not processing at all, to automatically processing on every update. Select the option that is most appropriate for your site and the level of involvement required.
- Allow Calais Searching: Overrides the setting at the API level. Indicates whether future searches can be performed on the extracted metadata by Calais
- Allow Calais Distribution: Overrides the setting at the API level. Indicates whether the extracted metadata can be distributed by Calais.
- Relevancy Threshold: The threshold set here will limit for the entity terms that apply by only displaying or automatically associating terms that have an equal or greater relevance than the threshold.
- Use Calais Global Entity Defaults: When selected, the Vocabularies associated globally in the Global Calais Entities section will use used for this specific content type, however, you can override the associated Vocabularies for this particular content type.
Calais Tag Modifications
- Calais Blacklist: Enter a list of terms that you want to prevent from being suggested moving forward. If these terms already exist in your taxonomy, you'll need to delete them in order to dissociate them from nodes they've already been applied to.
- Calais Rename: Enter a list of terms that you'd like to have renamed in the format OldName=NewName
Calais Bulk Processing
If your site has a large number of nodes, you can push them through Calais in batches using the options available here.