.../data/salience/categories

This directory is contains files that drive category assignment through an extraction of Wikipedia's content categorization. The default data/salience/categories directory contains the following files that can be accessed and customized by users:

taxonomy.dat

A data file of categories and subcategories

Other files that reside within data/salience/categories cannot be customized by users.

taxonomy.dat

The file taxonomy.dat provided in data/salience/categories provides the taxonomy of categories and subcategories used for document category assignment.

How Categories are Found

The Concept Matrix represents words and phrases as a distribution across Wikipedia articles. Two words are similar if they are both important words in many of the same articles.

Wikipedia also classifies many articles into categories. For autocategories, we found a few thousand interesting category pages, and match documents into the categories if the document contains many terms that are important to the articles in the category.