This directory is contains files that drive category assignment through an extraction of Wikipedia's content categorization. The default
data/salience/categories directory contains the following files that can be accessed and customized by users:
A data file of categories and subcategories
Other files that reside within
data/salience/categories cannot be customized by users.
taxonomy.dat provided in
data/salience/categories provides the taxonomy of categories and subcategories used for document category assignment.
The Concept Matrix represents words and phrases as a distribution across Wikipedia articles. Two words are similar if they are both important words in many of the same articles.
Wikipedia also classifies many articles into categories. For autocategories, we found a few thousand interesting category pages, and match documents into the categories if the document contains many terms that are important to the articles in the category.
Updated about 1 year ago