CRM114 - Hyperspace Classification

There is a new classification for CRM114 that appears to be the fastest and possible the most accurate. Quote from Bill below:

“Hyperspace classification is a very different kind of classifier. Instead of grouping all members of a class of texts together and looking for common features, hyperspace classification keeps each text separate. Each text, known or unknown, represents a single point in a multidimensional hyperspace of roughly 4 billion dimensions. The classifier understands these points in hyperspace and measures the distance between the point representing the unknown text and the points representing each of the known texts. This distance is calculated as the square root of the square of the number of dimensions that appeared in both documents divided by the product of the dimensions appearing in the unknown text only times the number of dimensions appearing in the known text only (if is not specified, then each repeated feature counts as an additional unit displacement in that particular dimension of the hyperspace)

Every known text generates one such distance to the unknown text (thus, a single document yields only one hyperspace feature, instead of the hundreds or thousands of features in the Markovian, OSB, or Winnow classifiers). The distances are then merged with a combined radiance model. An astronomical analogy is perhaps the best way to understand thiseach known document is a star, shining in hyperspace. The position of each star depends on what feature phrases the document contained, and the brightness of the star depends on the number of feature phrases that the known document shares with the incoming unknown document (more precisely, it’s the number of shared feature phrases).

Each set of documents (the good examples, or the spam examples) forms a galaxy of these stars, shining in the fourbilliondimension hyperspace. To determine which class our unknown document belongs, we just see which galaxy shines more brightly on the star of our unknown document. If the “good galaxy” shines more light onto our unknown text’s position in hyperspace, the unknown text is assumed “good”, and if the “spam galaxy” shines more light, then the unknown is assumed to be “spam”.

This sounds like a lot of work, but hyperspace classification is surprisingly fast (faster than any other classifier by 4 or more times), as accurate as any other classifier except OSBF, and uses very little disk space (less than any other classifier, by as much as a factor of 40!). Depending on the corpus, and the training method, hyperspace can sometimes even beat OSBF in accuracy. At this writing, the classifier is considered experimental only because the ondisk format of the hyperspace point sets is likely to change to use a little more space to get more speed and accuracy.

An advantage of the hyperspace classifier is that convergence is guaranteed; in fact, it’s rarely even necessary to train more than once. This is because hyperspatial classification is not a linear classifier. For example, a hyperspace classifier can learn an “unlearnable” (to a linear classifier) checkerboard in O(n) examples, where N is the number of distinct squares on the checkerboard.

The hyperspace classifier is activated with the flag. You can use the flag with it, although the effect is not very large, it does decrease the disk space requirement even further. The hyperspace algorithm uses the OSB feature set, so using the feature set flag works, but cuts accuracy significantly (and is therefore not recommended). As a somewhat experimental classifier, is not yet implemented, and is ignored.

Hyperspace classification is best trained SSTTT, with a very thin margin (only 0.5 units of pR needed; this means only errors and texts with nearly completely ambiguous results need to be trained. Hyperspace does work with pure TOE training, but loses significant accuracy (it does keep its speed and diskspace advantages). “

Related Posts

Leave a Reply

You must be logged in to post a comment.