Category utility

Category utility is an information‑theoretic measure used to evaluate the quality of a partition of objects into categories (or clusters). Originally introduced by Gluck and Corter (1985) and later refined by Corter and Gluck (1992), the metric quantifies the predictive advantage gained by an observer who knows the category labels of instances compared with an observer who does not.

Definition

In a probabilistic formulation, let

  • C = { c₁, …, cₚ } be a set of p categories, with prior probabilities p(cⱼ).
  • F = { f₁, …, fₙ } be a set of n categorical features, each feature fᵢ taking one of m possible values.
  • p(fᵢₖ | cⱼ) denote the conditional probability that feature fᵢ has value k given category cⱼ, and p(fᵢₖ) the marginal probability of that value.

The category utility (CU) of the partition (C, F) is

$$ \text{CU}(C,F)=\frac{1}{p}\sum_{c_j\in C}p(c_j)\Bigg[\sum_{f_i\in F}\sum_{k=1}^{m}p(f_{ik}\mid c_j)^2 ;-;\sum_{f_i\in F}\sum_{k=1}^{m}p(f_{ik})^2\Bigg]. $$

The term inside the brackets represents the expected number of feature values that can be correctly guessed by an observer using a probability‑matching strategy when the category label is known, minus the expected number of correct guesses when the label is unknown. The factor 1/p acts as a simple regularizer against over‑fitting.

An equivalent information‑theoretic expression (for binary features) is

$$ \text{CU}(C,F)=\Bigl[p(c)\sum_{i}p(f_i\mid c)\log p(f_i\mid c)+p(\bar c)\sum_{i}p(f_i\mid\bar c)\log p(f_i\mid\bar c)\Bigr] -\sum_{i}p(f_i)\log p(f_i), $$

where the first two terms capture the coding cost of features when the category is known, and the final term captures the coding cost when it is unknown. The value is non‑negative; higher values indicate a more “useful’’ categorisation.

Relationship to Other Measures

Category utility is closely related to mutual information between the category variable C and the feature set F; in many presentations the two are mathematically equivalent. It also shares conceptual ground with the information gain used in decision‑tree learning, as both assess the reduction in uncertainty provided by a partition.

Applications

The metric is a central component of conceptual clustering algorithms, which aim not only to group similar instances but also to produce a descriptive characterization of each cluster. Notable systems that employ category utility include:

  • Cobweb – an incremental, hierarchical clustering algorithm that repeatedly merges or splits clusters to maximize CU.
  • Variants of COBWEB/COBWEB/2, COBWEB/3, and other model‑based clustering approaches.
  • Modern implementations that adapt CU for purely categorical data sets, often as a fitness function in greedy or evolutionary clustering heuristics.

In machine‑learning textbooks (e.g., Witten & Frank, 2005) and research surveys, category utility is presented as a standard objective function for evaluating clustering quality, especially when the data consist of discrete attributes.

Interpretation

Intuitively, a high category utility indicates that:

  1. Objects within the same category tend to share feature values (high intra‑category similarity).
  2. Objects from different categories tend to differ in their feature values (high inter‑category dissimilarity).

Thus CU balances the desire for homogeneous clusters against the need for informative, discriminative categories.

References

  • Gluck, R., & Corter, J. (1985). “Category utility: A measure of “category goodness”.”
  • Corter, J., & Gluck, R. (1992). “Category utility and its application to conceptual clustering.”
  • Fisher, D. (1987). “The COBWEB clustering system.”
  • Witten, I. H., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques.

These sources provide the formal derivations, experimental evaluations, and broader context for the use of category utility in artificial intelligence and data analysis.

Browse

More topics to explore