Nearest centroid classifier

From Wikipedia, the free encyclopedia
  (Redirected from Rocchio Classification)
Jump to: navigation, search
Rocchio Classification

In machine learning, a nearest centroid classifier or nearest prototype classifier is a classification model that assigns to observations the label of the class of training samples whose mean (centroid) is closest to the observation.

When applied to text classification using tf*idf vectors to represent documents, the nearest centroid classifier is known as the Rocchio classifier because of its similarity to the Rocchio algorithm for relevance feedback.[1]

An extended version of the nearest centroid classifier has found applications in the medical domain, specifically classification of tumors.[2]


  • Training procedure: given labeled training samples \textstyle\{(\vec{x}_1, y_1), \dots, (\vec{x}_n, y_n)\} with class labels y_i \in \mathbf{Y}, compute the per-class centroids \textstyle\vec{\mu_l} = \frac{1}{|C_l|}\underset{i \in C_l}{\sum} \vec{x}_i where C_l is the set of indices of samples belonging to class l \in \mathbf{Y}.
  • Prediction function: the class assigned to an observation \vec{x} is \hat{y} = {\arg\min}_{l \in \mathbf{Y}} \|\vec{\mu}_l - \vec{x}\|.

See also[edit]


  1. ^ Manning, Christopher; Raghavan, Prabhakar; Schütze, Hinrich (2008). "Vector space classification". Introduction to Information Retrieval. Cambridge University Press. 
  2. ^ Tibshirani, Robert; Hastie, Trevor; Narasimhan, Balasubramanian; Chu, Gilbert (2002). "Diagnosis of multiple cancer types by shrunken centroids of gene expression". Proceedings of the National Academy of Sciences 99 (10). doi:10.1073/pnas.082099299.