# User:Qymwill/Draft of the page for C-1SVM algorithm

In machine learning, competing one class support vector machines (C-1SVMs) are supervised learning models designed for binary and multiclass classification. In the algorithm, multiple one class support vector machines are trained, and different one-class support vector machines model the distributions of samples from different classes and hence may compete with each other. C-1SVM is not a binary SVM, not a one class SVM either. It is a set of competing one-class SVMs. For binary classification, two one-class support vector machines are trained and maintained.

Given a set of training examples and their corresponding labels, a C-1SVM training algorithm builds several C-1SVM models that can cooperatively label new examples. Based on the idea of maximum-margin hyperplane used in the binary SVM, a C-1SVM model tries to find a hyperplane which can enclose as many target class examples as possible. New examples are predicted to belong to a category based on which hyperplane they fall in.

By combining one-class support vector machine and on-line learning, C-1SVM was firstly introduced by Minglun Gong and Li Cheng as a support vector method for foreground segmentation of live videos[1] in computer vision and image processing.

## Algorithm

C-1SVM algorithm mainly involves two parts: training and classification. C-1SVM training is a kind of iterative process: given the one-class SVM models that are partially trained using received examples, when the model receives a new example with a label, the example is used to update the one-class SVM model of the corresponding class. The process stops until there are no changes in those one-class SVM models. C-1SVM classification is achieved by defining a score function for each class, and a new example belongs to the class with the maximum score.

### One Class Support Vector Machine

The one-class SVM is proposed by Bernhard Sch¨olkopf, John C. Platt, John Shawe-Taylor and Alex J. Smola.[2] It is an unsupervised learning method that learns a decision function for novelty detection: classifying new data as similar or different to the training set. C-1SVM algorithm trian a one-class SVM for each class. Each one-class SVM model trained by C-1SVM algorithm can determine if new examples belong to its corresponding class or not. Finally, the trained one-class SVM models make classifications cooperatively.

### On-line Learning

Binary SVM training using a large set of examples is a classical offline learning because the quadratic objective function that SVM aims to minimize would not change once the training process begins. Previous research[3] have shown that a similar solution can be obtained by following on-line learning scheme, in which training examples are repetitively inputted into an on-line learner following a repeated sequence. C-1SVM uses on-line learning because it can produce a partially trained C-1SVM model at each time point during the training process, and the partially trained model can be used by the new received example at each time point. These partially trained models are gradually refined towards the final solution. Because of the quicker feedback provided by partially trained C-1SVMs in the loop, on-line learning can help C-1SVM training achieve faster convergence than offline learning.

The online learner in C-1SVM training algorithm follows the one proposed by Li Cheng, S.V. N. Vishwanathan, Dale Schuurmans, Shaojun Wang, Terry Caelli.[4] Denote training examples in a time sequence as ${\displaystyle (x_{t})_{t=1}^{T}}$ and their corresponding labels as ${\displaystyle (y_{t})_{t=1}^{T}}$, ${\displaystyle T}$ is the number of training examples. Let ${\displaystyle f_{t}(\cdot )}$ be a score function of example ${\displaystyle t}$ and ${\displaystyle S_{i}(\cdot )}$ be the ${\displaystyle i_{th}}$ support vector of classes. The kernel function used in on-line learning is denoted as ${\displaystyle k(\cdot ,\cdot )}$. When a new labeled example ${\displaystyle x_{t}}$ is received, the score function is defined as

${\displaystyle f_{t}(x_{t})=\sum _{i=1}{\text{min}}(\alpha _{i},C)\chi (S_{i}(y_{t})\neq x_{t})k(S_{i}(y_{t}),x_{t})}$

The weight of ${\displaystyle x_{t}}$ is updated as

${\displaystyle \alpha _{t}={\frac {\gamma -f_{t}(x_{t})}{k(x_{t},x_{t})}}}$

where ${\displaystyle \gamma }$ is the margin that is often set to ${\displaystyle 1}$, ${\displaystyle C>0}$ is the cut-off value which is used to constrain the weight of ${\displaystyle x_{t}}$, and ${\displaystyle \chi (\cdot )}$ is an indicator function: ${\displaystyle \chi (true)=1}$ and ${\displaystyle \chi (false)=0}$. If ${\displaystyle \alpha _{t}>0}$, ${\displaystyle x_{t}}$ is added into the corresponding support vector set, else if ${\displaystyle x_{t}}$ is already in the support vector set, then erase it from the set.

Three common kernel functions include:

• Gaussian kernel: ${\displaystyle k(x,y)=e^{-\left|\left|x-y\right|\right|^{2}/(2\sigma ^{2})}}$, where ${\displaystyle \sigma ^{2}}$ is the variance.

When receiving an unlabeled example ${\displaystyle x_{new}}$, C-1SVM label it as the class with the maximum score that is computed using the score function defined above.

## Example

Take the binary classification for example, and there are two samples in each class. Denote them as ${\displaystyle class{\text{ }}1:x_{1}=(1,1),x_{2}=(2,2)}$ and ${\displaystyle class{\text{ }}2:x_{3}=(-1,-1),x_{4}=(-2,-2)}$. Input the samples in the sequence: ${\displaystyle x_{1},x_{2},x_{3},x_{4},x_{1},x_{2},x_{3},x_{4},\dots }$. Both of the support vector sets are initially empty. The cut-off ${\displaystyle C=0.5}$, margin ${\displaystyle \gamma =1}$ and Gaussian function with standard deviation ${\displaystyle \sigma =1.0}$ is selected as the kernel function in the example. The algorithm can work as follows:

Step 1: receive ${\displaystyle x_{1}}$, since the support vector set of the first class is empty, the score function ${\displaystyle f_{1}(x_{1})=0}$, then ${\displaystyle \alpha _{1}=1>0}$, and the support vector set ${\displaystyle S(1)}$ becomes ${\displaystyle \{(x_{1},\alpha _{1})\}}$.
Step 2: receive ${\displaystyle x_{2}}$, compute ${\displaystyle f_{2}(x_{2})=0.184}$, then ${\displaystyle \alpha _{2}=0.816>0}$, and the support vector set ${\displaystyle S(1)}$ becomes ${\displaystyle \{(x_{1},\alpha _{1}),(x_{2},\alpha _{2})\}}$.
Step 3: receive ${\displaystyle x_{3}}$, ${\displaystyle f_{3}(x_{3})=0}$ and ${\displaystyle \alpha _{3}=1>0}$, the support vector set ${\displaystyle S(2)}$ becomes ${\displaystyle \{(x_{3},\alpha _{3})\}}$.
Step 4: receive ${\displaystyle x_{4}}$, compute ${\displaystyle f_{4}(x_{4})=0.184}$, then ${\displaystyle \alpha _{4}=0.816>0}$, and the support vector set ${\displaystyle S(2)}$ becomes ${\displaystyle \{(x_{3},\alpha _{3}),(x_{4},\alpha _{4})\}}$.

Following the above four steps, input samples in the same sequence and repeat the process until the two support vector sets do not change.

For classification, when an unlabeled sample ${\displaystyle x_{new}}$ arrives, follow the defined score function, compute ${\displaystyle score[1]}$ using trained support vectors in ${\displaystyle S(1)}$ and ${\displaystyle score[2]}$ using trained support vectors in ${\displaystyle S(2)}$. If ${\displaystyle score[1]\geq score[2]}$, then label ${\displaystyle x_{new}}$ as ${\displaystyle class{\text{ }}1}$, else label it as ${\displaystyle class{\text{ }}2}$.

## Pseudocode

The algorithm includes two parts: training and classification.

### Training

Input: training examples ${\displaystyle x}$ and labels ${\displaystyle y}$, total class number ${\displaystyle n}$

Output: support vectors of each class and corresponding weights

1  procedure C-1SVMtraining(${\displaystyle x,y,n}$)
2  for ${\displaystyle i\leftarrow 1{\text{ to }}n}$
3      allocate a set ${\displaystyle S(i)}$ to store support vectors for the ${\displaystyle i_{th}}$ class
4  while there are changes for support vector sets do
5     for ${\displaystyle t\leftarrow 1{\text{ to }}T}$
6         receive an example ${\displaystyle x_{t}}$ and its label ${\displaystyle y_{t}}$
7         compute score ${\displaystyle f_{t}(x_{t})\leftarrow \sum _{i=1}{\text{min}}(\alpha _{i},C)\chi (S_{i}(y_{t})\neq x_{t})k(S_{i}(y_{t}),x_{t})}$
8         update weight ${\displaystyle \alpha _{t}\leftarrow {\frac {\gamma -f_{t}(x_{t})}{k(x_{t},x_{t})}}}$
9         if there is a vector in ${\displaystyle S(y_{t})}$ equals to ${\displaystyle x_{t}}$
10           if ${\displaystyle \alpha _{t}\leq 0}$
11              delete ${\displaystyle (x_{t},\alpha _{t})}$ from ${\displaystyle S(y_{t})}$
12        else
13           if ${\displaystyle \alpha _{t}>0}$
14              push ${\displaystyle (x_{t},\alpha _{t})}$ into ${\displaystyle S(y_{t})}$
15 return ${\displaystyle S(1),S(2),\dots ,S(n)}$


### Classification

Input: trained support vector sets ${\displaystyle S}$, new unlabeled example ${\displaystyle x_{new}}$, total class number ${\displaystyle n}$

Output: label ${\displaystyle y_{new}}$

1 procedure C-1SVMclassification(${\displaystyle S,x_{new},n}$)
2 for ${\displaystyle i\leftarrow {\text{ to }}n}$
3     ${\displaystyle score[i]\leftarrow \sum _{i=1}{\text{min}}(\alpha _{i},C)\chi (S_{j}(i)\neq x_{new})k(S_{j}(i),x_{new})}$
4 ${\displaystyle y_{new}\leftarrow \{1,2,\dots ,n|score[y_{new}]{\text{ is the maximun}}\}}$


Support vector sets of classes can be implemented by first in first out queue.

## Binary SVMs vs C-1SVMs

Fig. 1. Comparison between a binary SVM and C-1SVMs. White dots represent the training examples of the first class and black dots represent the training examples of the second class, while red dot denotes an unlabeled example.

For binary classification, binary SVM algorithm only train one SVM model giving one hyperplane. However, C-1SVM algorithm totally train two one-class SVMs. By doing so, two hyperplanes enclosing the training examples more tightly are returned, which can help toward better detecting and handling of ambiguous cases.

A detailed comparison is given in Fig.1. Denote while dots as the first class and black dots as the second class. The line between white dots and black dots represents the hyperplane trained by binary SVM. The circles enclosing white dots or black dots are hyperplanes trained by C-1SVMs. In case (a), binary SVM classifies the red dot as the first class, whereas C-1SVMs classifies it as unknown because neither of one-class SVM hyperplanes include it as inlier. In case (b), because the distance between the red dot and the hyperline of binary SVM is too small, binary SVM does not have high confidence to label it, whereas C-1SVMs can correctly label it as the second class.

## Complexity

Training: denote the total number of the optimal support vectors as ${\displaystyle m}$. Considering the worst case performance, we only add or delete one support vector in the loop of line 5 in the procedure C-1SVMtraining. Hence, ${\displaystyle m}$ loops are required to convergence at most. Further, for each loop, ${\displaystyle T}$ examples are received. Finally the running time could be ${\displaystyle O(mT)}$. If all of the examples are support vectors in the worst case, ${\displaystyle O(T)}$ space is required to store them.

Classification: to make classification, scores of all classes are computed. The running time and required space is both ${\displaystyle \Theta (n)}$.

## Applications

Most classification problems in machine learning can be solved by C-1SVMs. Similar to the SVM, C-1SVMs can also be applied in Computer Vision, Data Mining, Speech Recognition. For example, Gong and Cheng use C-1SVM algorithm to solve foreground segmentation of live videos[1], a problem in Computer Vision. They trained two one-class SVMs, one is for foreground pixels, and another one is for background pixels. When a new unlabeled pixel arrives, label it as foreground pixel or background pixel cooperatively using foreground and background one-class SVM.

## References

1. ^ a b Gong, Minglun; Cheng, Li (2011). "Foreground segmentation of live videos using locally competing 1SVMs" (PDF). IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
2. ^ Sch¨olkopf, Bernhard; Platt, John C.; Shawe-Taylor, John; Smola, Alex J. (2001). "Estimating the support of a high-demensional distribution". Neural Computation, Vol. 13: 1443–1472.
3. ^ Bottou, Leon; Bousquet, Olivier (2008). "The Tradeoffs of Large Scale Learning". Advances in Neural Information Processing Systems.
4. ^ Cheng, Li; Vishwanathan, S.V. N.; Schuurmans, Dale; Wang, Shaojun; Caelli, Terry (2007). "Implicit Online Learning with Kernels". Advances in Neural Information Processing Systems.