User:Qwertyus/Logistic regression

From Wikipedia, the free encyclopedia
Jump to: navigation, search
This is a set of notes to be incorporated into either logistic regression or multinomial logistic regression. We should also merge the stub Entropy maximization somewhere.

Logistic regression (LR) is a popular class of models used in machine learning for probability estimation and classification.[1] LR models are the basic type of log-linear models; they are also known as (conditional) maximum entropy classifiers, especially in natural language processing.

The basic, binomial logistic regression model can be used as a binary classifier by computing the probability of one class and predicting that class if and only if the LR model predicts a value greater than ½. Multiclass classification with K classes can be performed either by means of multinomial logistic regression (MLR) or by using K binary LR classifiers, predicting the class for which the highest probability is predicted (the "one-vs.-rest" construction).[2]

Model and prediction function[edit]

Logistic regression is a parametric model, described by a coefficient matrix β of dimension m×n (number of features times number of classes) and an intercept vector α of length n.[3] In the binary case, β can be reduced to a vector instead of an m×2 matrix and α to a scalar, since when the probability of one of the classes is p, the probability of the other class is 1 - p and storing the parameters for one of the classes suffices.

Given an observation (also called instance or event), represented by a feature vector x, to be classified as belonging to one of K classes, logistic regression can be used to predict the probability that it belongs to class k by evaluating

P(y=k|\mathbf{x}) = \sigma(\beta \cdot \mathbf{x} + \alpha)

where y represents the output class and σ is the logistic sigmoid. In the case of multinomial LR, the softmax function should be used instead of the logistic function, giving a discrete probability distribution over all classes:

P(y=k|\mathbf{x}) = \frac{\exp(\beta_k \mathbf{x} + \alpha)}{\sum_{j=1}^{K} \exp(\beta_j \mathbf{x} + \alpha))}

The denominator in this fraction is a normalization factor (it depends on x, but not on the class) conventionally denoted Z, leading to the shorthand \textstyle\tfrac{1}{Z} \exp(\beta_k \mathbf{x} + \alpha).[4] The value of the softmax, or the individual probabilities computed in a one-vs.-rest procedure, can be turned into a classification procedure by computing the probability of each class, then predicting the most likely class:

\hat y = \arg\max_{1 \le k \le K} P(y=k)

In the binary case, this reduces to determining whether the probability of y=1 is greater than ½. The full probability computation is not needed to make discrete predictions, though, as

\arg\max_{1 \le k \le K} P(y=k) = \arg\max_{1 \le k \le K} \beta_k \mathbf{x} + \alpha

so logistic regression's decision function is exactly the same as for a perceptron or linear SVM.

Comparison to the naive Bayes classifier[edit]

Logistic regression is related to the much simpler naive Bayes classifier (NB) in the sense that NB and LR form a generative-discriminative pair.[5] They belong to the same family of models (have the same number and types of parameters), but are trained according to different criteria: while NB optimizes the joint probability P(x,y) of sample/label pairs, logistic regression is optimized directly for the conditional probability P(y|x) of a label given a sample.


The common way of training (fitting) a logistic regression model is to optimize the conditional log-likelihood of the model. This is equivalent to maximizing the conditional entropy of the model, hence the name maximum entropy classifier.

Many different training algorithms have been developed for LR and MLR models. Among the older algorithms are generalized iterative scaling (GIS) and improved iterative scaling (IIS). More modern approaches usually employ LBFGS[6] (or, in the L1-regularized case, OWL-QN) or stochastic gradient descent, though other algorithms may be used.[7]


See also[edit]


  1. ^ "In the terminology of statistics, this model is known as logistic regression, although it should be emphasized that this is a model for classification rather than regression." (Bishop, p. 205)
  2. ^ Fan et al.
  3. ^ α and β may be combined into a single matrix (called w by Bishop) under the assumption that each feature vector x contains an extra "always-on" feature with constant value one that "triggers" β.
  4. ^ Jurafsky and Martin, p. 235.
  5. ^ Ng and Jordan 2001.
  6. ^ Malouf 2002.
  7. ^ Lin 2008.