||This is not a Wikipedia article: It is an individual user's work in progress page, and may be incomplete and/or unreliable.
For guidance on developing this draft, see Wikipedia:So you made a userspace draft. This draft was last edited 16 months ago .
- This is a set of notes to be incorporated into either logistic regression or multinomial logistic regression. We should also merge the stub Entropy maximization somewhere.
Logistic regression (LR) is a popular class of models used in machine learning for probability estimation and classification. LR models are the basic type of log-linear models; they are also known as (conditional) maximum entropy classifiers, especially in natural language processing.
The basic, binomial logistic regression model can be used as a binary classifier by computing the probability of one class and predicting that class if and only if the LR model predicts a value greater than ½. Multiclass classification with K classes can be performed either by means of multinomial logistic regression (MLR) or by using K binary LR classifiers, predicting the class for which the highest probability is predicted (the "one-vs.-rest" construction).
Model and prediction function
Logistic regression is a parametric model, described by a coefficient matrix β of dimension m×n (number of features times number of classes) and an intercept vector α of length n. In the binary case, β can be reduced to a vector instead of an m×2 matrix and α to a scalar, since when the probability of one of the classes is p, the probability of the other class is 1 - p and storing the parameters for one of the classes suffices.
Given an observation (also called instance or event), represented by a feature vector x, to be classified as belonging to one of K classes, logistic regression can be used to predict the probability that it belongs to class k by evaluating
where y represents the output class and σ is the logistic sigmoid. In the case of multinomial LR, the softmax function should be used instead of the logistic function, giving a discrete probability distribution over all classes:
The denominator in this fraction is a normalization factor (it depends on x, but not on the class) conventionally denoted Z, leading to the shorthand . The value of the softmax, or the individual probabilities computed in a one-vs.-rest procedure, can be turned into a classification procedure by computing the probability of each class, then predicting the most likely class:
In the binary case, this reduces to determining whether the probability of y=1 is greater than ½. The full probability computation is not needed to make discrete predictions, though, as
Comparison to the naive Bayes classifier
Logistic regression is related to the much simpler naive Bayes classifier (NB) in the sense that NB and LR form a generative-discriminative pair. They belong to the same family of models (have the same number and types of parameters), but are trained according to different criteria: while NB optimizes the joint probability P(x,y) of sample/label pairs, logistic regression is optimized directly for the conditional probability P(y|x) of a label given a sample.
The common way of training (fitting) a logistic regression model is to optimize the conditional log-likelihood of the model. This is equivalent to maximizing the conditional entropy of the model, hence the name maximum entropy classifier.
Many different training algorithms have been developed for LR and MLR models. Among the older algorithms are generalized iterative scaling (GIS) and improved iterative scaling (IIS). More modern approaches usually employ LBFGS (or, in the L1-regularized case, OWL-QN) or stochastic gradient descent, though other algorithms may be used.
- A multilayer perceptron with softmax activation function can be viewed as a stacked network of LR models.
- Maximum-entropy Markov models and conditional random fields are extensions of logistic regression for structured prediction.
- "In the terminology of statistics, this model is known as logistic regression, although it should be emphasized that this is a model for classification rather than regression." (Bishop, p. 205)
- Fan et al.
- α and β may be combined into a single matrix (called w by Bishop) under the assumption that each feature vector x contains an extra "always-on" feature with constant value one that "triggers" β.
- Jurafsky and Martin, p. 235.
- Ng and Jordan 2001.
- Malouf 2002.
- Lin 2008.
- Christopher M. Bishop (2006). Pattern Recognition and Machine Learning. Springer.
- Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang and Chih-Jen Lin (2008). LIBLINEAR: a library for large linear classification. J. Machine Learning Research 9:1871–1874.
- Trevor Hastie, Robert Tibshirani and Jerome Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer.
- Daniel Jurafsky and James H. Martin (2009). Speech and Language Processing. Prentice Hall, pp. 227–241.
- Chih-Jen Lin, Ruby C. Weng and S. Sathiya Keerthi (2008). Trust region Newton method for large-scale logistic regression. J. Machine Learning Research 9:627–650.
- Robert Malouf (2002). A comparison of algorithms for maximum entropy parameter estimation. In Proc. Sixth Conf. on Natural Language Learning (CoNLL-2002), pp. 49–55.
- Robert Malouf (2010). Maximum entropy models. In Alex Clark, Chris Fox, and Shalom Lappin (eds.), Handbook of Computational Linguistics and Natural Language Processing. Wiley Blackwell, pp. 133–155.
- Andrew Ng and Michael I. Jordan (2001). On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. NIPS.