Jump to content

Logistic regression: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Repairing links to disambiguation pages - You can help! - Degree of freedom
No edit summary
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{Regression bar}}
{{Regression bar}}
In [[statistics]], '''logistic regression''' is a type of [[regression analysis]] used for predicting the outcome of a [[categorical variable|categorical]] (a variable that can take on a limited number of categories) [[dependent variable|criterion variable]] based on one or more predictor variables. Logistic regression can be bi- or multinomial. '''Binomial''' or '''binary logistic regression''' refers to the instance in which the criterion can take on only two possible outcomes (e.g., "dead" vs. "alive", "success" vs. "failure", or "yes" vs. "no"). [[multinomial logit|Multinomial logistic regression]] refers to the instance in which the criterion can take on three or more possible outcomes (e.g., "better' vs. "no change" vs. "worse"). Generally, the criterion is coded as "0" and "1" in binary logistic regression as it leads to the most straightforward interpretation.<ref>{{cite book|last=Lemeshow|first=David W. Hosmer, Stanley|title=Applied logistic regression|year=2000|publisher=Wiley|location=New York|isbn=0471356328|edition=2nd ed.}}</ref> The target group (referred to as a "case") is usually coded as "1" and the reference group (referred to as a "noncase") as "0". The [[binomial distribution]] has a mean equal to the proportion of cases, denoted ''P'', and a [[variance]] equal to the product of cases and noncases, ''PQ'', wherein ''Q'' is equal to the proportion of noncases or 1 - ''P''. <ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref> Accordingly, the [[standard deviation]] is simply the square root of ''PQ''. Logistic regression is used to predict the [[odds]] of being a case based on the predictor(s). The odds are defined as the probability of a case divided by the probability of a non case. The [[odds ratio]] is the primary measure of effect size in logistic regression and is computed to compare the odds that membership in one group will lead to a case outcome with the odds that membership in some other group will lead to a case outcome. The odds ratio (denoted OR) is simply the odds of being a case for one group divided by the odds of being a case for another group. An odds ratio of one indicates that the odds of a case outcome are equally likely for both groups under comparison. The further the odds deviate from one, the stronger the relationship. The odds ratio has a floor of zero but no ceiling (upper limit) - theoretically, the odds ratio can increase infinitely.<ref>{{cite book|last=Lemeshow|first=David W. Hosmer, Stanley|title=Applied logistic regression|year=2000|publisher=Wiley|location=New York|isbn=0471356328|edition=2nd ed.}}</ref>
In [[statistics]], '''logistic regression''' (sometimes called the '''logistic model''' or '''[[logit]] model''') is a type of [[regression analysis]] used for predicting the outcome of a [[binary variable|binary]] [[dependent variable]] (a variable which can take only two possible outcomes, e.g. "yes" vs. "no" or "success" vs. "failure") based on one or more [[independent variable|predictor variable]]s. Logistic regression attempts to model the [[probability]] of a "yes/success" outcome using a [[linear function]] of the predictors. Specifically, the [[log-odds]] of success (the [[logit]] of the probability) is fit to the predictors using [[linear regression]]. Logistic regression is one type of [[discrete choice]] model, which in general predict [[categorical variable|categorical]] dependent variables — either binary or multi-way.


Like other forms of [[regression analysis]], logistic regression makes use of one or more predictor variables that may be either [[continuous variable|continuous]] or [[categorical variable|categorical]]. Also, like other linear regression models, the [[expected value]] (average value) of the response variable is fit to the predictors the expected value of a [[Bernoulli distribution]] is simply the probability of success. Unlike ordinary linear regression, however, logistic regression is used for predicting binary outcomes ([[Bernoulli trial]]s) rather than continuous outcomes, and models a transformation of the expected value as a linear function of the predictors, rather than the expected value itself.
Like other forms of regression analysis, logistic regression makes use of one or more predictor variables that may be either [[continuous]] or [[categorical]]. Also, like other linear regression models, the [[expected value]] (average value) of the response variable is fit to the predictors - the expected value of a [[Bernoulli distribution]] is simply the [[probability]] of a case. In other words, in logistic regression the base rate of a case for the null model (the model without any predictors or the intercept-only model) is fit to the model including one or more predictors. Unlike ordinary linear regression, however, logistic regression is used for predicting binary outcomes ([[Bernoulli trials]]) rather than continuous outcomes. Given this difference, it is necessary that logistic regression take the [[natural logarithm]] of the odds (referred to as the [[logit]] or [[log-odds]]) to create a continuous criterion. The logit of success is then fit to the predictors using regression analysis. The results of the logit, however, are not intuitive, so the logit is converted back to the odds via the [[exponential function]] or the inverse of the natural logarithm. Therefore, although the observed variables in logistic regression are [[categorical]], the predicted scores are actually modelled as a continuous variable (the logit). The logit is referred to as the ''link function'' in logistic regression - although the output in logistic regression is binomial and displayed in a [[contingency table]], the logit is an underlying continuous criterion upon which linear regression is conducted. <ref>{{cite book|last=Lemeshow|first=David W. Hosmer, Stanley|title=Applied logistic regression|year=2000|publisher=Wiley|location=New York|isbn=0471356328|edition=2nd ed.}}</ref>


For example, logistic regression might be used to predict whether a patient has a given disease (e.g. [[diabetes]]), based on observed characteristics of the patient (age, gender, [[body mass index]], results of various [[blood test]]s, etc.). Another example might be to predict whether a voter will vote Democratic or Republican, based on age, income, gender, race, state of residence, votes in previous elections, etc. Logistic regression is used extensively in numerous disciplines: the medical and social sciences fields, [[natural language processing]], marketing applications such as prediction of a customer's propensity to purchase a product or cease a subscription, etc.
For example, logistic regression might be used to predict whether a patient has a given disease (e.g. [[diabetes]]), based on observed characteristics of the patient (age, gender, [[body mass index]], results of various [[blood test]]s, etc.). Another example might be to predict whether a voter will vote Democratic or Republican, based on age, income, gender, race, state of residence, votes in previous elections, etc. Logistic regression is used extensively in numerous disciplines: the medical and social sciences fields, [[natural language processing]], marketing applications such as prediction of a customer's propensity to purchase a product or cease a subscription, etc. In each of these instances, a logistic regression model would compute the relevant [[odds]] for each predictor or interaction term, take the [[natural logarithm]] of the odds (compute the [[logit]]), conduct a linear regression analysis on the predicted values of the [[logit]], and then take the [[exponential function]] of the [[logit]] to compute the [[odds ratio]].


==Introduction==
==Introduction==
Both linear and logistic regression analyses compare the observed values of the criterion with the predicted values with and without the variable(s) in question in order to determine if the model that includes the variable(s) more accurately predicts the outcome than the model without that variable (or set of variables).<ref>{{cite book|last=Lemeshow|first=David W. Hosmer, Stanley|title=Applied logistic regression|year=2000|publisher=Wiley|location=New York|isbn=0471356328|edition=2nd ed.}}</ref> Given that both analyses are guided by the same goal, why is it that logistic regression is needed for analyses with a dichotomous criterion? Why is [[linear regression]] inappropriate to use with a dichotomous criterion? There are several reasons why it is inappropriate to conduct [[linear regression]] on a dichotomous criterion. First, it violates the assumption of linearity. The linear regression line is the expected value of the criterion given the predictor(s) and is equal to the intercept (the value of the criterion when the predictor(s) are equal to zero) plus the product of the regression coefficient and some given value of the predictor plus some error term - this implies that it is possible for the expected value of the criterion given the value of the predictor to take on any value as the predictor(s) ranges from <math>(-\infty,+\infty)</math>; however, this is not the case with a dichotomous criterion.<ref>{{cite book|last=Lemeshow|first=David W. Hosmer, Stanley|title=Applied logistic regression|year=2000|publisher=Wiley|location=New York|isbn=0471356328|edition=2nd ed.}}</ref> The conditional mean of a dichotomous criterion must be greater than or equal to zero and less than or equal to one, thus, the distribution is not linear but [[sigmoid]] or S-shaped. <ref>{{cite book|last=Lemeshow|first=David W. Hosmer, Stanley|title=Applied logistic regression|year=2000|publisher=Wiley|location=New York|isbn=0471356328|edition=2nd ed.}}</ref> As the predictors approach <math>(-\infty)</math> the criterion asymptotes at zero and as the predictors approach <math>(+\infty)</math> the criterion asymptotes at one. Linear regression disregards this information and it becomes possible for the criterion to take on probabilities less than zero and greater than one although such values are not theoretically permissible.<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref> Furthermore, there is no straightforward interpretation of such values.
Logistic regression is a [[generalized linear model]], specifically a type of [[binomial regression]]. It is often compared with [[probit model|probit regression]], the other main type of binomial regression, which transforms the probability using the [[probit function]] (the [[quantile function]] of the [[normal distribution]]) rather than the [[logit function]]. Both functions have a similar shape, and both serve to transform the limited range of a probability, restricted to the range <math>[0,1]</math>, into the full range <math>(-\infty,+\infty)</math>, which makes the transformed value more suitable for fitting using a linear function. The effect of both functions is to transform the middle of the probability range (near 50%) more or less linearly, while stretching out the extremes (near 0% or 100%) [[exponential growth|exponentially]].


Second, conducting linear regression with a dichotomous criterion violates the assumption that the error term is [[homoscedasticity|homoscedastic]].<ref>{{cite book|last=Menard|first=Scott|title=Applied logistic regression analysis|year=2002|publisher=Sage|location=Thousand Oaks, Calif [u.a.]|isbn=978-0761922087|edition=2. ed.}}</ref> [[Homoscedasticity]] is the assumption that variance in the criterion is constant at all levels of the predictor(s). This assumption will always be violated when one has a criterion that is distributed binomially. Consider the variance formula: ''e'' = ''PQ'', wherein ''P'' is equal to the proportion of "1's" or "cases" and ''Q'' is equal to (1 - ''P''), the proportion of "0's" or "noncases" in the distribution. Given that there are only two possible outcomes in a binomial distribution, one can determine the proportion of "noncases" from the proportion of "cases" and vice versa. Likewise, one can also determine the variance of the distribution from either the proportion of "cases" or "noncases". That is to say that the variance is not independent of the predictor - the error term is not [[homoscedastic]], but [[heteroscedastic]], meaning that the variance is not equal at all levels of the predictor. The variance is greatest when the proportion of cases equals .5. ''e'' = ''PQ'' = .5(1 - .5) = .5(.5) = .25. As the proportion of cases approaches the extremes, however, error approaches zero. For example, when the proportion of cases equals .99, there is almost zero error: ''e'' = ''PQ'' = .99(1 - .99) = .99(.01) = .009. Therefore, error or variance in the criterion is not independent of the predictor variable(s).
This agrees with the intuition that e.g. if a certain amount of lobbying causes a given senator to increase his/her chances of voting in favor of a bill from 50% to 75% (a change of 25 percentage points), the same amount of lobbying applied to an already highly favorable senator might change his/her odds from 90% to 95% (a change of 5 percentage points) rather than 90% to a nonsensical 115%; when applied to an even more favorable senator, the odds might go from 95% to 97.5% (a change of 2.5 percentage points); and when applied to an extremely unfavorable senator, might change the odds from 5% to 10% (a change of 5 percentage points) or 10% to 20% (a change of 10 percentage points). This shows that it is unreasonable to directly model a probability using a linear function, since this implies that a given change in a predictor variable always causes the same absolute change in the response variable regardless of the previous value of the variable. Rather, it appears that a given change in a predictor causes a proportional change in the distance of the response probability from either extreme (0% or 100%) — in the above example, in all cases the probability moved either twice as close to, or twice as far from, one of the extremes. The logit and probit transformations trigger exactly this behavior.


Third, conducting linear regression with a dichotomous variable violates the assumption that error is normally distributed because the criterion has only two values.<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref> Given that a dichotomous criterion violates these assumptions of linear regression, conducting linear regression with a dichotomous criterion may lead to errors in inference and at the very least, interpretation of the outcome will not be straightforward.<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=0805822232|edition=3. ed.}}</ref>
It is important to note that, although the observed outcomes of the response variables are [[categorical variable|categorical]] — simple "yes" or "no" outcomes — logistic regression actually models a [[continuous variable]] (the [[probability]] of "yes"). This probability is a [[latent variable]] that is assumed to generate the observed yes/no outcomes. At its heart, this is conceptually similar to ordinary [[linear regression]], which predicts the unobserved [[expected value]] of the outcome (e.g. the average income, height, etc.), which in turn generates the observed value of the outcome (which is likely to be somewhere near the average, but may differ by an "error" term). The difference is that for a simple [[normal distribution|normally distributed]] continuous variable, the average (expected) value and observed value are measured with the same units. Thus it is convenient to conceive of the observed value as simply the expected value plus some error term, and often to blur the difference between the two. For logistic regression, however, the expected value and observed value are different types of values (continuous vs. discrete), and visualizing the observed value as expected value plus error does not work. As a result, the distinction between expected and observed value must always be kept in mind.

Given the shortcomings of the linear regression model for dealing with a dichotomous criterion, it is necessary to use some other analysis. Besides logistic regression, there is at least one additional alternative analysis for dealing with a dichotomous criterion - [[discriminant function analysis]]. Like logistic regression, [[discriminant function analysis]] is a technique in which a set of predictors is used to determine group membership. There are two problems with [[discriminant function analysis]], however: first, like linear regression, [[discriminant function analysis]] may produce probabilities greater than one or less than zero, even though such probabilities are theoretically inadmissible. In addition, [[discriminant function analysis]] assumes that the predictor variables are normally distributed.<ref>{{cite book|last=Howell|first=David C.|title=Statistical methods for psychology|year=2010|publisher=Thomson Wadsworth|location=Belmont, CA|isbn=9780495597841|edition=7th ed.}}</ref> Logistic regression neither produces probabilities that lie below zero or above one, nor imposes restrictive normality assumptions on the predictors.

Logistic regression is a [[generalized linear model]], specifically a type of [[binomial regression]]. Logistic regression serves to transform the limited range of a probability, restricted to the range <math>[0,1]</math>, into the full range <math>(-\infty,+\infty)</math>, which makes the transformed value more suitable for fitting using a linear function. The effect of both functions is to transform the middle of the probability range (near 50%) more or less linearly, while stretching out the extremes (near 0% or 100%) [[exponential growth|exponentially]]. This is because in the middle of the probability range, one expects a relatively linear function - it is towards the extremes that the regression line begins to curve as it approaches asymptote; hence, the sigmoidal distribution (see Figure 1). In essence, when conducting logistic regression, one is transforming the probability of a case outcome into the odds of a case outcome and taking the [[natural logarithm]] of the odds to create the [[logit]]. The odds as a criterion provides an improvement over probability as the criterion as the odds has no fixed upper limit; however, the odds is still limited in that it has a fixed lower limit of zero and its values do not tend to be normally distributed or linearly related to the predictors. Hence, it is necessary to take the [[natural logarithm]] of the [[odds]] to remedy these limitations.

The [[natural logarithm]] is the power to which the base, ''e'' must be raised to produce some value ''Y'' (the criterion). [[Euler's number]] or ''e'' is a mathematical constant equal to about 2.71828. An excellent example of this relationship is when ''Y'' = 2.71828 or ''e''. When ''Y'' = 2.71828, ln(''Y'' or 2.71828) = 1, because ''Y'' equals ''e'' in this instance, so ''e'' must only be raised to the power of 1 to equal itself. In other words, ''Y'' is the power to which the base, ''e'', must be raised to equal ''Y'' (2.71828). Given that the logit is not generally interpreted and that the inverse of the [[natural logarithm]], the [[exponential function]] of the [[logit]] is generally interpreted instead, it is also helpful to examine this function (denoted: <math>e^{Y}</math>). To illustrate the relationship between the [[exponential function]] and the [[natural logarithm]], consider the exponentiation of the product of the natural logarithm above. There it was evident that the natural logarithm of 2.71828 was equal to 1. Here, if one exponentiates 1, the product is 2.71828; thus, the exponential function is the reciprocal of the natural logarithm. The [[logit]] can be thought of as a [[latent]] continuous variable that is fit to the predictors analogous to the manner in which a continuous criterion is fit to the predictors in [[linear regression]] analysis. After the criterion (the logit) is fit to the predictors the result is [[exponential function|exponentiated]], converting the unintuitive logit back in to the easily interpretable odds. It is important to note that, the [[probability]], [[odds ratio]], and [[logit]] all provide the same information. A probability of .5 is equal to an odds ratio of 1 and a logit of 0 - all three values indicate that "case" and "noncase" outcomes are equally likely.

It is also important to note that, although the observed outcomes of the response variables are [[categorical variable|categorical]] — simple "yes" or "no" outcomes — logistic regression actually models a [[continuous variable]] (the [[probability]] of "yes"). This probability is a [[latent variable]] that is assumed to generate the observed yes/no outcomes. At its heart, this is conceptually similar to ordinary [[linear regression]], which predicts the unobserved [[expected value]] of the outcome (e.g. the average income, height, etc.), which in turn generates the observed value of the outcome (which is likely to be somewhere near the average, but may differ by an "error" term). The difference is that for a simple [[normal distribution|normally distributed]] continuous variable, the average (expected) value and observed value are measured with the same units. Thus it is convenient to conceive of the observed value as simply the expected value plus some error term, and often to blur the difference between the two. For logistic regression, however, the expected value and observed value are different types of values (continuous vs. discrete), and visualizing the observed value as expected value plus error does not work. As a result, the distinction between expected and observed value must always be kept in mind.


== Definition ==
== Definition ==
[[Image:Logistic-curve.svg|thumb|320px|right|Figure 1. The logistic function, with ''z'' on the horizontal axis and ''&fnof;''(''z'') on the vertical axis]]
[[Image:Logistic-curve.svg|thumb|320px|right|Figure 1. The logistic function, with <math>\beta_0 + \beta_1 X_1 + e</math> on the horizontal axis and <math>\pi(x)</math> on the vertical axis]]
An explanation of logistic regression begins with an explanation of the [[logistic function]], which, like probabilities, always takes on values between zero and one:
An explanation of logistic regression begins with an explanation of the [[logistic function]], which, like probabilities, always takes on values between zero and one:


:<math>f(z) = \frac{e^{z}}{e^{z} + 1} \! = \frac{1}{1 + e^{-z}} \! </math>
:<math>\pi(x) = \frac{e^{(\beta_0 + \beta_1 X_1 + e)}} {e^{(\beta_0 + \beta1 X_1 + e)} + 1} = \frac {1} {e^{-(\beta_0 + \beta_1 X_1 + e)} + 1}</math>

AND

:<math>g(x) = ln \frac {\pi(x)} {1 - \pi(x)} = \beta_0 + \beta_1 X_1 + e</math>

AND

:<math>\frac{\pi(x)} {1 - \pi(x)} = e^{(\beta_0 + \beta_1 X_1 + e)}</math><ref>{{cite book|last=Lemeshow|first=David W. Hosmer, Stanley|title=Applied logistic regression|year=2000|publisher=Wiley|location=New York|isbn=0471356328|edition=2nd ed.}}</ref>

A graph of the function is shown in figure 1. The input is <math>\beta_0 + \beta_1 X_1 + e</math> and the output is <math>\pi(x)</math>. The logistic function is useful because it can take as an input any value from negative infinity to positive infinity, whereas the output is confined to values between 0 and 1. In the above equations, ''g''(''X'') refers to the logit function of some given predictor ''X'', ln denotes the [[natural logarithm]]:<math>\pi(x)</math> is the probability of being a case, <math>\beta_0</math> is the [[intercept]] from the linear regression equation (the value of the criterion when the predictor is equal to zero), <math>\beta_1 X_1</math> is the regression coefficient multiplied by some value of the predictor, base ''e'' denotes the [[exponential function]] and ''e'' in the linear regression equation denotes the error term. The first formula illustrates that the probability of being a case is equal to the odds of the exponential function of the linear regression equation. This is important in that it shows that the input of the logistic regression equation (the linear regression equation) can vary from negative to positive infinity and yet, after exponentiating the odds of the equation, the output will vary between zero and one. The second equation illustrates that the [[logit]] (i.e., log-odds or natural logarithm of the odds) is equivalent to the linear regression equation. Likewise, the third equation illustrates that the odds of being a case is equivalent to the [[exponential function]] of the linear regression equation. This illustrates how the [[logit]] serves as a link function between the odds and the linear regression equation. Given that the logit varies from <math>(-\infty,+\infty)</math>it provides an adequate criterion upon which to conduct linear regression and the logit is easily converted back into the odds.<ref>{{cite book|last=Lemeshow|first=David W. Hosmer, Stanley|title=Applied logistic regression|year=2000|publisher=Wiley|location=New York|isbn=0471356328|edition=2nd ed.}}</ref>

This is where it becomes extremely sensible to use reference cell coding ("0" = non case, "1" = case). With this coding scheme the odds ratio is equal to the exponential function of the regression coefficient.<ref>{{cite book|last=Lemeshow|first=David W. Hosmer, Stanley|title=Applied logistic regression|year=2000|publisher=Wiley|location=New York|isbn=0471356328|edition=2nd ed.}}</ref>

<math>OR = \frac{((e^{\beta_0 + \beta_1} \div (1 + e^{\beta_0 +\beta_1})) \div (1 \div (1 + e^{\beta_0 + \beta_1})))} {((e^{\beta_0} \div 1 + e^{\beta_0}) \div (1 \div 1 + e^{\beta_0}))}</math>

<math> = \frac{e^{\beta_0 + \beta_1}} {e^{\beta_0}}</math>

<math> = e^{(\beta_0 + \beta_1) - \beta_0}</math>

<math> = e^{\beta_1}</math><ref>{{cite book|last=Lemeshow|first=David W. Hosmer, Stanley|title=Applied logistic regression|year=2000|publisher=Wiley|location=New York|isbn=0471356328|edition=2nd ed.}}</ref>

Therefore, when one uses a reference coding scheme, the exponentiation of the regression coefficient is the odds ratio and no further calculations are necessary.

== Model Fitting ==

===Maximum Likelihoods===

In linear regression one uses an analytical solution to estimate regression coefficients by finding those values that minimize the sum of squared residuals (error variance).<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=0805822232|edition=3. ed.}}</ref> In other words, there is a series of computations that one can make to derive a solution. In logistic regression there is no set of equations from which one can derive a solution - an analytical solution does not exist. Instead, logistic regression uses the [[maximum likelihood]] procedure to estimate the coefficients that maximize the likelihood of the regression coefficients given the predictors and criterion. <ref>{{cite book|last=Menard|first=Scott|title=Applied logistic regression analysis|year=2002|publisher=Sage|location=Thousand Oaks, Calif [u.a.]|isbn=9780761922087|edition=2. ed.}}</ref> Unlike analytical solutions wherein it is possible to solve directly for the coefficients, the [[maximum likelihood]] solution is an iterative process that begins with a tentative solution, revises it slightly to see if it can be improved, and repeats this process until improvement is minute, at which point the model is said to have converged.<ref>{{cite book|last=Menard|first=Scott|title=Applied logistic regression analysis|year=2002|publisher=Sage|location=Thousand Oaks, Calif [u.a.]|isbn=9780761922087|edition=2. ed.}}</ref> What this means is that the [[maximum likelihood]] procedure has found a solution that maximizes the likelihood of the coefficients given the predictor(s) and criterion.

In some instances the model may not reach convergence. When a model does not converge this indicates that the coefficients are not reliable as the model never reached a final solution. Lack of convergence may result from a number of problems: having a large ratio of predictors to cases, [[multicollinearity]], [[sparse matrix|sparseness]], or complete separation. Although not a precise number, as a general rule of thumb, logistic regression models require a minimum of 10 cases per variable.<ref>{{cite journal|last=Peduzzi|first=P|coauthors=Concato, J, Kemper, E, Holford, TR, Feinstein, AR|title=A simulation study of the number of events per variable in logistic regression analysis.|journal=Journal of clinical epidemiology|date=1996 Dec|volume=49|issue=12|pages=1373-9|pmid=8970487}}</ref> Having a large proportion of variables to cases results in an overly conservative Wald statistic (discussed below) and can lead to nonconvergence. Multicollinearity refers to unacceptably high correlations between predictors. As multicollinearity increases, coefficients remain unbiased but standard errors increase and the likelihood of model convergence decreases.<ref>{{cite book|last=Menard|first=Scott|title=Applied logistic regression analysis|year=2002|publisher=Sage|location=Thousand Oaks, Calif [u.a.]|isbn=9780761922087|edition=2. ed.}}</ref> To detect multicollinearity amongst the predictors, one can conduct a linear regression analysis with the predictors of interest for the sole purpose of examining the tolerance statistic<ref>{{cite book|last=Menard|first=Scott|title=Applied logistic regression analysis|year=2002|publisher=Sage|location=Thousand Oaks, Calif [u.a.]|isbn=9780761922087|edition=2. ed.}}</ref> used to assess whether multicollinearity is unacceptably high. Sparseness in the data refers to having a large proportion of empty cells (cells with zero counts). Zero cell counts are particularly problematic with categorical predictors. With continuous predictors, the model can infer values for the zero cell counts, but this is not the case with categorical predictors. The reason the model will not converge with zero cell counts for categorical predictors is because the natural logarithm of zero is an undefined value, so final solutions to the model cannot be reached. To remedy this problem, researchers may collapse categories in a theoretically meaningful way or may consider adding a constant to all cells.<ref>{{cite book|last=Menard|first=Scott|title=Applied logistic regression analysis|year=2002|publisher=Sage|location=Thousand Oaks, Calif [u.a.]|isbn=9780761922087|edition=2. ed.}}</ref> Another numerical problem that may lead to a lack of convergence is complete separation, which refers to the instance in which the predictors perfectly predict the criterion - all cases are accurately classified. In such instances, one should reexamine the data, as there is likely some kind of error.<ref>{{cite book|last=Lemeshow|first=David W. Hosmer, Stanley|title=Applied logistic regression|year=2000|publisher=Wiley|location=New York|isbn=0471356328|edition=2nd ed.}}</ref>

=== Deviance and Likelihood Ratio Tests ===

In linear regression analysis, one is concerned with partitioning variance via the [[sum of squares]] calculations - variance in the criterion is essentially divided into variance accounted for by the predictors and residual variance. In logistic regression analysis, deviance is used in lieu of sum of squares calculations.<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref> Deviance is analogous to the sum of squares calculations in linear regression<ref>{{cite book|last=Lemeshow|first=David W. Hosmer, Stanley|title=Applied logistic regression|year=2000|publisher=Wiley|location=New York|isbn=0471356328|edition=2nd ed.}}</ref> and is a measure of the lack of fit to the data in a logistic regression model.<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref> Deviance is calculated by comparing a given model with the saturated model - a model with a theoretically perfect fit.<ref>{{cite book|last=Lemeshow|first=David W. Hosmer, Stanley|title=Applied logistic regression|year=2000|publisher=Wiley|location=New York|isbn=0471356328|edition=2nd ed.}}</ref> This computation is called the [[likelihood ratio test]]:

<math> D = -2ln \frac{(likelihood of the fitted model)} {(likelihood of the saturated model)}</math><ref>{{cite book|last=Lemeshow|first=David W. Hosmer, Stanley|title=Applied logistic regression|year=2000|publisher=Wiley|location=New York|isbn=0471356328|edition=2nd ed.}}</ref>

In the above equation ''D'' represents the deviance and ln represents the [[natural logarithm]]. The results of the [[likelihood ratio]] (the ratio of the fitted model to the saturated model) will produce a negative value, so the product is multiplied by negative two times its [[natural logarithm]] to produce a value with an approximate [[chi-square]] distribution. <ref>{{cite book|last=Lemeshow|first=David W. Hosmer, Stanley|title=Applied logistic regression|year=2000|publisher=Wiley|location=New York|isbn=0471356328|edition=2nd ed.}}</ref> Smaller values indicate better fit as the fitted model deviates less from the saturated model. When assessed upon a chi-square distribution, nonsignificant chi-square values indicate very little unexplained variance and thus, good model fit. Conversely, a significant [[chi-square]] value indicates that a significant amount of the variance is unexplained. Two measures of deviance are particularly important in logistic regression: null deviance and model deviance. The null deviance represents the difference between a model with only the intercept and no predictors and the saturated model. And, the model deviance represents the difference between a model with at least one predictor and the saturated model.<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref> In this respect, the null model provides a baseline upon which to compare predictor models. Given that deviance is a measure of the difference between a given model and the saturated model, smaller values indicate better fit. Therefore, to assess the contribution of a predictor or set of predictors, one can subtract the model deviance from the null deviance and assess the difference on a chi-square distribution with one [[degrees of freedom|degree of freedom]].<ref>{{cite book|last=Lemeshow|first=David W. Hosmer, Stanley|title=Applied logistic regression|year=2000|publisher=Wiley|location=New York|isbn=0471356328|edition=2nd ed.}}</ref> If the model deviance is significantly smaller than the null deviance then one can conclude that the predictor or set of predictors significantly improved model fit. This is analogous to the ''F''-test used in linear regression analysis to assess the significance of prediction.<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref>


=== Pseudo-R<sup>2</sup>s===
A graph of the function is shown in figure 1. The input is ''z'' and the output is ''ƒ''(''z''). The logistic function is useful because it can take as an input any value from negative infinity to positive infinity, whereas the output is confined to values between 0 and 1. The variable ''z'' represents the exposure to some set of independent variables, while ''ƒ''(''z'') represents the probability of a particular outcome, given that set of explanatory variables. The variable ''z'' is a measure of the total contribution of all the independent variables used in the model and is known as the [[logit]].


In linear regression the squared multiple correlation, ''R''<sup>2</sup> is used to assess goodness of fit as it represents the proportion of variance in the criterion that is explained by the predictors.<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref> In logistic regression analysis, there is no agreed upon analogous measure, but there are several competing measures each with limitations.<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref> Three of the most commonly used indices are examined on this page beginning with the likelihood ratio ''R''<sup>2</sup>, ''R''<sup>2</sup><sub>''L''</sub>:
The variable ''z'' is usually defined as
:<math>z=\beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \cdots + \beta_kx_k,</math>


:<math>R^2_L = \frac{D_{null} - D_{model}} {D_{null}}</math><ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref>
where <math>\beta_0</math> is called the "[[y-intercept|intercept]]" and <math>\beta_1</math>, <math>\beta_2</math>, <math>\beta_3</math>, and so on, are called the "[[regression coefficient]]s" of <math>x_1</math>, <math>x_2</math>, <math>x_3</math> respectively. The intercept is the value of ''z'' when the value of all independent variables are zero (e.g. the value of ''z'' in someone with no risk factors). Each of the regression coefficients describes the size of the contribution of that risk factor. A positive regression coefficient means that the explanatory variable increases the probability of the outcome, while a negative regression coefficient means that the variable decreases the probability of that outcome; a large regression coefficient means that the risk factor strongly influences the probability of that outcome, while a near-zero regression coefficient means that that risk factor has little influence on the probability of that outcome.


This is the most analogous index to the squared multiple correlation in linear regression.<ref>{{cite book|last=Menard|first=Scott|title=Applied logistic regression analysis|year=2002|publisher=Sage|location=Thousand Oaks, Calif [u.a.]|isbn=9780761922087|edition=2. ed.}}</ref> It represents the proportional reduction in the deviance wherein the deviance is treated as a measure of variation analogous but not identical to the [[variance]] in [[linear regression]] analysis.<ref>{{cite book|last=Menard|first=Scott|title=Applied logistic regression analysis|year=2002|publisher=Sage|location=Thousand Oaks, Calif [u.a.]|isbn=9780761922087|edition=2. ed.}}</ref> One limitation of the [[likelihood ratio]] ''R''<sup>2</sup> is that it is not monotonically related to the odds ratio,<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref> meaning that it does not necessarily increase as the odds ratio increases and does not necessarily decrease as the odds ratio decreases.
Logistic regression is a useful way of describing the relationship between one or more independent variables (e.g., age, sex, etc.) and a binary response variable, expressed as a probability, that has only two values, such as having cancer ("has cancer" or "doesn't have cancer") <!-- previous example (death) is not as instructive; 'death' is easily verifiable and needs no probability-->.


The Cox and Sell ''R''<sup>2</sup> is an alternative index of goodness of fit related to the ''R''<sup>2</sup> value from linear regression. The Cox and Snell index is problematic as its maximum value is .75, when the [[variance]] is at its maximum (.25). The Nagelkerke ''R''<sup>2</sup> provides a correction to the Cox and Snell ''R''<sup>2</sup> so that the maximum value is equal to one. Nevertheless, the Cox and Snell and likelihood ratio ''R''<sup>2</sup>s show greater agreement with each other than either does with the Nagelkerke ''R''<sup>2</sup>.<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref> Of course, this might not be the case for values exceeding .75 as the Cox and Snell index is capped at this value. The likelihood ratio ''R''<sup>2</sup> is often preferred to the alternatives as it is most analogous to ''R''<sup>2</sup> in [[linear regression]], is independent of the base rate (both Cox and Snell and Nagelkerke ''R''<sup>2</sup>s increase as the proportion of cases increase from 0 to .5) and varies between 0 and 1.
== Sample size-dependent efficiency ==
Logistic regression tends to systematically overestimate odds ratios or beta coefficients when the sample size is less than about 500. With increasing sample size, the magnitude of overestimation diminishes and the estimated odds ratio asymptotically approaches the true population value. In a single study, overestimation due to small sample size might not have any relevance for the interpretation of the results, since it is much lower than the standard error of the estimate. However, if a number of small studies with systematically overestimated effects are pooled together without consideration of this effect, an effect may be perceived when in reality it does not exist.<ref>Nemes S, Jonasson JM, Genell A, Steineck G. 2009 Bias in odds ratios by logistic regression modelling and sample size. BMC Medical Research Methodology 9:56 [http://www.biomedcentral.com/1471-2288/9/56 BioMedCentral]</ref>


A word of caution is in order when interpreting pseudo-''R''<sup>2</sup> statistics. The reason these indices of fit are referred to as ''pseudo'' ''R''<sup>2</sup> is because they do not represent the proportionate reduction in error as the ''R''<sup>2</sup> in [[linear regression]] does.<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref> Linear regression assumes [[homoscedasticity]], that the error variance is the same for all values of the criterion. Logistic regression will always be [[heteroscedastic]] - the error variances differ for each value of the predicted score. For each value of the predicted score there would be a different value of the proportionate reduction in error. Therefore, it is inappropriate to think of ''R''<sup>2</sup> as a proportionate reduction in error in a universal sense in logistic regression.<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref>
A minimum of 10 events per independent variable has been recommended.<ref>{{cite journal|author=Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR|title=A simulation study of the number of events per variable in logistic regression analysis|journal=J Clin Epidemiol|year=1996|volume=49|issue=12|pages=1373&ndash;9|pmid=8970487}}</ref><ref>{{cite book|author=Agresti A|title=An Introduction to Categorical Data Analysis|chapter=Building and applying logistic regression models|year=2007|publisher=Wiley|location=Hoboken, New Jersey|page=138|isbn=978-0-471-22618-5}}</ref> For example, in a study where death is the outcome of interest, and 50 of 100 patients die, the maximum number of independent variables the model can support is&nbsp;50/10&nbsp;=&nbsp;5.


== Example ==
== Coefficients ==


After fitting the model, it is likely that researchers will want to examine the contribution of individual predictors. To do so, they will want to examine the regression coefficients. In linear regression, the regression coefficients represent the change in the criterion for each unit change in the predictor.<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref> In logistic regression, however, the regression coefficients represent the rate of change in the logit for each unit change in the predictor. Given that the logit is not intuitive, researchers are likely to focus on a predictor's effect on the [[exponential function]] of the regression coefficient - the odds ratio (see [[logistic regression#definition|definition]]). In linear regression, the significance of a regression coefficient is assessed by computing a ''t''-test. In logistic regression, there are a couple of different tests designed to assess the significance of an individual predictor, most notably, the [[likelihood ratio test]] and the [[Wald Test|Wald statistic]].
The application of a logistic regression may be illustrated using a fictitious example of death from heart disease. This simplified model uses only three risk factors (age, sex, and blood cholesterol level) to predict the 10-year risk of death from heart disease. These are the parameters that the data fit:
:<math>\beta_0=-5.0 \text{ (the intercept)}</math>
:<math>\beta_1=+2.0</math>
:<math>\beta_2=-1.0</math>
:<math>\beta_3=+1.2</math>
:<math>x_1=\text{ age in years, above 50}</math>
:<math>x_2=\text{ sex, where 0 is male and 1 is female}</math>
:<math>x_3=\text{ cholesterol level, in mmol/L above 5.0}</math>


=== Likelihood Ratio Test ===
The model can hence be expressed as


The [[likelihood ratio test]] discussed above to assess model fit is also the recommended procedure to assess the contribution of individual predictors to a given model.<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref><ref>{{cite book|last=Lemeshow|first=David W. Hosmer, Stanley|title=Applied logistic regression|year=2000|publisher=Wiley|location=New York|isbn=0471356328|edition=2nd ed.}}</ref><ref>{{cite book|last=Menard|first=Scott|title=Applied logistic regression analysis|year=2002|publisher=Sage|location=Thousand Oaks, Calif [u.a.]|isbn=9780761922087|edition=2. ed.}}</ref> In the case of a single predictor model, one simply compares the predictor model with the null model on a chi-square distribution with a single degree of freedom. If the predictor model has a significantly smaller chi-square value, then one can conclude that the predictor significantly predicts the criterion. Given that some common statistical packages (e.g., SAS, SPSS) do not provide likelihood ratio test statistics, it can be more difficult to assess the contribution of individual predictors in the multiple logistic regression case. To assess the contribution of individual predictors one can enter the predictors hierarchically, comparing each new model with the previous to determine the contribution of each predictor.<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref>
:<math>\text{risk of death} = \frac{1}{1+e^{-z}} \text{, where } z=-5.0 +2.0x_1 -1.0x_2 + 1.2x_3.</math>


=== Wald Statistic ===
In this model, increasing age is associated with an increasing risk of death from heart disease (z goes up by 2.0 for every year over the age of 50), female sex is associated with a decreased risk of death from heart disease (''z'' goes down by 1.0 if the patient is female), and increasing cholesterol is associated with an increasing risk of death (z goes up by 1.2 for each 1&nbsp;mmol/L increase in cholesterol above 5&nbsp;mmol/L).


Alternatively, when assessing the contribution of individual predictors in a given model, one may examine the significance of the [[Wald Test|Wald statistic]]. The [[Wald Test|Wald statistic]], analogous to the ''t''-test in linear regression, is used to assess the significance of coefficients. The [[Wald Test|Wald statistic]] is the ratio of the square of the regression coefficient to the square of the standard error of the coefficient and is asymptotically distributed as a chi-square distribution.<ref>{{cite book|last=Menard|first=Scott|title=Applied logistic regression analysis|year=2002|publisher=Sage|location=Thousand Oaks, Calif [u.a.]|isbn=9780761922087|edition=2. ed.}}</ref>
We wish to use this model to predict a particular subject's risk of death from heart disease: he is 50 years old and his cholesterol level is 7.0&nbsp;mmol/L. The subject's risk of death is therefore


: <math> \frac{1}{1+e^{-z}} \text{, where } z=-5.0 + (+2.0)(50-50) + (-1.0)0 + (+1.2)(7.0-5.0).</math>
<math>W_j = \frac{B^2_j} {SE^2_Bj}</math>


Although several statistical packages (e.g., SPSS, SAS) report the [[Wald Test|Wald statistic]] to assess the contribution of individual predictors, the [[Wald Test|Wald statistic]] is not without limitations. When the regression coefficient is large, the standard error of the regression coefficient also tends to be large increasing the probability of [[Type I and Type II errors|Type-II error]]. The [[Wald Test|Wald statistic]] also tends to be biased when data are sparse.<ref>{{cite book|first=Jacob Cohen|title=Applied multiple regression/correlation analysis for the behavioral sciences|publisher=Erlbaum|location=Mahwah, NJ [u.a.]|isbn=9780805822236|edition=3. ed.}}</ref>
This means that by this model, the subject's risk of dying from heart disease in the next 10 years is 0.07 (or 7%).


== Formal mathematical specification ==
== Formal mathematical specification ==
Line 483: Line 520:
| year = 1991
| year = 1991
| isbn = 978-0-8247-8587-1 }}
| isbn = 978-0-8247-8587-1 }}
*{{cite book
| last = Cohen
| first = Jacob
| coauthors = Patricia Cohen, Steven G. West, Leona S. Aiken
| title = Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, 3rd ed.
| publisher = New York: Routledge
| year = 2003
| isbn = 978-0-8058-2223-6 }}
*{{cite book
*{{cite book
| last = Greene
| last = Greene
Line 505: Line 550:
| year = 2000
| year = 2000
| isbn = 0-471-35632-8 }}
| isbn = 0-471-35632-8 }}
*{{cite book
| last = Howell
| first = David C.
| title = Statistical Methods for Psychology, 7th ed.
| publisher = Belmont, CA; Thomson Wadsworth
| year = 2010
| isbn = 978-0-495-59786-5 }}
*{{cite book
| last = Menard
| first = Scott W.
| title = Applied Logistic Regression, 2nd ed.
| publisher = Thousand Oaks; SAGE
| year = 2002
| isbn = 9780761922087 }}
*{{cite journal
| last = Peduzzi
| first = P.
| coauthors = J. Concato, E. Kemper, T.R. Holford, A.R. Feinstein
| title = "A simulation study of the number of events per variable in logistic regression analysis"
| journal = Journal of clinical epidemiology
| volume = 49 (12)
| pages = 1373 - 1379
| year = 1996
| PMID = 8970487 }}


==External links==
==External links==

Revision as of 00:10, 28 April 2012

In statistics, logistic regression is a type of regression analysis used for predicting the outcome of a categorical (a variable that can take on a limited number of categories) criterion variable based on one or more predictor variables. Logistic regression can be bi- or multinomial. Binomial or binary logistic regression refers to the instance in which the criterion can take on only two possible outcomes (e.g., "dead" vs. "alive", "success" vs. "failure", or "yes" vs. "no"). Multinomial logistic regression refers to the instance in which the criterion can take on three or more possible outcomes (e.g., "better' vs. "no change" vs. "worse"). Generally, the criterion is coded as "0" and "1" in binary logistic regression as it leads to the most straightforward interpretation.[1] The target group (referred to as a "case") is usually coded as "1" and the reference group (referred to as a "noncase") as "0". The binomial distribution has a mean equal to the proportion of cases, denoted P, and a variance equal to the product of cases and noncases, PQ, wherein Q is equal to the proportion of noncases or 1 - P. [2] Accordingly, the standard deviation is simply the square root of PQ. Logistic regression is used to predict the odds of being a case based on the predictor(s). The odds are defined as the probability of a case divided by the probability of a non case. The odds ratio is the primary measure of effect size in logistic regression and is computed to compare the odds that membership in one group will lead to a case outcome with the odds that membership in some other group will lead to a case outcome. The odds ratio (denoted OR) is simply the odds of being a case for one group divided by the odds of being a case for another group. An odds ratio of one indicates that the odds of a case outcome are equally likely for both groups under comparison. The further the odds deviate from one, the stronger the relationship. The odds ratio has a floor of zero but no ceiling (upper limit) - theoretically, the odds ratio can increase infinitely.[3]

Like other forms of regression analysis, logistic regression makes use of one or more predictor variables that may be either continuous or categorical. Also, like other linear regression models, the expected value (average value) of the response variable is fit to the predictors - the expected value of a Bernoulli distribution is simply the probability of a case. In other words, in logistic regression the base rate of a case for the null model (the model without any predictors or the intercept-only model) is fit to the model including one or more predictors. Unlike ordinary linear regression, however, logistic regression is used for predicting binary outcomes (Bernoulli trials) rather than continuous outcomes. Given this difference, it is necessary that logistic regression take the natural logarithm of the odds (referred to as the logit or log-odds) to create a continuous criterion. The logit of success is then fit to the predictors using regression analysis. The results of the logit, however, are not intuitive, so the logit is converted back to the odds via the exponential function or the inverse of the natural logarithm. Therefore, although the observed variables in logistic regression are categorical, the predicted scores are actually modelled as a continuous variable (the logit). The logit is referred to as the link function in logistic regression - although the output in logistic regression is binomial and displayed in a contingency table, the logit is an underlying continuous criterion upon which linear regression is conducted. [4]

For example, logistic regression might be used to predict whether a patient has a given disease (e.g. diabetes), based on observed characteristics of the patient (age, gender, body mass index, results of various blood tests, etc.). Another example might be to predict whether a voter will vote Democratic or Republican, based on age, income, gender, race, state of residence, votes in previous elections, etc. Logistic regression is used extensively in numerous disciplines: the medical and social sciences fields, natural language processing, marketing applications such as prediction of a customer's propensity to purchase a product or cease a subscription, etc. In each of these instances, a logistic regression model would compute the relevant odds for each predictor or interaction term, take the natural logarithm of the odds (compute the logit), conduct a linear regression analysis on the predicted values of the logit, and then take the exponential function of the logit to compute the odds ratio.

Introduction

Both linear and logistic regression analyses compare the observed values of the criterion with the predicted values with and without the variable(s) in question in order to determine if the model that includes the variable(s) more accurately predicts the outcome than the model without that variable (or set of variables).[5] Given that both analyses are guided by the same goal, why is it that logistic regression is needed for analyses with a dichotomous criterion? Why is linear regression inappropriate to use with a dichotomous criterion? There are several reasons why it is inappropriate to conduct linear regression on a dichotomous criterion. First, it violates the assumption of linearity. The linear regression line is the expected value of the criterion given the predictor(s) and is equal to the intercept (the value of the criterion when the predictor(s) are equal to zero) plus the product of the regression coefficient and some given value of the predictor plus some error term - this implies that it is possible for the expected value of the criterion given the value of the predictor to take on any value as the predictor(s) ranges from ; however, this is not the case with a dichotomous criterion.[6] The conditional mean of a dichotomous criterion must be greater than or equal to zero and less than or equal to one, thus, the distribution is not linear but sigmoid or S-shaped. [7] As the predictors approach the criterion asymptotes at zero and as the predictors approach the criterion asymptotes at one. Linear regression disregards this information and it becomes possible for the criterion to take on probabilities less than zero and greater than one although such values are not theoretically permissible.[8] Furthermore, there is no straightforward interpretation of such values.

Second, conducting linear regression with a dichotomous criterion violates the assumption that the error term is homoscedastic.[9] Homoscedasticity is the assumption that variance in the criterion is constant at all levels of the predictor(s). This assumption will always be violated when one has a criterion that is distributed binomially. Consider the variance formula: e = PQ, wherein P is equal to the proportion of "1's" or "cases" and Q is equal to (1 - P), the proportion of "0's" or "noncases" in the distribution. Given that there are only two possible outcomes in a binomial distribution, one can determine the proportion of "noncases" from the proportion of "cases" and vice versa. Likewise, one can also determine the variance of the distribution from either the proportion of "cases" or "noncases". That is to say that the variance is not independent of the predictor - the error term is not homoscedastic, but heteroscedastic, meaning that the variance is not equal at all levels of the predictor. The variance is greatest when the proportion of cases equals .5. e = PQ = .5(1 - .5) = .5(.5) = .25. As the proportion of cases approaches the extremes, however, error approaches zero. For example, when the proportion of cases equals .99, there is almost zero error: e = PQ = .99(1 - .99) = .99(.01) = .009. Therefore, error or variance in the criterion is not independent of the predictor variable(s).

Third, conducting linear regression with a dichotomous variable violates the assumption that error is normally distributed because the criterion has only two values.[10] Given that a dichotomous criterion violates these assumptions of linear regression, conducting linear regression with a dichotomous criterion may lead to errors in inference and at the very least, interpretation of the outcome will not be straightforward.[11]

Given the shortcomings of the linear regression model for dealing with a dichotomous criterion, it is necessary to use some other analysis. Besides logistic regression, there is at least one additional alternative analysis for dealing with a dichotomous criterion - discriminant function analysis. Like logistic regression, discriminant function analysis is a technique in which a set of predictors is used to determine group membership. There are two problems with discriminant function analysis, however: first, like linear regression, discriminant function analysis may produce probabilities greater than one or less than zero, even though such probabilities are theoretically inadmissible. In addition, discriminant function analysis assumes that the predictor variables are normally distributed.[12] Logistic regression neither produces probabilities that lie below zero or above one, nor imposes restrictive normality assumptions on the predictors.

Logistic regression is a generalized linear model, specifically a type of binomial regression. Logistic regression serves to transform the limited range of a probability, restricted to the range , into the full range , which makes the transformed value more suitable for fitting using a linear function. The effect of both functions is to transform the middle of the probability range (near 50%) more or less linearly, while stretching out the extremes (near 0% or 100%) exponentially. This is because in the middle of the probability range, one expects a relatively linear function - it is towards the extremes that the regression line begins to curve as it approaches asymptote; hence, the sigmoidal distribution (see Figure 1). In essence, when conducting logistic regression, one is transforming the probability of a case outcome into the odds of a case outcome and taking the natural logarithm of the odds to create the logit. The odds as a criterion provides an improvement over probability as the criterion as the odds has no fixed upper limit; however, the odds is still limited in that it has a fixed lower limit of zero and its values do not tend to be normally distributed or linearly related to the predictors. Hence, it is necessary to take the natural logarithm of the odds to remedy these limitations.

The natural logarithm is the power to which the base, e must be raised to produce some value Y (the criterion). Euler's number or e is a mathematical constant equal to about 2.71828. An excellent example of this relationship is when Y = 2.71828 or e. When Y = 2.71828, ln(Y or 2.71828) = 1, because Y equals e in this instance, so e must only be raised to the power of 1 to equal itself. In other words, Y is the power to which the base, e, must be raised to equal Y (2.71828). Given that the logit is not generally interpreted and that the inverse of the natural logarithm, the exponential function of the logit is generally interpreted instead, it is also helpful to examine this function (denoted: ). To illustrate the relationship between the exponential function and the natural logarithm, consider the exponentiation of the product of the natural logarithm above. There it was evident that the natural logarithm of 2.71828 was equal to 1. Here, if one exponentiates 1, the product is 2.71828; thus, the exponential function is the reciprocal of the natural logarithm. The logit can be thought of as a latent continuous variable that is fit to the predictors analogous to the manner in which a continuous criterion is fit to the predictors in linear regression analysis. After the criterion (the logit) is fit to the predictors the result is exponentiated, converting the unintuitive logit back in to the easily interpretable odds. It is important to note that, the probability, odds ratio, and logit all provide the same information. A probability of .5 is equal to an odds ratio of 1 and a logit of 0 - all three values indicate that "case" and "noncase" outcomes are equally likely.

It is also important to note that, although the observed outcomes of the response variables are categorical — simple "yes" or "no" outcomes — logistic regression actually models a continuous variable (the probability of "yes"). This probability is a latent variable that is assumed to generate the observed yes/no outcomes. At its heart, this is conceptually similar to ordinary linear regression, which predicts the unobserved expected value of the outcome (e.g. the average income, height, etc.), which in turn generates the observed value of the outcome (which is likely to be somewhere near the average, but may differ by an "error" term). The difference is that for a simple normally distributed continuous variable, the average (expected) value and observed value are measured with the same units. Thus it is convenient to conceive of the observed value as simply the expected value plus some error term, and often to blur the difference between the two. For logistic regression, however, the expected value and observed value are different types of values (continuous vs. discrete), and visualizing the observed value as expected value plus error does not work. As a result, the distinction between expected and observed value must always be kept in mind.

Definition

Figure 1. The logistic function, with on the horizontal axis and on the vertical axis

An explanation of logistic regression begins with an explanation of the logistic function, which, like probabilities, always takes on values between zero and one:

AND

AND

[13]

A graph of the function is shown in figure 1. The input is and the output is . The logistic function is useful because it can take as an input any value from negative infinity to positive infinity, whereas the output is confined to values between 0 and 1. In the above equations, g(X) refers to the logit function of some given predictor X, ln denotes the natural logarithm: is the probability of being a case, is the intercept from the linear regression equation (the value of the criterion when the predictor is equal to zero), is the regression coefficient multiplied by some value of the predictor, base e denotes the exponential function and e in the linear regression equation denotes the error term. The first formula illustrates that the probability of being a case is equal to the odds of the exponential function of the linear regression equation. This is important in that it shows that the input of the logistic regression equation (the linear regression equation) can vary from negative to positive infinity and yet, after exponentiating the odds of the equation, the output will vary between zero and one. The second equation illustrates that the logit (i.e., log-odds or natural logarithm of the odds) is equivalent to the linear regression equation. Likewise, the third equation illustrates that the odds of being a case is equivalent to the exponential function of the linear regression equation. This illustrates how the logit serves as a link function between the odds and the linear regression equation. Given that the logit varies from it provides an adequate criterion upon which to conduct linear regression and the logit is easily converted back into the odds.[14]

This is where it becomes extremely sensible to use reference cell coding ("0" = non case, "1" = case). With this coding scheme the odds ratio is equal to the exponential function of the regression coefficient.[15]

[16]

Therefore, when one uses a reference coding scheme, the exponentiation of the regression coefficient is the odds ratio and no further calculations are necessary.

Model Fitting

Maximum Likelihoods

In linear regression one uses an analytical solution to estimate regression coefficients by finding those values that minimize the sum of squared residuals (error variance).[17] In other words, there is a series of computations that one can make to derive a solution. In logistic regression there is no set of equations from which one can derive a solution - an analytical solution does not exist. Instead, logistic regression uses the maximum likelihood procedure to estimate the coefficients that maximize the likelihood of the regression coefficients given the predictors and criterion. [18] Unlike analytical solutions wherein it is possible to solve directly for the coefficients, the maximum likelihood solution is an iterative process that begins with a tentative solution, revises it slightly to see if it can be improved, and repeats this process until improvement is minute, at which point the model is said to have converged.[19] What this means is that the maximum likelihood procedure has found a solution that maximizes the likelihood of the coefficients given the predictor(s) and criterion.

In some instances the model may not reach convergence. When a model does not converge this indicates that the coefficients are not reliable as the model never reached a final solution. Lack of convergence may result from a number of problems: having a large ratio of predictors to cases, multicollinearity, sparseness, or complete separation. Although not a precise number, as a general rule of thumb, logistic regression models require a minimum of 10 cases per variable.[20] Having a large proportion of variables to cases results in an overly conservative Wald statistic (discussed below) and can lead to nonconvergence. Multicollinearity refers to unacceptably high correlations between predictors. As multicollinearity increases, coefficients remain unbiased but standard errors increase and the likelihood of model convergence decreases.[21] To detect multicollinearity amongst the predictors, one can conduct a linear regression analysis with the predictors of interest for the sole purpose of examining the tolerance statistic[22] used to assess whether multicollinearity is unacceptably high. Sparseness in the data refers to having a large proportion of empty cells (cells with zero counts). Zero cell counts are particularly problematic with categorical predictors. With continuous predictors, the model can infer values for the zero cell counts, but this is not the case with categorical predictors. The reason the model will not converge with zero cell counts for categorical predictors is because the natural logarithm of zero is an undefined value, so final solutions to the model cannot be reached. To remedy this problem, researchers may collapse categories in a theoretically meaningful way or may consider adding a constant to all cells.[23] Another numerical problem that may lead to a lack of convergence is complete separation, which refers to the instance in which the predictors perfectly predict the criterion - all cases are accurately classified. In such instances, one should reexamine the data, as there is likely some kind of error.[24]

Deviance and Likelihood Ratio Tests

In linear regression analysis, one is concerned with partitioning variance via the sum of squares calculations - variance in the criterion is essentially divided into variance accounted for by the predictors and residual variance. In logistic regression analysis, deviance is used in lieu of sum of squares calculations.[25] Deviance is analogous to the sum of squares calculations in linear regression[26] and is a measure of the lack of fit to the data in a logistic regression model.[27] Deviance is calculated by comparing a given model with the saturated model - a model with a theoretically perfect fit.[28] This computation is called the likelihood ratio test:

[29]

In the above equation D represents the deviance and ln represents the natural logarithm. The results of the likelihood ratio (the ratio of the fitted model to the saturated model) will produce a negative value, so the product is multiplied by negative two times its natural logarithm to produce a value with an approximate chi-square distribution. [30] Smaller values indicate better fit as the fitted model deviates less from the saturated model. When assessed upon a chi-square distribution, nonsignificant chi-square values indicate very little unexplained variance and thus, good model fit. Conversely, a significant chi-square value indicates that a significant amount of the variance is unexplained. Two measures of deviance are particularly important in logistic regression: null deviance and model deviance. The null deviance represents the difference between a model with only the intercept and no predictors and the saturated model. And, the model deviance represents the difference between a model with at least one predictor and the saturated model.[31] In this respect, the null model provides a baseline upon which to compare predictor models. Given that deviance is a measure of the difference between a given model and the saturated model, smaller values indicate better fit. Therefore, to assess the contribution of a predictor or set of predictors, one can subtract the model deviance from the null deviance and assess the difference on a chi-square distribution with one degree of freedom.[32] If the model deviance is significantly smaller than the null deviance then one can conclude that the predictor or set of predictors significantly improved model fit. This is analogous to the F-test used in linear regression analysis to assess the significance of prediction.[33]

Pseudo-R2s

In linear regression the squared multiple correlation, R2 is used to assess goodness of fit as it represents the proportion of variance in the criterion that is explained by the predictors.[34] In logistic regression analysis, there is no agreed upon analogous measure, but there are several competing measures each with limitations.[35] Three of the most commonly used indices are examined on this page beginning with the likelihood ratio R2, R2L:

[36]

This is the most analogous index to the squared multiple correlation in linear regression.[37] It represents the proportional reduction in the deviance wherein the deviance is treated as a measure of variation analogous but not identical to the variance in linear regression analysis.[38] One limitation of the likelihood ratio R2 is that it is not monotonically related to the odds ratio,[39] meaning that it does not necessarily increase as the odds ratio increases and does not necessarily decrease as the odds ratio decreases.

The Cox and Sell R2 is an alternative index of goodness of fit related to the R2 value from linear regression. The Cox and Snell index is problematic as its maximum value is .75, when the variance is at its maximum (.25). The Nagelkerke R2 provides a correction to the Cox and Snell R2 so that the maximum value is equal to one. Nevertheless, the Cox and Snell and likelihood ratio R2s show greater agreement with each other than either does with the Nagelkerke R2.[40] Of course, this might not be the case for values exceeding .75 as the Cox and Snell index is capped at this value. The likelihood ratio R2 is often preferred to the alternatives as it is most analogous to R2 in linear regression, is independent of the base rate (both Cox and Snell and Nagelkerke R2s increase as the proportion of cases increase from 0 to .5) and varies between 0 and 1.

A word of caution is in order when interpreting pseudo-R2 statistics. The reason these indices of fit are referred to as pseudo R2 is because they do not represent the proportionate reduction in error as the R2 in linear regression does.[41] Linear regression assumes homoscedasticity, that the error variance is the same for all values of the criterion. Logistic regression will always be heteroscedastic - the error variances differ for each value of the predicted score. For each value of the predicted score there would be a different value of the proportionate reduction in error. Therefore, it is inappropriate to think of R2 as a proportionate reduction in error in a universal sense in logistic regression.[42]

Coefficients

After fitting the model, it is likely that researchers will want to examine the contribution of individual predictors. To do so, they will want to examine the regression coefficients. In linear regression, the regression coefficients represent the change in the criterion for each unit change in the predictor.[43] In logistic regression, however, the regression coefficients represent the rate of change in the logit for each unit change in the predictor. Given that the logit is not intuitive, researchers are likely to focus on a predictor's effect on the exponential function of the regression coefficient - the odds ratio (see definition). In linear regression, the significance of a regression coefficient is assessed by computing a t-test. In logistic regression, there are a couple of different tests designed to assess the significance of an individual predictor, most notably, the likelihood ratio test and the Wald statistic.

Likelihood Ratio Test

The likelihood ratio test discussed above to assess model fit is also the recommended procedure to assess the contribution of individual predictors to a given model.[44][45][46] In the case of a single predictor model, one simply compares the predictor model with the null model on a chi-square distribution with a single degree of freedom. If the predictor model has a significantly smaller chi-square value, then one can conclude that the predictor significantly predicts the criterion. Given that some common statistical packages (e.g., SAS, SPSS) do not provide likelihood ratio test statistics, it can be more difficult to assess the contribution of individual predictors in the multiple logistic regression case. To assess the contribution of individual predictors one can enter the predictors hierarchically, comparing each new model with the previous to determine the contribution of each predictor.[47]

Wald Statistic

Alternatively, when assessing the contribution of individual predictors in a given model, one may examine the significance of the Wald statistic. The Wald statistic, analogous to the t-test in linear regression, is used to assess the significance of coefficients. The Wald statistic is the ratio of the square of the regression coefficient to the square of the standard error of the coefficient and is asymptotically distributed as a chi-square distribution.[48]

Although several statistical packages (e.g., SPSS, SAS) report the Wald statistic to assess the contribution of individual predictors, the Wald statistic is not without limitations. When the regression coefficient is large, the standard error of the regression coefficient also tends to be large increasing the probability of Type-II error. The Wald statistic also tends to be biased when data are sparse.[49]

Formal mathematical specification

There are various equivalent specifications of logistic regression, which fit into different types of more general models. These different specifications allow for different sorts of useful generalizations.

Setup

The basic setup of logistic regression is the same as for standard linear regression.

It is assumed that we have a series of N observed data points. Each data point i (ranging from 1 to N) consists of a set of M explanatory variables x1,i ... xM,i (aka independent variables, predictor variables, features, etc.), and an associated binary-valued outcome Yi (aka dependent variable, response variable), i.e. it can assume only the two possible values 0 (often meaning "no" or "failure") or 1 (often meaning "yes" or "success"). The explanatory variables and outcome typically represent observed properties of the data points. The goal of logistic regression is to explain the relationship between the explanatory variables and the outcome, so that the outcome can be correctly predicted for a new data point for which only the explanatory variables are available.

Some examples:

  • The observed outcomes are the presence or absence of a given disease (e.g. diabetes) in a set of patients, and the explanatory variables might be characteristics of the patients thought to be pertinent (sex, race, age, blood pressure, body-mass index, etc.).
  • The observed outcomes are the votes (e.g. Democratic or Republican) of a set of people in an election, and the explanatory variables are the demographic characteristics of each person (e.g. sex, race, age, income, etc.). In such a case, one of the two outcomes is arbitrarily coded as 1, and the other as 0.

As in linear regression, the outcomes Yi are assumed to be random variables, but the explanatory variables x1,i ... xm,i are not.

The explanatory variables

As shown above in the above examples, the explanatory variables may be of any type: real-valued, binary, categorical, etc. The main distinction is between continuous variables (e.g. income, age, blood pressure, etc.) and discrete variables (e.g. sex, race, political party, etc.). Discrete variables referring to more than two possible choices are typically coded using dummy variables (or indicator variables), i.e. separate explanatory variables taking the value 0 or 1 are created for each possible value of the discrete variable, with a 1 meaning "variable does have the given value" and a 0 meaning "variable does not have the given value". For example, a four-way discrete variable of blood type with the possible values "A, B, AB, O" would be converted to four separate two-way dummy variables, "is-A, is-B, is-AB, is-O", where only one of them has the value 1 and all the rest have the value 0. This allows for separate regression coefficients to be matched for each possible value of the discrete variable. (Note that in a case like this, only three of the four dummy variables are independent of each other, in the sense that once the values of three of the variables are known, the fourth is automatically determined. Thus, it's really only necessary to encode three of the four possibilities as dummy variables. This also means that when all four possibilities are encoded, the overall model is not identifiable in the absence of additional constraints such as a regularization constraint. Theoretically, this could cause problems, but in reality almost all logistic regression models are fit with regularization constraints.)

The outcomes

Formally, the outcomes Yi are described as being Bernoulli-distributed data, where each outcome is determined by an unobserved probability pi that is specific to the outcome at hand, but related to the explanatory variables. This can be expressed in any of the following equivalent forms:


The meanings of these four lines are:

  1. The first line expresses the probability distribution of each Yi: Conditioned on the explanatory variables, it follows a Bernoulli distribution parameterized by pi, the probability of the outcome of 1 ("success", "yes", etc.) for trial i. As noted above, each separate trial has its own probability of success, just as each trial has its own explanatory variables. The probability of success pi is not observed, only the outcome of an individual Bernoulli trial using that probability.
  2. The second line expresses the fact that the expected value of each Yi is equal to the probability of success pi, which is a general property of the Bernoulli distribution. In other words, if you were to run a large number of Bernoulli trials using the same probability of success pi, coding each success a 1 and each failure a 0 as is standard, and then take the average of all those 1's and 0's, the result you'd get would be close to pi. This is because doing an average this way simply computes the proportion of successes seen, which we expect to converge to the underlying probability of success.
  3. The third line writes out the probability mass function of the Bernoulli distribution, specifying the probability of seeing each possible outcome (there are only two).
  4. The fourth line is another way of writing the probability mass function, which avoids having to write out separate cases and is more convenient for certain types of calculations. This relies on the fact that Yi can take only the value 0 or 1. In each case, one of the exponents will be 1, "choosing" the value under it, while the other is 0, "canceling out" the value under it. Hence, the outcome is either pi or 1 - pi, as in the previous line.
Linear predictor function

The basic idea of logistic regression is to use the mechanism already developed for linear regression by modeling the probability pi using a linear predictor function, i.e. a linear combination of the explanatory variables and a set of regression coefficients that are specific to the model at hand but the same across all trials. The linear predictor function for a particular data point i is written as:

where are regression coefficients indicating the relative effect of a particular explanatory variable on the outcome.

The model is usually put into a more compact form as follows:

  • The regression coefficients β0, β1, ..., βk are grouped into a single vector β of size k+1.
  • For each data point i, an additional explanatory pseudo-variable x0,i is added, with a fixed value of 1, corresponding to the intercept coefficient β0.
  • The resulting explanatory variables x0,i, x1,i, ..., xk,i are then grouped into a single vector Xi of size k+1.

This makes it possible to write the linear predictor function as follows:

using the notation for a dot product between two vectors.

As a generalized linear model

The particular model used by logistic regression, which distinguishes it from standard linear regression and from other types of regression analysis used for binary-valued outcomes, is the way the probability of a particular outcome is linked to the linear predictor function:

Written using the more compact notation described above, this is:

This formulation expresses logistic regression as a type of generalized linear model, which predicts variables with various types of probability distributions by fitting a linear predictor function of the above form to some sort of arbitrary transformation of the expected value of the variable.

The intuition for transforming the using the logit function (the natural log of the odds) was explained above. It also has the practical effect of converting the probability (which is bounded to be between 0 and 1) to a variable that ranges over — thereby matching the potential range of the linear prediction function on the right side of the equation.

Note that both the probabilities pi and the regression coefficients are unobserved, and the means of determining them is not part of the model itself. They are typically determined by some sort of optimization procedure, e.g. maximum likelihood estimation, that finds values that best fit the observed data (i.e. that give the most accurate predictions for the data already observed), usually subject to regularization conditions that seek to exclude unlikely values, e.g. extremely large values for any of the regression coefficients. The use of a regularization condition is equivalent to doing maximum a posteriori (MAP) estimation, an extension of maximum likelihood. (Regularization is most commonly done using a squared regularizing function, which is equivalent to placing a zero-mean Gaussian prior distribution on the coefficients, but other regularizers are also possible.) Whether or not regularization is used, it is usually not possible to find a closed-form solution; instead, an iterative numerical method must be used, such as iteratively reweighted least squares (IRLS) or, more commonly these days, a quasi-Newton method such as the L-BFGS method.

The interpretation of the βj parameter estimates is as the additive effect on the log of the odds for a unit change in the jth explanatory variable. In the case of a dichotomous explanatory variable, for instance gender, is the estimate of the odds of having the outcome for, say, males compared with females.

An equivalent formula uses the inverse of the logit function, which is the logistic function, i.e.:

The formula can also be written (somewhat awkwardly) as a probability distribution (specifically, using a probability mass function):

As a latent-variable model

The above model has an equivalent formulation as a latent-variable model. This formulation is common in the theory of discrete choice models, and makes it easier to extend to certain more complicated models with multiple, correlated choices, as well as to compare logistic regression to the closely related probit model.

Imagine that, for each trial i, there is a continuous latent variable Yi* (i.e. an unobserved random variable) that is distributed as follows:

where

i.e. the latent variable can be written directly in terms of the linear predictor function and an additive random error variable that is distributed according to a standard logistic distribution.

Then Yi can be viewed as an indicator for whether this latent variable is positive:

The choice of modeling the error variable specifically with a standard logistic distribution, rather than a general logistic distribution with the location and scale set to arbitrary values, seems restrictive, but in fact it is not. It must be kept in mind that we can choose the regression coefficients ourselves, and very often can use them to offset changes in the parameters of the error variable's distribution. For example, a logistic error-variable distribution with a non-zero location parameter μ (which sets the mean) is equivalent to a distribution with a zero location parameter, where μ has been added to the intercept coefficient. Both situations produce the same value for Yi* regardless of settings of explanatory variables. Similarly, an arbitrary scale parameter s is equivalent to setting the scale parameter to 1 and then dividing all regression coefficients by s. In the latter case, the resulting value of Yi* will be smaller by a factor of s than in the former case, for all sets of explanatory variables — but critically, it will always remain on the same side of 0, and hence lead to the same Yi choice.

(Note that this predicts that the irrelevancy of the scale parameter may not carry over into more complex models where more than two choices are available.)

It turns out that this formulation is exactly equivalent to the preceding one, phrased in terms of the generalized linear model and without any latent variables. This can be shown as follows, using the fact that the cumulative distribution function (CDF) of the standard logistic distribution is the logistic function, which is the inverse of the logit function, i.e.

Then:


This formulation — which is standard in discrete choice models — makes clear the relationship between logistic regression (the "logit model") and the probit model, which uses an error variable distributed according to a standard normal distribution instead of a standard logistic distribution. Both the logistic and normal distributions are symmetric with a basic unimodal, "bell curve" shape. The only difference is that the logistic distribution has somewhat heavier tails, which means that it is less sensitive to outlying data (and hence somewhat more robust to model mis-specifications or erroneous data).

As a two-way latent-variable model

Yet another formulation uses two separate latent variables:

where

where EV1(0,1) is a standard type-1 extreme value distribution: i.e.

Then


This model has a separate latent variable and a separate set of regression coefficients for each possible outcome of the dependent variable. The reason for this separation is that it makes it easy to extend logistic regression to multi-outcome categorical variables, as in the multinomial logit model. In such a model, it is natural to model each possible outcome using a different set of regression coefficients. It is also possible to motivate each of the separate latent variables as the theoretical utility associated with making the associated choice, and thus motivate logistic regression in terms of utility theory. (In terms of utility theory, a rational actor always chooses the choice with the greatest associated utility.) This is in fact the approach taken by economists when formulating discrete choice models, because it both provides a theoretically strong foundation and facilitates intuitions about the model, which in turn makes it easy to consider various sorts of extensions. (See the example below.)

The choice of the type-1 extreme value distribution seems fairly arbitrary, but it makes the mathematics work out, and it may be possible to justify its use through rational choice theory.

It turns out that this model is equivalent to the previous model, although this seems non-obvious, since there are now two sets of regression coefficients and error variables, and the error variables have a different distribution. In fact, this model reduces directly to the previous one with the following substitutions:

An intuition for this comes from the fact that, since we choose based on the maximum of two values, only their difference matters, not the exact values — and this effectively removes one degree of freedom. Another critical fact is that the difference of two type-1 extreme-value-distributed variables is a logistic distribution, i.e. if

We can demonstrate the equivalent as follows:

Example

As an example, consider a province-level election where the choice is between a right-of-center party, a left-of-center party, and a secessionist party (e.g. the Parti Québécois, which wants Quebec to secede from Canada), whose primary platform is one of secession and has no strong views on other issues. We would then use three latent variables, one for each choice. Then, in accordance with utility theory, we can then interpret the latent variables as expressing the utility that results from making each of the choices. We can also interpret the regression coefficients as indicating the strength that the associated factor (i.e. explanatory variable) has in contributing to the utility — or more correctly, the amount by which a unit change in an explanatory variable changes the utility of a given choice. A voter might expect that the right-of-center party would lower taxes, especially on rich people. This would give low-income people no benefit, i.e. no change in utility (since they usually don't pay taxes); would cause moderate benefit (i.e. somewhat more money, or moderate utility increase) for middle-incoming people; and would cause significant benefits for high-income people. On the other hand, the left-of-center party might be expected to raise taxes and offset it with increased welfare and other assistance for the lower and middle classes. This would cause significant positive benefit to low-income people, perhaps weak benefit to middle-income people, and significant negative benefit to high-income people. Finally, the secessionist party would take no direct actions on the economy, but simply secede. A low-income or middle-income voter might expect basically no clear utility gain or loss from this, but a high-income voter might expect negative utility, since he/she is likely to own companies, which will have a harder time doing business in such an environment and probably lose money.

These intuitions can be expressed as follows:

Estimated strength of regression coefficient for different outcomes (party choices) and different values of explanatory variables
Center-right Center-left Secessionist
High-income strong + strong - strong -
Middle-income moderate + weak + none
Low-income none strong + none

This clearly shows that

  1. Separate sets of regression coefficients need to exist for each choice. When phrased in terms of utility, this can be seen very easily. Different choices have different effects on net utility; furthermore, the effects vary in complex ways that depend on the characteristics of each individual, so there need to be separate sets of coefficients for each characteristic, not simply a single extra per-choice characteristic.
  2. Even though income is a continuous variable, its effect on utility is too complex for it to be treated as a single variable. Either it needs to be directly split up into ranges, or higher powers of income need to be added so that polynomial regression on income is effectively done.

As a "log-linear" model

Yet another formulation combines the two-way latent variable formulation above with the original formulation higher up without latent variables, and in the process provides a link to one of the standard formulations of the multinomial logit.

Here, instead of writing the logit of the probabilities pi as a linear predictor, we separate the linear predictor into two, one for each of the two outcomes:

Note that two separate sets of regression coefficients have been introduced, just as in the two-way latent variable model, and the two equations appear a form that writes the logarithm of the associated probability as a linear predictor, with an extra term at the end. This term, as it turns out, serves as the normalizing factor ensuring that the result is a distribution. This can be seen by exponentiating both sides:

In this form it is clear that the purpose of Z is to ensure that the resulting distribution over Yi is in fact a probability distribution, i.e. it sums to 1. This means that Z is simply the sum of all un-normalized probabilities, and by dividing each probability by Z, the probabilities become "[[normalizing constant|normalized]". That is:

and the resulting equations are

Or generally:

This shows clearly how to generalize this formulation to more than two outcomes, as in multinomial logit.

Now, how can we prove that this is equivalent to the previous model? Keep in mind that the above model is overspecified, in that and cannot be independently specified: rather so knowing one automatically determines the other. As a result, the model is nonidentifiable, in that multiple combinations of β0 and β1 will produce the same probabilities for all possible explanatory variables. In fact, it can be seen that adding any constant vector to both of them will produce the same probabilities:

As a result, we can simplify matters, and restore identifiability, by picking an arbitrary value for one of the two vectors. We choose to set Then,

and so

which shows that this formulation is indeed equivalent to the previous formulation. (As in the two-way latent variable formulation, any settings where will produce equivalent results.)

Note that most treatments of the multinomial logit model start out either by extending the "log-linear" formulation presented here or the two-way latent variable formulation presented above, since both clearly show the way that the model could be extended to multi-way outcomes. In general, the presentation with latent variables is more common in econometrics and political science, where discrete choice models and utility theory reign, while the "log-linear" formulation here is more common in computer science, e.g. machine learning and natural language processing.

As a single-layer perceptron

The model has an equivalent formulation

This functional form is commonly called a single-layer perceptron or single-layer artificial neural network. A single-layer neural network computes a continuous output instead of a step function. The derivative of pi with respect to X = x1...xk is computed from the general form:

where f(X) is an analytic function in X. With this choice, the single-layer neural network is identical to the logistic regression model. This function has a continuous derivative, which allows it to be used in backpropagation. This function is also preferred because its derivative is easily calculated:

In terms of binomial data

A closely related model assumes that each i is associated not with a single Bernoulli trial but with ni independent identically distributed trials, where the observation Yi is the number of successes observed (the sum of the individual Bernoulli-distributed random variables), and hence follows a binomial distribution:

An example of this distribution is the fraction of seeds (pi) that germinate after ni are planted.

In terms of expected values, this model is expressed as follows:

so that

Or equivalently:

This model can be fit using the same sorts of methods as the above more basic model.

Bayesian logistic regression

Comparison of logistic function with a scaled inverse probit function (i.e. the CDF of the normal distribution), comparing vs. , which makes the slopes the same at the origin. This shows the heavier tails of the logistic distribution.

In a Bayesian statistics context, prior distributions are normally placed on the regression coefficients, usually in the form of Gaussian distributions. Unfortunately, the Gaussian distribution is not the conjugate prior of the likelihood function in logistic regression; in fact, the likelihood function is not an exponential family and thus does not have a conjugate prior at all. As a result, the posterior distribution is difficult to calculate, even using standard simulation algorithms (e.g. Gibbs sampling).

There are various possibilities:

  • Don't do a proper Bayesian analysis, but simply compute a maximum a posteriori point estimate of the parameters. This is common, for example, in "maximum entropy" classifiers in machine learning.
  • Use a more general approximation method such as Metropolis-Hastings.
  • Use a latent variable model and approximate the logistic distribution using a more tractable distribution, e.g. a Student's t-distribution or a mixture of normal distributions.
  • Do probit regression instead of logistic regression. This is actually a special case of the previous situation, using a normal distribution in place of a Student's t, mixture of normals, etc. This will be less accurate but has the advantage that probit regression is extremely common, and a ready-made Bayesian implementation may already be available.
  • Use the Laplace approximation of the posterior distribution. This approximates the posterior with a Gaussian distribution. This is not a terribly good approximation, but it suffices if all that is desired is an estimate of the posterior mean and variance. In such a case, an approximation scheme such as variational Bayes can be used.

Gibbs sampling with an approximating distribution

As shown above, logistic regression is equivalent to a latent variable model with an error variable distributed according to a standard logistic distribution. The overall distribution of the latent variable is also a logistic distribution, with the mean equal to (i.e. the fixed quantity added to the error variable). This model considerably simplifies the application of techniques such as Gibbs sampling. However, sampling the regression coefficients is still difficult, because of the lack of conjugacy between the normal and logistic distributions. Changing the prior distribution over the regression coefficients is of no help, because the logistic distribution is not in the exponential family and thus has no conjugate prior.

One possibility is to use a more general Markov chain Monte Carlo technique, such as Metropolis-Hastings, which can sample arbitrary distribution. Another possibility, however, is to replace the logistic distribution with a similar-shaped distribution that is easier to work with using Gibbs sampling. In fact, the logistic and normal distributions have a similar shape, and thus one possibility is simply to have normally-distributed errors. Because the normal distribution is conjugate to itself, sampling the regression coefficients becomes easy. In fact, this model is exactly the model used in probit regression.

However, the normal and logistic distributions differ in that the logistic has heavier tails. As a result, it is more robust to inaccuracies in the underlying model (which are inevitable, in that the model is essentially always an approximation) or to errors in the data. Probit regression loses some of this robustness.

Another alternative is to use errors distributed as a Student's t-distribution. The Student's t-distribution has heavy tails, and is easy to sample from because it is the compound distribution of a normal distribution with variance distributed as an inverse gamma distribution. In other words, if a normal distribution is used for the error variable, and another latent variable, following an inverse gamma distribution, is added corresponding to the variance of this error variable, the marginal distribution of the error variable will follow a Student's t-distribution. Because of the various conjugacy relationships, all variables in this model are easy to sample from.

The Student's t-distribution that best approximates a standard logistic distribution can be determined by matching the moments of the two distributions. The Student's t-distribution has three parameters, and since the skewness of both distributions is always 0, the first four moments can all be matched, using the following equations:

This yields the following values:

The following graphs compare the standard logistic distribution with the Student's t-distribution that matches the first four moments using the above-determined values, as well as the normal distribution that matches the first two moments. Note how much closer the Student's t-distribution agrees, especially in the tails. Beyond about two standard deviations from the mean, the logistic and normal distributions diverge rapidly, but the logistic and Student's t-distributions don't start diverging significantly until more than 5 standard deviations away.

(Another possibility, also amenable to Gibbs sampling, is to approximate the logistic distribution using a mixture density of normal distributions.)

Comparison of logistic and approximating distributions (t, normal).
Tails of distributions.
Further tails of distributions.
Extreme tails of distributions.

Extensions

There are large numbers of extensions:

  • Multinomial logistic regression (or multinomial logit) handles the case of a multi-way categorical dependent variable (with unordered values, also called "classification"). Note that the general case of having dependent variables with more than two values is termed polytomous regression.
  • Ordered logistic regression (or ordered logit) handles ordinal dependent variables (ordered values).
  • Mixed logit is an extension of multinomial logit that allows for correlations among the choices of the dependent variable.
  • An extension of the logistic model to sets of interdependent variables is the conditional random field.

Model accuracy

A way to test for errors in models created by step-wise regression is to not rely on the model's F-statistic, significance, or multiple-r, but instead assess the model against a set of data that was not used to create the model.[50] The class of techniques is called cross-validation.

Accuracy is measured as correctly classified records in the holdout sample.[51] There are four possible classifications:

  1. prediction of 0 when the holdout sample has a 0 (True Negative/TN)
  2. prediction of 0 when the holdout sample has a 1 (False Negative/FN)
  3. prediction of 1 when the holdout sample has a 0 (False Positive/FP)
  4. prediction of 1 when the holdout sample has a 1 (True Positive/TP)

These classifications are used to measure Precision and Recall:

The percent of correctly classified observations in the holdout sample is referred to the assessed model accuracy. Additional accuracy can be expressed as the model's ability to correctly classify 0, or the ability to correctly classify 1 in the holdout dataset. The holdout model assessment method is particularly valuable when data are collected in different settings (e.g., at different times or places) or when models are assumed to be generalizable.

See also

References

  1. ^ Lemeshow, David W. Hosmer, Stanley (2000). Applied logistic regression (2nd ed. ed.). New York: Wiley. ISBN 0471356328. {{cite book}}: |edition= has extra text (help)CS1 maint: multiple names: authors list (link)
  2. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  3. ^ Lemeshow, David W. Hosmer, Stanley (2000). Applied logistic regression (2nd ed. ed.). New York: Wiley. ISBN 0471356328. {{cite book}}: |edition= has extra text (help)CS1 maint: multiple names: authors list (link)
  4. ^ Lemeshow, David W. Hosmer, Stanley (2000). Applied logistic regression (2nd ed. ed.). New York: Wiley. ISBN 0471356328. {{cite book}}: |edition= has extra text (help)CS1 maint: multiple names: authors list (link)
  5. ^ Lemeshow, David W. Hosmer, Stanley (2000). Applied logistic regression (2nd ed. ed.). New York: Wiley. ISBN 0471356328. {{cite book}}: |edition= has extra text (help)CS1 maint: multiple names: authors list (link)
  6. ^ Lemeshow, David W. Hosmer, Stanley (2000). Applied logistic regression (2nd ed. ed.). New York: Wiley. ISBN 0471356328. {{cite book}}: |edition= has extra text (help)CS1 maint: multiple names: authors list (link)
  7. ^ Lemeshow, David W. Hosmer, Stanley (2000). Applied logistic regression (2nd ed. ed.). New York: Wiley. ISBN 0471356328. {{cite book}}: |edition= has extra text (help)CS1 maint: multiple names: authors list (link)
  8. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  9. ^ Menard, Scott (2002). Applied logistic regression analysis (2. ed. ed.). Thousand Oaks, Calif [u.a.]: Sage. ISBN 978-0761922087. {{cite book}}: |edition= has extra text (help)
  10. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  11. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 0805822232. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  12. ^ Howell, David C. (2010). Statistical methods for psychology (7th ed. ed.). Belmont, CA: Thomson Wadsworth. ISBN 9780495597841. {{cite book}}: |edition= has extra text (help)
  13. ^ Lemeshow, David W. Hosmer, Stanley (2000). Applied logistic regression (2nd ed. ed.). New York: Wiley. ISBN 0471356328. {{cite book}}: |edition= has extra text (help)CS1 maint: multiple names: authors list (link)
  14. ^ Lemeshow, David W. Hosmer, Stanley (2000). Applied logistic regression (2nd ed. ed.). New York: Wiley. ISBN 0471356328. {{cite book}}: |edition= has extra text (help)CS1 maint: multiple names: authors list (link)
  15. ^ Lemeshow, David W. Hosmer, Stanley (2000). Applied logistic regression (2nd ed. ed.). New York: Wiley. ISBN 0471356328. {{cite book}}: |edition= has extra text (help)CS1 maint: multiple names: authors list (link)
  16. ^ Lemeshow, David W. Hosmer, Stanley (2000). Applied logistic regression (2nd ed. ed.). New York: Wiley. ISBN 0471356328. {{cite book}}: |edition= has extra text (help)CS1 maint: multiple names: authors list (link)
  17. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 0805822232. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  18. ^ Menard, Scott (2002). Applied logistic regression analysis (2. ed. ed.). Thousand Oaks, Calif [u.a.]: Sage. ISBN 9780761922087. {{cite book}}: |edition= has extra text (help)
  19. ^ Menard, Scott (2002). Applied logistic regression analysis (2. ed. ed.). Thousand Oaks, Calif [u.a.]: Sage. ISBN 9780761922087. {{cite book}}: |edition= has extra text (help)
  20. ^ Peduzzi, P (1996 Dec). "A simulation study of the number of events per variable in logistic regression analysis". Journal of clinical epidemiology. 49 (12): 1373–9. PMID 8970487. {{cite journal}}: Check date values in: |date= (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)
  21. ^ Menard, Scott (2002). Applied logistic regression analysis (2. ed. ed.). Thousand Oaks, Calif [u.a.]: Sage. ISBN 9780761922087. {{cite book}}: |edition= has extra text (help)
  22. ^ Menard, Scott (2002). Applied logistic regression analysis (2. ed. ed.). Thousand Oaks, Calif [u.a.]: Sage. ISBN 9780761922087. {{cite book}}: |edition= has extra text (help)
  23. ^ Menard, Scott (2002). Applied logistic regression analysis (2. ed. ed.). Thousand Oaks, Calif [u.a.]: Sage. ISBN 9780761922087. {{cite book}}: |edition= has extra text (help)
  24. ^ Lemeshow, David W. Hosmer, Stanley (2000). Applied logistic regression (2nd ed. ed.). New York: Wiley. ISBN 0471356328. {{cite book}}: |edition= has extra text (help)CS1 maint: multiple names: authors list (link)
  25. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  26. ^ Lemeshow, David W. Hosmer, Stanley (2000). Applied logistic regression (2nd ed. ed.). New York: Wiley. ISBN 0471356328. {{cite book}}: |edition= has extra text (help)CS1 maint: multiple names: authors list (link)
  27. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  28. ^ Lemeshow, David W. Hosmer, Stanley (2000). Applied logistic regression (2nd ed. ed.). New York: Wiley. ISBN 0471356328. {{cite book}}: |edition= has extra text (help)CS1 maint: multiple names: authors list (link)
  29. ^ Lemeshow, David W. Hosmer, Stanley (2000). Applied logistic regression (2nd ed. ed.). New York: Wiley. ISBN 0471356328. {{cite book}}: |edition= has extra text (help)CS1 maint: multiple names: authors list (link)
  30. ^ Lemeshow, David W. Hosmer, Stanley (2000). Applied logistic regression (2nd ed. ed.). New York: Wiley. ISBN 0471356328. {{cite book}}: |edition= has extra text (help)CS1 maint: multiple names: authors list (link)
  31. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  32. ^ Lemeshow, David W. Hosmer, Stanley (2000). Applied logistic regression (2nd ed. ed.). New York: Wiley. ISBN 0471356328. {{cite book}}: |edition= has extra text (help)CS1 maint: multiple names: authors list (link)
  33. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  34. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  35. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  36. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  37. ^ Menard, Scott (2002). Applied logistic regression analysis (2. ed. ed.). Thousand Oaks, Calif [u.a.]: Sage. ISBN 9780761922087. {{cite book}}: |edition= has extra text (help)
  38. ^ Menard, Scott (2002). Applied logistic regression analysis (2. ed. ed.). Thousand Oaks, Calif [u.a.]: Sage. ISBN 9780761922087. {{cite book}}: |edition= has extra text (help)
  39. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  40. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  41. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  42. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  43. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  44. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  45. ^ Lemeshow, David W. Hosmer, Stanley (2000). Applied logistic regression (2nd ed. ed.). New York: Wiley. ISBN 0471356328. {{cite book}}: |edition= has extra text (help)CS1 maint: multiple names: authors list (link)
  46. ^ Menard, Scott (2002). Applied logistic regression analysis (2. ed. ed.). Thousand Oaks, Calif [u.a.]: Sage. ISBN 9780761922087. {{cite book}}: |edition= has extra text (help)
  47. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  48. ^ Menard, Scott (2002). Applied logistic regression analysis (2. ed. ed.). Thousand Oaks, Calif [u.a.]: Sage. ISBN 9780761922087. {{cite book}}: |edition= has extra text (help)
  49. ^ Applied multiple regression/correlation analysis for the behavioral sciences (3. ed. ed.). Mahwah, NJ [u.a.]: Erlbaum. ISBN 9780805822236. {{cite book}}: |edition= has extra text (help); |first= missing |last= (help)
  50. ^ Jonathan Mark and Michael A. Goldberg (2001). Multiple Regression Analysis and Mass Assessment: A Review of the Issues. The Appraisal Journal, Jan. pp. 89–109
  51. ^ Mayers, J.H and Forgy E.W. (1963). The Development of numerical credit evaluation systems. Journal of the American Statistical Association, Vol.58 Issue 303 (Sept) pp 799–806
  • Agresti, Alan. (2002). Categorical Data Analysis. New York: Wiley-Interscience. ISBN 0-471-36093-7.
  • Amemiya, T. (1985). Advanced Econometrics. Harvard University Press. ISBN 0-674-00560-0.
  • Balakrishnan, N. (1991). Handbook of the Logistic Distribution. Marcel Dekker, Inc. ISBN 978-0-8247-8587-1.
  • Cohen, Jacob (2003). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, 3rd ed. New York: Routledge. ISBN 978-0-8058-2223-6. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
  • Greene, William H. (2003). Econometric Analysis, fifth edition. Prentice Hall. ISBN 0-13-066189-9.
  • Hilbe, Joseph M. (2009). Logistic Regression Models. Chapman & Hall/CRC Press. ISBN 978-1-4200-7575-5.
  • Hosmer, David W. (2000). Applied Logistic Regression, 2nd ed. New York; Chichester, Wiley. ISBN 0-471-35632-8. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
  • Howell, David C. (2010). Statistical Methods for Psychology, 7th ed. Belmont, CA; Thomson Wadsworth. ISBN 978-0-495-59786-5.
  • Menard, Scott W. (2002). Applied Logistic Regression, 2nd ed. Thousand Oaks; SAGE. ISBN 9780761922087.
  • Peduzzi, P. (1996). ""A simulation study of the number of events per variable in logistic regression analysis"". Journal of clinical epidemiology. 49 (12): 1373–1379. PMID 8970487. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)