For theoretical purposes, it it often convenient to characterize radiation in terms of photons rather than energy. The energy E of a photon is given by the Planck relation
where E is the energy per photon, h is Planck's constant, c is the speed of light, ν is the frequency of the radiation and λ is the wavelength. A spectral radiative quantity in terms of energy, JE(λ), is converted to its quantal form JQ(λ) by dividing by the energy per photon:
For example, If JE(λ) is spectral radiance with units of watts/m2/sr/m, then the quantal equivalent JQ(λ) characterizes that radiation with units of photons/sec/m2/sr/m.
If CEλi(λ) (i=1,2,3) are the three energy-based color matching functions for a particular color space (LMS color space for the purposes of this article), then the tristimulus values may be expressed in terms of the quantal radiative quantity by:
Define the quantal color matching functions:
where λmax i is the wavelength at which CEλ i(λ)/λ is maximized. Define the quantal tristimulus values:
Note that, as with the energy based functions, the peak value of CQλi(λ) will be equal to unity. Using the above equation for the energy tristimulus values Ci
For the LMS color space, = {566,541,441} nm and
A more general proof
[edit]
Suppose we are given four equal-length lists of field elements , , , from which we may define . and will be called the parent population numbers and characteristics associated with each index i. Likewise and will be called the child population numbers and characteristics. (Equivalently, we could have been given , , , with where is referred to as the fitness associated with index i) Define the parent and child population totals:
|
|
and the probabilities (or frequencies)[1]:
|
|
Note that these are of the form of probability mass functions in that and are in fact the probabilities that a random individual drawn from the parent population has a characteristic and likewise for the child population. Define the fitnesses:
The average of any list is given by:
so the average characteristics are defined as:
|
|
and the average fitness is:
A simple theorem can be proved:
so that:
and
The covariance of and is defined by:
Defining , the expectation value of is
The sum of the two terms is:
Using the above mentioned simple theorem, the sum becomes
where
.
Logistic Regression and maximum entropy
[edit]
Defining which means and
Also define:
- and
Since within the respective bounds of integration, and are real,
Of all the functional forms used for estimating the probabilities of a particular categorical outcome which optimize the fit by maximizing the likelihood function (e.g. probit regression), the logistic regression solution is unique in that it is a maximum entropy solution.[2]
In order to show this, we use the method of Lagrange multipliers. The Lagrangian is equal to the entropy plus the sum of the products of Lagrange multipliers times various constraint expressions. The general multinomial case will be considered, since the proof is not made that much simpler by considering simpler cases. Equating the derivative of the Lagrangian with respect to the various probabilities to zero yields a functional form for those probabilities which corresponds to those used in logistic regression.[2]
As in the above section on multinomial logistic regression, we will consider M+1 explanatory variables denoted xm and which include x0=1. There will be a total of K data points, indexed by k={1,2,...,K}, and the data points are given by xmk and yk. The xmk will also be represented as a M+1-dimensional vector . There will be N+1 possible values of the categorical variable y ranging from 0 to N.
Let pn(x) be the probability, given explanatory variable vector x, that the outcome will be y=n. Define which is the probability that for the k-th measurement, the categorical outcome is n.
The Lagrangian will be expressed as a function of the probabilities pnk and will minimized by equating the derivatives of the Lagrangian with respect to these probabilities to zero. An important point is that the probabilities are treated equally and the fact that they sum to unity is part of the Lagrangian formulation, rather than being assumed from the beginning.
The first contribution to the Lagrangian is the entropy:
The log-likelihood is:
Assuming the multinomial logistic function, the derivative of the log-likelihood with respect the beta coefficients was found to be:
A very important point here is that this expression is (remarkably) not an explicit function of the beta coefficients. It is only a function of the probabilities pnk and the data. Rather than being specific to the assumed multinomial logistic case, it is taken to be a general statement of the condition at which the log-likelihood is maximized and makes no reference to the functional form of pnk. There are then (M+1)(N+1) fitting constraints and the fitting constraint term in the Lagrangian is then:
where the λnm are the appropriate Lagrange multipliers. There are K normalization constraints which may be written:
so that the normalization term in the Lagrangian is:
where the αk are the appropriate Lagrange multipliers. The Lagrangian is then the sum of the above three terms:
Setting the derivative of the Lagrangian with respect to one of the probabilities to zero yields:
Using the more condensed vector notation:
and dropping the primes on the n and k indices, and then solving for yields:
where:
Imposing the normalization constraint, we can write the probabilities as:
If we substitute this expression back into the log-likelihood expression and maximize it with respect to the λmn in order to find the appropriate λmn for our data, we will find that the minimum is not at a single point but lies in an M+1 dimensional space in the (M+1)(N+1) dimensional space of the λmn. In other words, there are an infinite number of equally valid choices of the λmn. We can choose which λmn to use in any number of ways, and the method chosen in the multinomial logistic regression section above was to set λm0=0 (which are M+1 in number) and identify the beta coefficients as βmn=λmn for all n except n=0. This recovers the results from that section.
In machine learning applications where logistic regression is used for binary classification, the MLE minimises the Cross entropy loss function.
Logistic regression is an important machine learning algorithm. The goal is to model the probability of a random variable being 0 or 1 given experimental data.[3]
Consider a generalized linear model function parameterized by ,
Therefore,
and since , we see that is given by We now calculate the likelihood function assuming that all the observations in the sample are independently Bernoulli distributed,
Typically, the log likelihood is maximized,
which is maximized using optimization techniques such as gradient descent.
Assuming the pairs are drawn uniformly from the underlying distribution, then in the limit of large N,
where is the conditional entropy and is the Kullback–Leibler divergence. This leads to the intuition that by maximizing the log-likelihood of a model, you are minimizing the KL divergence of your model from the maximal entropy distribution. Intuitively searching for the model that makes the fewest assumptions in its parameters.