User:PAR/sandbox

Quantal CMF

For theoretical purposes, it it often convenient to characterize radiation in terms of photons rather than energy. The energy E of a photon is given by the Planck relation

E=h\nu =hc/\lambda

where E is the energy per photon, h is Planck's constant, c is the speed of light, ν is the frequency of the radiation and λ is the wavelength. A spectral radiative quantity in terms of energy, JE(λ), is converted to its quantal form JQ(λ) by dividing by the energy per photon:

JQ(\lambda )=JE(\lambda )(\lambda /hc)

For example, If JE(λ) is spectral radiance with units of watts/m²/sr/m, then the quantal equivalent JQ(λ) characterizes that radiation with units of photons/sec/m²/sr/m.

If CE_λi(λ) (i=1,2,3) are the three energy-based color matching functions for a particular color space (LMS color space for the purposes of this article), then the tristimulus values may be expressed in terms of the quantal radiative quantity by:

CE_{i}=\int _{0}^{\infty }JE(\lambda )CE_{\lambda i}(\lambda )d\lambda =\int _{0}^{\infty }JQ(\lambda )(hc/\lambda )CE_{\lambda i}(\lambda )d\lambda

Define the quantal color matching functions:

CQ_{\lambda i}(\lambda )=(CE_{\lambda i}(\lambda )/\lambda )/(CE_{\lambda i}(\lambda _{maxi})/\lambda _{max})

where λ_{max i} is the wavelength at which CE_{λ i}(λ)/λ is maximized. Define the quantal tristimulus values:

CQ_{i}=\int _{0}^{\infty }JQ(\lambda )CQ_{\lambda i}(\lambda )d\lambda

Note that, as with the energy based functions, the peak value of CQ_λi(λ) will be equal to unity. Using the above equation for the energy tristimulus values C_i

CE_{i}=(hc/\lambda _{maxi})\,CE_{\lambda i}(\lambda _{maxi})\,CQ_{i}

For the LMS color space, $\lambda _{maxi}$ = {566,541,441} nm and

CE_{i}/CQ_{i}=\{3.49694,3.1253,0.144944\}\times 10^{-19}

A more general proof

Suppose we are given four equal-length lists of field elements $n_{i}$ , $z_{i}$ , $n_{i}'$ , $z_{i}'$ from which we may define $w_{i}=n_{i}'/n_{i}$ . $n_{i}$ and $z_{i}$ will be called the parent population numbers and characteristics associated with each index i. Likewise $n_{i}'$ and $z_{i}'$ will be called the child population numbers and characteristics. (Equivalently, we could have been given $n_{i}$ , $z_{i}$ , $w_{i}$ , $z_{i}'$ with $n_{i}'=w_{i}n_{i}$ where $w_{i}$ is referred to as the fitness associated with index i) Define the parent and child population totals:

n\;{\stackrel {\mathrm {def} }{=}}\;\sum _{i}n_{i}

n'\;{\stackrel {\mathrm {def} }{=}}\;\sum _{i}n_{i}'

and the probabilities (or frequencies)^[1]:

q_{i}\;{\stackrel {\mathrm {def} }{=}}\;n_{i}/n

q_{i}'\;{\stackrel {\mathrm {def} }{=}}\;n_{i}'/n'

Note that these are of the form of probability mass functions in that $\sum _{i}q_{i}=\sum _{i}q_{i}'=1$ and are in fact the probabilities that a random individual drawn from the parent population has a characteristic $z_{i}$ and likewise for the child population. Define the fitnesses:

w_{i}\;{\stackrel {\mathrm {def} }{=}}\;n_{i}'/n_{i}

The average of any list $x_{i}$ is given by:

E(x_{i})=\sum _{i}q_{i}x_{i}

so the average characteristics are defined as:

z\;{\stackrel {\mathrm {def} }{=}}\;\sum _{i}q_{i}z_{i}

z'\;{\stackrel {\mathrm {def} }{=}}\;\sum _{i}q_{i}'z_{i}'

and the average fitness is:

w\;{\stackrel {\mathrm {def} }{=}}\;\sum _{i}q_{i}w_{i}

A simple theorem can be proved: $q_{i}w_{i}=\left({\frac {n_{i}}{n}}\right)\left({\frac {n_{i}'}{n_{i}}}\right)=\left({\frac {n_{i}'}{n'}}\right)\left({\frac {n'}{n}}\right)=q_{i}'\left({\frac {n'}{n}}\right)$ so that:

w={\frac {n'}{n}}\sum _{i}q_{i}'={\frac {n'}{n}}

and

q_{i}w_{i}=w\,q_{i}'

The covariance of $w_{i}$ and $z_{i}$ is defined by:

\operatorname {cov} (w_{i},z_{i})\;{\stackrel {\mathrm {def} }{=}}\;E(w_{i}z_{i})-E(w_{i})E(z_{i})=\sum _{i}q_{i}w_{i}z_{i}-wz

Defining $\Delta z_{i}\;{\stackrel {\mathrm {def} }{=}}\;z_{i}'-z_{i}$ , the expectation value of $w_{i}\Delta z_{i}$ is

E(w_{i}\Delta z_{i})=\sum q_{i}w_{i}(z_{i}'-z_{i})=\sum _{i}q_{i}w_{i}z_{i}'-\sum _{i}q_{i}w_{i}z_{i}

The sum of the two terms is:

\operatorname {cov} (w_{i},z_{i})+E(w_{i}\Delta z_{i})=\sum _{i}q_{i}w_{i}z_{i}-wz+\sum _{i}q_{i}w_{i}z_{i}'-\sum _{i}q_{i}w_{i}z_{i}=\sum _{i}q_{i}w_{i}z_{i}'-wz

Using the above mentioned simple theorem, the sum becomes

\operatorname {cov} (w_{i},z_{i})+E(w_{i}\Delta z_{i})=w\sum _{i}q_{i}'z_{i}'-wz=wz'-wz=w\Delta z

where $\Delta z\;{\stackrel {\mathrm {def} }{=}}\;z'-z$ .

Logistic Regression and maximum entropy

f_{z}(z)=\int _{0}^{\infty }f_{x}(x)f_{y}(z/x){\frac {dx}{x}}-\int _{-\infty }^{0}f_{x}(x)f_{y}(z/x){\frac {dx}{x}}

f_{z}(z)=\int _{0}^{\infty }f_{x}(x)f_{y}(z/x){\frac {dx}{x}}+\int _{0}^{\infty }f_{x}(-x)f_{y}(-z/x){\frac {dx}{x}}

Defining $x=e^{p}$ which means $p=\ln(x)$ and $dp=dx/x$

Also define: $p_{z}=\ln(z)$

f_{z}(z)=\int _{-\infty }^{\infty }f_{x}(e^{p})f_{y}(e^{p_{z}-p})dp+\int _{-\infty }^{\infty }f_{x}(-e^{p})f_{y}(-e^{p_{z}-p})dp

h_{x}(p)=f_{x}(e^{p})+if_{x}(-e^{p})

and

h_{y}(p)=f_{y}(e^{p})+if_{y}(-e^{p})

h_{x}(p)h_{y}^{*}(p_{z}-p)=A+iB

Since within the respective bounds of integration, $f_{x}$ and $f_{y}$ are real,

A=f_{x}(e^{p})f_{y}(e^{p_{z}-p})+f_{x}(-e^{p})f_{y}(-e^{p_{z}-p})

B=f_{x}(e^{p})f_{y}(-e^{p_{z}-p})+f_{x}(-e^{p})f_{y}(e^{p_{z}-p})

f_{z}(z)=\int _{-\infty }^{\infty }A\,dp=\Re \left(\int _{-\infty }^{\infty }h_{x}(p)h_{y}^{*}(p_{z}-p)dp\right)

Of all the functional forms used for estimating the probabilities of a particular categorical outcome which optimize the fit by maximizing the likelihood function (e.g. probit regression), the logistic regression solution is unique in that it is a maximum entropy solution.^[2]

In order to show this, we use the method of Lagrange multipliers. The Lagrangian is equal to the entropy plus the sum of the products of Lagrange multipliers times various constraint expressions. The general multinomial case will be considered, since the proof is not made that much simpler by considering simpler cases. Equating the derivative of the Lagrangian with respect to the various probabilities to zero yields a functional form for those probabilities which corresponds to those used in logistic regression.^[2]

As in the above section on multinomial logistic regression, we will consider M+1 explanatory variables denoted x_m and which include x₀=1. There will be a total of K data points, indexed by k={1,2,...,K}, and the data points are given by x_mk and y_k. The x_mk will also be represented as a M+1-dimensional vector ${\boldsymbol {x}}_{k}=\{x_{0k},x_{1k},...,x_{Mk}\}$ . There will be N+1 possible values of the categorical variable y ranging from 0 to N.

Let p_n(x) be the probability, given explanatory variable vector x, that the outcome will be y=n. Define $p_{nk}=p_{n}({\boldsymbol {x}}_{k})$ which is the probability that for the k-th measurement, the categorical outcome is n.

The Lagrangian will be expressed as a function of the probabilities p_nk and will minimized by equating the derivatives of the Lagrangian with respect to these probabilities to zero. An important point is that the probabilities are treated equally and the fact that they sum to unity is part of the Lagrangian formulation, rather than being assumed from the beginning.

The first contribution to the Lagrangian is the entropy:

{\mathcal {L}}_{ent}=-\sum _{k=1}^{K}\sum _{n=0}^{N}p_{nk}\ln(p_{nk})

The log-likelihood is:

\ell =\sum _{k=1}^{K}\sum _{n=0}^{N}\Delta (n,y_{k})\ln(p_{nk})

Assuming the multinomial logistic function, the derivative of the log-likelihood with respect the beta coefficients was found to be:

{\frac {\partial \ell }{\partial \beta _{nm}}}=\sum _{k=1}^{K}\left(p_{nk}x_{mk}-\Delta (n,y_{k})x_{mk})\right)

A very important point here is that this expression is (remarkably) not an explicit function of the beta coefficients. It is only a function of the probabilities p_nk and the data. Rather than being specific to the assumed multinomial logistic case, it is taken to be a general statement of the condition at which the log-likelihood is maximized and makes no reference to the functional form of p_nk. There are then (M+1)(N+1) fitting constraints and the fitting constraint term in the Lagrangian is then:

{\mathcal {L}}_{fit}=\sum _{n=0}^{N}\sum _{m=0}^{M}\lambda _{nm}\sum _{k=1}^{K}\left(p_{nk}x_{mk}-\Delta (n,y_{k})x_{mk})\right)

where the λ_nm are the appropriate Lagrange multipliers. There are K normalization constraints which may be written:

\sum _{n=0}^{N}p_{nk}=1

so that the normalization term in the Lagrangian is:

{\mathcal {L}}_{norm}=\sum _{k=1}^{K}\alpha _{k}\left(1-\sum _{n=1}^{N}p_{nk}\right)

where the α_k are the appropriate Lagrange multipliers. The Lagrangian is then the sum of the above three terms:

{\mathcal {L}}={\mathcal {L}}_{ent}+{\mathcal {L}}_{fit}+{\mathcal {L}}_{norm}

Setting the derivative of the Lagrangian with respect to one of the probabilities to zero yields:

{\frac {\partial {\mathcal {L}}}{\partial p_{n'k'}}}=0=-\ln(p_{n'k'})-1+\sum _{m=0}^{M}(\lambda _{mn'}x_{mk'})-\alpha _{k'}

Using the more condensed vector notation:

\sum _{m=0}^{M}\lambda _{mn}x_{mk}={\boldsymbol {\lambda }}_{n}\cdot {\boldsymbol {x}}_{k}

and dropping the primes on the n and k indices, and then solving for $p_{nk}$ yields:

p_{nk}=A_{k}e^{{\boldsymbol {\lambda }}_{n}\cdot {\boldsymbol {x}}_{k}}

where:

A_{k}=e^{-(1+\alpha _{k})}

Imposing the normalization constraint, we can write the probabilities as:

p_{nk}={\frac {e^{{\boldsymbol {\lambda }}_{n}\cdot {\boldsymbol {x}}_{k}}}{\sum _{u=0}^{N}e^{{\boldsymbol {\lambda }}_{u}\cdot {\boldsymbol {x}}_{k}}}}

If we substitute this expression back into the log-likelihood expression and maximize it with respect to the λ_mn in order to find the appropriate λ_mn for our data, we will find that the minimum is not at a single point but lies in an M+1 dimensional space in the (M+1)(N+1) dimensional space of the λ_mn. In other words, there are an infinite number of equally valid choices of the λ_mn. We can choose which λ_mn to use in any number of ways, and the method chosen in the multinomial logistic regression section above was to set λ_m0=0 (which are M+1 in number) and identify the beta coefficients as β_mn=λ_mn for all n except n=0. This recovers the results from that section.

Other approaches

In machine learning applications where logistic regression is used for binary classification, the MLE minimises the Cross entropy loss function.

Logistic regression is an important machine learning algorithm. The goal is to model the probability of a random variable $Y$ being 0 or 1 given experimental data.^[3]

Consider a generalized linear model function parameterized by $\theta$ ,

h_{\theta }(X)={\frac {1}{1+e^{-\theta ^{T}X}}}=\Pr(Y=1\mid X;\theta )

Therefore,

\Pr(Y=0\mid X;\theta )=1-h_{\theta }(X)

and since $Y\in \{0,1\}$ , we see that $\Pr(y\mid X;\theta )$ is given by $\Pr(y\mid X;\theta )=h_{\theta }(X)^{y}(1-h_{\theta }(X))^{(1-y)}.$ We now calculate the likelihood function assuming that all the observations in the sample are independently Bernoulli distributed,

{\begin{aligned}L(\theta \mid y;x)&=\Pr(Y\mid X;\theta )\\&=\prod _{i}\Pr(y_{i}\mid x_{i};\theta )\\&=\prod _{i}h_{\theta }(x_{i})^{y_{i}}(1-h_{\theta }(x_{i}))^{(1-y_{i})}\end{aligned}}

Typically, the log likelihood is maximized,

N^{-1}\log L(\theta \mid y;x)=N^{-1}\sum _{i=1}^{N}\log \Pr(y_{i}\mid x_{i};\theta )

which is maximized using optimization techniques such as gradient descent.

Assuming the $(x,y)$ pairs are drawn uniformly from the underlying distribution, then in the limit of large N,

{\begin{aligned}&\lim \limits _{N\rightarrow +\infty }N^{-1}\sum _{i=1}^{N}\log \Pr(y_{i}\mid x_{i};\theta )=\sum _{x\in {\mathcal {X}}}\sum _{y\in {\mathcal {Y}}}\Pr(X=x,Y=y)\log \Pr(Y=y\mid X=x;\theta )\\[6pt]={}&\sum _{x\in {\mathcal {X}}}\sum _{y\in {\mathcal {Y}}}\Pr(X=x,Y=y)\left(-\log {\frac {\Pr(Y=y\mid X=x)}{\Pr(Y=y\mid X=x;\theta )}}+\log \Pr(Y=y\mid X=x)\right)\\[6pt]={}&-D_{\text{KL}}(Y\parallel Y_{\theta })-H(Y\mid X)\end{aligned}}

where $H(Y\mid X)$ is the conditional entropy and $D_{\text{KL}}$ is the Kullback–Leibler divergence. This leads to the intuition that by maximizing the log-likelihood of a model, you are minimizing the KL divergence of your model from the maximal entropy distribution. Intuitively searching for the model that makes the fewest assumptions in its parameters.

^ Cite error: The named reference Frank1 was invoked but never defined (see the help page).
^ ^a ^b Mount, J. (2011). "The Equivalence of Logistic Regression and Maximum Entropy models" (PDF). Retrieved Feb 23, 2022.
^ Ng, Andrew (2000). "CS229 Lecture Notes" (PDF). CS229 Lecture Notes: 16–19.

[Frank1-1] Cite error: The named reference Frank1 was invoked but never defined (see the help page).

[Mount2011-2] Mount, J. (2011). "The Equivalence of Logistic Regression and Maximum Entropy models" (PDF). Retrieved Feb 23, 2022.

[3] Ng, Andrew (2000). "CS229 Lecture Notes" (PDF). CS229 Lecture Notes: 16–19.

[1]

[2]

[3]