# Scoring rule

In decision theory, a score function, or scoring rule, measures the accuracy of probabilistic predictions. It is applicable to tasks in which predictions must assign probabilities to a set of mutually exclusive discrete outcomes. The set of possible outcomes can be either binary or categorical in nature, and the probabilities assigned to this set of outcomes must sum to one (where each individual probability is in the range of 0 to 1). A score can be thought of as either a measure of the "calibration" of a set of probabilistic predictions, or as a "cost function" or "loss function".

If a cost is levied in proportion to a proper scoring rule, the minimal expected cost corresponds to reporting the true set of probabilities. Proper scoring rules are used in meteorology, finance, and pattern classification where a forecaster or algorithm will attempt to minimize the average score to yield refined, calibrated probabilities (i.e. accurate probabilities). Various scoring rules have also been used to assess the predictive accuracy of football forecast models.[1]

## Example application of scoring rules

The logarithmic rule

An example of probabilistic forecasting is in meteorology where a weather forecaster may give the probability of rain on the next day. One could note the number of times that a 25% probability was quoted, over a long period, and compare this with the actual proportion of times that rain fell. If the actual percentage was substantially different from the stated probability we say that the forecaster is poorly calibrated. A poorly calibrated forecaster might be encouraged to do better by a bonus system. A bonus system designed around a proper scoring rule will incentivize the forecaster to report probabilities equal to his personal beliefs.[2]

In addition to simple case of a binary decision, such as assigning probabilities to 'rain' or 'no rain', scoring rules may be used for multiple classes, such as 'rain', 'snow', or 'clear'.

The image to the right shows an example of a scoring rule, the logarithmic scoring rule, as a function of the probability reported for the event that actually occurred. One way to use this rule would be as a cost based on the probability that a forecaster or algorithm assigns, then checking to see which event actually occurs.

## Proper scoring rules

Expected value of Logarithmic rule, when Event 1 is expected to occur with probability of 0.8

A probabilistic forecaster or algorithm will return a Probability vector r with a probability for each of the i outcomes. One usage of a scoring function could be to give a reward of $S(\mathbf{r},i)$ if the ith event occurs. If a proper scoring rule is used, then the highest expected reward is obtained by reporting the true probability distribution. The use of a proper scoring rule encourages the forecaster to be honest to maximize the expected reward.

A scoring rule is strictly proper if it is uniquely optimized by the true probabilities. Optimized in this case will correspond to maximization for the quadratic, spherical, and logarithmic rules but minimization for the Brier Score. This can be seen in the image at right for the logarithmic rule. Here, Event 1 is expected to occur with probability of 0.8, and the expected score (or reward) is shown as a function of the reported probability. The way to maximize the expected reward is to report the actual probability of 0.8 as all other reported probabilities will yield a lower expected score. This property holds because the logarithmic score is proper.

### Examples of proper scoring rules

There are an infinite number of scoring rules, including entire parameterized families of proper scoring rules. The ones shown below are simply popular examples.

#### Logarithmic scoring rule

The logarithmic scoring rule is a local strictly proper scoring rule. This is also the negative of surprisal, which is commonly using a scoring criteria in Bayesian Inference; the goal is to minimize expected surprisal. This scoring rule has strong foundations in information theory.

$L(\mathbf{r},i) = \ln(r_i)$

That is, a prediction of 80% or 0.8 which proved true (good) would receive a score of ln(0.8) = -0.22, while the same prediction which proved false (bad) would receive a score of the right prediction 20%: ln(1-0.8) = ln(0.2) = -1.6. The goal of a forecaster is to maximize his score and for the score to be as large as possible, and -0.22 is indeed larger than -1.6.

If one treats the truth or falsity of the prediction as a variable x which is 1 or 0 respectively, and the expressed probability as p, then one could write the logarithmic scoring rule as x*log(p) + (1-x)*log(1-p).

Since strictly proper scoring rules remain strictly proper under linear transformation

$L(\mathbf{r},i) = \log_b(r_i)$ is strictly proper for all $b>0$

The quadratic scoring rule is a strictly proper scoring rule

$Q(\mathbf{r},i) = 2r_i - \mathbf{r}\cdot \mathbf{r} = 2r_i -\sum_{j=1}^C r_j^2$

where $r_i$ is the probability assigned to the correct answer.

The Brier score, originally proposed by Glenn W. Brier in 1950,[3] can be obtained by an affine transform from the quadratic scoring rule.

$B(\mathbf{r},i) = \sum_{j=1}^C (y_j-r_j)^2$

Where $y_j = 1$ when the jth event is correct and $y_j = 0$ otherwise and C is the number of classes.

An important difference between these two rules is that a forecaster should strive to maximize the quadratic score yet minimize the Brier score. This is due to a negative sign in the linear transformation between them.

#### Spherical scoring rule

The spherical scoring rule is also a strictly proper scoring rule

$S(\mathbf{r},i) = \frac{r_i}{\lVert \mathbf{r} \rVert} = \frac{r_i}{\sqrt{r_1^2 + \cdots + r_c^2}}$

### Comparison of proper scoring rules

Shown below on the left is a graphical comparison of the Logarithmic, Quadratic, and Spherical scoring rules for a binary classification problem. The x-axis indicates the reported probability for the event that actually occurred.

It is important to note that each of the scores have diffent magnitudes and locations. The magnitude differences are not relevant however as scores remain proper under affine transformation. Therefore, to compare different score it is necessary to move them to a common scale. A reasonable choice of normalization is shown at the picture on the right where all scores intersect the points (0.5,0) and (1,1). This ensures that they yield 0 for a uniform distribution (two probabilities of 0.5 each), reflecting no cost or reward for reporting what is often the baseline distribution. All normalized scores below also yield 1 when the true class is assigned a probability of 1.

 Score of a binary classification for the true class showing logarithmic (blue), spherical (green), and quadratic (red) Normalized score of a binary classification for the true class showing logarithmic (blue), spherical (green), and quadratic (red)

## Characteristics

### Positive-affine transformation

A strictly proper scoring rule, whether binary or multiclass, after a positive-affine transformation remains a strictly proper scoring rule.[2] That is, if $S(\mathbf{r},i)$ is a strictly proper scoring rule then $a+bS(\mathbf{r},i)$ with $b>0$ is also a strictly proper scoring rule.

### Locality

A proper scoring rule is said to be local if its value depends only on the probability $r_i$. All binary scores are local because the probability assigned to the event that did not occur is directly producible as $1-r_i$.

The logarithmic scoring rule is an example of a strictly proper local scoring rule.

### Decomposition

The expectation value of a proper scoring rule $S$ can be decomposed into the sum of three components, called uncertainty, reliability, and resolution,[4][5] which characterize different attributes of probabilistic forecasts:

$E(S) = UNC + REL - RES.$

If a score is proper and negatively oriented (such as the Brier Score), all three terms are positive definite. The uncertainty component is equal to the expected score of the forecast which constantly predicts the average event frequency. The reliability component penalizes poorly calibrated forecasts, in which the predicted probabilities do not coincide with the event frequencies. Resolution rewards probabilities that are close to one whenever the event happens, and which are close to zero if the event does not happen.

The equations for the individual components depend on the particular scoring rule. For the Brier Score, they are given by

$UNC = \bar{x}(1-\bar{x})$
$REL = E(p-\pi(p))^2$
$RES = E(\pi(p)-\bar{x})^2$

where $\bar{x}$ is the average probability of occurrence of the binary event $x$, and $\pi(p)$ is the conditional event probability, given $p$, i.e. $\pi(p) = P(x=1\mid p)$

## References

1. ^ Constantinou, Anthony; Fenton, N. (2012). "Solving the Problem of Inadequate Scoring Rules for Assessing Probabilistic Football Forecast Models.". Journal of Quantitative Analysis in Sports. 8 (1): Article 1. Retrieved 25 March 2014.
2. ^ a b Bickel, E.J. (2007). "Some Comparisons among Quadratic, Spherical, and Logarithmic Scoring Rules". Decision Analysis 4 (2): 49–65. doi:10.1287/deca.1070.0089.
3. ^ Brier, G.W. (1950). "Verification of forecasts expressed in terms of probability". Monthly weather review 78: 1–3.
4. ^ Murphy, A.H. (1973). "A new vector partition of the probability score". Journal of Applied Meteorology 12: 595–600. doi:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.
5. ^ Bröcker, J. (2009). "Reliability, sufficiency, and the decomposition of proper scores". Quarterly Journal of the Royal Meteorological Society 135 (643): 1512–1519. doi:10.1002/qj.456.