Structured expert judgment: the classical model

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Expert Judgment (EJ) denotes a wide variety of techniques ranging from a single undocumented opinion, through preference surveys, to formal elicitation with external validation of expert probability assessments. Recent books are [1] .[2] In the nuclear safety area, Rasmussen [3] formalized EJ by documenting all steps in the expert elicitation process for scientific review. This made visible wide spreads in expert assessments and teed up questions regarding the validation and synthesis of expert judgments. The nuclear safety community later took onboard expert judgment techniques underpinned by external validation .[4] Empirical validation is the hallmark of science, and forms the centerpiece of the classical model of probabilistic forecasting .[5] A European Network coordinates workshops. Application areas include nuclear safety, investment banking, volcanology, public health, ecology, engineering, climate change and aeronautics/aerospace. For a survey of applications through 2006 see [6] [7] and [8] give exhortatory overviews. A recent large scale implementation by the World Health Organization is described in [9] .[10] A long running application at the Montserrat Volcano Observatory is described in [11] [12] .[13] The classical model scores expert performance in terms of statistical accuracy (sometimes called calibration) and informativeness .[14] These terms should not be confused with “accuracy and precision”. Accuracy “is a description of systematic errors” while precision “is a description of random errors”. In the classical model statistical accuracy is measured as the p-value or probability with which one would falsely reject the hypotheses that an expert’s probability assessments were statistically accurate. A low value (near zero) means it is very unlikely that the discrepancy between an expert’s probability statements and observed outcomes should arise by chance. Informativeness is measured as Shannon relative information (or Kullback Leibler divergence) with respect to an analyst-supplied background measure. Shannon relative information is used because it is scale invariant, tail insensitive, slow, and familiar. Parenthetically, measures with physical dimensions, such as the standard deviation, or the width of prediction intervals, raise serious problems, as a change of units (meters to kilometers) would affect some variables but not others. The product of statistical accuracy and informativeness for each expert is their combined score. With an optimal choice of a statistical accuracy threshold beneath which experts are unweighted, the combined score is a long run “strictly proper scoring rule”: an expert achieves his long run maximal expected score by and only by stating his true beliefs. The classical model derives Performance Weighted (PW) combinations. These are compared with Equally Weighted (EW) combinations, and recently with Harmonically Weighted (HW) combinations, as well as with individual expert assessments.

While some mathematicians and decision analysts regard combining expert judgments as a mathematical problem, the classical model regards expert combination as more akin to an engineering problem. A bicycle obeys Newton's Laws but does not follow from them. It is designed to optimize performance under constraints. Similarly expert judgment combination is viewed as a tool for enabling rational consensus by optimizing performance measures under mathematical and decision theoretic constraints. The theory of rational consensus [5] is summarized in [15] .[16]

Real expert judgment studies differ in many ways from research or academic exercises. The experts are typically recruited in a traceable peer nomination process based on their knowledge of and engagement with the subject of the study; they may receive remuneration. In all cases, experts’ reasoning is documented, and their names and affiliations are part of the reporting. However, to encourage candid judgments, individuals’ responses are not exchanged within the group and association of names with assessments is not reported in the open literature, but is preserved to enable peer review by the problem owner.

Elicitations typically last several hours; the elicitation protocol is formalized and is part of the public reporting. Elicitation styles differ among practitioners, including face-to-face interviews, with or without plenary briefing and training, and "supervised plenary". Remote elicitation is rarely used, but some recent studies use online face-to-face tools.

Why validate?[edit]

Since experts are invoked when quantities of interest are uncertain, the goal of structured expert judgment is a defensible quantification of uncertainty. Confronted with uncertainty, society at large will always harken to prophets, oracles, pundits, blue ribbon panels, crowd wisdom reputed to have performed well in the past. Scientists and engineers, in contrast, are typically averse to any methodology which eschews empirical validation. Most invocations of expert judgment do not attempt any form of validation, as if the predicate “expert” were validation enough. The classical model’s emphasis on validation is its distinguishing feature. Virtually all validation data with real experts and real applications (as opposed to academic exercises) has been generated by practitioners with the classical model.

One of the first studies with experienced and inexperienced experts [14] showed that expert performance on questions from their field of expertise was not predicted by their performance on “almanac questions”. Experienced and inexperienced experts performed similarly on questions outside their field, but the experienced experts were much better on questions from their field. Hence, validation must be based on assessments of uncertain quantities from the experts’ field, to which we know, or will know, the true values within the time frame of the study. Such quantities are called “calibration” or “seed” variables.

Finding good calibration variables is difficult, and requires a deep dive into the subject matter at hand. The quality of the calibration, and the performance on calibration variables, buttresses the credibility of the whole study. At the end of the day, the problem owner will ask: “if expert A has very good performance on the calibration variables, whereas expert B has very poor performance, am I going to ignore that difference?” If the owner’s answer is “yes” then the calibration variables have failed in their purpose and the effort has been for naught.

Figure1: Number of experts assessing number of calibration variables, post-2006 data
Figure2: P-values of experts from post-2006 studies, arranged from best to worst

Validation data[edit]

Figure3: P-values of best (blue diamond) and second best (red square) experts, in terms of combined score. Diamonds and squares on the same vertical line belong to the same study. The thick orange horizontal line denotes the traditional 5% rejection threshold

The best argument for validation of expert judgment is the expert judgment data itself. Whereas the pre-2006 data contains wide variations in numbers of experts and numbers of calibration variables, the 33 independent professionally contracted post-2006 elicitations are more uniform in design, better resourced, better documented and better lend themselves to aggregate presentation. The data comprise in total 320 experts. Figure 1 shows the distribution of experts over the number of assessed calibration variables.

The p-values are sensitive to the power of the statistical test, and hence to the number of calibration variables. These numbers are roughly comparable for experts in the post-2006 data. Figure 2 shows the p-values of all post-2006 experts, arranged from best to worst.

In this summary, 227 of the 320 experts have a statistical accuracy score less than 0.05, which is the traditional rejection threshold for simple hypothesis testing. Half of the experts score below 0.005, and roughly one third fall into the abysmal range below 0.0001. These numbers challenge the assumption that the predicate “expert” is a guarantee of quality with regard to uncertainty quantification.

There is however, a bright side: 93 of the 320 experts would not be rejected as statistical hypotheses at the 5% level. Figure 3 shows the statistical accuracy of the best expert and second best expert for each of the 33 studies. “Best” is defined in terms of the individual’s combined score, which accounts for both statistical accuracy and informativeness, but is driven by statistical accuracy. The plot arrangement is from best to worst of the best experts.

25 of the 33 studies have at least one, and usually two or more experts whose statistical accuracy is acceptable. Simply identifying those experts and relying on them would be a big improvement over un-validated expert judgment (spotting good performers without measuring performance is a fool’s errand .[17]

An oft heard suggestion is ‘Why not ask the experts to weight each other?” They often know each other or each other’s work, and they may well concur on whose opinion should weight heaviest. A moment’s reflection councils caution, however. Could an expert with poor performance be able to identify those other experts who perform well? Data on this question is sparse, but in a few studies experts were asked to rate each other, and the “group mutual rankings” were negatively correlated the rankings in terms of performance .[18] Reference [19] compared performance of various weighting schemes, including “citation weights” based on experts’ citation numbers and found performance comparable to equal weighting.

In-sample validation[edit]

Figure4: Comparison of statistical accuracy (top) and combined scores (bottom) of 33 post-2006 expert judgment studies

Data used to gauge expert performance can also be used to measure performance of combination schemes. With regard to PW these are in-sample comparisons, as the validation data is also used to initialize the combination model. Were PW not superior in-sample there would be little point in conducting out-of-sample validation. [20] suggest that averaging experts' quantiles might be superior to EW. Averaging quantiles is easier to compute than averaging distributions, and is frequently employed by unwary practitioners. Averaging quantiles is mathematically equivalent to Harmonically Weighted (HW) combinations of distributions. Figure 4 shows the p-values and combined scores of PW, EW, and HW for the 33 post-2006 studies, arranged according to PW scores. EW has the best combined score in 3 cases, HW is best in 4. PW is best in 26 cases. In 18 cases (55%) the hypothesis that HW is statistically accurate would be rejected at the 0.05 level. In 8 cases rejection would be at the 0.001 level.

Out-of-sample validation[edit]

Figure5: Out of sample p-values (left) and combined scores (right) of PW and EW, aggregated over same sized training sets, as percentage of all calibration variables, and aggregated over the 33 post-2006 studies
Figure 6: Ratios of combined scores of PW over EW for training sets sized at 80% of the calibration variables

Since the variables of interest are rarely observed within the time frame of the studies, out-of-sample validation mostly reduces to cross validation, whereby the model is initialized on a subset of the calibration variables (training set) and scored on the complimentary set (test set). The difficulty is in choosing the training/test set split. If the training set is small, then the ability to resolve expert performance is small and the PW of each training set poorly resembles the PW of the real study. If the test set is small then the ability to resolve differences in combination schemes is small. That said, [21] considered all splits of training/test sets, and showed that PW outperformed EW out-of-sample.

There is an out-of-sample penalty for PW’s statistical accuracy score. Figure 5 (left) shows out-of-sample statistical accuracy of PW and EW as function of training set size. Both scores increase due to loss of statistical power in the test set, but PW increases faster. As out-of-sample PW is better able to resolve expert performance it approaches the in-sample PW. The combined score (right) shows that out-of-sample dominance of PW grows with training set size.

With n calibration variables the total number of splits (excluding the empty set and the entire set) is 2n−2, which becomes unmanageable. Recent research suggests that using 80% of the calibration variables for the training set is a good compromise of competing interests. Figure 6 shows the ratios of combined scores for PW / EW per study, aggregated over all splits with 80% of the calibration variables in the training set.

Performance weighting PW is demonstrably superior to simple combination schemes (EW, HW) that do not use performance data. However, PW places greater demands on the analyst, both with regard to generating meaningful calibration variables and explaining the methods and results. Finding competent and experienced analysts is the greatest bottleneck for applications. Refinement of performance measures, and improvements in cross-validation methods and software would also be welcome.

Websites[edit] All mathematical details, an overview of post-2006 applications, recent publications and links to data are available online


  • EXCALIBUR (website) software for processing expert judgment data with the classical model. Freely available.


  1. ^ Burgman, M.A. (2016). Trusting judgments; How to get the best out of experts. Cambridge: Cambridge University Press. p. 214. 
  2. ^ Ungar, L.; Mellers, B.; Satopää, V.; Tetlock, P.; Tetlock, J. (2012). The Good Judgment Project: A Large Scale Test of Different Methods of Combining Expert Predictions. AAAI Fall Symposium Series (RSS). 
  3. ^ Rasmussen, N.C. (1975). "Reactor safety study. An assessment of accident risks in U. S. commercial nuclear power plants". WASH-1400 (NUREG-75/014). doi:10.2172/7134131. 
  4. ^ Cooke, R.M. (2012). "Uncertainty Analysis Comes to Integrated Assessment Models for Climate Change…and Conversely". Climatic Change. 117 (3 2013): 467–479. doi:10.1038/nclimate2466. 
  5. ^ a b Cooke, R.M. (1991). Experts in Uncertainty; Opinion and Subjective Probability in Science. New York Oxford: Oxford University Press. p. 321. 
  6. ^ Cooke, R.M.; Goossens, L.H.J. (2008). "TU Delft Expert Judgment Data Base". Reliability Engineering & System Safety. 117 (93): 657–674. 
  7. ^ Aspinall, W.P. (2010). "A route to more tractable expert advice". Nature. 463: 657–674. 
  8. ^ Sutherland, W.J.; Burgman, M. (2015). "Use experts wisely". Nature. 526: 317–318. doi:10.1038/526317a. 
  9. ^ Aspinall, W.P.; Cooke, R.M.; Havelaar, A.H.; Hoffmann, S.; Hald, T. (1 March 2016). "Evaluation of a Performance-Based Expert Elicitation: WHO Global Attribution of Foodborne Diseases". PLOS ONE. 11: e0149817. doi:10.1371/journal.pone.0149817. PMC 4773223Freely accessible. PMID 26930595. 
  10. ^ Hald, T.; Aspinall, W.P.; Devleesschauwer, B.; Cooke, R.M.; Corrigan, T.; Havelaar, A.H.; Gibb, H.; Torgerson, P.; Kirk, M.; Angulo, F.; Lake, R.; Speybroeck, N.; Hoffmann, S. (2015). "World Health Organization estimates of the relative contributions of food to the burden of disease due to selected foodborne hazards: a structured expert elicitation". PLOS ONE. 11: e0145839. doi:10.1371/journal.pone.0145839. PMC 4718673Freely accessible. PMID 26784029. 
  11. ^ Aspinall, W.P.; Loughlin, S.C.; Michael, F.V.; Miller, A.D.; Norton, K.C.; Sparks, R.S.J.; Young, S.R. (2002). "The Montserrat Volcano Observatory: its evolution, organisation, role and activities". In Druitt, T.H.; Kokelaar, B.P. The eruption of Soufrière Hills Volcano, Montserrat, from 1995 to 1999. Geological Society, London, Memoir. pp. 71–92. 
  12. ^ Aspinall, W.P. (2006). "Structured elicitation of expert judgment for probabilistic hazard and risk assessment in volcanic eruptions". In Mader, H.M.; Coles, S.G.; Connor, C.B.; Connor, L.J. Statistics in Volcanology. London: IAVCEI. pp. 15–30. 
  13. ^ Wadge, G.; Aspinall, W.P. (2014). "A Review of Volcanic Hazard and Risk Assessments at the Soufrière Hills Volcano, Montserrat from 1997 to 2011". In Wadge, G.; Robertson, R.E.A.; Voight, B. The Eruption of Soufrière Hills Volcano, Montserrat, from 2000 to 2010: Geological Society Memoirs, Vol. 39. London: Geological Society. pp. 439–456. 
  14. ^ a b Cooke, R.M.; Mendel, M.; Thijs, W. (1988). "Calibration and Information in Expert Resolution". Automatica: 87–94. doi:10.1016/0005-1098(88)90011-8. 
  15. ^ Cooke, R.M.; Wittmann, M.E.; Rothlisberger, J.D.; Rutherford, E.S.; Zhang, H.; Mason, D.; Lodge, D.M. (2014). "Structured expert judgment to forecast species invasions: Bighead and Silver Carp in Lake Erie". Conservation Biology. 29: 187–197. doi:10.1111/cobi.12369. 
  16. ^ Cooke, R.M. (2015). "Messaging climate change uncertainty (with supplementary Online Material)". Nature Climate Change. 5: 8–10. doi:10.1038/nclimate2466. 
  17. ^ Flandoli, F.; Giorgi, E.; Aspinall, W.P.; Neri, A. (2011). "Comparison of a new expert elicitation model with the Classical Model, equal weights and single experts, using a cross-validation technique". Reliability Engineering and System Safety. 96 (7): 1292–1310. doi:10.1016/j.ress.2011.05.012. 
  18. ^ Cooke, R.M.; Aspinall, W.P. (2013). "Quantifying scientific uncertainty from expert judgement elicitation". In Hill, L.; Rougier, J.C.; Sparks, R.S.J. Risk and Uncertainty assessment in Natural Hazards. Cambridge University Press. pp. 64–99. 
  19. ^ Cooke, R.M.; ElSaadany, S.; Xinzheng Huang, X. (2008). "On the Performance of Social Network and Likelihood Based Expert Weighting Schemes". Reliability Engineering and System Safety. 93 (5): 745–756. doi:10.1016/j.ress.2007.03.017. 
  20. ^ Lichtendahl Jr., K.C.; Grushka-Cockayne, Y.; Winkler, R.L. (July 2013). "Is It Better to Average Probabilities or Quantiles?". Management Science. 59 (7): 1594–1611. doi:10.1287/mnsc.1120.1667. ISSN 1526-5501. 
  21. ^ Eggstaff, J.W.; Mazzuchi, T.A.; Sarkani, S. (2014). "The Effect of the Number of Seed Variables on the Performance of Cooke's Classical Model". Reliability Engineering and System Safety. 121: 72–82. doi:10.1016/j.ress.2013.07.015.