Automatic item generation

Automatic Item Generation (AIG) is a young field of study in Psychology that links Psychometrics with Computer Programming, and it consists on the computer-algorithm controlled creation of items (which are basic units used to test individuals) under a so-called Item Model (IM).^[1] Instead of writing each individual item, computer algorithms are utilized to generate families of items from a smaller set of parent IMs.^[2]^[3] AIG is expected to reduce monetary costs, since standard item writing is highly expensive for most well-known worldwide organizations.^[4] AIG strongly increases the number of items generated in the same amount of time that it takes for standard item writers; items can even be created instantly (on the fly) during Computerized Adaptive Testing. Parallel test forms are easily created through AIG in order to reduce overexposure of a single group of items, thus enhancing test security. AIG is also expected to produce items with a wide range of difficulty levels, avoid construction errors, and permit higher comparability of items due to a more systematic definition of the prototypical IM.^[5]^[6]

General State of the Art

As for research regarding AIG, Gierl, Lai, and Turner^[7] presented new methods for AIG applied to medical education testing, and used an AIG program called the Item Generator (IGOR^[8]) for the creation of 1,248 medical knowledge multiple-choice items under a single IM. The cognitive model used for the IM contained knowledge, skills and content required to make medical diagnoses. Arendasy, Sommer, and Mayr^[9] described the high psychometric quality of automatically generated verbal stimuli items after administering them to two samples, which contained German and English speaking participants respectively. The sets of items administered to these two groups were based on a common set of interlanguage anchor items, which facilitated cross-lingual comparisons of performance. Holling, Bertling, and Zeuch^[10] used probability theory to automatically generate mathematical word problems with expected difficulties. They achieved a Rasch^[11] model fit, and item difficulties could be explained by the Linear Logistic Test Model (LLTM^[12]), as well as by the Random-Effects LLTM. Holling, Blank, Kuchenbäcker, and Kuhn^[13] made a similar study with statistical word problems, but without using AIG. Arendasy and his colleagues^[14]^[15] presented studies on automatically generated algebra word problems and examined how a quality control framework of AIG can affect the measurement quality of items.

State of the Art of the Automatic Generation of Figural Items

AIG of the figural kind is becoming particularly popular. Nevertheless, not all of the known figural item generators have been validated with regard to psychometric quality. Those with reported psychometric validation will be mentioned here.

The Item Maker (IMak) is an R-coded package for figural analogy plotting that is free to download. According to Blum and Holling's research,^[16] the psychometric properties of 23 IMak-generated items were satisfactory, and item difficulty based on rule generation could be predicted by means of the Linear Logistic Test Model (LLTM). Arendasy and his colleagues^[17]^[18] studied gender Differential Item Functioning and gender differences in automatically generated mental rotation items. They manipulated item design features that have exhibited gender DIF in previous studies, and they showed that the estimates of the effect size of gender differences were compromised by the presence of different kinds of gender DIF that could be related to specific item design features. Freund, Hofer, and Holling^[19] generated twenty five 4x4 square matrix items automatically by using MatrixDeveloper,^[20] and they were administered to 169 individuals. The items showed a good Rasch model fit, and rule-based generation could explain the item difficulty. Arendasy^[21] studied possible violations of the psychometric quality of automatically generated visuospatial reasoning items using IRT principles. For this purpose, he presented two programs, namely: the Figural Matrices Generator (GeomGen^[22]), and Endless Loop Generator (EsGen), and concluded that GeomGen was more suitable for AIG since it considers any violation during the said generation. In a parallel research using GeomGen, Arendasy and Sommer^[23] found that the sole variation of certain perceptual organization of items could influence the performance of respondents depending on specific ability levels, and that it had an effect on several psychometric quality indices. These results may question the unidimensionality assumption of figural matrix items in general. The first known item matrix generator was designed by Embretson,^[24]^[25] and her automatically generated items demonstrated good psychometric properties, as it is shown by Embretson and Reise.^[26] She also proposed a model for the adequate online item generation.

References

^ Gierl, M.J., & Haladyna, T.M. (2012). Automatic item generation, theory and practice. New York, NY: Routledge Chapman & Hall.
^ Glas, C.A.W., van der Linden, W.J., & Geerlings, H. (2010). Estimation of the parameters in an item-cloning model for adaptive testing. In W.J. van der Linden, & C.A.W. Glas (Eds.). Elements of adaptive testing (pp. 289-314). DOI: 10.1007/978-0-387-85461-8_15.
^ Gierl, M.J., & Lai, H. (2012). The role of item models in automatic item generation. International journal of testing, 12(3), 273-298. DOI: 10.1080/15305058.2011.635830.
^ Rudner, L. (2010). Implementing the graduate management admission test computerised adaptive test. In W.J. van der Linden, and C.A.W. Glas (Eds.). Elements of adaptive testing (pp. 151-165). DOI: 10.1007/978-0-387-85461-8_15.
^ Irvine, S. (2002). The foundations of item generation for mass testing. In S.H. Irvine, & P.C. Kyllonen (Eds.). Item generation for test development (pp. 3-34). Mahwah: Lawrence Erlbaum Associates.
^ Lai, H., Alves, C., & Gierl, M.J. (2009). Using automatic item generation to address item demands for CAT. In D.J. Weiss (Ed.), Proceedings of the 2009 GMAC Conference on Computerized Adaptive Testing. Web: www.psych.umn.edu/psylabs/CATCentral.
^ Gierl, M.J., Lai, H., & Turner, S.R. (2012). Using automatic item generation to create multiple-choice test items. Medical education, 46(8), 757-765. DOI: 10.1111/j.1365-2923.2012.04289.x
^ Gierl, M.J., Zhou, J., & Alves, C. (2008). Developing a taxonomy of item mode types to promote assessment engineering. J technol learn assess, 7(2), 1-51.
^ Arendasy, M.E., Sommer, M., & Mayr, F. (2011). Using automatic item generation to simultaneously construct German and English versions of a Word Fluency Test. Journal of cross-cultural psychology, 43(3), 464-479. DOI: 10.1177/0022022110397360.
^ Holling, H., Bertling, J.P., & Zeuch, N. (2009). Automatic item generation of probability word problems. Studies in educational evaluation, 35(2-3), 71-76.
^ Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press.
^ Fischer, G.H. (1973). The linear logistic test model as an instrument of educational research. Acta Psychological, 37, 359-374. DOI: 10.1016/0001-6918(73)90003-6.
^ Holling, H., Blank, H., Kuchenbäcker, K., & Kuhn, J.T. (2008). Rule-based item design of statistical word problems: a review and first implementation. Psychology science quarterly, 50(3), 363-378.
^ Arendasy, M.E., Sommer, M., Gittler, G., & Hergovich, A. (2006). Automatic generation of quantitative reasoning items. A pilot study. Journal of individual differences, 27(1), 2-14. DOI: 10.1027/1614-0001.27.1.2.
^ Arendasy, M.E., & Sommer, M. (2007). Using psychometric technology in educational assessment: the case of a schema-based isomorphic approach to the automatic generation of quantitative reasoning items. Learning and individual differences, 17(4), 366-383. DOI: 10.1016/j.lindif.2007.03.005.
^ Blum, D., & Holling, H. (2018). Automatic generation of figural analogies with the IMak package. Frontiers in psychology, 9(1286), 1-13. DOI: 10.3389/fpsyg.2018.01286.
^ Arendasy, M.E., & Sommer, M. (2010). Evaluating the contribution of different item features to the effect size of the gender difference in three-dimensional mental rotation using automatic item generation. Intelligence, 38(6), 574-581. DOI:10.1016/j.intell.2010.06.004.
^ Arendasy, M.E., Sommer, M., & Gittler, G. (2010). Combining automatic item generation and experimental designs to investigate the contribution of cognitive components to the gender difference in mental rotation. Intelligence, 38(5), 506-512. DOI:10.1016/j.intell.2010.06.006.
^ Freund, P.A., Hofer, S., & Holling, H. (2008). Explaining and controlling for the psychometric properties of computer-generated figural matrix items. Applied psychological measurement, 32(3), 195-210. DOI: 10.1177/0146621607306972.
^ Hofer, S. (2004). MatrixDeveloper. Münster, Germany: Psychological Institute IV. Westfälische Wilhelms-Universität.
^ Arendasy, M. (2005). Automatic generation of Rasch-calibrated items: figural matrices test GEOM and Endless-Loops Test EC. International journal of testing, 5(3), 197-224.
^ Arendasy, M. (2002). Geom-Gen-Ein Itemgenerator für Matrizentestaufgaben. Viena: Eigenverlag.
^ Arendasy, M.E., & Sommer, M. (2005). The effect of different types of perceptual manipulations on the dimensionality of automatic generated figural matrices. Intelligence, 33(3), 307-324. DOI: 10.1016/j.intell.2005.02.002.
^ Embretson, S.E. (1998). A cognitive design system approach to generating valid tests: application to abstract reasoning. Psychological methods, 3(3), 380-396.
^ Embretson, S.E. (1999). Generating items during testing: psychometric issues and models. Psychometrika, 64(4), 407-433.
^ Embretson, S.E., & Reise, S.P. (2000). Item Response Theory for psychologists. Mahwah: Lawrence Erlbaum Associates.

[1] Gierl, M.J., & Haladyna, T.M. (2012). Automatic item generation, theory and practice. New York, NY: Routledge Chapman & Hall.

[2] Glas, C.A.W., van der Linden, W.J., & Geerlings, H. (2010). Estimation of the parameters in an item-cloning model for adaptive testing. In W.J. van der Linden, & C.A.W. Glas (Eds.). Elements of adaptive testing (pp. 289-314). DOI: 10.1007/978-0-387-85461-8_15.

[3] Gierl, M.J., & Lai, H. (2012). The role of item models in automatic item generation. International journal of testing, 12(3), 273-298. DOI: 10.1080/15305058.2011.635830.

[4] Rudner, L. (2010). Implementing the graduate management admission test computerised adaptive test. In W.J. van der Linden, and C.A.W. Glas (Eds.). Elements of adaptive testing (pp. 151-165). DOI: 10.1007/978-0-387-85461-8_15.

[5] Irvine, S. (2002). The foundations of item generation for mass testing. In S.H. Irvine, & P.C. Kyllonen (Eds.). Item generation for test development (pp. 3-34). Mahwah: Lawrence Erlbaum Associates.

[6] Lai, H., Alves, C., & Gierl, M.J. (2009). Using automatic item generation to address item demands for CAT. In D.J. Weiss (Ed.), Proceedings of the 2009 GMAC Conference on Computerized Adaptive Testing. Web: www.psych.umn.edu/psylabs/CATCentral.

[7] Gierl, M.J., Lai, H., & Turner, S.R. (2012). Using automatic item generation to create multiple-choice test items. Medical education, 46(8), 757-765. DOI: 10.1111/j.1365-2923.2012.04289.x

[8] Gierl, M.J., Zhou, J., & Alves, C. (2008). Developing a taxonomy of item mode types to promote assessment engineering. J technol learn assess, 7(2), 1-51.

[9] Arendasy, M.E., Sommer, M., & Mayr, F. (2011). Using automatic item generation to simultaneously construct German and English versions of a Word Fluency Test. Journal of cross-cultural psychology, 43(3), 464-479. DOI: 10.1177/0022022110397360.

[10] Holling, H., Bertling, J.P., & Zeuch, N. (2009). Automatic item generation of probability word problems. Studies in educational evaluation, 35(2-3), 71-76.

[11] Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press.

[12] Fischer, G.H. (1973). The linear logistic test model as an instrument of educational research. Acta Psychological, 37, 359-374. DOI: 10.1016/0001-6918(73)90003-6.

[13] Holling, H., Blank, H., Kuchenbäcker, K., & Kuhn, J.T. (2008). Rule-based item design of statistical word problems: a review and first implementation. Psychology science quarterly, 50(3), 363-378.

[14] Arendasy, M.E., Sommer, M., Gittler, G., & Hergovich, A. (2006). Automatic generation of quantitative reasoning items. A pilot study. Journal of individual differences, 27(1), 2-14. DOI: 10.1027/1614-0001.27.1.2.

[15] Arendasy, M.E., & Sommer, M. (2007). Using psychometric technology in educational assessment: the case of a schema-based isomorphic approach to the automatic generation of quantitative reasoning items. Learning and individual differences, 17(4), 366-383. DOI: 10.1016/j.lindif.2007.03.005.

[16] Blum, D., & Holling, H. (2018). Automatic generation of figural analogies with the IMak package. Frontiers in psychology, 9(1286), 1-13. DOI: 10.3389/fpsyg.2018.01286.

[17] Arendasy, M.E., & Sommer, M. (2010). Evaluating the contribution of different item features to the effect size of the gender difference in three-dimensional mental rotation using automatic item generation. Intelligence, 38(6), 574-581. DOI:10.1016/j.intell.2010.06.004.

[18] Arendasy, M.E., Sommer, M., & Gittler, G. (2010). Combining automatic item generation and experimental designs to investigate the contribution of cognitive components to the gender difference in mental rotation. Intelligence, 38(5), 506-512. DOI:10.1016/j.intell.2010.06.006.

[19] Freund, P.A., Hofer, S., & Holling, H. (2008). Explaining and controlling for the psychometric properties of computer-generated figural matrix items. Applied psychological measurement, 32(3), 195-210. DOI: 10.1177/0146621607306972.

[20] Hofer, S. (2004). MatrixDeveloper. Münster, Germany: Psychological Institute IV. Westfälische Wilhelms-Universität.

[21] Arendasy, M. (2005). Automatic generation of Rasch-calibrated items: figural matrices test GEOM and Endless-Loops Test EC. International journal of testing, 5(3), 197-224.

[22] Arendasy, M. (2002). Geom-Gen-Ein Itemgenerator für Matrizentestaufgaben. Viena: Eigenverlag.

[23] Arendasy, M.E., & Sommer, M. (2005). The effect of different types of perceptual manipulations on the dimensionality of automatic generated figural matrices. Intelligence, 33(3), 307-324. DOI: 10.1016/j.intell.2005.02.002.

[24] Embretson, S.E. (1998). A cognitive design system approach to generating valid tests: application to abstract reasoning. Psychological methods, 3(3), 380-396.

[25] Embretson, S.E. (1999). Generating items during testing: psychometric issues and models. Psychometrika, 64(4), 407-433.

[26] Embretson, S.E., & Reise, S.P. (2000). Item Response Theory for psychologists. Mahwah: Lawrence Erlbaum Associates.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]