User:Bronwyn Fabrie

Psychometric Properties of Psychological Assessment Measures

By

Dr. Bronwyn Celeste Fabrie Criminologist and Psychologist Germany

page 2

Contents

1 Introduction 4

2 Definition of key concepts 5

2.1 Different types of norms 5 2.2 Criterion referenced tests 5 2.3 Psychological measures 5 2.4 Reliability and validity of psychological measures 5

3 Various types of norms 6

3.1 Mental age scales and grade equivalents 6 3.2 Percentiles 7 3.3 Stanines and Sten scales 7 3.4 Deviation IQ 7

4 Criterion referenced tests 7

4.1 Expectancy tables 8

5 Constructing a psychological measure 8

5.1 The planning phase 8 5.2 Format of the items 8 5.3 Item analysis phase: 9

      -  Item difficulty value				                9
      -  Item discrimination value				        9
      -  Item-total correlation				        9

6 Reliability of a psychological measure 10

6.1 Correlation coefficient 10 6.2 Statistical significance 10 6.3 Reliability coefficient 11 6.4 Observed score 11 6.5 True and error scores 11 6.6 True variance and error variance in a test score 12

contents continued

page 3

Contents (continued)

7. Stability of a test – the advantages and limitations 12

7.1 Test-retest reliability 12 7.2 Alternate-form reliability 12

8 Internal consistency of a test 13

      –   advantages and limitations

8.1 Split-half reliability 13 8.2 Kuder-Richardson 20 and Cronbach´s Alpha 13

9 Validity of a psychological measure 14

9.1 Content validity 14 9.2 Criterion-related validity 15 9.3 Predictive validity 15 9.4 Concurrent validity 15

10 Major methods of establishing construct validity 15

10.1 Developmental changes 16 10.2 Correlations with other tests 16 10.3 Factor analysis 16 10.4 Internal consistency 17 10.5 Convergent and discriminant validation 17 10.6 Experimental interventions 18

11 Summary 18

12 Conclusion 19

References 20

page 4

1. Introduction

To understand what is meant by psychometric properties of psychological assessment measures, it is necessary to separate the two descriptive classes on their own. In other words, what do we mean when we speak about psychometric properties, and what is implied by the concept psychological measures?

Psychometrics is basically the study of different mental traits and behavioural characteristics in amounts or scores. For example, these tests can take the form of intelligence scales, or a rating of different attitudes against a specific population standard or norm. The fact is, psychometrics involves the assessment of human data according to known, specific standards implied by the experienced researcher or scientist.

Without psychometric theory, there would be problems to develop a reliable and valid psychological measure. It would be impossible to study human intelligence without comparing a person, or group to a normative sample. Psychological assessment measures on the other hand, involve five different processes; diagnosis, classification, planning of treatment, self-knowledge, research and program evaluation (Gregory, 2000, p.41)

The aim of this essay is to demonstrate the principles of psychometric theory, and how these principles are integrated to explain how a psychological measure is developed. The following major concepts, including their related values and tests will be discussed, namely, the different types of norms, criterion referenced tests, psychological measures and the reliability and validity of these measures, as well as the advantages and limitations of each test.

page 5

3 Definition of key concepts Our discussion begins with an overview of the following concepts:

2.1 Different types of norms Norms can be defined as a raw score or table of values of an individual measurement against the performance of others in a particular group. Furthermore, norms help us to make important standard comparisons, whereby we can judge how much a person’s score deviates from the average population or representative group in a sample (Rosnow & Rosenthal, 1999. p. 222).

2.2 Criterion referenced tests These tests are basically the opposite to norm-referenced tests. Criterion- referenced tests are mainly concerned with the personal achievements of the tested person, than on making comparisons with the performance or abilities of external groups. For example, these tests are ideal for the testing of individual educational needs.

2.3 Psychological measures To recapture: psychological measures or assessment measures involve quantification techniques. In other words, psychological measures is a dynamic process, forever changing, but retaining its original structure. That is, a psychological measurement can be a sophisticated process with the following characteristics, as mentioned by Gregory (2000, p.30): - scores or categories - behaviour samples - norms or standards - standardized procedures - prediction of nontest behaviour

2.4 Reliability and validity of psychological measures Test reliability can be defined according to how it consistently measures what it is supposed to measure. Unfortunately, such problems as “random error” can greatly influence the reliability of an instrument (but more about that later in our discussion under point 6).

page 6

Validity, for example often pertains to the contents of a test or measuring instrument. For instance, if a particular personality trait is been measured by a certain test, then it is expected that this test actually measures what it is supposed to measure, otherwise it is invalid.

3 Various types of norms The word “normed score” on its own means very little to test takers if they do not know what kind of norms they are been tested against. It is up to the tester to explain a test or individual’s raw score against a background of a representative population group. The following norms will be discussed briefly to demonstrate the importance of understanding a person’s test score on a norm referenced test.

3.1 Mental age scales and Grade Equivalents These two types of norms actually are referred to as “developmental norms” but with a slight difference. Mental age scales encourage similar-age test comparisons. In other words, the performance level of a child who is 10 years of age will be compared to the performance level of other children of the same age group. These age norms are a convenient way of testing children’s developmental characteristics that are dynamically changing in comparison to the more stable traits of adults.

Grade equivalents are quite similar to age grades. However, instead of just checking the aptitude or ability of a similar age group, grade norms measure the standard of test performance for every individual grade depicted in the normative sample. This means that school performance is measured against a normative sample from the same class or grade in the school. This is a more convenient method to test a child or scholar’s academic performance against a similar grade equivalent (Gregory, 2000, p.71).

page 7

3.2 Percentiles Percentiles is a relative measure which can fluctuate between 0 and 100. A percentile describes the distribution of scores either falling below or above a particular percentage of sampling scores. For example, we talk about a 50th percentile (which is also referred to as the “median”) when a typical score falls below the 50% level and above the 50% level (Rosnow & Rosenthal, 1999, p. 233)

3.3 Stanines and Sten Scales (standard scores) Stanines transform test takers raw scores on a 1 - 9 point scale and the 10 unit sten scale which Canfield (1951) recommended is basically a slight variation on the stanine scale. Both measures were useful devices before the pre-computer age to test norms. In fact, stanines always have a mean of 5 and an exact standard deviation of 2. This means that scores are ranked from the lowest to the highest, with the bottom 4% of scores having a stanine of 1 and so forth (Gregory, 2000, p.68).

3.4 Deviation IQ This norm is often used to measure the scores depicted on intelligence tests. However, this kind of scale is often misinterpreted by inexperienced persons wanting to find an easy “labelling” system to describe a persons intelligence above or below a particular marker of 100. It would be foolish to use one single IQ test as a final result of ones intelligence or character. For example, using obsolete tests can either inflate or deflate a participants IQ scores.

4. Criterion referenced tests A criterion referenced test is the opposite to a norm-referenced test. Where the latter measures an individual’s performance against a representative group, the former will measure person’s mastery or nonmastery skills on a particular content domain, such as a task on dexterity or memory performance (Gregory, 2000, p.74). An expectance table is a good example of such tests.

page 8

4.1 Expectancy tables This kind of table normally reflects a practical eye-view of candidates predictor results and a specific criterion. For example, expectancy tables test the relationships between an individual’s test scores and what he or she is able to accomplish later on in life, whether it be a certain career path or achieving good grades in an upcoming college entrance exam. However, such tables also have limitations, mainly because they reflect the results of large representative group scores, which reflect their present social or school standards of the time. In fact, such tests, which also include most other norm tests require constant updates or checks in order to accomplish what it is supposed to; which is reliability and validity of results (Gregory, 2000, p. 72).

5. Constructing a psychological measure

5.1 The planning phase This is probably one of the most important steps before beginning a psychological measure. The planning phase involves very careful decision making. For example, an engineer will plan every move and step along the way before he delivers his proposal for a new effective railway bridge. Therefore, when planning a psychological measure, the tester will have to consider for instance; choice, format, length of items he or she will include in a test measure.

5.2 Format of the items Test construction is not just a simple matter of throwing any kind of item into a main batch. It is crucial to decide what type of item format is required. It is of no value to try and test a questionnaire on a personality trait against items that test for physical speed, such as running 500 metres in a certain time. For example, it would also make no sense to test a preschoolers performance on an arithmetic test meant for an 8 year old school child. The length of the measure should be suitable for that particular group. It would be a waste of time to test someone with a major depressive disorder (and on strong medication) on a test which requires 3 hours of heavy concentration.

page 9

In other words, test items can come in different formats and styles such as multiple-choice questionnaires, true-false items, forced-choice, closed/open-response and so forth.

It must be remembered however, that item selection is never a perfect system. It always involves an item measurement error in assessment tests. That is why careful consideration is applied to the planning and implementation of item selection from the beginning to the end stages to avoid as little as possible too much measurement error.

5.3 Item analysis phase The item analysis phase involves 3 different types of item statistics: - item difficulty value - discrimination value - item-total correlation These above item statistics help the researcher to choose the most suitable items for the end measure. It is always wise to try and adapt tests on a homogeneous basis, which means taking into account many different demographic features of the test person, such as age, sex, social/economic background, educational status, and most important cultural differences.

The item difficulty value tests a large amount of students correct answers against a single test question. If a minimum percentage get it wrong, then it is obvious that the test item is too easy and should be adjusted. The reverse is also true.

The discrimination value shows how well an item discriminates between those who get high and low ratings on the complete test.

The item-total correlation is a point-biserial correlation (also similar to The Pearson r), which stresses the relationship between 2 variables. The higher the relationship between a single item and the total score, then the item is considered good with regards to internal consistency. In other words, a good measurement should have items that are homogeneous with a high level of internal consistency.

page 10

6. Reliability of a psychological measure Reliability according to Gregory (2000) “expresses the relative influence of true and error scores on obtained test scores.” To understand what reliability Means, is to try and imagine a scale weighing a kilo of grapes. The greengrocer, weighs the grapes twice in a row and each time he gets a slightly different reading, but never the same as the first weighing. In other words, reliability is not always an absolute measure. There will always be a slight inconsistency between the first test and the second test. But again, slight fluctuations between tests is a matter of degrees. Repeating results helps the tester to confirm some form of accuracy in scores, but this again will not mean much without validity which will be discussed later on in this essay.

6.1 Correlation coefficient A correlation coefficient r possesses values ranging from –1.00 to +1.00. A +1.00 is a perfect linear relationship between 2 test results. A zero correlation occurs when 2 variables, such as height and reaction time have no relationship to one another.

To test reliability of psychological test scores, the same test should be taken twice, namely with a test-retest method. We can then test the degree of variance in the obtained scores with the variance in true scores.

6.2 Statistical significance This type of method goes beyond that of just testing a correlation coefficient between 2 variables. The psychometrician, for example is not just interested in a small sample of test-persons, but would like to compare/generalize the results to a larger population. The fact is the larger the sample size, the better the statistical significance. For instance, it is better to try for less errors by increasing the size and homogeneity of our sample. If a correlation is significant at a .01 level, we know then that the probability of error will be 1 out of a 100 which is a rather good estimate.

page 11

6.3 Reliability coefficient The reliability coefficient is the proportion of true score variance (factors which are consistent) to the complete total variance of test results. In plain terms, we add the true score variance (the stable attribute which we are testing) with the error score variance or errors of measurement.

6.4 Observed score The observed score or obtained score can be drastically altered by random events or measurement errors. To avoid this problem, it is up to the researcher to reduce as many of the nuisances as possible in order to have a reliable measure. In fact Classical Theory (Gregory, 2000. pp77-79) stipulates that a negative measurement error can contribute to an obtained score been much lower than the true score. A positive measurement error on the other hand, could contribute to a higher obtained score than the actual true score. Either way, one of the students doing a specific knowledge test will come out better due to some unbalanced item selection or other measurement error.

6.5 True and error scores True and error scores are uncorrelated according to the classical measurement theory. True scores are hypothetical. They are never really known. However, it is error scores which give test developers headaches. For example, the researcher decides to test a trait for nervousness and keeps on getting a measurement for confidence. It is obvious that there is something inconsistent with this test measure. It could be that the researcher has chosen incorrect test items based on obsolete tests, or that the person/s being tested are not suitable test candidates. The fact is, that errors of measurement will give false observed scores. If the same test would be repeated again, the end results will be inconsistent. Therefore test construction should be carefully planned in the beginning in order to avoid such measurement errors creeping into the results.

page 12

6.5 True variance and error variance in a test score Briefly, the true variance shows a more homogeneous, internal item consistency than the error variance. Error variance results from bad content sampling, such as in alternate-form and split-half reliability, as well as heterogeneity of the traits under observation. On the other hand, a high interitem consistency shows a more homogeneous variance with little inconsistency. For example, if 2 half-tests show 2 different results we speak about an error variance. This means that both half-tests are inconsistent with one another.

7 Stability of a test – the advantages and limitations

7.1 Test-retest reliability In this kind of measurement, the same test is repeated twice to the same test group. This sample group is of a heterogeneous nature which is representative of the general population. The idea behind this kind of test is to compare or correlate the two scores for a reliable measure. The advantage of such a test is to predict the second score from the results of the first test, hoping that there will be a correlation between both scores. There are limitations however to such tests. Error variances, such as experience, maturations, lengthy time spans between tests, illness and so forth could affect retest reliability. (PSY498-8 p. 6).

7.2 Alternate-form reliability Alternate forms of the same test are issued to test persons. This test measures the correlation between both scores (which is quite similar to the test-retest reliability). However, there is a difference between the two. The alternate-form reliability method inserts item-sampling differences (error variance) which can limit the scope of reliability. For example, some students may cope very well with the items on test 1 but do quite badly on the second test due to the unidentical items with the first test. Another limitation is the high cost of producing alternate features of a test, and the difficulties involved trying to reproduce parallel forms (Gregory, 2000 p. 83).

page 13

8. Internal consistency of a test – advantages and limitations Apart from alternate forms reliability and test-retest reliability, there are other methods to test items for consistency. For example, the split-half reliability, the Kuder-Richardson 20 and Cronbach’s Alpha.

8.1 Split-half reliability As the name implies, this kind of test correlates the 2 scores from a single test. This is achieved by “splitting” the test into identical halves. Sounds complicated, although it is actually quite an effective measure. For example, if the test scores on both halves indicate a strong correlation, then the scores on two complete tests from 2 different measures should in principle also show the same correlations (Gregory, 2000. p. 84) Internal consistency is therefore achieved through only a single administration.

Of course there are advantages to this method such as lengthening the test to produce more reliability or studying a large behaviour domain. But, there are also limitations as to how one can “split” items on a single test. One can try dividing even and odd numbers or separating easy and difficult items. However, this becomes a problem when the test developer has to split drawings or comprehension texts.

8.2 Kuder-Richardson 20 We use the Kuder-Richardson or KR20 (1937) formula if one wants to find internal consistency of a single administration of one test, such as discussed in the split-half procedure. What this formula actually does is to test individual test items as a 0 for wrong and a 1 for right. However, when tests go beyond the KR20 formula, such as in the testing of heterogeneous items, we then use the Coefficient Alpha (Cronbach (1951). This formula is suitable for example, in attitude scales where test persons must rate their answers as; strongly agree, disagree, and so forth (Gregory, 2000, p.86).

page 14

9 Validity of a psychological measure Validity can be described as the degree to which a measure does what it is supposed to do. In other words, the psychological measure should give a good indication of well-grounded truth/fact between both the trait been tested, and the operational definition of the construct. Furthermore, this measuring instrument must test, and only test what it was designed to do. For example, it is no use designing an instrument for intelligence scales and then using the same measure to test for “running speed” ((Blanche & Durrheim, 2002, p. 83). The following validity procedures will be discussed: - content validity - criterion-related validity - predictive validity - concurrent validity

9.1 Content validity Content validity is a suitable measure when testing for traits such as knowledge, as in an examination paper (Blanche et al, 2002, p85). In other words, this type of measure is actually the testing of item samples on a test which are taken from a greater sample or population, which could be several text books covering one field or domain topic. It would be impossible to test an examinee on the entire contents of a particular subject such as engineering! (Time is normally limited with such tests).

Content validation sometimes runs into difficulties when abstract traits, such as personality and aptitudes have to be tested. It is difficult to give an accurate test description of something like racism or morals, as these traits do not fit smugly between the pages of a subject book (Blanche et al, 2002, p85).

Face validity is another matter to consider. For instance, how does the test appear to others? Does it look too complicated, or does it have an unprofessional appearance? Face validity needs to be taken serious if the measure is going to be accepted by other persons in authority, namely from a legal and educational point of view (PSY498-/8102).

page 15

9.2 Criterion-related validity Criterion-related validity normally correlates with other similar tests or research. In other words, a researcher who discovers a new form of “job mobbing” in corporate and industry will compare previous studies in this field with his/her new findings. There are 2 types of validity measures to test for criterion validity, namely, predictive validity and concurrent validity.

9.3 Predictive validity As the name implies, predictive validity helps predict future events from existing scores, budgets, educational performance and so forth. For example, future inflation rates can be predicted from present statistics on the countries economic performance in relation to the rest of the world. Concurrent validation, on the other hand replaces predictive validity measures when it comes to making a present diagnosis on a pupils immediate performance, and not on future events (PSY498-8/102).

9.3 Concurrent validity This type of method would be more suitable when testing abstract traits, such as someone suffering from an immediate problem of depression. The clinician can judge the patients observable behaviour and cognitive performance, and make a suitable diagnosis. It would be difficult however to use a method of predictive validity in such a case. One cannot “predict” if someone who is suffering from a dark mood one day is going to suffer from depression in the future. A positive feature of concurrent validity is that costs are kept at a minimum and results are normally immediate, compared to predictive validity.

10 Major methods of establishing construct validity Example construct or traits are; technical and mechanical knowledge, running speed, frustration, reading and spelling abilities and so forth. How do we measure such constructs? Firstly, the researcher for instance gathers as much data as possible on a particular trait, through observations, interrelationships with other behaviour or cognitive measures and so forth.

page 16

We are looking at both a theoretical and empirical method of establishing construct validity. Several methods will be discussed under the following.

10.1 Developmental changes It is common knowledge that developmental changes take place between childhood and adulthood, which also means that both behaviour and cognitive abilities also change perhaps more rapidly in childhood than in later years where they tend to “stabilize”.

Age-differentiation is also dictated by a specific culture. Different cultures have different child-rearing patterns or beliefs. The Piagetian ordinal scales, for example, the sequential patterning of development or schemas indicate the gradual process of conceptual skills of early childhood to early adulthood. This is an example of construct validation of ordinal scales over several developmental levels (PSY498-8/102).

10.2 Correlations with other tests It is a necessary condition that when making a correlation between a new test with other tests that the former does not correlate too high to make it invalid. A good mix would be between low and a “moderate” high, but no more. In other words, it would be ridiculous to compare a new test on a factor of intelligence with a similar test, and then find out later that the new test is actually measuring a personality disorder!

10.2 Factor analysis This is a particular family of statistics which many researchers adopt to explain certain relationships between variables or constructs that correlate highly with one another. This method is used to obtain a strict frugal set of data. In other words, factor analysis allows for the testing of a multitude of major mental abilities such as, comprehension, memory, number recognition and so forth compared to more conservative tests, such as the Stanford-Binet tests (Gregory, 2000, p 23). Factor analysis has one primary goal, and that is to make a neat, comprehensible set of statistics by cutting back too many “untidy” test variables to a more efficient economical set of common traits.

page 17

10.3 Internal consistency Briefly, internal consistency aims for significant item-test correlations with the test pointing in a key direction. Another way of testing for internal consistency is to correlate subtest scores with the total score. Take for instance certain intelligence test factors, reading ability, arithmetic, spelling and so forth. All the sub scores are added together to give a total test score. Of course, it is necessary that items are homogeneous in order to achieve a good internal test consistency.

10.4 Convergent and Discriminant validation Convergent validation of a test means that a test correlates highly with other tests, or traits that share a common factor. In other words, such tests are normally done on a heterogeneous sample to test for convergence. This also means that such a test should also not correlate with opposite variables. For example, a test for vocabulary ability should not correlate with a test for arithmetic reasoning.

Discriminant validation is important to personality tests. In fact discriminant validation occurs when there is a clash, or non-correlation with two opposite variables such as popularity and intelligence. This would obviously be a negative correlation, if any correlation at all.

The multitrait-multimethod matrix (Campbell and Fiske (1959) combines the assessment of two or more variables with two or more methods (Gregory, 2000, p. 110-111). This matrix demonstrates a good source of data on discriminant and convergent validity, as well as reliability.

page 18

10.5 Experimental interventions Any form of experimental intervention a researcher does will involve “control” of the test situation. This is done in order to “isolate” common treatment factors, and remove any unwanted interferences that could invalidate results. There are numerous research designs to choose from, such as a standard one-group pretest-posttest design for testing construct validation in a scholastic test.

Then there are other tests such as the Equivalent Time Series that spread out over lengthy time periods (Neumann,1997, pp 183-197). What ever test is chosen, there will always be a certain amount of experimental interference. The researcher seeks solutions to problems, or tries to find a better experimental method to test different hypothesis for present and future generations.

11. Summary The goal of this essay was to explain what was meant by psychometric properties of psychological assessment measures. The principles necessary to psychometric theory were discussed, namely, the different types of norms, criterion referenced tests, psychological measures and the reliability and validity of these measures, as well as the advantages and limitations of each test.

page 19

12. Conclusion Psychometric testing of psychological measurements is an extensive procedure. There are a number of processes involved in assessing human data which cannot be done in a vacuum. People are human constructs which do not remain stable over time, that is why researchers are always testing and retesting their products against the dynamics of man. It is therefore safe to conclude that no test is a complete test. As this essay has demonstrated, there are always advantages and limitations to assessment measures. What works for one test, may not necessarily work for another. Sometimes it is not a matter of degrees whether a test is supposed to measure what it is supposed to measure, but how the test sample relates to the real world of people.

page 20

References

Durrheim, K. (2002). Research in Practice. In M.T.Blanche (Ed), Quantitative Measurement (pp. 72-95). Cape Town: UCT Press.

Gregory, R.J. (2000). Psychological Testing. 3rd Edition. Illinois: Allyn and Bacon, Inc.

Neumann, W.L. (1997). Social Research Methods. 3rd edition. Needham Heights: Allyn & Bacon.

Rosnow, R.L. & Rosenthal, R. (1999). Beginning Behavioral Research. 3rd. Edition. New Jersey: Prentice Hall

Tutorial Letter 102 for PSY498-8. (2003). Psychological Assessment. Pretoria: Unisa Press. (sections from pp. 3-24).