Chapter 13 Interpreting Test Scores

13.1
FREQUENCY
DISTRIBUTION

Marks awarded by counting the number of
correct answers on a test script are known as raw marks. ’15 marks out of a
total of 20’ may appear a high mark to some, but in fact the statement is
virtually meaningless of its own. For example, the tasks set in the test may
have been extremely simple and 15 may be the lowest mark in a particular group
of scores.

TABLE 1		TABLE 2			TABLE 3
Testee	Mark	Testee	Mark	Rank	Mark	Tally	Frequency
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z	20 25 33 35 29 25 30 26 19 27 26 32 34 27 27 29 25 23 30 26 22 23 33 26 24 26	D M C W L G S E P J N O H K T X Z B F Q Y R V U A I	35 34 33 33 32 30 30 29 29 27 27 27 26 26 26 26 26 25 25 25 24 23 23 22 20 19	1 2 3.5 (or 3=) 3.5 (or 3=) 5 6.5 (or 6=) 6.5 (or 6=) 8.5 (or 8=) 8.5 (or 8=) 11 (or 10=) 11 (or 10=) 11 (or 10=) 15 (or 13=) 15 (or 13=) 15 (or 13=) 15 (or 13=) 15 (or 13=) 19 (or 18=) 19 (or 18=) 19 (or 18=) 21 22.5 (or 22=) 22.5 (or 22=) 24 25 26	40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15	/ / // / // // /// //// /// / // / / / Total	1 1 2 1 2 2 3 5 3 1 2 1 1 1 26

Conversely,
the test may have been extremely difficult, in which case 15 may well be a very
high mark. Numbers still exert a strange and powerful influence on our society,
but the shibboleth that 40 per cent should
always represent a pass mark is nevertheless both surprising and
disturbing.

The tables on the previous page contain
the imaginary scores of a group of 26 students on a particular test consisting
of 40 items. Table 1 conveys very little, but Table 2, containing the students’
scores in order of merit, shows a little more. Table 3 contains a frequency
distribution showing the number of students who obtained each mark awarded;
the strokes on the left of the numbers
(e.g.////) are called tallies
and are included simply to illustrate to
the method of counting the frequency of scores. Note that normally the
frequency list would have been compiled without the need for Tables 1 and 2;
consequently, as the range of highest and lowest marks would then not be known,
all the possible scores would be
listed and a record made of the number of students obtaining each score in the
scale (as shown in the example).

Note that where ties occur in Table 2,
two ways of rendering these are shown. The usual classroom practice is that
shown in the parentheses. Where statistical work is to be done on the ranks, it
is essential to record the average rank (e.g. testees J.N and O, each with the
same mark, occupy places 10, 11 and 12 in the list, averaging 11).

13.2
MEASURES
OF CENTRAL TENDENCY

13.2.1 Mode

The mode refers to the score which most
candidates obtained in this case it is 26, as five testees have scored this
mark.

13.2.2 Median

The median refers to the score gained by
the middle candidates in the order of merit in the case of the 26 students here
(as in all cases involving even numbers of testees), there can obviously be no
middle person and thus the score in the bottom half is taken as the median. The
median score in this case is also 26.

13.2.3 Mean

The mean score any test is arithmetical
average i.e. the sum of the separate scores divided by the total number of
testees. The mode, median, and mean are all measures of central tendency. The
mean is the most efficient measure of central tendency, but it is not always
appropriate.

In the following Table 4 and formula,
note that symbol x is used to denote
the score, N the number of the
testees, and m the mean. The symbol f denotes the frequency with
which a score occurs. The symbol Σ means the
sum of.

Table 4

x
f fx

35 x 1 35

34 x 1 34

m = Σ fx =
702 =27

N 26

33
x 2 66

32 x 1 32

30 x 2 60

29 x 2 58

27 x 3
81

26 x 5 130

25 x 3 75

24 x 1 24

23 x 2 46

22 x 1 22

20 x 1 20

19 x 1 19

Total = 702

= Σ fx

Note
that x = 702 is the total number of
items which the group of 26 students got right between them. Dividing by N = 26
(as the formula states), this obviously gives the average.

It will be observed that in this particular case there is a fairly close
correspondence between the mean (27) and the median (26). Such a close
correspondence is not always common and has occurred in this case because the
scores tend to cluster symmetrically around a central point.

13.3
MEASURES
OF DISPERSION

Whereas the previous section was
concerned with measures of central tendency, this section is related to the
range or spread of scores. The mean by itself enables us to describe an
individual student’s score by comparing it with the average set of scores obtained
by a group, but it tells us nothing at all about the highest and lowest scores
and the spread of marks.

13.3.1 Range

One simple way of measuring the spread
of marks is based on the difference between the highest and lowest scores.
Thus, if the highest score on a 50-item test is 43 and the lowest 21, the range
is from 21 to 43; i.e. 22, if the highest score, however, is only 39 and the
lowest 29, the range is 10. (Note that in both cases, the mean may be 32.) The
range of the 26 scores given in Section 13.1 is 35 – 19 = 16.

13.3.2 Standard deviation

The standard deviation (s.d.) is another
way of showing the spread of scores. It measures the degree to which the group
of scores deviates from the mean; in other words, it shows how all the scores
are spread out and thus gives a fuller
description of test scores than the range, which simply describes the gap
between the highest and lowest marks and ignores the information provided by
all the remaining scores. Abbreviations used for the standard deviation are
either s.d. or σ (the Greek letter
sigma) or s.

One
simple method of calculating s.d. is shown below:

s.d.
= ∑d²

N
is the number of scores and d the deviation of each score from the mean.
Thus, working from 26 previous results, we produced to:

1.
find out the amount by which each score deviates from the mean (d);

2.
square each result (d²);

3.
total all the results (Σd²);

4.
divide the total by number of testees (Σd²/N); and

5.
find the square root of this result (√Σd²/N);

Score Mean Deviation
(d) Square (d²)

(Step 1) 35 deviates from 27 by 8 (Step
2) 64

34 7 49

33 6 36

32 5 25

30 3 9

29 2 4

27 0 0

26 -1 1

26 -2 4

25 -2 4

24 -3 9

23 -4
16

23 -4 16

22 -5 25

20 -7 49

19 -8 64

702 (Step
3) Total = 432

(Step 4) s.d. = √ 432

(Step 5) s.d = √ 16.62 = 4.08

Note:
If deviations (d) are taken from the mean, their sum (taking account of the
minus sign) is zero + 42 – 42 = 0. This affords a useful check on the
calculations involved here.

A standard deviation of 4.08, for
example, shows a smaller spread of scores than, say, a standard deviation of
8.96. If the aim of the test is simply to determine which students have
mastered a particular program of work or are capable of carrying out certain
tasks in the target language, a standard deviation of 4.08 or any other
denoting a fairly narrow spread will be quite satisfactory provided it is
associated with a high average score. However, if the test aims at measuring
several levels of attainment and making line distinctions within the group (as
perhaps in a proficiency test), then a broad spread will be required.

Standard deviation is also useful for
providing information concerning characteristics of different groups. If, for
example, the standard deviation on a certain test is 4.08 for one class, but 8.96
on the same test for another class, then it can be inferred that the latter
class is far more heterogeneous than the former.

13.4.1 Item analysis

Earlier careful consideration of
objectives and the construction of any test was attempted. What is required now
is a knowledge of how far those objectives have been achieved by a particular
test. Unfortunately, too many teachers think that the test is finished once the
raw marks have been obtained. But this is far from the case, for the results
obtained from objective tests can be used to provide valuable information
concerning;

– the performance of the
students as a group, thus (in the case of class progress tests) informing the
teacher about the effectiveness of the teaching;

–
the performance of individual students; and

–
the performance of each of the items comprising the test

Information concerning the performance
of the students as a whole and of individual students is very important for
teaching purposes, especially as many test results can show not only the types
of errors most frequently made but also the actual reasons for the errors being
made. As shown in earlier chapters, the great merit of objective tests arises
from the fact that they can provide an insight into the mental processes of the
students by showing very clearly what choices have been made, thereby indicating
definite lines on which remedial work can be given..

The performance of the test items,
themselves, is of obvious importance in compiling future tests. Since a great
deal of time and effort are usually spent on the construction of good objective
items, most teachers and test constructors will be desirous of either using
them again without further changes or else adapting them for future use. It is
thus useful to identify those items which were answered correctly by the more
able students taking the test and badly by the less able students. The
identification of certain difficult items in the test, together with a
knowledge of the performance of the individual distractors in multiple-choice
items, can prove just as valuable in its implications for teaching as for
testing.

All items should be examined from the
point of view of (1) their difficulty level and (2) their level of
discrimination.

13.4.2 Item difficulty

The index
of difficulty (or facility value)
of an item simply shows how easy or difficult the particular item provided in
the test. the index of difficulty (FV) is generally expressed as the fraction
(or percentage) of the students who answered the item correctly. It is
calculated by using the formula:

FV =

R
represents the number of correct answers and N the number of students taking
the test. Thus, if 21 out of 26 students tested obtained the correct answer for
one of the items, that item would have an index of difficulty (or a facility
value) of .77 per cent.

FV
= = .77

In
this case, the particular item is a fairly easy one since 77 per cent of the
students taking the test answered it correctly. Although an average facility
value of .5 or 50 per cent may be desirable for many public achievement tests
and for a few progress tests (depending on the purpose for which one is
testing), the facility value of a large number of individual items will
vary considerably. While aiming for test
items with facility values falling between .4 and .6, many test constructors
may be prepared in practice to accept items with facility values between .3 and
.7. Clearly, however, a very easy item, on which 90 per cent of testees
obtained the correct answer, will not distinguish between above-average
students and below-average students as well as an item which only 60 per cent
of the testees answer correctly. On the other hand, the easy item will
discriminate amongst a group of below-average students; in other words, one
student with a low standard may show that he or she is better than another
student with a low standard through being given the opportunity to answer an
easy item. Similarly, a very difficult item, though failing to discriminate
among most students, will certainly separate the good student from the very good
student.

Note that it is possible for a test
consisting of items each with a facility value of approximately .5 to fail to
discriminate at all between the good and the poor students. If, for example,
half the items are answered correctly by the good students but correctly by the
poor students, then the items will work against one another and no
discrimination will be possible. The chances of such an extreme situation
occurring are very remote indeed; it is highly probable, however, that at least
one or two items in a test will work against one another in this way.

13.4.3 Item discrimination

The discrimination index of an item
indicates the extent to which the item discriminates between the testees,
separating the more able testees from the less able. The index of
discrimination (D) tells us whether those students who performed will on the
whole test tended to do well or badly on each item in the test. It is
presupposed that the total score on the test is a valid measure of the
student’s ability (i.e the good student tends to do well on the test as a whole
and the poor student badly). On this basis, the score on the whole test is
accepted as the criterion measure, and it thus becomes possible to separate the
‘good’ students from the ‘bad’ ones in performances on individual items. If the
‘good’ students tend to do well on an item (as shown by many of item doing so –
a frequency measure) and the ‘poor’ students badly on the same item, then the
item is a good one because it distinguishes the ‘good’ from the ‘bad’ in the
same way as the total score. This is the argument underlying the index of
discriminations.

There are various methods of obtaining
the index of discrimination; all involve
a comparison of those students who performed well on the whole test and those
who performed poorly on the whole test. However, while it is statistically most
efficient to compare the top 27½ per cent, it is enough for most purposes to
divide small samples (e.g. class scores on a progress test) into halves or
thirds. For most classroom purposes, the following procedure is recommended .

1 Arrange the scripts in rank order of total
score and divide into two groups of equal size (i.e. the top half and the
bottom half). If there is an odd number of scripts, dispense with one script
chosen at random.

2 Count the number of those candidates in the
upper group answering the first item correctly; then count the number of
lower-group candidates answering the item correctly.

3 Subtract the number of correct answers in
the lower group from the number of correct answers in the upper group; i.e.
find the difference in the proportion passing in the lower group.

4 Divide this difference by the total number
of candidates in one group:

Correct U – Correct L

D =

(D = Discrimination index; n = Number of
candidates in one group; U = Upper half and L = Lower half. The index D is
thus-the difference between the proportion passing the item in U and L.)

5 Proceed in this manner for each item.

The following item, which was taken from
a test administered to 40 students, produced the results shown:

I left Tokyo ………….. Friday morning.

A. in
B. on C. at
D. by

15 – 6

.45

D = = =

Such
an item with a discrimination index of .45 functions fairly effectively,
although clearly it does not discriminate as well as an item with an index of
.6 or .7. Discrimination indices can range from + 1 (= an item which
discriminates perfectly – i.e. it shows perfect correlation with the testees’
results on the whole test) through 0 (= an item which does not discriminate
wrong way). Thus, for example, if all 20 students in the uper group answered a
certain item correctly and all 20 students in the lower group got the wrong
answer, the item would have an index of discrimination of 1.0. If , on the
other hand, only 10 students in the upper group answered it correctly and
furthermore 10 students in the lower group also got correct answers, the
discrimination index would be 0. However, if none of the 20 students in the
upper group got a correct answer and all the 20 students in the lower group
answered it correctly, the item would
have a negative discrimination, shown by -1.0. It is highly inadvisable to use
again, or even to attempt to amend, any item showing negative discrimination.
Inspection of such an item usually shows something radically wrong with it.

Again, working from actual test results,
we shall now look at the performance of three items. The first of the following
items has a high index of discrimination; the second is a poor item with a low
discrimination index; and the third example is given as an illustration of a
poor item with negative discrimination.

1 High discrimination index:

NEARLY When ………………Jim ……..crossed…………..the
road, he

…………ran into
a car.

18 – 3

D = = = .75 FV = =
0.525

(The item is at the right level of
difficulty and discriminates well.)

2 Low discrimination index:

If you …………… the bell, the door would have
been opened.

A.
would ring C. would have rung

B. had rung D. were ringing

3 – 0

D = = .15 FV = = .075

(In
this case, the item discriminates poorly because it is too difficult for
everyone, both ‘good’ and ‘bad’.)

3 Negative discrimination index:

I don’t think anybody has seen him.

A. Yes, someone has.

B. Yes, no one has.

C.
Yes, none has.

D.
Yes, anyone has.

4 – 6

–2

10 40

D = = = .10 FV = = 0.25

(This item is too difficult and
discriminates in the wrong direction.)

What
has gone wrong with the third item above? Even at this stage and without counting the number of candidates who chose
each of the options, it is evident that the item was a trick item: in other
words, the item was far too ‘clever’, it is even conceivable that many native speakers
would select option B in reference to the correct option A. Items like this all
too often escape the attention of the test writer until an item analysis
actually focuses attention on them. (This is one excellent reason for
conducting an item analysis.)

Note that items with a very high facility
value fail to discriminate and thus generally show a low discrimination index.
The particular group of students who were given the following item had
obviously mastered the use of for and
since following the present perfect
continuous tense:

He’s been living in Berlin ……..1975.

19 – 19

D = = 0 FV = = 0.95

(The item is extremely easy for the
testees and has zero discrimination.)

13.4.4 Item difficulty and discrimination

Facility
values and discrimination indices are usually recorded together in tabular form
and calculated by similar procedures. Note again the formulae used:

(

Correct U + Correct L

FV =
or FV =

Correct U – Correct L

D =

The following table, compiled from the
results of the test referred to in the preceding paragraphs, shows how these
measures are recorded.

Item U L U+L FV U-L D

1 19 19 38 .95 0 0

2 13 16 29 .73 -3 -.15

3 20 12 32 .80 8 .40

4 18 3 21 .53 15 .75

5 15 6 21 .53 9 .45

6 16 15 31 .77 1 .05

7 17 8 25 .62 9 .45

8 13 4 17 .42 9 .45

9 4 6 10 .25 -2 -.10

10 10 4 14 .35 6 .30

11 18 13 31 .78 5 .25

12 12 2 14 .35 10 .50

13 14 6 20 .50 8 .40

14 5 1 6 .15 4 .20

15 7 1 8 .20 6 .30

16 3 0 3 .08 3 .15

Etc.

Item showing a discrimination index of
below .30 are of doubtful use since they fail to discriminate effectively.
Thus, on the results listed in the table above, only items 3,4,5,7,8,10,12,13
and 15 could be safely used in future tests without being rewritten. However,
many test writers would keep item 1 simply as a lead-in to put the students at
ease.

13.4.5 Extended answer analysis

It will often be important to scrutinize
items in greater detail, particularly in those cases where items have not
performed as expected. We shall want to know not only why these items have not
performed according to expectations but also why certain testees have failed to
answer a particular item correctly. Such tasks are reasonably simple and
straightforward to perform if the multiple-choice technique has been used in
the test.

In order to carry out a full item
analysis, or an extended answer analysis, a record should be made of the
different options chosen by each student in the upper group and then the
various options selected by the lower group.

If I were rich, I ………………work.

A. shan’t B. won’t C. wouldn’t D. didn’t

U L U+L

U + L

A. 1 4 5

B.
2       5           7                                 FV
=                   =                   = .45

U – L

C. 14 4
18

D.
3 7 10 D =
= = .50

(20) (20) (40)

The
item has a facility value of .45 and a discrimination index of .50 and appears to have functioned
efficiently; the distractors attract the poorer students but not the better
ones.

The performance of the following item
with a low discrimination index is of particular interest:

Mr
Watson wants to meet a friend in Singapore this year. He ……….him for
ten years.

A. knew B.
had known C. knows D.
has known

U L U+L

A. 7 3 10

B. 4 3 7 FV = .325

C. 1 9 10 D = .15

D. 8 5 13

(20) (20)
(40)

While
distractor C appears to be performing well, it is clear that distractors A and
B are attracting the wrong candidates (i.e. the better ones). On closer
scrutiny, it will be found that both of these options may be correct in certain
contexts: for example, a student may envisage a situation in which Mr Watson is going to visit a friend whom he had
known for ten years in England
but who now lives in Singapore,
e.g.

He knew him (well) for ten years (while
he lived England).

The
same justification applies for option B.

The next item should have functioned
efficiently but failed to do so; an examination of the testees’ answers leads
us to guess that possibly many had been taught to use the past perfect tense to
indicate an action in the past taking place before another action in the past.
Thus, while the results obtained from the previous item reflect on the item
itself, the results here possibly reflect on the teaching:

John F. Kennedy ………… born in 1917 and
died in 1963.

A. is B.
has been C. was D. had been

U L U+L

A. 14 8 22

B. 4 7 11 FV = .50

C. 2 5 7 D = .30

D. 0 0 0

(20) (20)
(40)

Similarly, the level of difficulty of
distractors C and D in the following item is far too low; a full item analysis
suggests only too strongly that they have been added simply to complete the
number of options required.

Wasn’t that your father over there?

A. Yes, he was. C. Yes, was he.

B.
Yes, it was. D. Yes, was
it.

U L U+L

A. 7 13 20

B. 13 7 20 FV = .50

C. 0 0 0 D = .30

D. 0 0 0

(20) (20)
(40)

The
item could be made slightly more difficult and thus improved by replacing
distractor C by Yes, he wasn’t and D
by Yes, it wasn’t. The item is still
imperfect, but the difficulty level of the distractors will probably correspond
more closely to the level of attainment being tested.

The purpose of obtaining test statistics
is to assist interpretation of item and test results in a way which is
meaningful and significant. Provided that such statistics lead the teacher or
test constructor to focus once again on the content
of the test, then item analysis is an extremely valuable exercise.

13.5
MODERATING

The importance of moderating classroom
tests as well as public examinations cannot be stressed too greatly. No matter
how experienced test writers are, they are usually so deeply involved in their
work that they become incapable of standing back and viewing the items with any
real degree of objectivity. There are bound to be many blind-spots in tests,
especially in the field of objective testing, where the items sometimes contain
only the minimum of context.

It is essential, therefore, that the
test writer submits the test for moderation to a colleague or, preferably, to a
number of colleagues. Achievement and proficiency tests of English administered
to a large test population are generally moderated by a board consisting of linguists, language teachers, a psychologist,
a statistician, etc. The purpose of such a board is to scrutinize as closely as
closely as possible not only each item comprising the test but also the test as
a whole, so that the most appropriate and efficient measuring instrument is
produced for the particular purpose at hand. In these cases, moderation is also
frequently concerned with the scoring of the test and with the evaluation of
the test results.

13.6
ITEM
CARDS AND BANKS

As must be very clear at this stage, the
construction of objective tests necessities taking a great deal of time and
trouble. Although the scoring of such tests is simple and straightforward,
further effort is then spent on the evaluation of each item and on improving
those items which do not perform satisfactorily. It seems somewhat illogical,
therefore, to dispense with test items once they have appeared in a test.

The best way of recording and storing
items (together with any relevant information) is by means of small cards. Only
one item is entered on each card; on the reverse side of the card information
derived from an item analysis is recorded: e.g. the facility value (FV), the
Index of Discrimination (D), and an
extended answers analysis (if carried out). After being arranged according to
the element or skill which they are intended to test, the items on the separate
cards are grouped according to difficulty level, the particular area tested,
etc. It is an easy task to arrange them for quick reference according to
whatever system is desired. Furthermore, the cards can be rearranged at any
later date.

Although it will obviously take
considerable time to build up an item bank consisting of a few hundred items,
such an item bank will prove of enormous value and will save the teacher a
great deal of time and trouble (or options within each item) being changed each
time. If there is concern about test security or if there is any other reason
indicating the need for new items, many of the existing items can be rewritten.
In such cases, the same options are
generally kept, but the context is changed so that one of the distractors now
becomes the correct option.
Multiple-choice items testing most area of the various language elements
and skills can be rewritten in this way, e.g.

(Grammar) I hope you ……. us your secret
soon.

A. told B. will tell C. have told D.
would tell

I
wish you ……. us your secret soon.

A. told B. will tell C. have told D.
would tell

(Vocabulary) Are you going to wear your best
…… for the party?

A. clothes B. clothing C. cloth D.
clothings

What kind of
…….. is your new suit made of?

A. clothes B. clothing C. cloth D.
clothings

(Phoneme beat bit beat

discrimination) beat beat bit

(Listening Student hears: Why are you going home?

comprehension) Student reads: A. At six o’clock.

B.
Yes, I am.

C.
To help my mother.

D.
By bus.

Student
hears: How are you going to David’s?

Student reads: A. At six o’clock.

B.
Yes, I am.

C.
To help my mother.

D.
By bus.

Two-thirds of the country’s (fuel, endevour, industry,
energy) comes from imported oil,
while the remaining one-third comes from coal. Moreover, soon the
country will have its first nuclear power station.

Two-thirds of the country’s (fuel, endevour, industry,
power) takes the form of imported oil, while the remaining one-third is
coal. However, everyone in the country was made to realize the importance
of coal during the recent miners’ strike, when many factories were forced
to close down.

(Reading

comprehension/

vocabulary)

Items rewritten in this way become new
items, and thus it will be necessary to collect facility values and
discrimination indices again.

Such examples serve to show ways of
making maximum use of the various types of test items which have been
constructed, administered and evaluated. In any case, however, the effort spent
on constructing tests of English as a second or foreign language is never
wasted since the insights provided into language behavior as well as into
language learning and teaching will always be invaluable in any situation
connected with either teaching or testing.

Chapter 13 Interpreting Test Scores

Chapter 9 Listening Comprehension Tests

The Sentence Patterns of Language

Teaching Grammar

Chapter 1 Introduction to English Language Testing

Definition of Morpheme, Allomorph, Root, and Base

Role Play, Games, and Songs

Similar Posts