Chapter 13 Interpreting Test Scores
FREQUENCY
DISTRIBUTION
correct answers on a test script are known as raw marks. ’15 marks out of a
total of 20’ may appear a high mark to some, but in fact the statement is
virtually meaningless of its own. For example, the tasks set in the test may
have been extremely simple and 15 may be the lowest mark in a particular group
of scores.
TABLE 1
|
TABLE 2
|
TABLE 3
|
|||||
Testee
|
Mark
|
Testee
|
Mark
|
Rank
|
Mark
|
Tally
|
Frequency
|
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
|
20
25
33
35
29
25
30
26
19
27
26
32
34
27
27
29
25
23
30
26
22
23
33
26
24
26
|
D
M
C
W
L
G
S
E
P
J
N
O
H
K
T
X
Z
B
F
Q
Y
R
V
U
A
I
|
35
34
33
33
32
30
30
29
29
27
27
27
26
26
26
26
26
25
25
25
24
23
23
22
20
19
|
1
2
3.5
(or 3=)
3.5
(or 3=)
5
6.5
(or 6=)
6.5
(or 6=)
8.5
(or 8=)
8.5
(or 8=)
11 (or 10=)
11 (or 10=)
11 (or 10=)
15 (or 13=)
15 (or 13=)
15 (or 13=)
15 (or 13=)
15 (or 13=)
19 (or 18=)
19 (or 18=)
19 (or 18=)
21
22.5
(or 22=)
22.5
(or 22=)
24
25
26
|
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
|
/
/
//
/
//
//
///
![]()
///
/
//
/
/
/
Total
|
1
1
2
1
2
2
3
5
3
1
2
1
1
1
26
|
the test may have been extremely difficult, in which case 15 may well be a very
high mark. Numbers still exert a strange and powerful influence on our society,
but the shibboleth that 40 per cent should
always represent a pass mark is nevertheless both surprising and
disturbing.
the imaginary scores of a group of 26 students on a particular test consisting
of 40 items. Table 1 conveys very little, but Table 2, containing the students’
scores in order of merit, shows a little more. Table 3 contains a frequency
distribution showing the number of students who obtained each mark awarded;
the strokes on the left of the numbers
(e.g.////) are called tallies
and are included simply to illustrate to
the method of counting the frequency of scores. Note that normally the
frequency list would have been compiled without the need for Tables 1 and 2;
consequently, as the range of highest and lowest marks would then not be known,
all the possible scores would be
listed and a record made of the number of students obtaining each score in the
scale (as shown in the example).
two ways of rendering these are shown. The usual classroom practice is that
shown in the parentheses. Where statistical work is to be done on the ranks, it
is essential to record the average rank (e.g. testees J.N and O, each with the
same mark, occupy places 10, 11 and 12 in the list, averaging 11).
MEASURES
OF CENTRAL TENDENCY
candidates obtained in this case it is 26, as five testees have scored this
mark.
the middle candidates in the order of merit in the case of the 26 students here
(as in all cases involving even numbers of testees), there can obviously be no
middle person and thus the score in the bottom half is taken as the median. The
median score in this case is also 26.
average i.e. the sum of the separate scores divided by the total number of
testees. The mode, median, and mean are all measures of central tendency. The
mean is the most efficient measure of central tendency, but it is not always
appropriate.
note that symbol x is used to denote
the score, N the number of the
testees, and m the mean. The symbol f denotes the frequency with
which a score occurs. The symbol Σ means the
sum of.
Table 4
|
||
x
f fx |
||
35 x 1 35
34 x 1 34
33
32 x 1 32
30 x 2 60
29 x 2 58
27 x 3
81
26 x 5 130
25 x 3 75
24 x 1 24
23 x 2 46
22 x 1 22
20 x 1 20
19 x 1 19
|
||
Total = 702
= Σ fx
|
that x = 702 is the total number of
items which the group of 26 students got right between them. Dividing by N = 26
(as the formula states), this obviously gives the average.
correspondence between the mean (27) and the median (26). Such a close
correspondence is not always common and has occurred in this case because the
scores tend to cluster symmetrically around a central point.
MEASURES
OF DISPERSION
concerned with measures of central tendency, this section is related to the
range or spread of scores. The mean by itself enables us to describe an
individual student’s score by comparing it with the average set of scores obtained
by a group, but it tells us nothing at all about the highest and lowest scores
and the spread of marks.
of marks is based on the difference between the highest and lowest scores.
Thus, if the highest score on a 50-item test is 43 and the lowest 21, the range
is from 21 to 43; i.e. 22, if the highest score, however, is only 39 and the
lowest 29, the range is 10. (Note that in both cases, the mean may be 32.) The
range of the 26 scores given in Section 13.1 is 35 – 19 = 16.
way of showing the spread of scores. It measures the degree to which the group
of scores deviates from the mean; in other words, it shows how all the scores
are spread out and thus gives a fuller
description of test scores than the range, which simply describes the gap
between the highest and lowest marks and ignores the information provided by
all the remaining scores. Abbreviations used for the standard deviation are
either s.d. or σ (the Greek letter
sigma) or s.
simple method of calculating s.d. is shown below:
![]() |
= ∑d²

is the number of scores and d the deviation of each score from the mean.
Thus, working from 26 previous results, we produced to:
find out the amount by which each score deviates from the mean (d);
square each result (d²);
total all the results (Σd²);
divide the total by number of testees (Σd²/N); and
find the square root of this result (√Σd²/N);
(d) Square (d²)
2) 64
16
3) Total = 432
If deviations (d) are taken from the mean, their sum (taking account of the
minus sign) is zero + 42 – 42 = 0. This affords a useful check on the
calculations involved here.
example, shows a smaller spread of scores than, say, a standard deviation of
8.96. If the aim of the test is simply to determine which students have
mastered a particular program of work or are capable of carrying out certain
tasks in the target language, a standard deviation of 4.08 or any other
denoting a fairly narrow spread will be quite satisfactory provided it is
associated with a high average score. However, if the test aims at measuring
several levels of attainment and making line distinctions within the group (as
perhaps in a proficiency test), then a broad spread will be required.
providing information concerning characteristics of different groups. If, for
example, the standard deviation on a certain test is 4.08 for one class, but 8.96
on the same test for another class, then it can be inferred that the latter
class is far more heterogeneous than the former.
objectives and the construction of any test was attempted. What is required now
is a knowledge of how far those objectives have been achieved by a particular
test. Unfortunately, too many teachers think that the test is finished once the
raw marks have been obtained. But this is far from the case, for the results
obtained from objective tests can be used to provide valuable information
concerning;
students as a group, thus (in the case of class progress tests) informing the
teacher about the effectiveness of the teaching;
the performance of individual students; and
the performance of each of the items comprising the test
of the students as a whole and of individual students is very important for
teaching purposes, especially as many test results can show not only the types
of errors most frequently made but also the actual reasons for the errors being
made. As shown in earlier chapters, the great merit of objective tests arises
from the fact that they can provide an insight into the mental processes of the
students by showing very clearly what choices have been made, thereby indicating
definite lines on which remedial work can be given..
themselves, is of obvious importance in compiling future tests. Since a great
deal of time and effort are usually spent on the construction of good objective
items, most teachers and test constructors will be desirous of either using
them again without further changes or else adapting them for future use. It is
thus useful to identify those items which were answered correctly by the more
able students taking the test and badly by the less able students. The
identification of certain difficult items in the test, together with a
knowledge of the performance of the individual distractors in multiple-choice
items, can prove just as valuable in its implications for teaching as for
testing.
point of view of (1) their difficulty level and (2) their level of
discrimination.
of difficulty (or facility value)
of an item simply shows how easy or difficult the particular item provided in
the test. the index of difficulty (FV) is generally expressed as the fraction
(or percentage) of the students who answered the item correctly. It is
calculated by using the formula:
|
represents the number of correct answers and N the number of students taking
the test. Thus, if 21 out of 26 students tested obtained the correct answer for
one of the items, that item would have an index of difficulty (or a facility
value) of .77 per cent.
|
= = .77
this case, the particular item is a fairly easy one since 77 per cent of the
students taking the test answered it correctly. Although an average facility
value of .5 or 50 per cent may be desirable for many public achievement tests
and for a few progress tests (depending on the purpose for which one is
testing), the facility value of a large number of individual items will
vary considerably. While aiming for test
items with facility values falling between .4 and .6, many test constructors
may be prepared in practice to accept items with facility values between .3 and
.7. Clearly, however, a very easy item, on which 90 per cent of testees
obtained the correct answer, will not distinguish between above-average
students and below-average students as well as an item which only 60 per cent
of the testees answer correctly. On the other hand, the easy item will
discriminate amongst a group of below-average students; in other words, one
student with a low standard may show that he or she is better than another
student with a low standard through being given the opportunity to answer an
easy item. Similarly, a very difficult item, though failing to discriminate
among most students, will certainly separate the good student from the very good
student.
consisting of items each with a facility value of approximately .5 to fail to
discriminate at all between the good and the poor students. If, for example,
half the items are answered correctly by the good students but correctly by the
poor students, then the items will work against one another and no
discrimination will be possible. The chances of such an extreme situation
occurring are very remote indeed; it is highly probable, however, that at least
one or two items in a test will work against one another in this way.
indicates the extent to which the item discriminates between the testees,
separating the more able testees from the less able. The index of
discrimination (D) tells us whether those students who performed will on the
whole test tended to do well or badly on each item in the test. It is
presupposed that the total score on the test is a valid measure of the
student’s ability (i.e the good student tends to do well on the test as a whole
and the poor student badly). On this basis, the score on the whole test is
accepted as the criterion measure, and it thus becomes possible to separate the
‘good’ students from the ‘bad’ ones in performances on individual items. If the
‘good’ students tend to do well on an item (as shown by many of item doing so –
a frequency measure) and the ‘poor’ students badly on the same item, then the
item is a good one because it distinguishes the ‘good’ from the ‘bad’ in the
same way as the total score. This is the argument underlying the index of
discriminations.
the index of discrimination; all involve
a comparison of those students who performed well on the whole test and those
who performed poorly on the whole test. However, while it is statistically most
efficient to compare the top 27½ per cent, it is enough for most purposes to
divide small samples (e.g. class scores on a progress test) into halves or
thirds. For most classroom purposes, the following procedure is recommended .
score and divide into two groups of equal size (i.e. the top half and the
bottom half). If there is an odd number of scripts, dispense with one script
chosen at random.
upper group answering the first item correctly; then count the number of
lower-group candidates answering the item correctly.
the lower group from the number of correct answers in the upper group; i.e.
find the difference in the proportion passing in the lower group.
of candidates in one group:
|
candidates in one group; U = Upper half and L = Lower half. The index D is
thus-the difference between the proportion passing the item in U and L.)
a test administered to 40 students, produced the results shown:
B. on C. at
D. by
|
|
|
an item with a discrimination index of .45 functions fairly effectively,
although clearly it does not discriminate as well as an item with an index of
.6 or .7. Discrimination indices can range from + 1 (= an item which
discriminates perfectly – i.e. it shows perfect correlation with the testees’
results on the whole test) through 0 (= an item which does not discriminate
wrong way). Thus, for example, if all 20 students in the uper group answered a
certain item correctly and all 20 students in the lower group got the wrong
answer, the item would have an index of discrimination of 1.0. If , on the
other hand, only 10 students in the upper group answered it correctly and
furthermore 10 students in the lower group also got correct answers, the
discrimination index would be 0. However, if none of the 20 students in the
upper group got a correct answer and all the 20 students in the lower group
answered it correctly, the item would
have a negative discrimination, shown by -1.0. It is highly inadvisable to use
again, or even to attempt to amend, any item showing negative discrimination.
Inspection of such an item usually shows something radically wrong with it.
we shall now look at the performance of three items. The first of the following
items has a high index of discrimination; the second is a poor item with a low
discrimination index; and the third example is given as an illustration of a
poor item with negative discrimination.
road, he
a car.
|
|
|
0.525
difficulty and discriminates well.)
been opened.
would ring C. would have rung
|
||||
|
||||
this case, the item discriminates poorly because it is too difficult for
everyone, both ‘good’ and ‘bad’.)
Yes, none has.
Yes, anyone has.
|
|
|
discriminates in the wrong direction.)
has gone wrong with the third item above? Even at this stage and without counting the number of candidates who chose
each of the options, it is evident that the item was a trick item: in other
words, the item was far too ‘clever’, it is even conceivable that many native speakers
would select option B in reference to the correct option A. Items like this all
too often escape the attention of the test writer until an item analysis
actually focuses attention on them. (This is one excellent reason for
conducting an item analysis.)
value fail to discriminate and thus generally show a low discrimination index.
The particular group of students who were given the following item had
obviously mastered the use of for and
since following the present perfect
continuous tense:
|
|
testees and has zero discrimination.)
values and discrimination indices are usually recorded together in tabular form
and calculated by similar procedures. Note again the formulae used:
|
![]() |
|||||
|
||||||
or FV =
|
results of the test referred to in the preceding paragraphs, shows how these
measures are recorded.
Item U L U+L FV U-L D
|
1 19 19 38 .95 0 0
2 13 16 29 .73 -3 -.15
3 20 12 32 .80 8 .40
4 18 3 21 .53 15 .75
5 15 6 21 .53 9 .45
6 16 15 31 .77 1 .05
7 17 8 25 .62 9 .45
8 13 4 17 .42 9 .45
9 4 6 10 .25 -2 -.10
10 10 4 14 .35 6 .30
11 18 13 31 .78 5 .25
12 12 2 14 .35 10 .50
13 14 6 20 .50 8 .40
14 5 1 6 .15 4 .20
15 7 1 8 .20 6 .30
16 3 0 3 .08 3 .15
Etc.
|
below .30 are of doubtful use since they fail to discriminate effectively.
Thus, on the results listed in the table above, only items 3,4,5,7,8,10,12,13
and 15 could be safely used in future tests without being rewritten. However,
many test writers would keep item 1 simply as a lead-in to put the students at
ease.
items in greater detail, particularly in those cases where items have not
performed as expected. We shall want to know not only why these items have not
performed according to expectations but also why certain testees have failed to
answer a particular item correctly. Such tasks are reasonably simple and
straightforward to perform if the multiple-choice technique has been used in
the test.
analysis, or an extended answer analysis, a record should be made of the
different options chosen by each student in the upper group and then the
various options selected by the lower group.
U L U+L
|
|
A. 1 4 5
2 5 7 FV
= = = .45
|
|
C. 14 4
18
3 7 10 D =
= = .50
(20) (20) (40)
item has a facility value of .45 and a discrimination index of .50 and appears to have functioned
efficiently; the distractors attract the poorer students but not the better
ones.
with a low discrimination index is of particular interest:
Watson wants to meet a friend in Singapore this year. He ……….him for
ten years.
had known C. knows D.
has known
(40)
distractor C appears to be performing well, it is clear that distractors A and
B are attracting the wrong candidates (i.e. the better ones). On closer
scrutiny, it will be found that both of these options may be correct in certain
contexts: for example, a student may envisage a situation in which Mr Watson is going to visit a friend whom he had
known for ten years in England
but who now lives in Singapore,
e.g.
he lived England).
same justification applies for option B.
efficiently but failed to do so; an examination of the testees’ answers leads
us to guess that possibly many had been taught to use the past perfect tense to
indicate an action in the past taking place before another action in the past.
Thus, while the results obtained from the previous item reflect on the item
itself, the results here possibly reflect on the teaching:
died in 1963.
has been C. was D. had been
(40)
distractors C and D in the following item is far too low; a full item analysis
suggests only too strongly that they have been added simply to complete the
number of options required.
Yes, it was. D. Yes, was
it.
(40)
item could be made slightly more difficult and thus improved by replacing
distractor C by Yes, he wasn’t and D
by Yes, it wasn’t. The item is still
imperfect, but the difficulty level of the distractors will probably correspond
more closely to the level of attainment being tested.
is to assist interpretation of item and test results in a way which is
meaningful and significant. Provided that such statistics lead the teacher or
test constructor to focus once again on the content
of the test, then item analysis is an extremely valuable exercise.
MODERATING
tests as well as public examinations cannot be stressed too greatly. No matter
how experienced test writers are, they are usually so deeply involved in their
work that they become incapable of standing back and viewing the items with any
real degree of objectivity. There are bound to be many blind-spots in tests,
especially in the field of objective testing, where the items sometimes contain
only the minimum of context.
test writer submits the test for moderation to a colleague or, preferably, to a
number of colleagues. Achievement and proficiency tests of English administered
to a large test population are generally moderated by a board consisting of linguists, language teachers, a psychologist,
a statistician, etc. The purpose of such a board is to scrutinize as closely as
closely as possible not only each item comprising the test but also the test as
a whole, so that the most appropriate and efficient measuring instrument is
produced for the particular purpose at hand. In these cases, moderation is also
frequently concerned with the scoring of the test and with the evaluation of
the test results.
ITEM
CARDS AND BANKS
construction of objective tests necessities taking a great deal of time and
trouble. Although the scoring of such tests is simple and straightforward,
further effort is then spent on the evaluation of each item and on improving
those items which do not perform satisfactorily. It seems somewhat illogical,
therefore, to dispense with test items once they have appeared in a test.
items (together with any relevant information) is by means of small cards. Only
one item is entered on each card; on the reverse side of the card information
derived from an item analysis is recorded: e.g. the facility value (FV), the
Index of Discrimination (D), and an
extended answers analysis (if carried out). After being arranged according to
the element or skill which they are intended to test, the items on the separate
cards are grouped according to difficulty level, the particular area tested,
etc. It is an easy task to arrange them for quick reference according to
whatever system is desired. Furthermore, the cards can be rearranged at any
later date.
considerable time to build up an item bank consisting of a few hundred items,
such an item bank will prove of enormous value and will save the teacher a
great deal of time and trouble (or options within each item) being changed each
time. If there is concern about test security or if there is any other reason
indicating the need for new items, many of the existing items can be rewritten.
In such cases, the same options are
generally kept, but the context is changed so that one of the distractors now
becomes the correct option.
Multiple-choice items testing most area of the various language elements
and skills can be rewritten in this way, e.g.
soon.
would tell

wish you ……. us your secret soon.
would tell
…… for the party?
clothings

…….. is your new suit made of?
clothings

Yes, I am.
To help my mother.
By bus.

hears: How are you going to David’s?
Yes, I am.
To help my mother.
By bus.
|

items, and thus it will be necessary to collect facility values and
discrimination indices again.
making maximum use of the various types of test items which have been
constructed, administered and evaluated. In any case, however, the effort spent
on constructing tests of English as a second or foreign language is never
wasted since the insights provided into language behavior as well as into
language learning and teaching will always be invaluable in any situation
connected with either teaching or testing.