An Interval Measure of
Election Poll Accuracy
By
Joseph
Shipman, PhD
Director
of Election Polling
SurveyUSA
And
Jay
H. Leve
Editor
SurveyUSA
Abstract: Measures of election poll accuracy, starting with
Mosteller’s pioneering work in 1949, have always involved “point estimates” of
error, using a single set of predicted vote percentages. A known drawback of any such measure is that scholars must make an assumption about how to
allocate “undecided” voters, or choose to ignore the
undecided voters. A new measure is proposed, which uses an “interval
estimate” to account for every possible allocation of undecided voters, rather
than choosing to ignore the undecided voters or allocate them arbitrarily. Advantages and disadvantages of
this measure are discussed.
Contents
1.
Six Pollsters In
Search Of An Arbiter
2.
The “Undecided”
Issue
3.
The Work Of Mosteller
4.
Comparison Of 6 Mosteller
Measures
5.
Mapping Each
Measure
6.
Relative Accuracy
Vs Absolute Accuracy
7.
Traugott’s Measure
8.
A New Concept:
Interval Estimation
9.
A New Measure: Median
Spread Error
10.
Advantages And
Disadvantages
11.
The “Undecided”
Issue Revisited
12.
Conclusion
1.
Six Pollsters
In Search Of An Arbiter
Consider an election between
two candidates, Smith and Jones, with no writeins allowed. The outcome of this
election is a pair of numbers which sum to 100%, for example:
This is about as simple as
an election can get.
Now suppose that the day
before the election, six different pollsters release polls, all with the same
field period.
The 6 polls are as follows:

Smith 
Jones 
Undecided 
Pollster # 1 
60 
32 
8 
Pollster # 2 
54 
36 
10 
Pollster # 3 
61 
36 
3 
Pollster # 4 
54 
40 
6 
Pollster # 5 
56 
36 
8 
Pollster # 6 
57 
43 
0 




Actual Vote 
60 
40 
n/a 
The following graph plots
each pollster’s projection, with Smith on the X axis and Jones on the Y axis,
and the actual outcome (Smith 60, Jones 40) at the centerpoint of the graph.
Which poll was the most
accurate?
The point of this paper is:
this seemingly simple question is remarkably difficult to answer.
Let's ask the pollsters who
did the best. Here's what they say.
Pollster #1: “My poll was the most accurate. I was the only one
to get the winner's vote total exactly right.”
Pollster #2: “My poll was the most accurate. Smith won by a 3:2
margin, and only my poll got that ratio exactly right. If my undecideds are
taken out, the rest of my sample voted 60 to 40 for Smith, a bullseye.”
Pollster #3: “Well, you didn't take out undecideds, did you? My
poll was the most accurate: I was off by 1 point for Smith, and by 4 points for
Jones, so my average error was 2.5 points per candidate, which was better than
everyone else.”
Pollster #4: “Wait a minute. You shouldn't be looking at how many
points off you were, you should look at percentage error. I underestimated Smith
by 10% of his actual vote (he got 60% and I said 54%), and I was exactly right
on Jones, so my average error was 5% of each candidate's vote, which was better
than everyone else. My poll was the most accurate.”
Pollster #5: “Nobody cares about the individual candidate vote
predictions, they care about the margin of victory. Smith won by 20 points, and
I was the only one who said he'd win by 20, so my poll was the most accurate.”
Pollster #6: “You didn't really say he'd win by 20, because you
didn't say what 8% of the voters were going to do – it’s only a 20point margin
if your undecideds happen to split 5050. That's not a real prediction. I made
a real prediction, and I was within 3 points for both candidates,
and all the rest of you were further away for one or both candidates. My poll
was the most accurate.”
Hmmm.
2.
The
“Undecided” Issue
A big part of the difficulty
is accounting for “undecided” voters. If all the pollsters had reported
predictions for Smith and Jones that summed to 100%, the comparisons would be
easier, because all the points in the graph would fall on a single line rather
than being spread in two dimensions. But allocating undecided voters is not a
mathematical exercise; rather, it is subject to caprice and whim. With the benefit
of hindsight, Pollster 2 selfservingly recommends “proportional allocation,” awarding
the undecideds to the candidates in proportion to their actual votes, while
Pollster 5 selfservingly prefers “equal allocation,” where each candidate gets
half the “undecided” voters. But neither said so before the votes were counted.
This is not to say there is
no such thing as an undecided voter; nor that predicting some elections may
involve greater uncertainty than others because voter preferences are less established.
But for the purposes of evaluating the accuracy of polls, we must compare the
polls with the actual election outcomes, where there are no “undecideds.”
It is possible to take the
attitude of Pollster #6, that vote predictions should sum to 100%. If the poll
detects a high degree of indecision among the potential voters, this can be
reported separately, in the same way a “Margin of Error” is reported as a
separate index of the reliability of a prediction. But as long as the usual
practice is to report “undecideds” as a subgroup of the electorate of a
particular size, measures of poll accuracy must deal with this.
3.
The Work
Of Mosteller
Following the 1948
presidential election, a commission was formed to study the failure of polls to
predict Truman's reelection. The resulting book, The Preelection Polls of
1948: Report to the Committee on Analysis of PreElection Polls and Forecasts,
by Frederick Mosteller et al, was published by the Social Science Research
Council (New York, 1949). In this book, eight different ways to measure the
accuracy of a preelection poll were proposed (for shorthand herein and going
forward: “Mosteller 1” through “Mosteller 8”).[1]
Though it has been more
than 50 years since Mosteller proposed his measures, they remain today the “default”
method by which preelection polls are evaluated.
The six Mosteller measures
which depend on predicted vote percentages are all “error measures,” where a
smaller score indicates a smaller error and, as such, a more accurate poll. The
six measures are defined as follows:
Mosteller 1: The difference in percentage points between the
winner's predicted and actual proportion of the total votes cast.
Mosteller 2: The difference in percentage points between the
winner's predicted and actual proportions of the votes received by the top two
candidates.
Mosteller 3: The average deviation in percentage points between
predicted and actual returns for each candidate (without regard to sign).
Mosteller 4: The average percentage error (averaging the deviations
from 100 percent of the ratio of predicted to actual proportion).
Mosteller 5: The (unsigned) difference of the oriented
differences between predicted and actual percentage results for the top two
candidates.
Mosteller 6: The maximum observed difference between predicted
and actual percentage results for any candidate.
4.
Comparison
Of 6 Mosteller Measures
Before we try to parse
these definitions, it is helpful to see what the six error measures have to say
about the polls discussed above. The yellow highlight indicates the lowest
error for each measure:
Pollster 
Smith 
Jones 
M1 Error 
M2 Error 
M3 Error 
M4 Error 
M5 Error 
M6 Error 
P1 
60 
32 
0.00 
5.22 
4.00 
10.00 
8.00 
8.00 
P2 
54 
36 
6.00 
0.00 
5.00 
10.00 
2.00 
6.00 
P3 
61 
36 
1.00 
2.89 
2.50 
5.83 
5.00 
4.00 
P4 
54 
40 
6.00 
2.55 
3.00 
5.00 
6.00 
6.00 
P5 
56 
36 
4.00 
0.87 
4.00 
8.33 
0.00 
4.00 
P6 
57 
43 
3.00 
3.00 
3.00 
6.25 
6.00 
3.00 
Actual 
60 
40 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
The following chart ranks
each pollster’s performance from best (a ranking of 1) to worst (a ranking of
6), using each measure. The yellow highlight indicates the best pollster for
each given measure:
Pollster 
Smith 
Jones 
M1 Rank 
M2 Rank 
M3 Rank 
M4 Rank 
M5 Rank 
M6 Rank 
P1 
60 
32 
1 
6 
4 
5 
6 
6 
P2 
54 
36 
5 
1 
6 
5 
2 
4 
P3 
61 
36 
2 
4 
1 
2 
3 
2 
P4 
54 
40 
5 
3 
2 
1 
4 
4 
P5 
56 
36 
4 
2 
4 
4 
1 
2 
P6 
57 
43 
3 
5 
2 
3 
4 
1 
It is not hard to see that
the six Mosteller measures correspond respectively to the six pollsters, each
of whom described a standard by which his/her own poll was the most accurate.
Some other observations
about the six measures:
Measures 2 and 5 implicitly
allocate undecided voters: M2 proportionately, M5 equally.
The “actual election result”
always wins under measures 3, 4, and 6, but measures 1, 2, and 5 can give a “perfect”
score to a poll whose predicted numbers differed from the actual ones.
5.
Mapping Each
Measure
A better understanding of
how each measure “works” can be obtained by using “contour maps,” which show,
for each measure, families of curves representing polls which score the same by
that measure.
Contour maps follow:
For Mosteller 1, the contour
“curve” is a vertical line, because only the vote for Smith is relevant. The
point representing Poll 1 is directly below the point representing “actual
results,” and so has the lowest error and the best ranking.
For Mosteller 2, the curves
are lines representing constant vote ratios, all of which, if extended, would
pass through the point (0,0) which is outside of the illustrated range.
For Mosteller 3, the curves
are “diamonds,” shrinking in size so that only the point (60,40) gets a perfect
score.
Mosteller 4 is similar to
Mosteller 3, except that the “diamonds” are stretched. The horizontal and
vertical directions are no longer symmetrical, which corresponds to Mosteller
4's defining error as a proportion of the vote for each candidate rather than
of the overall total.
Mosteller 5’s curves of
equivalent polls are straight lines, but instead of converging to a point as in
Mosteller 2 they are parallel, all at 45 degree angles to the horizontal. By
this measure, Poll 5 does the best because it is on the same line as the “actual”
result.
Finally, Mosteller 6 shows
concentric squares rather than diamonds, because only the larger candidate
deviation counts, not the average.
6.
Relative Accuracy
Vs Absolute Accuracy
There is a dichotomy here.
Measures 2 and 5 are concerned with getting the relative
performance of the two candidates correct. They implicitly allocate undecideds,
and don't penalize polls with high undecideds. Measures 1, 3, 4, and 6 look at absolute
differences between predicted and actual percentages for individual
candidates, and polls with higher proportions of undecideds will be more likely
to score poorly.
A good case can be made for
both types of measure. To reprise the essential claims from our Pollster Forum:
Pollster #5: “My poll was relatively perfect.”
Pollster #6: “My poll was absolutely the most
accurate.”
The drawbacks of the two
types of measure are also clear. Relative measures fail to give “extra
credit” to a poll that gets it exactly right – under Mosteller 5, Smith 56,
Jones 36 is just as good as Smith 60, Jones 40 even though 60 to 40 was how the
actual vote went. On the other hand, absolute measures don’t
distinguish between a poll in which the candidates are misestimated in the
same direction from one in which they were misestimated in opposite
directions, even though the implications for which candidate is “winning” may
be quite different.
7.
Traugott’s
Measure
At the 2003 Annual Meeting
of the American Association for Public Opinion Research, in Nashville, the
paper A Review and Proposal for a New Measure of Poll Accuracy was
presented. The authors, Michael Traugott, Elizabeth Martin, and Courtney
Kennedy, defined their measure as the absolute value of the natural log of the “odds
ratio” between predicted and actual candidate percentages, expressed in
percentage units.
The “odds ratio” is a ratio
of ratios. Like the Mosteller 2 measure, it depends only on the relative
proportions of votes for the top 2 candidates, but it treats “predicted” and “actual”
data symmetrically, while the Mosteller 2 error measure will generally change
if the roles of “predicted” and “actual” numbers are reversed.
The contour graph for the
Traugott Measure is almost the same as for Mosteller 2:
Again, the curves of
equivalent polls are lines converging to the point (0,0). The differences are
subtle: although the order of the lines on each side of the “actual” point is
the same in both measures, the relative positions of equivalent lines on
opposite sides differ.
It turns out that, for the
purposes of distinguishing between more and less accurate polls, the Traugott
measure is almost identical to the Mosteller 2 measure. In the examples used in
this paper, the order is identical:
Pollster 
Smith 
Jones 
M1 Error 
M2 Error 
M3 Error 
M4 Error 
M5 Error 
M6 Error 
Traugott Error 
P1 
60 
32 
0.00 
5.22 
4.00 
10.00 
8.00 
8.00 
22.31 
P2 
54 
36 
6.00 
0.00 
5.00 
10.00 
2.00 
6.00 
0.00 
P3 
61 
36 
1.00 
2.89 
2.50 
5.83 
5.00 
4.00 
12.19 
P4 
54 
40 
6.00 
2.55 
3.00 
5.00 
6.00 
6.00 
10.54 
P5 
56 
36 
4.00 
0.87 
4.00 
8.33 
0.00 
4.00 
3.64 
P6 
57 
43 
3.00 
3.00 
3.00 
6.25 
6.00 
3.00 
12.36 
Actual 
60 
40 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
The following chart ranks
each pollster’s performance from best (a ranking of 1) to worst (a ranking of
6), using each measure:
Pollster 
Smith 
Jones 
M1 Rank 
M2 Rank 
M3 Rank 
M4 Rank 
M5 Rank 
M6 Rank 
Traugott Rank 
P1 
60 
32 
1 
6 
4 
5 
6 
6 
6 
P2 
54 
36 
5 
1 
6 
5 
2 
4 
1 
P3 
61 
36 
2 
4 
1 
2 
3 
2 
4 
P4 
54 
40 
5 
3 
2 
1 
4 
4 
3 
P5 
56 
36 
4 
2 
4 
4 
1 
2 
2 
P6 
57 
43 
3 
5 
2 
3 
4 
1 
5 
When poll predictions are
rounded to the nearest whole percent, there are few cases where the Traugott
measure and the Mosteller 2 measure give opposite answers. After allocating
undecideds proportionately (which both M2 and Traugott assume) there are 99
possible rounded predictions for an election, ranging from 199 to 991 (the
Traugott measure is undefined when one of the percentages is 0).
In the case of a 60 to 40
election, we compared all 99*99=9801 pairs of possible predictions. There are no
points in the graphed ranges where disagreement occurs. The closer an election
is to 5050, the fewer and more outlying are the cases where the Mosteller 2
and Traugott measures judge polls differently. For the case of any possible
election outcome (extending from Smith 99, Jones 1; to Smith 1, Jones 99) the
Mosteller 2 and Traugott measures always agree for at least 97% of the possible
pairs of pollster predictions (considering only predictions rounded to the
nearest whole percent).
8.
A New
Concept: Interval Estimation
All the measures we have
seen so far are “point estimates.” But polls which have “undecided” voters are
saying that the predicted vote percentages are just lower bounds, and declining
to make a judgment on how the undecided votes will be allocated to the
candidates, so they are really predicting a range of possible
outcomes.
It turns out that there is
one way to avoid making assumptions about how the undecideds will vote:
consider all the possible ways they might vote.
First, take Pollster #2, who
predicted: Smith 54%, Jones 36%, Undecided 10%.
The
range of possible outcomes consistent with this prediction is:
At one end of these
possible ranges, the actual victory margin (or “spread”) of 20 percentage
points is underestimated by 12 points; at the other end, it is overestimated by
8 points. Somewhere in the middle is the 60 to 40 result that actually
occurred.
Depending on how undecideds
are treated, the poll’s error on the spread (which corresponds to the Mosteller
5 error in a poll with no undecideds) could be anywhere from 0 to 12 points.
Now, take Pollster #3,
which predicted: Smith 61%, Jones 36%, Undecided 3%.
The range of possible
outcomes is:
The range of possible
errors is 2 points to 8 points (zero is not a possibility, because Smith was
overestimated and is still overestimated even when all undecideds are allocated
to Jones).
In all these cases, the
possible outcomes form a continuous range or “interval.” We would like to use
this entire interval to define a new error measure.
9.
A New
Measure: Median Spread Error
We define a new error
measure as follows:
The “Median spread error” of
a poll is the average of the smallest and largest possible spread errors,
considering all possible allocations of undecided voters.
In the case of Pollster #2,
we have seen that the range of possible errors is 0 to 12 points, so we take
the median: 6 points.
In the case of Pollster #3,
the range of possible errors is 2 to 8 points, so we take the median: 5 points.
Here is a comparison of
this measure with the others:
Pollster 
Smith 
Jones 
M1 Error 
M2 Error 
M3 Error 
M4 Error 
M5 Error 
M6 Error 
Traugott Error 
Shipman Error 
P1 
60 
32 
0.00 
5.22 
4.00 
10.00 
8.00 
8.00 
22.31 
8.00 
P2 
54 
36 
6.00 
0.00 
5.00 
10.00 
2.00 
6.00 
0.00 
6.00 
P3 
61 
36 
1.00 
2.89 
2.50 
5.83 
5.00 
4.00 
12.19 
5.00 
P4 
54 
40 
6.00 
2.55 
3.00 
5.00 
6.00 
6.00 
10.54 
6.00 
P5 
56 
36 
4.00 
0.87 
4.00 
8.33 
0.00 
4.00 
3.64 
4.00 
P6 
57 
43 
3.00 
3.00 
3.00 
6.25 
6.00 
3.00 
12.36 
6.00 
Actual 
60 
40 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
Pollster 
Smith 
Jones 
M1 Rank 
M2 Rank 
M3 Rank 
M4 Rank 
M5 Rank 
M6 Rank 
Traugott Rank 
Shipman Rank 
P1 
60 
32 
1 
6 
4 
5 
6 
6 
6 
6 
P2 
54 
36 
5 
1 
6 
5 
2 
4 
1 
3 
P3 
61 
36 
2 
4 
1 
2 
3 
2 
4 
2 
P4 
54 
40 
5 
3 
2 
1 
4 
4 
3 
3 
P5 
56 
36 
4 
2 
4 
4 
1 
2 
2 
1 
P6 
57 
43 
3 
5 
2 
3 
4 
1 
5 
3 
The Median Spread Error
Measure has the following key properties:
This measure behaves like a
relative measure if there is no way to allocate undecideds that
gets the election exactly right; but if there is, it behaves like an absolute
measure and penalizes undecideds.
Here is the contour graph
for the Median Spread Error Measure:
In the lower left quadrant,
the graph resembles the Mosteller 6 measure graph, which was concentric
squares. In the other three quarters of the graph, the contour lines are
identical to the ones for the Mosteller 5 measure.
The most accurate poll here
is Poll 5, so the Shipman measure agrees with Mosteller 5 on the winner in this
case. However, Mosteller 5 prefers Poll 2 (M5 error = 2 points) to Poll 3 (M5
error = 5 points), while the Median Spread Error Measure prefers Poll 3 to Poll
2 because of its lower “undecided” vote. The more polls differ in their
undecided votes, the more the Median Spread Error and Mosteller 5 measures
diverge.
10.
Advantages
And Disadvantages
The “Median Spread Error”
represents a compromise between relative and absolute
measures of accuracy. Because it is based on the “Spread Error” (Mosteller 5 measure),
it rewards getting the relative vote percentages of the candidates right. But
it also rewards getting “absolute” percentages right, because even if the
spread is correct, the existence of undecideds means that that spread was not
distinguished as a single prediction. As with the absolute measures M3, M4, M6,
the only way to get a perfect score is to
get both candidates' vote percentages exactly right.
Thus, in our Smith 60%,
Jones 40% example, a pollster who predicts [Smith 59%, Jones 39%, Undecided 2%]
has done a materially better job of polling the contest than a competitor who
predicts [Smith 49%, Jones 29%, Undecided 22%] – even though Mosteller 5, the
measure most widely in use today, would assign both pollsters the identical
perfect score of “zero error” (20 minus 20 = 0).
One disadvantage of this
new Median Spread Error Measure is that it is more complicated to define and
calculate than some of the Mosteller measures. Another disadvantage is that, in
elections with more than two candidates, the measure cannot be defined if the
only data available to the scholar is the predicted vote percentages for the
top 2 candidates. To calculate the Median Spread Error measure, the specific
number of reported undecided voters must be known. This is not necessary to
calculate Mosteller’s measures.
11.
The “Undecided”
Issue Revisited
Is it the authors’
recommendation that pollsters allocate undecided voters in their final
preelection poll? No. Specifically, it is not.
The authors believe that
allocating undecided voters on a pollbypoll basis is an invitation to manipulate
actual poll data to align with whatever hunch or intuition the pollster may
have about the outcome of the contest. Further, it is an invitation for a
pollster to change his/her results to be more in line with other pollsters, so
that the pollster’s final prediction is not an outlier.
Allocating “undecided”
voters using a systematic, predefined formula that does not change pollbypoll
(e.g. always proportionately, or always equally among the top 2 candidates, or always
2/3 to the challenger and 1/3 to the incumbent if there is an incumbent) is
less idiosyncratic, but still requires the pollster to make an arbitrary determination
that is exogenous to the gathered poll data.
The authors believe that a
poll released near to an election with a relatively high number of undecided
voters is an indication that the questionnaire was not designed properly, and/or
that the screening of voters was not conducted with enough rigor. Welldesigned
screening questions and wellwritten “who will you vote for questions” should,
as a natural byproduct, produce lower undecideds in a final preelection poll,
all other things being equal. The solution is not, as some have recommended,
for the pollster to make up numbers on election eve for the purpose of
eliminating the undecideds, but rather to craft the survey instrument in such a
way that it naturally results in fewer and fewer undecideds as the election
draws near.
Pollsters who report
results with comparatively few undecided voters in fact “stick their neck out”
further than those who report larger numbers of undecideds. The Median Spread
Error measure defined here rewards the pollster whose estimate is not just the
most precise, but whose numbers leave him/her the least amount of wiggle room.
12.
Conclusion
It is difficult to measure
the accuracy of election polls, because assumptions must be made about how to
allocate undecided voters. Mosteller (1949) defined 6 accuracy measures, which
can give divergent results. There is a tension between “relative” and “absolute”
measures. Contour graphs can illuminate how the accuracy measures work, and
show that a new measure defined by Traugott et al in 2003 does not evaluate
election polls significantly differently from the Mosteller 2 measure. A new
accuracy measure, “Median Spread Error,” considers the entire range of possible
allocations of undecided voters. It avoids arbitrary assumptions about
allocation of undecideds and achieves a compromise between “relative” and
“absolute” accuracy.
[1] Mosteller 7 error is calculated using sample size and
Mosteller 8 error is calculated with electoral votes. These two measures are
not discussed in this paper.