An Interval Measure of Election Poll Accuracy

 

By

Joseph Shipman, PhD

Director of Election Polling

SurveyUSA

 

And

Jay H. Leve

Editor

SurveyUSA

 

Abstract: Measures of election poll accuracy, starting with Mosteller’s pioneering work in 1949, have always involved “point estimates” of error, using a single set of predicted vote percentages. A known drawback of any such measure is that scholars must make an assumption about how to allocate “undecided” voters, or choose to ignore the undecided voters. A new measure is proposed, which uses an “interval estimate” to account for every possible allocation of undecided voters, rather than choosing to ignore the undecided voters or allocate them arbitrarily. Advantages and disadvantages of this measure are discussed.

 

Contents

 

1.        Six Pollsters In Search Of An Arbiter

2.        The “Undecided” Issue

3.        The Work Of Mosteller

4.        Comparison Of 6 Mosteller Measures

5.        Mapping Each Measure

6.        Relative Accuracy Vs Absolute Accuracy

7.        Traugott’s Measure

8.        A New Concept: Interval Estimation

9.        A New Measure: Median Spread Error

10.     Advantages And Disadvantages

11.     The “Undecided” Issue Revisited

12.     Conclusion

 


1.       Six Pollsters In Search Of An Arbiter

 

Consider an election between two candidates, Smith and Jones, with no write-ins allowed. The outcome of this election is a pair of numbers which sum to 100%, for example:

 

 

This is about as simple as an election can get.

 

Now suppose that the day before the election, six different pollsters release polls, all with the same field period.

 

The 6 polls are as follows:

 

 

Smith

Jones

Undecided

Pollster # 1

60

32

8

Pollster # 2

54

36

10

Pollster # 3

61

36

3

Pollster # 4

54

40

6

Pollster # 5

56

36

8

Pollster # 6

57

43

0

 

 

 

 

Actual Vote

60

40

n/a

 

The following graph plots each pollster’s projection, with Smith on the X axis and Jones on the Y axis, and the actual outcome (Smith 60, Jones 40) at the center-point of the graph.



Which poll was the most accurate?

 

The point of this paper is: this seemingly simple question is remarkably difficult to answer.

 

Let's ask the pollsters who did the best. Here's what they say.

 

Pollster #1: “My poll was the most accurate. I was the only one to get the winner's vote total exactly right.”

 

Pollster #2: “My poll was the most accurate. Smith won by a 3:2 margin, and only my poll got that ratio exactly right. If my undecideds are taken out, the rest of my sample voted 60 to 40 for Smith, a bullseye.”

 

Pollster #3: “Well, you didn't take out undecideds, did you? My poll was the most accurate: I was off by 1 point for Smith, and by 4 points for Jones, so my average error was 2.5 points per candidate, which was better than everyone else.”

 

Pollster #4: “Wait a minute. You shouldn't be looking at how many points off you were, you should look at percentage error. I underestimated Smith by 10% of his actual vote (he got 60% and I said 54%), and I was exactly right on Jones, so my average error was 5% of each candidate's vote, which was better than everyone else. My poll was the most accurate.”

 

Pollster #5: “Nobody cares about the individual candidate vote predictions, they care about the margin of victory. Smith won by 20 points, and I was the only one who said he'd win by 20, so my poll was the most accurate.”

 

Pollster #6: “You didn't really say he'd win by 20, because you didn't say what 8% of the voters were going to do – it’s only a 20-point margin if your undecideds happen to split 50-50. That's not a real prediction. I made a real prediction, and I was within 3 points for both candidates, and all the rest of you were further away for one or both candidates. My poll was the most accurate.”

 

Hmmm.

 

2.       The “Undecided” Issue

 

A big part of the difficulty is accounting for “undecided” voters. If all the pollsters had reported predictions for Smith and Jones that summed to 100%, the comparisons would be easier, because all the points in the graph would fall on a single line rather than being spread in two dimensions. But allocating undecided voters is not a mathematical exercise; rather, it is subject to caprice and whim. With the benefit of hindsight, Pollster 2 self-servingly recommends “proportional allocation,” awarding the undecideds to the candidates in proportion to their actual votes, while Pollster 5 self-servingly prefers “equal allocation,” where each candidate gets half the “undecided” voters. But neither said so before the votes were counted.

 

This is not to say there is no such thing as an undecided voter; nor that predicting some elections may involve greater uncertainty than others because voter preferences are less established. But for the purposes of evaluating the accuracy of polls, we must compare the polls with the actual election outcomes, where there are no “undecideds.”

 

It is possible to take the attitude of Pollster #6, that vote predictions should sum to 100%. If the poll detects a high degree of indecision among the potential voters, this can be reported separately, in the same way a “Margin of Error” is reported as a separate index of the reliability of a prediction. But as long as the usual practice is to report “undecideds” as a subgroup of the electorate of a particular size, measures of poll accuracy must deal with this.

 

3.       The Work Of Mosteller

 

Following the 1948 presidential election, a commission was formed to study the failure of polls to predict Truman's reelection. The resulting book, The Pre-election Polls of 1948: Report to the Committee on Analysis of Pre-Election Polls and Forecasts, by Frederick Mosteller et al, was published by the Social Science Research Council (New York, 1949). In this book, eight different ways to measure the accuracy of a pre-election poll were proposed (for short-hand herein and going forward: “Mosteller 1” through “Mosteller 8”).[1]

 

Though it has been more than 50 years since Mosteller proposed his measures, they remain today the “default” method by which pre-election polls are evaluated.

 

The six Mosteller measures which depend on predicted vote percentages are all “error measures,” where a smaller score indicates a smaller error and, as such, a more accurate poll. The six measures are defined as follows:

 

Mosteller 1: The difference in percentage points between the winner's predicted and actual proportion of the total votes cast.

 

Mosteller 2: The difference in percentage points between the winner's predicted and actual proportions of the votes received by the top two candidates.

 

Mosteller 3: The average deviation in percentage points between predicted and actual returns for each candidate (without regard to sign).

 

Mosteller 4: The average percentage error (averaging the deviations from 100 percent of the ratio of predicted to actual proportion).

 

Mosteller 5: The (unsigned) difference of the oriented differences between predicted and actual percentage results for the top two candidates.

 

Mosteller 6: The maximum observed difference between predicted and actual percentage results for any candidate.

 


 

4.       Comparison Of 6 Mosteller Measures

 

Before we try to parse these definitions, it is helpful to see what the six error measures have to say about the polls discussed above. The yellow highlight indicates the lowest error for each measure:

 

Pollster

Smith

Jones

M1 Error

M2 Error

M3 Error

M4 Error

M5 Error

M6 Error

P1

60

32

0.00

5.22

4.00

10.00

8.00

8.00

P2

54

36

6.00

0.00

5.00

10.00

2.00

6.00

P3

61

36

1.00

2.89

2.50

5.83

5.00

4.00

P4

54

40

6.00

2.55

3.00

5.00

6.00

6.00

P5

56

36

4.00

0.87

4.00

8.33

0.00

4.00

P6

57

43

3.00

3.00

3.00

6.25

6.00

3.00

Actual

60

40

0.00

0.00

0.00

0.00

0.00

0.00

 

The following chart ranks each pollster’s performance from best (a ranking of 1) to worst (a ranking of 6), using each measure. The yellow highlight indicates the best pollster for each given measure:

 

Pollster

Smith

Jones

M1 Rank

M2 Rank

M3 Rank

M4 Rank

M5 Rank

M6 Rank

P1

60

32

1

6

4

5

6

6

P2

54

36

5

1

6

5

2

4

P3

61

36

2

4

1

2

3

2

P4

54

40

5

3

2

1

4

4

P5

56

36

4

2

4

4

1

2

P6

57

43

3

5

2

3

4

1

 

It is not hard to see that the six Mosteller measures correspond respectively to the six pollsters, each of whom described a standard by which his/her own poll was the most accurate.

 

Some other observations about the six measures:

 

Measures 2 and 5 implicitly allocate undecided voters: M2 proportionately, M5 equally.

 

The “actual election result” always wins under measures 3, 4, and 6, but measures 1, 2, and 5 can give a “perfect” score to a poll whose predicted numbers differed from the actual ones.

 

5.       Mapping Each Measure

 

A better understanding of how each measure “works” can be obtained by using “contour maps,” which show, for each measure, families of curves representing polls which score the same by that measure.

 

Contour maps follow:

 


 

 

 

For Mosteller 1, the contour “curve” is a vertical line, because only the vote for Smith is relevant. The point representing Poll 1 is directly below the point representing “actual results,” and so has the lowest error and the best ranking.


 

 

For Mosteller 2, the curves are lines representing constant vote ratios, all of which, if extended, would pass through the point (0,0) which is outside of the illustrated range.

 

 


 

For Mosteller 3, the curves are “diamonds,” shrinking in size so that only the point (60,40) gets a perfect score.

 

 


 

 

Mosteller 4 is similar to Mosteller 3, except that the “diamonds” are stretched. The horizontal and vertical directions are no longer symmetrical, which corresponds to Mosteller 4's defining error as a proportion of the vote for each candidate rather than of the overall total.

 

 


 

 

Mosteller 5’s curves of equivalent polls are straight lines, but instead of converging to a point as in Mosteller 2 they are parallel, all at 45 degree angles to the horizontal. By this measure, Poll 5 does the best because it is on the same line as the “actual” result.

 

 


 

Finally, Mosteller 6 shows concentric squares rather than diamonds, because only the larger candidate deviation counts, not the average.

 

 

 

 


6.       Relative Accuracy Vs Absolute Accuracy

 

There is a dichotomy here. Measures 2 and 5 are concerned with getting the relative performance of the two candidates correct. They implicitly allocate undecideds, and don't penalize polls with high undecideds. Measures 1, 3, 4, and 6 look at absolute differences between predicted and actual percentages for individual candidates, and polls with higher proportions of undecideds will be more likely to score poorly.

 

A good case can be made for both types of measure. To reprise the essential claims from our Pollster Forum:

 

Pollster #5: “My poll was relatively perfect.”

 

Pollster #6: “My poll was absolutely the most accurate.”

 

The drawbacks of the two types of measure are also clear. Relative measures fail to give “extra credit” to a poll that gets it exactly right – under Mosteller 5, Smith 56, Jones 36 is just as good as Smith 60, Jones 40 even though 60 to 40 was how the actual vote went. On the other hand, absolute measures don’t distinguish between a poll in which the candidates are mis-estimated in the same direction from one in which they were mis-estimated in opposite directions, even though the implications for which candidate is “winning” may be quite different.

 

7.       Traugott’s Measure

 

At the 2003 Annual Meeting of the American Association for Public Opinion Research, in Nashville, the paper A Review and Proposal for a New Measure of Poll Accuracy was presented. The authors, Michael Traugott, Elizabeth Martin, and Courtney Kennedy, defined their measure as the absolute value of the natural log of the “odds ratio” between predicted and actual candidate percentages, expressed in percentage units.

 

The “odds ratio” is a ratio of ratios. Like the Mosteller 2 measure, it depends only on the relative proportions of votes for the top 2 candidates, but it treats “predicted” and “actual” data symmetrically, while the Mosteller 2 error measure will generally change if the roles of “predicted” and “actual” numbers are reversed.

 

The contour graph for the Traugott Measure is almost the same as for Mosteller 2:


 

 

 

Again, the curves of equivalent polls are lines converging to the point (0,0). The differences are subtle: although the order of the lines on each side of the “actual” point is the same in both measures, the relative positions of equivalent lines on opposite sides differ.

 


It turns out that, for the purposes of distinguishing between more and less accurate polls, the Traugott measure is almost identical to the Mosteller 2 measure. In the examples used in this paper, the order is identical:

 

Pollster

Smith

Jones

M1 Error

M2 Error

M3 Error

M4 Error

M5 Error

M6 Error

Traugott Error

P1

60

32

0.00

5.22

4.00

10.00

8.00

8.00

22.31

P2

54

36

6.00

0.00

5.00

10.00

2.00

6.00

0.00

P3

61

36

1.00

2.89

2.50

5.83

5.00

4.00

12.19

P4

54

40

6.00

2.55

3.00

5.00

6.00

6.00

10.54

P5

56

36

4.00

0.87

4.00

8.33

0.00

4.00

3.64

P6

57

43

3.00

3.00

3.00

6.25

6.00

3.00

12.36

Actual

60

40

0.00

0.00

0.00

0.00

0.00

0.00

0.00

 

The following chart ranks each pollster’s performance from best (a ranking of 1) to worst (a ranking of 6), using each measure:

 

Pollster

Smith

Jones

M1 Rank

M2 Rank

M3 Rank

M4 Rank

M5 Rank

M6 Rank

Traugott Rank

P1

60

32

1

6

4

5

6

6

6

P2

54

36

5

1

6

5

2

4

1

P3

61

36

2

4

1

2

3

2

4

P4

54

40

5

3

2

1

4

4

3

P5

56

36

4

2

4

4

1

2

2

P6

57

43

3

5

2

3

4

1

5

 

When poll predictions are rounded to the nearest whole percent, there are few cases where the Traugott measure and the Mosteller 2 measure give opposite answers. After allocating undecideds proportionately (which both M2 and Traugott assume) there are 99 possible rounded predictions for an election, ranging from 1-99 to 99-1 (the Traugott measure is undefined when one of the percentages is 0).

 

In the case of a 60 to 40 election, we compared all 99*99=9801 pairs of possible predictions. There are no points in the graphed ranges where disagreement occurs. The closer an election is to 50-50, the fewer and more outlying are the cases where the Mosteller 2 and Traugott measures judge polls differently. For the case of any possible election outcome (extending from Smith 99, Jones 1; to Smith 1, Jones 99) the Mosteller 2 and Traugott measures always agree for at least 97% of the possible pairs of pollster predictions (considering only predictions rounded to the nearest whole percent).

 

8.       A New Concept: Interval Estimation

 

All the measures we have seen so far are “point estimates.” But polls which have “undecided” voters are saying that the predicted vote percentages are just lower bounds, and declining to make a judgment on how the undecided votes will be allocated to the candidates, so they are really predicting a range of possible outcomes.

 

It turns out that there is one way to avoid making assumptions about how the undecideds will vote: consider all the possible ways they might vote.

 

First, take Pollster #2, who predicted: Smith 54%, Jones 36%, Undecided 10%.

 

The range of possible outcomes consistent with this prediction is:

 

 

At one end of these possible ranges, the actual victory margin (or “spread”) of 20 percentage points is underestimated by 12 points; at the other end, it is overestimated by 8 points. Somewhere in the middle is the 60 to 40 result that actually occurred.

 

Depending on how undecideds are treated, the poll’s error on the spread (which corresponds to the Mosteller 5 error in a poll with no undecideds) could be anywhere from 0 to 12 points.

 

 

 

 

 


Now, take Pollster #3, which predicted: Smith 61%, Jones 36%, Undecided 3%. 

 

The range of possible outcomes is:

 

 

 

 

 

The range of possible errors is 2 points to 8 points (zero is not a possibility, because Smith was overestimated and is still overestimated even when all undecideds are allocated to Jones).

 

In all these cases, the possible outcomes form a continuous range or “interval.” We would like to use this entire interval to define a new error measure.


 

9.       A New Measure: Median Spread Error

 

We define a new error measure as follows:

 

The “Median spread error” of a poll is the average of the smallest and largest possible spread errors, considering all possible allocations of undecided voters.

 

In the case of Pollster #2, we have seen that the range of possible errors is 0 to 12 points, so we take the median: 6 points.

 

In the case of Pollster #3, the range of possible errors is 2 to 8 points, so we take the median: 5 points.

 

Here is a comparison of this measure with the others:

 

Pollster

Smith

Jones

M1 Error

M2 Error

M3 Error

M4 Error

M5 Error

M6 Error

Traugott Error

Shipman Error

P1

60

32

0.00

5.22

4.00

10.00

8.00

8.00

22.31

8.00

P2

54

36

6.00

0.00

5.00

10.00

2.00

6.00

0.00

6.00

P3

61

36

1.00

2.89

2.50

5.83

5.00

4.00

12.19

5.00

P4

54

40

6.00

2.55

3.00

5.00

6.00

6.00

10.54

6.00

P5

56

36

4.00

0.87

4.00

8.33

0.00

4.00

3.64

4.00

P6

57

43

3.00

3.00

3.00

6.25

6.00

3.00

12.36

6.00

Actual

60

40

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

 

Pollster

Smith

Jones

M1 Rank

M2 Rank

M3 Rank

M4 Rank

M5 Rank

M6 Rank

Traugott Rank

Shipman Rank

P1

60

32

1

6

4

5

6

6

6

6

P2

54

36

5

1

6

5

2

4

1

3

P3

61

36

2

4

1

2

3

2

4

2

P4

54

40

5

3

2

1

4

4

3

3

P5

56

36

4

2

4

4

1

2

2

1

P6

57

43

3

5

2

3

4

1

5

3

 

The Median Spread Error Measure has the following key properties:

 

 

 

This measure behaves like a relative measure if there is no way to allocate undecideds that gets the election exactly right; but if there is, it behaves like an absolute measure and penalizes undecideds.

 

Here is the contour graph for the Median Spread Error Measure:

 

 

In the lower left quadrant, the graph resembles the Mosteller 6 measure graph, which was concentric squares. In the other three quarters of the graph, the contour lines are identical to the ones for the Mosteller 5 measure.

 

The most accurate poll here is Poll 5, so the Shipman measure agrees with Mosteller 5 on the winner in this case. However, Mosteller 5 prefers Poll 2 (M5 error = 2 points) to Poll 3 (M5 error = 5 points), while the Median Spread Error Measure prefers Poll 3 to Poll 2 because of its lower “undecided” vote. The more polls differ in their undecided votes, the more the Median Spread Error and Mosteller 5 measures diverge.


10.    Advantages And Disadvantages

 

The “Median Spread Error” represents a compromise between relative and absolute measures of accuracy. Because it is based on the “Spread Error” (Mosteller 5 measure), it rewards getting the relative vote percentages of the candidates right. But it also rewards getting “absolute” percentages right, because even if the spread is correct, the existence of undecideds means that that spread was not distinguished as a single prediction. As with the absolute measures M3, M4, M6, the only way to get a perfect score is to get both candidates' vote percentages exactly right.

 

Thus, in our Smith 60%, Jones 40% example, a pollster who predicts [Smith 59%, Jones 39%, Undecided 2%] has done a materially better job of polling the contest than a competitor who predicts [Smith 49%, Jones 29%, Undecided 22%] – even though Mosteller 5, the measure most widely in use today, would assign both pollsters the identical perfect score of “zero error” (20 minus 20 = 0).

 

One disadvantage of this new Median Spread Error Measure is that it is more complicated to define and calculate than some of the Mosteller measures. Another disadvantage is that, in elections with more than two candidates, the measure cannot be defined if the only data available to the scholar is the predicted vote percentages for the top 2 candidates. To calculate the Median Spread Error measure, the specific number of reported undecided voters must be known. This is not necessary to calculate Mosteller’s measures.

 

11.    The “Undecided” Issue Revisited

 

Is it the authors’ recommendation that pollsters allocate undecided voters in their final pre-election poll? No. Specifically, it is not.

 

The authors believe that allocating undecided voters on a poll-by-poll basis is an invitation to manipulate actual poll data to align with whatever hunch or intuition the pollster may have about the outcome of the contest. Further, it is an invitation for a pollster to change his/her results to be more in line with other pollsters, so that the pollster’s final prediction is not an outlier.

 

Allocating “undecided” voters using a systematic, pre-defined formula that does not change poll-by-poll (e.g. always proportionately, or always equally among the top 2 candidates, or always 2/3 to the challenger and 1/3 to the incumbent if there is an incumbent) is less idiosyncratic, but still requires the pollster to make an arbitrary determination that is exogenous to the gathered poll data.

 

The authors believe that a poll released near to an election with a relatively high number of undecided voters is an indication that the questionnaire was not designed properly, and/or that the screening of voters was not conducted with enough rigor. Well-designed screening questions and well-written “who will you vote for questions” should, as a natural byproduct, produce lower undecideds in a final pre-election poll, all other things being equal. The solution is not, as some have recommended, for the pollster to make up numbers on election eve for the purpose of eliminating the undecideds, but rather to craft the survey instrument in such a way that it naturally results in fewer and fewer undecideds as the election draws near.

 

Pollsters who report results with comparatively few undecided voters in fact “stick their neck out” further than those who report larger numbers of undecideds. The Median Spread Error measure defined here rewards the pollster whose estimate is not just the most precise, but whose numbers leave him/her the least amount of wiggle room.

 

12.    Conclusion

 

It is difficult to measure the accuracy of election polls, because assumptions must be made about how to allocate undecided voters. Mosteller (1949) defined 6 accuracy measures, which can give divergent results. There is a tension between “relative” and “absolute” measures. Contour graphs can illuminate how the accuracy measures work, and show that a new measure defined by Traugott et al in 2003 does not evaluate election polls significantly differently from the Mosteller 2 measure. A new accuracy measure, “Median Spread Error,” considers the entire range of possible allocations of undecided voters. It avoids arbitrary assumptions about allocation of undecideds and achieves a compromise between “relative” and “absolute” accuracy.



[1] Mosteller 7 error is calculated using sample size and Mosteller 8 error is calculated with electoral votes. These two measures are not discussed in this paper.