The Pollster site operated through Huffington Post employs a thoughtful model for estimating a summary trendline from a wide variety of polls. It deals with house effects, one-off poll weight discounting, timing range, sample size differences, integration of regional effects and relations between state and national polls. For all of the smart modifications it seems that Pollster and other poll aggregation systems (like RCP) need to address a rather fundamental measurement issue: a poll statistic is not always the same thing as the key population parameter that we want to estimate. Rather than modeling the trajectory of the poll statistic we should strive to measure the rate at which voters prefer one candidate over the other. However, because of the options to declare 'undecided' or 'other' the poll statistic can differ dramatically from the preference rate.
Consider these recent polls from Colorado.
firm dates obama romney undecided other
PPP 11/3 11/4 52 46 3 0
Lake 10/31 11/4 45 44 9 3
A / (A+B) B / (A+B)
This works out in the case of these two polls as:
Obama Romney
firm A/(A+B) B/(A+B) diff
PPP .530612 .469388 .061224
Lake .505618 .494382 .011236
Standardizing poll statistics into vote rates gets us closer to the rate at which voters prefer candidate A compared to candidate B. Unstandardized poll results included in aggregation cause interpretations to combine changes in vote rates with changes in rates of 'undecided' or 'other' voting. This can lead to faulty interpretations. For instance, compare the two plots below (first plot reports vote rates, second is from the Pollster site). Both use the same polling data, but the top plots vote rates between the top two candidates while the second reports poll percentages (along with the many model improvements mentioned earlier).
It is hard to compare these plots directly because the Pollster trendline is based on many more observation events, is smoothed nicely, and reports an X axis based on a consistent time metric. In contrast, my simplistic plot registers one unit of X as one poll. The main result of these defects is that comparisons of temporal trends are difficult because of the distortion inherent in the first plot. Despite that difficulty we can read a couple of helpful lessons. Mainly, it seems that basing model values on the poll statistics rather than some standardized measure seems to run the risk of getting the size of the preference gap and the direction of the trends wrong. We can see that in the most recent weeks where the gap in vote rate has increased slightly, and the size of the gap is actually much larger than the raw poll data seems to imply. The poll data puts the race very close (about 0.6%; while the vote rate puts the gap at 1.9%).
These are very different lessons to derive from the same underlying data (state level polls, for the most part). The vote rate seems to suggest that Obama is ahead by about 2 percentage points and that his advantage is increasing. The poll statistics seem to suggest that the rates are nearly identical and Romney is gaining. If we take these plots as predictions, the vote rate seems to predict a win by about 2% in Colorado, while the poll statistics suggest a much closer race. Would a Pollster model based on standardized vote rates still register Colorado is a toss up, or as a leaning Obama?
You can create plots similar to the top plot too.
To create this one I simply followed these steps:
1. Go to the state page for the pollster data:
http://elections.huffingtonpost.com/pollster/2012-colorado-president-romney-vs-obama
2. Select more data until it maxes out.
3. Select, copy paste into Excel
4. Reverse the order of the polls to put them in chronological order
5. Create a denominator by adding vote rates of the two candidates together
6. Divide each poll statistic by the denominator
7. Calculate some type of moving average of the polls (I used 7 poll averages)
8. Plot!
I pasted my data into a google spreadsheet. Take a look if you are curious.
The second sheet on the google spreadsheet is from Ohio poll data, and the plot is below. It is not really necessary to include lines for both candidates, but the symmetric values highlight the trends in the data. The growth in the vote rate gap is a little easier to observe than the poll summary, but otherwise the current gaps are similar (Pollster summary difference = 3.3; vote rate difference=3.28).
In the case of Ohio, key aspects of both plots are pretty similar: current trend direction (slight increase in Obama's edge) and approximate size of the that edge (3.3%). However, in Colorado both the trend and the gap are strikingly different in the two plots. It would be interesting to note which other state level cases show potentially influential errors related to the unstandardized rates. It would be even more interesting to see a comparison from Pollster between their current model and plots that start from vote rates rather than poll statistics.
The Pollster site has a long standing tradition of depicting the raw data in the plots-- however, it seems likely that the points and any trendline based on those points can lead to erroneous conclusions. Through email discussion Dr. Simon Jackman indicated that he is well aware of the deficiencies of using raw poll data but the legacy of earlier plots seems to be an institutional constraint with Pollster. Perhaps Pollster could institute an option that allows users to switch between the standardized and nonstandardized plots, similar to switching between linear and logged axes in other data visualizations (like Gapminder). Doing so might lead to better understanding of poll trends and add to the numerate discussion of political and social data.
No comments:
Post a Comment