Monday, November 5, 2012

Aggregation sites should standardize vote rates from polls

     The Pollster site operated through Huffington Post employs a thoughtful model for estimating a summary trendline from a wide variety of polls.  It deals with house effects, one-off poll weight discounting, timing range, sample size differences, integration of regional effects and relations between state and national polls.  For all of the smart modifications it seems that Pollster and other poll aggregation systems (like RCP) need to address a rather fundamental measurement issue:  a poll statistic is not always the same thing as the key population parameter that we want to estimate.  Rather than modeling the trajectory of the poll statistic we should strive to measure the rate at which voters prefer one candidate over the other.  However, because of the options to declare 'undecided' or 'other' the poll statistic can differ dramatically from the preference rate.  

Consider these recent polls from Colorado.

firm     dates     obama     romney    undecided   other
PPP   11/3 11/4      52        46         3          0
Lake  10/31 11/4     45        44         9          3                                

     Based simply on the polls the comparison between these two surveys seems to show a pronounced 7 point jump for Obama (from 45 to 52).   This is not actually the case though.  Rather than averaging across poll statistics we should standardize values into vote rates before plotting trends or otherwise summarizing values across different polls.  One way is to take polling statistics for candidate A and standardize by the sum of the polling statistics for both candidates. We can do the same for candidate B, although the measures are symmetrical.  This standardization prevents variation in the proportion 'undecided' or 'other' from distracting from our focus on estimating the population parameter of interest.  This standardization shows that the preference for Obama shifted from 50.5 to 53, and suggests a much more modest change in preference.  Another way to think about the difference is that the 45 for Obama makes it seem like a major drop, and it seems as though Romney polled higher than Obama on one of these polls (46 is bigger than 45, after all).  But, in fact, voters preferred Obama in both polls, just by different margins.  

  A / (A+B)   B / (A+B)

This works out in the case of these two polls as:

         Obama        Romney
firm     A/(A+B)      B/(A+B)    diff 
PPP     .530612      .469388   .061224
Lake    .505618      .494382   .011236

     Standardizing poll statistics into vote rates gets us closer to the rate at which voters prefer candidate A compared to candidate B.  Unstandardized poll results included in aggregation cause interpretations to combine changes in vote rates with changes in rates of 'undecided' or 'other' voting.  This can lead to faulty interpretations.  For instance, compare the two plots below (first plot reports vote rates, second is from the Pollster site).  Both use the same polling data, but the top plots vote rates between the top two candidates while the second reports poll percentages (along with the many model improvements mentioned earlier).

    It is hard to compare these plots directly because the Pollster trendline is based on many more observation events, is smoothed nicely, and reports an X axis based on a consistent time metric.  In contrast, my simplistic plot registers one unit of X as one poll.  The main result of these defects is that comparisons of temporal trends are difficult because of the distortion inherent in the first plot.  Despite that difficulty we can read a couple of helpful lessons.  Mainly, it seems that basing model values on the poll statistics rather than some standardized measure seems to run the risk of getting the size of the preference gap and the direction of the trends wrong.  We can see that in the most recent weeks where the gap in vote rate has increased slightly, and the size of the gap is actually much larger than the raw poll data seems to imply.  The poll data puts the race very close (about 0.6%; while the vote rate puts the gap at 1.9%).  
     These are very different lessons to derive from the same underlying data (state level polls, for the most part).  The vote rate seems to suggest that Obama is ahead by about 2 percentage points and that his advantage is increasing.  The poll statistics seem to suggest that the rates are nearly identical and Romney is gaining.   If we take these plots as predictions, the vote rate seems to predict a win by about 2% in Colorado, while the poll statistics suggest a much closer race.  Would a Pollster model based on standardized vote rates still register Colorado is a toss up, or as a leaning Obama?

You can create plots similar to the top plot too.

To create this one I simply followed these steps:

1.  Go to the state page for the pollster data:
2.  Select more data until it maxes out.
3.  Select, copy paste into Excel
4.  Reverse the order of the polls to put them in chronological order
5.  Create a denominator by adding vote rates of the two candidates together
6.  Divide each poll statistic by the denominator
7.  Calculate some type of moving average of the polls (I used 7 poll averages)
8.  Plot!

I pasted my data into a google spreadsheet.  Take a look if you are curious.

     The second sheet on the google spreadsheet is from Ohio poll data, and the plot is below.  It is not really necessary to include lines for both candidates, but the symmetric values highlight the trends in the data.  The growth in the vote rate gap is a little easier to observe than the poll summary, but otherwise the current gaps are similar (Pollster summary difference = 3.3; vote rate difference=3.28).

     The other major differences seem to be the timing and scale of the major fluctuations.   The timing difference is actually due to my hasty construction in the vote rate plot, where the polls are simply ordered and not representative of the actual time elapsed.  This visually compresses the early time period where polls were rare and stretches the recent weeks where polls are common.  Second is the peaks and valleys are more extreme in the standardized vote rate plot.  This difference is either due to the differences in the smoothing rules or to differences between the poll statistics and the vote rates.  Or, probably, some combination of both.
     In the case of Ohio, key aspects of both plots are pretty similar:  current trend direction (slight increase in Obama's edge) and approximate size of the that edge (3.3%).  However, in Colorado both the trend and the gap are strikingly different in the two plots. It would be interesting to note which other state level cases show potentially influential errors related to the unstandardized rates. It would be even more interesting to see a comparison from Pollster between their current model and plots that start from vote rates rather than poll statistics.
     The Pollster site has a long standing tradition of depicting the raw data in the plots-- however, it seems likely that the points and any trendline based on those points can lead to erroneous conclusions.  Through email discussion Dr. Simon Jackman indicated that he is well aware of the deficiencies of using raw poll data but the legacy of earlier plots seems to be an institutional constraint with Pollster.  Perhaps Pollster could institute an option that allows users to switch between the standardized and nonstandardized plots, similar to switching between linear and logged axes in other data visualizations (like Gapminder).     Doing so might lead to better understanding of poll trends and add to the numerate discussion of political and social data.