Can we predict how Ohio counties voted in the 2016 Presidential Election?

EDA

Question & Motivation - We will be researching if demographic factors such as metropolitan status, race, population, occupation, and average level of income data can be used to predict how different counties in Ohio voted in the 2016 presidential election. This is an important question to look at because if it is possible to predict election results based on this criteria, it can give more insight into the 2020 presidential election, which is occuring soon and will have severe consequences for not only the country but the global community as well.

Description of Data - The data that we have chosen to analyze comes from a public dataset created by Opendatasoft, which contains data on each county in Ohio. The turnout by county csv dataset shows how many people out of the registered voters per county in Ohio voted during the election, and each county has a turnout of over 67%, which is higher than the national average. The American Community Survey specifies metropolitan status, race, population, occupation, and average level of income data to confirm what status a county has. Politico’s election map shows the percentage of voters that voted for a presidential candidate by county, which can then be compared to the data determining the urban, suburban, or rural status of the county, as well as the other identifying factors we want to analyze. The 2010 census also gives information on the racial composition by county of Ohio residents. Because this data is quite old, as there isn’t comprehensive demographic data collected closer to 2016, it may not be the most accurate representation of the demographics during the time of the 2016 election.

The dataset was collected from Opendatasoft. The data itself was hosted on GitHub by the user Deleetdk, who sourced it from the New York Times and a Sociological paper. The electoral data, describing how each county voted, was sourced from the New York Times. The data describing income, race, etc. was sourced from a paper (“Inequality across US counties: an S factor analysis” by Emil O. W. Kirkegaard) hosted on OpenPsych, a website that hosts free academic articles. These datasources are freely available to the public, and are from self-volunteered and widely available information, as demographic data is usually scrubbed free of any Personal Identifiable Information. This means that any one person represented in the data cannot be identified in detail, preventing them from being exposed from being in the dataset.

When determining which visualizations were best suited to displaying the data, we determined that choropleth maps would be the best method of vizualization. Because of the strict boundary lines between counties, cloropleth maps are more suited to capturing the data accurately, as opposed to the non-distinct boundaries of heat maps. We discussed this as a group and determined that this was the best course of action, weighing our options. We also used bar charts to display generalized demographic data, in order to provide further context to our analysis.

# Exploratory Data Analysis
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import geopandas as gpd
import folium
import json
import os
from branca.colormap import linear
import branca.colormap
import folium.plugins # The Folium Javascript Map Library
from folium.plugins import HeatMap
from folium.plugins import HeatMapWithTime
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import r2_score
from sklearn.feature_selection import RFE

ohio_geo = gpd.read_file('usa-2016-presidential-election-by-county.geojson')
ohio_geo_json = json.load(open('usa-2016-presidential-election-by-county.geojson'))

m = folium.Map([40.4173, -82.9071], zoom_start=7)

# This map shows the state of Ohio on a map of the United States, the state we will be focusing on, and outlines the different counties.
folium.GeoJson(ohio_geo
).add_to(m)

m

# Here we are cleaning the data to calculate the difference between Republican votes in 2016 and 2012 to see how each county has shifted.
party_votes = ohio_geo.loc[:,['name_16', 'rep12_frac', 'rep16_frac']]
party_votes['rep_diff'] = party_votes['rep16_frac'] - party_votes['rep12_frac']
rep_diff = party_votes.loc[:,['name_16','rep_diff']]
rep_diff.head()

	name_16	rep_diff
0	Fayette	11.792030
1	Portage	6.557826
2	Stark	7.646732
3	Van Wert	7.063866
4	Brown	13.043855

colormap = linear.BuGn_03.scale(
    party_votes.rep_diff.min(),
    party_votes.rep_diff.max())

colormap

rep_diff_dict = rep_diff.set_index('name_16')['rep_diff']
rep_diff_dict.head()

name_16
Fayette     11.792030
Portage      6.557826
Stark        7.646732
Van Wert     7.063866
Brown       13.043855
Name: rep_diff, dtype: float64

folium.GeoJson(
    ohio_geo_json,
    name='rep_diff',
    style_function=lambda feature: {
        'fillColor': colormap(rep_diff_dict[feature['properties']['name_16']]),
        'color': 'black',
        'weight': 1,
        'dashArray': '5, 5',
        'fillOpacity': 0.9,
    }
).add_to(m)

folium.LayerControl().add_to(m)
colormap.caption = 'Difference in Voting Republican between 2016 and 2012 elections'
colormap.add_to(m)

m

The above map shows the the difference between Republican voting percentages in 2016 and 2012, with lighter counties having voted less for Republican Presidential candidates in 2016 than 2012, with the inverse for darker colored counties. The largest shift away from Republican candidates comes in urban counties, such as the ones surrounding Columbus and Cincinatti. What is interesting about the counties that shifted towards Republican candidates in 2016 is that many are located in a cluster in the South East portion of the state. These counties are by other states’ cities that are known for the drop in manufacturing jobs, such as Pittsburg, making them a part of the Rust Belt. These counties could have felt that a strong pro-US manufacturing message appealed to them more intently in 2016 rather than 2012.

afro = ohio_geo.loc[:, ['name_16', 'african_american_population']]
afro_pop_dict = afro.set_index('name_16')['african_american_population']
afrcolormap = linear.PuBu_07.scale(
    afro.african_american_population.min(),
    afro.african_american_population.max())

afrcolormap

m2 = folium.Map([40.4173, -82.9071], zoom_start=7)
folium.GeoJson(
    ohio_geo_json,
    name='afro_percentage_pop',
    style_function=lambda feature: {
        'fillColor': afrcolormap(afro_pop_dict[feature['properties']['name_16']]),
        'color': 'black',
        'weight': 1,
        'dashArray': '5, 5',
        'fillOpacity': 0.9,
    }
).add_to(m2)

folium.LayerControl().add_to(m2)
afrcolormap.caption = 'Proportion of County Population that is African American'
afrcolormap.add_to(m2)

m2

The above map shows the proportion of a county’s population that is African American. Almost unsurprisingly, urban counties have the largest African American populations, with the counties containing Cleveland, Cincinnati, Columbus, Dayton, and Toledo having the highest black populations by a large margin. The rest of the state’s counties barely have any black people. Notably, the previously discussed South Eastern portion of Ohio-the portion that shifted hard towards Republican candidates, barely has any African American people. Given that African American voters tend to vote Democrat, there was no population in these counties to stem the Republican tide in 2016.

turnout = pd.read_csv('turnoutbycounty.csv')
county_turnout = turnout.loc[:,['November 8, 2016 General Election Official Canvass\nCounty Level Voter Turnout Report', 'Unnamed: 5']]
county_turnout = county_turnout.drop([0, 1])
county_turnout = county_turnout.reset_index(drop = True)
county_turnout.rename(columns = {'November 8, 2016 General Election Official Canvass\nCounty Level Voter Turnout Report':'County', 'Unnamed: 5': 'Voter_Turnout_Percentage'}, inplace = True)
county_turnout['Voter_Turnout_Percentage'] = county_turnout['Voter_Turnout_Percentage'].map(lambda x: x.rstrip('%'))
county_turnout = county_turnout.astype({'Voter_Turnout_Percentage': 'float'})
county_turnout.head()

	County	Voter_Turnout_Percentage
0	Adams	68.36
1	Allen	68.67
2	Ashland	70.81
3	Ashtabula	68.94
4	Athens	66.15

voter_turnout_dict = county_turnout.set_index('County')['Voter_Turnout_Percentage']

colormap3 = linear.BuGn_07.scale(
    county_turnout.Voter_Turnout_Percentage.min(),
    county_turnout.Voter_Turnout_Percentage.max())

colormap3

m3 = folium.Map([40.4173, -82.9071], zoom_start=7)
folium.GeoJson(
    ohio_geo_json,
    name='Voter_Turnout_Percentage',
    style_function=lambda feature: {
        'fillColor': colormap3(voter_turnout_dict[feature['properties']['name_16']]),
        'color': 'black',
        'weight': 1,
        'dashArray': '5, 5',
        'fillOpacity': 0.9,
    }
).add_to(m3)

folium.LayerControl().add_to(m3)
colormap3.caption = 'Voter Turnout Percentage'
colormap3.add_to(m3)

m3

The above map shows the voter turnout percentage per county in the 2016 Presidential election, with darker counties having higher turnout and lighter counties having lower turnout. Curiously, many counties surrounding ones that have cities have higher voter turnouts. These presumably suburban counties that contain a large portion of white voters could have affected Ohio’s Red shift in the 2016 election. The Western and Northern portions of Ohio also tend to have higher voter turnouts as well. The Southern tip of the state has some of the lower turnout levels, including part of the section that shifted very hard to the right.

Modeling

We wish to predict what percentage of people in each of Ohio’s counties voted Republican in the 2016 election based on the 2012 election data. We will test how accurate our model is by using this data to predict 2016 election results, using 2016 as a benchmark of accuracy. We will be testing three different types of models: OLS, Ridge, and LASSO. This is to determine which provides the most accurate model, along with helping us perform feature selection. Ridge and LASSO help to perform feature selection, with LASSO outright removing features it deems irrelevant, but we want to use all three of these models in tandem with manual feature selection to ensure that our model is targeting the right areas to maximize its accuracy. Since our problem is also a regression problem, we utilized three methods of regression models in the forms of OLS, Ridge, and LASSO.

ohio_csv = pd.read_csv('usa-2016-presidential-election-by-county.csv', sep=';')
ohio_csv = ohio_csv.dropna(axis='columns') # Drop columns with any NaN values

sixteen = ohio_csv.filter(regex='16', axis=1)
eight = ohio_csv.filter(regex='08', axis=1)
dropped_sixteen = ohio_csv.drop(columns=sixteen.columns) # Drop columns with 2016 info
only_2k12 = dropped_sixteen.drop(columns=eight.columns) # Drop columns with 2008 info

final_data = only_2k12.select_dtypes(include='number').astype('float')
final_data.head()

	Fips	Precincts	Votes	Democrats 12 (Votes)	Republicans 12 (Votes)	Republicans 2012	Democrats 2012	Less Than High School Diploma	At Least High School Diploma	At Least Bachelors's Degree	...	Max Alc	Mixedness	reporting	Statecode Prev	total12	other12	Other12 Frac	Rep12 Frac2	Dem12 Frac2
0	39047.0	25.0	10817.0	4249.0	6620.0	59.974633	38.494292	17.8	82.2	13.1	...	0.000274	-0.216278	25.0	39.0	11038.0	169.0	0.015311	0.609072	0.390928
1	39133.0	130.0	73968.0	39453.0	35242.0	46.137935	51.650869	9.6	90.4	24.9	...	0.000466	-0.570721	130.0	39.0	76384.0	1689.0	0.022112	0.471812	0.528188
2	39151.0	284.0	170868.0	89432.0	88581.0	48.738899	49.207135	12.2	87.8	20.4	...	0.000072	-0.876549	284.0	39.0	181746.0	3733.0	0.020540	0.497610	0.502390
3	39161.0	39.0	13584.0	4029.0	9585.0	68.966758	28.989783	10.1	89.9	14.1	...	0.000590	-0.082522	39.0	39.0	13898.0	284.0	0.020435	0.704055	0.295945
4	39015.0	32.0	19139.0	7107.0	11916.0	61.448020	36.649134	20.0	80.0	9.8	...	0.000751	-0.012464	32.0	39.0	19392.0	369.0	0.019028	0.626400	0.373600

5 rows × 67 columns

The original county voting dataset was fairly large, containing election data for the 2008, 2012, and 2016 presidential elections. We were interested in creating a model to see if we could accuratley predict the 2016 data, so we removed both the 2008 and 2016 data from the dataset. This would ensure our model was based soley on the 2012 data. We also converted the numbers in our dataset into floats to ensure they would be usable.

X = final_data.copy()
y = sixteen.copy()['Republicans 2016']

# Partition data into training, validation, and test sets

np.random.seed(20)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.80, test_size=0.20)

X_train, X_validate, y_train, y_validate = train_test_split(X_train, y_train,
                                                    train_size=0.75, test_size=0.25)

def rmse(pred, actual):
    return np.sqrt(np.mean((pred - actual) ** 2))

# Let's look at OLS

## Create lin_reg method and fit model
lin_reg = LinearRegression(normalize=False)
lin_model = lin_reg.fit(X_train, y_train)

lin_train_pred = lin_model.predict(X_train)

# plot the residuals on a scatter plot
plt.scatter(y_train, lin_train_pred)
plt.title('Linear Model (OLS) on Training')
plt.xlabel('actual value')
plt.ylabel('predicted value')
plt.show()

OLS on Training

# rmse and correlation for OLS on training
rmse(lin_train_pred, y_train), r2_score(y_train, lin_train_pred)

(3.2746150488817636e-10, 1.0)

lin_valid_pred = lin_model.predict(X_validate)

# plot the residuals on a scatter plot
plt.scatter(y_validate, lin_valid_pred)
plt.title('Linear Model (OLS) on Validation')
plt.xlabel('actual value')
plt.ylabel('predicted value')
plt.show()

OLS on Validation

# rmse and correlation for OLS on validation
rmse(lin_valid_pred, y_validate), r2_score(y_validate, lin_valid_pred)

(6.012555729854793, 0.5708065932106756)

# Cross validation using OLS
linear_cross_pred = cross_val_predict(lin_model, X_train, y_train, cv = 8)
rmse(linear_cross_pred, y_train), r2_score(y_train, linear_cross_pred)

(4.3480800464157925, 0.8413087153892228)

Looking at the OLS prediction model’s residual plot, we see a bit of a pattern because the points are not randomly scattered, so a nonlinear model would be more appropriate to represent the data we are analyzing. The r2 score is quite low at 0.57, so again, we know that this model isn’t the best fit for our data and doesn’t explain all of the variability.

# Let's look at Ridge

## Create ridge_reg method and fit model
ridge_reg = Ridge()
ridge_model = ridge_reg.fit(X_train, y_train)

ridge_train_pred = ridge_model.predict(X_train)

# Plot the residuals on a scatter plot
plt.scatter(y_train, ridge_train_pred)
plt.title('Ridge Model on Training')
plt.xlabel('actual value')
plt.ylabel('predicted value')
plt.show()

Ridge on Training

# rmse and correlation for Ridge on training
rmse(ridge_train_pred, y_train), r2_score(y_train, ridge_train_pred)

(0.6257856200026994, 0.9967129283337127)

ridge_valid_pred = ridge_model.predict(X_validate)

# plot the residuals on a scatter plot
plt.scatter(y_validate, ridge_valid_pred)
plt.title('Ridge Model on Validation')
plt.xlabel('actual value')
plt.ylabel('predicted value')
plt.show()

Ridge on Validation

# rmse and correlation for Ridge on validation
rmse(ridge_valid_pred, y_validate), r2_score(y_validate, ridge_valid_pred)

(1.1586187414858822, 0.9840626573283578)

# Cross validation using Ridge
ridge_cross_pred = cross_val_predict(ridge_model, X_train, y_train, cv = 8)
rmse(ridge_cross_pred, y_train), r2_score(y_train, ridge_cross_pred)

(2.5026396055037914, 0.9474279828566161)

# Let's look at Lasso

## Create lasso_reg method and fit model
lasso_reg = Lasso(max_iter=10000)
lasso_model = lasso_reg.fit(X_train, y_train)

lasso_train_pred = lasso_model.predict(X_train)

# Plot the residuals on a scatter plot
plt.scatter(y_train, lasso_train_pred)
plt.title('Lasso Model on Training')
plt.xlabel('actual value')
plt.ylabel('predicted value')
plt.show()

Lasso on Training

# rmse and correlation for Lasso on training
rmse(lasso_train_pred, y_train), r2_score(y_train, lasso_train_pred)

(1.1995018934812, 0.9879229770899404)

lasso_valid_pred = lasso_model.predict(X_validate)

# plot the residuals on a scatter plot
plt.scatter(y_validate, lasso_valid_pred)
plt.title('Lasso Model on Validation')
plt.xlabel('actual value')
plt.ylabel('predicted value')
plt.show()

Lasso on Validation

# rmse and correlation for Lasso on validation
rmse(lasso_valid_pred, y_validate), r2_score(y_validate, lasso_valid_pred)

(0.949760467716469, 0.9892906530231533)

# Cross validation using Lasso
lasso_cross_pred = cross_val_predict(lasso_model, X_train, y_train, cv = 8)
rmse(lasso_cross_pred, y_train), r2_score(y_train, lasso_cross_pred)

(2.4507955016226832, 0.9495835614618354)

Model Explanation - For our results, we tested each model on both the training set and validation set in order to ensure that they were not being overfit to the training data. Some models that performed very well on the training set, such as OLS (with an rmse of 3.2746150488817636e-10), did not perform as well on the validation set (rmse of 6.012555729854793), meaning that overfitting was occurring. This made this model the worst of the set. Moving on to the Ridge and LASSO models, the Ridge model performed well on the training set, with an rmse of 0.6257857358548339, though its performance worsened slightly on the training set (rmse of 1.1586163866666963). This made the LASSO model our best performing model, and though its training rmse is higher than the Ridge model’s (1.1995018934812), its performance on the validation set was better than on the test set (0.949760467716469), meaning it is our most generalized and non-overfit model. We will thus be moving on with our LASSO model.

Our LASSO model is the best to see how accurately it can predict how people will vote based on demographic factors. The model predicts the percentage of voters in an Ohio county that will vote Republican based on features such as race, income, etc.

Legal, policy, and ethical implications, as well as future related areas of research - A better understanding of how people of some demographics vote, if there is an influence on voting patterns based on such factors, can help politicians campaign to voters. It could also help predict the outcome of some elections better. Policy could be both positively and negatively affected by this research if it is taken into consideration when making political decisions such as the redistricting of voting districts. Gerrymandering in politics continues to be a huge problem, diluting the voice of some voters, as redistricting is done in order to reduce the impact of their vote and benefit the political party of those responsible for the redistricting over the other. Being aware of how the concentration of voters of a certain demographic in certain areas can predict how counties will vote can help identify the impact of redistricting as well. This is an interesting area of research that should be expanded on to fix redistricting problems.

We trained our model on 2012 data to predict which Ohio counties will vote Republican in the 2016 election. Likewise, to make predictions on which Ohio counties will vote Republican in the 2020 election, we will train it on the 2016 election data. We just don’t know yet how accurate the predictions will be until the 2020 election occurs however, but we are curious to find out!