Analyzing Regional Trends in Police Stop Data Across the United States
Names: Rishabh Parekh, Emmy Yu, Manu Prakasam
About the Data: Ethical Data Collection
Publicly-available Standardized Stop Data from: https://openpolicing.stanford.edu/data/
The Stanford Open Policing Project data are made available under the Open Data Commons Attribution License, which allows us to:
To Share: To copy, distribute and use the database.
To Create: To produce works from the database.
To Adapt: To modify, transform and build upon the database.
The Stanford Opening Policing Project data comes from the working paper:
E. Pierson, C. Simoiu, J. Overgoor, S. Corbett-Davies, D. Jenson, A. Shoemaker, V. Ramachandran, P. Barghouty, C. Phillips, R. Shroff, and S. Goel. (2019) “A large-scale analysis of racial disparities in police stops across the United States”.
The accompanying README file describes how the data was standardized and cleaned for privacy and secruity reasons. Descriptions of the column meanings are also given.
Downloading data from The Stanford Open Policing Project:
Specifically, we will be looking at the datasets from San Francisco, New Orleans, and Pittsburgh.
sfcrime = pd.read_csv('ca_san_francisco_2019_12_17.csv')
sfcrime.head()
raw_row_number | date | time | location | lat | lng | district | subject_age | subject_race | subject_sex | type | arrest_made | citation_issued | warning_issued | outcome | contraband_found | search_conducted | search_vehicle | search_basis | reason_for_stop | raw_search_vehicle_description | raw_result_of_contact_description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 869921 | 2014-08-01 | 00:01:00 | MASONIC AV & FELL ST | 37.773004 | -122.445873 | NaN | NaN | asian/pacific islander | female | vehicular | False | False | True | warning | NaN | False | False | NaN | Mechanical or Non-Moving Violation (V.C.) | No Search | Warning |
1 | 869922 | 2014-08-01 | 00:01:00 | GEARY&10TH AV | 37.780898 | -122.468586 | NaN | NaN | black | male | vehicular | False | True | False | citation | NaN | False | False | NaN | Mechanical or Non-Moving Violation (V.C.) | No Search | Citation |
2 | 869923 | 2014-08-01 | 00:15:00 | SUTTER N OCTAVIA ST | 37.786919 | -122.426718 | NaN | NaN | hispanic | male | vehicular | False | True | False | citation | NaN | False | False | NaN | Mechanical or Non-Moving Violation (V.C.) | No Search | Citation |
3 | 869924 | 2014-08-01 | 00:18:00 | 3RD ST & DAVIDSON | 37.746380 | -122.392005 | NaN | NaN | hispanic | male | vehicular | False | False | True | warning | NaN | False | False | NaN | Mechanical or Non-Moving Violation (V.C.) | No Search | Warning |
4 | 869925 | 2014-08-01 | 00:19:00 | DIVISADERO ST. & BUSH ST. | 37.786348 | -122.440003 | NaN | NaN | white | male | vehicular | False | True | False | citation | NaN | False | False | NaN | Mechanical or Non-Moving Violation (V.C.) | No Search | Citation |
nola_crime = pd.read_csv('la_new_orleans_2019_12_17.csv')
nola_crime.head()
raw_row_number | date | time | location | lat | lng | district | zone | subject_age | subject_race | subject_sex | officer_assignment | type | arrest_made | citation_issued | warning_issued | outcome | contraband_found | contraband_drugs | contraband_weapons | frisk_performed | search_conducted | search_person | search_vehicle | search_basis | reason_for_stop | vehicle_color | vehicle_make | vehicle_model | vehicle_year | raw_actions_taken | raw_subject_race | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2010-01-01 | 01:11:00 | NaN | NaN | NaN | 6 | E | 26.0 | black | female | 6th District | vehicular | False | False | False | NaN | NaN | NaN | NaN | False | False | False | False | NaN | TRAFFIC VIOLATION | BLACK | DODGE | CARAVAN | 2005.0 | NaN | BLACK |
1 | 9087 | 2010-01-01 | 01:29:00 | NaN | NaN | NaN | 7 | C | 37.0 | black | male | 7th District | vehicular | False | False | False | NaN | NaN | NaN | NaN | False | False | False | False | NaN | TRAFFIC VIOLATION | BLUE | NISSAN | MURANO | 2005.0 | NaN | BLACK |
2 | 9086 | 2010-01-01 | 01:29:00 | NaN | NaN | NaN | 7 | C | 37.0 | black | male | 7th District | vehicular | False | False | False | NaN | NaN | NaN | NaN | False | False | False | False | NaN | TRAFFIC VIOLATION | BLUE | NISSAN | MURANO | 2005.0 | NaN | BLACK |
3 | 267 | 2010-01-01 | 14:00:00 | NaN | NaN | NaN | 7 | I | 96.0 | black | male | 7th District | vehicular | False | False | False | NaN | NaN | NaN | NaN | False | False | False | False | NaN | TRAFFIC VIOLATION | GRAY | JEEP | GRAND CHEROKEE | 2003.0 | NaN | BLACK |
4 | 2 | 2010-01-01 | 02:06:00 | NaN | NaN | NaN | 5 | D | 17.0 | black | male | 5th District | NaN | False | False | False | NaN | NaN | NaN | NaN | False | False | False | False | NaN | CALL FOR SERVICE | NaN | NaN | NaN | NaN | NaN | BLACK |
pittcrime = pd.read_csv('pa_pittsburgh_2019_12_17.csv')
pittcrime.head()
raw_row_number | date | time | location | lat | lng | neighborhood | subject_age | subject_race | subject_sex | officer_id_hash | officer_age | officer_race | officer_sex | type | violation | arrest_made | citation_issued | warning_issued | outcome | contraband_found | frisk_performed | search_conducted | reason_for_stop | raw_zone | raw_object_searched | raw_race | raw_ethnicity | raw_zone_division | raw_evidence_found | raw_weapons_found | raw_nothing_found | raw_police_zone | raw_officer_race | raw_officer_zone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2008-01-01 | 00:14:00 | 351 S Negley Ave | 40.459466 | -79.932802 | NaN | 20.0 | white | male | 3bb3b1bd48 | 41.0 | NaN | NaN | pedestrian | NaN | False | False | False | NaN | NaN | NaN | False | Other | NaN | NaN | White | White | - | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 3 | 2008-01-01 | 00:14:00 | 376 Main St | 40.465868 | -79.955594 | NaN | 19.0 | white | male | 3bb3b1bd48 | 41.0 | NaN | NaN | pedestrian | NaN | False | False | False | NaN | NaN | NaN | False | Other | NaN | NaN | White | White | - | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2 | 2008-01-01 | 00:14:00 | Stamair Way & Baum Blvd | 40.456812 | -79.939041 | NaN | 16.0 | white | male | 3bb3b1bd48 | 41.0 | NaN | NaN | pedestrian | NaN | False | False | False | NaN | NaN | NaN | False | Other | - | NaN | White | White | - | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 4 | 2008-01-01 | 01:59:00 | N Braddock Ave & Thomas Blvd | 40.448873 | -79.893923 | NaN | 21.0 | NaN | male | b62aedb5bb | 29.0 | NaN | NaN | pedestrian | NaN | True | False | False | arrest | NaN | NaN | True | majorCrimes Other | - | person | Black | White | - | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 5 | 2008-01-01 | 14:50:00 | 2518 West Liberty Ave | 40.398780 | -80.026439 | NaN | 41.0 | white | male | 1ccb6bd45a | NaN | NaN | NaN | pedestrian | NaN | False | False | False | NaN | NaN | NaN | True | narcVice | - | person vehicle place | White | NaN | N/V | NaN | NaN | NaN | NaN | NaN | NaN |
Data Applicability:
Our research question will assess the relationship between race/socioeconomic backgrounds and the amount of interactions with the police. With the given data, we will have access to race of the individual, location, and time of each stop a police officer makes. The information of location can give us a general idea of an individual’s socioeconomic background based on the neighborhood they are in.
Furthermore, using the race of the individual, we can find the total number of stops per race. We will also calculate the proportion of total number of stops and the population of that specific race in the area. This can show us if there is a disproportionate rate of interactions with the law enforcment and minorities.
Using the Stanford Open Policing data, we will also compare the three distinctly different cities: San Francisco, Pittsburgh, and New Orleans. This will allow us to see if policing is similar or different across the nation, and will further give us a general idea of the relationship between minorities and law enforcement.
Standardizing the Data:
Certain columns that are common across all cities (and lend themselves to interesting analyses) were selected and merged to create one DataFrame for easier analysis and plotting.
Each row in the cleaned data represents a stop. Coverage varies by location. From the original Open Policing data, fields with an asterisk were removed for public release due to privacy concerns. All columns except raw_row_number
, violation
, disposition
, location
, officer_assignment
, any city or state subgeography (i.e. county, beat, division, etc), unit
, and vehicle_{color, make, model, type}
are also digit sanitized (each digit replaced with “-“) for privacy concerns. These columns from the original dataset were removed for the purpose of our analysis.
Exploratory Data Analysis
sf_crime_to_merge = sfcrime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "reason_for_stop"]]
sf_crime_to_merge = pd.DataFrame(sf_crime_to_merge)
sf_crime_to_merge["city"] = "San Francisco"
pitt_crime_to_merge = pittcrime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "reason_for_stop"]]
pitt_crime_to_merge = pd.DataFrame(pitt_crime_to_merge)
pitt_crime_to_merge["city"] = "Pittsburgh"
nola_crime_to_merge = nola_crime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "reason_for_stop"]]
nola_crime_to_merge = pd.DataFrame(nola_crime_to_merge)
nola_crime_to_merge["city"] = "New Orleans"
#Exploratory Data Analysis Dataframe
eda_merged = (sf_crime_to_merge.append(pitt_crime_to_merge)).append(nola_crime_to_merge)
eda_merged
date | time | subject_age | subject_race | subject_sex | type | outcome | citation_issued | reason_for_stop | city | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2014-08-01 | 00:01:00 | NaN | asian/pacific islander | female | vehicular | warning | False | Mechanical or Non-Moving Violation (V.C.) | San Francisco |
1 | 2014-08-01 | 00:01:00 | NaN | black | male | vehicular | citation | True | Mechanical or Non-Moving Violation (V.C.) | San Francisco |
2 | 2014-08-01 | 00:15:00 | NaN | hispanic | male | vehicular | citation | True | Mechanical or Non-Moving Violation (V.C.) | San Francisco |
3 | 2014-08-01 | 00:18:00 | NaN | hispanic | male | vehicular | warning | False | Mechanical or Non-Moving Violation (V.C.) | San Francisco |
4 | 2014-08-01 | 00:19:00 | NaN | white | male | vehicular | citation | True | Mechanical or Non-Moving Violation (V.C.) | San Francisco |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
512087 | 2017-12-31 | 00:48:00 | 28.0 | black | female | vehicular | arrest | False | TRAFFIC VIOLATION | New Orleans |
512088 | 2017-12-31 | 00:48:00 | 25.0 | black | male | vehicular | arrest | False | TRAFFIC VIOLATION | New Orleans |
512089 | 2017-12-31 | 00:48:00 | 23.0 | black | male | vehicular | arrest | False | TRAFFIC VIOLATION | New Orleans |
512090 | 2017-12-31 | 00:48:00 | 25.0 | black | male | vehicular | arrest | False | TRAFFIC VIOLATION | New Orleans |
512091 | 2017-12-31 | 12:49:00 | 37.0 | black | female | vehicular | citation | True | TRAFFIC VIOLATION | New Orleans |
1691720 rows × 10 columns
plt.figure(figsize=(12,6))
ax = sns.countplot(x="city", hue="subject_race", data=eda_merged)
plt.title('Frequency of Stops by Race')
plt.legend(loc = 'upper right')
plt.show()
From the figure above, it is clear that the demographic range differs significantly across out three target locations: San Francisco, Pittsburgh, and New Orleans. It is important to note that the frequency of stops by race have some correlation to the racial demographics in these cities. For example, there is a large spike in stops for Blacks in New Orleans compared to both San Francisco and Pittsburgh. However, data from the Demographic Statistical Atlas shows that New Orleans has 58.9% Black population compared to San Francisco’s 5.4% and Pittsburgh’s 24.3%.
#The lengths of the datasets differ by location, with San Francisco having the greatest number of citations.
eda_merged.groupby('city').count()[['citation_issued']]
citation_issued | |
---|---|
city | |
New Orleans | 512092 |
Pittsburgh | 274555 |
San Francisco | 905070 |
eda_time = eda_merged.copy()
eda_time['date'] = pd.to_datetime(eda_time['date'])
eda_time['year'] = eda_time['date'].dt.year
eda_grouped = eda_time.groupby(['city','year']).count()[['citation_issued']].reset_index()
plt.figure(figsize=(15,8))
sns.countplot(x='city', hue='year', data=eda_time)
plt.title('Frequency of Stops by Year Across Locations')
plt.legend(loc = 'best')
plt.show()
From the barchart above, it can be seen that the greater dataset lengths are not entirely due to a higher number of stops made by police. As seen in the DataFrame above, San Francisco has the highest number of citations issued. The SF data also contains data from 2007-2016, where the 2016 end data is partial data. The Pittsburgh data is from 2008-2018, where the beginning and end years contains partial data as well. The New Orleans data contains the years 2010-2018 (partial 2018, data collection ended before end of year). Therefore, for fair comparisons and aggregations against locations, data will be limited to the years 2010-2015.
#taking only the stops from 2010-2015
inclusive_years = [2010, 2011, 2012, 2013, 2014, 2015]
eda_standard = eda_time[eda_time['year'].isin(inclusive_years)]
eda_standard.head()
date | time | subject_age | subject_race | subject_sex | type | outcome | citation_issued | reason_for_stop | city | year | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014-08-01 | 00:01:00 | NaN | asian/pacific islander | female | vehicular | warning | False | Mechanical or Non-Moving Violation (V.C.) | San Francisco | 2014.0 |
1 | 2014-08-01 | 00:01:00 | NaN | black | male | vehicular | citation | True | Mechanical or Non-Moving Violation (V.C.) | San Francisco | 2014.0 |
2 | 2014-08-01 | 00:15:00 | NaN | hispanic | male | vehicular | citation | True | Mechanical or Non-Moving Violation (V.C.) | San Francisco | 2014.0 |
3 | 2014-08-01 | 00:18:00 | NaN | hispanic | male | vehicular | warning | False | Mechanical or Non-Moving Violation (V.C.) | San Francisco | 2014.0 |
4 | 2014-08-01 | 00:19:00 | NaN | white | male | vehicular | citation | True | Mechanical or Non-Moving Violation (V.C.) | San Francisco | 2014.0 |
plt.figure(figsize=(12,6))
ax = sns.countplot(x="city", hue="subject_sex", data=eda_standard)
plt.title('Frequency of Stops by Gender')
plt.legend(loc = 'upper right')
plt.show()
In this figure showing the frequency of stops by gender across the three cities, it is clear that although the SF dataset contains the most number of stops, the division of stops by gender is fairly consistent across the cities. Stops of males constitutes more than half of the data in each location.
eda_merged['year'] = pd.DatetimeIndex(eda_merged["date"]).year
plt.figure(figsize=(12,6))
ax = sns.countplot(x="subject_race", hue="outcome", data=eda_merged.sort_values(by=['outcome', 'subject_race']))
plt.title('Type of Police Stop by Race: Total')
plt.legend(loc = 'upper left')
plt.show()
plt.figure(figsize=(12,6))
sf = eda_merged.loc[eda_merged["city"] == "San Francisco"]
ax = sns.countplot(x="subject_race", hue="outcome", data=sf.sort_values(by=['outcome', 'subject_race']))
plt.title('Type of Police Stop by Race: San Francisco')
plt.legend(loc = 'upper left')
plt.show()
Note: The San Francisco dataset does not have an “unknown” category option listed for the subject’s race.
According to the 2018 US Census Bureau, San Francisco County’s population was 40% Non-Hispanic White, 5.4% Hispanic White, 5.2% Black or African American, 34.3% Asian, 8.1% Some Other Race, 0.3% Native American and Alaskan Native, 0.2% Pacific Islander and 6.5% from two or more races. Based on this data, it’s interesting to note the relatively low count of AAPI subjects stopped based on their percentage of the population, and the relatively high count of black and Hispanic subjects stopped based on their percentage of the population.
plt.figure(figsize=(12,6))
pitt = eda_merged.loc[eda_merged["city"] == "Pittsburgh"]
ax = sns.countplot(x="subject_race", hue="outcome", data=pitt.sort_values(by=['outcome', 'subject_race']))
plt.title('Type of Police Stop by Race: Pittsburgh')
plt.legend(loc = 'upper left')
plt.show()
In Pittsburgh, the two races with the most significant percentages are black and white. We see relatively similar trends for these two races, with one small difference being that more black subjects receive more warnings than citations and more white subjects receive citations than warnings. This could indicate a number of different things, including more black subjects are unecessarily stopped or for smaller infractions. An additional note, given Pittsburgh’s population ~65% white and ~24% black (from the Demographic Statistical Atlas referenced earlier), we see a higher relative count of stops for black subjects than the demographics of the city would predict given an even distribution of stops across race.
plt.figure(figsize=(12,6))
nola = eda_merged.loc[eda_merged["city"] == "New Orleans"]
ax = sns.countplot(x="subject_race", hue="outcome", data=nola.sort_values(by=['outcome', 'subject_race']))
plt.title('Type of Police Stop by Race: New Orleans')
plt.legend(loc = 'upper right')
plt.show()
An interesting difference between New Orleans and our other cities is the very high rate of arrests relative to citations and warnings. San Francisco has the lowest relative rate of arrests, followed by Pittsburgh, with New Orleans having a rate of arrest nearly at par with citations and warnings.
We then took a look to see if there were trends year over year.
plt.figure(figsize=(10,6))
sf_years = sf['year'].value_counts()
sns.lineplot(data=sf_years, label='SF')
pitt_years = pitt['year'].value_counts()
sns.lineplot(data=pitt_years, label='Pittsburgh')
nola_years = nola['year'].value_counts()
sns.lineplot(data=nola_years, label='New Orleans')
plt.ylim(ymin=0)
plt.xlabel("Year")
plt.ylabel("Count")
plt.title("Stops Over Time By Cities")
plt.legend()
plt.show()
We see some interesting trends in stop count by year that vary by city. NOLA is up and down but decreasing, SF is decreasing over time for the most part and Pitt saw a trend upward followed by a trend downward, resulted in an arc-like graph. We then took a deeper look into New Orleans to see if time trends differed by race, but we did not find any immediately striking results.
plt.figure(figsize=(10,6))
nola_black = nola.loc[nola["subject_race"] == "black"]
nola_black = nola_black['year'].value_counts()
sns.lineplot(data=nola_black, label="black")
plt.ylim(ymin=0)
nola_white = nola.loc[nola["subject_race"] == "white"]
nola_white = nola_white['year'].value_counts()
sns.lineplot(data=nola_white, label="white")
plt.ylim(ymin=0)
plt.xlabel("Year")
plt.ylabel("Count")
plt.title("Stops in New Orleans Over Time by Race")
plt.legend()
plt.show()
We then decided to take a look at if an analysis of time of day versus stoppage rates by race produced any interesting results.
plt.figure(figsize=(10,6))
with_time_sf = sfcrime.copy()
with_time_sf['subject_race'] = with_time_sf['subject_race'].replace(['asian/pacific islander', 'hispanic', 'black', 'other'], 'non-white')
with_time_sf['time'] = pd.to_datetime(with_time_sf['time'])
with_time_sf['time'] = with_time_sf['time'].dt.hour
sns.countplot(x='time', hue='subject_race', data=with_time_sf.sort_values(by='subject_race'))
plt.xticks(rotation=70)
plt.title('Count of Stops by Race Throughout Time of Day (SF)');
For the majority of the day in San Francisco, the ratio between the number of non-white subjects stopped and the number of white subjects stopped stays relatively constant with slightly more non-white subjects stopped. We then see that this gap widens between the hours of 10pm and midnight.
plt.figure(figsize=(10,6))
with_time_nola = nola_crime.copy()
with_time_nola['subject_race'] = with_time_nola['subject_race'].replace(['asian/pacific islander', 'hispanic', 'black', 'other', 'unknown'], 'non-white')
with_time_nola['time'] = pd.to_datetime(with_time_nola['time'])
with_time_nola['time'] = with_time_nola['time'].dt.hour
sns.countplot(x='time', hue='subject_race', data=with_time_nola.sort_values(by='subject_race'))
plt.xticks(rotation=70)
plt.title('Count of Stops by Race Throughout Time of Day (New Orleans)');
We see relatively similar ratios throughout the day in New Orleans.
plt.figure(figsize=(10,6))
with_time_pitt = pittcrime.copy()
with_time_pitt['subject_race'] = with_time_pitt['subject_race'].replace(['asian/pacific islander', 'hispanic', 'black', 'other', 'unknown'], 'non-white')
with_time_pitt['time'] = pd.to_datetime(with_time_pitt['time'])
with_time_pitt['time'] = with_time_pitt['time'].dt.hour
sns.countplot(x='time', hue='subject_race', data=with_time_pitt.sort_values(by='subject_race'))
plt.xticks(rotation=70)
plt.title('Count of Stops by Race Throughout Time of Day (Pittsburgh)');
In Pittsburgh, we see the most interesting results. Throughout the day time hours we see more white subjects stopped. However, as it gets into night time, we see a greater percentage of those stopped being non-white and from midnight to 3am, non-white subjects make up a majority of those stopped. However in contrast to New Orleans and San Francisco, the number of people stopped is far lower at night versus the day.
From these graphs, we find that the relative rates of stoppage for non-white versus white subjects change dependent on the time of day in San Francisco and Pittsburgh. This difference in ratio dependent on time of day could possibly indicate police bias and warrants further study to determine the cause.
Modeling
Modeling and analysis will be evaluated based on appropriateness to question, clarity in explanation of model and the reasons for using it, and the exposition of the relationship between the team’s modeling efforts and the conclusions the team draws.
In particular, we seek to see if there is underlying discrimination (eg. racial, gender, etc) based on the frequency of stops in each of these cities. From the Exploratory Data Analysis performed above, it is already clear that each of the select cities: San Francisco, Pittsburgh, and New Orleans has varying demographics and distribution of stops.
Since our goal of identifying possible underlying discrimination in policing is a broad topic, here are some variables that we took into consideration:
- Severity of the Crime (ie. if the stop resulted in a warning, citation, or an arrest in order of increasing severity)
- Time of Day (ie. when does the crime occur? is there a distinction between crimes occuring during night or day?
- Type of Stop (ie. pedestrian, vehicular, etc)
- Subject Age (ie. age of the person stopped)
To promote fair comparisons, the following considerations were taken into account during standardization/cleaning of the data:
- Based on the availability of data, only data from the years 2010-2015 will be used.
- By the sheer volume of data, and since the total number of citations varied in each of the cities, a random sample of 200,000 stops will be taken from each city.
- “Unknown” or NaN values listed under subject race will be dropped from the datasets.
- Implementing a the 24-hour time, where the minutes become fractions of the hour.
- Implement One-Hot Encoding on the “outcomes” column for its 3 outcomes: arrest, citation, and warning and on the “type” column for its 2 types: vehicular, pedestrian.
sfcrime = sfcrime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "warning_issued", 'arrest_made', "reason_for_stop"]]
sf_crime = sfcrime.iloc[sfcrime.dropna().index, :]
random.seed(30)
sf_crime_resampled = sf_crime.sample(200000)
sf_crime_resampled[['citation_issued',"warning_issued", "arrest_made"]] = sf_crime_resampled[['citation_issued',"warning_issued", "arrest_made"]].astype(int)
sf_crime_resampled['vehicular'] = (sf_crime_resampled['type']=='vehicular').astype(int)
sf_crime_resampled['pedestrian'] = (sf_crime_resampled['type']=='pedestrian').astype(int)
sf_crime_resampled['time'] = pd.to_datetime(sf_crime_resampled['time'])
sf_crime_resampled['time'] = sf_crime_resampled['time'].dt.hour + sf_crime_resampled['time'].dt.minute/60
sf_crime_resampled.head()
date | time | subject_age | subject_race | subject_sex | type | outcome | citation_issued | warning_issued | arrest_made | reason_for_stop | vehicular | pedestrian | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
496292 | 2011-04-23 | 1.966667 | 30.0 | asian/pacific islander | male | vehicular | warning | 0 | 1 | 0 | Moving Violation | 1 | 0 |
697863 | 2013-09-05 | 1.500000 | 26.0 | white | male | vehicular | citation | 1 | 0 | 0 | Moving Violation | 1 | 0 |
543063 | 2011-10-04 | 23.833333 | 20.0 | asian/pacific islander | female | vehicular | warning | 0 | 1 | 0 | Moving Violation | 1 | 0 |
478772 | 2011-02-21 | 14.000000 | 26.0 | hispanic | male | vehicular | citation | 1 | 0 | 0 | Moving Violation | 1 | 0 |
183677 | 2008-05-29 | 12.216667 | 60.0 | white | male | vehicular | citation | 1 | 0 | 0 | Mechanical or Non-Moving Violation (V.C.) | 1 | 0 |
nola_crime = nola_crime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "warning_issued", 'arrest_made', "reason_for_stop"]]
nola_crime = nola_crime.iloc[nola_crime.dropna().index, :]
random.seed(30)
nola_crime_resampled = nola_crime.sample(200000)
nola_crime_resampled[['citation_issued',"warning_issued", "arrest_made"]] = nola_crime_resampled[['citation_issued',"warning_issued", "arrest_made"]].astype(int)
nola_crime_resampled['vehicular'] = (nola_crime_resampled['type']=='vehicular').astype(int)
nola_crime_resampled['pedestrian'] = (nola_crime_resampled['type']=='pedestrian').astype(int)
nola_crime_resampled['time'] = pd.to_datetime(nola_crime_resampled['time'])
nola_crime_resampled['time'] = nola_crime_resampled['time'].dt.hour + nola_crime_resampled['time'].dt.minute/60
nola_crime_resampled.head()
date | time | subject_age | subject_race | subject_sex | type | outcome | citation_issued | warning_issued | arrest_made | reason_for_stop | vehicular | pedestrian | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
441600 | 2013-11-05 | 18.500000 | 23.0 | black | male | vehicular | arrest | 1 | 0 | 1 | TRAFFIC VIOLATION | 1 | 0 |
308926 | 2017-08-01 | 16.966667 | 22.0 | black | female | vehicular | citation | 1 | 0 | 0 | TRAFFIC VIOLATION | 1 | 0 |
327897 | 2017-08-14 | 1.733333 | 49.0 | black | female | vehicular | warning | 0 | 1 | 0 | TRAFFIC VIOLATION | 1 | 0 |
476436 | 2015-12-02 | 17.066667 | 27.0 | black | male | vehicular | warning | 0 | 1 | 0 | TRAFFIC VIOLATION | 1 | 0 |
464300 | 2017-11-22 | 23.900000 | 20.0 | black | male | vehicular | arrest | 0 | 0 | 1 | TRAFFIC VIOLATION | 1 | 0 |
pitt_crime = pittcrime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "warning_issued", 'arrest_made', "reason_for_stop"]]
pitt_crime = pitt_crime.iloc[pitt_crime[['subject_race', 'subject_age']].dropna().index, :]
pitt_crime = pitt_crime.sample()
pitt_crime
date | time | subject_age | subject_race | subject_sex | type | outcome | citation_issued | warning_issued | arrest_made | reason_for_stop | |
---|---|---|---|---|---|---|---|---|---|---|---|
47015 | 2018-02-20 | 20:59:00 | 46.0 | white | NaN | pedestrian | NaN | False | False | False | narcVice vehicleCodeViolation |
Upon further analysis of the Pittsburgh dataset, there were too many null values to use this dataset. Since our chosen features included the the subject age and race, simply dropping the null values meant that a majority of the dataset was also discarded. In the cell above, only one instance included information on both the race and the age. If the outcome null values were also dropped in this instance, none of the stops in the Pittsburgh dataset would be viable for our analysis.
Therefore, our model and analysis will be limited to San Francisco and New Orleans.
Logistic Regression Classifier
Why logistic regression?
We have incorporated logistic regression and naive bases to the analysis regarding the overall accuracy and fairness of police stops as it pertains to race. The inputs that we would receive would be both categorical and quantitative. Furthermore, we chose a no-regularization logistic regression for the baseline of the model. Since this analysis aims to perform classification, the essence of a logistic regression model is designed on a certain event that would exist or not. In our project, the logistic regression model aims to delineate between a binary result (1 for white vs. 0 for non-white individuals). In essence, underlying racial discrimination will be classified based on trends in features that will predict if the police stop was for a white individual or non-white minority individual. A similar analysis may also be done based on gender. Furthermore, logistic regression would be used to find the relationships between dependent variables and other independent or nominal variables.
To implement logistic regression, the subject_race column is transformed into a binary format: 1 for white vs. 0 for non-white individuals.
sf_crime_resampled['race_binary'] = (sf_crime_resampled['subject_race'] == 'white').astype(int)
sf_crime_resampled.head()
date | time | subject_age | subject_race | subject_sex | type | outcome | citation_issued | warning_issued | arrest_made | reason_for_stop | vehicular | pedestrian | race_binary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
496292 | 2011-04-23 | 1.966667 | 30.0 | asian/pacific islander | male | vehicular | warning | 0 | 1 | 0 | Moving Violation | 1 | 0 | 0 |
697863 | 2013-09-05 | 1.500000 | 26.0 | white | male | vehicular | citation | 1 | 0 | 0 | Moving Violation | 1 | 0 | 1 |
543063 | 2011-10-04 | 23.833333 | 20.0 | asian/pacific islander | female | vehicular | warning | 0 | 1 | 0 | Moving Violation | 1 | 0 | 0 |
478772 | 2011-02-21 | 14.000000 | 26.0 | hispanic | male | vehicular | citation | 1 | 0 | 0 | Moving Violation | 1 | 0 | 0 |
183677 | 2008-05-29 | 12.216667 | 60.0 | white | male | vehicular | citation | 1 | 0 | 0 | Mechanical or Non-Moving Violation (V.C.) | 1 | 0 | 1 |
nola_crime_resampled['race_binary'] = (nola_crime_resampled['subject_race'] == 'white').astype(int)
nola_crime_resampled.head()
date | time | subject_age | subject_race | subject_sex | type | outcome | citation_issued | warning_issued | arrest_made | reason_for_stop | vehicular | pedestrian | race_binary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
441600 | 2013-11-05 | 18.500000 | 23.0 | black | male | vehicular | arrest | 1 | 0 | 1 | TRAFFIC VIOLATION | 1 | 0 | 0 |
308926 | 2017-08-01 | 16.966667 | 22.0 | black | female | vehicular | citation | 1 | 0 | 0 | TRAFFIC VIOLATION | 1 | 0 | 0 |
327897 | 2017-08-14 | 1.733333 | 49.0 | black | female | vehicular | warning | 0 | 1 | 0 | TRAFFIC VIOLATION | 1 | 0 | 0 |
476436 | 2015-12-02 | 17.066667 | 27.0 | black | male | vehicular | warning | 0 | 1 | 0 | TRAFFIC VIOLATION | 1 | 0 | 0 |
464300 | 2017-11-22 | 23.900000 | 20.0 | black | male | vehicular | arrest | 0 | 0 | 1 | TRAFFIC VIOLATION | 1 | 0 | 0 |
To perform modeling and analysis on our dataset, the dataset will be split into a training, validation, and test set:
- Training set 70%
- Validation set 15%
- Test set 15%
San Francisco
#SF features matrix
X_sf = sf_crime_resampled[['subject_age', 'time', 'citation_issued', "warning_issued", "arrest_made", 'vehicular', 'pedestrian']]
y_sf = sf_crime_resampled['race_binary']
# Train/Test Split
X_sf_train, X_sf_test, y_sf_train, y_sf_test = train_test_split(X_sf, y_sf, train_size = .70, test_size = .30)
# Train/Validation Split
X_sf_test, X_sf_validate, y_sf_test, y_sf_validate = train_test_split(X_sf_test, y_sf_test, train_size = .50, test_size = .50)
log_reg = LogisticRegression()
log_model = log_reg.fit(X_sf_train, y_sf_train)
log_pred = log_model.predict(X_sf_train)
log_reg_confusion_matrix = confusion_matrix(y_sf_train, log_pred)
log_model.score(X_sf_validate, y_sf_validate)
0.5872
New Orleans
#New Orleans features matrix
X_nola = nola_crime_resampled[['subject_age', 'time', 'citation_issued', "warning_issued", "arrest_made", 'vehicular', 'pedestrian']]
y_nola = nola_crime_resampled['race_binary']
# Train/Test Split
X_nola_train, X_nola_test, y_nola_train, y_nola_test = train_test_split(X_nola, y_nola, train_size = .70, test_size = .30)
# Train/Validation Split
X_nola_test, X_nola_validate, y_nola_test, y_nola_validate = train_test_split(X_nola_test, y_nola_test, train_size = .50, test_size = .50)
log_reg = LogisticRegression()
log_model = log_reg.fit(X_nola_train, y_nola_train)
log_pred = log_model.predict(X_nola_train)
log_reg_confusion_matrix_nola = confusion_matrix(y_nola_train, log_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_reg_confusion_matrix_nola, classes=class_names,
title='Confusion Matrix')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_reg_confusion_matrix_nola, classes=class_names, normalize=True,
title='Normalized Confusion Matrix')
log_model.score(X_nola_validate, y_nola_validate)
0.7420666666666667
Ridge Regression Classifier
Why Ridge Regression?
Ridge regression is a type of squared loss regression used when the number of predictors exceeds the number of observations (eg. p>n) and when the model experiences multicollinearity. Since both p>n and multicollinearity are issues when using linear least squares regression, ridge regression would be used instead. Ridge regression works by using a shrinkage estimator that essentially would produce new estimates that are “shrunk” to the population’s true parameters. It is a L2 “squared” regularization that adds a penalty equal to the squared magnitude of the coefficients. A tuning parameter 𝜆 would determine the strength of this penalty. The constraints put on each of the estimators helps to shrink extreme variance and fluctuations; this sacrifices training accuracy for a model that is likely to generalize better. In other words, ridge regression introduces enough bias that shrinks variance to make estimates closer to the true population values.
Here, we try to improve upon the logistic model. RidgeClassifier is a classifier variant of the Ridge regressor (provided by scikit-learn), and can be much faster than logistic regression. It is used below in place of the sklearn.linear_model.Ridge() model.
San Francisco
ridge = RidgeClassifier()
ridge_model = ridge.fit(X_sf_train, y_sf_train)
ridge_pred = ridge_model.predict(X_sf_train)
ridge_confusion_matrix_sf = confusion_matrix(y_sf_train, ridge_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_sf, classes=class_names,
title='Confusion Matrix')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_sf, classes=class_names, normalize=True,
title='Normalized Confusion Matrix')
ridge_model.score(X_sf_validate, y_sf_validate)
0.5870333333333333
New Orleans
ridge_nola = RidgeClassifier()
ridge_nola_model = ridge_nola.fit(X_nola_train, y_nola_train)
ridge_nola_pred = ridge_nola_model.predict(X_nola_train)
ridge_confusion_matrix_nola = confusion_matrix(y_nola_train, ridge_nola_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_nola, classes=class_names,
title='Confusion Matrix')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_nola, classes=class_names, normalize=True,
title='Normalized Confusion Matrix')
ridge_nola_model.score(X_nola_validate, y_nola_validate)
0.7420666666666667
One-Hot Encoding
What follows are our attempts at one-hot encoding reason_for_stop
and using resultant columns as additional features as a means of improving accuracy.
San Francisco
sf_crime_resampled_new = pd.concat([sf_crime_resampled,pd.get_dummies(sf_crime_resampled['reason_for_stop'], prefix='stopreason')],axis=1)
sf_crime_resampled_new
date | time | subject_age | subject_race | subject_sex | type | outcome | citation_issued | warning_issued | arrest_made | reason_for_stop | vehicular | pedestrian | race_binary | stopreason_Assistance to Motorist | stopreason_BOLO/APB/Warrant | stopreason_DUI Check | stopreason_MPC Violation | stopreason_MPC Violation|Moving Violation | stopreason_Mechanical or Non-Moving Violation (V.C.) | stopreason_Mechanical or Non-Moving Violation (V.C.)|DUI Check | stopreason_Mechanical or Non-Moving Violation (V.C.)|Moving Violation | stopreason_Moving Violation | stopreason_Moving Violation|Assistance to Motorist | stopreason_Moving Violation|DUI Check | stopreason_Moving Violation|Mechanical or Non-Moving Violation (V.C.) | stopreason_Moving Violation|NA | stopreason_Moving Violation|Traffic Collision | stopreason_Traffic Collision | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
496292 | 2011-04-23 | 1.966667 | 30.0 | asian/pacific islander | male | vehicular | warning | 0 | 1 | 0 | Moving Violation | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
697863 | 2013-09-05 | 1.500000 | 26.0 | white | male | vehicular | citation | 1 | 0 | 0 | Moving Violation | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
543063 | 2011-10-04 | 23.833333 | 20.0 | asian/pacific islander | female | vehicular | warning | 0 | 1 | 0 | Moving Violation | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
478772 | 2011-02-21 | 14.000000 | 26.0 | hispanic | male | vehicular | citation | 1 | 0 | 0 | Moving Violation | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
183677 | 2008-05-29 | 12.216667 | 60.0 | white | male | vehicular | citation | 1 | 0 | 0 | Mechanical or Non-Moving Violation (V.C.) | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
452716 | 2010-11-18 | 21.750000 | 44.0 | white | female | vehicular | citation | 1 | 0 | 0 | Moving Violation | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
119254 | 2007-11-04 | 21.866667 | 62.0 | white | male | vehicular | arrest | 0 | 0 | 1 | Moving Violation | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
356527 | 2009-12-21 | 11.833333 | 21.0 | other | male | vehicular | citation | 1 | 0 | 0 | Moving Violation | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
487440 | 2011-03-22 | 21.750000 | 23.0 | white | male | vehicular | citation | 1 | 0 | 0 | Mechanical or Non-Moving Violation (V.C.) | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
823151 | 2015-09-17 | 13.750000 | 40.0 | hispanic | male | vehicular | warning | 0 | 1 | 0 | Moving Violation | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
200000 rows × 29 columns
sf_crime_resampled_new.drop(['reason_for_stop'],axis=1, inplace=True)
X_sf_new = sf_crime_resampled_new[columns_we_want]
y_sf_new = sf_crime_resampled_new['race_binary']
# Train/Test Split
X_sf_train, X_sf_test, y_sf_train, y_sf_test = train_test_split(X_sf_new, y_sf_new, train_size = .70, test_size = .30)
# Train/Validation Split
X_sf_test, X_sf_validate, y_sf_test, y_sf_validate = train_test_split(X_sf_test, y_sf_test, train_size = .50, test_size = .50)
log_reg = LogisticRegression()
log_model = log_reg.fit(X_sf_train, y_sf_train)
log_pred = log_model.predict(X_sf_train)
log_confusion_matrix_sf = confusion_matrix(y_sf_train, log_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_confusion_matrix_sf, classes=class_names,
title='Confusion Matrix')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_confusion_matrix_sf, classes=class_names, normalize=True,
title='Normalized Confusion Matrix')
log_model.score(X_sf_validate, y_sf_validate)
0.5855
ridge = RidgeClassifier()
ridge_model = ridge.fit(X_sf_train, y_sf_train)
ridge_pred = ridge_model.predict(X_sf_train)
ridge_confusion_matrix_sf = confusion_matrix(y_sf_train, ridge_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_sf, classes=class_names,
title='Confusion Matrix')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_sf, classes=class_names, normalize=True,
title='Normalized Confusion Matrix')
ridge_model.score(X_sf_validate, y_sf_validate)
0.5855
New Orleans
nola_crime_resampled_new = pd.concat([nola_crime_resampled,pd.get_dummies(nola_crime_resampled['reason_for_stop'], prefix='stopreason')],axis=1)
nola_crime_resampled_new
date | time | subject_age | subject_race | subject_sex | type | outcome | citation_issued | warning_issued | arrest_made | reason_for_stop | vehicular | pedestrian | race_binary | stopreason_SUSPECT PERSON | stopreason_SUSPECT VEHICLE | stopreason_TRAFFIC VIOLATION | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
441600 | 2013-11-05 | 18.500000 | 23.0 | black | male | vehicular | arrest | 1 | 0 | 1 | TRAFFIC VIOLATION | 1 | 0 | 0 | 0 | 0 | 1 |
308926 | 2017-08-01 | 16.966667 | 22.0 | black | female | vehicular | citation | 1 | 0 | 0 | TRAFFIC VIOLATION | 1 | 0 | 0 | 0 | 0 | 1 |
327897 | 2017-08-14 | 1.733333 | 49.0 | black | female | vehicular | warning | 0 | 1 | 0 | TRAFFIC VIOLATION | 1 | 0 | 0 | 0 | 0 | 1 |
476436 | 2015-12-02 | 17.066667 | 27.0 | black | male | vehicular | warning | 0 | 1 | 0 | TRAFFIC VIOLATION | 1 | 0 | 0 | 0 | 0 | 1 |
464300 | 2017-11-22 | 23.900000 | 20.0 | black | male | vehicular | arrest | 0 | 0 | 1 | TRAFFIC VIOLATION | 1 | 0 | 0 | 0 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
157539 | 2012-04-23 | 14.916667 | 69.0 | black | female | vehicular | warning | 0 | 1 | 0 | TRAFFIC VIOLATION | 1 | 0 | 0 | 0 | 0 | 1 |
26292 | 2018-01-18 | 12.633333 | 57.0 | black | male | vehicular | arrest | 1 | 0 | 1 | TRAFFIC VIOLATION | 1 | 0 | 0 | 0 | 0 | 1 |
332641 | 2011-08-18 | 16.816667 | 44.0 | white | male | vehicular | citation | 1 | 0 | 0 | TRAFFIC VIOLATION | 1 | 0 | 1 | 0 | 0 | 1 |
311926 | 2017-08-03 | 19.083333 | 61.0 | black | male | vehicular | warning | 0 | 1 | 0 | TRAFFIC VIOLATION | 1 | 0 | 0 | 0 | 0 | 1 |
329071 | 2015-08-15 | 18.233333 | 25.0 | black | male | vehicular | arrest | 1 | 0 | 1 | TRAFFIC VIOLATION | 1 | 0 | 0 | 0 | 0 | 1 |
200000 rows × 17 columns
nola_columns = list(nola_crime_resampled_new.columns)
nola_columns_desired = ['time',
'subject_age',
'citation_issued',
'warning_issued',
'arrest_made',
'vehicular',
'pedestrian',
'stopreason_SUSPECT PERSON',
'stopreason_SUSPECT VEHICLE',
'stopreason_TRAFFIC VIOLATION']
X_nola_new = nola_crime_resampled_new[nola_columns_desired]
y_nola_new = nola_crime_resampled_new['race_binary']
# Train/Test Split
X_nola_train, X_nola_test, y_nola_train, y_nola_test = train_test_split(X_nola_new, y_nola_new, train_size = .70, test_size = .30)
# Train/Validation Split
X_nola_test, X_nola_validate, y_nola_test, y_nola_validate = train_test_split(X_nola_test, y_nola_test, train_size = .50, test_size = .50)
log_reg = LogisticRegression()
log_model = log_reg.fit(X_nola_train, y_nola_train)
log_pred = log_model.predict(X_nola_train)
log_confusion_matrix_nola = confusion_matrix(y_nola_train, log_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_confusion_matrix_nola, classes=class_names,
title='Confusion Matrix')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_confusion_matrix_nola, classes=class_names, normalize=True,
title='Normalized Confusion Matrix')
log_model.score(X_nola_validate, y_nola_validate)
0.7444
ridge_nola = RidgeClassifier()
ridge_nola_model = ridge_nola.fit(X_nola_train, y_nola_train)
ridge_nola_pred = ridge_nola_model.predict(X_nola_train)
confusion_matrix(y_nola_train, ridge_nola_pred)
Note: The RidgeClassifier did not change the confusion matrix for New Orleans.
ridge_nola_model.score(X_nola_validate, y_nola_validate)
0.7444333333333333
Modeling Conclusions
Numerical Results
In modeling the San Francisco dataset, the logistic regression model and the ridge regression model (after adding additional features like one-hot encoding) both gave an validation accuracy of 58.4%.
In modeling the New Orleans dataset, the logistic model and the ridge regression model scored a validation accuracy of 74.6%.
Evaluation Methods
Our two methods of evaluating our results are using confusion matrices and using the model.score() function. In our confusion matrix, we see our true negatives at (0,0) in the matrix, false negatives at (1,0), false positives at (0,1), and true positives at (1,1). True negatives are non-white subjects predicted to be non-white, false negatives are white subjects predicted to be non-white, false positives are non-white subjects predicted to be white, and true positives are white subjects predicted to be white. model.score() returns the mean accuracy on the given test data and labels.
Interpretation of Results
We see higher accuracies for our New Orleans models because our models are predicting all values to be non-white. For the New Orleans dataset, this results in higher accuracy because majority of the dataset population (and city population) is non-white. In San Francisco on the other hand, a plurality of the population is white. Thus, our scores just indicate what the percentage of non-white subjects within the dataset. It is of note however, that the percentage of non-white subjects within the dataset is higher than the overall percentage of people of color in New Orleans, suggesting that people of color are stopped at a higher rate than white people, even when you control for percentage of the population.
Reflecting on the Chosen Models
The accuracy of our predictive models for San Francisco are low, in which the model only predicts correctly a little more than half the time. While this may be consequence of the chosen features or the choice of the classifier itself, considerations should also be taken that the features simple do not correlate well with race. Therefore, even adding additional features to the model did not increase the accuracy by a significant amount. Training on additional noise would work to decrease the accuracy. In other words, rather than there being a problem with the model, there may just be little to no underlying racial discrimination in the features that we chose to train the model on.
Potential Improvements
Oversampling: An imbalance in the dataset can be seen from the proportion of whites to other ethnicities. Especially in the New Orleans data set, there are considerably more non-white individuals than white individuals in the sampled training set, which leads the models to tend to exclusively predict subject_race as non-white. A potential solution is to oversample rows with white subjects in our training data to increase the likelihood that the model will predict a white subject. Since one of the primary goals of model validation estimation of how it will perform on unseen data, oversampling correctly is critical.
Bootstrapping Given the imbalance in the datasets (especially in the case of New Orleans) between the white and non-white populations, a method to create more variability in the data would be to bootstrap (random sampling with replacement). In this way, the model would be able to train on more data without having to collect additional data.
Important Null values: In standardizing the data from both the San Francisco and New Orleans datasets, all the Null/NaN/None values were dropped. However, there may be underlying patterns to these dropped values. More exploratory data analysis could be performed to see why certain values failed to appear in the given datasets.