Analyzing Regional Trends in Police Stop Data Across the United States

Names: Rishabh Parekh, Emmy Yu, Manu Prakasam

About the Data: Ethical Data Collection

Publicly-available Standardized Stop Data from: https://openpolicing.stanford.edu/data/

The Stanford Open Policing Project data are made available under the Open Data Commons Attribution License, which allows us to:

To Share: To copy, distribute and use the database.
To Create: To produce works from the database.
To Adapt: To modify, transform and build upon the database.

The Stanford Opening Policing Project data comes from the working paper:
E. Pierson, C. Simoiu, J. Overgoor, S. Corbett-Davies, D. Jenson, A. Shoemaker, V. Ramachandran, P. Barghouty, C. Phillips, R. Shroff, and S. Goel. (2019) “A large-scale analysis of racial disparities in police stops across the United States”.

The accompanying README file describes how the data was standardized and cleaned for privacy and secruity reasons. Descriptions of the column meanings are also given.

Downloading data from The Stanford Open Policing Project:

Specifically, we will be looking at the datasets from San Francisco, New Orleans, and Pittsburgh.

sfcrime = pd.read_csv('ca_san_francisco_2019_12_17.csv')
sfcrime.head()

	raw_row_number	date	time	location	lat	lng	district	subject_age	subject_race	subject_sex	type	arrest_made	citation_issued	warning_issued	outcome	contraband_found	search_conducted	search_vehicle	search_basis	reason_for_stop	raw_search_vehicle_description	raw_result_of_contact_description
0	869921	2014-08-01	00:01:00	MASONIC AV & FELL ST	37.773004	-122.445873	NaN	NaN	asian/pacific islander	female	vehicular	False	False	True	warning	NaN	False	False	NaN	Mechanical or Non-Moving Violation (V.C.)	No Search	Warning
1	869922	2014-08-01	00:01:00	GEARY&10TH AV	37.780898	-122.468586	NaN	NaN	black	male	vehicular	False	True	False	citation	NaN	False	False	NaN	Mechanical or Non-Moving Violation (V.C.)	No Search	Citation
2	869923	2014-08-01	00:15:00	SUTTER N OCTAVIA ST	37.786919	-122.426718	NaN	NaN	hispanic	male	vehicular	False	True	False	citation	NaN	False	False	NaN	Mechanical or Non-Moving Violation (V.C.)	No Search	Citation
3	869924	2014-08-01	00:18:00	3RD ST & DAVIDSON	37.746380	-122.392005	NaN	NaN	hispanic	male	vehicular	False	False	True	warning	NaN	False	False	NaN	Mechanical or Non-Moving Violation (V.C.)	No Search	Warning
4	869925	2014-08-01	00:19:00	DIVISADERO ST. & BUSH ST.	37.786348	-122.440003	NaN	NaN	white	male	vehicular	False	True	False	citation	NaN	False	False	NaN	Mechanical or Non-Moving Violation (V.C.)	No Search	Citation

nola_crime = pd.read_csv('la_new_orleans_2019_12_17.csv')
nola_crime.head()

	raw_row_number	date	time	location	lat	lng	district	zone	subject_age	subject_race	subject_sex	officer_assignment	type	arrest_made	citation_issued	warning_issued	outcome	contraband_found	contraband_drugs	contraband_weapons	frisk_performed	search_conducted	search_person	search_vehicle	search_basis	reason_for_stop	vehicle_color	vehicle_make	vehicle_model	vehicle_year	raw_actions_taken	raw_subject_race
0	1	2010-01-01	01:11:00	NaN	NaN	NaN	6	E	26.0	black	female	6th District	vehicular	False	False	False	NaN	NaN	NaN	NaN	False	False	False	False	NaN	TRAFFIC VIOLATION	BLACK	DODGE	CARAVAN	2005.0	NaN	BLACK
1	9087	2010-01-01	01:29:00	NaN	NaN	NaN	7	C	37.0	black	male	7th District	vehicular	False	False	False	NaN	NaN	NaN	NaN	False	False	False	False	NaN	TRAFFIC VIOLATION	BLUE	NISSAN	MURANO	2005.0	NaN	BLACK
2	9086	2010-01-01	01:29:00	NaN	NaN	NaN	7	C	37.0	black	male	7th District	vehicular	False	False	False	NaN	NaN	NaN	NaN	False	False	False	False	NaN	TRAFFIC VIOLATION	BLUE	NISSAN	MURANO	2005.0	NaN	BLACK
3	267	2010-01-01	14:00:00	NaN	NaN	NaN	7	I	96.0	black	male	7th District	vehicular	False	False	False	NaN	NaN	NaN	NaN	False	False	False	False	NaN	TRAFFIC VIOLATION	GRAY	JEEP	GRAND CHEROKEE	2003.0	NaN	BLACK
4	2	2010-01-01	02:06:00	NaN	NaN	NaN	5	D	17.0	black	male	5th District	NaN	False	False	False	NaN	NaN	NaN	NaN	False	False	False	False	NaN	CALL FOR SERVICE	NaN	NaN	NaN	NaN	NaN	BLACK

pittcrime = pd.read_csv('pa_pittsburgh_2019_12_17.csv')
pittcrime.head()

	raw_row_number	date	time	location	lat	lng	neighborhood	subject_age	subject_race	subject_sex	officer_id_hash	officer_age	officer_race	officer_sex	type	violation	arrest_made	citation_issued	warning_issued	outcome	contraband_found	frisk_performed	search_conducted	reason_for_stop	raw_zone	raw_object_searched	raw_race	raw_ethnicity	raw_zone_division	raw_evidence_found	raw_weapons_found	raw_nothing_found	raw_police_zone	raw_officer_race	raw_officer_zone
0	1	2008-01-01	00:14:00	351 S Negley Ave	40.459466	-79.932802	NaN	20.0	white	male	3bb3b1bd48	41.0	NaN	NaN	pedestrian	NaN	False	False	False	NaN	NaN	NaN	False	Other	NaN	NaN	White	White	-	NaN	NaN	NaN	NaN	NaN	NaN
1	3	2008-01-01	00:14:00	376 Main St	40.465868	-79.955594	NaN	19.0	white	male	3bb3b1bd48	41.0	NaN	NaN	pedestrian	NaN	False	False	False	NaN	NaN	NaN	False	Other	NaN	NaN	White	White	-	NaN	NaN	NaN	NaN	NaN	NaN
2	2	2008-01-01	00:14:00	Stamair Way & Baum Blvd	40.456812	-79.939041	NaN	16.0	white	male	3bb3b1bd48	41.0	NaN	NaN	pedestrian	NaN	False	False	False	NaN	NaN	NaN	False	Other	-	NaN	White	White	-	NaN	NaN	NaN	NaN	NaN	NaN
3	4	2008-01-01	01:59:00	N Braddock Ave & Thomas Blvd	40.448873	-79.893923	NaN	21.0	NaN	male	b62aedb5bb	29.0	NaN	NaN	pedestrian	NaN	True	False	False	arrest	NaN	NaN	True	majorCrimes Other	-	person	Black	White	-	NaN	NaN	NaN	NaN	NaN	NaN
4	5	2008-01-01	14:50:00	2518 West Liberty Ave	40.398780	-80.026439	NaN	41.0	white	male	1ccb6bd45a	NaN	NaN	NaN	pedestrian	NaN	False	False	False	NaN	NaN	NaN	True	narcVice	-	person vehicle place	White	NaN	N/V	NaN	NaN	NaN	NaN	NaN	NaN

Data Applicability:

Our research question will assess the relationship between race/socioeconomic backgrounds and the amount of interactions with the police. With the given data, we will have access to race of the individual, location, and time of each stop a police officer makes. The information of location can give us a general idea of an individual’s socioeconomic background based on the neighborhood they are in.

Furthermore, using the race of the individual, we can find the total number of stops per race. We will also calculate the proportion of total number of stops and the population of that specific race in the area. This can show us if there is a disproportionate rate of interactions with the law enforcment and minorities.

Using the Stanford Open Policing data, we will also compare the three distinctly different cities: San Francisco, Pittsburgh, and New Orleans. This will allow us to see if policing is similar or different across the nation, and will further give us a general idea of the relationship between minorities and law enforcement.

Standardizing the Data:

Certain columns that are common across all cities (and lend themselves to interesting analyses) were selected and merged to create one DataFrame for easier analysis and plotting.

Each row in the cleaned data represents a stop. Coverage varies by location. From the original Open Policing data, fields with an asterisk were removed for public release due to privacy concerns. All columns except raw_row_number, violation, disposition, location, officer_assignment, any city or state subgeography (i.e. county, beat, division, etc), unit, and vehicle_{color, make, model, type}are also digit sanitized (each digit replaced with “-“) for privacy concerns. These columns from the original dataset were removed for the purpose of our analysis.

Exploratory Data Analysis

sf_crime_to_merge = sfcrime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "reason_for_stop"]]
sf_crime_to_merge = pd.DataFrame(sf_crime_to_merge)
sf_crime_to_merge["city"] = "San Francisco"

pitt_crime_to_merge = pittcrime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "reason_for_stop"]]
pitt_crime_to_merge = pd.DataFrame(pitt_crime_to_merge)
pitt_crime_to_merge["city"] = "Pittsburgh"

nola_crime_to_merge = nola_crime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "reason_for_stop"]]
nola_crime_to_merge = pd.DataFrame(nola_crime_to_merge)
nola_crime_to_merge["city"] = "New Orleans"

#Exploratory Data Analysis Dataframe
eda_merged = (sf_crime_to_merge.append(pitt_crime_to_merge)).append(nola_crime_to_merge)

eda_merged

	date	time	subject_age	subject_race	subject_sex	type	outcome	citation_issued	reason_for_stop	city
0	2014-08-01	00:01:00	NaN	asian/pacific islander	female	vehicular	warning	False	Mechanical or Non-Moving Violation (V.C.)	San Francisco
1	2014-08-01	00:01:00	NaN	black	male	vehicular	citation	True	Mechanical or Non-Moving Violation (V.C.)	San Francisco
2	2014-08-01	00:15:00	NaN	hispanic	male	vehicular	citation	True	Mechanical or Non-Moving Violation (V.C.)	San Francisco
3	2014-08-01	00:18:00	NaN	hispanic	male	vehicular	warning	False	Mechanical or Non-Moving Violation (V.C.)	San Francisco
4	2014-08-01	00:19:00	NaN	white	male	vehicular	citation	True	Mechanical or Non-Moving Violation (V.C.)	San Francisco
...	...	...	...	...	...	...	...	...	...	...
512087	2017-12-31	00:48:00	28.0	black	female	vehicular	arrest	False	TRAFFIC VIOLATION	New Orleans
512088	2017-12-31	00:48:00	25.0	black	male	vehicular	arrest	False	TRAFFIC VIOLATION	New Orleans
512089	2017-12-31	00:48:00	23.0	black	male	vehicular	arrest	False	TRAFFIC VIOLATION	New Orleans
512090	2017-12-31	00:48:00	25.0	black	male	vehicular	arrest	False	TRAFFIC VIOLATION	New Orleans
512091	2017-12-31	12:49:00	37.0	black	female	vehicular	citation	True	TRAFFIC VIOLATION	New Orleans

1691720 rows × 10 columns

plt.figure(figsize=(12,6))
ax = sns.countplot(x="city", hue="subject_race", data=eda_merged)
plt.title('Frequency of Stops by Race')
plt.legend(loc = 'upper right')
plt.show()

png

From the figure above, it is clear that the demographic range differs significantly across out three target locations: San Francisco, Pittsburgh, and New Orleans. It is important to note that the frequency of stops by race have some correlation to the racial demographics in these cities. For example, there is a large spike in stops for Blacks in New Orleans compared to both San Francisco and Pittsburgh. However, data from the Demographic Statistical Atlas shows that New Orleans has 58.9% Black population compared to San Francisco’s 5.4% and Pittsburgh’s 24.3%.

#The lengths of the datasets differ by location, with San Francisco having the greatest number of citations.
eda_merged.groupby('city').count()[['citation_issued']]

	citation_issued
city
New Orleans	512092
Pittsburgh	274555
San Francisco	905070

eda_time = eda_merged.copy()
eda_time['date'] = pd.to_datetime(eda_time['date'])
eda_time['year'] = eda_time['date'].dt.year
eda_grouped = eda_time.groupby(['city','year']).count()[['citation_issued']].reset_index()

plt.figure(figsize=(15,8))
sns.countplot(x='city', hue='year', data=eda_time)
plt.title('Frequency of Stops by Year Across Locations')
plt.legend(loc = 'best')
plt.show()

png

From the barchart above, it can be seen that the greater dataset lengths are not entirely due to a higher number of stops made by police. As seen in the DataFrame above, San Francisco has the highest number of citations issued. The SF data also contains data from 2007-2016, where the 2016 end data is partial data. The Pittsburgh data is from 2008-2018, where the beginning and end years contains partial data as well. The New Orleans data contains the years 2010-2018 (partial 2018, data collection ended before end of year). Therefore, for fair comparisons and aggregations against locations, data will be limited to the years 2010-2015.

#taking only the stops from 2010-2015
inclusive_years = [2010, 2011, 2012, 2013, 2014, 2015]
eda_standard = eda_time[eda_time['year'].isin(inclusive_years)]
eda_standard.head()

	date	time	subject_age	subject_race	subject_sex	type	outcome	citation_issued	reason_for_stop	city	year
0	2014-08-01	00:01:00	NaN	asian/pacific islander	female	vehicular	warning	False	Mechanical or Non-Moving Violation (V.C.)	San Francisco	2014.0
1	2014-08-01	00:01:00	NaN	black	male	vehicular	citation	True	Mechanical or Non-Moving Violation (V.C.)	San Francisco	2014.0
2	2014-08-01	00:15:00	NaN	hispanic	male	vehicular	citation	True	Mechanical or Non-Moving Violation (V.C.)	San Francisco	2014.0
3	2014-08-01	00:18:00	NaN	hispanic	male	vehicular	warning	False	Mechanical or Non-Moving Violation (V.C.)	San Francisco	2014.0
4	2014-08-01	00:19:00	NaN	white	male	vehicular	citation	True	Mechanical or Non-Moving Violation (V.C.)	San Francisco	2014.0

plt.figure(figsize=(12,6))
ax = sns.countplot(x="city", hue="subject_sex", data=eda_standard)
plt.title('Frequency of Stops by Gender')
plt.legend(loc = 'upper right')
plt.show()

png

In this figure showing the frequency of stops by gender across the three cities, it is clear that although the SF dataset contains the most number of stops, the division of stops by gender is fairly consistent across the cities. Stops of males constitutes more than half of the data in each location.

eda_merged['year'] = pd.DatetimeIndex(eda_merged["date"]).year
plt.figure(figsize=(12,6))
ax = sns.countplot(x="subject_race", hue="outcome", data=eda_merged.sort_values(by=['outcome', 'subject_race']))
plt.title('Type of Police Stop by Race: Total')
plt.legend(loc = 'upper left')
plt.show()

png

plt.figure(figsize=(12,6))
sf = eda_merged.loc[eda_merged["city"] == "San Francisco"]
ax = sns.countplot(x="subject_race", hue="outcome", data=sf.sort_values(by=['outcome', 'subject_race']))
plt.title('Type of Police Stop by Race: San Francisco')
plt.legend(loc = 'upper left')
plt.show()

png

Note: The San Francisco dataset does not have an “unknown” category option listed for the subject’s race.

According to the 2018 US Census Bureau, San Francisco County’s population was 40% Non-Hispanic White, 5.4% Hispanic White, 5.2% Black or African American, 34.3% Asian, 8.1% Some Other Race, 0.3% Native American and Alaskan Native, 0.2% Pacific Islander and 6.5% from two or more races. Based on this data, it’s interesting to note the relatively low count of AAPI subjects stopped based on their percentage of the population, and the relatively high count of black and Hispanic subjects stopped based on their percentage of the population.

plt.figure(figsize=(12,6))
pitt = eda_merged.loc[eda_merged["city"] == "Pittsburgh"]
ax = sns.countplot(x="subject_race", hue="outcome", data=pitt.sort_values(by=['outcome', 'subject_race']))
plt.title('Type of Police Stop by Race: Pittsburgh')
plt.legend(loc = 'upper left')
plt.show()

png

In Pittsburgh, the two races with the most significant percentages are black and white. We see relatively similar trends for these two races, with one small difference being that more black subjects receive more warnings than citations and more white subjects receive citations than warnings. This could indicate a number of different things, including more black subjects are unecessarily stopped or for smaller infractions. An additional note, given Pittsburgh’s population ~65% white and ~24% black (from the Demographic Statistical Atlas referenced earlier), we see a higher relative count of stops for black subjects than the demographics of the city would predict given an even distribution of stops across race.

plt.figure(figsize=(12,6))
nola = eda_merged.loc[eda_merged["city"] == "New Orleans"]
ax = sns.countplot(x="subject_race", hue="outcome", data=nola.sort_values(by=['outcome', 'subject_race']))
plt.title('Type of Police Stop by Race: New Orleans')
plt.legend(loc = 'upper right')
plt.show()

png

An interesting difference between New Orleans and our other cities is the very high rate of arrests relative to citations and warnings. San Francisco has the lowest relative rate of arrests, followed by Pittsburgh, with New Orleans having a rate of arrest nearly at par with citations and warnings.

We then took a look to see if there were trends year over year.

plt.figure(figsize=(10,6))
sf_years = sf['year'].value_counts()
sns.lineplot(data=sf_years, label='SF')
pitt_years = pitt['year'].value_counts()
sns.lineplot(data=pitt_years, label='Pittsburgh')
nola_years = nola['year'].value_counts()
sns.lineplot(data=nola_years, label='New Orleans')
plt.ylim(ymin=0)
plt.xlabel("Year")
plt.ylabel("Count")
plt.title("Stops Over Time By Cities")
plt.legend()
plt.show()

png

We see some interesting trends in stop count by year that vary by city. NOLA is up and down but decreasing, SF is decreasing over time for the most part and Pitt saw a trend upward followed by a trend downward, resulted in an arc-like graph. We then took a deeper look into New Orleans to see if time trends differed by race, but we did not find any immediately striking results.

plt.figure(figsize=(10,6))
nola_black = nola.loc[nola["subject_race"] == "black"]
nola_black = nola_black['year'].value_counts()
sns.lineplot(data=nola_black, label="black")
plt.ylim(ymin=0)
nola_white = nola.loc[nola["subject_race"] == "white"]
nola_white = nola_white['year'].value_counts()
sns.lineplot(data=nola_white, label="white")
plt.ylim(ymin=0)
plt.xlabel("Year")
plt.ylabel("Count")
plt.title("Stops in New Orleans Over Time by Race")
plt.legend()
plt.show()

png

We then decided to take a look at if an analysis of time of day versus stoppage rates by race produced any interesting results.

plt.figure(figsize=(10,6))
with_time_sf = sfcrime.copy()
with_time_sf['subject_race'] = with_time_sf['subject_race'].replace(['asian/pacific islander', 'hispanic', 'black', 'other'], 'non-white')
with_time_sf['time'] = pd.to_datetime(with_time_sf['time'])
with_time_sf['time'] = with_time_sf['time'].dt.hour
sns.countplot(x='time', hue='subject_race', data=with_time_sf.sort_values(by='subject_race'))
plt.xticks(rotation=70)
plt.title('Count of Stops by Race Throughout Time of Day (SF)');

png

For the majority of the day in San Francisco, the ratio between the number of non-white subjects stopped and the number of white subjects stopped stays relatively constant with slightly more non-white subjects stopped. We then see that this gap widens between the hours of 10pm and midnight.

plt.figure(figsize=(10,6))
with_time_nola = nola_crime.copy()
with_time_nola['subject_race'] = with_time_nola['subject_race'].replace(['asian/pacific islander', 'hispanic', 'black', 'other', 'unknown'], 'non-white')
with_time_nola['time'] = pd.to_datetime(with_time_nola['time'])
with_time_nola['time'] = with_time_nola['time'].dt.hour
sns.countplot(x='time', hue='subject_race', data=with_time_nola.sort_values(by='subject_race'))
plt.xticks(rotation=70)
plt.title('Count of Stops by Race Throughout Time of Day (New Orleans)');

png

We see relatively similar ratios throughout the day in New Orleans.

plt.figure(figsize=(10,6))
with_time_pitt = pittcrime.copy()
with_time_pitt['subject_race'] = with_time_pitt['subject_race'].replace(['asian/pacific islander', 'hispanic', 'black', 'other', 'unknown'], 'non-white')
with_time_pitt['time'] = pd.to_datetime(with_time_pitt['time'])
with_time_pitt['time'] = with_time_pitt['time'].dt.hour
sns.countplot(x='time', hue='subject_race', data=with_time_pitt.sort_values(by='subject_race'))
plt.xticks(rotation=70)
plt.title('Count of Stops by Race Throughout Time of Day (Pittsburgh)');

png

In Pittsburgh, we see the most interesting results. Throughout the day time hours we see more white subjects stopped. However, as it gets into night time, we see a greater percentage of those stopped being non-white and from midnight to 3am, non-white subjects make up a majority of those stopped. However in contrast to New Orleans and San Francisco, the number of people stopped is far lower at night versus the day.

From these graphs, we find that the relative rates of stoppage for non-white versus white subjects change dependent on the time of day in San Francisco and Pittsburgh. This difference in ratio dependent on time of day could possibly indicate police bias and warrants further study to determine the cause.

Modeling

Modeling and analysis will be evaluated based on appropriateness to question, clarity in explanation of model and the reasons for using it, and the exposition of the relationship between the team’s modeling efforts and the conclusions the team draws.

In particular, we seek to see if there is underlying discrimination (eg. racial, gender, etc) based on the frequency of stops in each of these cities. From the Exploratory Data Analysis performed above, it is already clear that each of the select cities: San Francisco, Pittsburgh, and New Orleans has varying demographics and distribution of stops.

Since our goal of identifying possible underlying discrimination in policing is a broad topic, here are some variables that we took into consideration:

Severity of the Crime (ie. if the stop resulted in a warning, citation, or an arrest in order of increasing severity)
Time of Day (ie. when does the crime occur? is there a distinction between crimes occuring during night or day?
Type of Stop (ie. pedestrian, vehicular, etc)
Subject Age (ie. age of the person stopped)

To promote fair comparisons, the following considerations were taken into account during standardization/cleaning of the data:

Based on the availability of data, only data from the years 2010-2015 will be used.
By the sheer volume of data, and since the total number of citations varied in each of the cities, a random sample of 200,000 stops will be taken from each city.
“Unknown” or NaN values listed under subject race will be dropped from the datasets.
Implementing a the 24-hour time, where the minutes become fractions of the hour.
Implement One-Hot Encoding on the “outcomes” column for its 3 outcomes: arrest, citation, and warning and on the “type” column for its 2 types: vehicular, pedestrian.

sfcrime = sfcrime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "warning_issued", 'arrest_made', "reason_for_stop"]]
sf_crime = sfcrime.iloc[sfcrime.dropna().index, :]
random.seed(30)
sf_crime_resampled = sf_crime.sample(200000)

sf_crime_resampled[['citation_issued',"warning_issued", "arrest_made"]] = sf_crime_resampled[['citation_issued',"warning_issued", "arrest_made"]].astype(int)
sf_crime_resampled['vehicular'] = (sf_crime_resampled['type']=='vehicular').astype(int)
sf_crime_resampled['pedestrian'] = (sf_crime_resampled['type']=='pedestrian').astype(int)

sf_crime_resampled['time'] = pd.to_datetime(sf_crime_resampled['time'])
sf_crime_resampled['time'] = sf_crime_resampled['time'].dt.hour + sf_crime_resampled['time'].dt.minute/60
sf_crime_resampled.head()

	date	time	subject_age	subject_race	subject_sex	type	outcome	citation_issued	warning_issued	reason_for_stop	vehicular
496292	2011-04-23	1.966667	30.0	asian/pacific islander	male	vehicular	warning	0	1	Moving Violation	1
697863	2013-09-05	1.500000	26.0	white	male	vehicular	citation	1	0	Moving Violation	1
543063	2011-10-04	23.833333	20.0	asian/pacific islander	female	vehicular	warning	0	1	Moving Violation	1
478772	2011-02-21	14.000000	26.0	hispanic	male	vehicular	citation	1	0	Moving Violation	1
183677	2008-05-29	12.216667	60.0	white	male	vehicular	citation	1	0	Mechanical or Non-Moving Violation (V.C.)	1

nola_crime = nola_crime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "warning_issued", 'arrest_made', "reason_for_stop"]]
nola_crime = nola_crime.iloc[nola_crime.dropna().index, :]
random.seed(30)
nola_crime_resampled = nola_crime.sample(200000)

nola_crime_resampled[['citation_issued',"warning_issued", "arrest_made"]] = nola_crime_resampled[['citation_issued',"warning_issued", "arrest_made"]].astype(int)
nola_crime_resampled['vehicular'] = (nola_crime_resampled['type']=='vehicular').astype(int)
nola_crime_resampled['pedestrian'] = (nola_crime_resampled['type']=='pedestrian').astype(int)

nola_crime_resampled['time'] = pd.to_datetime(nola_crime_resampled['time'])
nola_crime_resampled['time'] = nola_crime_resampled['time'].dt.hour + nola_crime_resampled['time'].dt.minute/60
nola_crime_resampled.head()

	date	time	subject_age	subject_race	subject_sex	type	outcome	citation_issued	warning_issued	arrest_made	reason_for_stop	vehicular
441600	2013-11-05	18.500000	23.0	black	male	vehicular	arrest	1	0	1	TRAFFIC VIOLATION	1
308926	2017-08-01	16.966667	22.0	black	female	vehicular	citation	1	0	0	TRAFFIC VIOLATION	1
327897	2017-08-14	1.733333	49.0	black	female	vehicular	warning	0	1	0	TRAFFIC VIOLATION	1
476436	2015-12-02	17.066667	27.0	black	male	vehicular	warning	0	1	0	TRAFFIC VIOLATION	1
464300	2017-11-22	23.900000	20.0	black	male	vehicular	arrest	0	0	1	TRAFFIC VIOLATION	1

pitt_crime = pittcrime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "warning_issued", 'arrest_made', "reason_for_stop"]]
pitt_crime = pitt_crime.iloc[pitt_crime[['subject_race', 'subject_age']].dropna().index, :]
pitt_crime = pitt_crime.sample()
pitt_crime

	date	time	subject_age	subject_race	subject_sex	type	outcome	citation_issued	warning_issued	arrest_made	reason_for_stop
47015	2018-02-20	20:59:00	46.0	white	NaN	pedestrian	NaN	False	False	False	narcVice vehicleCodeViolation

Upon further analysis of the Pittsburgh dataset, there were too many null values to use this dataset. Since our chosen features included the the subject age and race, simply dropping the null values meant that a majority of the dataset was also discarded. In the cell above, only one instance included information on both the race and the age. If the outcome null values were also dropped in this instance, none of the stops in the Pittsburgh dataset would be viable for our analysis.

Therefore, our model and analysis will be limited to San Francisco and New Orleans.

Logistic Regression Classifier

Why logistic regression?

We have incorporated logistic regression and naive bases to the analysis regarding the overall accuracy and fairness of police stops as it pertains to race. The inputs that we would receive would be both categorical and quantitative. Furthermore, we chose a no-regularization logistic regression for the baseline of the model. Since this analysis aims to perform classification, the essence of a logistic regression model is designed on a certain event that would exist or not. In our project, the logistic regression model aims to delineate between a binary result (1 for white vs. 0 for non-white individuals). In essence, underlying racial discrimination will be classified based on trends in features that will predict if the police stop was for a white individual or non-white minority individual. A similar analysis may also be done based on gender. Furthermore, logistic regression would be used to find the relationships between dependent variables and other independent or nominal variables.

To implement logistic regression, the subject_race column is transformed into a binary format: 1 for white vs. 0 for non-white individuals.

sf_crime_resampled['race_binary'] = (sf_crime_resampled['subject_race'] == 'white').astype(int)
sf_crime_resampled.head()

	date	time	subject_age	subject_race	subject_sex	type	outcome	citation_issued	warning_issued	reason_for_stop	vehicular	race_binary
496292	2011-04-23	1.966667	30.0	asian/pacific islander	male	vehicular	warning	0	1	Moving Violation	1	0
697863	2013-09-05	1.500000	26.0	white	male	vehicular	citation	1	0	Moving Violation	1	1
543063	2011-10-04	23.833333	20.0	asian/pacific islander	female	vehicular	warning	0	1	Moving Violation	1	0
478772	2011-02-21	14.000000	26.0	hispanic	male	vehicular	citation	1	0	Moving Violation	1	0
183677	2008-05-29	12.216667	60.0	white	male	vehicular	citation	1	0	Mechanical or Non-Moving Violation (V.C.)	1	1

nola_crime_resampled['race_binary'] = (nola_crime_resampled['subject_race'] == 'white').astype(int)
nola_crime_resampled.head()

	date	time	subject_age	subject_race	subject_sex	type	outcome	citation_issued	warning_issued	arrest_made	reason_for_stop	vehicular
441600	2013-11-05	18.500000	23.0	black	male	vehicular	arrest	1	0	1	TRAFFIC VIOLATION	1
308926	2017-08-01	16.966667	22.0	black	female	vehicular	citation	1	0	0	TRAFFIC VIOLATION	1
327897	2017-08-14	1.733333	49.0	black	female	vehicular	warning	0	1	0	TRAFFIC VIOLATION	1
476436	2015-12-02	17.066667	27.0	black	male	vehicular	warning	0	1	0	TRAFFIC VIOLATION	1
464300	2017-11-22	23.900000	20.0	black	male	vehicular	arrest	0	0	1	TRAFFIC VIOLATION	1

To perform modeling and analysis on our dataset, the dataset will be split into a training, validation, and test set:

Training set 70%
Validation set 15%
Test set 15%

San Francisco

#SF features matrix
X_sf = sf_crime_resampled[['subject_age', 'time', 'citation_issued', "warning_issued", "arrest_made", 'vehicular', 'pedestrian']]
y_sf = sf_crime_resampled['race_binary']

# Train/Test Split
X_sf_train, X_sf_test, y_sf_train, y_sf_test = train_test_split(X_sf, y_sf, train_size = .70, test_size = .30)

# Train/Validation Split
X_sf_test, X_sf_validate, y_sf_test, y_sf_validate = train_test_split(X_sf_test, y_sf_test, train_size = .50, test_size = .50)

log_reg = LogisticRegression()
log_model = log_reg.fit(X_sf_train, y_sf_train)
log_pred = log_model.predict(X_sf_train)
log_reg_confusion_matrix = confusion_matrix(y_sf_train, log_pred)

png

log_model.score(X_sf_validate, y_sf_validate)

0.5872

New Orleans

#New Orleans features matrix
X_nola = nola_crime_resampled[['subject_age', 'time', 'citation_issued', "warning_issued", "arrest_made", 'vehicular', 'pedestrian']]
y_nola = nola_crime_resampled['race_binary']

# Train/Test Split
X_nola_train, X_nola_test, y_nola_train, y_nola_test = train_test_split(X_nola, y_nola, train_size = .70, test_size = .30)

# Train/Validation Split
X_nola_test, X_nola_validate, y_nola_test, y_nola_validate = train_test_split(X_nola_test, y_nola_test, train_size = .50, test_size = .50)

log_reg = LogisticRegression()
log_model = log_reg.fit(X_nola_train, y_nola_train)
log_pred = log_model.predict(X_nola_train)

log_reg_confusion_matrix_nola = confusion_matrix(y_nola_train, log_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_reg_confusion_matrix_nola, classes=class_names,
                      title='Confusion Matrix')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_reg_confusion_matrix_nola, classes=class_names, normalize=True,
                      title='Normalized Confusion Matrix')

png

log_model.score(X_nola_validate, y_nola_validate)

0.7420666666666667

Ridge Regression Classifier

Why Ridge Regression?

Ridge regression is a type of squared loss regression used when the number of predictors exceeds the number of observations (eg. p>n) and when the model experiences multicollinearity. Since both p>n and multicollinearity are issues when using linear least squares regression, ridge regression would be used instead. Ridge regression works by using a shrinkage estimator that essentially would produce new estimates that are “shrunk” to the population’s true parameters. It is a L2 “squared” regularization that adds a penalty equal to the squared magnitude of the coefficients. A tuning parameter 𝜆 would determine the strength of this penalty. The constraints put on each of the estimators helps to shrink extreme variance and fluctuations; this sacrifices training accuracy for a model that is likely to generalize better. In other words, ridge regression introduces enough bias that shrinks variance to make estimates closer to the true population values.

Here, we try to improve upon the logistic model. RidgeClassifier is a classifier variant of the Ridge regressor (provided by scikit-learn), and can be much faster than logistic regression. It is used below in place of the sklearn.linear_model.Ridge() model.

San Francisco

ridge = RidgeClassifier()
ridge_model = ridge.fit(X_sf_train, y_sf_train)
ridge_pred = ridge_model.predict(X_sf_train)

ridge_confusion_matrix_sf = confusion_matrix(y_sf_train, ridge_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_sf, classes=class_names,
                      title='Confusion Matrix')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_sf, classes=class_names, normalize=True,
                      title='Normalized Confusion Matrix')

png

ridge_model.score(X_sf_validate, y_sf_validate)

0.5870333333333333

New Orleans

ridge_nola = RidgeClassifier()
ridge_nola_model = ridge_nola.fit(X_nola_train, y_nola_train)
ridge_nola_pred = ridge_nola_model.predict(X_nola_train)

ridge_confusion_matrix_nola = confusion_matrix(y_nola_train, ridge_nola_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_nola, classes=class_names,
                      title='Confusion Matrix')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_nola, classes=class_names, normalize=True,
                      title='Normalized Confusion Matrix')

png

ridge_nola_model.score(X_nola_validate, y_nola_validate)

0.7420666666666667

One-Hot Encoding

What follows are our attempts at one-hot encoding reason_for_stop and using resultant columns as additional features as a means of improving accuracy.

San Francisco

sf_crime_resampled_new = pd.concat([sf_crime_resampled,pd.get_dummies(sf_crime_resampled['reason_for_stop'], prefix='stopreason')],axis=1)
sf_crime_resampled_new

	date	time	subject_age	subject_race	subject_sex	type	outcome	citation_issued	warning_issued	arrest_made	reason_for_stop	vehicular	pedestrian	race_binary	stopreason_Assistance to Motorist	stopreason_BOLO/APB/Warrant	stopreason_DUI Check	stopreason_MPC Violation	stopreason_MPC Violation\|Moving Violation	stopreason_Mechanical or Non-Moving Violation (V.C.)	stopreason_Mechanical or Non-Moving Violation (V.C.)\|DUI Check	stopreason_Mechanical or Non-Moving Violation (V.C.)\|Moving Violation	stopreason_Moving Violation	stopreason_Moving Violation\|Assistance to Motorist	stopreason_Moving Violation\|DUI Check	stopreason_Moving Violation\|Mechanical or Non-Moving Violation (V.C.)	stopreason_Moving Violation\|NA	stopreason_Moving Violation\|Traffic Collision	stopreason_Traffic Collision
496292	2011-04-23	1.966667	30.0	asian/pacific islander	male	vehicular	warning	0	1	0	Moving Violation	1	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
697863	2013-09-05	1.500000	26.0	white	male	vehicular	citation	1	0	0	Moving Violation	1	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
543063	2011-10-04	23.833333	20.0	asian/pacific islander	female	vehicular	warning	0	1	0	Moving Violation	1	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
478772	2011-02-21	14.000000	26.0	hispanic	male	vehicular	citation	1	0	0	Moving Violation	1	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
183677	2008-05-29	12.216667	60.0	white	male	vehicular	citation	1	0	0	Mechanical or Non-Moving Violation (V.C.)	1	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
452716	2010-11-18	21.750000	44.0	white	female	vehicular	citation	1	0	0	Moving Violation	1	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
119254	2007-11-04	21.866667	62.0	white	male	vehicular	arrest	0	0	1	Moving Violation	1	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
356527	2009-12-21	11.833333	21.0	other	male	vehicular	citation	1	0	0	Moving Violation	1	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
487440	2011-03-22	21.750000	23.0	white	male	vehicular	citation	1	0	0	Mechanical or Non-Moving Violation (V.C.)	1	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0
823151	2015-09-17	13.750000	40.0	hispanic	male	vehicular	warning	0	1	0	Moving Violation	1	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0

200000 rows × 29 columns

sf_crime_resampled_new.drop(['reason_for_stop'],axis=1, inplace=True)

X_sf_new = sf_crime_resampled_new[columns_we_want]

y_sf_new = sf_crime_resampled_new['race_binary']

# Train/Test Split
X_sf_train, X_sf_test, y_sf_train, y_sf_test = train_test_split(X_sf_new, y_sf_new, train_size = .70, test_size = .30)

# Train/Validation Split
X_sf_test, X_sf_validate, y_sf_test, y_sf_validate = train_test_split(X_sf_test, y_sf_test, train_size = .50, test_size = .50)

log_reg = LogisticRegression()
log_model = log_reg.fit(X_sf_train, y_sf_train)
log_pred = log_model.predict(X_sf_train)

log_confusion_matrix_sf = confusion_matrix(y_sf_train, log_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_confusion_matrix_sf, classes=class_names,
                      title='Confusion Matrix')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_confusion_matrix_sf, classes=class_names, normalize=True,
                      title='Normalized Confusion Matrix')

png

log_model.score(X_sf_validate, y_sf_validate)

0.5855

ridge = RidgeClassifier()
ridge_model = ridge.fit(X_sf_train, y_sf_train)
ridge_pred = ridge_model.predict(X_sf_train)

ridge_confusion_matrix_sf = confusion_matrix(y_sf_train, ridge_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_sf, classes=class_names,
                      title='Confusion Matrix')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_sf, classes=class_names, normalize=True,
                      title='Normalized Confusion Matrix')

png

ridge_model.score(X_sf_validate, y_sf_validate)

0.5855

New Orleans

nola_crime_resampled_new = pd.concat([nola_crime_resampled,pd.get_dummies(nola_crime_resampled['reason_for_stop'], prefix='stopreason')],axis=1)
nola_crime_resampled_new

	date	time	subject_age	subject_race	subject_sex	type	outcome	citation_issued	warning_issued	arrest_made	reason_for_stop	vehicular	pedestrian	race_binary	stopreason_SUSPECT PERSON	stopreason_SUSPECT VEHICLE	stopreason_TRAFFIC VIOLATION
441600	2013-11-05	18.500000	23.0	black	male	vehicular	arrest	1	0	1	TRAFFIC VIOLATION	1	0	0	0	0	1
308926	2017-08-01	16.966667	22.0	black	female	vehicular	citation	1	0	0	TRAFFIC VIOLATION	1	0	0	0	0	1
327897	2017-08-14	1.733333	49.0	black	female	vehicular	warning	0	1	0	TRAFFIC VIOLATION	1	0	0	0	0	1
476436	2015-12-02	17.066667	27.0	black	male	vehicular	warning	0	1	0	TRAFFIC VIOLATION	1	0	0	0	0	1
464300	2017-11-22	23.900000	20.0	black	male	vehicular	arrest	0	0	1	TRAFFIC VIOLATION	1	0	0	0	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
157539	2012-04-23	14.916667	69.0	black	female	vehicular	warning	0	1	0	TRAFFIC VIOLATION	1	0	0	0	0	1
26292	2018-01-18	12.633333	57.0	black	male	vehicular	arrest	1	0	1	TRAFFIC VIOLATION	1	0	0	0	0	1
332641	2011-08-18	16.816667	44.0	white	male	vehicular	citation	1	0	0	TRAFFIC VIOLATION	1	0	1	0	0	1
311926	2017-08-03	19.083333	61.0	black	male	vehicular	warning	0	1	0	TRAFFIC VIOLATION	1	0	0	0	0	1
329071	2015-08-15	18.233333	25.0	black	male	vehicular	arrest	1	0	1	TRAFFIC VIOLATION	1	0	0	0	0	1

200000 rows × 17 columns

nola_columns = list(nola_crime_resampled_new.columns)
nola_columns_desired = ['time',
 'subject_age',
 'citation_issued',
 'warning_issued',
 'arrest_made',
 'vehicular',
 'pedestrian',
 'stopreason_SUSPECT PERSON',
 'stopreason_SUSPECT VEHICLE',
 'stopreason_TRAFFIC VIOLATION']

X_nola_new = nola_crime_resampled_new[nola_columns_desired]
y_nola_new = nola_crime_resampled_new['race_binary']

# Train/Test Split
X_nola_train, X_nola_test, y_nola_train, y_nola_test = train_test_split(X_nola_new, y_nola_new, train_size = .70, test_size = .30)

# Train/Validation Split
X_nola_test, X_nola_validate, y_nola_test, y_nola_validate = train_test_split(X_nola_test, y_nola_test, train_size = .50, test_size = .50)

log_reg = LogisticRegression()
log_model = log_reg.fit(X_nola_train, y_nola_train)
log_pred = log_model.predict(X_nola_train)

log_confusion_matrix_nola = confusion_matrix(y_nola_train, log_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_confusion_matrix_nola, classes=class_names,
                      title='Confusion Matrix')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_confusion_matrix_nola, classes=class_names, normalize=True,
                      title='Normalized Confusion Matrix')

png

log_model.score(X_nola_validate, y_nola_validate)

0.7444

ridge_nola = RidgeClassifier()
ridge_nola_model = ridge_nola.fit(X_nola_train, y_nola_train)
ridge_nola_pred = ridge_nola_model.predict(X_nola_train)
confusion_matrix(y_nola_train, ridge_nola_pred)

png

Note: The RidgeClassifier did not change the confusion matrix for New Orleans.

ridge_nola_model.score(X_nola_validate, y_nola_validate)

0.7444333333333333

Modeling Conclusions

Numerical Results

In modeling the San Francisco dataset, the logistic regression model and the ridge regression model (after adding additional features like one-hot encoding) both gave an validation accuracy of 58.4%.

In modeling the New Orleans dataset, the logistic model and the ridge regression model scored a validation accuracy of 74.6%.

Evaluation Methods

Our two methods of evaluating our results are using confusion matrices and using the model.score() function. In our confusion matrix, we see our true negatives at (0,0) in the matrix, false negatives at (1,0), false positives at (0,1), and true positives at (1,1). True negatives are non-white subjects predicted to be non-white, false negatives are white subjects predicted to be non-white, false positives are non-white subjects predicted to be white, and true positives are white subjects predicted to be white. model.score() returns the mean accuracy on the given test data and labels.

Interpretation of Results

We see higher accuracies for our New Orleans models because our models are predicting all values to be non-white. For the New Orleans dataset, this results in higher accuracy because majority of the dataset population (and city population) is non-white. In San Francisco on the other hand, a plurality of the population is white. Thus, our scores just indicate what the percentage of non-white subjects within the dataset. It is of note however, that the percentage of non-white subjects within the dataset is higher than the overall percentage of people of color in New Orleans, suggesting that people of color are stopped at a higher rate than white people, even when you control for percentage of the population.

Reflecting on the Chosen Models

The accuracy of our predictive models for San Francisco are low, in which the model only predicts correctly a little more than half the time. While this may be consequence of the chosen features or the choice of the classifier itself, considerations should also be taken that the features simple do not correlate well with race. Therefore, even adding additional features to the model did not increase the accuracy by a significant amount. Training on additional noise would work to decrease the accuracy. In other words, rather than there being a problem with the model, there may just be little to no underlying racial discrimination in the features that we chose to train the model on.

Potential Improvements

Oversampling: An imbalance in the dataset can be seen from the proportion of whites to other ethnicities. Especially in the New Orleans data set, there are considerably more non-white individuals than white individuals in the sampled training set, which leads the models to tend to exclusively predict subject_race as non-white. A potential solution is to oversample rows with white subjects in our training data to increase the likelihood that the model will predict a white subject. Since one of the primary goals of model validation estimation of how it will perform on unseen data, oversampling correctly is critical.
Bootstrapping Given the imbalance in the datasets (especially in the case of New Orleans) between the white and non-white populations, a method to create more variability in the data would be to bootstrap (random sampling with replacement). In this way, the model would be able to train on more data without having to collect additional data.
Important Null values: In standardizing the data from both the San Francisco and New Orleans datasets, all the Null/NaN/None values were dropped. However, there may be underlying patterns to these dropped values. More exploratory data analysis could be performed to see why certain values failed to appear in the given datasets.