Link

Analyzing Regional Trends in Police Stop Data Across the United States

Names: Rishabh Parekh, Emmy Yu, Manu Prakasam

About the Data: Ethical Data Collection

Publicly-available Standardized Stop Data from: https://openpolicing.stanford.edu/data/

The Stanford Open Policing Project data are made available under the Open Data Commons Attribution License, which allows us to:

To Share: To copy, distribute and use the database.
To Create: To produce works from the database.
To Adapt: To modify, transform and build upon the database.

The Stanford Opening Policing Project data comes from the working paper:
E. Pierson, C. Simoiu, J. Overgoor, S. Corbett-Davies, D. Jenson, A. Shoemaker, V. Ramachandran, P. Barghouty, C. Phillips, R. Shroff, and S. Goel. (2019) “A large-scale analysis of racial disparities in police stops across the United States”.

The accompanying README file describes how the data was standardized and cleaned for privacy and secruity reasons. Descriptions of the column meanings are also given.

Downloading data from The Stanford Open Policing Project:

Specifically, we will be looking at the datasets from San Francisco, New Orleans, and Pittsburgh.

sfcrime = pd.read_csv('ca_san_francisco_2019_12_17.csv')
sfcrime.head()
raw_row_numberdatetimelocationlatlngdistrictsubject_agesubject_racesubject_sextypearrest_madecitation_issuedwarning_issuedoutcomecontraband_foundsearch_conductedsearch_vehiclesearch_basisreason_for_stopraw_search_vehicle_descriptionraw_result_of_contact_description
08699212014-08-0100:01:00MASONIC AV & FELL ST37.773004-122.445873NaNNaNasian/pacific islanderfemalevehicularFalseFalseTruewarningNaNFalseFalseNaNMechanical or Non-Moving Violation (V.C.)No SearchWarning
18699222014-08-0100:01:00GEARY&10TH AV37.780898-122.468586NaNNaNblackmalevehicularFalseTrueFalsecitationNaNFalseFalseNaNMechanical or Non-Moving Violation (V.C.)No SearchCitation
28699232014-08-0100:15:00SUTTER N OCTAVIA ST37.786919-122.426718NaNNaNhispanicmalevehicularFalseTrueFalsecitationNaNFalseFalseNaNMechanical or Non-Moving Violation (V.C.)No SearchCitation
38699242014-08-0100:18:003RD ST & DAVIDSON37.746380-122.392005NaNNaNhispanicmalevehicularFalseFalseTruewarningNaNFalseFalseNaNMechanical or Non-Moving Violation (V.C.)No SearchWarning
48699252014-08-0100:19:00DIVISADERO ST. & BUSH ST.37.786348-122.440003NaNNaNwhitemalevehicularFalseTrueFalsecitationNaNFalseFalseNaNMechanical or Non-Moving Violation (V.C.)No SearchCitation
nola_crime = pd.read_csv('la_new_orleans_2019_12_17.csv')
nola_crime.head()
raw_row_numberdatetimelocationlatlngdistrictzonesubject_agesubject_racesubject_sexofficer_assignmenttypearrest_madecitation_issuedwarning_issuedoutcomecontraband_foundcontraband_drugscontraband_weaponsfrisk_performedsearch_conductedsearch_personsearch_vehiclesearch_basisreason_for_stopvehicle_colorvehicle_makevehicle_modelvehicle_yearraw_actions_takenraw_subject_race
012010-01-0101:11:00NaNNaNNaN6E26.0blackfemale6th DistrictvehicularFalseFalseFalseNaNNaNNaNNaNFalseFalseFalseFalseNaNTRAFFIC VIOLATIONBLACKDODGECARAVAN2005.0NaNBLACK
190872010-01-0101:29:00NaNNaNNaN7C37.0blackmale7th DistrictvehicularFalseFalseFalseNaNNaNNaNNaNFalseFalseFalseFalseNaNTRAFFIC VIOLATIONBLUENISSANMURANO2005.0NaNBLACK
290862010-01-0101:29:00NaNNaNNaN7C37.0blackmale7th DistrictvehicularFalseFalseFalseNaNNaNNaNNaNFalseFalseFalseFalseNaNTRAFFIC VIOLATIONBLUENISSANMURANO2005.0NaNBLACK
32672010-01-0114:00:00NaNNaNNaN7I96.0blackmale7th DistrictvehicularFalseFalseFalseNaNNaNNaNNaNFalseFalseFalseFalseNaNTRAFFIC VIOLATIONGRAYJEEPGRAND CHEROKEE2003.0NaNBLACK
422010-01-0102:06:00NaNNaNNaN5D17.0blackmale5th DistrictNaNFalseFalseFalseNaNNaNNaNNaNFalseFalseFalseFalseNaNCALL FOR SERVICENaNNaNNaNNaNNaNBLACK
pittcrime = pd.read_csv('pa_pittsburgh_2019_12_17.csv')
pittcrime.head()
raw_row_numberdatetimelocationlatlngneighborhoodsubject_agesubject_racesubject_sexofficer_id_hashofficer_ageofficer_raceofficer_sextypeviolationarrest_madecitation_issuedwarning_issuedoutcomecontraband_foundfrisk_performedsearch_conductedreason_for_stopraw_zoneraw_object_searchedraw_raceraw_ethnicityraw_zone_divisionraw_evidence_foundraw_weapons_foundraw_nothing_foundraw_police_zoneraw_officer_raceraw_officer_zone
012008-01-0100:14:00351 S Negley Ave40.459466-79.932802NaN20.0whitemale3bb3b1bd4841.0NaNNaNpedestrianNaNFalseFalseFalseNaNNaNNaNFalseOtherNaNNaNWhiteWhite-NaNNaNNaNNaNNaNNaN
132008-01-0100:14:00376 Main St40.465868-79.955594NaN19.0whitemale3bb3b1bd4841.0NaNNaNpedestrianNaNFalseFalseFalseNaNNaNNaNFalseOtherNaNNaNWhiteWhite-NaNNaNNaNNaNNaNNaN
222008-01-0100:14:00Stamair Way & Baum Blvd40.456812-79.939041NaN16.0whitemale3bb3b1bd4841.0NaNNaNpedestrianNaNFalseFalseFalseNaNNaNNaNFalseOther-NaNWhiteWhite-NaNNaNNaNNaNNaNNaN
342008-01-0101:59:00N Braddock Ave & Thomas Blvd40.448873-79.893923NaN21.0NaNmaleb62aedb5bb29.0NaNNaNpedestrianNaNTrueFalseFalsearrestNaNNaNTruemajorCrimes Other-personBlackWhite-NaNNaNNaNNaNNaNNaN
452008-01-0114:50:002518 West Liberty Ave40.398780-80.026439NaN41.0whitemale1ccb6bd45aNaNNaNNaNpedestrianNaNFalseFalseFalseNaNNaNNaNTruenarcVice-person vehicle placeWhiteNaNN/VNaNNaNNaNNaNNaNNaN

Data Applicability:

Our research question will assess the relationship between race/socioeconomic backgrounds and the amount of interactions with the police. With the given data, we will have access to race of the individual, location, and time of each stop a police officer makes. The information of location can give us a general idea of an individual’s socioeconomic background based on the neighborhood they are in.

Furthermore, using the race of the individual, we can find the total number of stops per race. We will also calculate the proportion of total number of stops and the population of that specific race in the area. This can show us if there is a disproportionate rate of interactions with the law enforcment and minorities.

Using the Stanford Open Policing data, we will also compare the three distinctly different cities: San Francisco, Pittsburgh, and New Orleans. This will allow us to see if policing is similar or different across the nation, and will further give us a general idea of the relationship between minorities and law enforcement.

Standardizing the Data:

Certain columns that are common across all cities (and lend themselves to interesting analyses) were selected and merged to create one DataFrame for easier analysis and plotting.

Each row in the cleaned data represents a stop. Coverage varies by location. From the original Open Policing data, fields with an asterisk were removed for public release due to privacy concerns. All columns except raw_row_number, violation, disposition, location, officer_assignment, any city or state subgeography (i.e. county, beat, division, etc), unit, and vehicle_{color, make, model, type}are also digit sanitized (each digit replaced with “-“) for privacy concerns. These columns from the original dataset were removed for the purpose of our analysis.

Exploratory Data Analysis

sf_crime_to_merge = sfcrime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "reason_for_stop"]]
sf_crime_to_merge = pd.DataFrame(sf_crime_to_merge)
sf_crime_to_merge["city"] = "San Francisco"
pitt_crime_to_merge = pittcrime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "reason_for_stop"]]
pitt_crime_to_merge = pd.DataFrame(pitt_crime_to_merge)
pitt_crime_to_merge["city"] = "Pittsburgh"
nola_crime_to_merge = nola_crime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "reason_for_stop"]]
nola_crime_to_merge = pd.DataFrame(nola_crime_to_merge)
nola_crime_to_merge["city"] = "New Orleans"
#Exploratory Data Analysis Dataframe
eda_merged = (sf_crime_to_merge.append(pitt_crime_to_merge)).append(nola_crime_to_merge)
eda_merged
datetimesubject_agesubject_racesubject_sextypeoutcomecitation_issuedreason_for_stopcity
02014-08-0100:01:00NaNasian/pacific islanderfemalevehicularwarningFalseMechanical or Non-Moving Violation (V.C.)San Francisco
12014-08-0100:01:00NaNblackmalevehicularcitationTrueMechanical or Non-Moving Violation (V.C.)San Francisco
22014-08-0100:15:00NaNhispanicmalevehicularcitationTrueMechanical or Non-Moving Violation (V.C.)San Francisco
32014-08-0100:18:00NaNhispanicmalevehicularwarningFalseMechanical or Non-Moving Violation (V.C.)San Francisco
42014-08-0100:19:00NaNwhitemalevehicularcitationTrueMechanical or Non-Moving Violation (V.C.)San Francisco
.................................
5120872017-12-3100:48:0028.0blackfemalevehiculararrestFalseTRAFFIC VIOLATIONNew Orleans
5120882017-12-3100:48:0025.0blackmalevehiculararrestFalseTRAFFIC VIOLATIONNew Orleans
5120892017-12-3100:48:0023.0blackmalevehiculararrestFalseTRAFFIC VIOLATIONNew Orleans
5120902017-12-3100:48:0025.0blackmalevehiculararrestFalseTRAFFIC VIOLATIONNew Orleans
5120912017-12-3112:49:0037.0blackfemalevehicularcitationTrueTRAFFIC VIOLATIONNew Orleans

1691720 rows × 10 columns

plt.figure(figsize=(12,6))
ax = sns.countplot(x="city", hue="subject_race", data=eda_merged)
plt.title('Frequency of Stops by Race')
plt.legend(loc = 'upper right')
plt.show()

png

From the figure above, it is clear that the demographic range differs significantly across out three target locations: San Francisco, Pittsburgh, and New Orleans. It is important to note that the frequency of stops by race have some correlation to the racial demographics in these cities. For example, there is a large spike in stops for Blacks in New Orleans compared to both San Francisco and Pittsburgh. However, data from the Demographic Statistical Atlas shows that New Orleans has 58.9% Black population compared to San Francisco’s 5.4% and Pittsburgh’s 24.3%.

#The lengths of the datasets differ by location, with San Francisco having the greatest number of citations.
eda_merged.groupby('city').count()[['citation_issued']]
citation_issued
city
New Orleans512092
Pittsburgh274555
San Francisco905070
eda_time = eda_merged.copy()
eda_time['date'] = pd.to_datetime(eda_time['date'])
eda_time['year'] = eda_time['date'].dt.year
eda_grouped = eda_time.groupby(['city','year']).count()[['citation_issued']].reset_index()

plt.figure(figsize=(15,8))
sns.countplot(x='city', hue='year', data=eda_time)
plt.title('Frequency of Stops by Year Across Locations')
plt.legend(loc = 'best')
plt.show()

png

From the barchart above, it can be seen that the greater dataset lengths are not entirely due to a higher number of stops made by police. As seen in the DataFrame above, San Francisco has the highest number of citations issued. The SF data also contains data from 2007-2016, where the 2016 end data is partial data. The Pittsburgh data is from 2008-2018, where the beginning and end years contains partial data as well. The New Orleans data contains the years 2010-2018 (partial 2018, data collection ended before end of year). Therefore, for fair comparisons and aggregations against locations, data will be limited to the years 2010-2015.

#taking only the stops from 2010-2015
inclusive_years = [2010, 2011, 2012, 2013, 2014, 2015]
eda_standard = eda_time[eda_time['year'].isin(inclusive_years)]
eda_standard.head()
datetimesubject_agesubject_racesubject_sextypeoutcomecitation_issuedreason_for_stopcityyear
02014-08-0100:01:00NaNasian/pacific islanderfemalevehicularwarningFalseMechanical or Non-Moving Violation (V.C.)San Francisco2014.0
12014-08-0100:01:00NaNblackmalevehicularcitationTrueMechanical or Non-Moving Violation (V.C.)San Francisco2014.0
22014-08-0100:15:00NaNhispanicmalevehicularcitationTrueMechanical or Non-Moving Violation (V.C.)San Francisco2014.0
32014-08-0100:18:00NaNhispanicmalevehicularwarningFalseMechanical or Non-Moving Violation (V.C.)San Francisco2014.0
42014-08-0100:19:00NaNwhitemalevehicularcitationTrueMechanical or Non-Moving Violation (V.C.)San Francisco2014.0
plt.figure(figsize=(12,6))
ax = sns.countplot(x="city", hue="subject_sex", data=eda_standard)
plt.title('Frequency of Stops by Gender')
plt.legend(loc = 'upper right')
plt.show()

png

In this figure showing the frequency of stops by gender across the three cities, it is clear that although the SF dataset contains the most number of stops, the division of stops by gender is fairly consistent across the cities. Stops of males constitutes more than half of the data in each location.

eda_merged['year'] = pd.DatetimeIndex(eda_merged["date"]).year
plt.figure(figsize=(12,6))
ax = sns.countplot(x="subject_race", hue="outcome", data=eda_merged.sort_values(by=['outcome', 'subject_race']))
plt.title('Type of Police Stop by Race: Total')
plt.legend(loc = 'upper left')
plt.show()

png

plt.figure(figsize=(12,6))
sf = eda_merged.loc[eda_merged["city"] == "San Francisco"]
ax = sns.countplot(x="subject_race", hue="outcome", data=sf.sort_values(by=['outcome', 'subject_race']))
plt.title('Type of Police Stop by Race: San Francisco')
plt.legend(loc = 'upper left')
plt.show()

png

Note: The San Francisco dataset does not have an “unknown” category option listed for the subject’s race.

According to the 2018 US Census Bureau, San Francisco County’s population was 40% Non-Hispanic White, 5.4% Hispanic White, 5.2% Black or African American, 34.3% Asian, 8.1% Some Other Race, 0.3% Native American and Alaskan Native, 0.2% Pacific Islander and 6.5% from two or more races. Based on this data, it’s interesting to note the relatively low count of AAPI subjects stopped based on their percentage of the population, and the relatively high count of black and Hispanic subjects stopped based on their percentage of the population.

plt.figure(figsize=(12,6))
pitt = eda_merged.loc[eda_merged["city"] == "Pittsburgh"]
ax = sns.countplot(x="subject_race", hue="outcome", data=pitt.sort_values(by=['outcome', 'subject_race']))
plt.title('Type of Police Stop by Race: Pittsburgh')
plt.legend(loc = 'upper left')
plt.show()

png

In Pittsburgh, the two races with the most significant percentages are black and white. We see relatively similar trends for these two races, with one small difference being that more black subjects receive more warnings than citations and more white subjects receive citations than warnings. This could indicate a number of different things, including more black subjects are unecessarily stopped or for smaller infractions. An additional note, given Pittsburgh’s population ~65% white and ~24% black (from the Demographic Statistical Atlas referenced earlier), we see a higher relative count of stops for black subjects than the demographics of the city would predict given an even distribution of stops across race.

plt.figure(figsize=(12,6))
nola = eda_merged.loc[eda_merged["city"] == "New Orleans"]
ax = sns.countplot(x="subject_race", hue="outcome", data=nola.sort_values(by=['outcome', 'subject_race']))
plt.title('Type of Police Stop by Race: New Orleans')
plt.legend(loc = 'upper right')
plt.show()

png

An interesting difference between New Orleans and our other cities is the very high rate of arrests relative to citations and warnings. San Francisco has the lowest relative rate of arrests, followed by Pittsburgh, with New Orleans having a rate of arrest nearly at par with citations and warnings.

We then took a look to see if there were trends year over year.

plt.figure(figsize=(10,6))
sf_years = sf['year'].value_counts()
sns.lineplot(data=sf_years, label='SF')
pitt_years = pitt['year'].value_counts()
sns.lineplot(data=pitt_years, label='Pittsburgh')
nola_years = nola['year'].value_counts()
sns.lineplot(data=nola_years, label='New Orleans')
plt.ylim(ymin=0)
plt.xlabel("Year")
plt.ylabel("Count")
plt.title("Stops Over Time By Cities")
plt.legend()
plt.show()

png

We see some interesting trends in stop count by year that vary by city. NOLA is up and down but decreasing, SF is decreasing over time for the most part and Pitt saw a trend upward followed by a trend downward, resulted in an arc-like graph. We then took a deeper look into New Orleans to see if time trends differed by race, but we did not find any immediately striking results.

plt.figure(figsize=(10,6))
nola_black = nola.loc[nola["subject_race"] == "black"]
nola_black = nola_black['year'].value_counts()
sns.lineplot(data=nola_black, label="black")
plt.ylim(ymin=0)
nola_white = nola.loc[nola["subject_race"] == "white"]
nola_white = nola_white['year'].value_counts()
sns.lineplot(data=nola_white, label="white")
plt.ylim(ymin=0)
plt.xlabel("Year")
plt.ylabel("Count")
plt.title("Stops in New Orleans Over Time by Race")
plt.legend()
plt.show()

png

We then decided to take a look at if an analysis of time of day versus stoppage rates by race produced any interesting results.

plt.figure(figsize=(10,6))
with_time_sf = sfcrime.copy()
with_time_sf['subject_race'] = with_time_sf['subject_race'].replace(['asian/pacific islander', 'hispanic', 'black', 'other'], 'non-white')
with_time_sf['time'] = pd.to_datetime(with_time_sf['time'])
with_time_sf['time'] = with_time_sf['time'].dt.hour
sns.countplot(x='time', hue='subject_race', data=with_time_sf.sort_values(by='subject_race'))
plt.xticks(rotation=70)
plt.title('Count of Stops by Race Throughout Time of Day (SF)');

png

For the majority of the day in San Francisco, the ratio between the number of non-white subjects stopped and the number of white subjects stopped stays relatively constant with slightly more non-white subjects stopped. We then see that this gap widens between the hours of 10pm and midnight.

plt.figure(figsize=(10,6))
with_time_nola = nola_crime.copy()
with_time_nola['subject_race'] = with_time_nola['subject_race'].replace(['asian/pacific islander', 'hispanic', 'black', 'other', 'unknown'], 'non-white')
with_time_nola['time'] = pd.to_datetime(with_time_nola['time'])
with_time_nola['time'] = with_time_nola['time'].dt.hour
sns.countplot(x='time', hue='subject_race', data=with_time_nola.sort_values(by='subject_race'))
plt.xticks(rotation=70)
plt.title('Count of Stops by Race Throughout Time of Day (New Orleans)');

png

We see relatively similar ratios throughout the day in New Orleans.

plt.figure(figsize=(10,6))
with_time_pitt = pittcrime.copy()
with_time_pitt['subject_race'] = with_time_pitt['subject_race'].replace(['asian/pacific islander', 'hispanic', 'black', 'other', 'unknown'], 'non-white')
with_time_pitt['time'] = pd.to_datetime(with_time_pitt['time'])
with_time_pitt['time'] = with_time_pitt['time'].dt.hour
sns.countplot(x='time', hue='subject_race', data=with_time_pitt.sort_values(by='subject_race'))
plt.xticks(rotation=70)
plt.title('Count of Stops by Race Throughout Time of Day (Pittsburgh)');

png

In Pittsburgh, we see the most interesting results. Throughout the day time hours we see more white subjects stopped. However, as it gets into night time, we see a greater percentage of those stopped being non-white and from midnight to 3am, non-white subjects make up a majority of those stopped. However in contrast to New Orleans and San Francisco, the number of people stopped is far lower at night versus the day.

From these graphs, we find that the relative rates of stoppage for non-white versus white subjects change dependent on the time of day in San Francisco and Pittsburgh. This difference in ratio dependent on time of day could possibly indicate police bias and warrants further study to determine the cause.

Modeling

Modeling and analysis will be evaluated based on appropriateness to question, clarity in explanation of model and the reasons for using it, and the exposition of the relationship between the team’s modeling efforts and the conclusions the team draws.

In particular, we seek to see if there is underlying discrimination (eg. racial, gender, etc) based on the frequency of stops in each of these cities. From the Exploratory Data Analysis performed above, it is already clear that each of the select cities: San Francisco, Pittsburgh, and New Orleans has varying demographics and distribution of stops.

Since our goal of identifying possible underlying discrimination in policing is a broad topic, here are some variables that we took into consideration:

  • Severity of the Crime (ie. if the stop resulted in a warning, citation, or an arrest in order of increasing severity)
  • Time of Day (ie. when does the crime occur? is there a distinction between crimes occuring during night or day?
  • Type of Stop (ie. pedestrian, vehicular, etc)
  • Subject Age (ie. age of the person stopped)

To promote fair comparisons, the following considerations were taken into account during standardization/cleaning of the data:

  • Based on the availability of data, only data from the years 2010-2015 will be used.
  • By the sheer volume of data, and since the total number of citations varied in each of the cities, a random sample of 200,000 stops will be taken from each city.
  • “Unknown” or NaN values listed under subject race will be dropped from the datasets.
  • Implementing a the 24-hour time, where the minutes become fractions of the hour.
  • Implement One-Hot Encoding on the “outcomes” column for its 3 outcomes: arrest, citation, and warning and on the “type” column for its 2 types: vehicular, pedestrian.
sfcrime = sfcrime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "warning_issued", 'arrest_made', "reason_for_stop"]]
sf_crime = sfcrime.iloc[sfcrime.dropna().index, :]
random.seed(30)
sf_crime_resampled = sf_crime.sample(200000)
sf_crime_resampled[['citation_issued',"warning_issued", "arrest_made"]] = sf_crime_resampled[['citation_issued',"warning_issued", "arrest_made"]].astype(int)
sf_crime_resampled['vehicular'] = (sf_crime_resampled['type']=='vehicular').astype(int)
sf_crime_resampled['pedestrian'] = (sf_crime_resampled['type']=='pedestrian').astype(int)

sf_crime_resampled['time'] = pd.to_datetime(sf_crime_resampled['time'])
sf_crime_resampled['time'] = sf_crime_resampled['time'].dt.hour + sf_crime_resampled['time'].dt.minute/60
sf_crime_resampled.head()
datetimesubject_agesubject_racesubject_sextypeoutcomecitation_issuedwarning_issuedarrest_madereason_for_stopvehicularpedestrian
4962922011-04-231.96666730.0asian/pacific islandermalevehicularwarning010Moving Violation10
6978632013-09-051.50000026.0whitemalevehicularcitation100Moving Violation10
5430632011-10-0423.83333320.0asian/pacific islanderfemalevehicularwarning010Moving Violation10
4787722011-02-2114.00000026.0hispanicmalevehicularcitation100Moving Violation10
1836772008-05-2912.21666760.0whitemalevehicularcitation100Mechanical or Non-Moving Violation (V.C.)10
nola_crime = nola_crime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "warning_issued", 'arrest_made', "reason_for_stop"]]
nola_crime = nola_crime.iloc[nola_crime.dropna().index, :]
random.seed(30)
nola_crime_resampled = nola_crime.sample(200000)
nola_crime_resampled[['citation_issued',"warning_issued", "arrest_made"]] = nola_crime_resampled[['citation_issued',"warning_issued", "arrest_made"]].astype(int)
nola_crime_resampled['vehicular'] = (nola_crime_resampled['type']=='vehicular').astype(int)
nola_crime_resampled['pedestrian'] = (nola_crime_resampled['type']=='pedestrian').astype(int)

nola_crime_resampled['time'] = pd.to_datetime(nola_crime_resampled['time'])
nola_crime_resampled['time'] = nola_crime_resampled['time'].dt.hour + nola_crime_resampled['time'].dt.minute/60
nola_crime_resampled.head()
datetimesubject_agesubject_racesubject_sextypeoutcomecitation_issuedwarning_issuedarrest_madereason_for_stopvehicularpedestrian
4416002013-11-0518.50000023.0blackmalevehiculararrest101TRAFFIC VIOLATION10
3089262017-08-0116.96666722.0blackfemalevehicularcitation100TRAFFIC VIOLATION10
3278972017-08-141.73333349.0blackfemalevehicularwarning010TRAFFIC VIOLATION10
4764362015-12-0217.06666727.0blackmalevehicularwarning010TRAFFIC VIOLATION10
4643002017-11-2223.90000020.0blackmalevehiculararrest001TRAFFIC VIOLATION10
pitt_crime = pittcrime[["date", "time", "subject_age", "subject_race", "subject_sex", "type", "outcome", "citation_issued", "warning_issued", 'arrest_made', "reason_for_stop"]]
pitt_crime = pitt_crime.iloc[pitt_crime[['subject_race', 'subject_age']].dropna().index, :]
pitt_crime = pitt_crime.sample()
pitt_crime
datetimesubject_agesubject_racesubject_sextypeoutcomecitation_issuedwarning_issuedarrest_madereason_for_stop
470152018-02-2020:59:0046.0whiteNaNpedestrianNaNFalseFalseFalsenarcVice vehicleCodeViolation

Upon further analysis of the Pittsburgh dataset, there were too many null values to use this dataset. Since our chosen features included the the subject age and race, simply dropping the null values meant that a majority of the dataset was also discarded. In the cell above, only one instance included information on both the race and the age. If the outcome null values were also dropped in this instance, none of the stops in the Pittsburgh dataset would be viable for our analysis.

Therefore, our model and analysis will be limited to San Francisco and New Orleans.

Logistic Regression Classifier

Why logistic regression?

We have incorporated logistic regression and naive bases to the analysis regarding the overall accuracy and fairness of police stops as it pertains to race. The inputs that we would receive would be both categorical and quantitative. Furthermore, we chose a no-regularization logistic regression for the baseline of the model. Since this analysis aims to perform classification, the essence of a logistic regression model is designed on a certain event that would exist or not. In our project, the logistic regression model aims to delineate between a binary result (1 for white vs. 0 for non-white individuals). In essence, underlying racial discrimination will be classified based on trends in features that will predict if the police stop was for a white individual or non-white minority individual. A similar analysis may also be done based on gender. Furthermore, logistic regression would be used to find the relationships between dependent variables and other independent or nominal variables.

To implement logistic regression, the subject_race column is transformed into a binary format: 1 for white vs. 0 for non-white individuals.

sf_crime_resampled['race_binary'] = (sf_crime_resampled['subject_race'] == 'white').astype(int)
sf_crime_resampled.head()
datetimesubject_agesubject_racesubject_sextypeoutcomecitation_issuedwarning_issuedarrest_madereason_for_stopvehicularpedestrianrace_binary
4962922011-04-231.96666730.0asian/pacific islandermalevehicularwarning010Moving Violation100
6978632013-09-051.50000026.0whitemalevehicularcitation100Moving Violation101
5430632011-10-0423.83333320.0asian/pacific islanderfemalevehicularwarning010Moving Violation100
4787722011-02-2114.00000026.0hispanicmalevehicularcitation100Moving Violation100
1836772008-05-2912.21666760.0whitemalevehicularcitation100Mechanical or Non-Moving Violation (V.C.)101
nola_crime_resampled['race_binary'] = (nola_crime_resampled['subject_race'] == 'white').astype(int)
nola_crime_resampled.head()
datetimesubject_agesubject_racesubject_sextypeoutcomecitation_issuedwarning_issuedarrest_madereason_for_stopvehicularpedestrianrace_binary
4416002013-11-0518.50000023.0blackmalevehiculararrest101TRAFFIC VIOLATION100
3089262017-08-0116.96666722.0blackfemalevehicularcitation100TRAFFIC VIOLATION100
3278972017-08-141.73333349.0blackfemalevehicularwarning010TRAFFIC VIOLATION100
4764362015-12-0217.06666727.0blackmalevehicularwarning010TRAFFIC VIOLATION100
4643002017-11-2223.90000020.0blackmalevehiculararrest001TRAFFIC VIOLATION100

To perform modeling and analysis on our dataset, the dataset will be split into a training, validation, and test set:

  • Training set 70%
  • Validation set 15%
  • Test set 15%

San Francisco

#SF features matrix
X_sf = sf_crime_resampled[['subject_age', 'time', 'citation_issued', "warning_issued", "arrest_made", 'vehicular', 'pedestrian']]
y_sf = sf_crime_resampled['race_binary']
# Train/Test Split
X_sf_train, X_sf_test, y_sf_train, y_sf_test = train_test_split(X_sf, y_sf, train_size = .70, test_size = .30)

# Train/Validation Split
X_sf_test, X_sf_validate, y_sf_test, y_sf_validate = train_test_split(X_sf_test, y_sf_test, train_size = .50, test_size = .50)
log_reg = LogisticRegression()
log_model = log_reg.fit(X_sf_train, y_sf_train)
log_pred = log_model.predict(X_sf_train)
log_reg_confusion_matrix = confusion_matrix(y_sf_train, log_pred)

png

png

log_model.score(X_sf_validate, y_sf_validate)
0.5872

New Orleans

#New Orleans features matrix
X_nola = nola_crime_resampled[['subject_age', 'time', 'citation_issued', "warning_issued", "arrest_made", 'vehicular', 'pedestrian']]
y_nola = nola_crime_resampled['race_binary']
# Train/Test Split
X_nola_train, X_nola_test, y_nola_train, y_nola_test = train_test_split(X_nola, y_nola, train_size = .70, test_size = .30)

# Train/Validation Split
X_nola_test, X_nola_validate, y_nola_test, y_nola_validate = train_test_split(X_nola_test, y_nola_test, train_size = .50, test_size = .50)
log_reg = LogisticRegression()
log_model = log_reg.fit(X_nola_train, y_nola_train)
log_pred = log_model.predict(X_nola_train)
log_reg_confusion_matrix_nola = confusion_matrix(y_nola_train, log_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_reg_confusion_matrix_nola, classes=class_names,
                      title='Confusion Matrix')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_reg_confusion_matrix_nola, classes=class_names, normalize=True,
                      title='Normalized Confusion Matrix')

png

png

log_model.score(X_nola_validate, y_nola_validate)
0.7420666666666667

Ridge Regression Classifier

Why Ridge Regression?

Ridge regression is a type of squared loss regression used when the number of predictors exceeds the number of observations (eg. p>n) and when the model experiences multicollinearity. Since both p>n and multicollinearity are issues when using linear least squares regression, ridge regression would be used instead. Ridge regression works by using a shrinkage estimator that essentially would produce new estimates that are “shrunk” to the population’s true parameters. It is a L2 “squared” regularization that adds a penalty equal to the squared magnitude of the coefficients. A tuning parameter 𝜆 would determine the strength of this penalty. The constraints put on each of the estimators helps to shrink extreme variance and fluctuations; this sacrifices training accuracy for a model that is likely to generalize better. In other words, ridge regression introduces enough bias that shrinks variance to make estimates closer to the true population values.

Here, we try to improve upon the logistic model. RidgeClassifier is a classifier variant of the Ridge regressor (provided by scikit-learn), and can be much faster than logistic regression. It is used below in place of the sklearn.linear_model.Ridge() model.

San Francisco

ridge = RidgeClassifier()
ridge_model = ridge.fit(X_sf_train, y_sf_train)
ridge_pred = ridge_model.predict(X_sf_train)
ridge_confusion_matrix_sf = confusion_matrix(y_sf_train, ridge_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_sf, classes=class_names,
                      title='Confusion Matrix')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_sf, classes=class_names, normalize=True,
                      title='Normalized Confusion Matrix')

png

png

ridge_model.score(X_sf_validate, y_sf_validate)
0.5870333333333333

New Orleans

ridge_nola = RidgeClassifier()
ridge_nola_model = ridge_nola.fit(X_nola_train, y_nola_train)
ridge_nola_pred = ridge_nola_model.predict(X_nola_train)
ridge_confusion_matrix_nola = confusion_matrix(y_nola_train, ridge_nola_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_nola, classes=class_names,
                      title='Confusion Matrix')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_nola, classes=class_names, normalize=True,
                      title='Normalized Confusion Matrix')

png

png

ridge_nola_model.score(X_nola_validate, y_nola_validate)
0.7420666666666667

One-Hot Encoding

What follows are our attempts at one-hot encoding reason_for_stop and using resultant columns as additional features as a means of improving accuracy.

San Francisco

sf_crime_resampled_new = pd.concat([sf_crime_resampled,pd.get_dummies(sf_crime_resampled['reason_for_stop'], prefix='stopreason')],axis=1)
sf_crime_resampled_new
datetimesubject_agesubject_racesubject_sextypeoutcomecitation_issuedwarning_issuedarrest_madereason_for_stopvehicularpedestrianrace_binarystopreason_Assistance to Motoriststopreason_BOLO/APB/Warrantstopreason_DUI Checkstopreason_MPC Violationstopreason_MPC Violation|Moving Violationstopreason_Mechanical or Non-Moving Violation (V.C.)stopreason_Mechanical or Non-Moving Violation (V.C.)|DUI Checkstopreason_Mechanical or Non-Moving Violation (V.C.)|Moving Violationstopreason_Moving Violationstopreason_Moving Violation|Assistance to Motoriststopreason_Moving Violation|DUI Checkstopreason_Moving Violation|Mechanical or Non-Moving Violation (V.C.)stopreason_Moving Violation|NAstopreason_Moving Violation|Traffic Collisionstopreason_Traffic Collision
4962922011-04-231.96666730.0asian/pacific islandermalevehicularwarning010Moving Violation100000000001000000
6978632013-09-051.50000026.0whitemalevehicularcitation100Moving Violation101000000001000000
5430632011-10-0423.83333320.0asian/pacific islanderfemalevehicularwarning010Moving Violation100000000001000000
4787722011-02-2114.00000026.0hispanicmalevehicularcitation100Moving Violation100000000001000000
1836772008-05-2912.21666760.0whitemalevehicularcitation100Mechanical or Non-Moving Violation (V.C.)101000001000000000
..........................................................................................
4527162010-11-1821.75000044.0whitefemalevehicularcitation100Moving Violation101000000001000000
1192542007-11-0421.86666762.0whitemalevehiculararrest001Moving Violation101000000001000000
3565272009-12-2111.83333321.0othermalevehicularcitation100Moving Violation100000000001000000
4874402011-03-2221.75000023.0whitemalevehicularcitation100Mechanical or Non-Moving Violation (V.C.)101000001000000000
8231512015-09-1713.75000040.0hispanicmalevehicularwarning010Moving Violation100000000001000000

200000 rows × 29 columns

sf_crime_resampled_new.drop(['reason_for_stop'],axis=1, inplace=True)
X_sf_new = sf_crime_resampled_new[columns_we_want]
y_sf_new = sf_crime_resampled_new['race_binary']
# Train/Test Split
X_sf_train, X_sf_test, y_sf_train, y_sf_test = train_test_split(X_sf_new, y_sf_new, train_size = .70, test_size = .30)

# Train/Validation Split
X_sf_test, X_sf_validate, y_sf_test, y_sf_validate = train_test_split(X_sf_test, y_sf_test, train_size = .50, test_size = .50)
log_reg = LogisticRegression()
log_model = log_reg.fit(X_sf_train, y_sf_train)
log_pred = log_model.predict(X_sf_train)
log_confusion_matrix_sf = confusion_matrix(y_sf_train, log_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_confusion_matrix_sf, classes=class_names,
                      title='Confusion Matrix')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_confusion_matrix_sf, classes=class_names, normalize=True,
                      title='Normalized Confusion Matrix')

png

png

log_model.score(X_sf_validate, y_sf_validate)
0.5855
ridge = RidgeClassifier()
ridge_model = ridge.fit(X_sf_train, y_sf_train)
ridge_pred = ridge_model.predict(X_sf_train)
ridge_confusion_matrix_sf = confusion_matrix(y_sf_train, ridge_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_sf, classes=class_names,
                      title='Confusion Matrix')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(ridge_confusion_matrix_sf, classes=class_names, normalize=True,
                      title='Normalized Confusion Matrix')

png

png

ridge_model.score(X_sf_validate, y_sf_validate)
0.5855

New Orleans

nola_crime_resampled_new = pd.concat([nola_crime_resampled,pd.get_dummies(nola_crime_resampled['reason_for_stop'], prefix='stopreason')],axis=1)
nola_crime_resampled_new
datetimesubject_agesubject_racesubject_sextypeoutcomecitation_issuedwarning_issuedarrest_madereason_for_stopvehicularpedestrianrace_binarystopreason_SUSPECT PERSONstopreason_SUSPECT VEHICLEstopreason_TRAFFIC VIOLATION
4416002013-11-0518.50000023.0blackmalevehiculararrest101TRAFFIC VIOLATION100001
3089262017-08-0116.96666722.0blackfemalevehicularcitation100TRAFFIC VIOLATION100001
3278972017-08-141.73333349.0blackfemalevehicularwarning010TRAFFIC VIOLATION100001
4764362015-12-0217.06666727.0blackmalevehicularwarning010TRAFFIC VIOLATION100001
4643002017-11-2223.90000020.0blackmalevehiculararrest001TRAFFIC VIOLATION100001
......................................................
1575392012-04-2314.91666769.0blackfemalevehicularwarning010TRAFFIC VIOLATION100001
262922018-01-1812.63333357.0blackmalevehiculararrest101TRAFFIC VIOLATION100001
3326412011-08-1816.81666744.0whitemalevehicularcitation100TRAFFIC VIOLATION101001
3119262017-08-0319.08333361.0blackmalevehicularwarning010TRAFFIC VIOLATION100001
3290712015-08-1518.23333325.0blackmalevehiculararrest101TRAFFIC VIOLATION100001

200000 rows × 17 columns

nola_columns = list(nola_crime_resampled_new.columns)
nola_columns_desired = ['time',
 'subject_age',
 'citation_issued',
 'warning_issued',
 'arrest_made',
 'vehicular',
 'pedestrian',
 'stopreason_SUSPECT PERSON',
 'stopreason_SUSPECT VEHICLE',
 'stopreason_TRAFFIC VIOLATION']
X_nola_new = nola_crime_resampled_new[nola_columns_desired]
y_nola_new = nola_crime_resampled_new['race_binary']
# Train/Test Split
X_nola_train, X_nola_test, y_nola_train, y_nola_test = train_test_split(X_nola_new, y_nola_new, train_size = .70, test_size = .30)

# Train/Validation Split
X_nola_test, X_nola_validate, y_nola_test, y_nola_validate = train_test_split(X_nola_test, y_nola_test, train_size = .50, test_size = .50)
log_reg = LogisticRegression()
log_model = log_reg.fit(X_nola_train, y_nola_train)
log_pred = log_model.predict(X_nola_train)
log_confusion_matrix_nola = confusion_matrix(y_nola_train, log_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_confusion_matrix_nola, classes=class_names,
                      title='Confusion Matrix')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(log_confusion_matrix_nola, classes=class_names, normalize=True,
                      title='Normalized Confusion Matrix')

png

png

log_model.score(X_nola_validate, y_nola_validate)
0.7444
ridge_nola = RidgeClassifier()
ridge_nola_model = ridge_nola.fit(X_nola_train, y_nola_train)
ridge_nola_pred = ridge_nola_model.predict(X_nola_train)
confusion_matrix(y_nola_train, ridge_nola_pred)

png

png

Note: The RidgeClassifier did not change the confusion matrix for New Orleans.

ridge_nola_model.score(X_nola_validate, y_nola_validate)
0.7444333333333333

Modeling Conclusions

Numerical Results

In modeling the San Francisco dataset, the logistic regression model and the ridge regression model (after adding additional features like one-hot encoding) both gave an validation accuracy of 58.4%.

In modeling the New Orleans dataset, the logistic model and the ridge regression model scored a validation accuracy of 74.6%.

Evaluation Methods

Our two methods of evaluating our results are using confusion matrices and using the model.score() function. In our confusion matrix, we see our true negatives at (0,0) in the matrix, false negatives at (1,0), false positives at (0,1), and true positives at (1,1). True negatives are non-white subjects predicted to be non-white, false negatives are white subjects predicted to be non-white, false positives are non-white subjects predicted to be white, and true positives are white subjects predicted to be white. model.score() returns the mean accuracy on the given test data and labels.

Interpretation of Results

We see higher accuracies for our New Orleans models because our models are predicting all values to be non-white. For the New Orleans dataset, this results in higher accuracy because majority of the dataset population (and city population) is non-white. In San Francisco on the other hand, a plurality of the population is white. Thus, our scores just indicate what the percentage of non-white subjects within the dataset. It is of note however, that the percentage of non-white subjects within the dataset is higher than the overall percentage of people of color in New Orleans, suggesting that people of color are stopped at a higher rate than white people, even when you control for percentage of the population.

Reflecting on the Chosen Models

The accuracy of our predictive models for San Francisco are low, in which the model only predicts correctly a little more than half the time. While this may be consequence of the chosen features or the choice of the classifier itself, considerations should also be taken that the features simple do not correlate well with race. Therefore, even adding additional features to the model did not increase the accuracy by a significant amount. Training on additional noise would work to decrease the accuracy. In other words, rather than there being a problem with the model, there may just be little to no underlying racial discrimination in the features that we chose to train the model on.

Potential Improvements

  • Oversampling: An imbalance in the dataset can be seen from the proportion of whites to other ethnicities. Especially in the New Orleans data set, there are considerably more non-white individuals than white individuals in the sampled training set, which leads the models to tend to exclusively predict subject_race as non-white. A potential solution is to oversample rows with white subjects in our training data to increase the likelihood that the model will predict a white subject. Since one of the primary goals of model validation estimation of how it will perform on unseen data, oversampling correctly is critical.

  • Bootstrapping Given the imbalance in the datasets (especially in the case of New Orleans) between the white and non-white populations, a method to create more variability in the data would be to bootstrap (random sampling with replacement). In this way, the model would be able to train on more data without having to collect additional data.

  • Important Null values: In standardizing the data from both the San Francisco and New Orleans datasets, all the Null/NaN/None values were dropped. However, there may be underlying patterns to these dropped values. More exploratory data analysis could be performed to see why certain values failed to appear in the given datasets.