Analysis of Crime in Sanctuary vs Non-Sanctuary Cities
UC Berkeley, LEGALST 123 - Data, Prediction and Law (Spring 2020)
Claire Black, Hunter Bjazevich, Anna Gueorguieva, Alyssa Lau
Introduction to Research Question and Dataset:
Donald Trump has been on record saying that we must “end the sanctuary cities that have resulted in so many needless deaths. Cities that refuse to cooperate with federal authorities will not receive taxpayer dollars, and we will work with Congress to pass legislation to protect those jurisdictions that do assist federal authorities” (Trump Immigration speech, Pheonix AZ, August 2016).
Our goal is to test the claim that sanctuary cities cause “so many needless deaths” and whether President Trump’s proposal to freeze federal funding to these cities holds any merit by investigating whether sanctuary cities tend to have higher violent crime rates than non-sanctuary cities.
Our preliminary analysis serves the purpose to help us form a hypothesis and begins with looking at a single city and investigating violent crime rates both before and after that city received their sanctuary status. For this, we chose to analyze violent crime rates in Los Angeles, California, which became a sanctuary city after the passing of California Senate Bill 54 declared the state of California as a sanctuary state in 2017.
The importance of understanding and sorting different types of crime:
An analysis of crime rate changes using all types of crime may become convoluted and fail to provide a useable conclusion. In order to create conclusions with the most impact, we must first understand which types of crime are most relevant to our analysis.
By creating data frames that sort the top crimes pre-sanctuary city policy vs post-sanctuary city policy, we better understand the dataset and the human contexts from which it arose. These human contexts must be remembered throughout the analysis of data, as our social structure will impact that data we see. Some crimes are historically more reported than others due to social factors. For example, our data includes intimate partner rape. This is more likely to go unreported than forcible rape from a non-intimate partner due to different contexts in which the crime occurs, and is something we should take into account when forming a conclusion.
Below is the first five rows of a table representing crime in LA from 2010-2019. It includes many features such as: the date the crime was reported, the data occurred, which neighborhood, what type of crime, information about the victim, and exact coordinate locations of the crime.
DR_NO | Date Rptd | DATE OCC | TIME OCC | AREA | AREA NAME | Rpt Dist No | Part 1-2 | Crm Cd | Crm Cd Desc | Mocodes | Vict Age | Vict Sex | Vict Descent | Premis Cd | Premis Desc | Weapon Used Cd | Weapon Desc | Status | Status Desc | Crm Cd 1 | Crm Cd 2 | Crm Cd 3 | Crm Cd 4 | LOCATION | Cross Street | LAT | LON | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1307355 | 02/20/2010 12:00:00 AM | 02/20/2010 12:00:00 AM | 1350 | 13 | Newton | 1385 | 2 | 900 | VIOLATION OF COURT ORDER | 0913 1814 2000 | 48 | M | H | 501.0 | SINGLE FAMILY DWELLING | NaN | NaN | AA | Adult Arrest | 900.0 | NaN | NaN | NaN | 300 E GAGE AV | NaN | 33.9825 | -118.2695 |
1 | 11401303 | 09/13/2010 12:00:00 AM | 09/12/2010 12:00:00 AM | 45 | 14 | Pacific | 1485 | 2 | 740 | VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA... | 0329 | 0 | M | W | 101.0 | STREET | NaN | NaN | IC | Invest Cont | 740.0 | NaN | NaN | NaN | SEPULVEDA BL | MANCHESTER AV | 33.9599 | -118.3962 |
2 | 70309629 | 08/09/2010 12:00:00 AM | 08/09/2010 12:00:00 AM | 1515 | 13 | Newton | 1324 | 2 | 946 | OTHER MISCELLANEOUS CRIME | 0344 | 0 | M | H | 103.0 | ALLEY | NaN | NaN | IC | Invest Cont | 946.0 | NaN | NaN | NaN | 1300 E 21ST ST | NaN | 34.0224 | -118.2524 |
3 | 90631215 | 01/05/2010 12:00:00 AM | 01/05/2010 12:00:00 AM | 150 | 6 | Hollywood | 646 | 2 | 900 | VIOLATION OF COURT ORDER | 1100 0400 1402 | 47 | F | W | 101.0 | STREET | 102.0 | HAND GUN | IC | Invest Cont | 900.0 | 998.0 | NaN | NaN | CAHUENGA BL | HOLLYWOOD BL | 34.1016 | -118.3295 |
4 | 100100501 | 01/03/2010 12:00:00 AM | 01/02/2010 12:00:00 AM | 2100 | 1 | Central | 176 | 1 | 122 | RAPE, ATTEMPTED | 0400 | 47 | F | H | 103.0 | ALLEY | 400.0 | STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE) | IC | Invest Cont | 122.0 | NaN | NaN | NaN | 8TH ST | SAN PEDRO ST | 34.0387 | -118.2488 |
Some of the data needs to be cleaned. Since we are investigating crime rates before and after the passing of California SB 54, there are a number of features we do not need to include (like victim sex), especially if some columns may pose problems if left as is. Also, we have to appropriately replace NaNs and 0s.
Some columns have NaNs or 0s, and depending on the column, we will imputate them with either a 0 or the column average. For example, victim ages listed as 0 will be replaced with the column average, and columns like ‘Weapon Used Cd’ will be replaced by a 0, which we will interpret as a weapon not being used.
Below is a cleaned version of crime in LA which allows for easier data manipulation and analysis.
Date Rptd | DATE OCC | Year | TIME OCC | AREA | AREA NAME | Crm Cd | Crm Cd Desc | Vict Age | Vict Descent | Premis Cd | Weapon Used Cd | Weapon Desc | Status | Status Desc | Crm Cd 1 | LAT | LON | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2010-02-20 | 2010-02-20 | 2010 | 1350 | 13 | Newton | 900 | VIOLATION OF COURT ORDER | 48.000000 | H | 501.0 | 0.0 | 0 | AA | Adult Arrest | 900.0 | 33.9825 | -118.2695 |
1 | 2010-09-13 | 2010-09-12 | 2010 | 45 | 14 | Pacific | 740 | VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA... | 31.765902 | W | 101.0 | 0.0 | 0 | IC | Invest Cont | 740.0 | 33.9599 | -118.3962 |
2 | 2010-08-09 | 2010-08-09 | 2010 | 1515 | 13 | Newton | 946 | OTHER MISCELLANEOUS CRIME | 31.765902 | H | 103.0 | 0.0 | 0 | IC | Invest Cont | 946.0 | 34.0224 | -118.2524 |
3 | 2010-01-05 | 2010-01-05 | 2010 | 150 | 6 | Hollywood | 900 | VIOLATION OF COURT ORDER | 47.000000 | W | 101.0 | 102.0 | HAND GUN | IC | Invest Cont | 900.0 | 34.1016 | -118.3295 |
4 | 2010-01-03 | 2010-01-02 | 2010 | 2100 | 1 | Central | 122 | RAPE, ATTEMPTED | 47.000000 | H | 103.0 | 400.0 | STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE) | IC | Invest Cont | 122.0 | 34.0387 | -118.2488 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2114086 | 2019-03-28 | 2019-03-28 | 2019 | 400 | 6 | Hollywood | 648 | ARSON | 31.765902 | X | 706.0 | 506.0 | FIRE | IC | Invest Cont | 648.0 | 34.0962 | -118.3490 |
2114087 | 2019-08-15 | 2019-08-14 | 2019 | 1810 | 7 | Wilshire | 331 | THEFT FROM MOTOR VEHICLE - GRAND ($400 AND OVER) | 40.000000 | W | 101.0 | 0.0 | 0 | IC | Invest Cont | 331.0 | 34.0871 | -118.3732 |
2114088 | 2019-01-06 | 2019-01-06 | 2019 | 2100 | 20 | Olympic | 930 | CRIMINAL THREATS - NO WEAPON DISPLAYED | 46.000000 | B | 102.0 | 400.0 | STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE) | IC | Invest Cont | 930.0 | 34.0637 | -118.2870 |
2114089 | 2019-10-17 | 2019-10-16 | 2019 | 1800 | 17 | Devonshire | 420 | THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER) | 31.765902 | 0 | 101.0 | 0.0 | 0 | IC | Invest Cont | 420.0 | 34.2266 | -118.5085 |
2114090 | 2019-02-01 | 2019-02-01 | 2019 | 1615 | 8 | West LA | 330 | BURGLARY FROM VEHICLE | 33.000000 | W | 707.0 | 0.0 | 0 | IC | Invest Cont | 330.0 | 34.0420 | -118.4531 |
2114091 rows × 18 columns
1. Trends in violent crime in LA from 2010 to 2019:
President Trump posits that violent crime rates are inflated under systems of sanctuary cities. Because of this, we will focus on violent crime rates, and not all instances of crime because for this analysis certain crimes, such as embezzlement, are irrelevant.
LA times lists “violent crime” as homicide, rape, assault, robbery. According to the FBI, nationally, the top violent crimes are aggravated assaults (62.5% of crime), robbery (29.5%), forcible rape (6.8%), and murder (1.2%). So, I will isolate these four categories to form a data frame for violent crime.
Source: https://ucr.fbi.gov/crime-in-the-u.s/2010/crime-in-the-u.s.-2010/violent-crime
AREA | |
---|---|
Year | |
2010 | 12270 |
2011 | 11285 |
2012 | 10293 |
2013 | 9073 |
2014 | 9356 |
2015 | 10605 |
2016 | 11955 |
2017 | 12455 |
2018 | 11825 |
2019 | 10851 |
The ‘violent_df’ dataframe above gives the counts of violent crimes per year from 2010 to 2019. Looking at the violent crime counts in a vacuum, there does not seem to be any discernible difference in the number of violent crimes as the counts look relatively uniform.
Below we have plotted several crime trends. The first plot gives a sense of crime trends from 2010 to 2019, while the second plot hones in on the crime rates from 2016 to 2019, two years before and two years after the passage of California SB 54 declared California as a sanctuary state.
It does appear that if we group the number of crimes per year that the total number of violent crimes decreased substantially between 2017 and 2019, which was immediately after California became a sanctuary state. This could be for a variety of reasons, but if we were to look at this in a vacuum, we see that crime rates actually decreased with the passing of this bill in Los Angeles. This does not disprove President Trump’s proposition, but does serve to prove that sanctuary cities do not have a statistically significant effect on violent crime rates.
2. Finding Top 5 Crimes of Pre and Post Senate Bill
The first table below represent the top 5 crimes occuring in LA in 2016 & 2017 (pre-sanctuary bill) and the second table represents the top 5 crimes occuring in LA in 2018 & 2019 (post-sanctuary bill). Counts in the post-sanctuary bill table do seem to decrease in comparison to the first table, however there are many other factors that could have impacted this change and thus we cannot make any direct conclusions yet.
Count Pre-Sanct | |
---|---|
Crime Type | |
ROBBERY | 18576 |
ATTEMPTED ROBBERY | 2558 |
RAPE, FORCIBLE | 2466 |
CRIMINAL HOMICIDE | 576 |
RAPE, ATTEMPTED | 234 |
Count Post-Sanct | |
---|---|
Crime Type | |
ROBBERY | 17397 |
ATTEMPTED ROBBERY | 2555 |
RAPE, FORCIBLE | 2040 |
CRIMINAL HOMICIDE | 513 |
RAPE, ATTEMPTED | 171 |
Hypothesis:
Our data frame illustrates that the top 5 crimes have not changed, nor has the relative frequency for which the top 5 crimes occur. Based on the preliminary comparisons between the ‘Count’ column of pre-sanctuary policy crimes and post-santuary policy crimes, there was a decrease in violent crime following the implementation of sanctuary policies in Los Angeles. This is not to say that sanctuary declaration was responsible for decrease in violent crime rates since there are numerous things that can contribute to this, but is indicative of the lack of evidence for sanctuary cities increasing crime rates. Using our above analysis of Los Angeles, we hypothesize that santuary city policies do not have any significant effect on violent crime rates, and thus believe President Trump’s proposition to withhold federal funding from sanctuary cities and states is misguided. Further in-depth analysis of this data, along with additional data can provide evidence for this hypothesis.
Machine Learning Experiment
If sanctuary cities are actually an indicator of increased crime rates in a given city, theoretically one would be able to predict with accuracy whether or not a given city is a sanctuary city based on their violent crime rates. To investigate this, we are going to create several classification models to predict whether a city is either a sanctuary city (1) or a non-sanctuary city (0).
We have compiled a long list of cities and their rates of violent crime spanning from 2010-2015. This data was consolidated by the Marshall Project using the Uniform Crime Reporting Statistics, an organization under the umbrella of the United States Department of Justice.
We classified cities as sanctuary cities according to their sanctuary status listed by the Center For Immigration Studies. If a city was listed in a sanctuary state or sanctuary county, that city would also be listed as a sanctuary city. If a city was first declared a sanctuary city sometime in between 2010-2015, the sanctuary status feature is coded as 0 for the years before the declaration and as 1 for the years after. The violent crime statstics we are using as features include total violent crime rate, homicide rate, rape rate, and aggravated assault rate. We chose to use violent crime rates, rather than total violent crime instances, to account for the variance in population size for cities.
The data in the csv file below has already been cleaned.
department_name | year | Sanctuary Status | total_pop | violent_per_100k | homs_per_100k | rape_per_100k | rob_per_100k | agg_ass_per_100k | |
---|---|---|---|---|---|---|---|---|---|
0 | Albuquerque | 2010 | 0 | 545852 | 786.110521 | 7.694393 | 61.921546 | 172.207851 | 544.286730 |
1 | Albuquerque | 2011 | 0 | 551961 | 762.735048 | 6.884544 | 47.829466 | 180.809876 | 527.211162 |
2 | Albuquerque | 2012 | 0 | 553684 | 749.705608 | 7.404946 | 50.209145 | 197.224410 | 494.867108 |
3 | Albuquerque | 2013 | 0 | 558165 | 774.860480 | 6.628864 | 78.650578 | 187.399783 | 502.181255 |
4 | Albuquerque | 2014 | 1 | 558874 | 882.846581 | 5.367936 | 71.930346 | 247.103998 | 558.444300 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
398 | Wichita | 2011 | 0 | 384796 | 770.018399 | 6.496949 | 63.670101 | 127.340201 | 572.511149 |
399 | Wichita | 2012 | 0 | 386409 | 749.464945 | 5.952242 | 62.627941 | 128.361399 | 552.523363 |
400 | Wichita | 2013 | 0 | 386486 | 794.077923 | 3.881124 | 64.685396 | 120.832320 | 604.679083 |
401 | Wichita | 2014 | 0 | 387493 | 758.465314 | 5.419453 | 62.710810 | 128.002312 | 562.332739 |
402 | Wichita | 2015 | 0 | 389824 | 984.803398 | 6.926203 | 89.527582 | 188.033574 | 700.316040 |
403 rows × 9 columns
For our classification models, we will need to split the data into training, validation, and test sets. 80% of the data will go into training and 20% into testing. From there, 25% of the training data will be used as a validation set to test our models before applying them to the final test set.
Discussion of Classification models - Linear SVM Classifier
The first classification model we will be using is a Linear SVM classifier like the one we saw in a guest lecture to reverse engineer the COMPAS algorithm, as well as in lab. If violent crime rates are truly predictive of whether or not a city is a sanctuary city, then our model, trained with features related to violent crime rates, should be able to accurately classify cities into “Sanctuary” and “Non-Sanctuary.”
There are certain advantages and disadvantages of using Linear SVM to classify our cities as either a sanctuary city or non-sanctuary city. The first advantage, and also the reason we chose to start with Linear SVM, is that the method is fairly simple for two-group classification problems like the one we are investigating. This essentially creates a linear decision boundary when working in two-dimensional space. Anything falling on one side of the line will serve as one classification, and the other side will serve as another classification. With SVM, if there is not a clear linear decision boundary, one could add additional dimensions until there is a clear boundary. However, in our case we will only be working with Linear SVM.
The obvious disadvantage is that this method may oversimplify classification problems. There may not always be separable data where one set of actual data is separated from the opposite classification. Because of this, the model may not accurately classify correctly or perform well when applied to new data. Another disadvantage stems from potentially long training time when first training the model. We did not experience a problem from this, but if we begin working with larger datasets, especially those where SVM requires more than two dimensions, problems could arise because of computing time.
However, we find that Linear SVM is a good model to begin with for this specific problem. Below is the confusion matrix created by Linear SVM. This confusion matrix shows that 86% of the data was correctly predicted to be non-sanctuary (true-negatives) and 14% of the data was predicted to be non-sanctuary but was actually sanctuary (false-negatives).
Naive Bayes Classifier
The second model we will use is the Naive Bayes classifier, which assumes that all of the features are independent of each other and that each feature contributes equally to the outcome. This classifier uses Bayes Theorem to predict membership probabilities of piece of data.
Like Linear SVM, there are a number of advantages and disadvantages to using Naive Bayes classifiers. One advantage of using the Naive Bayes classifer is that, unlike SVM, this technique works quite well with larger datasets and generally is much faster than SVM. Additionally, if working with multi-group classification problems, this classifier also performs well.
A disadvantage of using Naive Bayes is that, as I have said above, the classifier assumes that all features are independent of one another. This is not the case most of the time and in datasets where features overlap or are dependent on one another, the model may perform poorly.
Naive Bayes requires less training data, yet it is stilly a highly scalable algorithm. With the size of our current dataset, we applied this Naive Bayes classifier to investigate how it relates to the simpler method of Linear SVM.
The Naive Bayes confusion matrix shows 54 true-negatives, 3 true-positives, 8 false-negatives and 16 false-positives (where a sanctuary city is represented to be a positive and a non-sanctuary city is a negative).
Multinomial Logistic Regression
The final model we will use is multinomial logistic regression. We will follow the same steps as the two previous classifiers. Logistic regression investigates the relationship between dependent and independent variables.
The advantages of using a logistic regression classifier is that the model is very simple and time effective when working with larger datasets. In the case of having linearly separable data, such as our training data, the logistic regression classifier can potentially be highly effective. Like Linear SVM, this leads to the disadvantage of oversimplifying the classifier and leaving it to work poorly when working with non-linear relationships and when working in more than two dimensions.
We are using this technique largely to see how it compares to SVM and Naive Bayes.
Our final model, a multinomial linear regression gives us 43 true-negatives, 5 false-negatives, 6 true-positives, and 27 false-positives.
Overall, the SVM classifier resulted in the highest mean accuracy on the validation sets, but we figured that this is because of a class imbalance due to sanctuary cities being much less common than non-sanctuary cities in our data. To remedy this, we put 4x weight on the sanctuary city data in our since non sanctuary cities were roughly 4x more common than sanctuary cities, but the results largely stayed the same. Linear SVM was still the most accurate, so decided to run the test set on that model. The confusion matrix below shows the result of running Linear SVM on our test set.
ROC Curve and AUC
Creating a ROC curve of our results and finding the area under the curve can allow us to measure how well our classifier has performed. A ROC curve (Receiver Operating Characteristic curve) is useful in determining thresholds in binary classifiers, as well as providing a measurement of performace of predictive power by finding the Area Under the Curve (AUC). The ROC plots the true positive rate (True Positives / (True Positives + False Negatives)) on the y-axis and the false positive rate (False Positives / (False Positives + True Negatives))on x-axis. The ideal graph would be a line which creates a 90 degree angle by closely traveling up the y-axis and then making a sharp turn to the right to follow the x-axis at the top of the graph. The closer the line goes to a 45 degree angle, the worse your model may be performing. A straight diagonal line of x = y would show no performance skill.
Since our thresholds have already been determined and all tests ran, we will be using AUC to measure predictive power. The higher the AUC, the more likely our classifier is performing well and classifying correctly. By applying scikit-learn functions we can create a ROC plot and find the AUC.
Below is the ROC curve for our linear SVM model. The AUC is 0.5.
The second ROC curve is for multinomial logistic regression. It has an AUC of 0.64.
The final ROC curve is for our naive bayes model. It has an AUC of 0.65.
This ROC/AUC analysis shows that the Linear SVM model, despite being the best performing classifier in terms of accuracy, could not effectively predict whether a city is a sanctuary city based on their crime rates, providing more evidence of our hypothesis that sanctuary status plays no significant role in determining violent crime rates.
Naive Bayes and Logistic Regression classifiers performed more effectively in this analysis, but it is also worth nothing that overall these models didn’t predict status very accurately.
Nonetheless, all of the above analyses illustrate the lack of sufficient evidence that sanctuary cities are more dangerous than non-sanctuary cities.
Conclusion / Summary & Write Up
Research Question we sought out to answer:
Are sanctuary cities generally more dangerous than non-sanctuary cities? Can we predict whether a city is a sanctuary city or a non-sanctuary city based on an investigation of their violent crime rates?
Many of our nation’s leaders place unfounded blame on various members of our society, especially immigrants. Various cities around the country have declared themselves as sanctuary cities, meaning that they limit their cooperation with the Federal government’s immigration policies. Many people argue that giving “sanctuary” to undocumented immigrants will increase violent crime in their neighborhoods and cities. Government leaders, most notably President Trump, have used this argument to end sanctuary cities by threatening to withhold federal funding, citing that they necessitate violent crimes and cause many needless deaths. Our purpose was to investigate President Trump’s claim that sanctuary cities are generally more dangerous than non-sanctuary cities and whether his threat to withhold federal aid was justified or misguided.
The data we used:
The data we used to form our hypothesis was gathered from the Los Angeles Police Department and consists of the years 2010 to 2019. According to the LA Times in 2018 when the bill came into effect, local police departments stopped engaging in joint operations with the U.S Immigration and Customs Enforcement (ICE) and no longer transfers immigrants with minor offenses to ice custody.
In order to test our hypothesize we used data that was gathered by the Marshall Project from the Uniform Crime Reporting Statistics. This is a database under the U.S. Department of Justice. The data consists of a long list of cities, many of which are sanctuary cities and many of which are not, as well as their violent crime rates. We added a column of binary values where the number 1 represented a sanctuary city and 0 represented a non sanctuary city.
Our hypothesis and how we formed it:
In order to form a hypothesis, we found it necessary to first investigate a city and its crime rates before and after it declared itself as a sanctuary city. We chose Los Angeles to be this city because it became a sanctuary city in 2017 with the passing of California Senate Bill 54, giving us available crime data prior to 2017, as well as after 2017 when police departments began to not cooperate with ICE. Through filtering the crime statistics to only display instances of violent crime, we found that instances of violent crime in Los Angeles did not increase, but actually decreased between 2017 and 2019. Because this was only one city’s data, we did not find it justified to hypothesize that sanctuary cities decrease violent crime rates, but we felt like this analysis was enough to help us hypothesize that sanctuary city status does not have any significant effect on rates of violent crime.
Testing our Hypothesis:
To test the hypothesis that sanctuary city status has no discernible effect on violent crime rates in a given city, we decided to use the machine learning techniques from class to make a classification model that predicts whether a city is a sanctuary city (1) or a non-sanctuary city (0). If we were able to predict whether a city was a sanctuary city solely based on their violent crime rates with higher than expected accuracy as well as with reasonable False Positive Rates, we would reject our hypothesis. If we found that we could not effectively predict a city’s sanctuary status based on their violent crime rates we would approve our hypothesis.
We made three separate classification models: Linear SVM, Naive Bayes, and Logistic Regression. Because of the class imbalance due to more cities being non-sanctuary cities, we changed the ‘class_weight’ attribute within the sklearn functions to place more weight on sanctuary cities.
Explanation of our findings
On the test set using our best performing model, the Linear SVM classifier, we were able to predict with about 84% accuracy whether a city will be a sanctuary city or a non-sanctuary city using Linear SVM.
Regardless of this fairly high success rate in predicting whether a city is classified as a sanctuary city or a non-sanctuary city, there are some caveats to the results from our Linear SVM model. This model predicted 100% of the time that a city would be a non-sanctuary city, despite us putting a larger class weight on sanctuary cities. This likely means that we were able to predict with this accuracy based on a city being much more likely to be a non-sanctuary city than they are to be a sanctuary city. Looking at the entire dataset, there are 330 instances of a non-sanctuary city and 73 instances of a sanctuary city, implying that about 82% of the cities are listed as non-sanctuary. Given the minimal disparity between this percentage and our predictions, we are concluding from these models that there is no statistically significant difference between crime rates in sanctuary cities vs non-sanctuary cities.
Additionally, the AUC and ROC show that our Linear SVM model does not have much predictive power. This is to be expected, as our data itself leans heavily towards the (0) classification regardless of the class_weights parameter, making it hard to classify correctly when the majority of the data will be trained to classify as (0). This lack in predictive power may help to further prove that there is no significant difference between crime rates in sanctuary cities vs non-sanctuary cities, as the amount of violent crimes does not seem to be a good indicator when classifying cities.
Because of the poor predictive power found from the ROC/AUC analysis on our Linear SVM model, we found it necessary to also apply this analysis to our other classifiers. We determined that Naive Bayes and Logistic Regression have more predictive power than the SVM model, but this should be taken with a grain of salt because these models were generally also much less accurate.
Conclusion:
No, it does not appear that President Trump’s claim that sanctuary cities are generally more dangerous than non-sanctuary cities holds merit, and his threat to withhold federal funding to these cities looks to be misguided.
If we perform our preliminary analysis on other large cities other than Los Angeles, we expect crime rates largely to remain uniform. If for some reason violent crime rates within a city have increased after declaring a sanctuary status, we believe this to be the result of confounding variables and not a direct effect of becoming a sanctuary city.
Sanctuary cities, as we have concluded, as a class are not more dangerous than non-sanctuary cities. This is not to say that a city classified as a sanctuary city cannot have higher violent crime rates, but these higher crime rates are most likely a result of an outside factor, and not a result of the sanctuary status.
Acknowledgment of Ethical, Social and Legal Complexities
Data as a tool to understand society:
Data does not exist in a vaccuum independent of varying legal, economic and social factors. Our analysis project is not a showcase of python functions, matrices and graphs – it is an evaluation and contribution to current discussions being held in our society through the perspective of data. Our society relies on data and our data relies on society. We make claims to be verified through data, just as we may create a claim after discovering interesting patterns in data.
Trump’s claim that sanctuary cities have increased violent crime could have serious legal and social implications. Public policies would most likely shift if Trump promoted a removal of sanctuary cities and social tensions might rise. Thus, does Trump’s claim have enough evidence to support all these changes? That is what our group attempts to answer. We have shown that violent crime rates are not a good indicator of whether or not a city is a sanctuary city. This brings into question: are the legal actions Trump wishes to take in order to weaken sanctuary policies worth it, if there actually seems to be no difference in crime? This question illustrates the interconnection between law and data. Without data, we cannot understand the impacts of certain laws on society. And while we must take data seriously, we cannot put blind faith into what it tells us. Data could have many pitfalls: it holds implicit bias, it could have been collected poorly, it could have a skewed representation for certain groups, etc. For example, one city may have a better system for detection and prevention of violent crime than another city - independent of whether or not it is sanctuary. There are endless possibilities for where data may not provide an accurate representation of what we aim to describe. Thus, we put our faith into the data we are using, but should not close off discussion of potential changes or pitfalls. Due to all these variabilities, we cannot with certainty say that Trump is wrong because of one data analysis project; this would discredit necessary social debate and policy discussion. Rather, we can provide our perspective into the pool of knowledge and hope that it provides evidence for the legal choice that creates the most positive change. Data and law should support one another in discussions of social change, without overpowering each other. Trump should use data in order to validate his claim, but data should also continue to consider outside social factors that may not be represented within the dataset.
Potential risks of using ML in the legal realm:
Our project creates a model which uses features of violent crime to predict whether a city is santuary or not. All cities in the data set have a clear answer at the end - they are legally sanctuary or they are not. This allows for a clear cut comparison of how our classifier performs - it either predicted correctly or it did not. However, if we were to attempt to classify on some other arbitrary value such as ‘dangerous city’ or ‘not dangerous city’ we may run into more issues. The definition of a ‘dangerous city’ is arbitrary and problematic. It brings up ethical issues of representation and power. Who has the power to claim a city to be dangerous? Politicians? Data scientists? Citizens of that city? If a data scientist labels a city ‘dangerous’ in an analysis, this creates a representation of that city that the citizens might not want nor agree with. As data analysts, we must be careful of how we choose to manipulate the data and be aware of the consequences it may have on the groups that the data comes from. Our analysis attempts to remove arbitrary definitions that may be potentially harmful by using a cities own choice of representation (sanctuary or not). But, we are aware that this project could easily shift into an ethically sticky situation where data analysists hold power over how a city will be represented by creating our own classification bins.
Data is a two-edged sword. It can open up discussions of fairness and social good through the exposure of unfair practices / biases that arise explicity in data. However, it can also create conclusions that harm individuals or groups when all necessary social factors are not properly accounted for. Our legal institutions rely heavily on statistics and data in order to create laws and policies. We must be aware that our analysis does not exist independently from laws that have direct social implications. We must approach data analysis with empathy and awareness of the power it holds while using it to the best of our abilities in order create positive social/legal changes.