Description: Using machine learning to predict blight violation compliance in Detroit – if a person or business is fined for not maintaining their property, will they pay the fine?

Techniques: using pandas DataFrames, feature analysis, feature engineering, machine learning

Tools: Python (pandas, numpy, scikit-learn, matplotlib), Jupyter notebooks

This work was the final project in an online Coursera class I took about machine learning offered by the University of Michigan, and is based on a Kaggle competition described here. I’ll go through some of the main steps I took below. You can find more detailed code at my GitHub page.

Reading and examining data

Read in data:

First, import some libraries and read in the data:

###### import commands
import pandas as pd
import numpy as np

#### read model
df=pd.read_csv('train.csv', encoding = 'ISO-8859-1', low_memory=False)

Take a look at the data, what types of features do we have?

agency_name                      Buildings, Safety Engineering & Env Department
inspector_name                                                  Sims, Martinzie
violator_name                                 INVESTMENT INC., MIDWEST MORTGAGE
violation_street_number                                                    2900
violation_street_name                                                     TYLER
violation_zip_code                                                          NaN
mailing_address_str_number                                                    3
mailing_address_str_name                                              S. WICKER
city                                                                    CHICAGO
state                                                                        IL
zip_code                                                                  60606
non_us_str_code                                                             NaN
country                                                                     USA
ticket_issued_date                                          2004-03-16 11:40:00
hearing_date                                                2005-03-21 10:30:00
violation_code                                                        9-1-36(a)
violation_description         Failure of owner to obtain certificate of comp...
disposition                                              Responsible by Default
fine_amount                                                                 250
admin_fee                                                                    20
state_fee                                                                    10
late_fee                                                                     25
discount_amount                                                               0
clean_up_cost                                                                 0
judgment_amount                                                             305
payment_amount                                                                0
balance_due                                                                 305
payment_date                                                                NaN
payment_status                                               NO PAYMENT APPLIED
collection_status                                                           NaN
grafitti_status                                                             NaN
compliance_detail                                   non-compliant by no payment
compliance                                                                    0
Name: 22056, dtype: object

Some of these aren’t included in the test dataset, including the compliance field, which is 0 when the person didn’t pay their fine, and 1 when they did. It turns out there are ~148,000 tickets that weren’t paid, and ~12,000 that were.

Add location information:

Now let’s add the location information that is provided in the latlons.csv and addresses.csv files. The lat/lon file has columns address, lat, and lon, and the address file has columns ticket_id and address, so they can be merged with the address column. This won’t be entirely accurate because of spelling mistakes, but it’s good enough for now. Then merge this new information with the full dataset using the ticket_id index

#### read latlon, addresses

#### merge latlon, addresses on addresses column, make sure it's consistent first
add_df['address'] = add_df['address'].str.upper()
latlon_df['address'] = latlon_df['address'].str.upper()


#### merge latlon info into main df
df=df.merge(df_info,left_index=True,right_index=True, how='left')

Here I make a plot of where these blight violations occur:.

#### plot spatial data, clean version
%matplotlib notebook # for playing around in jupyter
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors

# define the map boundaries

# np histogram will give the number of observations within each 2d bin
H, xedges, yedges = np.histogram2d(df['lon'], df['lat'], bins=(xedges,yedges)) # this throws a warning, I assume because a few of the lat/lon values are nan's, ignoring for now

fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(111)

xx, yy = np.meshgrid(xedges, yedges)
sc = ax.pcolormesh(xx, yy, np.transpose(H), norm=colors.LogNorm(vmin=1,vmax=100))
ax.set_aspect(1/np.cos(44 * np.pi / 180))
ax.set_title('All blight violations')
cbar=plt.colorbar(sc,ax=ax)"# of violations")

The colormap is in a logarithmic scale – since a few of the areas have a lot of violations, if we used a linear scale everything else would be washed out. It’s cool that you can make out some of the highways and streets in the city, but for now I’m not going to use the lat/lon information in the machine learning portion.

Examine useful features:

Moving on to the other features, there are some that stand out as being potentially useful, including disposition, late_fee, fine_amount, and discount_amount. Some of these seem like examples of data leakage, where you wouldn’t necessarily know this information for future tickets issued. But they’re included in the test dataset provided, so I’ll use them. I may try in the future to attempt this machine learning problem without using any of these questionable features.

Below is an example of how we can test different features to see how they might predict compliance, in this case for fine_amount

## use crosstab to find the average non-compliance for different values of the chosen variable
var='fine_amount' # define the column to examine
ctNorm=pd.crosstab(df[var], df['compliance'], normalize='index', margins=True)
ctNorm['N']=df[(df['compliance']==1) | (df['compliance']==0)][var].value_counts() # some observations have null values for compliance because the violators were found not responsible
print(ctNorm[ctNorm['N']>100]) # just show the values that have over 100 observations to reduce noise

## plot the fine amount vs. the percent non-compliant
import matplotlib.pyplot as plt
%matplotlib notebook

maxind=ctNorm['N'].idxmax() # find the index that has the most observations

avgval=ctNorm.loc['All',0.0] # the average non-compliance for all violations

plt.ylabel('% non-compliant')
compliance        0.0       1.0        N
0.0          0.000000  1.000000    195.0
25.0         0.922351  0.077649   1378.0
50.0         0.909380  0.090620  20415.0
100.0        0.882877  0.117123  15488.0
125.0        0.949559  0.050441    793.0
200.0        0.900000  0.100000  12710.0
250.0        0.936899  0.063101  86798.0
300.0        0.971603  0.028397   3768.0
350.0        0.953125  0.046875    128.0
500.0        0.945938  0.054062   6918.0
750.0        0.938865  0.061135    229.0
1000.0       0.962538  0.037462   4965.0
1500.0       0.943182  0.056818    264.0
2500.0       0.970227  0.029773   1545.0
3500.0       0.979269  0.020731   3859.0
10000.0      0.994872  0.005128    195.0

It looks like the higher the fine amount, the higher the chance the person won’t pay their fine. The most common fine amount is shown in red, and the average non-compliance % is shown in the dashed black line.

One more exploration of the data before we get to the machine learning (I did a lot more that you can check out at GitHub). If we examine the top violators, we see that Acorn Investment had the most violations, and that they never paid the fine:

ctNorm=pd.crosstab(df[var], df['compliance'], normalize='index', margins=True)
ctNorm['N']=df[(df['compliance']==1) | (df['compliance']==0)][var].value_counts() 
print(ctNorm[ctNorm['N']>100].sort_values(by=['N'], ascending=False))
compliance                              0.0       1.0      N
INVESTMENT, ACORN                  1.000000  0.000000  624.0
INVESTMENT CO., ACORN              1.000000  0.000000  343.0
BANK, WELLS FARGO                  0.980237  0.019763  253.0
MILLER, JOHN                       0.994350  0.005650  177.0
STEHLIK, JERRY                     0.867089  0.132911  158.0
NEW YORK, BANK OF                  0.971831  0.028169  142.0
KRAMER, KEITH                      1.000000  0.000000  119.0
SNOW, GEORGE                       1.000000  0.000000  108.0
APARTMENTS, CARLTON                1.000000  0.000000  102.0
NATIONAL TRUST CO., DEUTSCHE BANK  0.980392  0.019608  102.0

A quick google search led me to this article, which talks about the ‘king of Detroit blight’ who rents properties through a number of different companies including Acorn Investment. In a future version of this project I may create a new feature called is_bad_landlord, which uses the many different company names and mailing addresses to determine if this owner is involved

Teaching the machine

Define features to use:

Now we get into the machine learning part of this project. First I’ll define the features that I want to use to train the model.

#### define features to use and clean model
usecols=['disposition', 'agency_name', 'late_fee','discount_amount', 'fine_amount','judgment_amount','TimeDiff','issue_month','hearing_month','hearing_day','hearing_dayofweek','issue_dayofweek','violation_description']
df=df.dropna() # delete any rows that have null values

#### split into X and y

Checkout the GitHub page to see how I added some of these features, including TimeDiff, and the month and day features. TimeDiff is the difference between the hearing and ticket issue date, and the month and day features used pandas.to_datetime() to get numerical values for different dates. Some of these features are categorical variables, so I’ll use X_new=pd.get_dummies(X[usecols]) to convert them into multiple dummy indicator variables. For example, if variable colors had categories green, red, blue, it would create variable names colors_green, colors_red, colors_blue, with 0‘s and 1‘s in each column depending on what color category was there before.

Split into test and training data sets:

Now we split the data into training and test sets. We won’t touch the test data set, and leave it to evaluate whether the models we train on the training data set work well.

#### split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_new, y, random_state=0, test_size=.1)

Preprocess data:

Before we start training the model, we should scale the data to reduce the effect of outliers. Depending on the learning algorithm we choose, this will improve the algorithm by having more standardized data. Decision trees wouldn’t care, but many other algorithms would.

#### scale the data so things are nice
# scaler = MinMaxScaler() # not as robust to outliers
scaler = RobustScaler(quantile_range=(25, 75))

X_train_scaled = scaler.fit_transform(X_train)
# we must apply the same scaling to the test set that we computed for the training set
X_test_scaled = scaler.transform(X_test)

An additional preprocessing step that could be done before training the model is resampling the data sets, since we have an unbalanced number of compliant vs. non-compliant observations. I tested both upsampling the minority class and downsampling the majority class, plus a combination of both. For more details, see the code on the GitHub page.

Choose classifier:

Now we choose the classifier, from sklearn. Here I’m showing a GradientBoostingClassifier, but I also tested RandomForestClassifier and DecisionTreeClassifier.

clf = GradientBoostingClassifier(learning_rate=0.1, n_estimators=200)

Grid search over model parameters:

In order to test different input parameters for the classifier, I used GridSearchCV from sklearn.model_selection, changing the max_features and max_depth. This fits the model to the data using different potential parameters.

#### define parameters for grid search
grid_values = {'max_features': ['sqrt','log2'], 'max_depth':[4,6,8]}

#### do grid search, then fit to scaled data
grid_clf_auc = GridSearchCV(clf, param_grid=grid_values, scoring='roc_auc'), y_train)

Evaluate models:

Once the grid search has been performed, we can see how well the best estimator performed:


This is the roc_auc score, which I’ll explain more in a bit. Essentially, a score of 1 would mean your model predicts the data perfectly, 0.5 would mean your model has no predictive value, and 0 would mean your model anti-predicts the data (the answer is always wrong). A score of .82 is pretty good in this case, but this is on the training data set. What is the score for the test data that the classifier hasn’t seen yet?

from sklearn.metrics import roc_curve, auc

#### pull out the best estimator from the grid search

#### calculate predicted probabilities with the chosen classifier

#### calculate false and true positive rates for roc curve
fpr, tpr, _ = roc_curve(y_test, y_proba[:,1])
roc_auc = auc(fpr, tpr)

#### print out auc value for the test data

What’s been done here with the roc_curve call is testing the true result (y_test) against the probabilistically predicted result (y_proba) from the test dataset, and giving us a false positive rate (fpr) and a true positive rate (tpr) for changing thresholds of the y_proba value. If we plot the tpr vs the fpr we get this:

The ideal position on this roc (receiver operating characteristic) plot would be in the upper left corner – here we would predict 100% of the compliant cases, and we would have zero false positives. The red line shows where we would be if we just randomly guessed. The AUC, or area under the curve, integrates the blue curve.

Finishing up:

At this point, the whole dataflow I’ve described could be applied to the given test dataset (or any future blight tickets issued), to come up with a probability of compliance. There are a number of ways the model could be improved, which I may discuss in the future. This includes: more testing of resampling of the data set, testing different machine learning classifiers, adding new features or massaging existing ones, or adding new data sets like those listed in the Kaggle description such as building permits or parcel information.