Example usage

Here, we demonstrate how to use fasterrisk to generate sparse risk scoring systems:

Download and Read Sample Data

Imports

from fasterrisk.fasterrisk import RiskScoreOptimizer, RiskScoreClassifier
from fasterrisk.utils import download_file_from_google_drive
import os.path

import numpy as np
import pandas as pd
import time
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 from fasterrisk.fasterrisk import RiskScoreOptimizer, RiskScoreClassifier
      2 from fasterrisk.utils import download_file_from_google_drive
      3 import os.path

ModuleNotFoundError: No module named 'fasterrisk'

Download Sample Data

from pathlib import Path
Path("../tests").mkdir(parents=True, exist_ok=True) # create the "../tests" directory if it doesn't exist

train_data_file_path = "../tests/adult_train_data.csv"
test_data_file_path = "../tests/adult_test_data.csv"

if not os.path.isfile(train_data_file_path):
    download_file_from_google_drive('1nuWn0QVG8tk3AN4I4f3abWLcFEP3WPec', train_data_file_path)
if not os.path.isfile(test_data_file_path):
    download_file_from_google_drive('1TyBO02LiGfHbatPWU4nzc8AndtIF-7WH', test_data_file_path)

Read Sample Data

train_df = pd.read_csv(train_data_file_path)
train_data = np.asarray(train_df)
X_train, y_train = train_data[:, 1:], train_data[:, 0]

test_df = pd.read_csv(test_data_file_path)
test_data = np.asarray(test_df)
X_test, y_test = test_data[:, 1:], test_data[:, 0]

Train Risk Score Models

Create RiskScoreOptimizer and Perform Optimization

sparsity =5
parent_size = 10

RiskScoreOptimizer_m = RiskScoreOptimizer(X = X_train, y = y_train, k = sparsity, parent_size = parent_size)
start_time = time.time()
RiskScoreOptimizer_m.optimize()
print("Optimization takes {:.2f} seconds.".format(time.time() - start_time))
Optimization takes 10.26 seconds.

Get Risk Score Models

multipliers, sparseDiversePool_beta0_integer, sparseDiversePool_betas_integer = RiskScoreOptimizer_m.get_models()
print("We generate {} risk score models from the sparse diverse pool".format(len(multipliers)))
We generate 50 risk score models from the sparse diverse pool

Access the first risk score model

model_index = 0 # first model
multiplier = multipliers[model_index]
intercept = sparseDiversePool_beta0_integer[model_index]
coefficients = sparseDiversePool_betas_integer[model_index]

Use the first risk score model to do prediction

RiskScoreClassifier_m = RiskScoreClassifier(multiplier, intercept, coefficients, X_train = X_train)
y_test_pred = RiskScoreClassifier_m.predict(X_test)
print("y_test are predicted to be {}".format(y_test_pred))
y_test are predicted to be [-1 -1 -1 ... -1 -1 -1]
y_test_pred_prob = RiskScoreClassifier_m.predict_prob(X_test)
print("The risk probabilities of having y_test to be +1 are {}".format(y_test_pred_prob))
The risk probabilities of having y_test to be +1 are [0.13308868 0.34872682 0.34872682 ... 0.04216029 0.34872682 0.04216029]

Additional Tutorial on Binarizing Continuous Features

If your data has continuous features, we recommend converting the continuous features to binary features as a preprocessing step to make the final model more interpretable. We use the public PIMA dataset to show how to do this as a preprocessing step.

Download the PIMA dataset

pima_original_data_file_path = "../tests/pima_original_data.csv"
if not os.path.isfile(pima_original_data_file_path):
    download_file_from_google_drive('184JhmJiSEUiBCo8ySAD8adDn_S9rjmjM', pima_original_data_file_path)

pima_original_data_df = pd.read_csv(pima_original_data_file_path)

X_original_df = pima_original_data_df.drop(columns="Outcome") # drop the Outcome column, which stores the y label for this binary classification problem

X_original_df
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 6 148 72 35 0 33.6 0.627 50
1 1 85 66 29 0 26.6 0.351 31
2 8 183 64 0 0 23.3 0.672 32
3 1 89 66 23 94 28.1 0.167 21
4 0 137 40 35 168 43.1 2.288 33
... ... ... ... ... ... ... ... ...
763 10 101 76 48 180 32.9 0.171 63
764 2 122 70 27 0 36.8 0.340 27
765 5 121 72 23 112 26.2 0.245 30
766 1 126 60 0 0 30.1 0.349 47
767 1 93 70 31 0 30.4 0.315 23

768 rows × 8 columns

Convert the dataframe with continuous features to a new dataframe with binary features

from fasterrisk.binarization_util import convert_continuous_df_to_binary_df

X_binarized_df = convert_continuous_df_to_binary_df(X_original_df)
X_binarized_df
Converting continuous features to binary features in the dataframe......
If a feature has more than 100 unqiue values, we pick the threasholds by selecting 100 quantile points. You can change the number of thresholds by passing another specified number: convert_continuous_df_to_binary_df(df, num_quantiles=50).
Finish converting continuous features to binary features......
Pregnancies<=0 Pregnancies<=1 Pregnancies<=2 Pregnancies<=3 Pregnancies<=4 Pregnancies<=5 Pregnancies<=6 Pregnancies<=7 Pregnancies<=8 Pregnancies<=9 ... Age<=62 Age<=63 Age<=64 Age<=65 Age<=66 Age<=67 Age<=68 Age<=69 Age<=70 Age<=72
0 0 0 0 0 0 0 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
1 0 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
2 0 0 0 0 0 0 0 0 1 1 ... 1 1 1 1 1 1 1 1 1 1
3 0 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
763 0 0 0 0 0 0 0 0 0 0 ... 0 1 1 1 1 1 1 1 1 1
764 0 0 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
765 0 0 0 0 0 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
766 0 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
767 0 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1

768 rows × 559 columns

You can then use X_binarized_df as your new design matrix and input to the FasterRisk algorithm!

Tutorial on Group-sparsity Constrained Model

We still use the pima dataset to illustrate how to produce scoring systems with the group-sparsity constraint

Binarization with Group Information

X_binarized_df, featureIndex_to_groupIndex = convert_continuous_df_to_binary_df(X_original_df, get_featureIndex_to_groupIndex=True)
Converting continuous features to binary features in the dataframe......
If a feature has more than 100 unqiue values, we pick the threasholds by selecting 100 quantile points. You can change the number of thresholds by passing another specified number: convert_continuous_df_to_binary_df(df, num_quantiles=50).
Finish converting continuous features to binary features......

We still obtain the same preprocessed binary features:

X_binarized_df
Pregnancies<=0 Pregnancies<=1 Pregnancies<=2 Pregnancies<=3 Pregnancies<=4 Pregnancies<=5 Pregnancies<=6 Pregnancies<=7 Pregnancies<=8 Pregnancies<=9 ... Age<=62 Age<=63 Age<=64 Age<=65 Age<=66 Age<=67 Age<=68 Age<=69 Age<=70 Age<=72
0 0 0 0 0 0 0 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
1 0 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
2 0 0 0 0 0 0 0 0 1 1 ... 1 1 1 1 1 1 1 1 1 1
3 0 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
763 0 0 0 0 0 0 0 0 0 0 ... 0 1 1 1 1 1 1 1 1 1
764 0 0 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
765 0 0 0 0 0 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
766 0 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
767 0 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1

768 rows × 559 columns

However, now we have a new variable “featureIndex_to_groupIndex”, which stores the group information of each feature. This tells us which continuous feature a binary feature was derived from

featureIndex_to_groupIndex
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
       7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
       7, 7, 7, 7, 7, 7, 7, 7, 7])

Train the model with the group-sparsity constraint

y_train = np.asarray(pima_original_data_df["Outcome"].values)
X_train = np.asarray(X_binarized_df)

sparsity = 5
group_sparsity = 2
parent_size = 10

RiskScoreOptimizer_m = RiskScoreOptimizer(X = X_train, y = y_train, k = sparsity, parent_size = parent_size, \
                                          group_sparsity = group_sparsity, \
                                          featureIndex_to_groupIndex = featureIndex_to_groupIndex)
start_time = time.time()
RiskScoreOptimizer_m.optimize()
print("Optimization takes {:.2f} seconds.".format(time.time() - start_time))
Optimization takes 2.02 seconds.

Print the First Model Card

multipliers, sparseDiversePool_beta0_integer, sparseDiversePool_betas_integer = RiskScoreOptimizer_m.get_models()

model_index = 0 # first model

multiplier = multipliers[model_index]
intercept = sparseDiversePool_beta0_integer[model_index]
coefficients = sparseDiversePool_betas_integer[model_index]

X_featureNames = list(X_binarized_df.columns)

RiskScoreClassifier_m = RiskScoreClassifier(multiplier, intercept, coefficients, X_train = X_train)
RiskScoreClassifier_m.reset_featureNames(X_featureNames)
RiskScoreClassifier_m.print_model_card()
The Risk Score is:
1.                    Glucose<=99.80000000000001     -3 point(s) |   ...
2.                                Glucose<=129.5     -2 point(s) | + ...
3.                               Glucose<=165.95     -3 point(s) | + ...
4.                       BMI<=27.316000000000003     -4 point(s) | + ...
5.                        BMI<=48.15999999999999     -4 point(s) | + ...
                                                           SCORE | =    
SCORE |  -16.0  |  -14.0  |  -13.0  |  -12.0  |  -11.0  |  -10.0  |  -9.0  |  -8.0  |
RISK  |   1.8% |   4.8% |   7.6% |  11.9% |  18.3% |  26.9% |  37.8% |  50.0% |
SCORE |  -7.0  |  -6.0  |  -5.0  |  -4.0  |  -3.0  |  -2.0  |   0.0  |
RISK  |  62.2% |  73.1% |  81.7% |  88.1% |  92.4% |  95.2% |  98.2% |

Indeed, we see that there are only 2 groups in the above scoring system.