Example usage
Here, we demonstrate how to use fasterrisk to generate sparse risk scoring systems:
Download and Read Sample Data
Imports
from fasterrisk.fasterrisk import RiskScoreOptimizer, RiskScoreClassifier
from fasterrisk.utils import download_file_from_google_drive
import os.path
import numpy as np
import pandas as pd
import time
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from fasterrisk.fasterrisk import RiskScoreOptimizer, RiskScoreClassifier
2 from fasterrisk.utils import download_file_from_google_drive
3 import os.path
ModuleNotFoundError: No module named 'fasterrisk'
Download Sample Data
from pathlib import Path
Path("../tests").mkdir(parents=True, exist_ok=True) # create the "../tests" directory if it doesn't exist
train_data_file_path = "../tests/adult_train_data.csv"
test_data_file_path = "../tests/adult_test_data.csv"
if not os.path.isfile(train_data_file_path):
download_file_from_google_drive('1nuWn0QVG8tk3AN4I4f3abWLcFEP3WPec', train_data_file_path)
if not os.path.isfile(test_data_file_path):
download_file_from_google_drive('1TyBO02LiGfHbatPWU4nzc8AndtIF-7WH', test_data_file_path)
Read Sample Data
train_df = pd.read_csv(train_data_file_path)
train_data = np.asarray(train_df)
X_train, y_train = train_data[:, 1:], train_data[:, 0]
test_df = pd.read_csv(test_data_file_path)
test_data = np.asarray(test_df)
X_test, y_test = test_data[:, 1:], test_data[:, 0]
Train Risk Score Models
Create RiskScoreOptimizer and Perform Optimization
sparsity =5
parent_size = 10
RiskScoreOptimizer_m = RiskScoreOptimizer(X = X_train, y = y_train, k = sparsity, parent_size = parent_size)
start_time = time.time()
RiskScoreOptimizer_m.optimize()
print("Optimization takes {:.2f} seconds.".format(time.time() - start_time))
Optimization takes 10.26 seconds.
Get Risk Score Models
multipliers, sparseDiversePool_beta0_integer, sparseDiversePool_betas_integer = RiskScoreOptimizer_m.get_models()
print("We generate {} risk score models from the sparse diverse pool".format(len(multipliers)))
We generate 50 risk score models from the sparse diverse pool
Access the first risk score model
model_index = 0 # first model
multiplier = multipliers[model_index]
intercept = sparseDiversePool_beta0_integer[model_index]
coefficients = sparseDiversePool_betas_integer[model_index]
Use the first risk score model to do prediction
RiskScoreClassifier_m = RiskScoreClassifier(multiplier, intercept, coefficients, X_train = X_train)
y_test_pred = RiskScoreClassifier_m.predict(X_test)
print("y_test are predicted to be {}".format(y_test_pred))
y_test are predicted to be [-1 -1 -1 ... -1 -1 -1]
y_test_pred_prob = RiskScoreClassifier_m.predict_prob(X_test)
print("The risk probabilities of having y_test to be +1 are {}".format(y_test_pred_prob))
The risk probabilities of having y_test to be +1 are [0.13308868 0.34872682 0.34872682 ... 0.04216029 0.34872682 0.04216029]
Print the first model card
X_featureNames = list(train_df.columns[1:])
RiskScoreClassifier_m.reset_featureNames(X_featureNames)
RiskScoreClassifier_m.print_model_card()
The Risk Score is:
1. Age_22_to_29 -2 point(s) | ...
2. HSDiploma -2 point(s) | + ...
3. NoHS -4 point(s) | + ...
4. Married 4 point(s) | + ...
5. AnyCapitalGains 3 point(s) | + ...
SCORE | =
SCORE | -8.0 | -6.0 | -5.0 | -4.0 | -3.0 | -2.0 | -1.0 |
RISK | 0.1% | 0.4% | 0.7% | 1.2% | 2.3% | 4.2% | 7.6% |
SCORE | 0.0 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 7.0 |
RISK | 13.3% | 22.3% | 34.9% | 50.0% | 65.1% | 77.7% | 92.4% |
Print Top 10 Model Cards from the Pool and their performance metrics
num_models = min(10, len(multipliers))
for model_index in range(num_models):
multiplier = multipliers[model_index]
intercept = sparseDiversePool_beta0_integer[model_index]
coefficients = sparseDiversePool_betas_integer[model_index]
RiskScoreClassifier_m = RiskScoreClassifier(multiplier, intercept, coefficients)
RiskScoreClassifier_m.reset_featureNames(X_featureNames)
RiskScoreClassifier_m.print_model_card()
train_loss = RiskScoreClassifier_m.compute_logisticLoss(X_train, y_train)
train_acc, train_auc = RiskScoreClassifier_m.get_acc_and_auc(X_train, y_train)
test_acc, test_auc = RiskScoreClassifier_m.get_acc_and_auc(X_test, y_test)
print("The logistic loss on the training set is {}".format(train_loss))
print("The training accuracy and AUC are {:.3f}% and {:.3f}".format(train_acc*100, train_auc))
print("The test accuracy and AUC are are {:.3f}% and {:.3f}\n".format(test_acc*100, test_auc))
The Risk Score is:
1. Age_22_to_29 -2 point(s) | ...
2. HSDiploma -2 point(s) | + ...
3. NoHS -4 point(s) | + ...
4. Married 4 point(s) | + ...
5. AnyCapitalGains 3 point(s) | + ...
SCORE | =
SCORE | -8.0 | -6.0 | -5.0 | -4.0 | -3.0 | -2.0 | -1.0 |
RISK | 0.1% | 0.4% | 0.7% | 1.2% | 2.3% | 4.2% | 7.6% |
SCORE | 0.0 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 7.0 |
RISK | 13.3% | 22.3% | 34.9% | 50.0% | 65.1% | 77.7% | 92.4% |
The logistic loss on the training set is 9798.652346518873
The training accuracy and AUC are 82.575% and 0.862
The test accuracy and AUC are are 81.787% and 0.856
The Risk Score is:
1. HSDiploma -2 point(s) | ...
2. NoHS -4 point(s) | + ...
3. Married 4 point(s) | + ...
4. WorkHrsPerWeek_lt_40 -2 point(s) | + ...
5. AnyCapitalGains 3 point(s) | + ...
SCORE | =
SCORE | -8.0 | -6.0 | -5.0 | -4.0 | -3.0 | -2.0 | -1.0 |
RISK | 0.1% | 0.4% | 0.7% | 1.3% | 2.5% | 4.4% | 7.9% |
SCORE | 0.0 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 7.0 |
RISK | 13.7% | 22.7% | 35.1% | 50.0% | 64.9% | 77.3% | 92.1% |
The logistic loss on the training set is 9859.61575793142
The training accuracy and AUC are 82.333% and 0.860
The test accuracy and AUC are are 81.849% and 0.854
The Risk Score is:
1. HSDiploma -3 point(s) | ...
2. NoHS -5 point(s) | + ...
3. JobManagerial 2 point(s) | + ...
4. Married 5 point(s) | + ...
5. AnyCapitalGains 3 point(s) | + ...
SCORE | =
SCORE | -8.0 | -6.0 | -5.0 | -3.0 | -2.0 | -1.0 | 0.0 |
RISK | 0.2% | 0.6% | 1.0% | 2.7% | 4.4% | 7.2% | 11.4% |
SCORE | 2.0 | 3.0 | 4.0 | 5.0 | 7.0 | 8.0 | 10.0 |
RISK | 26.4% | 37.5% | 50.0% | 62.5% | 82.3% | 88.6% | 95.6% |
The logistic loss on the training set is 9883.324461826953
The training accuracy and AUC are 82.268% and 0.860
The test accuracy and AUC are are 81.511% and 0.854
The Risk Score is:
1. HSDiploma -3 point(s) | ...
2. NoHS -5 point(s) | + ...
3. Married 5 point(s) | + ...
4. WorkHrsPerWeek_geq_50 1 point(s) | + ...
5. AnyCapitalGains 3 point(s) | + ...
SCORE | =
SCORE | -8.0 | -7.0 | -5.0 | -4.0 | -3.0 | -2.0 | -1.0 | 0.0 |
RISK | 0.2% | 0.3% | 0.9% | 1.5% | 2.5% | 4.1% | 6.8% | 10.9% |
SCORE | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 8.0 | 9.0 |
RISK | 17.2% | 25.9% | 37.2% | 50.0% | 62.8% | 74.1% | 89.1% | 93.2% |
The logistic loss on the training set is 9895.728067750335
The training accuracy and AUC are 82.180% and 0.861
The test accuracy and AUC are are 81.342% and 0.856
The Risk Score is:
1. Age_45_to_59 1 point(s) | ...
2. HSDiploma -2 point(s) | + ...
3. NoHS -5 point(s) | + ...
4. Married 4 point(s) | + ...
5. AnyCapitalGains 3 point(s) | + ...
SCORE | =
SCORE | -7.0 | -6.0 | -5.0 | -4.0 | -3.0 | -2.0 | -1.0 | 0.0 |
RISK | 0.2% | 0.4% | 0.7% | 1.2% | 2.0% | 3.5% | 5.9% | 9.9% |
SCORE | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 7.0 | 8.0 |
RISK | 16.0% | 24.9% | 36.5% | 50.0% | 63.5% | 75.1% | 84.0% | 90.1% |
The logistic loss on the training set is 9914.75974232043
The training accuracy and AUC are 80.656% and 0.863
The test accuracy and AUC are are 80.052% and 0.856
The Risk Score is:
1. HSDiploma -3 point(s) | ...
2. NoHS -5 point(s) | + ...
3. Married 5 point(s) | + ...
4. AnyCapitalGains 3 point(s) | + ...
5. AnyCapitalLoss 2 point(s) | + ...
SCORE | =
SCORE | -8.0 | -6.0 | -5.0 | -3.0 | -2.0 | -1.0 | 0.0 |
RISK | 0.2% | 0.5% | 0.8% | 2.3% | 3.9% | 6.4% | 10.5% |
SCORE | 2.0 | 3.0 | 4.0 | 5.0 | 7.0 | 8.0 | 10.0 |
RISK | 25.5% | 36.9% | 50.0% | 63.1% | 83.3% | 89.5% | 96.1% |
The logistic loss on the training set is 9923.881690282931
The training accuracy and AUC are 82.180% and 0.857
The test accuracy and AUC are are 81.342% and 0.852
The Risk Score is:
1. HSDiploma -2 point(s) | ...
2. ProfVocOrAS -1 point(s) | + ...
3. NoHS -4 point(s) | + ...
4. Married 3 point(s) | + ...
5. AnyCapitalGains 2 point(s) | + ...
SCORE | =
SCORE | -7.0 | -6.0 | -5.0 | -4.0 | -3.0 | -2.0 | -1.0 |
RISK | 0.1% | 0.1% | 0.3% | 0.8% | 1.7% | 3.7% | 8.0% |
SCORE | 0.0 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
RISK | 16.4% | 30.7% | 50.0% | 69.3% | 83.6% | 92.0% |
The logistic loss on the training set is 9980.639483585337
The training accuracy and AUC are 82.172% and 0.856
The test accuracy and AUC are are 81.235% and 0.849
The Risk Score is:
1. HSDiploma -2 point(s) | ...
2. NoHS -4 point(s) | + ...
3. Married 3 point(s) | + ...
4. NeverMarried -1 point(s) | + ...
5. AnyCapitalGains 2 point(s) | + ...
SCORE | =
SCORE | -7.0 | -6.0 | -5.0 | -4.0 | -3.0 | -2.0 | -1.0 |
RISK | 0.1% | 0.3% | 0.6% | 1.2% | 2.4% | 4.9% | 9.8% |
SCORE | 0.0 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
RISK | 18.6% | 32.3% | 50.0% | 67.7% | 81.4% | 90.2% |
The logistic loss on the training set is 9988.041001585896
The training accuracy and AUC are 82.180% and 0.855
The test accuracy and AUC are are 81.342% and 0.849
The Risk Score is:
1. HSDiploma -2 point(s) | ...
2. NoHS -4 point(s) | + ...
3. Married 4 point(s) | + ...
4. DivorcedOrSeparated 1 point(s) | + ...
5. AnyCapitalGains 2 point(s) | + ...
SCORE | =
SCORE | -6.0 | -5.0 | -4.0 | -3.0 | -2.0 | -1.0 | 0.0 |
RISK | 0.1% | 0.3% | 0.6% | 1.2% | 2.6% | 5.1% | 10.1% |
SCORE | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 7.0 |
RISK | 18.9% | 32.6% | 50.0% | 67.4% | 81.1% | 89.9% | 94.9% |
The logistic loss on the training set is 10000.803138904072
The training accuracy and AUC are 82.180% and 0.855
The test accuracy and AUC are are 81.342% and 0.848
The Risk Score is:
1. HSDiploma -2 point(s) | ...
2. NoHS -4 point(s) | + ...
3. JobService -1 point(s) | + ...
4. Married 4 point(s) | + ...
5. AnyCapitalGains 3 point(s) | + ...
SCORE | =
SCORE | -7.0 | -6.0 | -5.0 | -4.0 | -3.0 | -2.0 | -1.0 | 0.0 |
RISK | 0.2% | 0.4% | 0.7% | 1.3% | 2.4% | 4.3% | 7.7% | 13.4% |
SCORE | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 7.0 |
RISK | 22.4% | 35.0% | 50.0% | 65.0% | 77.6% | 86.6% | 92.3% |
The logistic loss on the training set is 10000.838406643863
The training accuracy and AUC are 82.176% and 0.858
The test accuracy and AUC are are 81.465% and 0.852
Additional Tutorial on Binarizing Continuous Features
If your data has continuous features, we recommend converting the continuous features to binary features as a preprocessing step to make the final model more interpretable. We use the public PIMA dataset to show how to do this as a preprocessing step.
Download the PIMA dataset
pima_original_data_file_path = "../tests/pima_original_data.csv"
if not os.path.isfile(pima_original_data_file_path):
download_file_from_google_drive('184JhmJiSEUiBCo8ySAD8adDn_S9rjmjM', pima_original_data_file_path)
pima_original_data_df = pd.read_csv(pima_original_data_file_path)
X_original_df = pima_original_data_df.drop(columns="Outcome") # drop the Outcome column, which stores the y label for this binary classification problem
X_original_df
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | |
|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 |
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 |
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 |
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 763 | 10 | 101 | 76 | 48 | 180 | 32.9 | 0.171 | 63 |
| 764 | 2 | 122 | 70 | 27 | 0 | 36.8 | 0.340 | 27 |
| 765 | 5 | 121 | 72 | 23 | 112 | 26.2 | 0.245 | 30 |
| 766 | 1 | 126 | 60 | 0 | 0 | 30.1 | 0.349 | 47 |
| 767 | 1 | 93 | 70 | 31 | 0 | 30.4 | 0.315 | 23 |
768 rows × 8 columns
Convert the dataframe with continuous features to a new dataframe with binary features
from fasterrisk.binarization_util import convert_continuous_df_to_binary_df
X_binarized_df = convert_continuous_df_to_binary_df(X_original_df)
X_binarized_df
Converting continuous features to binary features in the dataframe......
If a feature has more than 100 unqiue values, we pick the threasholds by selecting 100 quantile points. You can change the number of thresholds by passing another specified number: convert_continuous_df_to_binary_df(df, num_quantiles=50).
Finish converting continuous features to binary features......
| Pregnancies<=0 | Pregnancies<=1 | Pregnancies<=2 | Pregnancies<=3 | Pregnancies<=4 | Pregnancies<=5 | Pregnancies<=6 | Pregnancies<=7 | Pregnancies<=8 | Pregnancies<=9 | ... | Age<=62 | Age<=63 | Age<=64 | Age<=65 | Age<=66 | Age<=67 | Age<=68 | Age<=69 | Age<=70 | Age<=72 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 3 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 763 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 764 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 765 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 766 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 767 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
768 rows × 559 columns
You can then use X_binarized_df as your new design matrix and input to the FasterRisk algorithm!
Tutorial on Group-sparsity Constrained Model
We still use the pima dataset to illustrate how to produce scoring systems with the group-sparsity constraint
Binarization with Group Information
X_binarized_df, featureIndex_to_groupIndex = convert_continuous_df_to_binary_df(X_original_df, get_featureIndex_to_groupIndex=True)
Converting continuous features to binary features in the dataframe......
If a feature has more than 100 unqiue values, we pick the threasholds by selecting 100 quantile points. You can change the number of thresholds by passing another specified number: convert_continuous_df_to_binary_df(df, num_quantiles=50).
Finish converting continuous features to binary features......
We still obtain the same preprocessed binary features:
X_binarized_df
| Pregnancies<=0 | Pregnancies<=1 | Pregnancies<=2 | Pregnancies<=3 | Pregnancies<=4 | Pregnancies<=5 | Pregnancies<=6 | Pregnancies<=7 | Pregnancies<=8 | Pregnancies<=9 | ... | Age<=62 | Age<=63 | Age<=64 | Age<=65 | Age<=66 | Age<=67 | Age<=68 | Age<=69 | Age<=70 | Age<=72 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 3 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 763 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 764 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 765 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 766 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 767 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
768 rows × 559 columns
However, now we have a new variable “featureIndex_to_groupIndex”, which stores the group information of each feature. This tells us which continuous feature a binary feature was derived from
featureIndex_to_groupIndex
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7])
Train the model with the group-sparsity constraint
y_train = np.asarray(pima_original_data_df["Outcome"].values)
X_train = np.asarray(X_binarized_df)
sparsity = 5
group_sparsity = 2
parent_size = 10
RiskScoreOptimizer_m = RiskScoreOptimizer(X = X_train, y = y_train, k = sparsity, parent_size = parent_size, \
group_sparsity = group_sparsity, \
featureIndex_to_groupIndex = featureIndex_to_groupIndex)
start_time = time.time()
RiskScoreOptimizer_m.optimize()
print("Optimization takes {:.2f} seconds.".format(time.time() - start_time))
Optimization takes 2.02 seconds.
Print the First Model Card
multipliers, sparseDiversePool_beta0_integer, sparseDiversePool_betas_integer = RiskScoreOptimizer_m.get_models()
model_index = 0 # first model
multiplier = multipliers[model_index]
intercept = sparseDiversePool_beta0_integer[model_index]
coefficients = sparseDiversePool_betas_integer[model_index]
X_featureNames = list(X_binarized_df.columns)
RiskScoreClassifier_m = RiskScoreClassifier(multiplier, intercept, coefficients, X_train = X_train)
RiskScoreClassifier_m.reset_featureNames(X_featureNames)
RiskScoreClassifier_m.print_model_card()
The Risk Score is:
1. Glucose<=99.80000000000001 -3 point(s) | ...
2. Glucose<=129.5 -2 point(s) | + ...
3. Glucose<=165.95 -3 point(s) | + ...
4. BMI<=27.316000000000003 -4 point(s) | + ...
5. BMI<=48.15999999999999 -4 point(s) | + ...
SCORE | =
SCORE | -16.0 | -14.0 | -13.0 | -12.0 | -11.0 | -10.0 | -9.0 | -8.0 |
RISK | 1.8% | 4.8% | 7.6% | 11.9% | 18.3% | 26.9% | 37.8% | 50.0% |
SCORE | -7.0 | -6.0 | -5.0 | -4.0 | -3.0 | -2.0 | 0.0 |
RISK | 62.2% | 73.1% | 81.7% | 88.1% | 92.4% | 95.2% | 98.2% |
Indeed, we see that there are only 2 groups in the above scoring system.