Example usage

Here, we demonstrate how to use fasterrisk to generate sparse risk scoring systems:

Download and Read Sample Data

Imports

from fasterrisk.fasterrisk import RiskScoreOptimizer, RiskScoreClassifier
from fasterrisk.utils import download_file_from_google_drive
import os.path

import numpy as np
import pandas as pd
import time

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 from fasterrisk.fasterrisk import RiskScoreOptimizer, RiskScoreClassifier
      2 from fasterrisk.utils import download_file_from_google_drive
      3 import os.path

ModuleNotFoundError: No module named 'fasterrisk'

Download Sample Data

from pathlib import Path
Path("../tests").mkdir(parents=True, exist_ok=True) # create the "../tests" directory if it doesn't exist

train_data_file_path = "../tests/adult_train_data.csv"
test_data_file_path = "../tests/adult_test_data.csv"

if not os.path.isfile(train_data_file_path):
    download_file_from_google_drive('1nuWn0QVG8tk3AN4I4f3abWLcFEP3WPec', train_data_file_path)
if not os.path.isfile(test_data_file_path):
    download_file_from_google_drive('1TyBO02LiGfHbatPWU4nzc8AndtIF-7WH', test_data_file_path)

Read Sample Data

train_df = pd.read_csv(train_data_file_path)
train_data = np.asarray(train_df)
X_train, y_train = train_data[:, 1:], train_data[:, 0]

test_df = pd.read_csv(test_data_file_path)
test_data = np.asarray(test_df)
X_test, y_test = test_data[:, 1:], test_data[:, 0]

Train Risk Score Models

Create RiskScoreOptimizer and Perform Optimization

sparsity =5
parent_size = 10

RiskScoreOptimizer_m = RiskScoreOptimizer(X = X_train, y = y_train, k = sparsity, parent_size = parent_size)

start_time = time.time()
RiskScoreOptimizer_m.optimize()
print("Optimization takes {:.2f} seconds.".format(time.time() - start_time))

Optimization takes 10.26 seconds.

Get Risk Score Models

multipliers, sparseDiversePool_beta0_integer, sparseDiversePool_betas_integer = RiskScoreOptimizer_m.get_models()
print("We generate {} risk score models from the sparse diverse pool".format(len(multipliers)))

We generate 50 risk score models from the sparse diverse pool

Access the first risk score model

model_index = 0 # first model
multiplier = multipliers[model_index]
intercept = sparseDiversePool_beta0_integer[model_index]
coefficients = sparseDiversePool_betas_integer[model_index]

Use the first risk score model to do prediction

RiskScoreClassifier_m = RiskScoreClassifier(multiplier, intercept, coefficients, X_train = X_train)

y_test_pred = RiskScoreClassifier_m.predict(X_test)
print("y_test are predicted to be {}".format(y_test_pred))

y_test are predicted to be [-1 -1 -1 ... -1 -1 -1]

y_test_pred_prob = RiskScoreClassifier_m.predict_prob(X_test)
print("The risk probabilities of having y_test to be +1 are {}".format(y_test_pred_prob))

The risk probabilities of having y_test to be +1 are [0.13308868 0.34872682 0.34872682 ... 0.04216029 0.34872682 0.04216029]

Print the first model card

X_featureNames = list(train_df.columns[1:])

RiskScoreClassifier_m.reset_featureNames(X_featureNames)
RiskScoreClassifier_m.print_model_card()

The Risk Score is:
1.            Age_22_to_29     -2 point(s) |   ...
2.               HSDiploma     -2 point(s) | + ...
3.                    NoHS     -4 point(s) | + ...
4.                 Married      4 point(s) | + ...
5.         AnyCapitalGains      3 point(s) | + ...
                                     SCORE | =    
SCORE |  -8.0  |  -6.0  |  -5.0  |  -4.0  |  -3.0  |  -2.0  |  -1.0  |
RISK  |   0.1% |   0.4% |   0.7% |   1.2% |   2.3% |   4.2% |   7.6% |
SCORE |   0.0  |   1.0  |   2.0  |   3.0  |   4.0  |   5.0  |   7.0  |
RISK  |  13.3% |  22.3% |  34.9% |  50.0% |  65.1% |  77.7% |  92.4% |

Print Top 10 Model Cards from the Pool and their performance metrics

num_models = min(10, len(multipliers))

for model_index in range(num_models):
    multiplier = multipliers[model_index]
    intercept = sparseDiversePool_beta0_integer[model_index]
    coefficients = sparseDiversePool_betas_integer[model_index]

    RiskScoreClassifier_m = RiskScoreClassifier(multiplier, intercept, coefficients)
    RiskScoreClassifier_m.reset_featureNames(X_featureNames)
    RiskScoreClassifier_m.print_model_card()

    train_loss = RiskScoreClassifier_m.compute_logisticLoss(X_train, y_train)
    train_acc, train_auc = RiskScoreClassifier_m.get_acc_and_auc(X_train, y_train)
    test_acc, test_auc = RiskScoreClassifier_m.get_acc_and_auc(X_test, y_test)

    print("The logistic loss on the training set is {}".format(train_loss))
    print("The training accuracy and AUC are {:.3f}% and {:.3f}".format(train_acc*100, train_auc))
    print("The test accuracy and AUC are are {:.3f}% and {:.3f}\n".format(test_acc*100, test_auc))

The Risk Score is:
1.            Age_22_to_29     -2 point(s) |   ...
2.               HSDiploma     -2 point(s) | + ...
3.                    NoHS     -4 point(s) | + ...
4.                 Married      4 point(s) | + ...
5.         AnyCapitalGains      3 point(s) | + ...
                                     SCORE | =    
SCORE |  -8.0  |  -6.0  |  -5.0  |  -4.0  |  -3.0  |  -2.0  |  -1.0  |
RISK  |   0.1% |   0.4% |   0.7% |   1.2% |   2.3% |   4.2% |   7.6% |
SCORE |   0.0  |   1.0  |   2.0  |   3.0  |   4.0  |   5.0  |   7.0  |
RISK  |  13.3% |  22.3% |  34.9% |  50.0% |  65.1% |  77.7% |  92.4% |
The logistic loss on the training set is 9798.652346518873
The training accuracy and AUC are 82.575% and 0.862
The test accuracy and AUC are are 81.787% and 0.856

The Risk Score is:
1.               HSDiploma     -2 point(s) |   ...
2.                    NoHS     -4 point(s) | + ...
3.                 Married      4 point(s) | + ...
4.    WorkHrsPerWeek_lt_40     -2 point(s) | + ...
5.         AnyCapitalGains      3 point(s) | + ...
                                     SCORE | =    
SCORE |  -8.0  |  -6.0  |  -5.0  |  -4.0  |  -3.0  |  -2.0  |  -1.0  |
RISK  |   0.1% |   0.4% |   0.7% |   1.3% |   2.5% |   4.4% |   7.9% |
SCORE |   0.0  |   1.0  |   2.0  |   3.0  |   4.0  |   5.0  |   7.0  |
RISK  |  13.7% |  22.7% |  35.1% |  50.0% |  64.9% |  77.3% |  92.1% |
The logistic loss on the training set is 9859.61575793142
The training accuracy and AUC are 82.333% and 0.860
The test accuracy and AUC are are 81.849% and 0.854

The Risk Score is:
1.               HSDiploma     -3 point(s) |   ...
2.                    NoHS     -5 point(s) | + ...
3.           JobManagerial      2 point(s) | + ...
4.                 Married      5 point(s) | + ...
5.         AnyCapitalGains      3 point(s) | + ...
                                     SCORE | =    
SCORE |  -8.0  |  -6.0  |  -5.0  |  -3.0  |  -2.0  |  -1.0  |   0.0  |
RISK  |   0.2% |   0.6% |   1.0% |   2.7% |   4.4% |   7.2% |  11.4% |
SCORE |   2.0  |   3.0  |   4.0  |   5.0  |   7.0  |   8.0  |  10.0  |
RISK  |  26.4% |  37.5% |  50.0% |  62.5% |  82.3% |  88.6% |  95.6% |
The logistic loss on the training set is 9883.324461826953
The training accuracy and AUC are 82.268% and 0.860
The test accuracy and AUC are are 81.511% and 0.854

The Risk Score is:
1.               HSDiploma     -3 point(s) |   ...
2.                    NoHS     -5 point(s) | + ...
3.                 Married      5 point(s) | + ...
4.   WorkHrsPerWeek_geq_50      1 point(s) | + ...
5.         AnyCapitalGains      3 point(s) | + ...
                                     SCORE | =    
SCORE |  -8.0  |  -7.0  |  -5.0  |  -4.0  |  -3.0  |  -2.0  |  -1.0  |   0.0  |
RISK  |   0.2% |   0.3% |   0.9% |   1.5% |   2.5% |   4.1% |   6.8% |  10.9% |
SCORE |   1.0  |   2.0  |   3.0  |   4.0  |   5.0  |   6.0  |   8.0  |   9.0  |
RISK  |  17.2% |  25.9% |  37.2% |  50.0% |  62.8% |  74.1% |  89.1% |  93.2% |
The logistic loss on the training set is 9895.728067750335
The training accuracy and AUC are 82.180% and 0.861
The test accuracy and AUC are are 81.342% and 0.856

The Risk Score is:
1.            Age_45_to_59      1 point(s) |   ...
2.               HSDiploma     -2 point(s) | + ...
3.                    NoHS     -5 point(s) | + ...
4.                 Married      4 point(s) | + ...
5.         AnyCapitalGains      3 point(s) | + ...
                                     SCORE | =    
SCORE |  -7.0  |  -6.0  |  -5.0  |  -4.0  |  -3.0  |  -2.0  |  -1.0  |   0.0  |
RISK  |   0.2% |   0.4% |   0.7% |   1.2% |   2.0% |   3.5% |   5.9% |   9.9% |
SCORE |   1.0  |   2.0  |   3.0  |   4.0  |   5.0  |   6.0  |   7.0  |   8.0  |
RISK  |  16.0% |  24.9% |  36.5% |  50.0% |  63.5% |  75.1% |  84.0% |  90.1% |
The logistic loss on the training set is 9914.75974232043
The training accuracy and AUC are 80.656% and 0.863
The test accuracy and AUC are are 80.052% and 0.856

The Risk Score is:
1.               HSDiploma     -3 point(s) |   ...
2.                    NoHS     -5 point(s) | + ...
3.                 Married      5 point(s) | + ...
4.         AnyCapitalGains      3 point(s) | + ...
5.          AnyCapitalLoss      2 point(s) | + ...
                                     SCORE | =    
SCORE |  -8.0  |  -6.0  |  -5.0  |  -3.0  |  -2.0  |  -1.0  |   0.0  |
RISK  |   0.2% |   0.5% |   0.8% |   2.3% |   3.9% |   6.4% |  10.5% |
SCORE |   2.0  |   3.0  |   4.0  |   5.0  |   7.0  |   8.0  |  10.0  |
RISK  |  25.5% |  36.9% |  50.0% |  63.1% |  83.3% |  89.5% |  96.1% |
The logistic loss on the training set is 9923.881690282931
The training accuracy and AUC are 82.180% and 0.857
The test accuracy and AUC are are 81.342% and 0.852

The Risk Score is:
1.               HSDiploma     -2 point(s) |   ...
2.             ProfVocOrAS     -1 point(s) | + ...
3.                    NoHS     -4 point(s) | + ...
4.                 Married      3 point(s) | + ...
5.         AnyCapitalGains      2 point(s) | + ...
                                     SCORE | =    
SCORE |  -7.0  |  -6.0  |  -5.0  |  -4.0  |  -3.0  |  -2.0  |  -1.0  |
RISK  |   0.1% |   0.1% |   0.3% |   0.8% |   1.7% |   3.7% |   8.0% |
SCORE |   0.0  |   1.0  |   2.0  |   3.0  |   4.0  |   5.0  |
RISK  |  16.4% |  30.7% |  50.0% |  69.3% |  83.6% |  92.0% |
The logistic loss on the training set is 9980.639483585337
The training accuracy and AUC are 82.172% and 0.856
The test accuracy and AUC are are 81.235% and 0.849

The Risk Score is:
1.               HSDiploma     -2 point(s) |   ...
2.                    NoHS     -4 point(s) | + ...
3.                 Married      3 point(s) | + ...
4.            NeverMarried     -1 point(s) | + ...
5.         AnyCapitalGains      2 point(s) | + ...
                                     SCORE | =    
SCORE |  -7.0  |  -6.0  |  -5.0  |  -4.0  |  -3.0  |  -2.0  |  -1.0  |
RISK  |   0.1% |   0.3% |   0.6% |   1.2% |   2.4% |   4.9% |   9.8% |
SCORE |   0.0  |   1.0  |   2.0  |   3.0  |   4.0  |   5.0  |
RISK  |  18.6% |  32.3% |  50.0% |  67.7% |  81.4% |  90.2% |
The logistic loss on the training set is 9988.041001585896
The training accuracy and AUC are 82.180% and 0.855
The test accuracy and AUC are are 81.342% and 0.849

The Risk Score is:
1.               HSDiploma     -2 point(s) |   ...
2.                    NoHS     -4 point(s) | + ...
3.                 Married      4 point(s) | + ...
4.     DivorcedOrSeparated      1 point(s) | + ...
5.         AnyCapitalGains      2 point(s) | + ...
                                     SCORE | =    
SCORE |  -6.0  |  -5.0  |  -4.0  |  -3.0  |  -2.0  |  -1.0  |   0.0  |
RISK  |   0.1% |   0.3% |   0.6% |   1.2% |   2.6% |   5.1% |  10.1% |
SCORE |   1.0  |   2.0  |   3.0  |   4.0  |   5.0  |   6.0  |   7.0  |
RISK  |  18.9% |  32.6% |  50.0% |  67.4% |  81.1% |  89.9% |  94.9% |
The logistic loss on the training set is 10000.803138904072
The training accuracy and AUC are 82.180% and 0.855
The test accuracy and AUC are are 81.342% and 0.848

The Risk Score is:
1.               HSDiploma     -2 point(s) |   ...
2.                    NoHS     -4 point(s) | + ...
3.              JobService     -1 point(s) | + ...
4.                 Married      4 point(s) | + ...
5.         AnyCapitalGains      3 point(s) | + ...
                                     SCORE | =    
SCORE |  -7.0  |  -6.0  |  -5.0  |  -4.0  |  -3.0  |  -2.0  |  -1.0  |   0.0  |
RISK  |   0.2% |   0.4% |   0.7% |   1.3% |   2.4% |   4.3% |   7.7% |  13.4% |
SCORE |   1.0  |   2.0  |   3.0  |   4.0  |   5.0  |   6.0  |   7.0  |
RISK  |  22.4% |  35.0% |  50.0% |  65.0% |  77.6% |  86.6% |  92.3% |
The logistic loss on the training set is 10000.838406643863
The training accuracy and AUC are 82.176% and 0.858
The test accuracy and AUC are are 81.465% and 0.852

Additional Tutorial on Binarizing Continuous Features

If your data has continuous features, we recommend converting the continuous features to binary features as a preprocessing step to make the final model more interpretable. We use the public PIMA dataset to show how to do this as a preprocessing step.

Download the PIMA dataset

pima_original_data_file_path = "../tests/pima_original_data.csv"
if not os.path.isfile(pima_original_data_file_path):
    download_file_from_google_drive('184JhmJiSEUiBCo8ySAD8adDn_S9rjmjM', pima_original_data_file_path)

pima_original_data_df = pd.read_csv(pima_original_data_file_path)

X_original_df = pima_original_data_df.drop(columns="Outcome") # drop the Outcome column, which stores the y label for this binary classification problem

X_original_df

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
0	6	148	72	35	0	33.6	0.627	50
1	1	85	66	29	0	26.6	0.351	31
2	8	183	64	0	0	23.3	0.672	32
3	1	89	66	23	94	28.1	0.167	21
4	0	137	40	35	168	43.1	2.288	33
...	...	...	...	...	...	...	...	...
763	10	101	76	48	180	32.9	0.171	63
764	2	122	70	27	0	36.8	0.340	27
765	5	121	72	23	112	26.2	0.245	30
766	1	126	60	0	0	30.1	0.349	47
767	1	93	70	31	0	30.4	0.315	23

768 rows × 8 columns

Convert the dataframe with continuous features to a new dataframe with binary features

from fasterrisk.binarization_util import convert_continuous_df_to_binary_df

X_binarized_df = convert_continuous_df_to_binary_df(X_original_df)
X_binarized_df

Converting continuous features to binary features in the dataframe......
If a feature has more than 100 unqiue values, we pick the threasholds by selecting 100 quantile points. You can change the number of thresholds by passing another specified number: convert_continuous_df_to_binary_df(df, num_quantiles=50).
Finish converting continuous features to binary features......

	Pregnancies<=0	Pregnancies<=1	Pregnancies<=2	Pregnancies<=3	Pregnancies<=4	Pregnancies<=5	Pregnancies<=6	Pregnancies<=7	Pregnancies<=8	Pregnancies<=9	...	Age<=62	Age<=63	Age<=64	Age<=65	Age<=66	Age<=67	Age<=68	Age<=69	Age<=70	Age<=72
0	0	0	0	0	0	0	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
1	0	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
2	0	0	0	0	0	0	0	0	1	1	...	1	1	1	1	1	1	1	1	1	1
3	0	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
4	1	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
763	0	0	0	0	0	0	0	0	0	0	...	0	1	1	1	1	1	1	1	1	1
764	0	0	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
765	0	0	0	0	0	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
766	0	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
767	0	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1

768 rows × 559 columns

You can then use X_binarized_df as your new design matrix and input to the FasterRisk algorithm!

Tutorial on Group-sparsity Constrained Model

We still use the pima dataset to illustrate how to produce scoring systems with the group-sparsity constraint

Binarization with Group Information

X_binarized_df, featureIndex_to_groupIndex = convert_continuous_df_to_binary_df(X_original_df, get_featureIndex_to_groupIndex=True)

Converting continuous features to binary features in the dataframe......
If a feature has more than 100 unqiue values, we pick the threasholds by selecting 100 quantile points. You can change the number of thresholds by passing another specified number: convert_continuous_df_to_binary_df(df, num_quantiles=50).
Finish converting continuous features to binary features......

We still obtain the same preprocessed binary features:

X_binarized_df

	Pregnancies<=0	Pregnancies<=1	Pregnancies<=2	Pregnancies<=3	Pregnancies<=4	Pregnancies<=5	Pregnancies<=6	Pregnancies<=7	Pregnancies<=8	Pregnancies<=9	...	Age<=62	Age<=63	Age<=64	Age<=65	Age<=66	Age<=67	Age<=68	Age<=69	Age<=70	Age<=72
0	0	0	0	0	0	0	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
1	0	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
2	0	0	0	0	0	0	0	0	1	1	...	1	1	1	1	1	1	1	1	1	1
3	0	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
4	1	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
763	0	0	0	0	0	0	0	0	0	0	...	0	1	1	1	1	1	1	1	1	1
764	0	0	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
765	0	0	0	0	0	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
766	0	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1
767	0	1	1	1	1	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1

768 rows × 559 columns

However, now we have a new variable “featureIndex_to_groupIndex”, which stores the group information of each feature. This tells us which continuous feature a binary feature was derived from

featureIndex_to_groupIndex

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
       7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
       7, 7, 7, 7, 7, 7, 7, 7, 7])

Train the model with the group-sparsity constraint

y_train = np.asarray(pima_original_data_df["Outcome"].values)
X_train = np.asarray(X_binarized_df)

sparsity = 5
group_sparsity = 2
parent_size = 10

RiskScoreOptimizer_m = RiskScoreOptimizer(X = X_train, y = y_train, k = sparsity, parent_size = parent_size, \
                                          group_sparsity = group_sparsity, \
                                          featureIndex_to_groupIndex = featureIndex_to_groupIndex)

start_time = time.time()
RiskScoreOptimizer_m.optimize()
print("Optimization takes {:.2f} seconds.".format(time.time() - start_time))

Optimization takes 2.02 seconds.

Print the First Model Card

multipliers, sparseDiversePool_beta0_integer, sparseDiversePool_betas_integer = RiskScoreOptimizer_m.get_models()

model_index = 0 # first model

multiplier = multipliers[model_index]
intercept = sparseDiversePool_beta0_integer[model_index]
coefficients = sparseDiversePool_betas_integer[model_index]

X_featureNames = list(X_binarized_df.columns)

RiskScoreClassifier_m = RiskScoreClassifier(multiplier, intercept, coefficients, X_train = X_train)
RiskScoreClassifier_m.reset_featureNames(X_featureNames)
RiskScoreClassifier_m.print_model_card()

The Risk Score is:
1.                    Glucose<=99.80000000000001     -3 point(s) |   ...
2.                                Glucose<=129.5     -2 point(s) | + ...
3.                               Glucose<=165.95     -3 point(s) | + ...
4.                       BMI<=27.316000000000003     -4 point(s) | + ...
5.                        BMI<=48.15999999999999     -4 point(s) | + ...
                                                           SCORE | =    
SCORE |  -16.0  |  -14.0  |  -13.0  |  -12.0  |  -11.0  |  -10.0  |  -9.0  |  -8.0  |
RISK  |   1.8% |   4.8% |   7.6% |  11.9% |  18.3% |  26.9% |  37.8% |  50.0% |
SCORE |  -7.0  |  -6.0  |  -5.0  |  -4.0  |  -3.0  |  -2.0  |   0.0  |
RISK  |  62.2% |  73.1% |  81.7% |  88.1% |  92.4% |  95.2% |  98.2% |

Indeed, we see that there are only 2 groups in the above scoring system.