fasterrisk.binarization_util

Classes

BinBinarizer

Binarize variables into binary variables based on percentile or user defined thresholds.

Functions

`convert_continuous_df_to_binary_df`(df[, ...])	Convert a dataframe with continuous features to a dataframe with binary features by thresholding
`nan_onehot_single_column`(→ numpy.ndarray)

Module Contents

fasterrisk.binarization_util.convert_continuous_df_to_binary_df(df, max_num_thresholds_per_feature=100, sampling_weights='uniform', sampling_seed=0, get_featureIndex_to_groupIndex=False)

Convert a dataframe with continuous features to a dataframe with binary features by thresholding

Parameters:

df (pandas.DataFrame) – original dataframe where there are columns with continuous features
max_num_thresholds_per_feature (int, optional) – number of points we pick as thresholds if a column has too many unique values, by default 100
sampling_weights (str, optional) – how to sample the thresholds from all unique values, by default ‘uniform’; alternatively, ‘weighted’ allows to sample the thresholds according to the distribution of the unique values
sampling_seed (int, optional) – random seed for sampling, by default 0
get_featureIndex_to_groupIndex (bool, optional) – whether to return a numpy array that maps feature index to group index, by default False

Returns:

binarized_df – a new dataframe where each column only has 0/1 as the feature

Return type:

pandas.DataFrame

fasterrisk.binarization_util.nan_onehot_single_column(column: pandas.Series) → numpy.ndarray

class fasterrisk.binarization_util.BinBinarizer(whether_interval: bool = False, max_num_thresholds_per_feature: int = 100, sampling_weights: str = 'uniform', sampling_seed: int = 0, group_sparsity: bool = False)

Bases: sklearn.preprocessing._encoders._BaseEncoder

Binarize variables into binary variables based on percentile or user defined thresholds.

Parameters:

interval_width (int) – width of the interval measured by percentiles. For instance, if interval_width=10, then each interval will be between nth and (n+10)th percentile
categorical_cols (list) – list of names for categorical variables
wheter_interval (bool) – whether to one hot based on intervals or based on less thans, by default False (use less thans)

whether_interval

group_sparsity

max_num_thresholds_per_feature

sampling_weights

sampling_seed

rng

fit(df: pandas.DataFrame) → None: fit IntervalBinarizer

transform(df: pandas.DataFrame) → tuple

Transform data using percentiles found in fitting

Parameters:: df (pd.DataFrame) – data to transform
Returns:: transformed data, group sparsity index
Return type:: tuple

fit_transform(df: pandas.DataFrame) → pandas.DataFrame: fit and transform on same dataframe