fasterrisk.binarization_util

Classes

BinBinarizer

Binarize variables into binary variables based on percentile or user defined thresholds.

Functions

convert_continuous_df_to_binary_df(df[, ...])

Convert a dataframe with continuous features to a dataframe with binary features by thresholding

nan_onehot_single_column(→ numpy.ndarray)

Module Contents

fasterrisk.binarization_util.convert_continuous_df_to_binary_df(df, max_num_thresholds_per_feature=100, sampling_weights='uniform', sampling_seed=0, get_featureIndex_to_groupIndex=False)

Convert a dataframe with continuous features to a dataframe with binary features by thresholding

Parameters:
  • df (pandas.DataFrame) – original dataframe where there are columns with continuous features

  • max_num_thresholds_per_feature (int, optional) – number of points we pick as thresholds if a column has too many unique values, by default 100

  • sampling_weights (str, optional) – how to sample the thresholds from all unique values, by default ‘uniform’; alternatively, ‘weighted’ allows to sample the thresholds according to the distribution of the unique values

  • sampling_seed (int, optional) – random seed for sampling, by default 0

  • get_featureIndex_to_groupIndex (bool, optional) – whether to return a numpy array that maps feature index to group index, by default False

Returns:

binarized_df – a new dataframe where each column only has 0/1 as the feature

Return type:

pandas.DataFrame

fasterrisk.binarization_util.nan_onehot_single_column(column: pandas.Series) numpy.ndarray
class fasterrisk.binarization_util.BinBinarizer(whether_interval: bool = False, max_num_thresholds_per_feature: int = 100, sampling_weights: str = 'uniform', sampling_seed: int = 0, group_sparsity: bool = False)

Bases: sklearn.preprocessing._encoders._BaseEncoder

Binarize variables into binary variables based on percentile or user defined thresholds.

Parameters:
  • interval_width (int) – width of the interval measured by percentiles. For instance, if interval_width=10, then each interval will be between nth and (n+10)th percentile

  • categorical_cols (list) – list of names for categorical variables

  • wheter_interval (bool) – whether to one hot based on intervals or based on less thans, by default False (use less thans)

whether_interval
group_sparsity
max_num_thresholds_per_feature
sampling_weights
sampling_seed
rng
fit(df: pandas.DataFrame) None

fit IntervalBinarizer

transform(df: pandas.DataFrame) tuple

Transform data using percentiles found in fitting

Parameters:

df (pd.DataFrame) – data to transform

Returns:

transformed data, group sparsity index

Return type:

tuple

fit_transform(df: pandas.DataFrame) pandas.DataFrame

fit and transform on same dataframe