fasterrisk.binarization_util
Classes
Binarize variables into binary variables based on percentile or user defined thresholds. |
Functions
|
Convert a dataframe with continuous features to a dataframe with binary features by thresholding |
|
Module Contents
- fasterrisk.binarization_util.convert_continuous_df_to_binary_df(df, max_num_thresholds_per_feature=100, sampling_weights='uniform', sampling_seed=0, get_featureIndex_to_groupIndex=False)
Convert a dataframe with continuous features to a dataframe with binary features by thresholding
- Parameters:
df (pandas.DataFrame) – original dataframe where there are columns with continuous features
max_num_thresholds_per_feature (int, optional) – number of points we pick as thresholds if a column has too many unique values, by default 100
sampling_weights (str, optional) – how to sample the thresholds from all unique values, by default ‘uniform’; alternatively, ‘weighted’ allows to sample the thresholds according to the distribution of the unique values
sampling_seed (int, optional) – random seed for sampling, by default 0
get_featureIndex_to_groupIndex (bool, optional) – whether to return a numpy array that maps feature index to group index, by default False
- Returns:
binarized_df – a new dataframe where each column only has 0/1 as the feature
- Return type:
pandas.DataFrame
- fasterrisk.binarization_util.nan_onehot_single_column(column: pandas.Series) numpy.ndarray
- class fasterrisk.binarization_util.BinBinarizer(whether_interval: bool = False, max_num_thresholds_per_feature: int = 100, sampling_weights: str = 'uniform', sampling_seed: int = 0, group_sparsity: bool = False)
Bases:
sklearn.preprocessing._encoders._BaseEncoderBinarize variables into binary variables based on percentile or user defined thresholds.
- Parameters:
interval_width (int) – width of the interval measured by percentiles. For instance, if interval_width=10, then each interval will be between nth and (n+10)th percentile
categorical_cols (list) – list of names for categorical variables
wheter_interval (bool) – whether to one hot based on intervals or based on less thans, by default False (use less thans)
- whether_interval
- group_sparsity
- max_num_thresholds_per_feature
- sampling_weights
- sampling_seed
- rng
- fit(df: pandas.DataFrame) None
fit IntervalBinarizer
- transform(df: pandas.DataFrame) tuple
Transform data using percentiles found in fitting
- Parameters:
df (pd.DataFrame) – data to transform
- Returns:
transformed data, group sparsity index
- Return type:
tuple
- fit_transform(df: pandas.DataFrame) pandas.DataFrame
fit and transform on same dataframe