Input/Output

Convert dataframe to one-hot matrix.

param df:

Input dataframe for which the rows are the features, and colums are the samples.

type df:

pd.DataFrame()

param dtypes:

Representation of the columns in the form of [‘cat’,’num’]. By default the dtype is determiend based on the pandas dataframe.

type dtypes:

list of str or ‘pandas’, optional

param y_min:

Minimal number of sampels that must be present in a group. All groups with less then y_min samples are labeled as _other_ and are not used in the enriching model. The default is None.

type y_min:

int [0..len(y)], optional

param perc_min_num:

This parameters can be used to force variables into numeric ones if unique non-zero values are above the percentage. The default is None. Alternative can be 0.8

type perc_min_num:

float [None, 0..1], optional

param hot_only:

When True; the output of the onehot matrix exclusively contains categorical values that are transformed to one-hot. The default is True.

type hot_only:

bool [True, False], optional

param deep_extract:

True: Extract information from a vector that contains a list/array/dict. False: converted to a string and treated as catagorical [‘cat’].

type deep_extract:

bool [False, True] (default : False)

param remove_mutual_exclusive:

True: Remove the mutual exclusive groups. In binairy features; False and 0 are excluded. False: Do nothing

type remove_mutual_exclusive:

bool [False, True] (default : False)

param remove_multicollinearity:

True: Remove multicollinear columns by removing one columns for each catagory that is converted into onehot. False: Do nothing

type remove_multicollinearity:

bool [False, True] (default : False)

param excl_background:

Remove values/strings that labeled in the list. As an example, the following column: [‘yes’, ‘no’, ‘yes’, ‘yes’,’no’,’unknown’, …], is split into ‘column_yes’, ‘column_no’ and ‘column_unknown’. If unknown listed, then ‘column_unknown’ is not transformed into a new one-hot column. The default is None (every possible name is converted into a one-hot column)

type excl_background:

list or None, [0], [0, ‘0.0’, ‘unknown’, ‘nan’, ‘None’ …], optional

param verbose:

Print message to screen. The default is 3. 0: (default), 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

type verbose:

int, optional

returns:
  • dict

  • numeric (DataFrame) – Input-dataframe with converted numerical values

  • onehot (DataFrame) – Input-dataframe with converted one-hot values. Note that continuous values are only removed if hot_only=True.

  • labx (list of str) – Input feature-labels or names

  • df (DataFrame) – Input-dataframe but with set dtypes. Note that df is extended if deep_extract=True

  • labels (list of str) – Column names of df

  • dtypes (list of str) – dtypes for the feature-labels for df in the form of ‘num’ (numerical) and/or ‘cat’ (categorical).

Examples

>>> import df2onehot
>>> df = df2onehot.import_example()
>>> out = df2onehot.df2onehot(df)