Input/Output

Convert dataframe to one-hot matrix.

param df

Input dataframe for which the rows are the features, and colums are the samples.

type df

pd.DataFrame()

param dtypes

Representation of the columns in the form of [‘cat’,’num’]. By default the dtype is determiend based on the pandas dataframe.

type dtypes

list of str or ‘pandas’, optional

param y_min

Minimal number of sampels that must be present in a group. All groups with less then y_min samples are labeled as _other_ and are not used in the enriching model. The default is None.

type y_min

int [0..len(y)], optional

param perc_min_num

This parameters can be used to force variables into numeric ones if unique non-zero values are above the percentage. The default is None. Alternative can be 0.8

type perc_min_num

float [None, 0..1], optional

param hot_only

When True; the output of the onehot matrix exclusively contains categorical values that are transformed to one-hot. The default is True.

type hot_only

bool [True, False], optional

param deep_extract

True: Extract information from a vector that contains a list/array/dict. False: converted to a string and treated as catagorical [‘cat’].

type deep_extract

bool [False, True] (default : False)

param excl_background

Remove values/strings that labeled in the list. As an example, the following column: [‘yes’, ‘no’, ‘yes’, ‘yes’,’no’,’unknown’, …], is split into ‘column_yes’, ‘column_no’ and ‘column_unknown’. If unknown listed, then ‘column_unknown’ is not transformed into a new one-hot column. The default is None (every possible name is converted into a one-hot column)

type excl_background

list or None, [0], [0, ‘0.0’, ‘unknown’, ‘nan’, ‘None’ …], optional

param verbose

Print message to screen. The default is 3. 0: (default), 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

type verbose

int, optional

returns
  • dict

  • numeric (DataFrame) – Input-dataframe with converted numerical values

  • onehot (DataFrame) – Input-dataframe with converted one-hot values. Note that continuous values are only removed if hot_only=True.

  • labx (list of str) – Input feature-labels or names

  • df (DataFrame) – Input-dataframe but with set dtypes. Note that df is extended if deep_extract=True

  • labels (list of str) – Column names of df

  • dtypes (list of str) – dtypes for the feature-labels for df in the form of ‘num’ (numerical) and/or ‘cat’ (categorical).

Examples

>>> import df2onehot
>>> df = df2onehot.import_example()
>>> out = df2onehot.df2onehot(df)