Convert dataframe to one-hot matrix.

df2onehot.df2onehot.df2onehot(df, dtypes='pandas', y_min=None, perc_min_num=None, hot_only=True, deep_extract=False, excl_background=None, verbose=3)

Convert dataframe to one-hot matrix.

Parameters
  • df (pd.DataFrame()) – Input dataframe for which the rows are the features, and colums are the samples.

  • dtypes (list of str or 'pandas', optional) – Representation of the columns in the form of [‘cat’,’num’]. By default the dtype is determiend based on the pandas dataframe.

  • y_min (int [0..len(y)], optional) – Minimal number of sampels that must be present in a group. All groups with less then y_min samples are labeled as _other_ and are not used in the enriching model. The default is None.

  • perc_min_num (float [None, 0..1], optional) – This parameters can be used to force variables into numeric ones if unique non-zero values are above the percentage. The default is None. Alternative can be 0.8

  • hot_only (bool [True, False], optional) – When True; the output of the onehot matrix exclusively contains categorical values that are transformed to one-hot. The default is True.

  • deep_extract (bool [False, True] (default : False)) – True: Extract information from a vector that contains a list/array/dict. False: converted to a string and treated as catagorical [‘cat’].

  • excl_background (list or None, [0], [0, '0.0', 'unknown', 'nan', 'None' ...], optional) – Remove values/strings that labeled in the list. As an example, the following column: [‘yes’, ‘no’, ‘yes’, ‘yes’,’no’,’unknown’, …], is split into ‘column_yes’, ‘column_no’ and ‘column_unknown’. If unknown listed, then ‘column_unknown’ is not transformed into a new one-hot column. The default is None (every possible name is converted into a one-hot column)

  • verbose (int, optional) – Print message to screen. The default is 3. 0: (default), 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

Returns

  • dict

  • numeric (DataFrame) – Input-dataframe with converted numerical values

  • onehot (DataFrame) – Input-dataframe with converted one-hot values. Note that continuous values are only removed if hot_only=True.

  • labx (list of str) – Input feature-labels or names

  • df (DataFrame) – Input-dataframe but with set dtypes. Note that df is extended if deep_extract=True

  • labels (list of str) – Column names of df

  • dtypes (list of str) – dtypes for the feature-labels for df in the form of ‘num’ (numerical) and/or ‘cat’ (categorical).

Examples

>>> import df2onehot
>>> df = df2onehot.import_example()
>>> out = df2onehot.df2onehot(df)
df2onehot.df2onehot.dict2df(dfc)
df2onehot.df2onehot.import_example(data='titanic', url=None, sep=',', verbose=3)

Import example dataset from github source.

Description

Import one of the few datasets from github source or specify your own download url link.

param data

Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’

type data

str

param url

url link to to dataset.

type url

str

param verbose

Print message to screen.

type verbose

int, (default: 3)

returns

Dataset containing mixed features.

rtype

pd.DataFrame()