Convert dataframe to one-hot matrix.
- df2onehot.df2onehot.df2onehot(df, dtypes='pandas', y_min=None, perc_min_num=None, hot_only=True, deep_extract=False, excl_background=None, verbose=3)
Convert dataframe to one-hot matrix.
- Parameters
df (pd.DataFrame()) – Input dataframe for which the rows are the features, and colums are the samples.
dtypes (list of str or 'pandas', optional) – Representation of the columns in the form of [‘cat’,’num’]. By default the dtype is determiend based on the pandas dataframe.
y_min (int [0..len(y)], optional) – Minimal number of sampels that must be present in a group. All groups with less then y_min samples are labeled as _other_ and are not used in the enriching model. The default is None.
perc_min_num (float [None, 0..1], optional) – This parameters can be used to force variables into numeric ones if unique non-zero values are above the percentage. The default is None. Alternative can be 0.8
hot_only (bool [True, False], optional) – When True; the output of the onehot matrix exclusively contains categorical values that are transformed to one-hot. The default is True.
deep_extract (bool [False, True] (default : False)) – True: Extract information from a vector that contains a list/array/dict. False: converted to a string and treated as catagorical [‘cat’].
excl_background (list or None, [0], [0, '0.0', 'unknown', 'nan', 'None' ...], optional) – Remove values/strings that labeled in the list. As an example, the following column: [‘yes’, ‘no’, ‘yes’, ‘yes’,’no’,’unknown’, …], is split into ‘column_yes’, ‘column_no’ and ‘column_unknown’. If unknown listed, then ‘column_unknown’ is not transformed into a new one-hot column. The default is None (every possible name is converted into a one-hot column)
verbose (int, optional) – Print message to screen. The default is 3. 0: (default), 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE
- Returns
dict
numeric (DataFrame) – Input-dataframe with converted numerical values
onehot (DataFrame) – Input-dataframe with converted one-hot values. Note that continuous values are only removed if hot_only=True.
labx (list of str) – Input feature-labels or names
df (DataFrame) – Input-dataframe but with set dtypes. Note that df is extended if deep_extract=True
labels (list of str) – Column names of df
dtypes (list of str) – dtypes for the feature-labels for df in the form of ‘num’ (numerical) and/or ‘cat’ (categorical).
Examples
>>> import df2onehot >>> df = df2onehot.import_example() >>> out = df2onehot.df2onehot(df)
- df2onehot.df2onehot.dict2df(dfc)
- df2onehot.df2onehot.import_example(data='titanic', url=None, sep=',', verbose=3)
Import example dataset from github source.
Description
Import one of the few datasets from github source or specify your own download url link.
- param data
Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’
- type data
str
- param url
url link to to dataset.
- type url
str
- param verbose
Print message to screen.
- type verbose
int, (default: 3)
- returns
Dataset containing mixed features.
- rtype
pd.DataFrame()