<no title>

Convert dataframe to one-hot matrix.

df2onehot.df2onehot.df2onehot(df, dtypes='pandas', y_min=2, perc_min_num=None, hot_only=True, deep_extract=False, excl_background=None, remove_mutual_exclusive=False, remove_multicollinearity=False, verbose=3)

Convert dataframe to one-hot matrix.

Parameters:

df (pd.DataFrame()) – Input dataframe for which the rows are the features, and colums are the samples.
dtypes (list of str or 'pandas', optional) – Representation of the columns in the form of [‘cat’,’num’]. By default the dtype is determiend based on the pandas dataframe.
y_min (int [0..len(y)], optional) – Minimal number of sampels that must be present in a group. All groups with less then y_min samples are labeled as _other_ and are not used in the enriching model. The default is None.
perc_min_num (float [None, 0..1], optional) – This parameters can be used to force variables into numeric ones if unique non-zero values are above the percentage. The default is None. Alternative can be 0.8
hot_only (bool [True, False], optional) – When True; the output of the onehot matrix exclusively contains categorical values that are transformed to one-hot. The default is True.
deep_extract (bool [False, True] (default : False)) – True: Extract information from a vector that contains a list/array/dict. False: converted to a string and treated as catagorical [‘cat’].
remove_mutual_exclusive (bool [False, True] (default : False)) – True: Remove the mutual exclusive groups. In binairy features; False and 0 are excluded. False: Do nothing
remove_multicollinearity (bool [False, True] (default : False)) – True: Remove multicollinear columns by removing one columns for each catagory that is converted into onehot. False: Do nothing
excl_background (list or None, [0], [0, '0.0', 'unknown', 'nan', 'None' ...], optional) – Remove values/strings that labeled in the list. As an example, the following column: [‘yes’, ‘no’, ‘yes’, ‘yes’,’no’,’unknown’, …], is split into ‘column_yes’, ‘column_no’ and ‘column_unknown’. If unknown listed, then ‘column_unknown’ is not transformed into a new one-hot column. The default is None (every possible name is converted into a one-hot column)
verbose (int, optional) – Print message to screen. The default is 3. 0: (default), 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

Returns:

dict
numeric (DataFrame) – Input-dataframe with converted numerical values
onehot (DataFrame) – Input-dataframe with converted one-hot values. Note that continuous values are only removed if hot_only=True.
labx (list of str) – Input feature-labels or names
df (DataFrame) – Input-dataframe but with set dtypes. Note that df is extended if deep_extract=True
labels (list of str) – Column names of df
dtypes (list of str) – dtypes for the feature-labels for df in the form of ‘num’ (numerical) and/or ‘cat’ (categorical).

Examples

>>> import df2onehot
>>> df = df2onehot.import_example()
>>> out = df2onehot.df2onehot(df)

df2onehot.df2onehot.dict2df(dfc)

df2onehot.df2onehot.import_example(data='titanic', url=None, sep=',', overwrite=False, verbose=3)

Import example dataset from github source.

Import one of the few datasets from github source or specify your own download url link.

Parameters:

data (str) – Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’
url (str) – url link to to dataset.
verbose (int, (default: 3)) – Print message to screen.

Returns:

Dataset containing mixed features.

Return type:

pd.DataFrame()

df2onehot.df2onehot.make_elements_unique(X)