Process Mixed dataset

In the following example we load the Titanic dataset, and use df2onehot to convert it towards a structured dataset.

# Load library
from df2onehot import df2onehot, import_example

# Import Titanic dataset
df = import_example(data="titanic")

print(df)
#      PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
# 0              1         0       3  ...   7.2500   NaN         S
# 1              2         1       1  ...  71.2833   C85         C
# 2              3         1       3  ...   7.9250   NaN         S
# 3              4         1       1  ...  53.1000  C123         S
# 4              5         0       3  ...   8.0500   NaN         S
# ..           ...       ...     ...  ...      ...   ...       ...
# 886          887         0       2  ...  13.0000   NaN         S
# 887          888         1       1  ...  30.0000   B42         S
# 888          889         0       3  ...  23.4500   NaN         S
# 889          890         1       1  ...  30.0000  C148         C
# 890          891         0       3  ...   7.7500   NaN         Q
#
# [891 rows x 12 columns]

# Convert the matrix into a structured datasset
results = df2onehot(df)

print(results.keys())
# dict_keys(['numeric', 'dtypes', 'onehot', 'labx', 'df', 'labels'])

# The onehot array exploded into 2637 features!
print(results['onehot'].shape)
# (891, 2637)

# The reason for the large onehot dataset is, among others incorrect typing of variables.
# Columns such as PassengerId should be removed because now these are typed as categorical.
# Prevent that integer variables are typed as categorical.

print(np.c_[results['labels'],results['dtypes']])
# array([['PassengerId', 'cat'],
#        ['Survived', 'cat'],
#        ['Pclass', 'cat'],
#        ['Name', 'cat'],
#        ['Sex', 'cat'],
#        ['Age', 'num'],
#        ['SibSp', 'cat'],
#        ['Parch', 'cat'],
#        ['Ticket', 'cat'],
#        ['Fare', 'num'],
#        ['Cabin', 'cat'],
#        ['Embarked', 'cat'],
#        ['all_true', 'cat']], dtype=object)

Force categorical values into numeric

We can force variables to be numeric if the number of unique values are above the given percentage: 80%. Or in other words, if a variable contains more then 80% unique values, it is set as numerical.

# Set the parameter to force columns into numerical dtypes
results = df2onehot(df, perc_min_num=0.8)

# Also remove categorical features for which less then 2 values exists.
results = df2onehot(df, perc_min_num=0.8, y_min=2)

# Check whether the dtypes are correct.
# PassengerId, Age and Fare are set as numerical, and the rest categorical.

print(np.c_[results['labels'],results['dtypes']])

# [['PassengerId' 'num']
#  ['Survived' 'cat']
#  ['Pclass' 'cat']
#  ['Name' 'cat']
#  ['Sex' 'cat']
#  ['Age' 'num']
#  ['SibSp' 'cat']
#  ['Parch' 'cat']
#  ['Ticket' 'cat']
#  ['Fare' 'num']
#  ['Cabin' 'cat']
#  ['Embarked' 'cat']
#  ['all_true' 'cat']]

# If we look at our one hot dense array, we notice that behind each column the sub-category is added.
print(results['onehot'])

#      Survived_0.0  Survived_1.0  Pclass_1.0  ...  Embarked_Q  Embarked_S  all_true
# 0            True         False       False  ...       False        True      True
# 1           False          True        True  ...       False       False      True
# 2           False          True       False  ...       False        True      True
# 3           False          True        True  ...       False        True      True
# 4            True         False       False  ...       False        True      True
# ..            ...           ...         ...  ...         ...         ...       ...
# 886          True         False       False  ...       False        True      True
# 887         False          True        True  ...       False        True      True
# 888          True         False       False  ...       False        True      True
# 889         False          True        True  ...       False       False      True
# 890          True         False       False  ...        True       False      True
#
# [891 rows x 206 columns]

Exclude redundant variables

We can make further clean the data by removing mutually exclusive columns. As an example, the column Survived is split into Survived_0.0 and Survived_1.0 but the column Survived_0.0 may not be so relevant. With the parameter excl_background we can ignore the labels that are put begin the columns.

# Ignore specific subcategories
results = df2onehot(df, perc_min_num=0.8, y_min=2, excl_background=['0.0'])

# The final shape of our structured dataset is:
results['onehot'].shape
(891, 203)

# The original variable names can be found here:
results['labx']

Exclude sparse variables

By converting categorical values into one-hot dense arrays, it can easily occur that certain variables will only contain a single or few True values. We can use the y_min functionality to remove such columns.

# We can tune the ``y_min`` parameter further remove even more columns.
results = df2onehot(df, perc_min_num=0.8, y_min=5, excl_background=['0.0'])

# The final shape of our structured dataset is:
results['onehot'].shape
(891, 29)

We still need to manually remove the identifier column and then we are ready to go for analysis!

Custom dtypes

In the following example we load the fifa dataset and structure the dataset.

# Load library
from df2onehot import df2onehot, import_example

# Import Fifa dataset
df = import_example('sprinkler')

# Custom typing of the columns
results = df2onehot(df, dtypes=['cat','cat','cat','cat'], excl_background=['0.0'])

Extracting nested columns

In certain cases, it can occur that your columns are nested with lists and dictionaries. With the deep_extract functionality it is possible to easily structure such columns. Let’s compare the results with and without the deep_extract functionality.

# Load library
from df2onehot import df2onehot, import_example

# Import complex dataframe containing lists in lists
df = import_example('complex')

#
#           feat_1   feat_2
# 0         [3, 4]  [4, 45]
# 1            NaN      NaN
# 2   [5, 6, 7, 8]      NaN
# 3            NaN      NaN
# 4            NaN      NaN
# 5             10      NaN
# 6            NaN      NaN
# 7            NaN      NaN
# 8            NaN      NaN
# 9            NaN      NaN
# 10           NaN      NaN
# 11           NaN      NaN
# 12           NaN      NaN
# 13           NaN      NaN
# 14           NaN      NaN
# 15             1        1
# 16           NaN      NaN
# 17           NaN      NaN
# 18           NaN      NaN
# 19           NaN      NaN
# 20    [9, 11, 4]       10
# 21           NaN      NaN
# 22           NaN      NaN
# 23           NaN      NaN
# 24           NaN      NaN

Without deep extract

Convert to onehot dense-array without using the deep_extract function. The result is a dataframe where each nested element is used as a new column name.

results = df2onehot(df, deep_extract=False)

# print
print(results['onehot'])

#     feat_1_1  feat_1_10  ...  feat_2_None  feat_2_['4', '45']
# 0      False      False  ...        False                True
# 1      False      False  ...         True               False
# 2      False      False  ...         True               False
# 3      False      False  ...         True               False
# 4      False      False  ...         True               False
# 5      False       True  ...         True               False
# 6      False      False  ...         True               False
# 7      False      False  ...         True               False
# 8      False      False  ...         True               False
# ...
# [25 rows x 10 columns]

With deep extract

With deep_extract=True, each element is analyzed whether it contains lists or dictionaries and structured accordingly. If a column name already exists, the value is added into that column for the specific row.

# Convert to onehot dense-array with the ``deep_extract=True`` function
results = df2onehot(df, deep_extract=True)

# print
print(results['onehot'])

#       1     10     11      3      4      5      6      7      8      9     45
# 0   False  False  False   True   True  False  False  False  False  False   True
# 1   False  False  False  False  False  False  False  False  False  False  False
# 2   False  False  False  False  False   True   True   True   True  False  False
# 3   False  False  False  False  False  False  False  False  False  False  False
# 4   False  False  False  False  False  False  False  False  False  False  False
# 5   False   True  False  False  False  False  False  False  False  False  False
# 6   False  False  False  False  False  False  False  False  False  False  False
# 7   False  False  False  False  False  False  False  False  False  False  False
# 8   False  False  False  False  False  False  False  False  False  False  False
# ...
# [25 rows x 11 columns]


# Lets print only the relevant rows.
idx = results['onehot'].sum(axis=1)>0
print(results['onehot'].loc[idx,:])
#       1     10     11      3      4      5      6      7      8      9     45
# 0   False  False  False   True   True  False  False  False  False  False   True
# 2   False  False  False  False  False   True   True   True   True  False  False
# 5   False   True  False  False  False  False  False  False  False  False  False
# 15   True  False  False  False  False  False  False  False  False  False  False
# 20  False   True   True  False   True  False  False  False  False   True  False