Process Mixed dataset
In the following example we load the Titanic dataset, and use df2onehot
to convert it towards a structured dataset.
# Load library
from df2onehot import df2onehot, import_example
# Import Titanic dataset
df = import_example(data="titanic")
print(df)
# PassengerId Survived Pclass ... Fare Cabin Embarked
# 0 1 0 3 ... 7.2500 NaN S
# 1 2 1 1 ... 71.2833 C85 C
# 2 3 1 3 ... 7.9250 NaN S
# 3 4 1 1 ... 53.1000 C123 S
# 4 5 0 3 ... 8.0500 NaN S
# .. ... ... ... ... ... ... ...
# 886 887 0 2 ... 13.0000 NaN S
# 887 888 1 1 ... 30.0000 B42 S
# 888 889 0 3 ... 23.4500 NaN S
# 889 890 1 1 ... 30.0000 C148 C
# 890 891 0 3 ... 7.7500 NaN Q
#
# [891 rows x 12 columns]
# Convert the matrix into a structured datasset
results = df2onehot(df)
print(results.keys())
# dict_keys(['numeric', 'dtypes', 'onehot', 'labx', 'df', 'labels'])
# The onehot array exploded into 2637 features!
print(results['onehot'].shape)
# (891, 2637)
# The reason for the large onehot dataset is, among others incorrect typing of variables.
# Columns such as PassengerId should be removed because now these are typed as categorical.
# Prevent that integer variables are typed as categorical.
print(np.c_[results['labels'],results['dtypes']])
# array([['PassengerId', 'cat'],
# ['Survived', 'cat'],
# ['Pclass', 'cat'],
# ['Name', 'cat'],
# ['Sex', 'cat'],
# ['Age', 'num'],
# ['SibSp', 'cat'],
# ['Parch', 'cat'],
# ['Ticket', 'cat'],
# ['Fare', 'num'],
# ['Cabin', 'cat'],
# ['Embarked', 'cat'],
# ['all_true', 'cat']], dtype=object)
Force categorical values into numeric
We can force variables to be numeric if the number of unique values are above the given percentage: 80%. Or in other words, if a variable contains more then 80% unique values, it is set as numerical.
# Set the parameter to force columns into numerical dtypes
results = df2onehot(df, perc_min_num=0.8)
# Also remove categorical features for which less then 2 values exists.
results = df2onehot(df, perc_min_num=0.8, y_min=2)
# Check whether the dtypes are correct.
# PassengerId, Age and Fare are set as numerical, and the rest categorical.
print(np.c_[results['labels'],results['dtypes']])
# [['PassengerId' 'num']
# ['Survived' 'cat']
# ['Pclass' 'cat']
# ['Name' 'cat']
# ['Sex' 'cat']
# ['Age' 'num']
# ['SibSp' 'cat']
# ['Parch' 'cat']
# ['Ticket' 'cat']
# ['Fare' 'num']
# ['Cabin' 'cat']
# ['Embarked' 'cat']
# ['all_true' 'cat']]
# If we look at our one hot dense array, we notice that behind each column the sub-category is added.
print(results['onehot'])
# Survived_0.0 Survived_1.0 Pclass_1.0 ... Embarked_Q Embarked_S all_true
# 0 True False False ... False True True
# 1 False True True ... False False True
# 2 False True False ... False True True
# 3 False True True ... False True True
# 4 True False False ... False True True
# .. ... ... ... ... ... ... ...
# 886 True False False ... False True True
# 887 False True True ... False True True
# 888 True False False ... False True True
# 889 False True True ... False False True
# 890 True False False ... True False True
#
# [891 rows x 206 columns]
Exclude redundant variables
We can make further clean the data by removing mutually exclusive columns.
As an example, the column Survived is split into Survived_0.0 and Survived_1.0 but the column Survived_0.0 may not be so relevant. With the parameter excl_background
we can ignore the labels that are put begin the columns.
# Ignore specific subcategories
results = df2onehot(df, perc_min_num=0.8, y_min=2, excl_background=['0.0'])
# The final shape of our structured dataset is:
results['onehot'].shape
(891, 203)
# The original variable names can be found here:
results['labx']
Exclude sparse variables
By converting categorical values into one-hot dense arrays, it can easily occur that certain variables will only contain a single or few True
values. We can use the y_min
functionality to remove such columns.
# We can tune the ``y_min`` parameter further remove even more columns.
results = df2onehot(df, perc_min_num=0.8, y_min=5, excl_background=['0.0'])
# The final shape of our structured dataset is:
results['onehot'].shape
(891, 29)
We still need to manually remove the identifier column and then we are ready to go for analysis!
Custom dtypes
In the following example we load the fifa dataset and structure the dataset.
# Load library
from df2onehot import df2onehot, import_example
# Import Fifa dataset
df = import_example('sprinkler')
# Custom typing of the columns
results = df2onehot(df, dtypes=['cat','cat','cat','cat'], excl_background=['0.0'])
Extracting nested columns
In certain cases, it can occur that your columns are nested with lists and dictionaries.
With the deep_extract
functionality it is possible to easily structure such columns.
Let’s compare the results with and without the deep_extract
functionality.
# Load library
from df2onehot import df2onehot, import_example
# Import complex dataframe containing lists in lists
df = import_example('complex')
#
# feat_1 feat_2
# 0 [3, 4] [4, 45]
# 1 NaN NaN
# 2 [5, 6, 7, 8] NaN
# 3 NaN NaN
# 4 NaN NaN
# 5 10 NaN
# 6 NaN NaN
# 7 NaN NaN
# 8 NaN NaN
# 9 NaN NaN
# 10 NaN NaN
# 11 NaN NaN
# 12 NaN NaN
# 13 NaN NaN
# 14 NaN NaN
# 15 1 1
# 16 NaN NaN
# 17 NaN NaN
# 18 NaN NaN
# 19 NaN NaN
# 20 [9, 11, 4] 10
# 21 NaN NaN
# 22 NaN NaN
# 23 NaN NaN
# 24 NaN NaN
Without deep extract
Convert to onehot dense-array without using the deep_extract
function.
The result is a dataframe where each nested element is used as a new column name.
results = df2onehot(df, deep_extract=False)
# print
print(results['onehot'])
# feat_1_1 feat_1_10 ... feat_2_None feat_2_['4', '45']
# 0 False False ... False True
# 1 False False ... True False
# 2 False False ... True False
# 3 False False ... True False
# 4 False False ... True False
# 5 False True ... True False
# 6 False False ... True False
# 7 False False ... True False
# 8 False False ... True False
# ...
# [25 rows x 10 columns]
With deep extract
With deep_extract=True
, each element is analyzed whether it contains lists or dictionaries and structured accordingly. If a column name already exists, the value is added into that column for the specific row.
# Convert to onehot dense-array with the ``deep_extract=True`` function
results = df2onehot(df, deep_extract=True)
# print
print(results['onehot'])
# 1 10 11 3 4 5 6 7 8 9 45
# 0 False False False True True False False False False False True
# 1 False False False False False False False False False False False
# 2 False False False False False True True True True False False
# 3 False False False False False False False False False False False
# 4 False False False False False False False False False False False
# 5 False True False False False False False False False False False
# 6 False False False False False False False False False False False
# 7 False False False False False False False False False False False
# 8 False False False False False False False False False False False
# ...
# [25 rows x 11 columns]
# Lets print only the relevant rows.
idx = results['onehot'].sum(axis=1)>0
print(results['onehot'].loc[idx,:])
# 1 10 11 3 4 5 6 7 8 9 45
# 0 False False False True True False False False False False True
# 2 False False False False False True True True True False False
# 5 False True False False False False False False False False False
# 15 True False False False False False False False False False False
# 20 False True True False True False False False False True False