Process Mixed dataset
####################################

In the following example we load the Titanic dataset, and use ``df2onehot`` to convert it towards a structured dataset.

.. code:: python
	
	# Load library
	from df2onehot import df2onehot, import_example
	
	# Import Titanic dataset
	df = import_example(data="titanic")

	print(df)
	#      PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
	# 0              1         0       3  ...   7.2500   NaN         S
	# 1              2         1       1  ...  71.2833   C85         C
	# 2              3         1       3  ...   7.9250   NaN         S
	# 3              4         1       1  ...  53.1000  C123         S
	# 4              5         0       3  ...   8.0500   NaN         S
	# ..           ...       ...     ...  ...      ...   ...       ...
	# 886          887         0       2  ...  13.0000   NaN         S
	# 887          888         1       1  ...  30.0000   B42         S
	# 888          889         0       3  ...  23.4500   NaN         S
	# 889          890         1       1  ...  30.0000  C148         C
	# 890          891         0       3  ...   7.7500   NaN         Q
	# 
	# [891 rows x 12 columns]

	# Convert the matrix into a structured datasset
	results = df2onehot(df)

	print(results.keys())
	# dict_keys(['numeric', 'dtypes', 'onehot', 'labx', 'df', 'labels'])
	
	# The onehot array exploded into 2637 features!
	print(results['onehot'].shape)
	# (891, 2637)
	
	# The reason for the large onehot dataset is, among others incorrect typing of variables. 
	# Columns such as PassengerId should be removed because now these are typed as categorical.
	# Prevent that integer variables are typed as categorical.
	
	print(np.c_[results['labels'],results['dtypes']])
	# array([['PassengerId', 'cat'],
	#        ['Survived', 'cat'],
	#        ['Pclass', 'cat'],
	#        ['Name', 'cat'],
	#        ['Sex', 'cat'],
	#        ['Age', 'num'],
	#        ['SibSp', 'cat'],
	#        ['Parch', 'cat'],
	#        ['Ticket', 'cat'],
	#        ['Fare', 'num'],
	#        ['Cabin', 'cat'],
	#        ['Embarked', 'cat'],
	#        ['all_true', 'cat']], dtype=object)


Force categorical values into numeric
**************************************

We can force variables to be numeric if the number of unique values are above the given percentage: 80%.
Or in other words, if a variable contains more then 80% unique values, it is set as numerical.

.. code:: python
	
	# Set the parameter to force columns into numerical dtypes
	results = df2onehot(df, perc_min_num=0.8)

	# Also remove categorical features for which less then 2 values exists.
	results = df2onehot(df, perc_min_num=0.8, y_min=2)
	
	# Check whether the dtypes are correct.
	# PassengerId, Age and Fare are set as numerical, and the rest categorical.
	
	print(np.c_[results['labels'],results['dtypes']])
	
	# [['PassengerId' 'num']
	#  ['Survived' 'cat']
	#  ['Pclass' 'cat']
	#  ['Name' 'cat']
	#  ['Sex' 'cat']
	#  ['Age' 'num']
	#  ['SibSp' 'cat']
	#  ['Parch' 'cat']
	#  ['Ticket' 'cat']
	#  ['Fare' 'num']
	#  ['Cabin' 'cat']
	#  ['Embarked' 'cat']
	#  ['all_true' 'cat']]

	# If we look at our one hot dense array, we notice that behind each column the sub-category is added.
	print(results['onehot'])
	
	#      Survived_0.0  Survived_1.0  Pclass_1.0  ...  Embarked_Q  Embarked_S  all_true
	# 0            True         False       False  ...       False        True      True
	# 1           False          True        True  ...       False       False      True
	# 2           False          True       False  ...       False        True      True
	# 3           False          True        True  ...       False        True      True
	# 4            True         False       False  ...       False        True      True
	# ..            ...           ...         ...  ...         ...         ...       ...
	# 886          True         False       False  ...       False        True      True
	# 887         False          True        True  ...       False        True      True
	# 888          True         False       False  ...       False        True      True
	# 889         False          True        True  ...       False       False      True
	# 890          True         False       False  ...        True       False      True
	# 
	# [891 rows x 206 columns]


Exclude redundant variables 
**************************************

We can make further clean the data by removing mutually exclusive columns.
As an example, the column **Survived** is split into *Survived_0.0* and *Survived_1.0* but the column *Survived_0.0* may not be so relevant. With the parameter ``excl_background`` we can ignore the labels that are put begin the columns.

.. code:: python

	# Ignore specific subcategories
	results = df2onehot(df, perc_min_num=0.8, y_min=2, excl_background=['0.0'])

	# The final shape of our structured dataset is:
	results['onehot'].shape
	(891, 203)

	# The original variable names can be found here:
	results['labx']

Exclude sparse variables
**************************************

By converting categorical values into one-hot dense arrays, it can easily occur that certain variables will only contain a single or few ``True`` values. We can use the ``y_min`` functionality to remove such columns.


.. code:: python

	# We can tune the ``y_min`` parameter further remove even more columns.
	results = df2onehot(df, perc_min_num=0.8, y_min=5, excl_background=['0.0'])

	# The final shape of our structured dataset is:
	results['onehot'].shape
	(891, 29)

We still need to manually remove the identifier column and then we are ready to go for analysis!


Custom dtypes
####################################

In the following example we load the **fifa** dataset and structure the dataset. 


.. code:: python

	# Load library
	from df2onehot import df2onehot, import_example

	# Import Fifa dataset
	df = import_example('sprinkler')

	# Custom typing of the columns
	results = df2onehot(df, dtypes=['cat','cat','cat','cat'], excl_background=['0.0'])


Extracting nested columns
####################################

In certain cases, it can occur that your columns are nested with lists and dictionaries.
With the ``deep_extract`` functionality it is possible to easily structure such columns.
Let's compare the results with and without the ``deep_extract`` functionality.

.. code:: python

	# Load library
	from df2onehot import df2onehot, import_example

	# Import complex dataframe containing lists in lists
	df = import_example('complex')
	
	# 
	#           feat_1   feat_2
	# 0         [3, 4]  [4, 45]
	# 1            NaN      NaN
	# 2   [5, 6, 7, 8]      NaN
	# 3            NaN      NaN
	# 4            NaN      NaN
	# 5             10      NaN
	# 6            NaN      NaN
	# 7            NaN      NaN
	# 8            NaN      NaN
	# 9            NaN      NaN
	# 10           NaN      NaN
	# 11           NaN      NaN
	# 12           NaN      NaN
	# 13           NaN      NaN
	# 14           NaN      NaN
	# 15             1        1
	# 16           NaN      NaN
	# 17           NaN      NaN
	# 18           NaN      NaN
	# 19           NaN      NaN
	# 20    [9, 11, 4]       10
	# 21           NaN      NaN
	# 22           NaN      NaN
	# 23           NaN      NaN
	# 24           NaN      NaN


Without deep extract
*********************

Convert to onehot dense-array **without** using the ``deep_extract`` function.
The result is a dataframe where each nested element is used as a new column name.

.. code:: python

	results = df2onehot(df, deep_extract=False)
	
	# print
	print(results['onehot'])

	#     feat_1_1  feat_1_10  ...  feat_2_None  feat_2_['4', '45']
	# 0      False      False  ...        False                True
	# 1      False      False  ...         True               False
	# 2      False      False  ...         True               False
	# 3      False      False  ...         True               False
	# 4      False      False  ...         True               False
	# 5      False       True  ...         True               False
	# 6      False      False  ...         True               False
	# 7      False      False  ...         True               False
	# 8      False      False  ...         True               False
	# ...
	# [25 rows x 10 columns]


With deep extract
*********************

With ``deep_extract=True``, each element is analyzed whether it contains lists or dictionaries and structured accordingly. If a column name already exists, the value is added into that column for the specific row.

.. code:: python

	# Convert to onehot dense-array with the ``deep_extract=True`` function
	results = df2onehot(df, deep_extract=True)
	
	# print
	print(results['onehot'])

	#	1     10     11      3      4      5      6      7      8      9     45
	# 0   False  False  False   True   True  False  False  False  False  False   True
	# 1   False  False  False  False  False  False  False  False  False  False  False
	# 2   False  False  False  False  False   True   True   True   True  False  False
	# 3   False  False  False  False  False  False  False  False  False  False  False
	# 4   False  False  False  False  False  False  False  False  False  False  False
	# 5   False   True  False  False  False  False  False  False  False  False  False
	# 6   False  False  False  False  False  False  False  False  False  False  False
	# 7   False  False  False  False  False  False  False  False  False  False  False
	# 8   False  False  False  False  False  False  False  False  False  False  False
	# ...
	# [25 rows x 11 columns]


	# Lets print only the relevant rows.
	idx = results['onehot'].sum(axis=1)>0
	print(results['onehot'].loc[idx,:])
	# 	1     10     11      3      4      5      6      7      8      9     45
	# 0   False  False  False   True   True  False  False  False  False  False   True
	# 2   False  False  False  False  False   True   True   True   True  False  False
	# 5   False   True  False  False  False  False  False  False  False  False  False
	# 15   True  False  False  False  False  False  False  False  False  False  False
	# 20  False   True   True  False   True  False  False  False  False   True  False


.. include:: add_bottom.add