Import datasets

HNet learns the association from datasets with mixed datatypes and with unknown function. Input datasets can range from generic dataframes to nested data structures with lists, missing values and enumerations.

We decided to use the DataFrame of pandas as the input data type for hnet. The columns represent the variables or features containing continues/categorical values. The rows are the samples.

Below is the titanic dataset that can be the input for hnet in its current form. The steps of pre-processing of the dataset is explained in the pre-processing section.

	PassengerId	Survived	Pclass	…	Fare	Cabin	Embarked
0	1	0	3	…	7.2500	NaN	S
1	2	1	1	…	71.2833	C85	C
2	3	1	3	…	7.9250	NaN	S
3	4	1	1	…	53.1000	C123	S
4	5	0	3	…	8.0500	NaN	S
	…	…	…	…	…	…	…
886	887	0	2	…	13.0000	NaN	S
887	888	1	1	…	30.0000	B42	S
888	889	0	3	…	23.4500	NaN	S
889	890	1	1	…	30.0000	C148	C
890	891	0	3	…	7.7500	NaN	Q

Import example datasets

Importing an example data set can be performed using hnet.import_example(). This function provides some example datasets such as sprinkler, titanic, student. The titanic dataset is depiced above and the spinkler below.

import hnet

# import example
df = hnet.import_example('sprinkler')

# print DataFrame
print(df)

	Cloudy	Sprinkler	Rain	Wet_Grass
0	0	0	0	0
1	1	0	1	1
2	0	1	0	1
3	1	1	1	1
4	1	1	1	1
	…	…	…	…
995	1	0	1	1
996	1	0	1	1
997	1	0	1	1
998	0	0	0	0
999	0	1	1	1

Example of the student dataset containing mixed datatypes:

# import example
df = hnet.import_example('student')

# print DataFrame
print(df)

	school_GP	school_MS	sex_F	sex_M	…	G3_8	G3_9	G1_18	G2_0
school_GP	0.0	0.0	0.0	0.0	…	0.0	0.0	0.000000	0.0
school_MS	0.0	0.0	0.0	0.0	…	0.0	0.0	0.000000	0.0
sex_F	0.0	0.0	0.0	0.0	…	0.0	0.0	0.000000	0.0
sex_M	0.0	0.0	0.0	0.0	…	0.0	0.0	0.000000	0.0
age_19	0.0	0.0	0.0	0.0	…	0.0	0.0	0.000000	0.0
…	…	…	…	…	…	…	…	…	…
G3_18	0.0	0.0	0.0	0.0	…	0.0	0.0	2.931461	0.0
G3_8	0.0	0.0	0.0	0.0	…	0.0	0.0	0.000000	0.0
G3_9	0.0	0.0	0.0	0.0	…	0.0	0.0	0.000000	0.0
G1_18	0.0	0.0	0.0	0.0	…	0.0	0.0	0.000000	0.0
G2_0	0.0	0.0	0.0	0.0	…	0.0	0.0	0.000000	0.0

Import from csv

Importing data from a csv file can be performed using pandas:

import pandas as pd

data = pd.read_csv('./pathname/to/file.csv')

Import from url

If your dataset is located on a particular website, it is possible to directly download the dataset. In the example below, we will download a dataset from the archives of [UCI](https://archive.ics.uci.edu/ml/).

# Import hnet
import hnet

# Download from url
df = hnet.import_example(url='https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data')

# Specify columns
df.columns=['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country','earnings']

# Initialize
from hnet import hnet
hn = hnet(black_list=['fnlwgt'])
# Run HNet
results = hn.association_learning(df)

Import from sklearn

Various example datasets are also present in sklean. See below a demonstration how to import and use these in hnet. However, datasets should contain at least 1 catagorical value. Datasets containing only continues values should follow a different method, perhaps t-SNE, SVD, UMAP.

# Import library
from sklearn import datasets

# Import pandas
import pandas as pd

X = datasets.load_boston()
df = pd.DataFrame(data=X['data'], columns=X['feature_names'])

X = datasets.load_diabetes()
df = pd.DataFrame(data=X['data'], columns=X['feature_names'])

Output variables

There are many output parameters provided by hnet. It all starts with the initialization:

# Load library
from hnet import hnet

# Initialize model and set parameters
hn = hnet(alpha=0.05, y_min=10, perc_min_num=0.8, multtest='holm', dtypes='pandas')

The object now returns an object containing variables user defined settings. Parameters that are not specified are set to default. For more details, see the API docstrings.

# Learn associations from data set
results = hn.association_learning(df)

The object can now be feeded with dataframe df, using association_learning function. The association_learning outputs various output variables in a dictionary.

print(results.keys())
# dict_keys(['simmatP', 'simmatLogP', 'labx', 'dtypes', 'counts', 'rules'])