Import datasets

HNet learns the association from datasets with mixed datatypes and with unknown function. Input datasets can range from generic dataframes to nested data structures with lists, missing values and enumerations.

We decided to use the DataFrame of pandas as the input data type for hnet. The columns represent the variables or features containing continues/categorical values. The rows are the samples.

Below is the titanic dataset that can be the input for hnet in its current form. The steps of pre-processing of the dataset is explained in the pre-processing section.

PassengerId

Survived

Pclass

Fare

Cabin

Embarked

0

1

0

3

7.2500

NaN

S

1

2

1

1

71.2833

C85

C

2

3

1

3

7.9250

NaN

S

3

4

1

1

53.1000

C123

S

4

5

0

3

8.0500

NaN

S

886

887

0

2

13.0000

NaN

S

887

888

1

1

30.0000

B42

S

888

889

0

3

23.4500

NaN

S

889

890

1

1

30.0000

C148

C

890

891

0

3

7.7500

NaN

Q

Import example datasets

Importing an example data set can be performed using hnet.import_example(). This function provides some example datasets such as sprinkler, titanic, student. The titanic dataset is depiced above and the spinkler below.

import hnet

# import example
df = hnet.import_example('sprinkler')

# print DataFrame
print(df)

Cloudy

Sprinkler

Rain

Wet_Grass

0

0

0

0

0

1

1

0

1

1

2

0

1

0

1

3

1

1

1

1

4

1

1

1

1

995

1

0

1

1

996

1

0

1

1

997

1

0

1

1

998

0

0

0

0

999

0

1

1

1

Example of the student dataset containing mixed datatypes:

# import example
df = hnet.import_example('student')

# print DataFrame
print(df)

school_GP

school_MS

sex_F

sex_M

G3_8

G3_9

G1_18

G2_0

school_GP

0.0

0.0

0.0

0.0

0.0

0.0

0.000000

0.0

school_MS

0.0

0.0

0.0

0.0

0.0

0.0

0.000000

0.0

sex_F

0.0

0.0

0.0

0.0

0.0

0.0

0.000000

0.0

sex_M

0.0

0.0

0.0

0.0

0.0

0.0

0.000000

0.0

age_19

0.0

0.0

0.0

0.0

0.0

0.0

0.000000

0.0

G3_18

0.0

0.0

0.0

0.0

0.0

0.0

2.931461

0.0

G3_8

0.0

0.0

0.0

0.0

0.0

0.0

0.000000

0.0

G3_9

0.0

0.0

0.0

0.0

0.0

0.0

0.000000

0.0

G1_18

0.0

0.0

0.0

0.0

0.0

0.0

0.000000

0.0

G2_0

0.0

0.0

0.0

0.0

0.0

0.0

0.000000

0.0

Import from csv

Importing data from a csv file can be performed using pandas:

import pandas as pd

data = pd.read_csv('./pathname/to/file.csv')

Import from url

If your dataset is located on a particular website, it is possible to directly download the dataset. In the example below, we will download a dataset from the archives of [UCI](https://archive.ics.uci.edu/ml/).

# Import hnet
import hnet

# Download from url
df = hnet.import_example(url='https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data')

# Specify columns
df.columns=['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country','earnings']

# Initialize
from hnet import hnet
hn = hnet(black_list=['fnlwgt'])
# Run HNet
results = hn.association_learning(df)

Import from sklearn

Various example datasets are also present in sklean. See below a demonstration how to import and use these in hnet. However, datasets should contain at least 1 catagorical value. Datasets containing only continues values should follow a different method, perhaps t-SNE, SVD, UMAP.

# Import library
from sklearn import datasets

# Import pandas
import pandas as pd

X = datasets.load_boston()
df = pd.DataFrame(data=X['data'], columns=X['feature_names'])

X = datasets.load_diabetes()
df = pd.DataFrame(data=X['data'], columns=X['feature_names'])

Output variables

There are many output parameters provided by hnet. It all starts with the initialization:

# Load library
from hnet import hnet

# Initialize model and set parameters
hn = hnet(alpha=0.05, y_min=10, perc_min_num=0.8, multtest='holm', dtypes='pandas')

The object now returns an object containing variables user defined settings. Parameters that are not specified are set to default. For more details, see the API docstrings.

# Learn associations from data set
results = hn.association_learning(df)

The object can now be feeded with dataframe df, using association_learning function. The association_learning outputs various output variables in a dictionary.

print(results.keys())
# dict_keys(['simmatP', 'simmatLogP', 'labx', 'dtypes', 'counts', 'rules'])