Import datasets
HNet learns the association from datasets with mixed datatypes and with unknown function. Input datasets can range from generic dataframes to nested data structures with lists, missing values and enumerations.
We decided to use the DataFrame of pandas
as the input data type for hnet
.
The columns represent the variables or features containing continues/categorical values. The rows are the samples.
Below is the titanic dataset that can be the input for hnet
in its current form. The steps of pre-processing of the dataset is explained in the pre-processing section.
PassengerId |
Survived |
Pclass |
… |
Fare |
Cabin |
Embarked |
|
0 |
1 |
0 |
3 |
… |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
… |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
… |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
… |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
… |
8.0500 |
NaN |
S |
… |
… |
… |
… |
… |
… |
… |
|
886 |
887 |
0 |
2 |
… |
13.0000 |
NaN |
S |
887 |
888 |
1 |
1 |
… |
30.0000 |
B42 |
S |
888 |
889 |
0 |
3 |
… |
23.4500 |
NaN |
S |
889 |
890 |
1 |
1 |
… |
30.0000 |
C148 |
C |
890 |
891 |
0 |
3 |
… |
7.7500 |
NaN |
Q |
Import example datasets
Importing an example data set can be performed using hnet.import_example()
. This function provides some example datasets such as sprinkler, titanic, student.
The titanic dataset is depiced above and the spinkler below.
import hnet
# import example
df = hnet.import_example('sprinkler')
# print DataFrame
print(df)
Cloudy |
Sprinkler |
Rain |
Wet_Grass |
|
0 |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
1 |
1 |
2 |
0 |
1 |
0 |
1 |
3 |
1 |
1 |
1 |
1 |
4 |
1 |
1 |
1 |
1 |
… |
… |
… |
… |
|
995 |
1 |
0 |
1 |
1 |
996 |
1 |
0 |
1 |
1 |
997 |
1 |
0 |
1 |
1 |
998 |
0 |
0 |
0 |
0 |
999 |
0 |
1 |
1 |
1 |
Example of the student dataset containing mixed datatypes:
# import example
df = hnet.import_example('student')
# print DataFrame
print(df)
school_GP |
school_MS |
sex_F |
sex_M |
… |
G3_8 |
G3_9 |
G1_18 |
G2_0 |
|
school_GP |
0.0 |
0.0 |
0.0 |
0.0 |
… |
0.0 |
0.0 |
0.000000 |
0.0 |
school_MS |
0.0 |
0.0 |
0.0 |
0.0 |
… |
0.0 |
0.0 |
0.000000 |
0.0 |
sex_F |
0.0 |
0.0 |
0.0 |
0.0 |
… |
0.0 |
0.0 |
0.000000 |
0.0 |
sex_M |
0.0 |
0.0 |
0.0 |
0.0 |
… |
0.0 |
0.0 |
0.000000 |
0.0 |
age_19 |
0.0 |
0.0 |
0.0 |
0.0 |
… |
0.0 |
0.0 |
0.000000 |
0.0 |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
G3_18 |
0.0 |
0.0 |
0.0 |
0.0 |
… |
0.0 |
0.0 |
2.931461 |
0.0 |
G3_8 |
0.0 |
0.0 |
0.0 |
0.0 |
… |
0.0 |
0.0 |
0.000000 |
0.0 |
G3_9 |
0.0 |
0.0 |
0.0 |
0.0 |
… |
0.0 |
0.0 |
0.000000 |
0.0 |
G1_18 |
0.0 |
0.0 |
0.0 |
0.0 |
… |
0.0 |
0.0 |
0.000000 |
0.0 |
G2_0 |
0.0 |
0.0 |
0.0 |
0.0 |
… |
0.0 |
0.0 |
0.000000 |
0.0 |
Import from csv
Importing data from a csv file can be performed using pandas
:
import pandas as pd
data = pd.read_csv('./pathname/to/file.csv')
Import from url
If your dataset is located on a particular website, it is possible to directly download the dataset. In the example below, we will download a dataset from the archives of [UCI](https://archive.ics.uci.edu/ml/).
# Import hnet
import hnet
# Download from url
df = hnet.import_example(url='https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data')
# Specify columns
df.columns=['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country','earnings']
# Initialize
from hnet import hnet
hn = hnet(black_list=['fnlwgt'])
# Run HNet
results = hn.association_learning(df)
Import from sklearn
Various example datasets are also present in sklean
.
See below a demonstration how to import and use these in hnet
.
However, datasets should contain at least 1 catagorical value. Datasets containing only continues values should follow a different method, perhaps t-SNE
, SVD
, UMAP
.
# Import library
from sklearn import datasets
# Import pandas
import pandas as pd
X = datasets.load_boston()
df = pd.DataFrame(data=X['data'], columns=X['feature_names'])
X = datasets.load_diabetes()
df = pd.DataFrame(data=X['data'], columns=X['feature_names'])
Output variables
There are many output parameters provided by hnet
.
It all starts with the initialization:
# Load library
from hnet import hnet
# Initialize model and set parameters
hn = hnet(alpha=0.05, y_min=10, perc_min_num=0.8, multtest='holm', dtypes='pandas')
The object now returns an object containing variables user defined settings. Parameters that are not specified are set to default. For more details, see the API docstrings.
# Learn associations from data set
results = hn.association_learning(df)
The object can now be feeded with dataframe df, using association_learning
function.
The association_learning outputs various output variables in a dictionary.
print(results.keys())
# dict_keys(['simmatP', 'simmatLogP', 'labx', 'dtypes', 'counts', 'rules'])