Examples
Lets load some datasets using hnet.import_example()
and demonstrate the usage of hnet
in learning Associations.
Sprinkler dataset
A natural way to study the relation between nodes in a network is to analyse the presence or absence of node-links. The sprinkler data set contains four nodes and therefore ideal to demonstrate the working of hnet
in inferring a network. Links between two nodes of a network can either be undirected or directed (directed edges are indicated with arrows). Notably, a directed edge does imply directionality between the two nodes whereas undirected does not.
from hnet import hnet
hn = hnet()
# Import example dataset
df = hn.import_example('sprinkler')
# Learn the relationships
results = hn.association_learning(df)
# Generate the interactive graph
G = hn.d3graph()
Interactive network
Cluster enrichment
In case you have detected cluster labels and now you want to know whether there is association between any of the clusters with a (group of) feature(s). In this example, I will load an cancer data set with pre-computed t-SNE coordinates based on genomic profiles. The t-SNE coordinates I will cluster, and the detected labels are used to determine any assocation with the metadata.
# For cluster evaluation
pip install sklearn
# For easy plotting
pip install scatterd
Cancer dataset
# Import
import hnet
# Import example dataset
df = hnet.import_example('cancer')
# Print
print(df.head())
tsneX |
tsneY |
age |
sex |
survival_months |
death_indicator |
labx |
PC1 |
PC2 |
|
---|---|---|---|---|---|---|---|---|---|
0 |
37.2043 |
24.1628 |
58 |
male |
44.5175 |
0 |
acc |
49.2335 |
14.4965 |
1 |
37.0931 |
23.4236 |
44 |
female |
55.0965 |
0 |
acc |
46.328 |
14.4645 |
2 |
36.8063 |
23.4449 |
23 |
female |
63.8029 |
1 |
acc |
46.5679 |
13.4801 |
3 |
38.0679 |
24.4118 |
30 |
male |
11.9918 |
0 |
acc |
63.6247 |
1.87406 |
4 |
36.7912 |
21.7153 |
29 |
female |
79.77 |
1 |
acc |
41.7467 |
37.5336 |
tSNE scatterplot
For demonstration purposes, we make a scatter plot with the True cancer labels to show that cancer labels are associated with the clusters. In many use-cases, your scatterplot would not be colored because you do not know yet which variables fit best the cluster labels.
# Import
from scatterd import scatterd
# Make scatter plot
scatterd(df['tsneX'],df['tsneY'], label=df['labx'], cmap='Set2', fontcolor=[0,0,0], title='Cancer dataset with True labels')
# Make scatter plot wihtout colors
scatterd(df['tsneX'],df['tsneY'], title='Cancer dataset.')
Compute associations
Step 1 is to compute the cluster labels based on the tSNE coordinates. We readily have these coordinates computed and can be extracted from the dataframe. Step 2 is to compute the enrichment of the variables (meta-data) with the cluster labels.
# Import
import sklearn
# Determine cluster labels
dbscan = sklearn.cluster.DBSCAN(eps=2)
labx = dbscan.fit_predict(df[['tsneX','tsneY']])
print('Number of detected clusters: %d' %(len(np.unique(labx))))
# Number of detected clusters: 22
# Import
import hnet
# Enrichment of clusterlabels with the meta-data
# results = hnet.enrichment(df[['age', 'sex', 'survival_months', 'death_indicator','labx']], labx)
# [hnet] >Start making fit..
# [df2onehot] >Auto detecting dtypes
# [df2onehot] >[age] > [float] > [num] [74]
# [df2onehot] >[sex] > [obj] > [cat] [2]
# [df2onehot] >[survival_months] > [force] > [num] [1591]
# [df2onehot] >[death_indicator] > [float] > [num] [2]
# [df2onehot] >[labx] > [obj] > [cat] [19]
# [df2onehot] >
# [df2onehot] >Setting dtypes in dataframe
# [hnet] >Analyzing [num] age......................
# [hnet] >Analyzing [cat] sex......................
# [hnet] >Analyzing [num] survival_months......................
# [hnet] >Analyzing [num] death_indicator......................
# [hnet] >Analyzing [cat] labx......................
# [hnet] >Multiple test correction using holm
# [hnet] >Fin
# For demonstration purposes I will only do the true cancer label column.
results = hnet.enrichment(df[['labx']], labx)
# Examine the results
print(results)
Cluster associations with categories
When we look at the results (table below), we see in the first column the category_label. These are the metadata variables of the dataframe df that we gave as an input. The second columns: P stands for P-value, which is the computed significance of the catagory_label with the target variable y. In this case, target variable y are are the cluster labels labx. A disadvantage of the P value is the limitation of machine precision. This may end up with P-value of 0. The logP is more interesting as these are not capped by machine precision (lower is better). Note that the target labels in y can be significantly enriched more then once. This means that certain y are enriched for multiple variables. This may occur because we may need to better estimate the cluster labels or its a mixed group or something else.
category_label |
P |
logP |
overlap_X |
popsize_M |
nr_succes_pop_n |
samplesize_N |
dtype |
y |
category_name |
Padj |
|
---|---|---|---|---|---|---|---|---|---|---|---|
0 |
acc |
1.27018e-153 |
-352.056 |
71 |
4674 |
77 |
72 |
categorical |
0 |
labx |
5.15692e-151 |
1 |
dlbc |
3.22319e-51 |
-116.261 |
24 |
4674 |
27 |
48 |
categorical |
1 |
labx |
1.29572e-48 |
2 |
kirc |
4.73559e-219 |
-502.711 |
218 |
4674 |
259 |
398 |
categorical |
10 |
labx |
1.94633e-216 |
3 |
kirp |
2.12553e-166 |
-381.475 |
177 |
4674 |
219 |
398 |
categorical |
10 |
labx |
8.65091e-164 |
4 |
kirc |
8.16897e-20 |
-43.9514 |
15 |
4674 |
259 |
17 |
categorical |
11 |
labx |
3.24308e-17 |
5 |
kirp |
1.26634e-20 |
-45.8156 |
18 |
4674 |
219 |
26 |
categorical |
12 |
labx |
5.04005e-18 |
6 |
blca |
5.65247e-217 |
-497.929 |
157 |
4674 |
265 |
161 |
categorical |
13 |
labx |
2.31751e-214 |
7 |
kirp |
4.18004e-14 |
-30.8059 |
9 |
4674 |
219 |
10 |
categorical |
14 |
labx |
1.6553e-11 |
8 |
lgg |
0 |
-1571.11 |
500 |
4674 |
504 |
501 |
categorical |
15 |
labx |
0 |
9 |
lihc |
0 |
-841.979 |
220 |
4674 |
231 |
222 |
categorical |
16 |
labx |
0 |
10 |
luad |
0 |
-1172.91 |
397 |
4674 |
427 |
419 |
categorical |
17 |
labx |
0 |
11 |
ov |
0 |
-963.047 |
256 |
4674 |
262 |
258 |
categorical |
18 |
labx |
0 |
12 |
brca |
0 |
-846.29 |
745 |
4674 |
761 |
1653 |
categorical |
2 |
labx |
0 |
13 |
cesc |
1.49892e-49 |
-112.422 |
172 |
4674 |
205 |
1653 |
categorical |
2 |
labx |
5.99569e-47 |
14 |
hnsc |
1.9156e-212 |
-487.498 |
463 |
4674 |
474 |
1653 |
categorical |
2 |
labx |
7.83481e-210 |
15 |
lusc |
6.20884e-51 |
-115.606 |
159 |
4674 |
182 |
1653 |
categorical |
2 |
labx |
2.48975e-48 |
16 |
prad |
0 |
-1241.55 |
356 |
4674 |
360 |
357 |
categorical |
20 |
labx |
0 |
17 |
laml |
4.39155e-312 |
-716.927 |
166 |
4674 |
167 |
167 |
categorical |
3 |
labx |
1.80932e-309 |
18 |
paad |
2.14906e-54 |
-123.575 |
19 |
4674 |
20 |
21 |
categorical |
4 |
labx |
8.6822e-52 |
19 |
cesc |
1.11451e-28 |
-64.364 |
21 |
4674 |
205 |
24 |
categorical |
5 |
labx |
4.44688e-26 |
20 |
coad |
1.16815e-193 |
-444.244 |
122 |
4674 |
134 |
161 |
categorical |
6 |
labx |
4.76605e-191 |
21 |
read |
4.83245e-52 |
-118.159 |
33 |
4674 |
34 |
161 |
categorical |
6 |
labx |
1.94748e-49 |
22 |
coad |
3.71058e-13 |
-28.6224 |
7 |
4674 |
134 |
8 |
categorical |
7 |
labx |
1.46568e-10 |
23 |
kich |
5.97831e-124 |
-283.732 |
59 |
4674 |
66 |
65 |
categorical |
8 |
labx |
2.42122e-121 |
24 |
kich |
1.2301e-06 |
-13.6084 |
3 |
4674 |
66 |
7 |
categorical |
9 |
labx |
0.00048466 |
Color on significantly associated catagories
Lets compute for each cluster label y, the most significantly enriched category label.
from scatterd import scatterd
# Import
out = results.loc[results.groupby(by='y')['logP'].idxmin()]
enriched_label = pd.DataFrame(labx.astype(str))
for i in range(out.shape[0]):
enriched_label = enriched_label.replace(out['y'].iloc[i], out['category_label'].iloc[i])
# Scatterplot of the cluster numbers
scatterd(df['tsneX'],df['tsneY'], label=labx, fontcolor=[0,0,0])
# Scatterplot of the significantly enriched cancer labels
scatterd(df['tsneX'],df['tsneY'], label=enriched_label.values.ravel(), fontcolor=[0,0,0], cmap='Set2', title='Significantly enriched cancer labels')
It can bee seen that the most significantly enriched cancer labels for the clusters do represent the true labels very well.