Examples

Lets load some datasets using hnet.import_example() and demonstrate the usage of hnet in learning Associations.

Sprinkler dataset

A natural way to study the relation between nodes in a network is to analyse the presence or absence of node-links. The sprinkler data set contains four nodes and therefore ideal to demonstrate the working of hnet in inferring a network. Links between two nodes of a network can either be undirected or directed (directed edges are indicated with arrows). Notably, a directed edge does imply directionality between the two nodes whereas undirected does not.

from hnet import hnet
hn = hnet()

# Import example dataset
df = hn.import_example('sprinkler')

# Learn the relationships
results = hn.association_learning(df)

# Generate the interactive graph
G = hn.d3graph()

Interactive network

Cluster enrichment

In case you have detected cluster labels and now you want to know whether there is association between any of the clusters with a (group of) feature(s). In this example, I will load an cancer data set with pre-computed t-SNE coordinates based on genomic profiles. The t-SNE coordinates I will cluster, and the detected labels are used to determine any assocation with the metadata.

# For cluster evaluation
pip install sklearn
# For easy plotting
pip install scatterd

Cancer dataset

# Import
import hnet
# Import example dataset
df = hnet.import_example('cancer')
# Print
print(df.head())

tsneX

tsneY

age

sex

survival_months

death_indicator

labx

PC1

PC2

0

37.2043

24.1628

58

male

44.5175

0

acc

49.2335

14.4965

1

37.0931

23.4236

44

female

55.0965

0

acc

46.328

14.4645

2

36.8063

23.4449

23

female

63.8029

1

acc

46.5679

13.4801

3

38.0679

24.4118

30

male

11.9918

0

acc

63.6247

1.87406

4

36.7912

21.7153

29

female

79.77

1

acc

41.7467

37.5336

tSNE scatterplot

For demonstration purposes, we make a scatter plot with the True cancer labels to show that cancer labels are associated with the clusters. In many use-cases, your scatterplot would not be colored because you do not know yet which variables fit best the cluster labels.

# Import
from scatterd import scatterd
# Make scatter plot
scatterd(df['tsneX'],df['tsneY'], label=df['labx'], cmap='Set2', fontcolor=[0,0,0], title='Cancer dataset with True labels')
# Make scatter plot wihtout colors
scatterd(df['tsneX'],df['tsneY'], title='Cancer dataset.')
tSNE scatter plot of Cancer patients.

fig1

fig2

Compute associations

Step 1 is to compute the cluster labels based on the tSNE coordinates. We readily have these coordinates computed and can be extracted from the dataframe. Step 2 is to compute the enrichment of the variables (meta-data) with the cluster labels.

# Import
import sklearn

# Determine cluster labels
dbscan = sklearn.cluster.DBSCAN(eps=2)
labx = dbscan.fit_predict(df[['tsneX','tsneY']])
print('Number of detected clusters: %d' %(len(np.unique(labx))))
# Number of detected clusters: 22
# Import
import hnet

# Enrichment of clusterlabels with the meta-data
# results = hnet.enrichment(df[['age', 'sex', 'survival_months', 'death_indicator','labx']], labx)

# [hnet] >Start making fit..
# [df2onehot] >Auto detecting dtypes
# [df2onehot] >[age]                     > [float] > [num] [74]
# [df2onehot] >[sex]                     > [obj]   > [cat] [2]
# [df2onehot] >[survival_months] > [force] > [num] [1591]
# [df2onehot] >[death_indicator] > [float] > [num] [2]
# [df2onehot] >[labx]                    > [obj]   > [cat] [19]
# [df2onehot] >
# [df2onehot] >Setting dtypes in dataframe
# [hnet] >Analyzing [num] age......................
# [hnet] >Analyzing [cat] sex......................
# [hnet] >Analyzing [num] survival_months......................
# [hnet] >Analyzing [num] death_indicator......................
# [hnet] >Analyzing [cat] labx......................
# [hnet] >Multiple test correction using holm
# [hnet] >Fin

# For demonstration purposes I will only do the true cancer label column.
results = hnet.enrichment(df[['labx']], labx)

# Examine the results
print(results)

Cluster associations with categories

When we look at the results (table below), we see in the first column the category_label. These are the metadata variables of the dataframe df that we gave as an input. The second columns: P stands for P-value, which is the computed significance of the catagory_label with the target variable y. In this case, target variable y are are the cluster labels labx. A disadvantage of the P value is the limitation of machine precision. This may end up with P-value of 0. The logP is more interesting as these are not capped by machine precision (lower is better). Note that the target labels in y can be significantly enriched more then once. This means that certain y are enriched for multiple variables. This may occur because we may need to better estimate the cluster labels or its a mixed group or something else.

category_label

P

logP

overlap_X

popsize_M

nr_succes_pop_n

samplesize_N

dtype

y

category_name

Padj

0

acc

1.27018e-153

-352.056

71

4674

77

72

categorical

0

labx

5.15692e-151

1

dlbc

3.22319e-51

-116.261

24

4674

27

48

categorical

1

labx

1.29572e-48

2

kirc

4.73559e-219

-502.711

218

4674

259

398

categorical

10

labx

1.94633e-216

3

kirp

2.12553e-166

-381.475

177

4674

219

398

categorical

10

labx

8.65091e-164

4

kirc

8.16897e-20

-43.9514

15

4674

259

17

categorical

11

labx

3.24308e-17

5

kirp

1.26634e-20

-45.8156

18

4674

219

26

categorical

12

labx

5.04005e-18

6

blca

5.65247e-217

-497.929

157

4674

265

161

categorical

13

labx

2.31751e-214

7

kirp

4.18004e-14

-30.8059

9

4674

219

10

categorical

14

labx

1.6553e-11

8

lgg

0

-1571.11

500

4674

504

501

categorical

15

labx

0

9

lihc

0

-841.979

220

4674

231

222

categorical

16

labx

0

10

luad

0

-1172.91

397

4674

427

419

categorical

17

labx

0

11

ov

0

-963.047

256

4674

262

258

categorical

18

labx

0

12

brca

0

-846.29

745

4674

761

1653

categorical

2

labx

0

13

cesc

1.49892e-49

-112.422

172

4674

205

1653

categorical

2

labx

5.99569e-47

14

hnsc

1.9156e-212

-487.498

463

4674

474

1653

categorical

2

labx

7.83481e-210

15

lusc

6.20884e-51

-115.606

159

4674

182

1653

categorical

2

labx

2.48975e-48

16

prad

0

-1241.55

356

4674

360

357

categorical

20

labx

0

17

laml

4.39155e-312

-716.927

166

4674

167

167

categorical

3

labx

1.80932e-309

18

paad

2.14906e-54

-123.575

19

4674

20

21

categorical

4

labx

8.6822e-52

19

cesc

1.11451e-28

-64.364

21

4674

205

24

categorical

5

labx

4.44688e-26

20

coad

1.16815e-193

-444.244

122

4674

134

161

categorical

6

labx

4.76605e-191

21

read

4.83245e-52

-118.159

33

4674

34

161

categorical

6

labx

1.94748e-49

22

coad

3.71058e-13

-28.6224

7

4674

134

8

categorical

7

labx

1.46568e-10

23

kich

5.97831e-124

-283.732

59

4674

66

65

categorical

8

labx

2.42122e-121

24

kich

1.2301e-06

-13.6084

3

4674

66

7

categorical

9

labx

0.00048466

Color on significantly associated catagories

Lets compute for each cluster label y, the most significantly enriched category label.

from scatterd import scatterd

# Import
out = results.loc[results.groupby(by='y')['logP'].idxmin()]
enriched_label = pd.DataFrame(labx.astype(str))

for i in range(out.shape[0]):
        enriched_label = enriched_label.replace(out['y'].iloc[i], out['category_label'].iloc[i])

# Scatterplot of the cluster numbers
scatterd(df['tsneX'],df['tsneY'], label=labx, fontcolor=[0,0,0])

# Scatterplot of the significantly enriched cancer labels
scatterd(df['tsneX'],df['tsneY'], label=enriched_label.values.ravel(), fontcolor=[0,0,0], cmap='Set2', title='Significantly enriched cancer labels')
Scatter plot of detected cluster and significantly enriched cancer labels for each of the clusters.

fig3

fig4

It can bee seen that the most significantly enriched cancer labels for the clusters do represent the true labels very well.