<no title>

HNET: Graphical Hypergeometric-networks.

hnet.hnet.compare_networks(adjmat_true, adjmat_pred, pos=None, showfig=True, width=15, height=8, verbose=3)

Compare two adjacency matrices and plot the differences.

Comparison of two networks based on two adjacency matrices. Both matrices should be of equal size and of type pandas DataFrame. The columns and rows between both matrices are matched if not ordered similarly.

Parameters:

adjmat_true (pd.DataFrame()) – First array.
adjmat_pred (pd.DataFrame()) – Second array.
pos (dict, optional) – Position of the nodes. The default is None.
showfig (Bool, optional) – Plot figure. The default is True.
width (int, optional) – Width of the figure. The default is 15.
height (int, optional) – Height of the figure. The default is 8.
verbose (int, optional) – Verbosity. The default is 3.

Returns:

tuple – Output contains a tuple of two elements, the score of matching adjacency matrix and adjacency matrix differences.
scores (dict) – Contains extensive number of keys with various scoring values.
adjmat_diff (pd.DataFrame()) – Difference of the input network compared to the second.
0 = No edge 1 = No difference between networks 2 = Addition of edge in the first input network compared to the second

-1 = Depliction of edge in the first netwwork compared to the second

hnet.hnet.enrichment(df, y, y_min=None, alpha=0.05, multtest='holm', dtypes='pandas', specificity='medium', excl_background=None, verbose=3)

Enrichment analysis.

Compute enrichment between input dataset and response variable y. Length of dataframe and y must be equal. The input dataset is converted into a one-hot dense array based on automatic typing dtypes='pandas' or user defined dtypes.

Parameters:

df (DataFrame) – Input Dataframe.
y (list of length df.index) – Response variable.
y_min (int, optional) – Minimal number of samples in a group.. The default is None.
alpha (float, optional) – Significance. The default is 0.05.
multtest (String, optional) – Multiple test correcton. The default is ‘holm’.
dtypes (list of length df.columns, optional) – By default the dtype is determined based on the pandas dataframe. Empty ones [‘’] are skipped. The default is ‘pandas’.
specificity (String, optional) – Configure how numerical data labels are stored.. The default is ‘medium’.
excl_background (String (default : None)) – Name to exclude from the background. Example: [‘0.0’]: To remove categorical values with label 0
verbose (int, optional) – Print message to screen. The higher the number, the more details. The default is 3.

Returns:

pd.DataFrame() with the following columns
category_label (str) – Label of the category.
P (float) – Pvalue of the hypergeometric test or Wilcoxon Ranksum.
logP (float) – -log10(Pvalue) of the hypergeometric test or Wilcoxon Ranksum.
Padj (float) – Adjusted P-value.
dtype (list of str) – Categoric or numeric.
y (str) – Response variable name.
category_name (str) – Subname of the category_label.
popsize_M (int) – Population size: Total number of samples.
nr_succes_pop_n (int) – Number of successes in population.
overlap_X (int) – Overlap between response variable y and input feature.
samplesize_N (int) – Sample size: Random variate, eg clustersize or groupsize, those of interest.
zscore (float) – Z-score of the Wilcoxon Ranksum test.
nr_not_succes_pop_n (int) – Number of successes in population.

Examples

>>> import hnet as hn
>>> df = hn.import_example('titanic')
>>> y = df['Survived'].values
>>> out = hn.enrichment(df, y)

class hnet.hnet.hnet(alpha=0.05, y_min=10, perc_min_num=0.8, k=1, multtest='holm', dtypes='pandas', specificity='medium', dropna=True, excl_background=None, black_list=None, white_list=None)

HNET - Graphical Hypergeometric networks.

This is the main function to detect significant edge probabilities between pairs of vertices (node-links) given the input DataFrame.

A multi-step process is performed which consisting 5 steps.

Pre-processing: Typing and One-hot Enconding. Each feature is set as being categoric, numeric or is excluded. The typing can be user-defined or automatically determined on conditions. Encoding of features in a one-hot dense array is done for the categoric terms. The one-hot dense array is subsequently used to create combinatory features using k combinations over n features (without replacement).
Combinations: Make smart combinations between features because many mutual exclusive classes do exists.
Hypergeometric test: The final dense array is used to assess significance with the categoric features.
Wilcoxon Ranksum: To assess significance across the numeric features (Xnumeric) in relation to the dense array (Xcombination), the Mann-Whitney-U test is performed.
Multiple test correction: Declaring significance for node-links.

The final output of HNet is an adjacency matrix containing edge weights that depicts the strength of pairs of vertices. The adjacency matrix can then be examined as a network representation using D3blocks.

Parameters:

alpha (float [0..1], (default : 0.05)) – Significance to keep only edges with <=alhpa. 1 : (for all results)
y_min (int [1..n], where n is the number of samples. (default : 10)) – Minimum number of samples in a group. Should be [samples>=y_min]. All groups with less then y_min samples are labeled as _other_ and are not used in the model. 10, None, 1, etc
perc_min_num (float [None, 0..1], optional) – Force column (int or float) to be numerical if unique non-zero values are above percentage.
k (int, [1..n] , (default : 1)) – Number of combinatoric elements to create for the n features
multtest (String, (default : 'holm')) –
- None: No multiple Test,
- ’bonferroni’: one-step correction,
- ’sidak’: one-step correction,
- ’holm-sidak’: step down method using Sidak adjustments,
- ’holm’: step-down method using Bonferroni adjustments,
- ’simes-hochberg’: step-up method (independent),
- ’hommel’: closed method based on Simes tests (non-negative),
- ’fdr_bh’: Benjamini/Hochberg (non-negative),
- ’fdr_by’: Benjamini/Yekutieli (negative),
- ’fdr_tsbh’: two stage fdr correction (non-negative),
- ’fdr_tsbky’: two stage fdr correction (non-negative)
dtypes (list of str, (default : 'pandas')) – list strings, example: [‘cat’,’num’,’’] of length y. By default the dtype is determined based on the pandas dataframe. Empty ones [‘’] are skipped. Can also be of the form: [‘cat’,’cat’,’num’,’’,’cat’]
specificity (String, (default : 'medium')) – Configure how numerical data labels are stored. Setting this variable can be of use in the ‘association_learning’ function for the creation of a network ([None] will glue most numerical labels together whereas [high] mostly will not). * None : No additional information in the labels, * ‘low’ : ‘high’ or ‘low’ are included that represents significantly higher or lower assocations compared to the rest-group, * ‘medium’: ‘high’ or ‘low’ are included with 1 decimal behind the comma, * ‘high’ : ‘high’ or ‘low’ are included with 3 decimal behind the comma.
dropna (Bool, [True,False] (Default : True)) – Drop rows/columns in adjacency matrix that showed no significance
excl_background (list or None, [0], [0, '0.0', 'male', ...], (Default: None)) – Remove values/strings that labeled as background. As an example, in a two-class approach with [0,1], the 0 is usually the background and not of interest. Example: ‘0.0’: To remove categorical values with label 0
black_list (List or None (default : None)) – If a list of edges is provided as black_list, they are excluded from the search and the resulting model will not contain any of those edges.
white_list (List or None (default : None)) – If a list of edges is provided as white_list, the search is limited to those edges. The resulting model will then only contain edges that are in white_list.

Returns:

simmatPpd.DataFrame(): Adjacency matrix containing P-values between variable assocations.
simmatLogPpd.DataFrame(): -log10(P-value) of the simmatP.
labxlist of str: Labels that are analyzed.
dtypeslist of str: dtypes that are set for the labels.
countslist of str: Relative counts for the labels based on the number of successes in population.

Return type:

dict()

Examples

>>> from hnet import hnet
>>> hn = hnet()
>>> # Load example dataset
>>> df = hn.import_example('sprinkler')
>>> association Learning
>>> out = hn.association_learning(df)
>>> # Plot dynamic graph
>>> G_dynamic = hn.d3graph()
>>> # Plot static graph
>>> G_static = hn.plot()
>>> # Plot heatmap
>>> P_heatmap = hn.heatmap(cluster=True)
>>> # Plot feature importance
>>> hn.plot_feat_importance()

References

Blog: https://towardsdatascience.com/explore-and-understand-your-data-with-a-network-of-significant-associations-9a03cf79d254
Github: https://github.com/erdogant/hnet
Documentation: https://erdogant.github.io/hnet/
Article: https://arxiv.org/abs/2005.04679

association_learning(df, verbose=3)

Learn the associations in the data.

Parameters:

df (DataFrame, [NxM].) –
N=rows->samples, and M=columns->features.

| f1| f2| f3|

s1 | 0 | 0 | 1 |

s2 | 0 | 1 | 0 |

s3 | 1 | 1 | 0 |
verbose (int [1-5], default: 3) – Print information to screen. 0: nothing, 1: Error, 2: Warning, 3: information, 4: debug, 5: trace.

Returns:

dict.
simmatP (pd.DataFrame()) – Adjacency matrix containing P-values between variable assocations.
simmatLogP (pd.DataFrame()) – -log10(P-value) of the simmatP.
labx (list of str) – Labels that are analyzed.
dtypes (list of str) – dtypes that are set for the labels.
counts (list of str) – Relative counts for the labels based on the number of successes in population.

combined_rules(simmatP=None, labx=None, verbose=3)

Association testing and combining Pvalues using fishers-method.

Multiple variables (antecedents) can be associated to a single variable (consequent). To test the significance of combined associations we used fishers-method. The strongest connection will be sorted on top.

Parameters:

simmatP (matrix) – simmilarity matrix
verbose (int, optional) – Print message to screen. The higher the number, the more details. The default is 3.

Returns:

pd.DataFrame() – Dataset containing antecedents and consequents. The strongest connection will be sorted on top. The columns are as following:
antecedents_labx – Generic label name.
antecedents – Specific label names in the ‘from’ category.
consequents – Specific label names that are the result of the antecedents.
Pfisher – Combined P-value

Examples

>>> from hnet import hnet
>>> hn = hnet()
>>> df = hn.import_example('sprinkler')
>>> hn.association_learning(df)
>>> hn.combined_rules()
>>> print(hn.results['rules'])

compute_associations(df, simmatP, simmat_labx, X_comb, X_labx, dtypes, verbose=3): Association learning on the processed data.

d3graph(summarize=False, node_size_limits=[6, 15], savepath=None, node_color=None, directed=True, threshold=None, white_list=None, black_list=None, min_edges=None, charge=500, figsize=(1500, 1500), showfig=True, elastic=False, verbose=3)

Interactive network creator.

This function creates a interactive and stand-alone network that is build on d3 javascript. d3graph is integrated into hnet and uses the -log10(P-value) adjacency matrix. Each column and index name represents a node whereas values >0 in the matrix represents an edge. Node links are build from rows to columns. Building the edges from row to columns only matters in directed cases. The network nodes and edges are adjusted in weight based on hte -log10(P-value), and colors are based on the category names.

Parameters:

self (Object) – The output of .association_learning()
summarize (bool, (default: False)) – Show the results based on categoric or label-specific associations. True: Summrize based on the categories False: All associations across labels
node_size_limits (tuple) – node sizes are scaled between [min,max] values. The default is [6,15].
savepath (str) – Save the figure in specified path.
node_color (None or 'cluster' default : None) – color nodes based on clustering or by label colors.
directed (bool, default is True.) – Create network using directed edges (arrows).
threshold (int (default : None)) – Associations (edges) are filtered based on the -log10(P) > threshold. threshold should range between 0 and maximum value of -log10(P).
black_list (List or None (default : None)) – If a list of edges is provided as black_list, they are excluded from the search and the resulting model will not contain any of those edges.
white_list (List or None (default : None)) – If a list of edges is provided as white_list, the search is limited to those edges. The resulting model will then only contain edges that are in white_list.
min_edges (int (default : None)) – Edges are only shown if a node has at least min_edges.
showfig (bool, optional) – Plot figure to screen. The default is True.
figsize (tuple, optional) – Size of the figure in the browser, [height,width]. The default is [1500,1500].

Returns:

dict (containing various results derived from network.)
G (graph) – Graph generated by networkx.
savepath (str) – Save the figure in specified path.
labx (array-like) – Cluster labels.

d3heatmap(summarize=False, savepath=None, directed=True, threshold=None, white_list=None, black_list=None, min_edges=None, figsize=(700, 700), vmax=None, showfig=True, verbose=3)

Interactive heatmap creator.

This function creates a interactive and stand-alone heatmap that is build on d3 javascript. d3heatmap is integrated into hnet and uses the -log10(P-value) adjacency matrix. Each column and index name represents a node whereas values >0 in the matrix represents an edge. Node links are build from rows to columns. Building the edges from row to columns only matters in directed cases. The network nodes and edges are adjusted in weight based on hte -log10(P-value), and colors are based on the category names.

Parameters:

self (Object) – The output of .association_learning()
summarize (bool, (default: False)) – Show the results based on categoric or label-specific associations. True: Summrize based on the categories False: All associations across labels
savepath (str) – Save the figure in specified path.
directed (bool, default is True.) – Create network using directed edges (arrows).
threshold (int (default : None)) – Associations (edges) are filtered based on the -log10(P) > threshold. threshold should range between 0 and maximum value of -log10(P).
black_list (List or None (default : None)) – If a list of edges is provided as black_list, they are excluded from the search and the resulting model will not contain any of those edges.
white_list (List or None (default : None)) – If a list of edges is provided as white_list, the search is limited to those edges. The resulting model will then only contain edges that are in white_list.
min_edges (int (default : None)) – Edges are only shown if a node has at least min_edges.
showfig (bool, optional) – Plot figure to screen. The default is True.
figsize (tuple, optional) – Size of the figure in the browser, [height,width]. The default is [1500,1500].

Returns:

dict (containing various results derived from network.)
savepath (str) – Save the figure in specified path.
labx (array-like) – Cluster labels.

heatmap(summarize=False, cluster=False, figsize=[15, 15], savepath=None, threshold=None, white_list=None, black_list=None, min_edges=None, verbose=3)

Plot static heatmap.

A heatmap can be of use when the results becomes too large to plot in a network.

Parameters:

self (Object) – The output of .association_learning()
summarize (bool, (default: False)) – Show the results based on categoric or label-specific associations. True: Summrize based on the categories False: All associations across labels
cluster (Bool, optional) – Cluster before making heatmap. The default is False.
figsize (typle, optional) – Figure size. The default is [15, 10].
savepath (Bool, optional) – saveingpath. The default is None.
threshold (int (default : None)) – Associations (edges) are filtered based on the -log10(P) > threshold. threshold should range between 0 and maximum value of -log10(P).
black_list (List or None (default : None)) – If a list of edges is provided as black_list, they are excluded from the search and the resulting model will not contain any of those edges.
white_list (List or None (default : None)) – If a list of edges is provided as white_list, the search is limited to those edges. The resulting model will then only contain edges that are in white_list.
min_edges (int (default : None)) – Edges are only shown if a node has at least min_edges.
verbose (int, optional) – Verbosity. The default is 3.

Return type:

None.

import_example(data='titanic', url=None, sep=',', verbose=3)

Import example dataset from github source.

Import one of the few datasets from github source or specify your own download url link.

Parameters:

data (str) – Example of a few datasets are: Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’
url (str) – url link to to dataset.

Returns:

Dataset containing mixed features.

Return type:

pd.DataFrame()

References

https://github.com/erdogant/datazets

load(filepath='hnet_model.pkl', verbose=3)

Load learned model.

Parameters:

filepath (str) – Pathname to stored pickle files.
verbose (int, optional) – Show message. A higher number gives more information. The default is 3.

Return type:

Object.

plot(summarize=False, scale=2, dist_between_nodes=0.4, node_size_limits=[25, 500], directed=True, node_color=None, savepath=None, figsize=[15, 10], pos=None, layout='fruchterman_reingold', dpi=250, threshold=None, white_list=None, black_list=None, min_edges=None, showfig=True, verbose=3)

Make plot static network plot of the model results.

The results of hnet can be vizualized in several manners, one of them is a static network plot.

Parameters:

self (Object) – The output of .association_learning()
summarize (bool, (default: False)) – Show the results based on categoric or label-specific associations. True: Summrize based on the categories False: All associations across labels
scale (int, optional) – scale the network by blowing it up by scale. The default is 2.
dist_between_nodes (float, optional) – Distance between the nodes. Edges are sized based this value. The default is 0.4.
node_size_limits (int, optional) – Nodes are scaled between the Min and max size. The default is [25,500].
node_color (str, None or 'cluster' default is None) – color nodes based on clustering or by label colors.
directed (bool, default is True.) – Create network using directed edges (arrows).
savepath (str, optional) – Save the figure in specified path.
figsize (tuple, optional) – Size of the figure, [height,width]. The default is [15,10].
pos (list, optional) – list with coordinates to orientate the nodes.
layout (str, optional) – layouts from networkx can be used. The default is ‘fruchterman_reingold’.
dpi (int, optional) – resolution of the figure. The default is 250.
threshold (int (default : None)) – Associations (edges) are filtered based on the -log10(P) > threshold. threshold should range between 0 and maximum value of -log10(P).
black_list (List or None (default : None)) – If a list of edges is provided as black_list, they are excluded from the search and the resulting model will not contain any of those edges.
white_list (List or None (default : None)) – If a list of edges is provided as white_list, the search is limited to those edges. The resulting model will then only contain edges that are in white_list.
min_edges (int (default : None)) – Edges are only shown if a node has at least min_edges.
showfig (bool, optional) – Plot figure to screen. The default is True.

Returns:

dict. Dictionary containing various results derived from network. The keys in the dict contain the following results
G (graph) – Graph generated by networkx.
labx (str) – labels of the nodes.
pos (list) – Coordinates of the node postions.

plot_feat_importance(marker_size=5, top_n=10, figsize=(15, 8), verbose=3)

Plot feature importance.

Parameters:

marker_size (int, (default: 5)) – Marker size in the scatter plot.
top_n (int, optional) – Top n features are labelled in the plot. The default is 10.
figsize (tuple, optional) – Size of the figure in the browser, [height,width]. The default is [1500,1500].
verbose (int, optional) – Verbosity. The default is 3.

Return type:

None.

prepocessing(df, verbose=3): Pre-processing based on the model parameters.

save(filepath='hnet_model.pkl', overwrite=False, verbose=3)

Save learned model in pickle file.

Parameters:

filepath (str, (default: 'hnet_model.pkl')) – Pathname to store pickle files.
overwrite (bool, (default=False)) – Overwite file if exists.
verbose (int, optional) – Show message. A higher number gives more informatie. The default is 3.

Returns:

bool – Status whether the file is saved.

Return type:

[True, False]

hnet.hnet.to_undirected(adjmat, method='logp', verbose=3)

Make adjacency matrix symmetric.

The adjacency matrix resulting from hnet is not neccesarily symmetric due to the statistics being used. In some cases, a symmetric matrix can be usefull. This function makes sure that values above the diagonal are the same as below the diagonal. Values above and below the diagnal are combined using the max or min value.

Parameters:

adjmat (array) – Square form adjacency matrix.
method (str) – Make matrix symmetric using the ‘max’ or ‘min’ function.
verbose (int) – Verbosity. The default is 3.

Returns:

Symmetric adjacency matrix.

Return type:

pd.DataFrame().

hnet.adjmat_vec.adjmat2vec(adjmat, min_weight=1)

Convert adjacency matrix into vector with source and target.

Parameters:

adjmat (pd.DataFrame()) – Adjacency matrix.
min_weight (float) – edges are returned with a minimum weight.

Returns:

nodes that are connected based on source and target

Return type:

pd.DataFrame()

Examples

>>> source=['Cloudy','Cloudy','Sprinkler','Rain']
>>> target=['Sprinkler','Rain','Wet_Grass','Wet_Grass']
>>> adjmat = vec2adjmat(source, target)
>>> vector = adjmat2vec(adjmat)

hnet.adjmat_vec.vec2adjmat(source, target, weight=None, symmetric=True)

Convert source and target into adjacency matrix.

Parameters:

source (list) – The source node.
target (list) – The target node.
weight (list of int) – The Weights between the source-target values
symmetric (bool, optional) – Make the adjacency matrix symmetric with the same number of rows as columns. The default is True.

Returns:

adjacency matrix.

Return type:

pd.DataFrame

Examples

>>> source=['Cloudy','Cloudy','Sprinkler','Rain']
>>> target=['Sprinkler','Rain','Wet_Grass','Wet_Grass']
>>> vec2adjmat(source, target)
>>>
>>> weight=[1,2,1,3]
>>> vec2adjmat(source, target, weight=weight)