bnlearn.structure_learning

Structure learning. Given a set of data samples, estimate a DAG that captures the dependencies between the variables.

bnlearn.structure_learning.fit(df, methodtype='hc', scoretype='bic', black_list=None, white_list=None, bw_list_method=None, max_indegree=None, tabu_length=100, epsilon=0.0001, max_iter=1000000.0, root_node=None, class_node=None, fixed_edges=None, return_all_dags=False, params_lingam={'apply_prior_knowledge_softly': False, 'measure': 'pwling', 'prior_knowledge': None, 'random_state': None}, params_pc={'alpha': 0.05, 'ci_test': 'chi_square'}, n_jobs=-1, verbose=3)

Structure learning fit model.

Search strategies for structure learning The search space of DAGs is super-exponential in the number of variables and the above scoring functions allow for local maxima.

To learn model structure (a DAG) from a data set, there are three broad techniques:

Score-based structure learning: using scoring functions as defined in scoretype and search strategy as defined in methodtype.
Constraint-based structure learning (PC): Using statistics such as chi-square test for strength of edges prior the modeling.
Hybrid structure learning (The combination of both techniques)

Score-based Structure Learning

This approach performes model selection as an optimization task. It has two building blocks: A scoring function sD:->R that maps models to a numerical score, based on how well they fit to a given data set D. A search strategy to traverse the search space of possible models M and select a model with optimal score. Commonly used scoring functions to measure the fit between model and data are Bayesian Dirichlet scores such as BDeu or K2 and the Bayesian Information Criterion (BIC, also called MDL). BDeu is dependent on an equivalent sample size.

The BDs score is determined by adjusting certain settings based on the size of the dataset and the observed variable counts. This adjustment involves using a value called “equivalent sample size” divided by the number of parent configurations with observed variable counts. The score-method evaluates how effectively a model can describe the provided dataset.

The Direct-LiNGAM method or also named ‘direct-lingam’ is a semi-parametric approach that assumes a linear relationship among observed variables while ensuring that the error terms follow a non-Gaussian distribution, with the constraint that the graph remains acyclic. Or in other words, the lingam-direct method allows you to model continuous and mixed datasets.

param df:

Input dataframe.

type df:

pd.DataFrame()

param methodtype:

String Search strategy for structure_learning. # Constraintsearch ‘pc’ or ‘cs’ or ‘constraintsearch’ # Score-Based ‘ex’ or ‘exhaustivesearch’ ‘hc’ or ‘hillclimbsearch’ (default) # Score-Based: Requires Root Node ‘cl’ or ‘chow-liu’ (requires setting root_node parameter) ‘nb’ or ‘naivebayes’ (requires <root_node>) ‘tan’ (requires <root_node> and <class_node> parameter) # Score-Based: For continuous and mixed datasets ‘direct-lingam’ ‘ica-lingam’

type methodtype:

str, (default : ‘hc’)

param scoretype:

Scoring function for the search spaces.

‘bic’
‘k2’
‘bdeu’
‘bds’
‘aic’

type scoretype:

str, (default : ‘bic’)

param black_list:

List of edges are black listed. In case of filtering on nodes, the nodes black listed nodes are removed from the dataframe. The resulting model will not contain any nodes that are in black_list.

type black_list:

List or None, (default : None)

param white_list:

List of edges are white listed. In case of filtering on nodes, the search is limited to those edges. The resulting model will then only contain nodes that are in white_list. Works only in case of methodtype=’hc’ See also paramter: bw_list_method

type white_list:

List or None, (default : None)

param bw_list_method:

A list of edges can be passed as black_list or white_list to exclude or to limit the search.

‘edges’ : [(‘A’, ‘B’), (‘C’,’D’), (…)] This option is limited to only methodtype=’hc’
‘nodes’ : [‘A’, ‘B’, …] Filter the dataframe based on the nodes for black_list or white_list. Filtering can be done for every methodtype/scoretype.

type bw_list_method:

list of str or tuple, (default : None)

param max_indegree:

If provided and unequal None, the procedure only searches among models where all nodes have at most max_indegree parents. (only in case of methodtype=’hc’)

type max_indegree:

int, (default : None)

param epsilon:

Defines the exit condition. If the improvement in score is less than epsilon, the learned model is returned. (only in case of methodtype=’hc’)

type epsilon:

float (default: 1e-4)

param max_iter:

The maximum number of iterations allowed. Returns the learned model when the number of iterations is greater than max_iter. (only in case of methodtype=’hc’)

type max_iter:

int (default: 1e6)

param root_node:

The root node for treeSearch based methods.

type root_node:

String. (only in case of chow-liu, Tree-augmented Naive Bayes (TAN))

param class_node:

The class node is required for Tree-augmented Naive Bayes (TAN)

type class_node:

String

param fixed_edges:

A list of edges that will always be there in the final learned model. The algorithm will add these edges at the start of the algorithm and will never change it.

type fixed_edges:

iterable, Only in case of HillClimbSearch.

param return_all_dags:

True: Return all possible DAGs. Only in case methodtype=’exhaustivesearch’ False: Do not return DAGs

type return_all_dags:

Bool, (default: False)

param params_lingam:

prior_knowledgearray-like, shape (n_features, n_features), optional (default=None): Prior knowledge used for causal discovery, where n_features is the number of features.
apply_prior_knowledge_softlyboolean, optional (default=False): If True, apply prior knowledge softly.
measureString: (default=’pwling’): For fast execution with GPU, ‘pwling_fast’ can be used (culingam is required). ‘pwling’, ‘kernel’, ‘pwling_fast’

type params_lingam:

dict: {‘random_state’: None, ‘prior_knowledge’: None, ‘apply_prior_knowledge_softly’: False, ‘measure’: ‘pwling’}

param params_pc:

‘ci_test’: ‘chi_square’, ‘pearsonr’, ‘g_sq’, ‘log_likelihood’, ‘freeman_tuckey’, ‘modified_log_likelihood’, ‘neyman’, ‘cressie_read’, ‘power_divergence’
‘alpha’: 0.05

type params_pc:

dict: {‘ci_test’: ‘chi_square’, ‘alpha’: 0.05}

param verbose:

0: None, 1: Error, 2: Warning, 3: Info (default), 4: Debug, 5: Trace

type verbose:

int, (default : 3)

returns:

‘model’ : pgmpy model ‘model_edges’ : Edges ‘adjmat’ : Adjacency matrix ‘config’ : Configurations ‘structure_scores’ : Structure scores (the lower the better)

rtype:

dict with keys

Examples

>>> # Import bnlearn
>>> import bnlearn as bn
>>>
>>> # Load DAG
>>> model = bn.import_DAG('asia')
>>>
>>> # plot ground truth
>>> G = bn.plot(model)

Examples

>>> # Sampling example
>>>
>>> # Load DAG
>>> model = bn.import_DAG('asia')
>>> # Sampling
>>> df = bn.sampling(model, n=10000)
>>>
>>> # Structure learning of sampled dataset
>>> model_sl = bn.structure_learning.fit(df, methodtype='hc', scoretype='bic')
>>>
>>> # Compute edge strength using chi-square independence test
>>> model_sl = bn.independence_test(model_sl, df)
>>>
>>> # Plot based on structure learning of sampled data
>>> bn.plot(model_sl, pos=G['pos'])

Examples

>>> # Compare networks and make plot
>>> bn.compare_networks(model, model_sl, pos=G['pos'])

Examples

>>> # Model mixed data sets (both discrete and continuous variables)
>>>
>>> # Load DAG
>>> df = bn.import_example(data='auto_mpg')
>>>
>>> # Structure learning of sampled dataset
>>> model = bn.structure_learning.fit(df, methodtype='direct-lingam')
>>>
>>> # Compute edge strength using chi-square independence test
>>> model = bn.independence_test(model_sl, df)
>>>
>>> # Plot based on structure learning of sampled data
>>> bn.plot(model_sl, pos=G['pos'])

References

[1] Scutari, Marco. An Empirical-Bayes Score for Discrete Bayesian Networks. Journal of Machine Learning Research, 2016, pp. 438–48
[2] Shimizu et al, DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model, https://arxiv.org/abs/1101.2489v3