bnlearn.structure_learning

Structure learning. Given a set of data samples, estimate a DAG that captures the dependencies between the variables.

bnlearn.structure_learning.fit(df, methodtype='hc', scoretype='bic', black_list=None, white_list=None, bw_list_method=None, max_indegree=None, tabu_length=100, epsilon=0.0001, max_iter=1000000.0, root_node=None, class_node=None, fixed_edges=None, return_all_dags=False, n_jobs=-1, verbose=3)

Structure learning fit model.

Search strategies for structure learning The search space of DAGs is super-exponential in the number of variables and the above scoring functions allow for local maxima.

To learn model structure (a DAG) from a data set, there are three broad techniques:
  1. Score-based structure learning: using scoring functions as defined in scoretype and search strategy as defined in methodtype.

  2. Constraint-based structure learning (PC): Using statistics such as chi-square test for strength of edges prior the modeling.

  3. Hybrid structure learning (The combination of both techniques) (MMHC)

Score-based Structure Learning. This approach performes model selection as an optimization task. It has two building blocks: A scoring function sD:->R that maps models to a numerical score, based on how well they fit to a given data set D. A search strategy to traverse the search space of possible models M and select a model with optimal score. Commonly used scoring functions to measure the fit between model and data are Bayesian Dirichlet scores such as BDeu or K2 and the Bayesian Information Criterion (BIC, also called MDL). BDeu is dependent on an equivalent sample size.

The BDs score is determined by adjusting certain settings based on the size of the dataset and the observed variable counts. This adjustment involves using a value called “equivalent sample size” divided by the number of parent configurations with observed variable counts. The score-method evaluates how effectively a model can describe the provided dataset.

Parameters:
  • df (pd.DataFrame()) – Input dataframe.

  • methodtype (str, (default : 'hc')) – String Search strategy for structure_learning. ‘hc’ or ‘hillclimbsearch’ (default) ‘ex’ or ‘exhaustivesearch’ ‘cs’ or ‘constraintsearch’ ‘cl’ or ‘chow-liu’ (requires setting root_node parameter) ‘nb’ or ‘naivebayes’ (requires <root_node>) ‘tan’ (requires <root_node> and <class_node> parameter)

  • scoretype (str, (default : 'bic')) –

    Scoring function for the search spaces.
    • ’bic’

    • ’k2’

    • ’bdeu’

    • ’bds’

    • ’aic’

  • black_list (List or None, (default : None)) – List of edges are black listed. In case of filtering on nodes, the nodes black listed nodes are removed from the dataframe. The resulting model will not contain any nodes that are in black_list.

  • white_list (List or None, (default : None)) – List of edges are white listed. In case of filtering on nodes, the search is limited to those edges. The resulting model will then only contain nodes that are in white_list. Works only in case of methodtype=’hc’ See also paramter: bw_list_method

  • bw_list_method (list of str or tuple, (default : None)) –

    A list of edges can be passed as black_list or white_list to exclude or to limit the search.
    • ’edges’ : [(‘A’, ‘B’), (‘C’,’D’), (…)] This option is limited to only methodtype=’hc’

    • ’nodes’ : [‘A’, ‘B’, …] Filter the dataframe based on the nodes for black_list or white_list. Filtering can be done for every methodtype/scoretype.

  • max_indegree (int, (default : None)) – If provided and unequal None, the procedure only searches among models where all nodes have at most max_indegree parents. (only in case of methodtype=’hc’)

  • epsilon (float (default: 1e-4)) – Defines the exit condition. If the improvement in score is less than epsilon, the learned model is returned. (only in case of methodtype=’hc’)

  • max_iter (int (default: 1e6)) – The maximum number of iterations allowed. Returns the learned model when the number of iterations is greater than max_iter. (only in case of methodtype=’hc’)

  • root_node (String. (only in case of chow-liu, Tree-augmented Naive Bayes (TAN))) – The root node for treeSearch based methods.

  • class_node (String) – The class node is required for Tree-augmented Naive Bayes (TAN)

  • fixed_edges (iterable, Only in case of HillClimbSearch.) – A list of edges that will always be there in the final learned model. The algorithm will add these edges at the start of the algorithm and will never change it.

  • return_all_dags (Bool, (default: False)) – Return all possible DAGs. Only in case methodtype=’exhaustivesearch’

  • verbose (int, (default : 3)) – 0: None, 1: Error, 2: Warning, 3: Info (default), 4: Debug, 5: Trace

Returns:

‘model’ : pgmpy model ‘model_edges’ : Edges ‘adjmat’ : Adjacency matrix ‘config’ : Configurations ‘structure_scores’ : Structure scores (the lower the better)

Return type:

dict with keys

Examples

>>> # Import bnlearn
>>> import bnlearn as bn
>>>
>>> # Load DAG
>>> model = bn.import_DAG('asia')
>>>
>>> # plot ground truth
>>> G = bn.plot(model)
>>>
>>> # Sampling
>>> df = bn.sampling(model, n=10000)
>>>
>>> # Structure learning of sampled dataset
>>> model_sl = bn.structure_learning.fit(df, methodtype='hc', scoretype='bic')
>>>
>>> # Compute edge strength using chi-square independence test
>>> model_sl = bn.independence_test(model_sl, df)
>>>
>>> # Plot based on structure learning of sampled data
>>> bn.plot(model_sl, pos=G['pos'])
>>>
>>> # Compare networks and make plot
>>> bn.compare_networks(model, model_sl, pos=G['pos'])

References

  • [1] Scutari, Marco. An Empirical-Bayes Score for Discrete Bayesian Networks. Journal of Machine Learning Research, 2016, pp. 438–48