Multivariate Parameter Fitting

The distfit library provides multivariate distribution fitting that enables modeling complex dependencies between multiple variables using copula-based methods. Rather than assuming a single multivariate parametric distribution, distfit decomposes the problem into:

Univariate marginal distribution fitting
Dependence modeling via a Gaussian copula

This separation allows flexible modeling of heterogeneous marginals while still capturing multivariate structure.

Core Features

Multivariate distribution fitting with automatic marginal estimation
Gaussian copula–based dependence modeling
Joint density evaluation for relative likelihood comparison
Multivariate outlier detection using joint log-density
Synthetic data generation preserving marginals and dependence
Extensive visualization tools for copula diagnostics

Marginal Distribution Fitting

Each variable is fitted independently using univariate distributions. You need to set multivariate=True and you can also set all other parameters as desired.

dfit = distfit(
    multivariate=True,
    distr='norm',
    method='mle',
    bins=50,
    alpha=0.05
)

Copula Dependence Modeling

Dependence is modeled using a Gaussian copula, where \(\Sigma\) is the estimated correlation matrix.

\[C(u_1, \dots, u_d) = \Phi_\Sigma\left(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_d)\right)\]

Joint Density Evaluation

The joint density is computed as:

\[f(\mathbf{x}) = c(\mathbf{u}) \prod_{i=1}^{d} f_i(x_i)\]

with copula density:

\[c(\mathbf{u}) = \frac{\phi_\Sigma(\mathbf{z})} {\prod_{i=1}^{d} \phi(z_i)}, \quad z_i = \Phi^{-1}(u_i)\]

Quick Example for Multivariate Fitting

from distfit import distfit

# Initialize with multivariate mode
dfit = distfit(multivariate=True)

# Load example data
X = dfit.import_example(data='multi_normal')
# X = dfit.import_example(data='multi_t')

# Fit model
dfit.fit_transform(X)

# Access estimated correlation matrix (Gaussian copula)
print(dfit.model.corr)

# Evaluate joint density
results = dfit.evaluate_pdf(X)
print(results['score'])
print(results['copula_density'])

# Generate synthetic samples
Xnew = dfit.generate(n=10)

# Detect multivariate outliers
bool_outliers = dfit.predict_outliers(X)

Interpretation output

results = dfit.evaluate_pdf(X)

# Output
results['copula_density']
results['score']

copula_density Vector of joint density values, one per observation. These are relative likelihoods, not probabilities.
score Mean log joint density, where higher values indicate a better model fit when comparing models on the same data.

\[\text{score} = \frac{1}{n} \sum_{i=1}^{n} \log f(\mathbf{x}_i)\]

Plots

Copula Gaussian Density

This visualization shows the data transformed to Gaussian copula space, where \(F_i\) are fitted marginal CDFs and \(\Phi^{-1}\) is the inverse standard normal CDF.

\[U_i = F_i(X_i), \quad Z_i = \Phi^{-1}(U_i)\]

Interpretation

Each point represents an observation in latent Gaussian space
Elliptical contours indicate linear dependence
Structure reflects dependence only, not marginal shape

fig, ax = dfit.plot_copulaDensity(plot_type='gaussian', pairplot=False)

Copula Gaussian Density Pairplot

Interpretation

Diagonal panels show marginal distributions in Gaussian space
Off-diagonal panels show pairwise dependence
Linear structure indicates strong dependence
Circular scatter indicates weak or no dependence

fig, ax = dfit.plot_copulaDensity(plot_type='gaussian', pairplot=True)

_images/copulaDensity_gaussian_pairplot.png

Copula Uniform Density

This visualization shows the data in copula (uniform) space.

\[U_i = F_i(X_i)\]

Interpretation

All marginals are uniform on \([0,1]\)
Structure reflects dependence only
Uniform scatter implies independence
Clustering near corners suggests tail dependence

fig, ax = dfit.plot_copulaDensity(plot_type='uniform', pairplot=False)

Copula Uniform Density Pairplot

Interpretation

Diagonal panels test PIT uniformity
Off-diagonal panels show empirical copula structure
Deviations indicate marginal misfit or dependence

fig, ax = dfit.plot_copulaDensity(plot_type='uniform', pairplot=True)

Joint Density Plot

Interpretation

Displays bivariate slices of the joint density
Combines marginal distributions and dependence
Higher dimensions are visualized via pairwise projections

fig, ax = dfit.plot_jointDensity(X)

PDF Plot

Interpretation

Shows fitted marginal probability density functions
Used to assess marginal distribution fit

fig, ax = dfit.plot(chart='pdf')

CDF Plot

Interpretation

Shows fitted marginal cumulative distribution functions
Used to validate probability integral transforms

fig, ax = dfit.plot(chart='cdf')

QQ Plot (Multivariate)

Interpretation

Compares empirical quantiles to fitted marginals
Large deviations indicate poor marginal fit
Multivariate outliers often appear at extremes

fig, ax = dfit.qqplot(X)

Outlier Detection

Outliers are defined as observations with low joint log-density. This detects observations unlikely under the full multivariate model, even if they are not marginal outliers.

outliers = dfit.predict_outliers(X)

It is expected that outliers have lower likelihood. We can expect that as shown in the code-block.

rng = np.random.default_rng(42)
mean = [0, 0]
cov = [[1, 0.6],
       [0.6, 1]]

X = rng.multivariate_normal(mean, cov, size=2000)

# Fit model on multivariate normal random data
dfit = distfit(multivariate=True, verbose=False)
dfit.fit_transform(X)

# Evaluate the copula density
pdf = dfit.evaluate_pdf(X)["copula_density"]

# Get outliers
outliers = dfit.predict_outliers(X)

# Outliers have lower likelihood
print(np.mean(pdf[outliers]))
# 0.0014758104978686533

print(np.mean(pdf[~outliers]))
# 0.10025029900211244

print(np.mean(pdf[outliers]) < np.mean(pdf[~outliers]))
# True

Generate Synthetic Data

Generate multivariate synthetic data based on the multidistribution fit.

# Generate synthetic samples
Xnew = dfit.generate(n=10)

 array([[ 0.61334212,  0.55326009,  0.15892912, -0.08668606],
     [ 1.12584863,  1.14758074,  0.18494332, -0.80220606],
     [ 3.72283115,  0.62819404,  0.31963464, -0.13226541],
     [ 1.05816854,  0.52648982,  0.30748156, -0.10778112],
     [ 0.48590115,  0.5370091 ,  0.31400217,  0.08802375],
     [ 0.51329513,  0.34469918,  0.12943172,  0.74397221],
     [ 1.3917044 ,  1.17482342,  0.30421591, -0.09497158],
     [ 0.42975052,  0.6232065 ,  0.25283493, -0.31761824],
     [ 0.27751107,  0.5779773 ,  0.35859482,  1.66407101],
     [ 1.13505836,  0.41056057,  0.24425488, -0.18984279]])

Model Comparison

Use the mean log-density score for comparison. Higher scores indicate better fit (for the same data).

res1 = model1.evaluate_pdf(X)
res2 = model2.evaluate_pdf(X)

print(res1['score'], res2['score'])

Connected variables

In a Gaussian copula model, all dependencies between variables are encoded in the correlation matrix stored in dfit.model.corr. Each entry corr[i, j] represents the linear dependence between variable i and j in Gaussian copula space.

This correlation matrix induces a graph structure where:

Nodes correspond to variables (columns of X)
Edges exist when two variables have a non-zero (or sufficiently large) correlation

By analysing this graph, we can identify connected components: groups of variables that are mutually dependent (directly or indirectly). Variables belonging to different components are statistically independent under the copula model.

Identifying connected variables helps to:

Interpret the dependency structure learned by the model
Detect independent sub-copulas in high-dimensional data
Explain block-diagonal or near block-diagonal correlation matrices
Simplify diagnostics and model validation

The example below extracts connected components directly from dfit.model.corr using a depth-first search (DFS). A small threshold can be used to avoid spurious connections caused by numerical noise.

print(dfit.model.corr)

[[1.         0.57622997]
[0.57622997 1.        ]]

# Connected variables for the first variable
dfit.model.corr[:, 0] > 0.8

Caveats and Considerations

Gaussian copula assumes elliptical dependence
Tail dependence may be underestimated
Computational cost increases with dimensionality
Density values are relative likelihoods, not probabilities
Covariance regularization is applied for numerical stability

References

The Gaussian copula relies on the multivariate normal distribution [1] [2].