Initialization
Probability density function.
distfit is a python package for probability density fitting of univariate distributions for random variables. With the random variable as an input, distfit can find the best fit for parametric, non-parametric, and discrete distributions.
For the parametric approach, the distfit library can determine the best fit across 89 theoretical distributions. To score the fit, one of the scoring statistics for the good-of-fitness test can be used used, such as RSS/SSE, Wasserstein, Kolmogorov-Smirnov (KS), or Energy. After finding the best-fitted theoretical distribution, the loc, scale, and arg parameters are returned, such as mean and standard deviation for normal distribution.
For the non-parametric approach, the distfit library contains two methods, the quantile and percentile method. Both methods assume that the data does not follow a specific probability distribution. In the case of the quantile method, the quantiles of the data are modeled whereas for the percentile method, the percentiles are modeled.
In case the dataset contains discrete values, the distift library contains the option for discrete fitting. The best fit is then derived using the binomial distribution.
Examples
>>> from distfit import distfit
>>> import numpy as np
>>>
>>> X = np.random.normal(0, 2, 1000)
>>> y = [-8,-6,0,1,2,3,4,5,6]
>>>
>>> dfit = distfit()
>>> results = dfit.fit_transform(X)
>>>
>>> # Plot summary
>>> dfit.plot_summary()
>>>
>>> # PDF plot
>>> dfit.plot()
>>>
>>> # Make prediction
>>> results_proba = dfit.predict(y)
>>>
>>> # Plot PDF
>>> fig, ax = dfit.plot(chart='pdf')
>>>
>>> # Add the CDF to the plot
>>> fig, ax = dfit.plot(chart='cdf', n_top=1, ax=ax)
>>>
>>> # QQ-plot for top 10 fitted distributions
>>> fig, ax = dfit.qqplot(X, n_top=10)
>>>
Detect best Fit
Fit best scoring theoretical distribution to the empirical data (X).
- param X:
Set of values belonging to the data
- type X:
array-like
- param verbose:
- Set the verbose messages using string or integer values.
0, 60, None, ‘silent’, ‘off’, ‘no’]: No message.
10, ‘debug’: Messages from debug level and higher.
20, ‘info’: Messages from info level and higher.
30, ‘warning’: Messages from warning level and higher.
50, ‘critical’: Messages from critical level and higher.
- type verbose:
[str, int], default is ‘info’ or 20
- returns:
dict.
model (dict) – dict containing keys with distribution parameters score : Scoring statistic name : distribution name distr : distribution function params : all kind of parameters loc : loc function parameter scale : scale function parameter arg : arg function parameter
summary (list) – Residual Sum of Squares
histdata (tuple (observed, bins)) – tuple containing observed and bins for data X in the histogram.
size (int) – total number of elements in for data X
Examples
>>> from distfit import distfit
>>> import numpy as np
>>>
>>> # Create dataset
>>> X = np.random.normal(0, 2, 1000)
>>> y = [-8,-6,0,1,2,3,4,5,6]
>>>
>>> # Default method is parametric.
>>> dfit = distfit()
>>>
>>> # In case of quantile
>>> dfit = distfit(method='quantile')
>>>
>>> # In case of percentile
>>> dfit = distfit(method='percentile')
>>>
>>> # Fit using method
>>> model_results = dfit.fit_transform(X)
>>>
>>> dfit.plot()
>>>
>>> # Make prediction
>>> results = dfit.predict(y)
>>>
>>> # Plot results with CII and predictions.
>>> dfit.plot()
>>>
Predict
Compute probability for response variables y, using the specified method.
Computes P-values for [y] based on the fitted distribution from X.
The empirical distribution of X is used to estimate the loc/scale/arg parameters for a
theoretical distribution in case method type is parametric
.
- param y:
Values to be predicted.
- type y:
array-like
- param multtest:
- Multiple test correction.
None
‘bonferroni’
‘sidak’
‘holm-sidak’
‘holm’
‘simes-hochberg’
‘hommel’
‘fdr_bh’
‘fdr_by’
‘fdr_tsbh’
‘fdr_tsbky’
- type multtest:
str, default: ‘fdr_bh’
- param alpha:
Significance alpha is inherited from self if None.
- type alpha:
float, default: None
- param todf:
Output results in pandas dataframe when True. Note that creating pandas dataframes makes the code run significantly slower!
- type todf:
Bool (default: False)
- param verbose:
- Set the verbose messages using string or integer values.
0, 60, None, ‘silent’, ‘off’, ‘no’]: No message.
10, ‘debug’: Messages from debug level and higher.
20, ‘info’: Messages from info level and higher.
30, ‘warning’: Messages from warning level and higher.
50, ‘critical’: Messages from critical level and higher.
- type verbose:
[str, int], default is ‘info’ or 20
- returns:
Object.
y_pred (list of str) – prediction of bounds [upper, lower] for input y, using the fitted distribution X.
y_proba (list of float) – probability for response variable y.
df (pd.DataFrame (only when set: todf=True)) – Dataframe containing the predictions in a structed manner.
Examples
>>> from distfit import distfit
>>> import numpy as np
>>>
>>> # Create dataset
>>> X = np.random.normal(0, 2, 1000)
>>> y = [-8,-6,0,1,2,3,4,5,6]
>>>
>>> # Initialize
>>> dfit = distfit(todf=True)
>>> # Fit
>>> model_results = dfit.fit_transform(X)
>>>
>>> # Make predictions
>>> results = dfit.predict(y)
>>> print(results['df'])
>>>
>>> # Plot results with CII and predictions.
>>> dfit.plot()
>>>
Generate Synthetic data
Generate synthetic data based on the fitted distribution.
- param n:
Number of samples to generate.
- type n:
int
- param random_state:
Random state.
- type random_state:
int, optional
- param verbose:
- Set the verbose messages using string or integer values.
0, 60, None, ‘silent’, ‘off’, ‘no’]: No message.
10, ‘debug’: Messages from debug level and higher.
20, ‘info’: Messages from info level and higher.
30, ‘warning’: Messages from warning level and higher.
50, ‘critical’: Messages from critical level and higher.
- type verbose:
[str, int], default is ‘info’ or 20
- returns:
X – Numpy array with generated data.
- rtype:
np.array
Examples
>>> from distfit import distfit
>>> import numpy as np
>>>
>>> # Create dataset
>>> X = np.random.normal(0, 2, 1000)
>>> y = [-8,-6,0,1,2,3,4,5,6]
>>>
>>> # Initialize
>>> dfit = distfit()
>>> # Fit
>>> dfit.fit_transform(X)
>>>
>>> # Create syntethic data using fitted distribution.
>>> Xnew = dfit.generate(10)
>>>
Compute Density
Compute density based on input data and number of bins.
- param X:
Set of values belonging to the data
- type X:
array-like
- param bins:
- Bin size to determine the empirical historgram.
‘auto’: Determine the bin size automatically.
50: Set specific bin size
- type bins:
int, default: ‘auto’
- param mhist:
- The density extraction method.
‘numpy’
‘seaborn’
- type mhist:
str, (default: ‘numpy’)
- returns:
binedges (array-like) – Array with the bin edges.
histvals (array-like) – Array with the histogram density values.
Examples
>>> from distfit import distfit
>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>>
>>> # Create dataset
>>> X = np.random.normal(0, 2, 1000)
>>>
>>> # Initialize
>>> dfit = distfit()
>>>
>>> # Compute bins and density
>>> bins, density = dfit.density(X)
>>>
>>> # Make plot
>>> plt.figure(); plt.plot(bins, density)
>>>
Bootstrapping
Bootstrap.
To validate our fitted model, the Kolmogorov-Smirnov (KS) test is used to compare the distribution of the bootstrapped samples to the original data to assess the goodness of fit. If the model is overfitting, the KS test will reveal a significant difference between the bootstrapped samples and the original data, indicating that the model is not representative of the underlying distribution.
- The goal here is to estimate the KS statistic of the fitted distribution when the params are estimated from data.
Resample using fitted distribution.
Use the resampled data to fit the distribution.
Compare the resampled data vs. fitted PDF.
Repeat 1000 times the steps 1-3
return score=ratio succes / n_boots
return whether the 95% CII for the KS-test statistic is valid.
- param X:
Set of values belonging to the data
- type X:
array-like
- param n_boots:
- Number of bootstraps to validate the fit.
None: No Bootstrap.
1000: Thousand bootstraps.
- type n_boots:
int, default: None
- param alpha:
Significance alpha.
- type alpha:
float, default: 0.05
- param n:
Number of samples to draw per bootstrap. This number if set to minimum(len(X), n)
- type n:
int, default: 10000
- param n_top:
Show the top number of results. The default is None.
- type n_top:
int, optional
- param update_model:
Update to the best model.
- type update_model:
float, default: True
- rtype:
None.
Examples
>>> # Import library
>>> from distfit import distfit
>>>
>>> # Initialize with 100 permutations
>>> dfit = distfit(n_boots=100)
>>>
>>> # Random data
>>> # X = np.random.exponential(0.5, 10000)
>>> # X = np.random.uniform(0, 1000, 10000)
>>> X = np.random.normal(163, 10, 10000)
>>>
>>> results = dfit.fit_transform(X)
>>>
>>> # Results are stored in summary
>>> dfit.summary[['name', 'score', 'bootstrap_score', 'bootstrap_pass']]
>>>
>>> # Create summary plot
>>> dfit.plot_summary()
Examples
>>> # Import library
>>> from distfit import distfit
>>>
>>> # Initialize without permutations
>>> dfit = distfit()
>>>
>>> # Random data
>>> # X = np.random.exponential(0.5, 10000)
>>> # X = np.random.uniform(0, 1000, 10000)
>>> X = np.random.normal(163, 10, 10000)
>>>
>>> # Fit without permutations
>>> results = dfit.fit_transform(X)
>>>
>>> # Results are stored in summary
>>> dfit.summary[['name', 'score', 'bootstrap_score', 'bootstrap_pass']]
>>>
>>> # Create summary plot (no bootstrap is present)
>>> dfit.plot_summary()
>>>
>>> results = dfit.bootstrap(X, n_boots=100)
>>>
>>> # Create summary plot (the bootstrap is automatically added to the plot)
>>> dfit.plot_summary()
Get Distributions
Return the distributions.
- param distr:
- Distributions to return.
‘full’: all available distributions.
‘popular’ : [norm, expon, pareto, dweibull, t, genextreme, gamma, lognorm, beta, uniform, loggamma]
‘norm’, ‘t’, ‘k’ or any other distribution name.
[‘norm’, ‘t’, ‘k’]: list of distributions.
- type distr:
str.
- rtype:
List with distributions.
Plot PDF/CDF
Make plot.
- param chart:
- Chart to plot.
‘pdf’: Probability density function.
‘cdf’: Cumulative density function.
- type chart:
str, default: ‘pdf’
- param n_top:
Show the top number of results. The default is 1.
- type n_top:
int, optional
- param title:
Title of the plot.
- type title:
String, optional (default: ‘’)
- param emp_properties:
- The line properties of the emperical line.
None: Do not plot.
{‘color’: ‘#000000’, ‘linewidth’: 3, ‘linestyle’: ‘-‘}
- type emp_properties:
dict
- param pdf_properties:
- The line properties of the PDF or the CDF.
None: Do not plot.
{‘color’: ‘#880808’, ‘linewidth’: 3, ‘linestyle’: ‘-‘}
- type pdf_properties:
dict
- param bar_properties:
- bar properties of the histogram.
None: Do not plot.
{‘color’: ‘#607B8B’, ‘linewidth’: 1, ‘edgecolor’: ‘#5A5A5A’, ‘align’: ‘edge’}
- type bar_properties:
dict
- param cii_properties:
- bar properties of the histogram.
None: Do not plot.
{‘color’: ‘#C41E3A’, ‘linewidth’: 3, ‘linestyle’: ‘dashed’, ‘marker’: ‘x’, ‘size’: 20, ‘color_sign_multipletest’: ‘g’, ‘color_sign’: ‘g’, ‘color_general’: ‘r’}
- type cii_properties:
dict
- param fontsize:
Fontsize for the axis and ticks.
- type fontsize:
int, (default: 18)
- param xlabel:
Label for the x-axis.
- type xlabel:
String, (default: ‘value’)
- param ylabel:
Label for the y-axis.
- type ylabel:
String, (default: ‘Frequency’)
- param figsize:
The figure size.
- type figsize:
tuple, optional (default: (10,8))
- param xlim:
Limit figure in x-axis.
- type xlim:
Float, optional (default: None)
- param ylim:
Limit figure in y-axis.
- type ylim:
Float, optional (default: None)
- param fig:
Matplotlib figure (Note - ignored when method is discrete)
- type fig:
Figure, optional (default: None)
- param ax:
Matplotlib Axes object (Note - ignored when method is discrete)
- type ax:
Axes, optional (default: None)
- param grid:
Show the grid on the figure.
- type grid:
Bool, optional (default: True)
- param cmap:
Colormap when plotting multiple the CDF. The used colors are stored in dfit.summary[‘colors’]. However, when cmap is set, the specified colormap is used.
- type cmap:
String, optional (default: None)
- param verbose:
- Set the verbose messages using string or integer values.
0, 60, None, ‘silent’, ‘off’, ‘no’]: No message.
10, ‘debug’: Messages from debug level and higher.
20, ‘info’: Messages from info level and higher.
30, ‘warning’: Messages from warning level and higher.
50, ‘critical’: Messages from critical level and higher.
- type verbose:
[str, int], default is ‘info’ or 20
- rtype:
tuple (fig, ax)
Examples
>>> from distfit import distfit
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>>
>>> # Create dataset
>>> X = np.random.normal(0, 2, 10000)
>>> y = [-8,-6,0,1,2,3,4,5,6]
>>>
>>> # Initialize
>>> dfit = distfit(alpha=0.01)
>>> dfit.fit_transform(X)
>>> dfit.predict(y)
>>>
>>> # Plot seperately
>>> fig, ax = dfit.plot(chart='pdf')
>>> fig, ax = dfit.plot(chart='cdf')
>>>
>>> # Change or remove properties of the chart.
>>> dfit.plot(chart='pdf', pdf_properties={'color': 'r'}, cii_properties={'color': 'g'}, emp_properties=None, bar_properties=None)
>>> dfit.plot(chart='cdf', pdf_properties={'color': 'r'}, cii_properties={'color': 'g'}, emp_properties=None, bar_properties=None)
>>>
>>> # Create subplot
>>> fig, ax = plt.subplots(1,2, figsize=(25, 10))
>>> dfit.plot(chart='pdf', ax=ax[0])
>>> dfit.plot(chart='cdf', ax=ax[1])
>>>
>>> # Change or remove properties of the chart.
>>> fig, ax = dfit.plot(chart='pdf', pdf_properties={'color': 'r', 'linewidth': 3}, cii_properties={'color': 'r', 'linewidth': 3}, bar_properties={'color': '#1e3f5a'})
>>> dfit.plot(chart='cdf', n_top=10, pdf_properties={'color': 'r'}, cii_properties=None, bar_properties=None, ax=ax)
QQ-plot
Plot QQplot results.
- param X:
The Null distribution or background data is build from X.
- type X:
array-like
- param line:
- Options for the reference line to which the data is compared.
‘45’ - 45-degree line
‘s’ - standardized line, the expected order statistics are scaled by the standard deviation of the given sample and have the mean added to them.
‘r’ - A regression line is fit
‘q’ - A line is fit through the quartiles.
‘None’ - by default no reference line is added to the plot.
- type line:
str, default: ‘45’
- param n_top:
Show the top number of results. The default is 1.
- type n_top:
int, optional
- param title:
Title of the plot.
- type title:
String, optional (default: ‘’)
- param fontsize:
Fontsize for the axis and ticks.
- type fontsize:
int, (default: 18)
- param figsize:
The figure size.
- type figsize:
tuple, optional (default: (10,8))
- param xlim:
Limit figure in x-axis.
- type xlim:
Float, optional (default: None)
- param ylim:
Limit figure in y-axis.
- type ylim:
Float, optional (default: None)
- param fig:
Matplotlib figure (Note - ignored when method is discrete)
- type fig:
Figure, optional (default: None)
- param ax:
Matplotlib Axes object. If given, this subplot is used to plot in instead of a new figure being created.
- type ax:
AxesSubplot, optional (default: None)
- param grid:
Show the grid on the figure.
- type grid:
Bool, optional (default: True)
- param cmap:
Colormap when plotting multiple the CDF. The used colors are stored in dfit.summary[‘colors’]. However, when cmap is set, the specified colormap is used.
- type cmap:
String, optional (default: None)
- param verbose:
- Set the verbose messages using string or integer values.
0, 60, None, ‘silent’, ‘off’, ‘no’]: No message.
10, ‘debug’: Messages from debug level and higher.
20, ‘info’: Messages from info level and higher.
30, ‘warning’: Messages from warning level and higher.
50, ‘critical’: Messages from critical level and higher.
- type verbose:
[str, int], default is ‘info’ or 20
- rtype:
tuple (fig, ax)
Examples
>>> from distfit import distfit
>>> import numpy as np
>>>
>>> # Create dataset
>>> X = np.random.normal(0, 2, 1000)
>>>
>>> # Initialize
>>> dfit = distfit()
>>>
>>> # Fit
>>> dfit.fit_transform(X)
>>>
>>> # Make qq-plot
>>> dfit.qqplot(X)
>>>
>>> # Make qq-plot for top 10 best fitted models.
>>> dfit.qqplot(X, n_top=10)
>>>
Plot Summary
Plot summary results.
- param n_top:
Show the top number of results. The default is None.
- type n_top:
int, optional
- param figsize:
The figure size.
- type figsize:
tuple, optional (default: (10,8))
- param color_axes_left:
Hex color of goodness of fit axes (left axes).
- type color_axes_left:
str, (default: ‘#0000FF’)
- param color_axes_right:
Hex color of boostrap axes (right axes).
- type color_axes_right:
str, (default: ‘#FC6600’)
- param title:
Title of the plot.
- type title:
String, optional (default: ‘’)
- param grid:
Show the grid on the figure.
- type grid:
Bool, optional (default: True)
- param fig:
Matplotlib figure
- type fig:
Figure, optional (default: None)
- param ylim:
Limit figure in y-axis.
- type ylim:
Float, optional (default: [None, None])
- param ax:
Matplotlib Axes object
- type ax:
Axes, optional (default: None)
- param verbose:
- Set the verbose messages using string or integer values.
0, 60, None, ‘silent’, ‘off’, ‘no’]: No message.
10, ‘debug’: Messages from debug level and higher.
20, ‘info’: Messages from info level and higher.
30, ‘warning’: Messages from warning level and higher.
50, ‘critical’: Messages from critical level and higher.
- type verbose:
[str, int], default is ‘info’ or 20
- rtype:
tuple (fig, ax)
Save
Save learned model in pickle file.
- param filepath:
Pathname to store pickle files.
- type filepath:
str
- param verbose:
Show message. A higher number gives more informatie. The default is 3.
- type verbose:
int, optional
- rtype:
object
Load
Load learned model.
- param filepath:
Pathname to stored pickle files.
- type filepath:
str
- param verbose:
Show message. A higher number gives more information. The default is 3.
- type verbose:
int, optional
- rtype:
Object.