API References

distfit is a python package for probability density fitting.

class distfit.distfit.BinomPMF(n)

Wrapper so that integer parameters don’t occur as function arguments.

References

distfit.distfit.check_version()
distfit.distfit.compute_cii(self, model, alpha=None, logger=None)
class distfit.distfit.distfit(method='parametric', distr: str = 'popular', stats: str = 'RSS', bins: int = 'auto', bound: str = 'both', alpha: float = 0.05, n_boots: int = None, smooth: int = None, n_perm: int = 10000, todf: bool = False, weighted: bool = True, f: float = 1.5, mhist: str = 'numpy', cmap: str = 'Set1', random_state: int = None, verbose: [<class 'str'>, <class 'int'>] = 'info', multtest=None, n_jobs=1)

Probability density function.

distfit is a python package for probability density fitting of univariate distributions for random variables. With the random variable as an input, distfit can find the best fit for parametric, non-parametric, and discrete distributions.

  • For the parametric approach, the distfit library can determine the best fit across 89 theoretical distributions. To score the fit, one of the scoring statistics for the good-of-fitness test can be used used, such as RSS/SSE, Wasserstein, Kolmogorov-Smirnov (KS), or Energy. After finding the best-fitted theoretical distribution, the loc, scale, and arg parameters are returned, such as mean and standard deviation for normal distribution.

  • For the non-parametric approach, the distfit library contains two methods, the quantile and percentile method. Both methods assume that the data does not follow a specific probability distribution. In the case of the quantile method, the quantiles of the data are modeled whereas for the percentile method, the percentiles are modeled.

  • In case the dataset contains discrete values, the distift library contains the option for discrete fitting. The best fit is then derived using the binomial distribution.

Examples

>>> from distfit import distfit
>>> import numpy as np
>>>
>>> X = np.random.normal(0, 2, 1000)
>>> y = [-8,-6,0,1,2,3,4,5,6]
>>>
>>> dfit = distfit()
>>> results = dfit.fit_transform(X)
>>>
>>> # Plot summary
>>> dfit.plot_summary()
>>>
>>> # PDF plot
>>> dfit.plot()
>>>
>>> # Make prediction
>>> results_proba = dfit.predict(y)
>>>
>>> # Plot PDF
>>> fig, ax = dfit.plot(chart='pdf')
>>>
>>> # Add the CDF to the plot
>>> fig, ax = dfit.plot(chart='cdf', n_top=1, ax=ax)
>>>
>>> # QQ-plot for top 10 fitted distributions
>>> fig, ax = dfit.qqplot(X, n_top=10)
>>>
bootstrap(X, n_boots=100, alpha=0.05, n=10000, n_top=None, update_model=True)

Bootstrap.

To validate our fitted model, the Kolmogorov-Smirnov (KS) test is used to compare the distribution of the bootstrapped samples to the original data to assess the goodness of fit. If the model is overfitting, the KS test will reveal a significant difference between the bootstrapped samples and the original data, indicating that the model is not representative of the underlying distribution.

The goal here is to estimate the KS statistic of the fitted distribution when the params are estimated from data.
  1. Resample using fitted distribution.

  2. Use the resampled data to fit the distribution.

  3. Compare the resampled data vs. fitted PDF.

  4. Repeat 1000 times the steps 1-3

  5. return score=ratio succes / n_boots

  6. return whether the 95% CII for the KS-test statistic is valid.

Parameters:
  • X (array-like) – Set of values belonging to the data

  • n_boots (int, default: None) –

    Number of bootstraps to validate the fit.
    • None: No Bootstrap.

    • 1000: Thousand bootstraps.

  • alpha (float, default: 0.05) – Significance alpha.

  • n (int, default: 10000) – Number of samples to draw per bootstrap. This number if set to minimum(len(X), n)

  • n_top (int, optional) – Show the top number of results. The default is None.

  • update_model (float, default: True) – Update to the best model.

Return type:

None.

Examples

>>> # Import library
>>> from distfit import distfit
>>>
>>> # Initialize with 100 permutations
>>> dfit = distfit(n_boots=100)
>>>
>>> # Random data
>>> # X = np.random.exponential(0.5, 10000)
>>> # X = np.random.uniform(0, 1000, 10000)
>>> X = np.random.normal(163, 10, 10000)
>>>
>>> results = dfit.fit_transform(X)
>>>
>>> # Results are stored in summary
>>> dfit.summary[['name', 'score', 'bootstrap_score', 'bootstrap_pass']]
>>>
>>> # Create summary plot
>>> dfit.plot_summary()

Examples

>>> # Import library
>>> from distfit import distfit
>>>
>>> # Initialize without permutations
>>> dfit = distfit()
>>>
>>> # Random data
>>> # X = np.random.exponential(0.5, 10000)
>>> # X = np.random.uniform(0, 1000, 10000)
>>> X = np.random.normal(163, 10, 10000)
>>>
>>> # Fit without permutations
>>> results = dfit.fit_transform(X)
>>>
>>> # Results are stored in summary
>>> dfit.summary[['name', 'score', 'bootstrap_score', 'bootstrap_pass']]
>>>
>>> # Create summary plot (no bootstrap is present)
>>> dfit.plot_summary()
>>>
>>> results = dfit.bootstrap(X, n_boots=100)
>>>
>>> # Create summary plot (the bootstrap is automatically added to the plot)
>>> dfit.plot_summary()
density(X, bins='auto', mhist='numpy')

Compute density based on input data and number of bins.

Parameters:
  • X (array-like) – Set of values belonging to the data

  • bins (int, default: 'auto') –

    Bin size to determine the empirical historgram.
    • ’auto’: Determine the bin size automatically.

    • 50: Set specific bin size

  • mhist (str, (default: 'numpy')) –

    The density extraction method.
    • ’numpy’

    • ’seaborn’

Returns:

  • binedges (array-like) – Array with the bin edges.

  • histvals (array-like) – Array with the histogram density values.

Examples

>>> from distfit import distfit
>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>>
>>> # Create dataset
>>> X = np.random.normal(0, 2, 1000)
>>>
>>> # Initialize
>>> dfit = distfit()
>>>
>>> # Compute bins and density
>>> bins, density = dfit.density(X)
>>>
>>> # Make plot
>>> plt.figure(); plt.plot(bins, density)
>>>
fit(verbose=None)

Collect the required distribution functions.

Returns:

  • Object.

  • self.distributions (functions) – list of functions containing distributions.

fit_transform(X, n_boots=None, verbose=None)

Fit best scoring theoretical distribution to the empirical data (X).

Parameters:
  • X (array-like) – Set of values belonging to the data

  • verbose ([str, int], default is 'info' or 20) –

    Set the verbose messages using string or integer values.
    • 0, 60, None, ‘silent’, ‘off’, ‘no’]: No message.

    • 10, ‘debug’: Messages from debug level and higher.

    • 20, ‘info’: Messages from info level and higher.

    • 30, ‘warning’: Messages from warning level and higher.

    • 50, ‘critical’: Messages from critical level and higher.

Returns:

  • dict.

  • model (dict) – dict containing keys with distribution parameters score : Scoring statistic name : distribution name distr : distribution function params : all kind of parameters loc : loc function parameter scale : scale function parameter arg : arg function parameter

  • summary (list) – Residual Sum of Squares

  • histdata (tuple (observed, bins)) – tuple containing observed and bins for data X in the histogram.

  • size (int) – total number of elements in for data X

Examples

>>> from distfit import distfit
>>> import numpy as np
>>>
>>> # Create dataset
>>> X = np.random.normal(0, 2, 1000)
>>> y = [-8,-6,0,1,2,3,4,5,6]
>>>
>>> # Default method is parametric.
>>> dfit = distfit()
>>>
>>> # In case of quantile
>>> dfit = distfit(method='quantile')
>>>
>>> # In case of percentile
>>> dfit = distfit(method='percentile')
>>>
>>> # Fit using method
>>> model_results = dfit.fit_transform(X)
>>>
>>> dfit.plot()
>>>
>>> # Make prediction
>>> results = dfit.predict(y)
>>>
>>> # Plot results with CII and predictions.
>>> dfit.plot()
>>>
generate(n, random_state=None, verbose=None)

Generate synthetic data based on the fitted distribution.

Parameters:
  • n (int) – Number of samples to generate.

  • random_state (int, optional) – Random state.

  • verbose ([str, int], default is 'info' or 20) –

    Set the verbose messages using string or integer values.
    • 0, 60, None, ‘silent’, ‘off’, ‘no’]: No message.

    • 10, ‘debug’: Messages from debug level and higher.

    • 20, ‘info’: Messages from info level and higher.

    • 30, ‘warning’: Messages from warning level and higher.

    • 50, ‘critical’: Messages from critical level and higher.

Returns:

X – Numpy array with generated data.

Return type:

np.array

Examples

>>> from distfit import distfit
>>> import numpy as np
>>>
>>> # Create dataset
>>> X = np.random.normal(0, 2, 1000)
>>> y = [-8,-6,0,1,2,3,4,5,6]
>>>
>>> # Initialize
>>> dfit = distfit()
>>> # Fit
>>> dfit.fit_transform(X)
>>>
>>> # Create syntethic data using fitted distribution.
>>> Xnew = dfit.generate(10)
>>>
get_distributions(distr='full')

Return the distributions.

Parameters:

distr (str.) –

Distributions to return.
  • ’full’: all available distributions.

  • ’popular’ : [norm, expon, pareto, dweibull, t, genextreme, gamma, lognorm, beta, uniform, loggamma]

  • ’norm’, ‘t’, ‘k’ or any other distribution name.

  • [‘norm’, ‘t’, ‘k’]: list of distributions.

Return type:

List with distributions.

import_example(data='gas_spot_price')

Import example dataset from github source.

Imports data directly from github source.

Parameters:

data (str) –

  • ‘gas_spot_price’

  • ’tips’

  • ’occupancy’

Returns:

DataFrame that conains the data.

Return type:

pd.DataFrame

lineplot(X, labels=None, projection=True, xlabel='x-axes', ylabel='y-axes', title='', fontsize=16, figsize=(25, 12), xlim=None, ylim=None, fig=None, ax=None, grid=True, cii_properties={'alpha': 0.7, 'linewidth': 1}, line_properties={'color': '#004481', 'linestyle': '-', 'linewidth': 1, 'marker': '.', 'markersize': 10}, verbose=None)

Plot data and CII and/or predictions.

Parameters:
  • X (array-like) – The Null distribution or background data is build from X. The x-axis are the index values, and the y-axis the corresponding values.

  • labels (array-like) – Labels for the x-axes. Should be the same size as X.

  • projection (bool (default: True)) – Projection of the distribution.

  • xlabel (string (default: 'Values')) – Label of the x-axis.

  • ylabel (string (default: 'Frequencies')) – Label of the y-axis.

  • title (String, optional (default: '')) – Title of the plot.

  • fontsize (int, (default: 18)) – Fontsize for the axis and ticks.

  • figsize (tuple, optional (default: (10,8))) – The figure size.

  • xlim (tuple, optional (default: None)) – Limit figure in x-axis: [0, 100]

  • ylim (tuple, optional (default: None)) – Limit figure in y-axis. Limit figure in x-axis: [0, 10]

  • fig (Figure, optional (default: None)) – Matplotlib figure (Note - ignored when method is discrete)

  • ax (AxesSubplot, optional (default: None)) – Matplotlib Axes object. If given, this subplot is used to plot in instead of a new figure being created.

  • grid (Bool, optional (default: True)) – Show the grid on the figure.

  • cii_properties (dict) –

    bar properties of the histogram.
    • None: Do not plot.

    • {‘color’: ‘#C41E3A’, ‘linewidth’: 3, ‘linestyle’: ‘dashed’, ‘marker’: ‘x’, ‘size’: 20, ‘color_sign_multipletest’: ‘g’, ‘color_sign’: ‘g’, ‘color_general’: ‘r’}

  • line_properties (dict) –

    Properties of the line. Set one or multiple properties.
    • {‘linestyle’: ‘-’, ‘color’: ‘#004481’, ‘marker’: ‘.’, ‘linewidth’: 1, ‘markersize’: 10}

    • {‘color’: ‘#000000’}

    • {‘color’: ‘#000000’, ‘marker’: ‘’}

  • verbose ([str, int], default is 'info' or 20) –

    Set the verbose messages using string or integer values.
    • 0, 60, None, ‘silent’, ‘off’, ‘no’]: No message.

    • 10, ‘debug’: Messages from debug level and higher.

    • 20, ‘info’: Messages from info level and higher.

    • 30, ‘warning’: Messages from warning level and higher.

    • 50, ‘critical’: Messages from critical level and higher.

Return type:

tuple (fig, ax)

Examples

>>> from distfit import distfit
>>> import numpy as np
>>>
>>> # Create dataset
>>> X = np.random.normal(0, 2, 1000)
>>>
>>> # Initialize
>>> dfit = distfit()
>>>
>>> # Fit
>>> dfit.fit_transform(X)
>>>
>>> # Make line plot
>>> dfit.lineplot(X)
>>>
>>> # Make line plot
>>> dfit.predict([0, 1, 2, 3, 4, 5])
>>> dfit.lineplot(X)
load(filepath)

Load learned model.

Parameters:
  • filepath (str) – Pathname to stored pickle files.

  • verbose (int, optional) – Show message. A higher number gives more information. The default is 3.

Return type:

Object.

plot(chart='pdf', n_top=1, title='', emp_properties={'color': '#000000', 'linestyle': '-', 'linewidth': 3}, pdf_properties={'color': '#880808', 'linestyle': '-', 'linewidth': 3}, bar_properties={'align': 'center', 'color': '#607B8B', 'edgecolor': '#5A5A5A', 'linewidth': 1}, cii_properties={'color': '#C41E3A', 'color_general': 'r', 'color_sign': 'g', 'color_sign_multipletest': 'g', 'linestyle': 'dashed', 'linewidth': 3, 'marker': 'x', 'size': 20}, fontsize=16, xlabel='Values', ylabel='Frequency', figsize=(20, 15), xlim=None, ylim=None, fig=None, ax=None, grid=True, cmap=None, verbose=None)

Make plot.

Parameters:
  • chart (str, default: 'pdf') –

    Chart to plot.
    • ’pdf’: Probability density function.

    • ’cdf’: Cumulative density function.

  • n_top (int, optional) – Show the top number of results. The default is 1.

  • title (String, optional (default: '')) – Title of the plot.

  • emp_properties (dict) –

    The line properties of the emperical line.
    • None: Do not plot.

    • {‘color’: ‘#000000’, ‘linewidth’: 3, ‘linestyle’: ‘-‘}

  • pdf_properties (dict) –

    The line properties of the PDF or the CDF.
    • None: Do not plot.

    • {‘color’: ‘#880808’, ‘linewidth’: 3, ‘linestyle’: ‘-‘}

  • bar_properties (dict) –

    bar properties of the histogram.
    • None: Do not plot.

    • {‘color’: ‘#607B8B’, ‘linewidth’: 1, ‘edgecolor’: ‘#5A5A5A’, ‘align’: ‘edge’}

  • cii_properties (dict) –

    bar properties of the histogram.
    • None: Do not plot.

    • {‘color’: ‘#C41E3A’, ‘linewidth’: 3, ‘linestyle’: ‘dashed’, ‘marker’: ‘x’, ‘size’: 20, ‘color_sign_multipletest’: ‘g’, ‘color_sign’: ‘g’, ‘color_general’: ‘r’}

  • fontsize (int, (default: 18)) – Fontsize for the axis and ticks.

  • xlabel (String, (default: 'value')) – Label for the x-axis.

  • ylabel (String, (default: 'Frequency')) – Label for the y-axis.

  • figsize (tuple, optional (default: (10,8))) – The figure size.

  • xlim (Float, optional (default: None)) – Limit figure in x-axis.

  • ylim (Float, optional (default: None)) – Limit figure in y-axis.

  • fig (Figure, optional (default: None)) – Matplotlib figure (Note - ignored when method is discrete)

  • ax (Axes, optional (default: None)) – Matplotlib Axes object (Note - ignored when method is discrete)

  • grid (Bool, optional (default: True)) – Show the grid on the figure.

  • cmap (String, optional (default: None)) – Colormap when plotting multiple the CDF. The used colors are stored in dfit.summary[‘colors’]. However, when cmap is set, the specified colormap is used.

  • verbose ([str, int], default is 'info' or 20) –

    Set the verbose messages using string or integer values.
    • 0, 60, None, ‘silent’, ‘off’, ‘no’]: No message.

    • 10, ‘debug’: Messages from debug level and higher.

    • 20, ‘info’: Messages from info level and higher.

    • 30, ‘warning’: Messages from warning level and higher.

    • 50, ‘critical’: Messages from critical level and higher.

Return type:

tuple (fig, ax)

Examples

>>> from distfit import distfit
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>>
>>> # Create dataset
>>> X = np.random.normal(0, 2, 10000)
>>> y = [-8,-6,0,1,2,3,4,5,6]
>>>
>>> # Initialize
>>> dfit = distfit(alpha=0.01)
>>> dfit.fit_transform(X)
>>> dfit.predict(y)
>>>
>>> # Plot seperately
>>> fig, ax = dfit.plot(chart='pdf')
>>> fig, ax = dfit.plot(chart='cdf')
>>>
>>> # Change or remove properties of the chart.
>>> dfit.plot(chart='pdf', pdf_properties={'color': 'r'}, cii_properties={'color': 'g'}, emp_properties=None, bar_properties=None)
>>> dfit.plot(chart='cdf', pdf_properties={'color': 'r'}, cii_properties={'color': 'g'}, emp_properties=None, bar_properties=None)
>>>
>>> # Create subplot
>>> fig, ax = plt.subplots(1,2, figsize=(25, 10))
>>> dfit.plot(chart='pdf', ax=ax[0])
>>> dfit.plot(chart='cdf', ax=ax[1])
>>>
>>> # Change or remove properties of the chart.
>>> fig, ax = dfit.plot(chart='pdf', pdf_properties={'color': 'r', 'linewidth': 3}, cii_properties={'color': 'r', 'linewidth': 3}, bar_properties={'color': '#1e3f5a'})
>>> dfit.plot(chart='cdf', n_top=10, pdf_properties={'color': 'r'}, cii_properties=None, bar_properties=None, ax=ax)
plot_cdf(n_top=1, title='', figsize=(20, 15), xlabel='Values', ylabel='Frequency', fontsize=16, xlim=None, ylim=None, fig=None, ax=None, grid=True, emp_properties={'color': '#000000', 'linestyle': '-', 'linewidth': 1.3}, cdf_properties={'color': '#004481', 'linestyle': '-', 'linewidth': 2}, cii_properties={'color': '#880808', 'color_general': 'r', 'color_sign': 'g', 'color_sign_multipletest': 'g', 'linestyle': 'dashed', 'linewidth': 2, 'marker': 'x', 'size': 20}, cmap=None, verbose=None)

Plot CDF results.

Parameters:
  • n_top (int, optional) – Show the top number of results. The default is 1.

  • title (String, optional (default: '')) – Title of the plot.

  • xlabel (string (default: 'Values')) – Label of the x-axis.

  • ylabel (string (default: 'Frequencies')) – Label of the y-axis.

  • figsize (tuple, optional (default: (10,8))) – The figure size.

  • xlim (tuple, optional (default: None)) – Limit figure in x-axis: [0, 100]

  • ylim (tuple, optional (default: None)) – Limit figure in y-axis. Limit figure in x-axis: [0, 10]

  • fig (Figure, optional (default: None)) – Matplotlib figure (Note - ignored when method is discrete)

  • ax (Axes, optional (default: None)) – Matplotlib Axes object (Note - ignored when method is discrete)

  • grid (Bool, optional (default: True)) – Show the grid on the figure.

  • emp_properties (dict) –

    The line properties of the emperical line.
    • None: Do not plot.

    • {‘color’: ‘#000000’, ‘linewidth’: 1.3, ‘linestyle’: ‘-‘}: default

  • cdf_properties (dict) –

    The line properties of the pdf.
    • None: Do not plot.

    • {‘color’: ‘#004481’, ‘linewidth’: 2, ‘linestyle’: ‘-‘}: default

  • cmap (String, optional (default: None)) – Colormap when plotting multiple the CDF. The used colors are stored in dfit.summary[‘colors’]. However, when cmap is set, the specified colormap is used.

  • verbose ([str, int], default is 'info' or 20) –

    Set the verbose messages using string or integer values.
    • 0, 60, None, ‘silent’, ‘off’, ‘no’]: No message.

    • 10, ‘debug’: Messages from debug level and higher.

    • 20, ‘info’: Messages from info level and higher.

    • 30, ‘warning’: Messages from warning level and higher.

    • 50, ‘critical’: Messages from critical level and higher.

Return type:

tuple (fig, ax)

Examples

>>> from distfit import distfit
>>> import numpy as np
>>>
>>> # Create dataset
>>> X = np.random.normal(0, 2, 1000)
>>>
>>> # Initialize
>>> dfit = distfit()
>>>
>>> # Fit
>>> dfit.fit_transform(X)
>>>
>>> # Make CDF plot
>>> fig, ax = dfit.plot(chart='cdf')
>>>
>>> # Append the PDF plot
>>> dfit.plot(chart='pdf', fig=fig, ax=ax)
>>>
>>> # Plot the CDF of the top 10 fitted distributions.
>>> fig, ax = dfit.plot(chart='cdf', n_top=10)
>>> # Append the PDF plot
>>> dfit.plot(chart='pdf', n_top=10, fig=fig, ax=ax)
>>>
plot_summary(n_top=None, color_axes_left='#0000FF', color_axes_right='#FC6600', title=None, rotation=45, fontsize=16, grid=True, ylim=[None, None], figsize=(20, 10), fig=None, ax=None, verbose=None)

Plot summary results.

Parameters:
  • n_top (int, optional) – Show the top number of results. The default is None.

  • figsize (tuple, optional (default: (10,8))) – The figure size.

  • color_axes_left (str, (default: '#0000FF')) – Hex color of goodness of fit axes (left axes).

  • color_axes_right (str, (default: '#FC6600')) – Hex color of boostrap axes (right axes).

  • title (String, optional (default: '')) – Title of the plot.

  • grid (Bool, optional (default: True)) – Show the grid on the figure.

  • fig (Figure, optional (default: None)) – Matplotlib figure

  • ylim (Float, optional (default: [None, None])) – Limit figure in y-axis.

  • ax (Axes, optional (default: None)) – Matplotlib Axes object

  • verbose ([str, int], default is 'info' or 20) –

    Set the verbose messages using string or integer values.
    • 0, 60, None, ‘silent’, ‘off’, ‘no’]: No message.

    • 10, ‘debug’: Messages from debug level and higher.

    • 20, ‘info’: Messages from info level and higher.

    • 30, ‘warning’: Messages from warning level and higher.

    • 50, ‘critical’: Messages from critical level and higher.

Return type:

tuple (fig, ax)

predict(y, alpha: float = None, multtest: str = 'fdr_bh', todf: bool = True, verbose: [<class 'str'>, <class 'int'>] = None)

Compute probability for response variables y, using the specified method.

Computes P-values for [y] based on the fitted distribution from X. The empirical distribution of X is used to estimate the loc/scale/arg parameters for a theoretical distribution in case method type is parametric.

Parameters:
  • y (array-like) – Values to be predicted.

  • multtest (str, default: 'fdr_bh') –

    Multiple test correction.
    • None

    • ’bonferroni’

    • ’sidak’

    • ’holm-sidak’

    • ’holm’

    • ’simes-hochberg’

    • ’hommel’

    • ’fdr_bh’

    • ’fdr_by’

    • ’fdr_tsbh’

    • ’fdr_tsbky’

  • alpha (float, default: None) – Significance alpha is inherited from self if None.

  • todf (Bool (default: False)) – Output results in pandas dataframe when True. Note that creating pandas dataframes makes the code run significantly slower!

  • verbose ([str, int], default is 'info' or 20) –

    Set the verbose messages using string or integer values.
    • 0, 60, None, ‘silent’, ‘off’, ‘no’]: No message.

    • 10, ‘debug’: Messages from debug level and higher.

    • 20, ‘info’: Messages from info level and higher.

    • 30, ‘warning’: Messages from warning level and higher.

    • 50, ‘critical’: Messages from critical level and higher.

Returns:

  • Object.

  • y_pred (list of str) – prediction of bounds [upper, lower] for input y, using the fitted distribution X.

  • y_proba (list of float) – probability for response variable y.

  • df (pd.DataFrame (only when set: todf=True)) – Dataframe containing the predictions in a structed manner.

Examples

>>> from distfit import distfit
>>> import numpy as np
>>>
>>> # Create dataset
>>> X = np.random.normal(0, 2, 1000)
>>> y = [-8,-6,0,1,2,3,4,5,6]
>>>
>>> # Initialize
>>> dfit = distfit(todf=True)
>>> # Fit
>>> model_results = dfit.fit_transform(X)
>>>
>>> # Make predictions
>>> results = dfit.predict(y)
>>> print(results['df'])
>>>
>>> # Plot results with CII and predictions.
>>> dfit.plot()
>>>
qqplot(X, line='45', n_top=1, title='QQ-plot', fontsize=16, figsize=(20, 15), xlim=None, ylim=None, fig=None, ax=None, grid=True, alpha=0.5, size=15, cmap=None, verbose=None)

Plot QQplot results.

Parameters:
  • X (array-like) – The Null distribution or background data is build from X.

  • line (str, default: '45') –

    Options for the reference line to which the data is compared.
    • ’45’ - 45-degree line

    • ’s’ - standardized line, the expected order statistics are scaled by the standard deviation of the given sample and have the mean added to them.

    • ’r’ - A regression line is fit

    • ’q’ - A line is fit through the quartiles.

    • ’None’ - by default no reference line is added to the plot.

  • n_top (int, optional) – Show the top number of results. The default is 1.

  • title (String, optional (default: '')) – Title of the plot.

  • fontsize (int, (default: 18)) – Fontsize for the axis and ticks.

  • figsize (tuple, optional (default: (10,8))) – The figure size.

  • xlim (Float, optional (default: None)) – Limit figure in x-axis.

  • ylim (Float, optional (default: None)) – Limit figure in y-axis.

  • fig (Figure, optional (default: None)) – Matplotlib figure (Note - ignored when method is discrete)

  • ax (AxesSubplot, optional (default: None)) – Matplotlib Axes object. If given, this subplot is used to plot in instead of a new figure being created.

  • grid (Bool, optional (default: True)) – Show the grid on the figure.

  • cmap (String, optional (default: None)) – Colormap when plotting multiple the CDF. The used colors are stored in dfit.summary[‘colors’]. However, when cmap is set, the specified colormap is used.

  • verbose ([str, int], default is 'info' or 20) –

    Set the verbose messages using string or integer values.
    • 0, 60, None, ‘silent’, ‘off’, ‘no’]: No message.

    • 10, ‘debug’: Messages from debug level and higher.

    • 20, ‘info’: Messages from info level and higher.

    • 30, ‘warning’: Messages from warning level and higher.

    • 50, ‘critical’: Messages from critical level and higher.

Return type:

tuple (fig, ax)

Examples

>>> from distfit import distfit
>>> import numpy as np
>>>
>>> # Create dataset
>>> X = np.random.normal(0, 2, 1000)
>>>
>>> # Initialize
>>> dfit = distfit()
>>>
>>> # Fit
>>> dfit.fit_transform(X)
>>>
>>> # Make qq-plot
>>> dfit.qqplot(X)
>>>
>>> # Make qq-plot for top 10 best fitted models.
>>> dfit.qqplot(X, n_top=10)
>>>
save(filepath, overwrite=True)

Save learned model in pickle file.

Parameters:
  • filepath (str) – Pathname to store pickle files.

  • verbose (int, optional) – Show message. A higher number gives more informatie. The default is 3.

Return type:

object

transform(X, verbose=None)

Determine best model for input data X.

The input data X can be modellend in two manners:

parametric

In the parametric case, the best fit on the data is determined using the scoring statistic such as Residual Sum of Squares approach (RSS) for the specified distributions. Based on the best distribution-fit, the confidence intervals (CII) can be determined for later usage in the predict() function.

quantile

In the quantile case, the data is ranked and the top/lower quantiles are determined.

Parameters:
  • X (array-like) – The Null distribution or background data is build from X.

  • verbose ([str, int], default is 'info' or 20) –

    Set the verbose messages using string or integer values.
    • 0, 60, None, ‘silent’, ‘off’, ‘no’]: No message.

    • 10, ‘debug’: Messages from debug level and higher.

    • 20, ‘info’: Messages from info level and higher.

    • 30, ‘warning’: Messages from warning level and higher.

    • 50, ‘critical’: Messages from critical level and higher.

Returns:

  • Object.

  • model (dict) – dict containing keys with distribution parameters score : scoring statistic name : distribution name distr : distribution function params : all kind of parameters loc : loc function parameter scale : scale function parameter arg : arg function parameter

  • summary (list) – Residual Sum of Squares

  • histdata (tuple (observed, bins)) – tuple containing observed and bins for data X in the histogram.

  • size (int) – total number of elements in for data X

distfit.distfit.fit_binom(X)

Transform array of samples (nonnegative ints) to histogram.

distfit.distfit.fit_transform_binom(X, f=1.5, weighted=True, stats='RSS')

Convert array of samples (nonnegative ints) to histogram and fit.

distfit.distfit.get_logger()
distfit.distfit.get_ppf(self, model, bound, alpha, logger=None)
class distfit.distfit.k_distribution(loc=None, scale=None)

K-Distribution.

fit()

Fit for K-distribution.

Parameters:

X (Vector) – Numpy array containing data in vector form.

Returns:

  • loc (Loc parameter)

  • scale (Scale parameter)

References

    1. Rangaswamy M, Weiner D, Ozturk A. Computer generation of correlated non-Gaussian radar clutter[J]. IEEE Transactions on Aerospace and Electronic Systems, 1995, 31(1): 106-116.

    1. Lamont-Smith T. Translation to the normal distribution for radar clutter[J]. IEE Proceedings-Radar, Sonar and Navigation, 2000, 147(1): 17-22.

    1. https://en.wikipedia.org/wiki/K-distribution

    1. Redding N J. Estimating the parameters of the K distribution in the intensity domain[J]. 1999.

name()

Name of distribution.

pdf(loc, scale)

Compute Probability Denity Distribution.

distfit.distfit.plot_binom(self, emp_properties={}, pdf_properties={}, bar_properties={}, cii_properties={}, fontsize=16, xlabel='Values', ylabel='Frequency', title='', figsize=(20, 15), xlim=None, ylim=None, grid=True)

Plot discrete results.

Parameters:

model (dict) – Results derived from the fit_transform function.

distfit.distfit.scale_data(y)
distfit.distfit.scale_data_minmax(X, minvalue, maxvalue)
distfit.distfit.set_colors(df, cmap='Set1')

Set colors.

Parameters:
  • df (DataFrame) – DataFrame.

  • cmap (str, default: 'Set1') – Set the colormap.

Returns:

df – DataFrame.

Return type:

DataFrame

distfit.distfit.set_logger(verbose: [<class 'str'>, <class 'int'>] = 'info')

Set the logger for verbosity messages.

Parameters:

verbose ([str, int], default is 'info' or 20) –

Set the verbose messages using string or integer values.
  • 0, 60, None, ‘silent’, ‘off’, ‘no’]: No message.

  • 10, ‘debug’: Messages from debug level and higher.

  • 20, ‘info’: Messages from info level and higher.

  • 30, ‘warning’: Messages from warning level and higher.

  • 50, ‘critical’: Messages from critical level and higher.

Return type:

None.

Examples

>>> # Set the logger to warning
>>> set_logger(verbose='warning')
>>>
>>> # Test with different messages
>>> logger.debug("Hello debug")
>>> logger.info("Hello info")
>>> logger.warning("Hello warning")
>>> logger.critical("Hello critical")
>>>
distfit.distfit.smoothline(xs, ys=None, interpol=3, window=1, verbose=None)

Smoothing 1D vector.

Smoothing a 1d vector can be challanging if the number of data is low sampled. This smoothing function therefore contains two steps. First interpolation of the input line followed by a convolution.

Parameters:
  • xs (array-like) – Data points for the x-axis.

  • ys (array-like) – Data points for the y-axis.

  • interpol (int, (default : 3)) – The interpolation factor. The data is interpolation by a factor n before the smoothing step.

  • window (int, (default : 1)) – Smoothing window that is used to create the convolution and gradually smoothen the line.

  • verbose (int [1-5], default: 3) – Print information to screen. A higher number will print more.

Returns:

  • xnew (array-like) – Data points for the x-axis.

  • ynew (array-like) – Data points for the y-axis.

distfit.distfit.transform_binom(hist, plot=True, weighted=True, f=1.5, stats='RSS')

Fit histogram to binomial distribution.

Parameters:
  • hist (array-like) – histogram as int array with counts, array index as bin.

  • weighted (Bool, (default: True)) – In principle, the most best fit will be obtained if you set weighted=True. However, using different measures, such as minimum residual sum of squares (RSS) as a metric; you can set weighted=False.

  • f (float, (default: 1.5)) – try to fit n in range n0/f to n0*f where n0 is the initial estimate.

Returns:

  • model (dict) –

    distrObject

    fitted binomial model.

    nameString

    Name of the fitted distribution.

    RSSfloat

    Best RSS score

    nint

    binomial n value.

    pfloat

    binomial p value.

    chi2rfloat

    rchi2: reduced chi-squared. This number should be around 1. Large values indicate a bad fit; small values indicate ‘too good to be true’ data..

  • figdata (dict) –

    ssesarray-like

    The computed RSS scores accompanyin the various n.

    Xdataarray-like

    Input data.

    histarray-like

    fitted histogram as int array, same length as hist.

    Ydataarray-like

    Probability mass function.

    nvalsarray-like

    Evaluated n’s.