Abstract

Background

Probability distribution fitting is the fitting of a probability distribution to a series of repeated measurements of a variable phenomenon. A distribution with a close fit can be used for various purposes as described in use-cases.

An approach to fit a probability distribution to data is a goodness of fit test. This compares the observed frequency (f) to the expected frequency from the model (f-hat) for any number of classes. In distfit we computed the goodness of fit test with the Sum of Squared Errors (or estimates) (SSE), also named Residual Sum of Squares (RSS).

The RSS describes the deviation predicted from actual empirical values of data. Or in other words, the differences in the estimates. It is a measure of the discrepancy between the data and an estimation model. A small RSS indicates a close fit of the model to the data. RSS is computed by:

_images/RSS.svg

Where yi is the i-th value of the variable to be predicted, xi is the i-th value of the explanatory variable, and f(xi) is the predicted value of yi (also termed y-hat).

distfit is a python package for probability density fitting across 89 univariate distributions to non-censored data by RSS. The best fitted distribution is returned with the loc, scale, arg parameters which can then be used to compute the probability on new data-points.

Use-cases

The distfit function has many use-cases. First of all to determine the best theoretical distribution for your data. This can reduce tens-of-thousands of data points into 3 floating parameters. Another application is for outlier detection. A null-distribution can be determined using the normal state. New datapoints that deviate significantly can then be marked as outliers, and are potentially of interest. The null-distribution can also be generated by randomization/permuation approaches. In such case, the new datapoints will be marked if it significantly deviates from randomness.