Quick start to find best fitting distribution
##################################################

Specify ``distfit`` parameters. In this example nothing is specied and that means that all parameters are set to default.


Generate random data
*************************

.. code:: python

    from distfit import distfit
    import numpy as np

    # Example data
    X = np.random.normal(10, 3, 2000)
    y = [3,4,5,6,10,11,12,18,20]


Fit distributions
**********************************

A series of distributions are fitted on the emperical data and for each a RSS is determined. The distribution with the best fit (lowest RSS) is the best fitting distribution.

.. code:: python

	# From the distfit library import the class distfit
	from distfit import distfit

	# Initialize
	dfit = distfit(todf=True)

	# Search for best theoretical fit on your empirical data
	results = dfit.fit_transform(X)

	# [distfit] >fit..
	# [distfit] >transform..
	# [distfit] >[norm      ] [0.00 sec] [RSS: 0.0036058] [loc=10.035 scale=2.947]
	# [distfit] >[expon     ] [0.00 sec] [RSS: 0.1821936] [loc=-0.496 scale=10.531]
	# [distfit] >[pareto    ] [0.12 sec] [RSS: 0.1821326] [loc=-699709.530 scale=699709.035]
	# [distfit] >[dweibull  ] [0.02 sec] [RSS: 0.0059431] [loc=10.001 scale=2.541]
	# [distfit] >[t         ] [0.09 sec] [RSS: 0.0036059] [loc=10.035 scale=2.947]
	# [distfit] >[genextreme] [0.27 sec] [RSS: 0.7053157] [loc=17.658 scale=2.731]
	# [distfit] >[gamma     ] [0.07 sec] [RSS: 0.0036036] [loc=-326.130 scale=0.026]
	# [distfit] >[lognorm   ] [0.15 sec] [RSS: 0.0036144] [loc=-187.018 scale=197.039]
	# [distfit] >[beta      ] [0.05 sec] [RSS: 0.0036176] [loc=-16.974 scale=51.538]
	# [distfit] >[uniform   ] [0.00 sec] [RSS: 0.1162497] [loc=-0.496 scale=19.280]
	# [distfit] >[loggamma  ] [0.07 sec] [RSS: 0.0036382] [loc=-493.477 scale=77.133]
	# [distfit] >Compute confidence interval [parametric]


Plot distribution fit
**********************************

.. code:: python

    # Plot
    dfit.plot()

.. |fig1a| image:: ../figs/example_fig1a.png
    :scale: 70%

.. table:: Distribution fit
   :align: center

   +---------+
   | |fig1a| |
   +---------+

Plot RSS
**********************************

Note that the best fit should be **normal**, as this was also the input data. However, many other distributions can be very similar with specific loc/scale parameters. It is however not unusual to see *gamma* and *beta* distribution as these are the "barba-pappas" among the distributions. Lets print the summary of detected distributions with the Residual Sum of Squares.

.. code:: python

    # Make plot
    dfit.plot_summary()

.. |fig1summary| image:: ../figs/fig1_summary.png
    :scale: 60%

.. table:: Summary of fitted theoretical Distributions
   :align: center

   +---------------+
   | |fig1summary| |
   +---------------+


Fit for one specific distribution
##########################################


Suppose you want to test for one specific distribution, such as the normal distribution. This can be done as following:

.. code:: python

    # Create random data
    X = np.random.normal(10, 3, 2000)
    y = [3,4,5,6,10,11,12,18,20]

    # Initialize
    dfit = distfit(distr='norm')
    # Fit on data
    results = dfit.fit_transform(X)

    # [distfit] >fit..
    # [distfit] >transform..
    # [distfit] >[norm] [RSS: 0.0151267] [loc=0.103 scale=2.028]

    dfit.plot()


Fit for multiple distributions
######################################


Suppose you want to test multiple distributions:

.. code:: python

	# Create random data
	X = np.random.normal(10, 3, 2000)
	y = [3,4,5,6,10,11,12,18,20]

	# Initialize
	dfit = distfit(distr=['norm', 't', 'uniform'])
	# Fit on data
	results = dfit.fit_transform(X)

	# [distfit] >fit..
	# [distfit] >transform..
	# [distfit] >[norm   ] [0.00 sec] [RSS: 0.0012337] [loc=0.005 scale=1.982]
	# [distfit] >[t      ] [0.12 sec] [RSS: 0.0012336] [loc=0.005 scale=1.982]
	# [distfit] >[uniform] [0.00 sec] [RSS: 0.2505846] [loc=-6.583 scale=15.076]
	# [distfit] >Compute confidence interval [parametric]

	dfit.plot()


Make predictions
######################


The ``predict`` function will compute the probability of samples in the fitted *PDF*. 
Note that, due to multiple testing approaches, it can occur that samples can be located 
outside the confidence interval but not marked as significant. See section Algorithm -> Multiple testing for more information.


Generate random data
*************************

.. code:: python

    # Example data
    X = np.random.normal(10, 3, 2000)
    y = [3,4,5,6,10,11,12,18,20]


Fit all distribution
**********************************

A series of distributions are fitted on the emperical data and for each a *RSS* is determined. The distribution with the best fit (lowest RSS) is the best fitting distribution.

.. code:: python

    # From the distfit library import the class distfit
    from distfit import distfit

    # Initialize
    dfit = distfit(todf=True)

    # Search for best theoretical fit on your empirical data
    dfit.fit_transform(X)

    # Make prediction on new datapoints based on the fit
    results = dfit.predict(y)


Plot predictions
**********************************

The best fitted distribution is plotted over the emperical data with it confidence intervals.

.. code:: python

    # The plot function will now also include the predictions of y
    dfit.plot()


Examine results
**********************************

``results`` is a dictionary containing ``y``, ``y_proba``, ``y_pred`` and ``P`` for which the output values has the same order as input value ``y``.
The "P" stands for the RAW P-values and "y_proba" are the corrected P-values after multiple test correction (default: fdr_bh).
In case you want to use the "P" values, set "multtest" to None during initialization.
Note that dataframe ``df`` is included when using the **todf=True** parameter.

.. code:: python

    # Print probabilities
    print(results['y_proba'])
    # > [0.02702734, 0.04908335, 0.08492715, 0.13745288, 0.49567466, 0.41288701, 0.3248188 , 0.02260135, 0.00636084]
    
    # Print the labels with respect to the confidence intervals
    print(results['y_pred'])
    # > ['down' 'down' 'down' 'none' 'none' 'none' 'none' 'up' 'up']

    # Print the dataframe containing the total information
    print(results['df'])

+----+-----+------------+----------+------------+
|    |   y |    y_proba | y_pred   |          P |
+====+=====+============+==========+============+
|  0 |   3 | 0.0270273  | down     | 0.00900911 |
+----+-----+------------+----------+------------+
|  1 |   4 | 0.0490833  | down     | 0.0218148  |
+----+-----+------------+----------+------------+
|  2 |   5 | 0.0849271  | down     | 0.0471817  |
+----+-----+------------+----------+------------+
|  3 |   6 | 0.137453   | none     | 0.0916353  |
+----+-----+------------+----------+------------+
|  4 |  10 | 0.495675   | none     | 0.495675   |
+----+-----+------------+----------+------------+
|  5 |  11 | 0.412887   | none     | 0.367011   |
+----+-----+------------+----------+------------+
|  6 |  12 | 0.324819   | none     | 0.252637   |
+----+-----+------------+----------+------------+
|  7 |  18 | 0.0226014  | up       | 0.00502252 |
+----+-----+------------+----------+------------+
|  8 |  20 | 0.00636084 | up       | 0.00070676 |
+----+-----+------------+----------+------------+
    

.. |fig1b| image:: ../figs/example_fig1b.png
    :scale: 70%

.. table:: Plot distribution with predictions
   :align: center

   +---------+
   | |fig1b| |
   +---------+


Output
**********************************

In the previous example, we showed that the output can be captured ``results`` and ``out`` but the results are also stored in the object itself. 
In our examples it is the ``dist`` object.
The same variable names are used; ``y``, ``y_proba``, ``y_pred`` and ``P``.
Note that dataframe ``df`` is included when using the todf=True paramter.


.. code:: python

    # All scores of the tested distributions
    print(dfit.summary)

    # Distribution parameters for best fit
    dfit.model

    # Show the predictions for y
    print(dfit.results['y_pred'])
    # ['down' 'down' 'none' 'none' 'none' 'none' 'up' 'up' 'up']

    # Show the probabilities for y that belong with the predictions
    print(dfit.results['y_proba'])
    # [2.75338375e-05 2.74664877e-03 4.74739680e-01 3.28636879e-01 1.99195071e-01 1.06316132e-01 5.05914722e-02 2.18922761e-02 8.89349927e-03]
 
    # All predicted information is also stored in a structured dataframe (only when setting the todf=True)
    # y: input values
    # y_proba: corrected P-values after multiple test correction (default: fdr_bh).
    # y_pred: True in case y_proba<=alpha
    # P: raw P-values

    print(dfit.results['df'])

+----+-----+------------+----------+------------+
|    |   y |    y_proba | y_pred   |          P |
+====+=====+============+==========+============+
|  0 |   3 | 0.0270273  | down     | 0.00900911 |
+----+-----+------------+----------+------------+
|  1 |   4 | 0.0490833  | down     | 0.0218148  |
+----+-----+------------+----------+------------+
|  2 |   5 | 0.0849271  | down     | 0.0471817  |
+----+-----+------------+----------+------------+
|  3 |   6 | 0.137453   | none     | 0.0916353  |
+----+-----+------------+----------+------------+
|  4 |  10 | 0.495675   | none     | 0.495675   |
+----+-----+------------+----------+------------+
|  5 |  11 | 0.412887   | none     | 0.367011   |
+----+-----+------------+----------+------------+
|  6 |  12 | 0.324819   | none     | 0.252637   |
+----+-----+------------+----------+------------+
|  7 |  18 | 0.0226014  | up       | 0.00502252 |
+----+-----+------------+----------+------------+
|  8 |  20 | 0.00636084 | up       | 0.00070676 |
+----+-----+------------+----------+------------+


Parallel Computing
######################

``Distfit``  supports parallel computing where it performs parallelizing into two parts for maximum efficiency: over the fitting of distributions and separately over the bootstrap approach.
The chart below shows how effective parallelization is over these two parts. In general it can be seen that parallelizing is very effective! Time to compute is reduced from ~210sec to ~48sec.
The ``n_jobs_dist`` describes the general loop, while ``n_jobs`` pertains to the bootstrap part.
When the cores are somehow divided between the two tasks, there is no performance gain. In other words, when bootstrapping is enabled, it is best to allocate most of the cores to it.
Core allocation is automatically managed during initialization, so you only need to set ``n_jobs``.


.. code:: python

    start_time = time.time()
    # Initialization
    dfit = distfit(distr='popular', n_boots=50, n_jobs=8, verbose='info')
    # Fit
    dfit.fit_transform(X)
    # Compute time
    elapsed_time = time.time() - start_time
    print(elapsed_time)


.. |fig_parallel_computing| image:: ../figs/performance_parralel_jobs.png
    :scale: 100%

.. table:: Parallel Computing
   :align: center

   +--------------------------+
   | |fig_parallel_computing| |
   +--------------------------+


.. include:: add_bottom.add