Fitting a probability distribution to incoming data

Mohamed Mohamed-Ahmed
Mar 23, 2022
9 min read

Updated: May 6, 2022

When dealing with any type of dataset it is important to create a probabilistic model of the data which would help to understand the dataset. The probabilistic model uses probability distribution which assists in explaining about dataset with the implementation of random processes. The random processes could be either:

Data is based on random variables obtained from a naturally random/stochastic process.
Data is based on uncertainty variables that could have been caused by incomplete data.

The probability distribution function uses incoming data in order to draw probabilities of occurrences of data. The distribution function divides into two groups:

Discrete distribution: it involves using data where it could be countable occurrences or outcomes.
Continuous distribution: it involves using data where it could be any possible value within a specified range and uncountable.

The main reason for this blog is that in recent years there has been an increase in the use of probability distributions in different disciplines like engineering, biology, weather forecasting, stocks and investments and many more. Hence, it is very important to fit the theoretical distribution to a custom dataset to find the distribution that could provide information/knowledge about the input data. To find the distribution that fits with the incoming data, it should follow the following steps:

Identify the theoretical distribution models that would work with the incoming data.
Assess the selected theoretical distribution model parameters.
Evaluate the significant level of the theoretical distribution models where it would explain which theoretical distribution model fits with the incoming data.

There are several distributions available in both discrete and continuous.

The programming used here is Python where the statistical models are done using the Scipy package.

Within the Scipy package, there are:

104 Continuous distribution
19 Discrete distribution

0. Dependencies:

The first thing to do is to import the packages.

Import packages:

import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt
import pylab
import math

1. Selecting theoretical distribution models:

From the packages, the first task is to select the theoretical distribution models that could fit with the observed data. This would be a valuable clue that would help to narrow down what theoretical distribution models would fit with the observed data and it could be done by creating histogram data from the observed data. This can be done through the following code:

 def get_hist(data):
    #### General code:
    # Obtaining the histogram of data:
    Hist = np.histogram(a=data, density=True)
    return Hist

After acquiring the histogram data, it is possible to view the shape of the histogram and understand about the input data. This would help to narrow down what type of distributions would best fit the shape of the histogram. Fitting the distribution(s) to the input data can be done through a technique called Maximum Likelihood Estimate (MLE). This fitting technique uses the input observation (input data) in order to find the estimated parameters so that these parameters are utilized to maximize the likelihood of a given distribution. This technique can be accomplished by the following equation:

Where θ is the parameter to be maximized of a given distribution.

2. Assessing the fitted distribution to the the theoretical distribution models:

To find out which one of the theoretical distributions fits with the input observations, Goodness-of-Fit tests are used. These tests help by comparing the distribution of input data with other known continuous distributions. This test uses a test statistic value which is then used to compare with a threshold statistical significance. There are different types of Goodness-of-Fit tests, for instance:

Sum of Square Error (SSE) test
R Square (R^2) test
Information Criteria (IC) test
Chi-Square (Chi^2) test
Kolmogorov-Smirnov (KS) test
Many more.

The assessment can be done through the following code:

def get_best_distribution(data, size, sensor_no, damage_case, method, plot=False):
    dist_names = ['alpha', 'anglit', 'arcsine', 'beta', 'betaprime', 'bradford', 'burr', 'cauchy', 'chi', 'chi2', 'cosine', 'dgamma', 'dweibull', 'erlang', 'expon', 'exponweib', 'exponpow', 'f', 'fatiguelife', 'fisk', 'foldcauchy', 'foldnorm', 'frechet_r', 'frechet_l', 'genlogistic', 'genpareto', 'genexpon', 'genextreme', 'gausshyper', 'gamma', 'gengamma', 'genhalflogistic', 'gilbrat',  'gompertz', 'gumbel_r', 'gumbel_l', 'halfcauchy', 'halflogistic', 'halfnorm', 'hypsecant', 'invgamma', 'invgauss', 'invweibull', 'johnsonsb', 'johnsonsu', 'ksone', 'kstwobign', 'laplace', 'logistic', 'loggamma', 'loglaplace', 'lognorm', 'lomax', 'maxwell', 'mielke', 'moyal', 'nakagami', 'ncx2', 'ncf', 'nct', 'norm', 'pareto', 'pearson3', 'powerlaw', 'powerlognorm', 'powernorm', 'rdist', 'reciprocal', 'rayleigh', 'rice', 'recipinvgauss', 'semicircular', 't', 'triang', 'truncexpon', 'truncnorm', 'tukeylambda', 'uniform', 'vonmises', 'wald', 'weibull_min', 'weibull_max', 'wrapcauchy']

    # Applying the Goodness-to-fit tests to select the best distribution that fits the data:
    dist_results = []
    dist_IC_results = []
    params = {}
    params_IC = {}
    params_SSE = {}

    if method == 'sse':
########################################################################################################################
######################################## Sum of Square Error (SSE) test ################################################
########################################################################################################################
        # Best holders
        best_distribution = st.norm
        best_params = (0.0, 1.0)
        best_sse = np.inf

        for dist_name in dist_names:
            dist = getattr(st, dist_name)
            param = dist.fit(data)
            params[dist_name] = param
            N_len = len(list(data))
            # Obtaining the histogram:
            Hist_data, bin_data = make_hist(data=data)

            # fit dist to data
            params_dist = dist.fit(data)

            # Separate parts of parameters
            arg = params_dist[:-2]
            loc = params_dist[-2]
            scale = params_dist[-1]
            
            # Get sane start and end points of distribution
            start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
            end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

            # Calculate fitted PDF and error with fit in distribution
            size = int(size)
            x = np.linspace(start, end, size)
            pdf = dist.pdf(Hist_data, loc=loc, scale=scale, *arg)
            sse = np.sum(np.power(Hist_data - pdf, 2.0))
            print("SSE value for " + dist_name + " for Sensor " + str(sensor_no) + " = " + str(sse))

            # identify if this distribution is better
            dist_results.append((dist_name, sse))

        # select the best fitted distribution and store the name of the best fit and its IC value
        best_dist, best_stat_test_val = (min(dist_results, key=lambda item: item[1]))

        # Printing out the best distributions based on the tests:
        print('\n################################ Sum of Square Error test parameters ####################################')
        print("Best fitting distribution (SSE test) for Sensor " + str(sensor_no) + " :" + str(best_dist))
        print("Best SSE value (SSE test) for Sensor " + str(sensor_no) + " :" + str(best_stat_test_val))
        print("Parameters for the best fit (SSE test) for Sensor " + str(sensor_no) + " :" + str(params[best_dist]))
        print('###########################################################################################################\n')

########################################################################################################################
########################################################################################################################
########################################################################################################################

    if method == 'r2':
########################################################################################################################
##################################################### R Square (R^2) test ##############################################
########################################################################################################################
        dist_results = []

        # Best holders:
        best_distribution = st.norm
        best_params = (0.0, 1.0)
        best_r2 = np.inf

        for dist_name in dist_names:
            dist = getattr(st, dist_name)
            param = dist.fit(data)
            params[dist_name] = param
            N_len = len(list(data))
            # Obtaining the histogram:
            Hist_data, bin_data = make_hist(data=data)

            # fit dist to data
            params_dist = dist.fit(data)

            # Separate parts of parameters
            arg = params_dist[:-2]
            loc = params_dist[-2]
            scale = params_dist[-1]

            # Get sane start and end points of distribution
            start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
            end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

            # Calculate fitted PDF and error with fit in distribution
            size = int(size)
            x = np.linspace(start, end, size)
            pdf = dist.pdf(Hist_data, loc=loc, scale=scale, *arg)
            r2 = compute_r2_test(y_true=Hist_data, y_predicted=pdf)
            dist_results.append((dist_name, r2))

        # select the best fitted distribution and store the name of the best fit and its IC value
        best_distribution, best_stat_test_val = (min(dist_results, key=lambda item: item[1]))

        print('\n############################## R Square test parameters ###########################################')
        best_dist = best_distribution
        print("Best fitting distribution (R^2 test) for Sensor " + str(sensor_no) + " :" + str(best_dist))
        print("Best R^2 value (R^2 test) for Sensor " + str(sensor_no) + " :" + str(best_stat_test_val))
        print("Parameters for the best fit (R^2 test) for Sensor " + str(sensor_no) + " :" + str(params[best_dist]))
        print('#####################################################################################################\n')

########################################################################################################################
########################################################################################################################
########################################################################################################################

    if method == 'ic':
########################################################################################################################
######################################## Information Criteria (IC) test ################################################
########################################################################################################################
        dist_results = []        

        for dist_name in dist_names:
            dist = getattr(st, dist_name)
            param = dist.fit(data)
            params[dist_name] = param
            N_len = len(list(data))

            # Obtaining the histogram:
            Hist_data, bin_data = make_hist(data=data)

            # fit dist to data
            params_dist = dist.fit(data)

            # Separate parts of parameters
            arg = params_dist[:-2]
            loc = params_dist[-2]
            scale = params_dist[-1]

            # Get sane start and end points of distribution
            start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
            end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

            # Calculate fitted PDF and error with fit in distribution
            size = int(size)
            x = np.linspace(start, end, size)
            pdf = dist.pdf(Hist_data, loc=loc, scale=scale, *arg)
            sse = np.sum(np.power(Hist_data - pdf, 2.0))

            # Obtaining the log of the pdf:
            loglik = np.sum(dist.logpdf(bin_data, *params_dist))
            k = len(params_dist[:])
            n = len(data)
            aic = 2 * k - 2 * loglik
            bic = n * np.log(sse / n) + k * np.log(n)
            dist_results.append((dist_name, aic))
            # dist_IC_results.append((dist_name, bic))

        # select the best fitted distribution and store the name of the best fit and its IC value
        best_dist, best_stat_test_val = (min(dist_results, key=lambda item: item[1]))

        print('\n############################ Information Criteria (IC) test parameters ##################################')
        print("Best fitting distribution (IC test) for Sensor " + str(sensor_no) + " :" + str(best_dist))
        print("Best IC value (IC test) for Sensor " + str(sensor_no) + " :" + str(best_stat_test_val))
        print("Parameters for the best fit (IC test) for Sensor " + str(sensor_no) + " :" + str(params[best_dist]))
        print('###########################################################################################################\n')

########################################################################################################################
########################################################################################################################
########################################################################################################################

    if method == 'chi':
########################################################################################################################
################################################ Chi-Square (Chi^2) test ###############################################
########################################################################################################################
        dist_results = []        

        # Set up 50 bins for chi-square test
        # Observed data will be approximately evenly distrubuted aross all bins
        percentile_bins = np.linspace(0, 100, 51)
        percentile_cutoffs = np.percentile(data, percentile_bins)
        observed_frequency, bins = (np.histogram(data, bins=percentile_cutoffs))
        cum_observed_frequency = np.cumsum(observed_frequency)
        size = len(data)

        chi_square = []
        for dist_name in dist_names:
            dist = getattr(st, dist_name)
            param = dist.fit(data)
            params[dist_name] = param

            # Obtaining the histogram:
            Hist_data, bin_data = make_hist(data=data)

            # fit dist to data
            params_dist = dist.fit(data)

            # Separate parts of parameters
            arg = params_dist[:-2]
            loc = params_dist[-2]
            scale = params_dist[-1]

            # Get sane start and end points of distribution
            start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
            end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

            # Calculate fitted PDF and error with fit in distribution
            size = int(size)
            x = np.linspace(start, end, size)
            pdf = dist.pdf(Hist_data, loc=loc, scale=scale, *arg)

            # Get expected counts in percentile bins
            # This is based on a 'cumulative distrubution function' (cdf)
            cdf_fitted = dist.cdf(percentile_cutoffs, *arg, loc=loc, scale=scale)
            expected_frequency = []
            for bin in range(len(percentile_bins) - 1):
                expected_cdf_area = cdf_fitted[bin + 1] - cdf_fitted[bin]
                expected_frequency.append(expected_cdf_area)

            # calculate chi-squared
            expected_frequency = np.array(expected_frequency) * size
            cum_expected_frequency = np.cumsum(expected_frequency)
            ss = sum(((cum_expected_frequency - cum_observed_frequency) ** 2) / cum_observed_frequency)
            chi_square.append(ss)

            # Applying the Chi-Square test:
            # D, p = scipy.stats.chisquare(f_obs=pdf, f_exp=Hist_data)
            # print("Chi-Square test Statistics value for " + dist_name + " for Sensor " + str(sensor_no) + " = " + str(D))
            print("p value for " + dist_name + " for Sensor " + str(sensor_no) + " = " + str(ss))
            dist_results.append((dist_name, ss))

        # select the best fitted distribution and store the name of the best fit and its p value
        print('Distribution results: ', dist_results)
        best_dist, best_stat_test_val = (min(dist_results, key=lambda item: item[1]))

        print('\n#################################### Chi-Square test parameters #######################################')
        print("Best fitting distribution (Chi^2 test) for Sensor " + str(sensor_no) + " :" + str(best_dist))
        print("Best p value (Chi^2 test) for Sensor " + str(sensor_no) + " :" + str(best_stat_test_val))
        print("Parameters for the best fit (Chi^2 test) for Sensor " + str(sensor_no) + " :" + str(params[best_dist]))
        print('#########################################################################################################\n')

########################################################################################################################
########################################################################################################################
########################################################################################################################

    if method == 'ks':
########################################################################################################################
########################################## Kolmogorov-Smirnov (KS) test ################################################
########################################################################################################################
        dist_results = []

        for dist_name in dist_names:
            dist = getattr(st, dist_name)
            param = dist.fit(data)
            params[dist_name] = param

            # Applying the Kolmogorov-Smirnov test:
            D, p = st.kstest(data, dist_name, args=param)
            # D, p = st.kstest(data, dist_name, args=param, N=N_len, alternative='greater')
            # print("Kolmogorov-Smirnov test Statistics value for " + dist_name + " for Sensor " + str(sensor_no) + " = " + str(D))
            print("p value for " + dist_name + " for Sensor " + str(sensor_no) + " = " + str(p))
            dist_results.append((dist_name, p))

        # select the best fitted distribution and store the name of the best fit and its p value
        best_dist, best_stat_test_val = (max(dist_results, key=lambda item: item[1]))

        print('\n################################ Kolmogorov-Smirnov test parameters #####################################')
        print("Best fitting distribution (KS test) for Sensor " + str(sensor_no) + " :" + str(best_dist))
        print("Best p value (KS test) for Sensor " + str(sensor_no) + " :" + str(best_stat_test_val))
        print("Parameters for the best fit (KS test) for Sensor " + str(sensor_no) + " :" + str(params[best_dist]))
        print('###########################################################################################################\n')

########################################################################################################################
########################################################################################################################
########################################################################################################################

    # Plotting the distribution with histogram:
    if plot:
        plt.hist(x=data, density=True)
        best_param = params[best_dist]
        best_dist_p = getattr(st, best_dist)
        pdf, x_axis_pdf, y_axis_pdf = make_pdf(data=data, dist=best_dist_p, params=best_param, size=len(data))
        plt.plot(x_axis_pdf, y_axis_pdf, color='red', label='Best dist ={0}'.format(best_dist))
        plt.legend()
        plt.title('Histogram and Distribution plot of Sensor {0} of DC {1}'.format(sensor_no, damage_case))
        # plt.title('Histogram and Distribution plot')
        plt.show()
        # plt.show(block=False)
        # plt.pause(5)  # Pauses the program for 5 seconds
        plt.close('all')

    return best_dist, best_stat_test_val, params[best_dist]

Creating PDF function "make_pdf" can be found in the appendices.

3. Evaluating the significant level of the fitted distribution:

For the evaluation part, test statistic value could be used to evaluate which one of the fitted distributions is almost close to the theoretical distribution. Each test would have its own rule for the evaluation (e.g. For the KS test, the theoretical distribution with the highest KS statistical test value is selected)

References:

Appendix:

1. Finding amount of distributions within Scipy package:

from scipy import stats

dist_continu = [d for d in dir(stats) if isinstance(getattr(stats, d), stats.rv_continuous)]
dist_discrete = [d for d in dir(stats) if isinstance(getattr(stats, d), stats.rv_discrete)]
print('number of continuous distributions: %d' % len(dist_continu))
print('number of discrete distributions:   %d' % len(dist_discrete))

2. Creating the PDF distribution:

def make_pdf(data, dist, params, size):
    """Generate distributions's Probability Distribution Function """

    # Get histogram of original data
    Hist_list = get_hist(data, size)
    Hist_data, bin_edges = Hist_list
    bin_mid = (bin_edges + np.roll(bin_edges, -1))[:-1] / 2.0  # go from bin edges to bin middles

    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    # Get sane start and end points of distribution
    start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
    end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)
    
    # Build PDF and turn into pandas Series
    size = int(size)
    x = np.linspace(start, end, size)
    y = dist.pdf(x, loc=loc, scale=scale, *arg)
    scale_pdf = np.trapz(Hist_list[0], (Hist_list[1] + np.roll(Hist_list[1], -1))[:-1] / 2.0)  / np.trapz(y, x)
    y *= scale_pdf
    pdf = pd.Series(y, x)
    return pdf, x, y

Fitting a probability distribution to incoming data

Import packages:

References:

Appendix:

Recent Posts

Comments