Files
Abstract
A method of univariant probability density function (pdf) estimation is developed for big data applications. The method employs the use of a non-parametric maximum entropy estimator (NMEM) for a data driven multithreaded probability density estimation algorithm, which has been termed the stitching estimator (SE). The NMEM has previously shown to be a robust pdf estimator for high throughput applications, which has made it the ideal choice for the underlying estimator in the SE's algorithm. This work divides the estimation problem into many smaller estimation problems; termed blocks. The sample is partitioned into blocks by an optimized branching tree algorithm which has been developed to maximize the uniformity for the density of the data in every block. The algorithm finds pdf estimates for blocks using the NMEM then the estimates per block are combined through a stitching procedure that uses a weighted average which utilizes the cumulative probability density functions (cdf) for each pair of adjacent blocks. Further improvements are obtained by implementing a sub-sampling approach that generates sub-samples from the original sample without replacement. The pdfs from each sub-sample are then averaged to give a final estimate. The SE has been extensively benchmarked against a large set of diverse distributions for sample sizes ranging from of $2^9$ up to $2^{20}$ and 1000 trials per sample size. The quality of the estimates are quantified using scaled quantile residual (SQR) plots, which is a sample size invariant metric that is consistent with the Anderson-Darling test. The set of test distributions range from easy single mode distributions to extremely difficult exotic distributions. In all cases tested the SE yields excellent estimates with no need for a priori knowledge of the structure of the data.