[HOME]

Mathematical / Statistical Topics

The following provide links to tutorials that I’ve written on various topics in statistical analysis, machine learning, and data science. There are also some links to related resources farther below (e.g., books on Bayesian data analysis).

NOTE: In case it helps, I run the Jupyter notebooks from Anaconda-Navigator using different environments for each notebook.

Sqrt(-1) Explained

This is a very, very brief (minimal) explanation of the “square root of -1”, without mathematical jargon, so that it can be read and understood by a broad audience, and so that they can see that there is nothing “imaginary” going on here.

Extreme Value Analysis (EVA)

“EVA is widely used in many disciplines, such as structural engineering, finance, earth sciences, traffic prediction, and geological engineering. For example, EVA might be used in the field of hydrology to estimate the probability of an unusually large flooding event, such as the 100-year flood.” – Wikipedia

This is the PDF file from a presentation on Extreme Value Analysis (EVA) that I gave to the central Texas IEEE Section on Oct 20, 2022.

Covid-19 and Blood Types (using a Bayesian approach)

In the recent paper by Jiao Zhao, et al. it was reported that blood group (e.g., A, B, AB, or O) appears to have an effect on the likelihood of becoming infected with the Covid-19 virus. Basically, people with blood type A appear to be more susceptible to the virus, while people with blood type O appear to be less susceptible.

The authors of the paper performed several types of statistical analyses to arrive at their conclusion: one-way ANOVA, 2-tailed chi-square, and a meta-analysis using random effects models. In this Jupyter notebook, I’ve performed a different type of analysis, Bayesian Data Analysis (BDA), using the data reported in their paper.

[CAVEAT: No one has checked my work, so there could be errors in it] This BDA appears to support their conclusion, but also provides posterior density estimates for the proportions of A, B, AB, and O blood groups among the infected, along with credible intervals for those proportions. See the four posterior density plots at the end of this notebook.

Bayesian Beta-Binomial Example

This Jupyter notebook provides a very simple example of Bayesian parameter estimation using the Beta-Binomial model. Both analytical and simulation-based results are presented. Three different approaches are used to obtain a parameter estimate for this model:

Exact Analytical Solution
Simple Non-MCMC Solution
MCMC Solution

Monoids 101 for Apache Spark

This Jupyter notebook describes what monoids are and the role they play in reduction and aggregation in Spark, specifically PySpark. To illustrate the use of the monoid concept, the following examples are included:

Word count
Max/Min as monoids
Histogram calculation using vectors as monoids
Calculating sample means and standard deviations
Calculating covariances and correlations using vectors and matrices as monoids
Sets as monoids
A HyperLogLog monoid (a “sketch method” for approximating set cardinality). NOTE: Uses the implementation, hll.py at https://github.com/Parsely/probably, which has been modified here to remove the dependency on the “smhasher” module and so that it can be run using the Anaconda Python distribution.

Misc. Resources

Books & Python Libraries
“A modeler’s guide to extreme value software”, arXiv:2205.07714v1, Léo R. Belzile, et al., 16 May 2022
Bayesian Inference in R