Statistics Tutorial for Data Science

Founders & Entrepreneurs Network
5 min readJul 10, 2021

--

Statistics Tutorial with Python

One of the basic conditions to become a data scientist is the basic understanding of statistics. From simple concepts such as mean ,median,standard deviations to complex statistical concepts such as

probability mass functions, Normal distributions , uniform distribution and may more. But lucky you have spent the last three years studying those concepts as undergraduate student doing cats and main exams for those concepts. So honesty speaking ,you are at the right hand if you want to master those concepts. Let get started!

A discrete variable is a variable that can take only the countable numbers of the values. One example of the discrete variable is the outcome of the dice. That is ,if the outcome of events is 1 to 10 it shows it a discrete variable which ranges between 1 to 10.

Continuous variable are variables that take uncountable number of values. A

good example of this is length .

In statistic we represent discrete distribution with PMF(Probability mass function)

and CDF(cumulative Distribution function).

While continuous function we represent with PDF (probability density function) .

PMF defines the probability of all possible values of y of a random number while

PDF represent probability of all possible values of continuous values.

To understand this concepts better let visualize it .

PDF of normal distribution with mean 0 and standard deviation 1

#importing the required libraries

import os

import numpy as np

import pandas as pd

from math import sqrt

from pylab import *

import matplotlib.mlab as mlab

import matplotlib.pyplot as plt

import seaborn as sns

# Statistics

from statistics import median

from scipy import signal

from scipy.special import factorial

import scipy.stats as stats

from scipy.stats import sem, binom, lognorm, poisson, bernoulli, spearmanr

from scipy.fftpack import fft, fftshift

# PDF of normal distribution with mean o and standard

deviation 1

# Plot normal distribution

mu = 0

variance = 1

sigma = sqrt(variance)

x = np.linspace(mu — 3*sigma, mu + 3*sigma, 100)

plt.figure(figsize=(16,5))

plt.plot(x, stats.norm.pdf(x, mu, sigma), label=’Normal Distribution’)

plt.title(‘Normal Distribution with mean = 0 and std = 1’)

plt.legend(fontsize=’xx-large’)

plt.grid()

plt.show()

From the above code we have our normal distribution

with mean 0 and std 1 of pdf.

PMF (Probability Mass Function)

Let visualize the pmf of binomial distribution for

number of values between 52 and 57

# PMF Visualization

n = 200

p = 0.5

plt.style.use(‘dark_background’)

fig, ax = plt.subplots(1, 1, figsize=(17,5))

x = np.arange(binom.ppf(0.01, n, p), binom.ppf(0.99, n, p))

ax.plot(x, binom.pmf(x, n, p), ‘bo’, ms=8, label=’Binomial PMF’)

ax.vlines(x, 0, binom.pmf(x, n, p), colors=’b’, lw=5, alpha=0.5)

rv = binom(n, p)

#ax.vlines(x, 0, rv.pmf(x), colors=’k’, linestyles=’-’, lw=1, label=’frozen PMF’)

ax.legend(loc=’best’, frameon=False, fontsize=’xx-large’)

plt.title(‘PMF of a binomial distribution (n=200, p=0.5)’, fontsize=’xx-large’)

plt.grid()

plt.show()

With those two visualizations we got a better understanding of the PDF and PMF.

I understand you are a bit lost if you are new to statistic and you got no idea ,what the heck is normal distribution and binomial distribution, but let me give you some clarity .

Normal Distribution

Normal distribution is also called Gaussian distribution or bell curve

has my lecturer used to call it back in 1.2 . In normal distribution

the data is symmetrically distributed with no skew has you have

seen in the above diagram of of pdf of normal. If you have plotted

it. Which I highly recommend you to do . In normal distribution

most of the values cluster around the central region with values

tapering off as they go further away from the center. A prefect

example of this is how the marks of any exam are distributed in a

given class .Very few get very good marks in this case

70+ ,majority lies between 40 to 69 and very few get 39 and

below . And that how normal distribution operates .The measure

of central tendency are mean, mode and median.

Let plot a scatter plot for normal distribution.

# Generate Normal Distribution

normal_dist = np.random.randn(200)

normal_df = pd.DataFrame({‘value’ : normal_dist})

# Create a Pandas Series for easy sample function

normal_dist = pd.Series(normal_dist)

normal_dist2 = np.random.randn(200)

normal_df2 = pd.DataFrame({‘value’ : normal_dist2})

# Create a Pandas Series for easy sample function

normal_dist2 = pd.Series(normal_dist)

normal_df_total = pd.DataFrame({‘value1’ : normal_dist,

‘value2’ : normal_dist2})

#Scatterplot

plt.figure(figsize=(15,5))

sns.scatterplot(data=normal_df)

plt.legend(fontsize=’xx-large’)

plt.title(‘Scatterplot of a Normal Distribution’, fontsize=’xx-large’)

To show you the bell curve my lecturer used to talk

about let plot distplot for better understanding.

# Normal Distribution as a Bell Curve

plt.figure(figsize=(18,5))

sns.distplot(normal_df)

plt.title(‘Normal distribution (n=1000)’, fontsize=’xx-large’)

plt.grid()

plt.show()

If you have plotted distplot above ,you now know why my

lecturer loved to call it bell curve.

Binomial Distribution.

Binomial distribution has a countable number of outcomes and thus

making it discrete has we saw above in pmf.

Binomial distribution must meet three condition which are :

1 The number of observations are limited.

2 Each of the observations are independent

3 The probability of success is the same for all the trials.

Another distribution worthy mentioning here is Bernoulli Distribution.

Bernoulli distribution is a special kind of binomial distribution with 0

and 1 has it values.

Let plot them

# Change of heads (outcome 1)

p = 0.6

# Create Bernoulli samples

bern_dist = bernoulli.rvs(p, size=1000)

bern_df = pd.DataFrame({‘value’ : bern_dist})

bern_values = bern_df[‘value’].value_counts()

# Plot Distribution

plt.figure(figsize=(18,4))

bern_values.plot(kind=’bar’, rot=0)

plt.annotate(xy=(0.85,300),

s=’Samples that came up Tails\nn = {}’.format(bern_values[0]),

fontsize=’large’,

color=’white’)

plt.annotate(xy=(-0.2,300),

s=’Samples that came up Heads\nn = {}’.format(bern_values[1]),

fontsize=’large’,

color=’white’)

plt.title(‘Bernoulli Distribution: p = 0.6, n = 1000’)

plt.grid()

plt.plot()

Poisson Distribution

Poisson distribution is as discrete probability distribution of a given

number of events occurring in a given time period.

And here is the plot.

x = np.arange(0, 20, 0.1)

y = np.exp(-5)*np.power(5, x)/factorial(x)

plt.figure(figsize=(15,8))

plt.title(‘Poisson distribution with lambda=5’, fontsize=’xx-large’)

plt.plot(x, y, ‘bs’)

plt.show()

To check out the code visit : https://github.com/tinajs2018/-Statistics-Tutorial-for-Data-Science-Statistics-Tutorial-with-Python

With that ,let practice more on those distributions .In part two of

basic statistic with python we shall dig deeper .

Stay safe and don’t forget to follow me for weekly updates.

--

--