Statistics Tutorial for Data Science
Statistics Tutorial with Python
One of the basic conditions to become a data scientist is the basic understanding of statistics. From simple concepts such as mean ,median,standard deviations to complex statistical concepts such as
probability mass functions, Normal distributions , uniform distribution and may more. But lucky you have spent the last three years studying those concepts as undergraduate student doing cats and main exams for those concepts. So honesty speaking ,you are at the right hand if you want to master those concepts. Let get started!
A discrete variable is a variable that can take only the countable numbers of the values. One example of the discrete variable is the outcome of the dice. That is ,if the outcome of events is 1 to 10 it shows it a discrete variable which ranges between 1 to 10.
Continuous variable are variables that take uncountable number of values. A
good example of this is length .
In statistic we represent discrete distribution with PMF(Probability mass function)
and CDF(cumulative Distribution function).
While continuous function we represent with PDF (probability density function) .
PMF defines the probability of all possible values of y of a random number while
PDF represent probability of all possible values of continuous values.
To understand this concepts better let visualize it .
PDF of normal distribution with mean 0 and standard deviation 1
#importing the required libraries
import os
import numpy as np
import pandas as pd
from math import sqrt
from pylab import *
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import seaborn as sns
# Statistics
from statistics import median
from scipy import signal
from scipy.special import factorial
import scipy.stats as stats
from scipy.stats import sem, binom, lognorm, poisson, bernoulli, spearmanr
from scipy.fftpack import fft, fftshift
# PDF of normal distribution with mean o and standard
deviation 1
# Plot normal distribution
mu = 0
variance = 1
sigma = sqrt(variance)
x = np.linspace(mu — 3*sigma, mu + 3*sigma, 100)
plt.figure(figsize=(16,5))
plt.plot(x, stats.norm.pdf(x, mu, sigma), label=’Normal Distribution’)
plt.title(‘Normal Distribution with mean = 0 and std = 1’)
plt.legend(fontsize=’xx-large’)
plt.grid()
plt.show()
From the above code we have our normal distribution
with mean 0 and std 1 of pdf.
PMF (Probability Mass Function)
Let visualize the pmf of binomial distribution for
number of values between 52 and 57
# PMF Visualization
n = 200
p = 0.5
plt.style.use(‘dark_background’)
fig, ax = plt.subplots(1, 1, figsize=(17,5))
x = np.arange(binom.ppf(0.01, n, p), binom.ppf(0.99, n, p))
ax.plot(x, binom.pmf(x, n, p), ‘bo’, ms=8, label=’Binomial PMF’)
ax.vlines(x, 0, binom.pmf(x, n, p), colors=’b’, lw=5, alpha=0.5)
rv = binom(n, p)
#ax.vlines(x, 0, rv.pmf(x), colors=’k’, linestyles=’-’, lw=1, label=’frozen PMF’)
ax.legend(loc=’best’, frameon=False, fontsize=’xx-large’)
plt.title(‘PMF of a binomial distribution (n=200, p=0.5)’, fontsize=’xx-large’)
plt.grid()
plt.show()
With those two visualizations we got a better understanding of the PDF and PMF.
I understand you are a bit lost if you are new to statistic and you got no idea ,what the heck is normal distribution and binomial distribution, but let me give you some clarity .
Normal Distribution
Normal distribution is also called Gaussian distribution or bell curve
has my lecturer used to call it back in 1.2 . In normal distribution
the data is symmetrically distributed with no skew has you have
seen in the above diagram of of pdf of normal. If you have plotted
it. Which I highly recommend you to do . In normal distribution
most of the values cluster around the central region with values
tapering off as they go further away from the center. A prefect
example of this is how the marks of any exam are distributed in a
given class .Very few get very good marks in this case
70+ ,majority lies between 40 to 69 and very few get 39 and
below . And that how normal distribution operates .The measure
of central tendency are mean, mode and median.
Let plot a scatter plot for normal distribution.
# Generate Normal Distribution
normal_dist = np.random.randn(200)
normal_df = pd.DataFrame({‘value’ : normal_dist})
# Create a Pandas Series for easy sample function
normal_dist = pd.Series(normal_dist)
normal_dist2 = np.random.randn(200)
normal_df2 = pd.DataFrame({‘value’ : normal_dist2})
# Create a Pandas Series for easy sample function
normal_dist2 = pd.Series(normal_dist)
normal_df_total = pd.DataFrame({‘value1’ : normal_dist,
‘value2’ : normal_dist2})
#Scatterplot
plt.figure(figsize=(15,5))
sns.scatterplot(data=normal_df)
plt.legend(fontsize=’xx-large’)
plt.title(‘Scatterplot of a Normal Distribution’, fontsize=’xx-large’)
To show you the bell curve my lecturer used to talk
about let plot distplot for better understanding.
# Normal Distribution as a Bell Curve
plt.figure(figsize=(18,5))
sns.distplot(normal_df)
plt.title(‘Normal distribution (n=1000)’, fontsize=’xx-large’)
plt.grid()
plt.show()
If you have plotted distplot above ,you now know why my
lecturer loved to call it bell curve.
Binomial Distribution.
Binomial distribution has a countable number of outcomes and thus
making it discrete has we saw above in pmf.
Binomial distribution must meet three condition which are :
1 The number of observations are limited.
2 Each of the observations are independent
3 The probability of success is the same for all the trials.
Another distribution worthy mentioning here is Bernoulli Distribution.
Bernoulli distribution is a special kind of binomial distribution with 0
and 1 has it values.
Let plot them
# Change of heads (outcome 1)
p = 0.6
# Create Bernoulli samples
bern_dist = bernoulli.rvs(p, size=1000)
bern_df = pd.DataFrame({‘value’ : bern_dist})
bern_values = bern_df[‘value’].value_counts()
# Plot Distribution
plt.figure(figsize=(18,4))
bern_values.plot(kind=’bar’, rot=0)
plt.annotate(xy=(0.85,300),
s=’Samples that came up Tails\nn = {}’.format(bern_values[0]),
fontsize=’large’,
color=’white’)
plt.annotate(xy=(-0.2,300),
s=’Samples that came up Heads\nn = {}’.format(bern_values[1]),
fontsize=’large’,
color=’white’)
plt.title(‘Bernoulli Distribution: p = 0.6, n = 1000’)
plt.grid()
plt.plot()
Poisson Distribution
Poisson distribution is as discrete probability distribution of a given
number of events occurring in a given time period.
And here is the plot.
x = np.arange(0, 20, 0.1)
y = np.exp(-5)*np.power(5, x)/factorial(x)
plt.figure(figsize=(15,8))
plt.title(‘Poisson distribution with lambda=5’, fontsize=’xx-large’)
plt.plot(x, y, ‘bs’)
plt.show()
To check out the code visit : https://github.com/tinajs2018/-Statistics-Tutorial-for-Data-Science-Statistics-Tutorial-with-Python
With that ,let practice more on those distributions .In part two of
basic statistic with python we shall dig deeper .
Stay safe and don’t forget to follow me for weekly updates.