Creating Data Science Python Package using Jupyter Notebook


Introduction
We are going to use the cloud Jupyter Notebook to ease the setting up of the environment and completely focus on creating a package. We have also used Object Orient Programming (OOP) fundamentals like Inheritance, objects, class, magic functions, and that I won’t be discussing how each part work. I am sharing my experience of creating Python packages while taking AWS Machine Learning Foundations It took me 10 minutes to create a Python package once I knew who to build it, so practise will make you better at coding packages.
Have you wondered how Python packages like Scikit-learn, pandas, and NumPy are built? They are all based on OOP to create complex and easy-to-use packages. Why do data scientists need to build python packages? In modern times OOP is a necessity and it helps users to build program and share it with the public or within organizations.
Distributions
Parent Class
So, without wasting time let’s dive into the coding and discuss our Parent class Distribution which will be used by Gaussian and Binomial Classes. We are using the Jupyter notebook magic function “%%witefile ” to create python files.
%%writefile distributions/general.py
The outputs of all Jupyter notebook cells are in bold
This will create a python file into the distributions folder and to make things simple you need to create a test folder that should contain your test files, a distribution folder that should contain all your package’s files, and a data folder that contains a .txt file containing data.
The Distribution class takes 2 arguments, mean and standard deviation, it also contains the read_data_file function to access the file and store it to the data variable.
%%writefile distributions/general.py
class Distribution:
def __init__(self, mu=0, sigma=1):
"""
Generic distribution class for calculating and
visualizing a probability distribution.
Attributes:
mean (float) representing the mean value of the distribution
stdev (float) representing the standard deviation of the distribution
data_list (list of floats) a list of floats extracted from the data file
"""
self.mean = mu
self.stdev = sigma
self.data = []
def read_data_file(self, file_name):
"""
Function to read in data from a txt file. The txt file should have
one number (float) per line. The numbers are stored in the data attribute.
Args:
file_name (string): name of a file to read from
Returns:
None
"""
with open(file_name) as file:
data_list = []
line = file.readline()
while line:
data_list.append(int(line))
line = file.readline()
file.close()
self.data = data_list
writing distributions/general.py
Testing Distribution Class
Everything in this class is working smoothly. We have added mean, standard deviation, and loaded the random.txt file and you can see the results.
from distributions.general import Distribution # importing python file from directory
general1 = Distribution(20, 3) # creating object with 20 mean, 30 standard deviation
print("Mean: {}, standard deviation: {}".format(general1.mean, general1.stdev))
general1.read_data_file("data/random.txt") # loading txt file data and checking
print("Data Loaded :", general1.data)
## everything is working perfectly
Gaussian Distribution
The Gaussian distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. – Wikipedia

Mean

The mean of a list of numbers is the sum of all of the numbers divided by the number of numbers. Mean – Wikipedia
Standard Deviation

This is a measure of the amount of variation of a set of values using the mean. The BMJ
Probability density function

The parameter mu is the mean, while the parameter sigma is the standard deviation. The x is the value in a list.
Gaussian Class
Without going deep into how inheritance work and how the magic function works, let’s discuss the functions:
- Initialize the parent class Distribution
- Create calculate_mean function -> calculate
mean from data - Create calculate_stdev function -> calculate
standard deviation from data - Create plot_histogram function -> display
histogram using matplotlib library - Create pdf function -> calculate
proabaility density function of data using mean and stdev - Create plot_histogram_pdf function -> plot the
normalized histogram of the data and a plot of the probability
density function along the same range. - Create magic function __add__ -> add together two
Gaussian distributions objects - Create magic function __repr__ -> output the
characteristics of the Gaussian instance
%%writefile distributions/gaussian.py
import math
import matplotlib.pyplot as plt
from .general import Distribution
class Gaussian(Distribution):
"""
Gaussian distribution class for calculating and
visualizing a Gaussian distribution.
Attributes:
mean (float) representing the mean value of the distribution
stdev (float) representing the standard deviation of the distribution
data_list (list of floats) a list of floats extracted from the data file
"""
def __init__(self, mu=0, sigma=1):
Distribution.__init__(self, mu, sigma)
def calculate_mean(self):
"""
Function to calculate the mean of the data set.
Args:
None
Returns:
float: mean of the data set
"""
avg = 1.0 * sum(self.data) / len(self.data)
self.mean = avg
return self.mean
def calculate_stdev(self, sample=True):
"""
Function to calculate the standard deviation of the data set.
Args:
sample (bool): whether the data represents a sample or population
Returns:
float: standard deviation of the data set
"""
if sample:
n = len(self.data) - 1
else:
n = len(self.data)
mean = self.calculate_mean()
sigma = 0
for d in self.data:
sigma += (d - mean) ** 2
sigma = math.sqrt(sigma / n)
self.stdev = sigma
return self.stdev
def plot_histogram(self):
"""
Function to output a histogram of the instance variable data using
matplotlib pyplot library.
Args:
None
Returns:
None
"""
plt.hist(self.data)
plt.title('Histogram of Data')
plt.xlabel('data')
plt.ylabel('count')
def pdf(self, x):
"""
Probability density function calculator for the gaussian distribution.
Args:
x (float): point for calculating the probability density function
Returns:
float: probability density function output
"""
return (1.0 / (self.stdev * math.sqrt(2*math.pi))) * math.exp(-0.5*((x - self.mean) / self.stdev) ** 2)
def plot_histogram_pdf(self, n_spaces = 50):
"""
Function to plot the normalized histogram of the data and a plot of the
probability density function along the same range
Args:
n_spaces (int): number of data points
Returns:
list: x values for the pdf plot
list: y values for the pdf plot
"""
mu = self.mean
sigma = self.stdev
min_range = min(self.data)
max_range = max(self.data)
# calculates the interval between x values
interval = 1.0 * (max_range - min_range) / n_spaces
x = []
y = []
# calculate the x values to visualize
for i in range(n_spaces):
tmp = min_range + interval*i
x.append(tmp)
y.append(self.pdf(tmp))
# make the plots
fig, axes = plt.subplots(2,sharex=True)
fig.subplots_adjust(hspace=.5)
axes[0].hist(self.data, density=True)
axes[0].set_title('Normed Histogram of Data')
axes[0].set_ylabel('Density')
axes[1].plot(x, y)
axes[1].set_title('Normal Distribution for n Sample Mean and Sample Standard Deviation')
axes[0].set_ylabel('Density')
plt.show()
return x, y
def __add__(self, other):
"""
Function to add together two Gaussian distributions
Args:
other (Gaussian): Gaussian instance
Returns:
Gaussian: Gaussian distribution
"""
result = Gaussian()
result.mean = self.mean + other.mean
result.stdev = math.sqrt(self.stdev ** 2 + other.stdev ** 2)
return result
def __repr__(self):
"""
Function to output the characteristics of the Gaussian instance
Args:
None
Returns:
string: characteristics of the Gaussian
"""
return "mean {}, standard deviation {}".format(self.mean, self.stdev)
writing distributions/gaussian.py
Experimenting
Testing __repr__ magic function.
from distributions.gaussian import Gaussian Gaussian(20,3) ## checking the object magic function
mean 20, standard deviation 3
Initializing gaussian1 object with 25 mean and 2 standard deviations and then reading random.txt file from data folder.
gaussian1 = Gaussian(25, 2)
gaussian1.read_data_file('data/random.txt')
Calculating probability function on 25 means and 2 Stdev on data
Then, calculating the mean and Stdev of data changes the original mean to 125.1 and Stdev to 210.77 and eventually changing our probability density function.
print("Before pdf : ",round(gaussian1.pdf(25),5)) # checking pdf on initial mean and stdev
print("mean :",round(gaussian1.calculate_mean(),2)) # calculating mean of data
print("Stdev :",round(gaussian1.calculate_stdev(),2)) # calculating stdev of data
print("After pdf : ",round(gaussian1.pdf(75),5)) # calcuation pdf of data
Before pdf : 0.19947 mean : 125.1 Stdev : 210.77 After pdf : 0.00184
Plotting histogram with PDF looks Awsome.
gaussian1.plot_histogram_pdf();

The unittest
The unittest, a unit testing framework was originally inspired by JUnit and has a similar flavour as major unit testing frameworks in other languages. It supports test automation, sharing of setup and shutdown code for tests, aggregation of tests into collections, and independence of the tests from the reporting framework, see the Documentation.
Creating Gaussian Class Test File
Test-First is a great tool. It creates better understanding and productivity in the team. The result is high-quality code – both in terms of early success in finding bugs and implementing features correctly. – Gil Zilberfeld
We are going to use the unittest library to test all of our functions so that in the future if we make any changes we can detect errors within few seconds.
Creating TestGaussianClass and has all the functions to test the functions in Gaussian class. We have used the assertEqual method to hack the validity of functions. I have tested these values myself and then added them individually to test every possibility.
%%writefile test/test_gaussian.py
import unittest
from distributions.gaussian import Gaussian
class TestGaussianClass(unittest.TestCase):
def setUp(self):
self.gaussian = Gaussian(35, 6)
self.gaussian.read_data_file('data/random.txt')
def test_initialization(self):
self.assertEqual(self.gaussian.mean, 35, 'incorrect mean')
self.assertEqual(self.gaussian.stdev, 6, 'incorrect standard deviation')
def test_readdata(self):
self.assertEqual(self.gaussian.data,
[10, 23, 45, 12, 23, 45, 67, 100, 300, 250, 45, 68, 29, 59, 239, 934, 12, 321, 12, 32, 1], 'data not read in correctly')
def test_meancalculation(self):
self.assertEqual(self.gaussian.calculate_mean(),
sum(self.gaussian.data) / float(len(self.gaussian.data)), 'calculated mean not as expected')
def test_stdevcalculation(self):
self.assertEqual(round(self.gaussian.calculate_stdev(), 2), 210.77, 'sample standard deviation incorrect')
self.assertEqual(round(self.gaussian.calculate_stdev(0), 2), 205.69, 'population standard deviation incorrect')
def test_pdf(self):
self.assertEqual(round(self.gaussian.pdf(25), 5), 0.01658,
'pdf function does not give expected result')
self.gaussian.calculate_mean()
self.gaussian.calculate_stdev()
self.assertEqual(round(self.gaussian.pdf(75), 5), 0.00184,
'pdf function after calculating mean and stdev does not give expected result')
def test_add(self):
gaussian_one = Gaussian(25, 3)
gaussian_two = Gaussian(30, 4)
gaussian_sum = gaussian_one + gaussian_two
self.assertEqual(gaussian_sum.mean, 55)
self.assertEqual(gaussian_sum.stdev, 5)
if __name__ == '__main__':
unittest.main()
writing test/test_gaussian.py
Running Test
Let’s run our test file from the test folder using !python. As you can see all tests have passed. I did get multiple errors first and it helped me better understand how my Gaussian class is working at every level.
!python test/test_gaussian.py
...... ---------------------------------------------------------------------- Ran 6 tests in 0.002s OK
Binomial Distribution
The binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes-no question, and each with its own Boolean-valued outcome: success (with probability p) or failure (with probability q = 1 − p). Binomial distribution – Wikipedia

Mean

Variance

Standard Deviation

Probability Density Function

Binomial Class
We will be using the mathematical functions mention above to create mean, standard deviation, and probability density functions. We have done the hard work in Gaussian class now we just need to follow the pattern.
- Initialize the probability and size variable – > p,n
- Initialize the parent class Distribution – > calculating
mean and Stdev and adding it to the parent class. - Create calculate_mean function -> calculate
mean from data - Create calculate_stdev function -> calculate
standard deviation from data - Create replace_stats_with_data function –
> That will calculate, probability and size from data and
then update the new mean and standard deviation. - Create plot_bar function -> display
bar chart using the matplotlib library - Create pdf function -> calculate
the probability density function of data using mean and stdev - Create plot_bar_pdf function -> plot
the pdf of the binomial distribution - Create magic function __add__ -> add
together with two Gaussian distributions objects - Create magic function __repr__ -> output
the characteristics of the Gaussian instance
%%writefile distributions/binomial.py
import math
import matplotlib.pyplot as plt
from .general import Distribution
class Binomial(Distribution):
""" Binomial distribution class for calculating and
visualizing a Binomial distribution.
Attributes:
mean (float) representing the mean value of the distribution
stdev (float) representing the standard deviation of the distribution
data_list (list of floats) a list of floats to be extracted from the data file
p (float) representing the probability of an event occurring
n (int) the total number of trials
"""
def __init__(self, prob=.5, size=20):
self.p = prob
self.n = size
mu = self.calculate_mean()
sigma = self.calculate_stdev()
Distribution.__init__(self, mu, sigma)
def calculate_mean(self):
"""Function to calculate the mean from p and n
Args:
None
Returns:
float: mean of the data set
"""
self.mean = self.p*self.n
return self.mean
def calculate_stdev(self):
"""Function to calculate the standard deviation from p and n.
Args:
None
Returns:
float: standard deviation of the data set
"""
self.stdev = math.sqrt(self.n * self.p * (1 - self.p))
return self.stdev
def replace_stats_with_data(self):
"""Function to calculate p and n from the data set
Args:
None
Returns:
float: the p value
float: the n value
"""
self.n = len(self.data)
self.p = sum(self.data)/len(self.data)
self.mean = self.calculate_mean()
self.stdev = self.calculate_stdev()
return self.p,self.n
def plot_bar(self):
"""Function to output a histogram of the instance variable data using
matplotlib pyplot library.
Args:
None
Returns:
None
"""
plt.bar(x = ['0', '1'], height = [(1 - self.p) * self.n, self.p * self.n])
plt.title('Bar Chart of Data')
plt.xlabel('outcome')
plt.ylabel('count')
def pdf(self, k):
"""Probability density function calculator for the gaussian distribution.
Args:
k (float): point for calculating the probability density function
Returns:
float: probability density function output
"""
a = math.factorial(self.n) / (math.factorial(k) * (math.factorial(self.n - k)))
b = (self.p ** k) * (1 - self.p) ** (self.n - k)
return a * b
def plot_bar_pdf(self):
"""Function to plot the pdf of the binomial distribution
Args:
None
Returns:
list: x values for the pdf plot
list: y values for the pdf plot
"""
x = []
y = []
# calculate the x values to visualize
for i in range(self.n + 1):
x.append(i)
y.append(self.pdf(i))
# make the plots
plt.bar(x, y)
plt.title('Distribution of Outcomes')
plt.ylabel('Probability')
plt.xlabel('Outcome')
plt.show()
return x, y
def __add__(self, other):
"""Function to add together two Binomial distributions with equal p
Args:
other (Binomial): Binomial instance
Returns:
Binomial: Binomial distribution
"""
try:
assert self.p == other.p, 'p values are not equal'
except AssertionError as error:
raise
result = Binomial()
result.p = self.p
result.n = self.n + other.n
result.mean = self.calculate_mean()
result.stdev = self.calculate_stdev()
return result
pass
def __repr__(self):
"""Function to output the characteristics of the Binomial instance
Args:
None
Returns:
string: characteristics of the Gaussian
"""
return "mean {}, standard deviation {}, p {}, n {}".format(self.mean,self.stdev,self.p,self.n)
writing distributions/binomial.py
Experimenting
Testing __repr__ magic function
from distributions.binomial import Binomial Binomial(0.6,300)
mean 180.0, standard deviation 8.48528137423857, p 0.6, n 300
Testing Binomial object and read_data_file
Mean: 8.0, standard deviation: 2.19 Data Loaded : [1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1]
Testing pdf of the initial value of p 0.4 and n 20
Then, using replace_stats_with_data to calculate p and n of data and then recalculating PDF.
Initial PDF 0.07465 Data p: 0.471, n: 17 Data PDF 0.06923
Testing bar plot
B1.plot_bar()

Testing Probability Density Function bar plot.
B1.plot_bar_pdf();

Binomial Class Test Function
Again, we are going to use the unittest library to test all of our functions so that in the future if we make any changes we can detect errors within few seconds.
Creating TestBinomialClass and has all the functions to test the functions in Binomial class. We have used the assertEqual method to hack the validity of functions. I have tested these values myself and then added them individually to test every possibility.
%%writefile test/test_binomial.py
import unittest
from distributions.binomial import Binomial
class TestBinomialClass(unittest.TestCase):
def setUp(self):
self.binomial = Binomial(0.4, 20)
self.binomial.read_data_file('data/binary.txt')
def test_initialization(self):
self.assertEqual(self.binomial.p, 0.4, 'p value incorrect')
self.assertEqual(self.binomial.n, 20, 'n value incorrect')
def test_readdata(self):
self.assertEqual(self.binomial.data,
[1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1], 'data not read in correctly')
def test_calculatemean(self):
mean = self.binomial.calculate_mean()
self.assertEqual(mean, 8)
def test_calculatestdev(self):
stdev = self.binomial.calculate_stdev()
self.assertEqual(round(stdev,2), 2.19)
def test_replace_stats_with_data(self):
p, n = self.binomial.replace_stats_with_data()
self.assertEqual(round(p,3), 0.471)
self.assertEqual(n, 17)
def test_pdf(self):
self.assertEqual(round(self.binomial.pdf(5), 5), 0.07465)
self.assertEqual(round(self.binomial.pdf(3), 5), 0.01235)
self.binomial.replace_stats_with_data()
self.assertEqual(round(self.binomial.pdf(5), 5), 0.06923)
self.assertEqual(round(self.binomial.pdf(3), 5), 0.00963)
def test_add(self):
binomial_one = Binomial(.4, 20)
binomial_two = Binomial(.4, 60)
binomial_sum = binomial_one + binomial_two
self.assertEqual(binomial_sum.p, .4)
self.assertEqual(binomial_sum.n, 80)
if __name__ == '__main__':
unittest.main()
writing test/test_binomial.py
Running Test
Running the test_binomial.py shows that no error was found during testing.
!python test/test_binomial.py
....... ---------------------------------------------------------------------- Ran 7 tests in 0.002s OK
Creating __init__.py Function
We need to create __init__.py file in the distributions folder to initial the Classes within the python file. This will help us call specific classes directly.
We have initiated Binomial and Gaussian class.
%%writefile distributions/__init__.py
from .binomial import Binomial from .gaussian import Gaussian
writing distributions/__init__.py
Creating setup.py Function
setup.py function giving all necessary information about the package. It uses setuptools library to create package name, version, description, author name, etc. We have to create this file outside the distribution folder.
%%writefile setup.py
from setuptools import setup
setup(name='distributions',
version='0.2',
description='Gaussian and Binomial distributions',
packages=['distributions'],
author = "Abid Ali Awan",
author_email = "[email protected]",
zip_safe=False)
Writing setup.py
Directory
You can see how your directory should look like.

Installing distributions Package
Let’s install the package by going into the Linux terminal and going into the work directory by using a cd. Then, use pip install . or pip install -U .
(venv) root@deepnote:~/work # pip install -U .
Processing /work
Building wheels for collected packages: distributions
Building wheel for distributions (setup.py) ... done
Created wheel for distributions: filename=distributions-0.2-py3-none-any.whl size=4800 sha256=39bc76cbf407b2870caea42b684b05efc15641c0583f195f36a315b3bc4476da
Stored in directory: /tmp/pip-ephem-wheel-cache-ef8q6wh9/wheels/95/55/fb/4ee852231f420991169c6c5d3eb5b02c36aea6b6f444965b4b
Successfully built distributions
Installing collected packages: distributions
Attempting uninstall: distributions
Found existing installation: distributions 0.2
Uninstalling distributions-0.2:
Successfully uninstalled distributions-0.2
Successfully installed distributions-0.2
Testing our Package
Let’s test our package directly from the terminal and it seems like everything is working fine. Well done you have created your first python package.
>>> from distributions import Gaussian >>> from distributions import Binomial >>> >>> print(Gaussian(20,6)) mean 20, standard deviation 6 >>> print(Binomial(0.4,50)) mean 20.0, standard deviation 3.4641016151377544, p 0.4, n 50 >>>
Do ask me questions in the comments and that’s it, I hope you enjoyed my article.
If you are still facing problems, I will be making my GitHub repo public after 15th August on a link. You can find more of my work on GitHub, Deepnote, and Medium. I share my articles on LinkedIn every week so follow me there if you want to know more about beginners’ hacks.



