Object Orientated Programming and Packaging
So you've gotten your code to the stage where you'd like to provide a stable, usable, code for other to use and to build upon. This is a fantastic stage to reach! Whether your code is going to be completely open source (visible to everyone in the world) or only usable to a number of collaborators, there are some key things we should keep in mind when making our code available for others to use.
In this workshop we'll walk taking out code from a simple Python script, to a versitile, well-documented, Python package. We'll be focusing on the following topics:
- Classes and Inheritance
- Documentation
- Packaging
- Testing
Starting Point
Let's start with some pre-existing code that we've written and would like to package up.
analaysis_script.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
|
Let's break this down a little:
- On lines 1-2 we're importing the required libraries, namely
numpy
andmatplotlib
. - On lines 6-12 we're loading in data and assigning empty arrays for properties of our data.
- On lines 15-19 we're assigning to the property arrays.
- On lines 23-26 and 28-31 we're normalizing with either a min/max normalization or a standard normalization.
- On lines 34-38 we're resetting the property arrays.
- On lines 42-45 we're printing out a summary of the data column.
- On lines 49-52 we're plotting the data to a histogram.
- On line 56 we're dumping the modified data to a new CSV file.
Keeping in mind that we want our code to be versatile, easy to use and well documented, let's consider the following takeaways:
- The user needs to have
numpy
andmatplotlib
to run this code. These should be considered requirements or dependencies. - When we load in data, we perform a number of operations and save some derived data. The user shouldn't need to worry about whether these arrays are getting properly set.
- We want to normalize the data using two similar ideas but different implementations. The user should expect similar interface when using either method to normalize the data.
- When the data is modified, we need to update the derived data. The user again shouldn't need to worry about correctly setting them. This would introduce a potential source of error.
- The derived data is used when printing summaries. We need to make sure these are up-to-date.
- When dumping to a CSV file, there is no check to see if we're overwriting data. This could lead to an unintentional loss of data.
- There is a lot of repeated code. Repeated code is highly susceptible to errors as one needs to make sure the same change is made in multiple places.
With these points in mind, let's start writing a Python package to make this workflow more user friendly.
Python Package Layout
Let's start off by created a new directory called my_package
.
Within my_package
we'll create an empty file called __init__.py
(here __file__.py
is pronounced "dunder file", hence __init__.py
is "dunder init", calling it "init" is also understandable).
This __init__.py
is a special file in Python with a few functions. Firstly __init__.py
tells Python that files in this directory can be import
-ed into python.
Once Python sees __init__.py
it knows that any .py
file within this directory can be import
-ed.
Secondly, __init__.py
will act as an entry point to the package (or sub-package, more on this later).
Within __init__.py
we can specify what is import
-ed by default using from my_package import *
.
We can do things like specify alias for functions within __init__.py
.
Let's say we have the following:
__init__.py | |
---|---|
1 2 3 4 5 6 |
|
short_name
to be my_super_complicated_long_function_name
.
We can then import this function as from my_package import short_name
.
We'll come back to this in more detail later, for now let's just stick with an empty my_package/__init__.py
Now let's just copy the analysis_script.py
into my_package
and wrap the main body of the script into a simple function, which takes the input filename as an argument:
my_package/analaysis_script.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
|
Let's also add a test data file to the base directory, this can be downloaded from here TO DO.
Our folder layout should look like this:
.
├── data.csv
└── my_package
├── analysis_script.py
└── __init__.py
From here we can try and run the code (I'm using iPython):
In [1]: from my_package.analysis_script import run_analysis
In [2]: run_analysis("./data.csv")
Which worked! However, this is only local to this directory. If we change to a different location and try to rerun the above:
In [1]: from my_package.analysis_script import run_analysis
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from my_package.analysis_script import run_analysis
ModuleNotFoundError: No module named 'my_package'
We want to package to be installed so that we can run the code from anywhere on our system. Let's look at how to do that next.
Environment-wide install using pip and pyproject.toml
There are a few different methods we can choose when creating a package and installing it:
- Append the PYTHONPATH: Using this method we simply modify the location where Python searches when running
import
. This isn't a good choice, it requires user messing around with environmental variables. Not all users will be comfterble doing this. - Install using
setup.py
. This method will compile the Python code to byte code and place it into the correct location for Python to find when runningimport
. Using this method, any user who haspip
installed can install our package usingpip install .
. This isn't a bad option, it is also fairly widely used, however Python standards and package maintainers tend to move away from this method. - Install using a
pyproject.toml
file. This is very similar to thesetup.py
method, except we'll be defining things like package name, file locations, additional data file, project metadata, all within thepyproject.toml
file.
pyproject.toml
is the preferred method for a number of reasons.
Firstly, it is tool agnostic, allowing us to use the installation tools that we are most fimilre with.
We can also include the requirements/dependencies within the pyproject.toml
file.
A toml
file is human-readable file format which uses modern syntax.
pyproject.toml
integrates with other tools such as commitizen
.
pyproject.toml
conforms to python standards (PEP 517/518)!
Let's start with a simple pyproject.toml
file:
pyproject.toml | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
|
There's a lot going on in this file. Let's take it apart:
[build-system]
requires = ["setuptools >= 61.0", "setuptools-scm>=8.0"]
build-backend = "setuptools.build_meta"
Here we are specifying the build-system
, essentially what we're using to build the package.
In this case we're using setuptools
, hence adding it as a requirement with the requires
variable.
[project]
dynamic = ["version"]
name = "my_package"
requires-python = ">= 3.8"
dependencies = [
"numpy",
"matplotlib",
]
authors = [
{name = "Stephan O'Brien", email = "[email protected]"},
]
maintainers = [
{name = "Stephan O'Brien", email = "[email protected]"}
]
description = "A python package for doing cool stuff"
readme = "README.md"
license = {file = "LICENSE"}
Here we're specifying some metadata about our package, including the name, version, who the authors/maintainers are and a short description.
Here we can also specify a README and license file.
The README acts as an introduction page about our package.
Note here we're also specifying the required python version requires-python = ">= 3.8"
.
This syntax means that we need a python version which is greater than or equal to 3.8 to run the code.
We can also specify dependencies using the dependencies
keyword.
When we start building the package, pip will search for these dependencies within our environment and install them if they don't exist.
[tool.setuptools]
packages=[
"my_package",
]
[project.urls]
Homepage = "https://github.com/steob92/my_package/"
Documentation = "https://github.com/steob92/my_package/"
Repository = "https://github.com/steob92/my_package.git"
Issues = "https://github.com/steob92/my_package/issues"
Changelog = "https://github.com/steob92/my_package/blob/main/CHANGELOG.md"
Here we're specifying some URLs that users can consult. For example the "Issues" URL will link to the github issues page of this repository.
With this package in place, we can simply install this package into our environment using pip install .
:
Processing /raid/RAID1/Tutorials/python_packaging_working
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: numpy in /home/obriens/mambaforge/envs/dev/lib/python3.11/site-packages (from my_package==0.0.0) (1.26.4)
Requirement already satisfied: matplotlib in /home/obriens/mambaforge/envs/dev/lib/python3.11/site-packages (from my_package==0.0.0) (3.8.0)
Requirement already satisfied: contourpy>=1.0.1 in /home/obriens/mambaforge/envs/dev/lib/python3.11/site-packages (from matplotlib->my_package==0.0.0) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /home/obriens/mambaforge/envs/dev/lib/python3.11/site-packages (from matplotlib->my_package==0.0.0) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /home/obriens/mambaforge/envs/dev/lib/python3.11/site-packages (from matplotlib->my_package==0.0.0) (4.25.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/obriens/mambaforge/envs/dev/lib/python3.11/site-packages (from matplotlib->my_package==0.0.0) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /home/obriens/mambaforge/envs/dev/lib/python3.11/site-packages (from matplotlib->my_package==0.0.0) (23.1)
Requirement already satisfied: pillow>=6.2.0 in /home/obriens/mambaforge/envs/dev/lib/python3.11/site-packages (from matplotlib->my_package==0.0.0) (10.2.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/obriens/mambaforge/envs/dev/lib/python3.11/site-packages (from matplotlib->my_package==0.0.0) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /home/obriens/mambaforge/envs/dev/lib/python3.11/site-packages (from matplotlib->my_package==0.0.0) (2.8.2)
Requirement already satisfied: six>=1.5 in /home/obriens/mambaforge/envs/dev/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib->my_package==0.0.0) (1.16.0)
Building wheels for collected packages: my_package
Building wheel for my_package (pyproject.toml) ... done
Created wheel for my_package: filename=my_package-0.0.0-py3-none-any.whl size=2183 sha256=dd2ca561171519e9b5755e595d8ae619d67e3ec7593c00e8e29096d29e6b943d
Stored in directory: /tmp/pip-ephem-wheel-cache-om8ak4ov/wheels/11/62/fc/593354b5442752b3bf9fa9c9fc8fe5e054389e3b47a6fb69bf
Successfully built my_package
Installing collected packages: my_package
Successfully installed my_package-0.0.0
This will now work from any directory as long as we're using the same Python environment!
Converting our script into a versatile package using Object Orientated Programming
As was previously mentioned, my_package/analysis_script.py
isn't very versatile.
We ideally want something that other users can build upon.
To do this we'll convert some data types into classes that we can then build upon.
Let's start by making a base data class that we'll build upon.
Create a subdirectory called my_package/data
, within this directory let's create a base.py
file:
my_package/data/base.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
|
Some comments:
- On line 1 we're only
import
-ing what we need to use fromnumpy
- On line 3 we define the name of the class as
Base
- On lines 5-10 we define some class member and data that we'll be using
- On lines 16-19 we're defining the
__str__
method which we define the behaviour when we run something like "print (my_base_class)". - On lines 22-24 we're defining a property called data using the
@property
decorator. This acts as a "getter", instead of accessing the real data_data
the user accesses a copy of the data calleddata
. - On lines 26-33 we're defining the "setter" of
data
using the@data.setter
decorator. This defines the behaviour when we're assigning values todata
. When the user callsmy_base_class.data = some_data
, they're actually calling this function. This allows for_data
to be updated and the data properties to be automatically determined wheneverdata
is modified. - On lines 47-56 we've defined a
normalize
function which allows the user to normalize the data using either a min/max or standard normalization based on a keyword. This defaults to a min/max normalization. Since we've set the propertydata
, when we assign withself.data = ...
the setter function is called,self._data
is updated the data properties (min, max, mean, std) are also recalculated. - On lines 59-60 we're created a member function to dump the data to a CSV file.
Aside on data classes
Python now has a special class type called "Data Classes". Using data classes allows one to create data-based classes like the one above. For the sake of illustration, we won't be using data classes in this tutorial. See here for more details.
We now want to make this package visible anyone using this package. Firstly we need to add a __init__.py
file to this folder:
1 2 3 4 5 6 7 8 9 |
|
Secondly, we need to add this location to the pyproject.toml
file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|
If we now rerun pip install .
, we can now import our base class:
1 2 3 4 5 6 7 8 |
|
It is often very useful to have a method or class of randomly generated data for testing pipelines.
Let's build on top of the Base
class by creating a class that will generate random data.
my_package/data/random.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
- Notice on line 1 we're importing using
from .base import Base
. Here the.base
is specifying that we're looking for a file calledbase.py
within the same directory (.
). If we wanted to search in the parent directory (one directory above) we'd usefrom ..base import Base
. This is similar to the.
and..
path when working withbash
. - On line 4 we define the
RandomData
class, specifying that it inherits from theBase
class. This means thatRandomData
will have all the same functionality asBase
, meaning theself.data
setter will also work forRandomData
. - On line 17 we're calling the parent class's constructor using
super().__init__()
. When ever we want to explicitly call the parent class's version of a function, we can callsuper().function()
.
Since "my_package.data"
has already been added to the pyproject.toml
file my_package/data/random_data.py
will be installed with pip install .
. We can try run this:
1 2 3 4 5 6 |
|
You'll notice that the import is pretty long and cumbersome to type.
Let's make this a little easier for the user by playing around with the __init__.py
files.
Firstly let's modify my_package/data/__init__.py
:
my_package/data/__init__.py | |
---|---|
1 2 3 4 |
|
RandomData
making it easily available from my_package.data
rather than my_package.data.random_data
.
We then create a variable called __all__
and assign it to a list (["RandomData"]
). This specifies what is imported when running from my_package.data import *
.
Next let's modify my_package/__init__.py
:
my_package/__init__.py | |
---|---|
1 2 3 4 |
|
.data.random_data
, we can just import from .data
.
We can do this because RandomData
is already visible within my_package.data
.
We now have a few ways to import RandomData
:
from my_package import *
from my_package import RandomData
from my_package.data import *
from my_package.data import RandomData
from my_package.data.random_data import RandomData
A user no longer needs to know that RandomData
is nested within data
. Reinstalling with pip install .
, we can run the previous example with:
1 2 3 4 5 6 |
|
Making a CSV Data loader
Now that we have a base class Base
and a tester class RandomData
, let's add a CSV reader class within the my_package/data
directory:
my_package/data/csv_data.py | |
---|---|
12 24 |
|
Here CSVData
doesn't explicitly inherent from Base
, but there is an internal data type self.data
which is a list of Base
.
Lines 12-21 defines the read_file
function which will extract the data from the CSV file and stores each column into a Base
data type.
On lines 23-25 we're defining a class method using the @classmethod
decorator.
A class method allows us to call functions associated to the class which aren't associated with a specific instance.
In this case the from_file
method will call __init__
, passing the filename
.
This will allow us to generate a new object using CSVData.from_file(filename)
.
Let's update the __init__.py
files with the new class:
my_package/data/__init__.py | |
---|---|
1 2 3 4 5 |
|
Next let's modify my_package/__init__.py
:
my_package/__init__.py | |
---|---|
1 2 3 4 |
|
Adding some package data
It might be useful to have some data included within the package.
This could be some reference data or some test data.
Let's start by adding the CSV file into our package.
We'll add it to my_package/package_data/data.csv
.
For the package data to be visible, we'll need to add a MANIFEST.in
file
include my_package/package_data/*.csv
Let's then update the CSVData
class to have a method to import this data:
my_package/data/csv_data.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
Here we're using pkg_resources
to access the location of the package data from the installation location.
Now calling CSVData.get_titanic_data()
will return a CSVData
object with the dataset loaded.
Documenting Our Code
So far we've written code with little-to-no documentation. For a small single-use analysis script, this might be ok. However, as our code increases in complexity and we want to start sharing it with others, we need to better document our code. Furthermore, we need to remember that undocumented code my become completely unreadable to anyone not in your current mindstate. This include ourselves. A very common problem people have is returning to old code and failing to understand the code without hours of reading and debugging.
In this section we'll look at the following topics:
- Documenting our code with docstrings, providing useful help messages for users.
- Using "type-hinting" to show the data types being used and returned by functions
- Combining the above to easily make documentation pages like numba.readthedocs.io
Docstrings
Docstrings are a standard method used to describe one's code. Docstrings use a format that is well defined and utilised by Python and documenation software.
Here are some links discussing documenting one's code using docstrings:
- Real Python: Documenting Python Code
- Example Google Style Python Docstrings
- THe Hitchhiker's Guide to Python: Documentation
There are many different standards, but they all use a common syntax, that is a long string defined just after a function/class declaration:
docstring_example.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Notice that the docstring starts and end with 3 quotation marks """
.
It is place directly after the functions/class declaration.
The first line is a high-level description of the function/class followed by an optional more detailed description.
We then define Parameters
and Returns
and list what the parameters/returns are, their type and how they are used.
We can also add things like examples, tests (which can be evaluated with a module like doctest) and any error handling does using raise exception
(see here for some examples).
Using a docstring like this, a new developer has a good idea of what the function does and the expected behaviour.
Futhermore the user can also call the help
function to return the docstring:
help(my_function)
Help on function my_function in module __main__:
my_function(a, b)
A simple function to add two numbers together
This function will take in two numbers (a and b) and return their sum (a + b)
Parameters
----------
a : float
The first number
b : float
The second number
Returns
-------
float
The sum of a and b
Let's start looking at my_package/data/base.py
and add some docstrings:
my_package/data/base.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
|
In the above example we've added docstrings to my_package/data/base.py
.
Notice that the normalize
function has the potential to raise a RuntimeError
(line 115).
This potential error steam is documented in the docstring on lines 102-105.
Type Hinting
Type hinting is an excellent way to inform developers and users of what type the arguments and returns are. This is done by labeling arguments and returns with the expected type. Let's look again at simple function we used to get the sum of two numbers:
type_hinting_example.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Here on line 1 we modify the function definition to include the data types.
We specify the type of a
and b
using the syntax var : type
.
In this case we use a : float, b : float
to indicate that a
and b
are expected to be of type float
.
Additionally we indicate what the return value is using -> type
after the function definition.
In this case we're returning a float, so def my_function(a : float ,b : float) -> float:
.
We can also specify default parameters using the syntax var : type = default
. For example:
type_hinting_example.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Here b
is of type float
and has a default value of 1.0
.
This means that if the user calls my_function(3)
, they'll get a return of 4
, whereas if the user calls my_function(3, 2)
they'll get a return of 5
.
A very useful package to use when implementing type hinting is the typing
library.
typing
has a lot of useful imports, for example:
List[type1, type2, ..., typen]
, this can be used when we expect alist
of typestype1,...,typen
.Tuple[type1, type2, ..., typen]
, this can be used when we expect atuple
of typestype1,...,typen
.Option[type]
, this can be used for optional valuesUnion[type1, type2]
, this can be used when a varibable can have eithertype1
ortype2
.
See typing cheat sheet for more examples.
Let's use typing
in the my_function
example:
type_hinting_example.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
On line 1 we import Optional
from typing
.
On line 2 the my_function
definition is updated to reflect that b
is an Optional
parameter of type float
.
We also update the docstring on line 12 to show that this is an Optional
parameter.
Now that we have an idea of how type hinting works, let's update the my_package/data/base.py
file:
my_package/data/base.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
|
On line 2 we import List
, Tuple
and Optional
from typing
.
On lines 8-12 the types of the class data are added, noting that ndarray
was imported on line 1.
On lines 90, 92 we're utilizing the typing
imports in the function definition.
Some functions don't have a return (e.g. line 48), in these cases we can use the type None
to specify that nothing is returned.