Benchmarking and Profiling¶

Benchmarking and Profiling are crucial techniques used to quantify the performance of code and measure improvements in its execution. In Python, understanding the performance of different implementations or comparing the efficiency of algorithms is essential, especially in scenarios where optimization is vital.

When trying to optimize code, it is important to remember that the time required to optimize the code, might be better spent on other tasks. This is to say, when developing code the primary goal is to deliver working code. Delivering fast and efficent code is a secondary goal. With this in mind, it is important not to prematurely optimize code. One can spend an incredible ammount of time trying to optimize code before it even works! Futhermore, one could spend house of work to save only seconds of run time. A balance needs to be struck!

A suitable development workflow would look something like this:

Get a working instance of the code.
Determine suitable benchmarks for evaluating the code (run time, iterations/second, iterations/cpu cores, percision on output, memory usage etc), and set target performance goals (e.g. total run time, total memory budget, etc).
Evaluate the performance using the predefined benchmarks.
If performance goals aren't met, profile the code to determine where the bottle necks are and whether changes can be made.
Implement suitable changes to the code and repeat from step 3.

Optimizing code can be a very time consuming process. There is a balance that needs to be made between acceptable run time and developer hours needed. Making mulitple changes to the code at once can also introduce bugs that might be difficult to debug. Either make small iterative changes to the code, or make sure the code passes pre-defined tests after each development cycle.

Python provides the timeit module, which enables the measurement of execution time for specific code segments. This module offers a simple yet effective way to evaluate the performance of individual code snippets or functions:

%%time will time the contents of an entire cell with a single run
%%timeit will time the contents of an entire cell, averaging over multiples run of that cell
%time will time a signle line with a single run
%timeit will time a single line, averaging over multiples run of that cell

timeit will also suspend some features, such as the garbage collector, to get an accurate measurement of the code's runtime.

In [29]:

Copied!





import time
import timeit
import numpy as np
import matplotlib.pyplot as plt
import time
import timeit
import numpy as np
import matplotlib.pyplot as plt

In [30]:

Copied!





%%timeit
test = 0
for i in range(100000):
    test += 1
%%timeit
test = 0
for i in range(100000):
    test += 1

2.48 ms ± 31.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We can also measure the time of the cell from one iteration using %%time

In [31]:

Copied!





%%time
test = 0
for i in range(100000):
    test += 1
%%time
test = 0
for i in range(100000):
    test += 1

CPU times: user 4.59 ms, sys: 0 ns, total: 4.59 ms
Wall time: 4.61 ms

We see a number of different times listed here:

Wall time: This is the time that has passed during the running of the code
CPU time: This is the time that the CPU was busy. This also includes time when the CPU has "swapped" its attention to a different process.
user: This is the time that the process has been running on the CPU but outside of the system kernel
sys : This is the time taken inside the system kernel. This could indicate that the code requires a lot of system calls rather than calls written by the developer.

As a rule of thumb:

CPU Time / Wall Time ~ 1. The process spent most of it's time using the CPU. A faster CPU could improve performance.
CPU Time / Wall Time < 1. This suggest the process has spent time waiting. This could be resource allocations, network/hardware IO, locks (to prevent data races) etc. The smaller the number, the more time spent waiting. 0.9 would suggest 10% of the time the CPU is idle.
CPU Time / Wall Time > 1. This suggests that we are utilizing multiple processors when running our code.

If we are running with N processors:

CPU Time / Wall Time ~ N. The process is well distributed across all CPUs with little to no idle time.
CPU Time / Wall Time < N. This suggest the process has spent time waiting. This could be resource allocations, network/hardware IO, locks (to prevent data races) etc.

We can use %timeit to time an single line or process, averaging over multiple runs:

In [32]:

Copied!





print ("This line won't be timed")
# Either will this line
test = np.random.random(1000)
# This line will be timed
%timeit test = np.random.random(1000)
print ("This line won't be timed")
# Either will this line
test = np.random.random(1000)
# This line will be timed
%timeit test = np.random.random(1000)

This line won't be timed
4.14 µs ± 87.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Or we can use %time to time an single line or process, with no averaging:

In [33]:

Copied!





print ("This line won't be timed")
# Either will this line
test = np.random.random(1000)
# This line will be timed
%time test = np.random.random(1000)
print ("This line won't be timed")
# Either will this line
test = np.random.random(1000)
# This line will be timed
%time test = np.random.random(1000)

This line won't be timed
CPU times: user 100 µs, sys: 4 µs, total: 104 µs
Wall time: 78.7 µs

Let's look at comparing two functions. Consider vector addition. In a pure Python, we would use a for loop to iterate over each element. We could also use numpy's inbuild functions to perform the addition/

In [34]:

Copied!





def add_two_arrays(a : np.ndarray,b : np.ndarray) -> np.ndarray:
    """Add two numpy arrays togeter using a for loop

    Args:
        a : array 1 to be added
        b : array 2 to be added

    Returns:
        array of a + b
    """
    c = np.zeros(len(a))
    for i in range(len(a)):
        c[i] = a[i] + b[i]
    return c

def add_two_arrays_numpy(a : np.ndarray, b : np.ndarray) -> np.ndarray:
    """Add two numpy arrays togeter using vectorization

    Args:
        a : array 1 to be added
        b : array 2 to be added

    Returns:
        array of a + b
    """
    c = a + b
    return c
def add_two_arrays(a : np.ndarray,b : np.ndarray) -> np.ndarray:
    """Add two numpy arrays togeter using a for loop

    Args:
        a : array 1 to be added
        b : array 2 to be added

    Returns:
        array of a + b
    """
    c = np.zeros(len(a))
    for i in range(len(a)):
        c[i] = a[i] + b[i]
    return c

def add_two_arrays_numpy(a : np.ndarray, b : np.ndarray) -> np.ndarray:
    """Add two numpy arrays togeter using vectorization

    Args:
        a : array 1 to be added
        b : array 2 to be added

    Returns:
        array of a + b
    """
    c = a + b
    return c

In [35]:

Copied!





test_x = np.random.random(1000)
test_y = np.random.random(1000)

print("Pure Python:")
%timeit test_z = add_two_arrays(test_x, test_y)
print("Numpy:")
%timeit test_z = add_two_arrays_numpy(test_x, test_y)
test_x = np.random.random(1000)
test_y = np.random.random(1000)

print("Pure Python:")
%timeit test_z = add_two_arrays(test_x, test_y)
print("Numpy:")
%timeit test_z = add_two_arrays_numpy(test_x, test_y)

Pure Python:
155 µs ± 742 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Numpy:

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[35], line 7
      5 get_ipython().run_line_magic('timeit', 'test_z = add_two_arrays(test_x, test_y)')
      6 print("Numpy:")
----> 7 get_ipython().run_line_magic('timeit', 'test_z = add_two_arrays_numpy(test_x, test_y)')

File ~/miniforge3/envs/workshop/lib/python3.10/site-packages/IPython/core/interactiveshell.py:2480, in InteractiveShell.run_line_magic(self, magic_name, line, _stack_depth)
   2478     kwargs['local_ns'] = self.get_local_scope(stack_depth)
   2479 with self.builtin_trap:
-> 2480     result = fn(*args, **kwargs)
   2482 # The code below prevents the output from being displayed
   2483 # when using magics with decorator @output_can_be_silenced
   2484 # when the last Python token in the expression is a ';'.
   2485 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):

File ~/miniforge3/envs/workshop/lib/python3.10/site-packages/IPython/core/magics/execution.py:1189, in ExecutionMagics.timeit(self, line, cell, local_ns)
   1186         if time_number >= 0.2:
   1187             break
-> 1189 all_runs = timer.repeat(repeat, number)
   1190 best = min(all_runs) / number
   1191 worst = max(all_runs) / number

File ~/miniforge3/envs/workshop/lib/python3.10/timeit.py:206, in Timer.repeat(self, repeat, number)
    204 r = []
    205 for i in range(repeat):
--> 206     t = self.timeit(number)
    207     r.append(t)
    208 return r

File ~/miniforge3/envs/workshop/lib/python3.10/site-packages/IPython/core/magics/execution.py:173, in Timer.timeit(self, number)
    171 gc.disable()
    172 try:
--> 173     timing = self.inner(it, self.timer)
    174 finally:
    175     if gcold:

File <magic-timeit>:1, in inner(_it, _timer)

KeyboardInterrupt:

We see a dramatic improvement by using NumPy arrays to add two vectors together. This is because NumPy is a highly optimized library, with the low-level code written in C++. Is this always the case?

In [ ]:

Copied!





def time_addition(n):
    
    test_x = np.random.random(n)
    test_y = np.random.random(n)
    
    n_repeat = 10
    
    # pure python
    t_start = time.time()
    for _ in range(n_repeat):
        _ = add_two_arrays(test_x, test_y)
    t_pure = np.median(time.time() - t_start)

    # numpy
    t_start = time.time()
    for _ in range(n_repeat):
        _ = add_two_arrays_numpy(test_x, test_y)
    t_numpy = np.median(time.time() - t_start)

    return t_pure, t_numpy
def time_addition(n):
    
    test_x = np.random.random(n)
    test_y = np.random.random(n)
    
    n_repeat = 10
    
    # pure python
    t_start = time.time()
    for _ in range(n_repeat):
        _ = add_two_arrays(test_x, test_y)
    t_pure = np.median(time.time() - t_start)

    # numpy
    t_start = time.time()
    for _ in range(n_repeat):
        _ = add_two_arrays_numpy(test_x, test_y)
    t_numpy = np.median(time.time() - t_start)

    return t_pure, t_numpy

In [17]:

Copied!





n_array = np.arange(20)
times_py = np.zeros(n_array.shape)
times_np = np.zeros(n_array.shape)

for i, n in enumerate(n_array):
    times_py[i], times_np[i] = time_addition(int(2**n))
n_array = np.arange(20)
times_py = np.zeros(n_array.shape)
times_np = np.zeros(n_array.shape)

for i, n in enumerate(n_array):
    times_py[i], times_np[i] = time_addition(int(2**n))

In [18]:

Copied!





plt.plot(2 ** n_array, times_py, "C0s:", label = "Pure Python")
plt.plot(2 ** n_array, times_np, "C1o--", label = "Numpy")
plt.loglog()
plt.grid()
plt.legend()
plt.xlabel("Array Length")
plt.ylabel("Run Time [s]");
plt.plot(2 ** n_array, times_py, "C0s:", label = "Pure Python")
plt.plot(2 ** n_array, times_np, "C1o--", label = "Numpy")
plt.loglog()
plt.grid()
plt.legend()
plt.xlabel("Array Length")
plt.ylabel("Run Time [s]");

No description has been provided for this image

At very small array sizes the performance is comparable. This is likely due to the overhead of calling the low level NumPy code. The NumPy perfomance saturates around $10^2$ - $10^3$ before scaling in a comparatable fashion to pure python, however it is still orders of magnitude faster. This perfomance will be system dependent!

It might be easy to assume for the above example that NumPy is always better than pure Python, but this isn't always the case and we should always consider our specific needs.

If we where writing code to operate on small (O($10^1$)), we may be faster to use a pure python implementation rather than importing NumPy and introducing an additional overhead.

Line profiling¶

Line profiling is a useful tool for finding bottlenecks within a block of code. There are many profilers one can use:

cProfiles
line_profiler
todo get more

For this example, we'll use line_profiler

mamba/conda install line_profiler

or

pip install line_profiler

Let's look at an expensive function that uses a MC method to estimate $\pi$.

We're going to throw random points $x, y \in [0,1]$ into a square of length 1. This will give an estimate of the the area of the square:

$$ A_{square} \propto N_{total}$$ The area of a circle is given by: $$ A_{circle} = \pi r^2$$ Since we're throwing random numbers in the range $[0,1]$, we will fill a quatre of a circle. $$ A_{circle} \propto 4 * \tilde{N_{within}}$$ Where $\tilde{N_{within}}$ is the number of samples within 1 of the origin (or $x^2 + y^2 <1$).

Combing these, we can estimate $pi$ as: $$ \pi \approx 4 \frac{N_{within}}{N_{total}}$$

In [37]:

Copied!





def monte_carlo_pi(nsamples):
    acc = 0
    for i in range(nsamples):
        x = np.random.random()
        y = np.random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples
def monte_carlo_pi(nsamples):
    acc = 0
    for i in range(nsamples):
        x = np.random.random()
        y = np.random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [38]:

Copied!

n_range = np.arange(7)

pi_diff = [np.abs(np.pi - monte_carlo_pi(10**n)) for n in n_range]
n_range = np.arange(7)

pi_diff = [np.abs(np.pi - monte_carlo_pi(10**n)) for n in n_range]

In [39]:

Copied!





plt.plot(10**n_range, pi_diff)
plt.xlabel("Number of Random Samples")
plt.ylabel("$|\pi_{NumPy} - \pi_{MC}|$")
plt.grid()
plt.xscale('log')
#plt.yscale('log')
plt.plot(10**n_range, pi_diff)
plt.xlabel("Number of Random Samples")
plt.ylabel("$|\pi_{NumPy} - \pi_{MC}|$")
plt.grid()
plt.xscale('log')
#plt.yscale('log')

We can see that as $N_{total}$ increases we tend to the NumPy value of $\pi$. We obtain deminising returns at around $10^4$ samples. Let's choose this as our benchmark point.

In [40]:

Copied!

n_total = 10 ** 4
n_total = 10 ** 4

How long does this take to run?

In [41]:

Copied!

%timeit monte_carlo_pi(n_total)
%timeit monte_carlo_pi(n_total)

6.14 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [42]:

Copied!





n_run = 100
n_repeat = 10
t_original = np.array(
    timeit.repeat(f"monte_carlo_pi({n_total})", "from __main__ import monte_carlo_pi", repeat = n_repeat,  number = n_run)
) / n_run
n_run = 100
n_repeat = 10
t_original = np.array(
    timeit.repeat(f"monte_carlo_pi({n_total})", "from __main__ import monte_carlo_pi", repeat = n_repeat,  number = n_run)
) / n_run

In [43]:

Copied!

print(f"Original Runtime = {np.mean(t_original) : 0.2e} +/- {np.std(t_original) :0.2e} s")
print(f"Original Runtime = {np.mean(t_original) : 0.2e} +/- {np.std(t_original) :0.2e} s")

Original Runtime =  6.36e-03 +/- 5.86e-04 s

If we want to speed up this calculation, we should first find what is the major bottleneck.

In [45]:

Copied!

%load_ext line_profiler
%load_ext line_profiler

The line_profiler extension is already loaded. To reload it, use:
  %reload_ext line_profiler

In [46]:

Copied!

%lprun -f monte_carlo_pi monte_carlo_pi(n_total)
%lprun -f monte_carlo_pi monte_carlo_pi(n_total)

Timer unit: 1e-09 s

Total time: 0.0180272 s
File: /tmp/ipykernel_235029/1024659408.py
Function: monte_carlo_pi at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def monte_carlo_pi(nsamples):
     2         1        852.0    852.0      0.0      acc = 0
     3     10001    1441541.0    144.1      8.0      for i in range(nsamples):
     4     10000    5764246.0    576.4     32.0          x = np.random.random()
     5     10000    5585385.0    558.5     31.0          y = np.random.random()
     6     10000    3567354.0    356.7     19.8          if (x ** 2 + y ** 2) < 1.0:
     7      7911    1666182.0    210.6      9.2              acc += 1
     8         1       1623.0   1623.0      0.0      return 4.0 * acc / nsamples

The above shows a breakdown of where we're spending time. We have each line of the function broken down. We can see that the allocations of acc and the return of 4.0 * acc / nsamples are expensive operations, but this only happens once per execution. Since these only happen once there may be some "noise" effecting these estimates.

for i in range(nsamples): takes 8% of our time. This is pretty significant. Python is also nutoriously bad at looping. If we can avoid looping in Python we should.

We also see that acc += 1 doesn't happen every iteration, and we spend ~20% of our time checking if (x ** 2 + y ** 2) < 1.0. We can see that the dominate part of the time is spent generating random numbers (x/y = np.random.random()) which takes ~60% of the run time.

If we were looking to speed up this function, we should work target the most expensive parts first, this would suggest targeting:

        x = np.random.random()
        y = np.random.random()
        if (x ** 2 + y ** 2) < 1.0:

Which constitutes 80% or our run time. Since these are ran 10000 times, it might make more sense to somehow only run this once. But how...

In [47]:

Copied!





def monte_carlo_pi_pure_numpy(nsamples):
    # Create random arrays with required number of samples
    # Use the highly optimized numpy functions instead of for loop
    x = np.random.random(nsamples)
    y = np.random.random(nsamples)
    # Do the filtering and calculation in the one line
    return 4.0 *np.sum((x ** 2 + y ** 2) < 1.0) / nsamples
def monte_carlo_pi_pure_numpy(nsamples):
    # Create random arrays with required number of samples
    # Use the highly optimized numpy functions instead of for loop
    x = np.random.random(nsamples)
    y = np.random.random(nsamples)
    # Do the filtering and calculation in the one line
    return 4.0 *np.sum((x ** 2 + y ** 2) < 1.0) / nsamples

In [48]:

Copied!

%timeit monte_carlo_pi_pure_numpy(n_total)
%timeit monte_carlo_pi_pure_numpy(n_total)

90.5 µs ± 3.92 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [31]:

Copied!





n_run = 1000
n_repeat = 10
t_optimized = np.array(
    timeit.repeat(f"monte_carlo_pi_pure_numpy({n_total})", "from __main__ import monte_carlo_pi_pure_numpy", repeat = n_repeat,  number = n_run)
) / n_run
t_optimized
n_run = 1000
n_repeat = 10
t_optimized = np.array(
    timeit.repeat(f"monte_carlo_pi_pure_numpy({n_total})", "from __main__ import monte_carlo_pi_pure_numpy", repeat = n_repeat,  number = n_run)
) / n_run
t_optimized

Out[31]:

array([1.30074857e-04, 1.22065353e-04, 8.42903380e-05, 8.41132850e-05,
       8.37166480e-05, 8.39098730e-05, 8.36135550e-05, 8.34919960e-05,
       8.35091390e-05, 8.30124940e-05])

In [32]:

Copied!

print(f"Optimized Runtime = {np.mean(t_optimized) : 0.2e} +/- {np.std(t_optimized) :0.2e} s")
print(f"Optimized Runtime = {np.mean(t_optimized) : 0.2e} +/- {np.std(t_optimized) :0.2e} s")

Optimized Runtime =  9.22e-05 +/- 1.70e-05 s

In [52]:

Copied!

%lprun -f monte_carlo_pi_pure_numpy monte_carlo_pi_pure_numpy(n_total)
%lprun -f monte_carlo_pi_pure_numpy monte_carlo_pi_pure_numpy(n_total)

Timer unit: 1e-09 s

Total time: 0.000826387 s
File: /tmp/ipykernel_235029/764224049.py
Function: monte_carlo_pi_pure_numpy at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def monte_carlo_pi_pure_numpy(nsamples):
     2                                               # Create random arrays with required number of samples
     3                                               # Use the highly optimized numpy functions instead of for loop
     4         1     184658.0 184658.0     22.3      x = np.random.random(nsamples)
     5         1     180490.0 180490.0     21.8      y = np.random.random(nsamples)
     6                                               # Do the filtering and calculation in the one line
     7         1     461239.0 461239.0     55.8      return 4.0 *np.sum((x ** 2 + y ** 2) < 1.0) / nsamples

In [37]:

Copied!

speedup =  np.mean(t_original) / np.mean(t_optimized)
print (f"Optmization speeds up the code by a factor of {speedup:0.2f}")
speedup =  np.mean(t_original) / np.mean(t_optimized)
print (f"Optmization speeds up the code by a factor of {speedup:0.2f}")

Optmization speeds up the code by a factor of 65.55

Can we do better?¶

We're already using highly optimized NumPy which is based on C/C++ code. However we're still interpting slow bytecode at run time and running on one core.

In [41]:

Copied!

# !pip install numba
# !pip install numba

In [53]:

Copied!

from numba import jit
from numba import jit

In [54]:

Copied!





@jit
def monte_carlo_pi_pure_numpy_jit(nsamples):
    # Create random arrays with required number of samples
    # Use the highly optimized numpy functions instead of for loop
    x = np.random.random(nsamples)
    y = np.random.random(nsamples)
    # Do the filtering and calculation in the one line
    return 4.0 *np.sum((x ** 2 + y ** 2) < 1.0) / nsamples
@jit
def monte_carlo_pi_pure_numpy_jit(nsamples):
    # Create random arrays with required number of samples
    # Use the highly optimized numpy functions instead of for loop
    x = np.random.random(nsamples)
    y = np.random.random(nsamples)
    # Do the filtering and calculation in the one line
    return 4.0 *np.sum((x ** 2 + y ** 2) < 1.0) / nsamples

In [55]:

Copied!





t1 = time.time()
monte_carlo_pi_pure_numpy_jit(n_total)
telap = time.time() - t1
print (f"This took {telap}s")
t1 = time.time()
monte_carlo_pi_pure_numpy_jit(n_total)
telap = time.time() - t1
print (f"This took {telap}s")

This took 0.818145751953125s

In [56]:

Copied!





t1 = time.time()
monte_carlo_pi_pure_numpy_jit(n_total)
telap = time.time() - t1
print (f"This took {telap}s")
t1 = time.time()
monte_carlo_pi_pure_numpy_jit(n_total)
telap = time.time() - t1
print (f"This took {telap}s")

This took 0.00010251998901367188s

In [57]:

Copied!





n_run = 1000
n_repeat = 10
t_jitted = np.array(
    timeit.repeat(f"monte_carlo_pi_pure_numpy_jit({n_total})", "from __main__ import monte_carlo_pi_pure_numpy_jit", repeat = n_repeat,  number = n_run)
) / n_run
# t_jitted
n_run = 1000
n_repeat = 10
t_jitted = np.array(
    timeit.repeat(f"monte_carlo_pi_pure_numpy_jit({n_total})", "from __main__ import monte_carlo_pi_pure_numpy_jit", repeat = n_repeat,  number = n_run)
) / n_run
# t_jitted

In [58]:

Copied!

print(f"jitted Runtime = {np.mean(t_jitted) : 0.2e} +/- {np.std(t_jitted) :0.2e} s")
print(f"jitted Runtime = {np.mean(t_jitted) : 0.2e} +/- {np.std(t_jitted) :0.2e} s")

jitted Runtime =  6.72e-05 +/- 1.59e-05 s

In [60]:

Copied!

speedup =  np.mean(t_original) / np.mean(t_jitted)
print (f"Optmization speeds up the code by a factor of {speedup:0.2f}")
speedup =  np.mean(t_original) / np.mean(t_jitted)
print (f"Optmization speeds up the code by a factor of {speedup:0.2f}")

Optmization speeds up the code by a factor of 94.69

In [61]:

Copied!





@jit( parallel = True )
def monte_carlo_pi_pure_numpy_jit_parallel(nsamples):
    # Create random arrays with required number of samples
    # Use the highly optimized numpy functions instead of for loop
    x = np.random.random(nsamples)
    y = np.random.random(nsamples)
    # Do the filtering and calculation in the one line
    return 4.0 *np.sum((x ** 2 + y ** 2) < 1.0) / nsamples
@jit( parallel = True )
def monte_carlo_pi_pure_numpy_jit_parallel(nsamples):
    # Create random arrays with required number of samples
    # Use the highly optimized numpy functions instead of for loop
    x = np.random.random(nsamples)
    y = np.random.random(nsamples)
    # Do the filtering and calculation in the one line
    return 4.0 *np.sum((x ** 2 + y ** 2) < 1.0) / nsamples

In [62]:

Copied!





t1 = time.time()
monte_carlo_pi_pure_numpy_jit_parallel(n_total)
telap = time.time() - t1
print (f"This took {telap}s")
t1 = time.time()
monte_carlo_pi_pure_numpy_jit_parallel(n_total)
telap = time.time() - t1
print (f"This took {telap}s")

This took 0.7153468132019043s

In [63]:

Copied!





n_run = 10000
n_repeat = 10
t_jitted_parallel = np.array(
    timeit.repeat(f"monte_carlo_pi_pure_numpy_jit_parallel({n_total})", "from __main__ import monte_carlo_pi_pure_numpy_jit_parallel", repeat = n_repeat,  number = n_run)
) / n_run
t_jitted_parallel
n_run = 10000
n_repeat = 10
t_jitted_parallel = np.array(
    timeit.repeat(f"monte_carlo_pi_pure_numpy_jit_parallel({n_total})", "from __main__ import monte_carlo_pi_pure_numpy_jit_parallel", repeat = n_repeat,  number = n_run)
) / n_run
t_jitted_parallel

Out[63]:

array([3.53246284e-05, 1.69790315e-05, 2.01592582e-05, 2.66108533e-05,
       6.16496429e-05, 2.76365722e-05, 3.48632386e-05, 1.14208586e-05,
       1.32643553e-05, 1.14757513e-05])

In [64]:

Copied!

print(f"jitted parallel Runtime = {np.mean(t_jitted_parallel) : 0.2e} +/- {np.std(t_jitted_parallel) :0.2e} s")
print(f"jitted parallel Runtime = {np.mean(t_jitted_parallel) : 0.2e} +/- {np.std(t_jitted_parallel) :0.2e} s")

jitted parallel Runtime =  2.59e-05 +/- 1.46e-05 s

In [65]:

Copied!

speedup =  np.mean(t_original) / np.mean(t_jitted_parallel)
print (f"Optmization speeds up the code by a factor of {speedup:0.2f}")
speedup =  np.mean(t_original) / np.mean(t_jitted_parallel)
print (f"Optmization speeds up the code by a factor of {speedup:0.2f}")

Optmization speeds up the code by a factor of 245.19

In [66]:

Copied!





@jit( parallel = True, nopython = True)
def monte_carlo_pi_pure_numpy_jit_parallel_nopython(nsamples):
    # Create random arrays with required number of samples
    # Use the highly optimized numpy functions instead of for loop
    x = np.random.random(nsamples)
    y = np.random.random(nsamples)
    # Do the filtering and calculation in the one line
    return 4.0 *np.sum((x ** 2 + y ** 2) < 1.0) / nsamples
@jit( parallel = True, nopython = True)
def monte_carlo_pi_pure_numpy_jit_parallel_nopython(nsamples):
    # Create random arrays with required number of samples
    # Use the highly optimized numpy functions instead of for loop
    x = np.random.random(nsamples)
    y = np.random.random(nsamples)
    # Do the filtering and calculation in the one line
    return 4.0 *np.sum((x ** 2 + y ** 2) < 1.0) / nsamples

In [67]:

Copied!





t1 = time.time()
monte_carlo_pi_pure_numpy_jit_parallel_nopython(n_total)
telap = time.time() - t1
print (f"This took {telap}s")
t1 = time.time()
monte_carlo_pi_pure_numpy_jit_parallel_nopython(n_total)
telap = time.time() - t1
print (f"This took {telap}s")

This took 0.31411242485046387s

In [68]:

Copied!





n_run = 10000
n_repeat = 10
t_jitted_parallel_nopython = np.array(
    timeit.repeat(f"monte_carlo_pi_pure_numpy_jit_parallel_nopython({n_total})", "from __main__ import monte_carlo_pi_pure_numpy_jit_parallel_nopython", repeat = n_repeat,  number = n_run)
) / n_run
t_jitted_parallel_nopython
n_run = 10000
n_repeat = 10
t_jitted_parallel_nopython = np.array(
    timeit.repeat(f"monte_carlo_pi_pure_numpy_jit_parallel_nopython({n_total})", "from __main__ import monte_carlo_pi_pure_numpy_jit_parallel_nopython", repeat = n_repeat,  number = n_run)
) / n_run
t_jitted_parallel_nopython

Out[68]:

array([3.03248594e-05, 1.24157844e-05, 1.27079011e-05, 1.13436051e-05,
       1.17766736e-05, 3.14462926e-05, 1.97302514e-05, 1.40485196e-05,
       1.42128084e-05, 1.49775431e-05])

In [69]:

Copied!

print(f"jitted parallel no python Runtime = {np.mean(t_jitted_parallel_nopython) : 0.2e} +/- {np.std(t_jitted_parallel_nopython) :0.2e} s")
print(f"jitted parallel no python Runtime = {np.mean(t_jitted_parallel_nopython) : 0.2e} +/- {np.std(t_jitted_parallel_nopython) :0.2e} s")

jitted parallel no python Runtime =  1.73e-05 +/- 7.16e-06 s

In [70]:

Copied!

speedup =  np.mean(t_original) / np.mean(t_jitted_parallel_nopython)
print (f"Optmization speeds up the code by a factor of {speedup:0.2f}")
speedup =  np.mean(t_original) / np.mean(t_jitted_parallel_nopython)
print (f"Optmization speeds up the code by a factor of {speedup:0.2f}")

Optmization speeds up the code by a factor of 367.65

In [71]:

Copied!





from numba import prange
@jit(nopython = True, parallel = True)
def monte_carlo_pi_revisit(nsamples):
    acc = 0
    for i in prange(nsamples):
        x = np.random.random()
        y = np.random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples
from numba import prange
@jit(nopython = True, parallel = True)
def monte_carlo_pi_revisit(nsamples):
    acc = 0
    for i in prange(nsamples):
        x = np.random.random()
        y = np.random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [72]:

Copied!





t1 = time.time()
monte_carlo_pi_revisit(n_total)
telap = time.time() - t1
print (f"This took {telap}s")
t1 = time.time()
monte_carlo_pi_revisit(n_total)
telap = time.time() - t1
print (f"This took {telap}s")

This took 0.2649538516998291s

In [73]:

Copied!





n_run = 10000
n_repeat = 10
t_jitted_revisit = np.array(
    timeit.repeat(f"monte_carlo_pi_revisit({n_total})", "from __main__ import monte_carlo_pi_revisit", repeat = n_repeat,  number = n_run)
) / n_run
t_jitted_revisit
n_run = 10000
n_repeat = 10
t_jitted_revisit = np.array(
    timeit.repeat(f"monte_carlo_pi_revisit({n_total})", "from __main__ import monte_carlo_pi_revisit", repeat = n_repeat,  number = n_run)
) / n_run
t_jitted_revisit

Out[73]:

array([1.48544039e-05, 1.88295904e-05, 1.98919663e-05, 1.77156133e-05,
       1.22228075e-05, 1.14322552e-05, 1.59235416e-05, 1.14099232e-05,
       4.08112965e-05, 4.06835035e-05])

In [74]:

Copied!

print(f"jitted parallel no python Runtime = {np.mean(t_jitted_revisit) : 0.2e} +/- {np.std(t_jitted_revisit) :0.2e} s")
print(f"jitted parallel no python Runtime = {np.mean(t_jitted_revisit) : 0.2e} +/- {np.std(t_jitted_revisit) :0.2e} s")

jitted parallel no python Runtime =  2.04e-05 +/- 1.06e-05 s

In [75]:

Copied!

speedup =  np.mean(t_original) / np.mean(t_jitted_revisit)
print (f"Optmization speeds up the code by a factor of {speedup:0.2f}")
speedup =  np.mean(t_original) / np.mean(t_jitted_revisit)
print (f"Optmization speeds up the code by a factor of {speedup:0.2f}")

Optmization speeds up the code by a factor of 312.10

Summary¶

We've looked at both how to speed up a simple function by first using pure NumPy, then moving to compiled versions, before moving to a parallel implementation:

In [77]:

Copied!





# speedup =  np.mean(t_original) / np.mean(t_optimized)
# print (f"Pure NumPy speeds up the code by a factor of {speedup:0.2f}")

speedup =  np.mean(t_original) / np.mean(t_jitted)
print (f"Compiled NumPy speeds up the code by a factor of {speedup:0.2f}")

speedup =  np.mean(t_original) / np.mean(t_jitted_parallel)
print (f"Parallel annd compiled NumPy speeds up the code by a factor of {speedup:0.2f}")

speedup =  np.mean(t_original) / np.mean(t_jitted_parallel_nopython)
print (f"Parallel annd compiled NumPy with 'no python' speeds up the code by a factor of {speedup:0.2f}")

speedup =  np.mean(t_original) / np.mean(t_jitted_revisit)
print (f"Parallel and complied python speeds up the code by a factor of {speedup:0.2f}")
# speedup =  np.mean(t_original) / np.mean(t_optimized)
# print (f"Pure NumPy speeds up the code by a factor of {speedup:0.2f}")

speedup =  np.mean(t_original) / np.mean(t_jitted)
print (f"Compiled NumPy speeds up the code by a factor of {speedup:0.2f}")

speedup =  np.mean(t_original) / np.mean(t_jitted_parallel)
print (f"Parallel annd compiled NumPy speeds up the code by a factor of {speedup:0.2f}")

speedup =  np.mean(t_original) / np.mean(t_jitted_parallel_nopython)
print (f"Parallel annd compiled NumPy with 'no python' speeds up the code by a factor of {speedup:0.2f}")

speedup =  np.mean(t_original) / np.mean(t_jitted_revisit)
print (f"Parallel and complied python speeds up the code by a factor of {speedup:0.2f}")

Compiled NumPy speeds up the code by a factor of 94.69
Parallel annd compiled NumPy speeds up the code by a factor of 245.19
Parallel annd compiled NumPy with 'no python' speeds up the code by a factor of 367.65
Parallel and complied python speeds up the code by a factor of 312.10

The compiled and parallel implementations offer the best performance. If we don't have access to a multi-core system, we achieve x100 performace by simply compiling the function with numba and jit.

Memory Profiling¶

So far we've only considered the run time performance. This is only one aspect of performance. Memory usage is often a limiting factor in performance.

We can use the memory_profile to profile the memory of our code. This works by sampling the memory usage at intervals while the code is running.

Similar to %%timeit and %timeit we can use %%memit and %memit to get the memory usage of a cell and a single line.

In [171]:

Copied!

# if you have issues with async switch to v0.61
# pip install memory_profiler==0.61
# if you have issues with async switch to v0.61
# pip install memory_profiler==0.61

In [17]:

Copied!





%load_ext memory_profiler
# to automatically reload changed files
%load_ext autoreload
%autoreload 2
%load_ext memory_profiler
# to automatically reload changed files
%load_ext autoreload
%autoreload 2

The memory_profiler extension is already loaded. To reload it, use:
  %reload_ext memory_profiler
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

In [18]:

Copied!

%memit big_array = [i for i in range(10**7)]
%memit big_array = [i for i in range(10**7)]

peak memory: 780.84 MiB, increment: 324.75 MiB

If we want to try and find what processes are using a lot of memory, we can line profile a function.

In [19]:

Copied!





%%writefile function_to_profile.py
import time

def expensive_function():
    
    x = [ 120 for i in range(100) ]
    b = [ x * 100 for i in range(1000) ]
    big_list = [ i for i in range(10**5) ]
    another_big_list = [ i for i in range(10**5) ]
    time.sleep(0.01)

    x = [ 42 for i in range(1000) ]
%%writefile function_to_profile.py
import time

def expensive_function():
    
    x = [ 120 for i in range(100) ]
    b = [ x * 100 for i in range(1000) ]
    big_list = [ i for i in range(10**5) ]
    another_big_list = [ i for i in range(10**5) ]
    time.sleep(0.01)

    x = [ 42 for i in range(1000) ] 

Overwriting function_to_profile.py

In [20]:

Copied!

from function_to_profile import expensive_function
%mprun -f expensive_function expensive_function()
from function_to_profile import expensive_function
%mprun -f expensive_function expensive_function()

Filename: /home/obriens/Documents/websites/mcgill_site/docs/Tutorials/Python/function_to_profile.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     3    455.5 MiB    455.5 MiB           1   def expensive_function():
     4                                             
     5    455.5 MiB      0.0 MiB         103       x = [ 120 for i in range(100) ]
     6    531.0 MiB     75.5 MiB        1003       b = [ x * 100 for i in range(1000) ]
     7    533.7 MiB      2.6 MiB      100003       big_list = [ i for i in range(10**5) ]
     8    538.0 MiB      4.4 MiB      100003       another_big_list = [ i for i in range(10**5) ]
     9    538.0 MiB      0.0 MiB           1       time.sleep(0.01)
    10                                         
    11    538.0 MiB      0.0 MiB        1003       x = [ 42 for i in range(1000) ]

This can often be difficult to parse and understand, in a jupyter notebook. I prefer to use memory_profiler via the command line interface (which we can still call from jupyter!).

In [1]:

Copied!





%%writefile initial_code.py

import numpy as np
import pandas as pd
import time

@profile
def load_data():
    df = pd.read_csv("./gaia3.csv")
    return df

@profile
def filter_data(df):
    parallax = df["parallax"]
    parallax_err = df["parallax_over_error"]
    good_mask = parallax_err > 20
    return good_mask
    
@profile
def get_distance(df):
    distance = 1./ (df["parallax"] * 1e-3)
    return distance

@profile
def assign_distance(df, distance):
    df["distance"] = distance



@profile
def filter_distance(df):
    distance = df["distance"]
    good_mask = distance < 10
    return good_mask


@profile
def apply_filter(df, data_filter):
    data = df[data_filter]
    return data



if __name__ == '__main__':

    data = load_data()
    time.sleep(0.1)
    
    data_filter = filter_data(data)
    time.sleep(0.1)
    
    data_filtered = apply_filter(data, data_filter)
    time.sleep(0.1)
    
    distance = get_distance(data_filtered)
    time.sleep(0.1)
    
    assign_distance(data_filtered, distance)
    time.sleep(0.1)
    
    distance_filter = filter_distance(data_filtered)
    time.sleep(0.1)
    
    data_10pc = apply_filter(data_filtered, distance_filter)
    time.sleep(0.1)
%%writefile initial_code.py

import numpy as np
import pandas as pd
import time

@profile
def load_data():
    df = pd.read_csv("./gaia3.csv")
    return df

@profile
def filter_data(df):
    parallax = df["parallax"]
    parallax_err = df["parallax_over_error"]
    good_mask = parallax_err > 20
    return good_mask
    
@profile
def get_distance(df):
    distance = 1./ (df["parallax"] * 1e-3)
    return distance

@profile
def assign_distance(df, distance):
    df["distance"] = distance



@profile
def filter_distance(df):
    distance = df["distance"]
    good_mask = distance < 10
    return good_mask


@profile
def apply_filter(df, data_filter):
    data = df[data_filter]
    return data



if __name__ == '__main__':

    data = load_data()
    time.sleep(0.1)
    
    data_filter = filter_data(data)
    time.sleep(0.1)
    
    data_filtered = apply_filter(data, data_filter)
    time.sleep(0.1)
    
    distance = get_distance(data_filtered)
    time.sleep(0.1)
    
    assign_distance(data_filtered, distance)
    time.sleep(0.1)
    
    distance_filter = filter_distance(data_filtered)
    time.sleep(0.1)
    
    data_10pc = apply_filter(data_filtered, distance_filter)
    time.sleep(0.1)

Overwriting initial_code.py

In [2]:

Copied!

!mprof run --interval 0.005 --python python initial_code.py
!mprof plot --output initial_code_profile.png
!mprof run --interval 0.005 --python python initial_code.py
!mprof plot --output initial_code_profile.png

mprof: Sampling memory every 0.005s
running new process
running as a Python program...
initial_code.py:25: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["distance"] = distance
Using last profile data.

In [3]:

Copied!

from IPython.display import Image, display
from IPython.display import Image, display

In [4]:

Copied!

display(Image("./initial_code_profile.png"))
display(Image("./initial_code_profile.png"))

From the above we can see a few things:

The file we read in is pretty big. The dataframe takes up ~250 MiB.
Filtering the data has a negligable effect on the memory usage.
We see a huge spike when calling apply_filter for the first time.
We see a second smaller spike when calling get_distance.

A plot like this is useful when we're trying to decide on what resources we need to run this program. We could likely run this on most machines as ~400 MiB is attainable by most personal laptops. If we wanted to submit this job to a cluster, we'd want at least ~450 MiB to provide some breathing room.

If we want to target areas to reduce memory usage, we should target the apply_filter and get_distance functions. Additionally we can look at ways to reduce the dataset size (we'll come back to this...).

Let's look at apply_filter:

@profile
def apply_filter(df, data_filter):
    data = df[data_filter]
    return data
...
    data_filtered = apply_filter(data, data_filter)
...
    data_10pc = apply_filter(data_filtered, distance_filter)
...

We are using the apply_filter. This function takes in a pandas.dataframe and a boolean array and returns a new pandas.dataframe. Let's start by overwriting the data in the function:

@profile
def apply_filter(df, data_filter):
    df = df[data_filter]
    return df

Additionally we'll overwrite all the instances of data

In [5]:

Copied!





%%writefile initial_code_iteration1.py

import numpy as np
import pandas as pd
import time

@profile
def load_data():
    df = pd.read_csv("./gaia3.csv")
    return df

@profile
def filter_data(df):
    parallax = df["parallax"]
    parallax_err = df["parallax_over_error"]
    good_mask = parallax_err > 20
    return good_mask
    
@profile
def get_distance(df):
    distance = 1./ (df["parallax"] * 1e-3)
    return distance

@profile
def assign_distance(df, distance):
    df["distance"] = distance



@profile
def filter_distance(df):
    distance = df["distance"]
    good_mask = distance < 10
    return good_mask


@profile
def apply_filter(df, data_filter):
    df = df[data_filter]
    return df



if __name__ == '__main__':

    data = load_data()
    time.sleep(0.1)
    
    data_filter = filter_data(data)
    time.sleep(0.1)
    
    data = apply_filter(data, data_filter)
    time.sleep(0.1)
    
    distance = get_distance(data)
    time.sleep(0.1)
    
    assign_distance(data, distance)
    time.sleep(0.1)
    
    distance_filter = filter_distance(data)
    time.sleep(0.1)
    
    data = apply_filter(data, distance_filter)
    time.sleep(0.1)
%%writefile initial_code_iteration1.py

import numpy as np
import pandas as pd
import time

@profile
def load_data():
    df = pd.read_csv("./gaia3.csv")
    return df

@profile
def filter_data(df):
    parallax = df["parallax"]
    parallax_err = df["parallax_over_error"]
    good_mask = parallax_err > 20
    return good_mask
    
@profile
def get_distance(df):
    distance = 1./ (df["parallax"] * 1e-3)
    return distance

@profile
def assign_distance(df, distance):
    df["distance"] = distance



@profile
def filter_distance(df):
    distance = df["distance"]
    good_mask = distance < 10
    return good_mask


@profile
def apply_filter(df, data_filter):
    df = df[data_filter]
    return df



if __name__ == '__main__':

    data = load_data()
    time.sleep(0.1)
    
    data_filter = filter_data(data)
    time.sleep(0.1)
    
    data = apply_filter(data, data_filter)
    time.sleep(0.1)
    
    distance = get_distance(data)
    time.sleep(0.1)
    
    assign_distance(data, distance)
    time.sleep(0.1)
    
    distance_filter = filter_distance(data)
    time.sleep(0.1)
    
    data = apply_filter(data, distance_filter)
    time.sleep(0.1)

Overwriting initial_code_iteration1.py

In [6]:

Copied!

!mprof run --interval 0.005 --python python initial_code_iteration1.py
!mprof plot --output initial_code_iteration1.png
!mprof run --interval 0.005 --python python initial_code_iteration1.py
!mprof plot --output initial_code_iteration1.png

mprof: Sampling memory every 0.005s
running new process
running as a Python program...
Using last profile data.

In [7]:

Copied!

display(Image("./initial_code_iteration1.png"))
display(Image("./initial_code_iteration1.png"))

This is an improvement, but we still see a spike at ~400 MiB when apply_filter is called for the first time. Why is this? Let's look at the function again:

def apply_filter(df, data_filter):
    df = df[data_filter]
    return df

When calling df = df[data_filter] we first make a copy of the data using df[data_filter] before assigning it to df. Let's look at a more memory efficient way to do this:

def apply_filter(df, data_filter):
    # Assign NaN to all the RA of the filtered events
    df["ra"][data_filter] = np.nan
    # Drop the entries with NaN, inplace = True meaning overwrite
    df.dropna(inplace = True)

Since we've overwritten the original dataframe, we don't need to return anything...

In [8]:

Copied!





%%writefile initial_code_iteration2.py

import numpy as np
import pandas as pd
import time

@profile
def load_data():
    df = pd.read_csv("./gaia3.csv")
    return df

@profile
def filter_data(df):
    parallax = df["parallax"]
    parallax_err = df["parallax_over_error"]
    good_mask = parallax_err > 20
    return good_mask
    
@profile
def get_distance(df):
    distance = 1./ (df["parallax"] * 1e-3)
    return distance

@profile
def assign_distance(df, distance):
    df["distance"] = distance



@profile
def filter_distance(df):
    distance = df["distance"]
    good_mask = distance < 10
    return good_mask


@profile
def apply_filter(df, data_filter):
    # Assign NaN to all the RA of the filtered events
    df["ra"][data_filter] = np.nan
    # Drop the entries with NaN, inplace = True meaning overwrite
    df.dropna(inplace = True)
    return df



if __name__ == '__main__':

    data = load_data()
    time.sleep(0.1)
    
    data_filter = filter_data(data)
    time.sleep(0.1)
    
    apply_filter(data, data_filter)
    time.sleep(0.1)
    
    distance = get_distance(data)
    time.sleep(0.1)
    
    assign_distance(data, distance)
    time.sleep(0.1)
    
    distance_filter = filter_distance(data)
    time.sleep(0.1)
    
    apply_filter(data, distance_filter)
    time.sleep(0.1)
%%writefile initial_code_iteration2.py

import numpy as np
import pandas as pd
import time

@profile
def load_data():
    df = pd.read_csv("./gaia3.csv")
    return df

@profile
def filter_data(df):
    parallax = df["parallax"]
    parallax_err = df["parallax_over_error"]
    good_mask = parallax_err > 20
    return good_mask
    
@profile
def get_distance(df):
    distance = 1./ (df["parallax"] * 1e-3)
    return distance

@profile
def assign_distance(df, distance):
    df["distance"] = distance



@profile
def filter_distance(df):
    distance = df["distance"]
    good_mask = distance < 10
    return good_mask


@profile
def apply_filter(df, data_filter):
    # Assign NaN to all the RA of the filtered events
    df["ra"][data_filter] = np.nan
    # Drop the entries with NaN, inplace = True meaning overwrite
    df.dropna(inplace = True)
    return df



if __name__ == '__main__':

    data = load_data()
    time.sleep(0.1)
    
    data_filter = filter_data(data)
    time.sleep(0.1)
    
    apply_filter(data, data_filter)
    time.sleep(0.1)
    
    distance = get_distance(data)
    time.sleep(0.1)
    
    assign_distance(data, distance)
    time.sleep(0.1)
    
    distance_filter = filter_distance(data)
    time.sleep(0.1)
    
    apply_filter(data, distance_filter)
    time.sleep(0.1)

Overwriting initial_code_iteration2.py

In [9]:

Copied!

!mprof run --interval 0.005 --python python initial_code_iteration2.py
!mprof plot --output initial_code_iteration2.png
!mprof run --interval 0.005 --python python initial_code_iteration2.py
!mprof plot --output initial_code_iteration2.png

mprof: Sampling memory every 0.005s
running new process
running as a Python program...
initial_code_iteration2.py:39: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df["ra"][data_filter] = np.nan
initial_code_iteration2.py:39: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["ra"][data_filter] = np.nan
initial_code_iteration2.py:39: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df["ra"][data_filter] = np.nan
initial_code_iteration2.py:39: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["ra"][data_filter] = np.nan
Using last profile data.

In [10]:

Copied!

display(Image("./initial_code_iteration2.png"))
display(Image("./initial_code_iteration2.png"))

This looks a lot better, but we still have that inital overhead of the dataset, is there anything we can do here?

Let's look at the data.

In [12]:

Copied!

!head gaia3.csv
!head gaia3.csv

,ra,dec,parallax,parallax_over_error,bp_rp,phot_g_mean_mag,pm,pmra,pmra_error,pmdec,pmdec_error
0,251.49255411977842,-50.74953285089441,6.383755900057115,103.153015,0.79550457,10.54673,32.09387,31.380714480358343,0.05963018,-6.728077202639662,0.045103732
1,251.58997660680922,-50.63863518231249,13.398257319108815,750.62354,2.1655512,13.470652,98.00884,-49.35844067323052,0.017024113,-84.67277009494555,0.012675992
2,251.09313867055343,-51.007071500182086,6.106723881526058,448.8808,1.2691355,12.529914,9.986012,5.650983532208639,0.013577581,8.233275210204,0.009894044
3,251.74646120955111,-51.20105760802903,7.481941724988636,87.46144,2.5889482,16.969418,147.46635,10.335826592962476,0.08570085,-147.10370008642158,0.05650899
4,250.18815359198393,-51.2730504962615,8.606532814900987,163.01225,2.755577,16.33911,5.618203,-1.3052739918459513,0.06419676,5.464472928347609,0.04363508
5,250.32171790439398,-51.39721173654795,5.948736623759927,103.88973,0.010123253,7.615107,31.96969,-16.618858274260234,0.06304326,-27.31070564371402,0.04115518
6,250.63769352342914,-51.83570545712104,25.8932028451182,1238.9082,2.691968,13.436164,434.55444,-58.46042596499814,0.021226894,-430.60414864152074,0.015450544
7,250.1119445586245,-51.26003101959269,6.36085302396653,131.53783,2.53827,16.118967,77.69121,-52.38320393098749,0.049328525,-57.37528741330969,0.037503965
8,250.18507947208107,-51.477673785228106,19.588654431566606,768.66254,0.44006157,6.22464,131.83145,11.454067954459212,0.028156618,131.33292721367764,0.019740378

We have a lot of percision in our measurements. Lets compare the measured values to the uncertainty on these values. Let's limit ourself to the first 100 values. By default, pandas will use 64 bit percision when reading in floats. This might be overkill.

In [13]:

Copied!

import pandas as pd
df = pd.read_csv("gaia3.csv", nrows=100)
df.head()
import pandas as pd
df = pd.read_csv("gaia3.csv", nrows=100)
df.head()

Out[13]:

	Unnamed: 0	ra	dec	parallax	parallax_over_error	bp_rp	phot_g_mean_mag	pm	pmra	pmra_error	pmdec	pmdec_error
0	0	251.492554	-50.749533	6.383756	103.153015	0.795505	10.546730	32.093870	31.380714	0.059630	-6.728077	0.045104
1	1	251.589977	-50.638635	13.398257	750.623540	2.165551	13.470652	98.008840	-49.358441	0.017024	-84.672770	0.012676
2	2	251.093139	-51.007072	6.106724	448.880800	1.269135	12.529914	9.986012	5.650984	0.013578	8.233275	0.009894
3	3	251.746461	-51.201058	7.481942	87.461440	2.588948	16.969418	147.466350	10.335827	0.085701	-147.103700	0.056509
4	4	250.188154	-51.273050	8.606533	163.012250	2.755577	16.339110	5.618203	-1.305274	0.064197	5.464473	0.043635

Do we need this percision? In this case it looks like the errors on our data are already pretty large.

In [15]:

Copied!

import numpy as np
df_f32 = pd.read_csv("gaia3.csv", nrows=100, dtype=np.float32)
df_f32.head()
import numpy as np
df_f32 = pd.read_csv("gaia3.csv", nrows=100, dtype=np.float32)
df_f32.head()

Out[15]:

	Unnamed: 0	ra	dec	parallax	parallax_over_error	bp_rp	phot_g_mean_mag	pm	pmra	pmra_error	pmdec	pmdec_error
0	0.0	251.492554	-50.749535	6.383756	103.153015	0.795505	10.546730	32.093868	31.380714	0.059630	-6.728077	0.045104
1	1.0	251.589981	-50.638634	13.398257	750.623535	2.165551	13.470652	98.008842	-49.358440	0.017024	-84.672768	0.012676
2	2.0	251.093140	-51.007072	6.106724	448.880798	1.269135	12.529914	9.986012	5.650983	0.013578	8.233275	0.009894
3	3.0	251.746460	-51.201057	7.481942	87.461441	2.588948	16.969418	147.466354	10.335827	0.085701	-147.103699	0.056509
4	4.0	250.188156	-51.273052	8.606533	163.012253	2.755577	16.339109	5.618203	-1.305274	0.064197	5.464473	0.043635

In [16]:

Copied!

import matplotlib.pyplot as plt
plt.hist(df["parallax"] - df_f32["parallax"])
import matplotlib.pyplot as plt
plt.hist(df["parallax"] - df_f32["parallax"])

Out[16]:

(array([ 1.,  0.,  1., 15., 32., 32., 15.,  2.,  1.,  1.]),
 array([-8.90233363e-07, -7.14498678e-07, -5.38763994e-07, -3.63029309e-07,
        -1.87294625e-07, -1.15599406e-08,  1.64174744e-07,  3.39909428e-07,
         5.15644113e-07,  6.91378797e-07,  8.67113481e-07]),
 <BarContainer object of 10 artists>)

In [17]:

Copied!

df_f16 = pd.read_csv("gaia3.csv", nrows=100, dtype=np.float16)
df_f16.head()
df_f16 = pd.read_csv("gaia3.csv", nrows=100, dtype=np.float16)
df_f16.head()

Out[17]:

	Unnamed: 0	ra	dec	parallax	parallax_over_error	bp_rp	phot_g_mean_mag	pm	pmra	pmra_error	pmdec	pmdec_error
0	0.0	251.500	-50.75000	6.382812	103.1250	0.795410	10.546875	32.093750	31.375000	0.059631	-6.726562	0.045105
1	1.0	251.625	-50.62500	13.398438	750.5000	2.166016	13.468750	98.000000	-49.343750	0.017029	-84.687500	0.012672
2	2.0	251.125	-51.00000	6.105469	449.0000	1.269531	12.531250	9.984375	5.652344	0.013580	8.234375	0.009895
3	3.0	251.750	-51.18750	7.480469	87.4375	2.589844	16.968750	147.500000	10.335938	0.085693	-147.125000	0.056519
4	4.0	250.250	-51.28125	8.609375	163.0000	2.755859	16.343750	5.617188	-1.305664	0.064209	5.464844	0.043640

In [18]:

Copied!

plt.hist(df["parallax"] - df_f16["parallax"])
plt.hist(df["parallax"] - df_f16["parallax"])

Out[18]:

(array([ 2.,  0.,  8., 13., 36., 23., 12.,  5.,  0.,  1.]),
 array([-0.00563744, -0.00440876, -0.00318007, -0.00195138, -0.0007227 ,
         0.00050599,  0.00173468,  0.00296337,  0.00419205,  0.00542074,
         0.00664943]),
 <BarContainer object of 10 artists>)

For a 32 bit float the uncertainy introduce is O($10^{-7}$), whereas for a 16 bit float the uncertainy is O($10^{-3}$). Let's start off by looking at using 32 bit floats instead when creating the pandas dataframe:

def load_data():
    df = pd.read_csv("./gaia3.csv", dtype = np.float32)
    return df

In [19]:

Copied!





%%writefile initial_code_iteration3.py

import numpy as np
import pandas as pd
import time

@profile
def load_data():
    df = pd.read_csv("./gaia3.csv", dtype = np.float32)
    return df

@profile
def filter_data(df):
    parallax = df["parallax"]
    parallax_err = df["parallax_over_error"]
    good_mask = parallax_err > 20
    return good_mask
    
@profile
def get_distance(df):
    distance = 1./ (df["parallax"] * 1e-3)
    return distance

@profile
def assign_distance(df, distance):
    df["distance"] = distance



@profile
def filter_distance(df):
    distance = df["distance"]
    good_mask = distance < 10
    return good_mask


@profile
def apply_filter(df, data_filter):
    # Assign NaN to all the RA of the filtered events
    df["ra"][data_filter] = np.nan
    # Drop the entries with NaN, inplace = True meaning overwrite
    df.dropna(inplace = True)
    return df



if __name__ == '__main__':

    data = load_data()
    time.sleep(0.1)
    
    data_filter = filter_data(data)
    time.sleep(0.1)
    
    apply_filter(data, data_filter)
    time.sleep(0.1)
    
    distance = get_distance(data)
    time.sleep(0.1)
    
    assign_distance(data, distance)
    time.sleep(0.1)
    
    distance_filter = filter_distance(data)
    time.sleep(0.1)
    
    apply_filter(data, distance_filter)
    time.sleep(0.1)
%%writefile initial_code_iteration3.py

import numpy as np
import pandas as pd
import time

@profile
def load_data():
    df = pd.read_csv("./gaia3.csv", dtype = np.float32)
    return df

@profile
def filter_data(df):
    parallax = df["parallax"]
    parallax_err = df["parallax_over_error"]
    good_mask = parallax_err > 20
    return good_mask
    
@profile
def get_distance(df):
    distance = 1./ (df["parallax"] * 1e-3)
    return distance

@profile
def assign_distance(df, distance):
    df["distance"] = distance



@profile
def filter_distance(df):
    distance = df["distance"]
    good_mask = distance < 10
    return good_mask


@profile
def apply_filter(df, data_filter):
    # Assign NaN to all the RA of the filtered events
    df["ra"][data_filter] = np.nan
    # Drop the entries with NaN, inplace = True meaning overwrite
    df.dropna(inplace = True)
    return df



if __name__ == '__main__':

    data = load_data()
    time.sleep(0.1)
    
    data_filter = filter_data(data)
    time.sleep(0.1)
    
    apply_filter(data, data_filter)
    time.sleep(0.1)
    
    distance = get_distance(data)
    time.sleep(0.1)
    
    assign_distance(data, distance)
    time.sleep(0.1)
    
    distance_filter = filter_distance(data)
    time.sleep(0.1)
    
    apply_filter(data, distance_filter)
    time.sleep(0.1)

Overwriting initial_code_iteration3.py

In [20]:

Copied!

!mprof run --interval 0.005 --python python initial_code_iteration3.py
!mprof plot --output initial_code_iteration3.png
!mprof run --interval 0.005 --python python initial_code_iteration3.py
!mprof plot --output initial_code_iteration3.png

mprof: Sampling memory every 0.005s
running new process
running as a Python program...
initial_code_iteration3.py:39: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df["ra"][data_filter] = np.nan
initial_code_iteration3.py:39: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df["ra"][data_filter] = np.nan
Using last profile data.

In [22]:

Copied!

display(Image("./initial_code_iteration3.png"))
display(Image("./initial_code_iteration3.png"))

We see a big improvement, with the peak memory usage < 250 MiB. Since we're limited by the uncertinty in the measurements, we might decide to lower our precision again:

def load_data():
    df = pd.read_csv("./gaia3.csv", dtype = np.float16)
    return df

In [23]:

Copied!





%%writefile initial_code_iteration4.py

import numpy as np
import pandas as pd
import time

@profile
def load_data():
    df = pd.read_csv("./gaia3.csv", dtype = np.float16)
    return df

@profile
def filter_data(df):
    parallax = df["parallax"]
    parallax_err = df["parallax_over_error"]
    good_mask = parallax_err > 20
    return good_mask
    
@profile
def get_distance(df):
    distance = 1./ (df["parallax"] * 1e-3)
    return distance

@profile
def assign_distance(df, distance):
    df["distance"] = distance



@profile
def filter_distance(df):
    distance = df["distance"]
    good_mask = distance < 10
    return good_mask


@profile
def apply_filter(df, data_filter):
    # Assign NaN to all the RA of the filtered events
    df["ra"][data_filter] = np.nan
    # Drop the entries with NaN, inplace = True meaning overwrite
    df.dropna(inplace = True)
    return df



if __name__ == '__main__':

    data = load_data()
    time.sleep(0.1)
    
    data_filter = filter_data(data)
    time.sleep(0.1)
    
    apply_filter(data, data_filter)
    time.sleep(0.1)
    
    distance = get_distance(data)
    time.sleep(0.1)
    
    assign_distance(data, distance)
    time.sleep(0.1)
    
    distance_filter = filter_distance(data)
    time.sleep(0.1)
    
    apply_filter(data, distance_filter)
    time.sleep(0.1)
%%writefile initial_code_iteration4.py

import numpy as np
import pandas as pd
import time

@profile
def load_data():
    df = pd.read_csv("./gaia3.csv", dtype = np.float16)
    return df

@profile
def filter_data(df):
    parallax = df["parallax"]
    parallax_err = df["parallax_over_error"]
    good_mask = parallax_err > 20
    return good_mask
    
@profile
def get_distance(df):
    distance = 1./ (df["parallax"] * 1e-3)
    return distance

@profile
def assign_distance(df, distance):
    df["distance"] = distance



@profile
def filter_distance(df):
    distance = df["distance"]
    good_mask = distance < 10
    return good_mask


@profile
def apply_filter(df, data_filter):
    # Assign NaN to all the RA of the filtered events
    df["ra"][data_filter] = np.nan
    # Drop the entries with NaN, inplace = True meaning overwrite
    df.dropna(inplace = True)
    return df



if __name__ == '__main__':

    data = load_data()
    time.sleep(0.1)
    
    data_filter = filter_data(data)
    time.sleep(0.1)
    
    apply_filter(data, data_filter)
    time.sleep(0.1)
    
    distance = get_distance(data)
    time.sleep(0.1)
    
    assign_distance(data, distance)
    time.sleep(0.1)
    
    distance_filter = filter_distance(data)
    time.sleep(0.1)
    
    apply_filter(data, distance_filter)
    time.sleep(0.1)

Overwriting initial_code_iteration4.py

In [24]:

Copied!

!mprof run --interval 0.005 --python python initial_code_iteration4.py
!mprof plot --output initial_code_iteration4.png
!mprof run --interval 0.005 --python python initial_code_iteration4.py
!mprof plot --output initial_code_iteration4.png

mprof: Sampling memory every 0.005s
running new process
running as a Python program...
/home/obriens/miniforge3/envs/workshop/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py:234: RuntimeWarning: overflow encountered in cast
  chunks = self._reader.read_low_memory(nrows)
initial_code_iteration4.py:39: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df["ra"][data_filter] = np.nan
initial_code_iteration4.py:39: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df["ra"][data_filter] = np.nan
Using last profile data.

In [25]:

Copied!

display(Image("./initial_code_iteration4.png"))
display(Image("./initial_code_iteration4.png"))

Before we start reading the the dataframe we already have ~75 MiB. Can we reduce this?

NumPy is a pretty big library, let's only import what we need.

In [26]:

Copied!





%%writefile initial_code_iteration5.py

from numpy import float16, nan
import pandas as pd
from time import sleep

@profile
def load_data():
    df = pd.read_csv("./gaia3.csv", dtype = float16)
    return df

@profile
def filter_data(df):
    parallax = df["parallax"]
    parallax_err = df["parallax_over_error"]
    good_mask = parallax_err > 20
    return good_mask
    
@profile
def get_distance(df):
    distance = 1./ (df["parallax"] * 1e-3)
    return distance

@profile
def assign_distance(df, distance):
    df["distance"] = distance



@profile
def filter_distance(df):
    distance = df["distance"]
    good_mask = distance < 10
    return good_mask


@profile
def apply_filter(df, data_filter):
    # Assign NaN to all the RA of the filtered events
    df["ra"][data_filter] = nan
    # Drop the entries with NaN, inplace = True meaning overwrite
    df.dropna(inplace = True)
    return df



if __name__ == '__main__':

    data = load_data()
    sleep(0.1)
    
    data_filter = filter_data(data)
    sleep(0.1)
    
    apply_filter(data, data_filter)
    sleep(0.1)
    
    distance = get_distance(data)
    sleep(0.1)
    
    assign_distance(data, distance)
    sleep(0.1)
    
    distance_filter = filter_distance(data)
    sleep(0.1)
    
    apply_filter(data, distance_filter)
    sleep(0.1)
%%writefile initial_code_iteration5.py

from numpy import float16, nan
import pandas as pd
from time import sleep

@profile
def load_data():
    df = pd.read_csv("./gaia3.csv", dtype = float16)
    return df

@profile
def filter_data(df):
    parallax = df["parallax"]
    parallax_err = df["parallax_over_error"]
    good_mask = parallax_err > 20
    return good_mask
    
@profile
def get_distance(df):
    distance = 1./ (df["parallax"] * 1e-3)
    return distance

@profile
def assign_distance(df, distance):
    df["distance"] = distance



@profile
def filter_distance(df):
    distance = df["distance"]
    good_mask = distance < 10
    return good_mask


@profile
def apply_filter(df, data_filter):
    # Assign NaN to all the RA of the filtered events
    df["ra"][data_filter] = nan
    # Drop the entries with NaN, inplace = True meaning overwrite
    df.dropna(inplace = True)
    return df



if __name__ == '__main__':

    data = load_data()
    sleep(0.1)
    
    data_filter = filter_data(data)
    sleep(0.1)
    
    apply_filter(data, data_filter)
    sleep(0.1)
    
    distance = get_distance(data)
    sleep(0.1)
    
    assign_distance(data, distance)
    sleep(0.1)
    
    distance_filter = filter_distance(data)
    sleep(0.1)
    
    apply_filter(data, distance_filter)
    sleep(0.1)

Overwriting initial_code_iteration5.py

In [27]:

Copied!

!mprof run --interval 0.005 --python python initial_code_iteration5.py
!mprof plot --output initial_code_iteration5.png
!mprof run --interval 0.005 --python python initial_code_iteration5.py
!mprof plot --output initial_code_iteration5.png

mprof: Sampling memory every 0.005s
running new process
running as a Python program...
/home/obriens/miniforge3/envs/workshop/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py:234: RuntimeWarning: overflow encountered in cast
  chunks = self._reader.read_low_memory(nrows)
initial_code_iteration5.py:39: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df["ra"][data_filter] = nan
initial_code_iteration5.py:39: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df["ra"][data_filter] = nan
Using last profile data.

In [28]:

Copied!

display(Image("./initial_code_iteration5.png"))
display(Image("./initial_code_iteration5.png"))

This brings us below 175 MiB. There are other steps we could look at:

Running in batches to reduce the overall usage.
Only loading in the columns needed for analysis
Using a different package/backend?

Summary¶

We're managed to reduce the memory usage from ~450 MiB to < 175 MiB. We achieved this by:

Profiling the memory usage to see what to target
Avoiding copying the data
Using inbuild functions to delete elements rather than doing by copy
Changing the percision of the data we're using
Reducing the import overhead

There is still a lot of redundency in the above code, we could explore other options:

Rewriting parts of the code in more memory efficient languages (Rust, C++).
Reducing the import overhead by reducing the complexity of the code and doing more steps in pure python.
Using more memory efficient libraries (e.g. Polars).
Spliting the dataset into chunks and loading in smaller bits at a time.

We also need to consider the development cost in finding low-memory solutions. Reducing the memory usage by 10% might be better than spending hours trying to reduce it by 20%.

The methods we've used here also have trade offs:

Going from f64 -> f16 is a lossy form of compression. We have lost information that we cannot simply get back.
We've thrown away data after it's read in and filtered. We can no longer compare to the original dataset without rewriting code.
We've decreased the complexity, we can no longer access functions like numpy.sum without explicitly importing them first.

When working with clusters we can use tools such as memory_profiler to estimate the requirements of the job we're about to submit. This can prevent our job crashing if we haven't requested enough memory.

Cheap Tricks to "generally" speed up your code¶

Vectorization is your friend¶

Vecorizing functions allow them to be applied to NumPy arrays, without having to write a for loop.

In [109]:

Copied!





def if_function(x):
    if x < 0:
        return 0
    else:
        return np.sqrt(x)
def if_function(x):
    if x < 0:
        return 0
    else:
        return np.sqrt(x)

In [110]:

Copied!

x = np.random.random(10**5)
x = np.random.random(10**5)

In [111]:

Copied!

%timeit y = if_function(x)
%timeit y = if_function(x)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[111], line 1
----> 1 get_ipython().run_line_magic('timeit', 'y = if_function(x)')

File ~/miniforge3/envs/workshop/lib/python3.10/site-packages/IPython/core/interactiveshell.py:2480, in InteractiveShell.run_line_magic(self, magic_name, line, _stack_depth)
   2478     kwargs['local_ns'] = self.get_local_scope(stack_depth)
   2479 with self.builtin_trap:
-> 2480     result = fn(*args, **kwargs)
   2482 # The code below prevents the output from being displayed
   2483 # when using magics with decorator @output_can_be_silenced
   2484 # when the last Python token in the expression is a ';'.
   2485 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):

File ~/miniforge3/envs/workshop/lib/python3.10/site-packages/IPython/core/magics/execution.py:1185, in ExecutionMagics.timeit(self, line, cell, local_ns)
   1183 for index in range(0, 10):
   1184     number = 10 ** index
-> 1185     time_number = timer.timeit(number)
   1186     if time_number >= 0.2:
   1187         break

File ~/miniforge3/envs/workshop/lib/python3.10/site-packages/IPython/core/magics/execution.py:173, in Timer.timeit(self, number)
    171 gc.disable()
    172 try:
--> 173     timing = self.inner(it, self.timer)
    174 finally:
    175     if gcold:

File <magic-timeit>:1, in inner(_it, _timer)

Cell In[109], line 2, in if_function(x)
      1 def if_function(x):
----> 2     if x < 0:
      3         return 0
      4     else:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

We might be tempted to simply rewrite this function with a for loop. But remember, Python is terrible when looping...

In [112]:

Copied!





def loop_function(x):
    ret = np.zeros(len(x))
    for i in range(len(x)):
        if x[i] < 0:
            ret[i] =  0
        else:
            ret[i] = np.sqrt(x[i])
    return ret
def loop_function(x):
    ret = np.zeros(len(x))
    for i in range(len(x)):
        if x[i] < 0:
            ret[i] =  0
        else:
            ret[i] = np.sqrt(x[i])
    return ret 

In [113]:

Copied!

%timeit y = loop_function(x)
%timeit y = loop_function(x)

65.5 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [114]:

Copied!

vec_function = np.vectorize(if_function)
vec_function = np.vectorize(if_function)

In [115]:

Copied!

%timeit y = vec_function(x)
%timeit y = vec_function(x)

54.3 ms ± 422 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Caching frequently used function calls¶

When you're expecting to repeat function calls with the same arguements, cache the results.

In [116]:

Copied!





#first in first out cache wrapper
def fifo(func):
    # Define the cache size 
    cache_size = 5
    # Use a dictionary as the cache
    cache = {}

    def wrapper(*args, **kwargs):
        # tokenize the arguments
        token = ""
        for arg in args:
            token += str(arg)
        for arg in kwargs:
            token += str(arg)

        # If still have the token
        if token in cache:
            return cache[token]

        # Call function
        result = func(*args, **kwargs)
        cache[token] = result

        # Check if we need to remove items from cache
        if len(cache) > 5:
            cache[next(iter(cache))]
            
        # Return the results of the function
        return result
    return wrapper
#first in first out cache wrapper
def fifo(func):
    # Define the cache size 
    cache_size = 5
    # Use a dictionary as the cache
    cache = {}

    def wrapper(*args, **kwargs):
        # tokenize the arguments
        token = ""
        for arg in args:
            token += str(arg)
        for arg in kwargs:
            token += str(arg)

        # If still have the token
        if token in cache:
            return cache[token]

        # Call function
        result = func(*args, **kwargs)
        cache[token] = result

        # Check if we need to remove items from cache
        if len(cache) > 5:
            cache[next(iter(cache))]
            
        # Return the results of the function
        return result
    return wrapper

In [117]:

Copied!

def multiply_string(my_string, rept):
    time.sleep(0.5)
    return my_string * rept
def multiply_string(my_string, rept):
    time.sleep(0.5)
    return my_string * rept

In [118]:

Copied!

%time multiply_string("hello", 5)
%time multiply_string("hello", 5)

CPU times: user 1.07 ms, sys: 33 µs, total: 1.1 ms
Wall time: 503 ms

Out[118]:

'hellohellohellohellohello'

In [119]:

Copied!

%time multiply_string("hello", 5)
%time multiply_string("hello", 5)

CPU times: user 1.25 ms, sys: 37 µs, total: 1.29 ms
Wall time: 502 ms

Out[119]:

'hellohellohellohellohello'

In [120]:

Copied!





@fifo
def multiply_string_cache(my_string, rept):
    time.sleep(0.5)
    return my_string * rept
@fifo
def multiply_string_cache(my_string, rept):
    time.sleep(0.5)
    return my_string * rept

In [121]:

Copied!

%time multiply_string_cache("goodbye", 5)
%time multiply_string_cache("goodbye", 5)

CPU times: user 1.2 ms, sys: 36 µs, total: 1.24 ms
Wall time: 503 ms

Out[121]:

'goodbyegoodbyegoodbyegoodbyegoodbye'

In [122]:

Copied!

%time multiply_string_cache("goodbye", 5)
%time multiply_string_cache("goodbye", 5)

CPU times: user 8 µs, sys: 0 ns, total: 8 µs
Wall time: 13.1 µs

Out[122]:

'goodbyegoodbyegoodbyegoodbyegoodbye'

Now that we know how the cache works, we can use some blackboxed python functions.

Let's use a Least recently used) cache. This will track the number of times we call a set of args. When the cache fills up, the least used item will be removed.

In [123]:

Copied!





from functools import lru_cache

@lru_cache(maxsize=5)
def multiply_string_lru(my_string, rept):
    time.sleep(0.5)
    return my_string * rept
from functools import lru_cache

@lru_cache(maxsize=5)
def multiply_string_lru(my_string, rept):
    time.sleep(0.5)
    return my_string * rept

In [124]:

Copied!

%time multiply_string_lru("goodbye", 5)
%time multiply_string_lru("goodbye", 5)

CPU times: user 952 µs, sys: 29 µs, total: 981 µs
Wall time: 503 ms

Out[124]:

'goodbyegoodbyegoodbyegoodbyegoodbye'

In [125]:

Copied!

%time multiply_string_lru("goodbye", 5)
%time multiply_string_lru("goodbye", 5)

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.39 µs

Out[125]:

'goodbyegoodbyegoodbyegoodbyegoodbye'

Already have working code from a more efficient language? Import it!¶

Python has lots of tools to import code from different languages

In [78]:

Copied!





def py_fib(x):
    if x in [0,1] :
        return 1
    else:
        return py_fib(x - 1) + py_fib(x - 2)
def py_fib(x):
    if x in [0,1] :
        return 1
    else:
        return py_fib(x - 1) + py_fib(x - 2)

In [79]:

Copied!

%timeit py_fib(10)
%timeit py_fib(10)

9.06 µs ± 93.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [80]:

Copied!





from numba import jit

@jit( nopython = True )
def py_fib(x):
    if x in [0,1] :
        return 1
    else:
        return py_fib(x - 1) + py_fib(x - 2)
from numba import jit

@jit( nopython = True )
def py_fib(x):
    if x in [0,1] :
        return 1
    else:
        return py_fib(x - 1) + py_fib(x - 2)

In [81]:

Copied!

py_fib(10)
py_fib(10)

Out[81]:

In [74]:

Copied!

%timeit py_fib(10)
%timeit py_fib(10)

362 ns ± 3.35 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

We can use packages like Cython to write C++ code in Python. Or we can directly import a precompiled library.

We can make C++ functions accessible in Python by wrapping our code in a extern "C" block.

extern "C" {

    double my_function(... some arguments...){
        ... some code ...
    }
}

Why extern "C"? One of the differences between C and C++ is that C++ allows us to overload functions. This means that two functions can have the same name, but accept different arguements. For example let say we have some functionality that we would like to make common across different data types:

double get_sqrt_integer( int a );
double get_sqrt_double( double a );

The user shouldn't need to worry if the argument is an int or a double. So we would "overload" the function like so:

double get_sqrt( int a );
double get_sqrt( double a );

So the user only ever calls get_sqrt and the compile handles the logic on whether it is a double or an int. The compile does this by giving each of the function versions unique names that are generated at compile time.

By using extern "C" we are reverting back to the C method, which will explicitly asign the name we give the function, but it means that we can no longer overload the function. We therefore would need to run:

extern "C" {   
    double get_sqrt_integer( int a );
    double get_sqrt_double( double a );
}

When analyzing the compiled library we would see the two functions called get_sqrt_integer and get_sqrt_double.

In [82]:

Copied!





%%writefile cpp_fib.cpp
extern "C" {
    int cpp_fib(unsigned int x)
    {
        if ( (x == 0) || (x == 1) )
        {
            return 1;
        }
        else
        {
            return cpp_fib(x-1) + cpp_fib(x-2);
        }
    }
}
%%writefile cpp_fib.cpp
extern "C" {
    int cpp_fib(unsigned int x)
    {
        if ( (x == 0) || (x == 1) )
        {
            return 1;
        }
        else
        {
            return cpp_fib(x-1) + cpp_fib(x-2);
        }
    }
}

Overwriting cpp_fib.cpp

Compile to a shared library

In [76]:

Copied!

# mamba / conda install gxx
# or
# mamba /conda install gxx-compiler
# mamba / conda install gxx
# or
# mamba /conda install gxx-compiler

In [83]:

Copied!

! g++ -fPIC -shared -o cpp_fib.so cpp_fib.cpp -O3
! g++ -fPIC -shared -o cpp_fib.so cpp_fib.cpp -O3

In [84]:

Copied!

! ls cpp_*
! ls cpp_*

cpp_fib.cpp  cpp_fib.so

Load in the library with ctypes and "translate" the argument types

In [85]:

Copied!





import ctypes

cpp_lib = ctypes.CDLL("./cpp_fib.so")
# Access the cpp_fib function from the library
# and assigne the argument type
cpp_lib.cpp_fib.argtypes = [ctypes.c_uint] 
# Alias it to cpp_fib
cpp_fib = cpp_lib.cpp_fib
import ctypes

cpp_lib = ctypes.CDLL("./cpp_fib.so")
# Access the cpp_fib function from the library
# and assigne the argument type
cpp_lib.cpp_fib.argtypes = [ctypes.c_uint] 
# Alias it to cpp_fib
cpp_fib = cpp_lib.cpp_fib

In [86]:

Copied!

%timeit cpp_fib(10)
%timeit cpp_fib(10)

273 ns ± 4.17 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

We can also load in Rust code using rustimport

mamba/conda install rustimport

And the pyo3 rust crate:

In [87]:

Copied!

%load_ext rustimport
%load_ext rustimport

use --release to complie an optimized binary rather than a debug binary.

In [88]:

Copied!





%%rustimport  --release --force
use pyo3::prelude::*;

#[pyfunction]
fn rust_fib(x: usize) -> usize {
    match x {
        1 => 1,
        0 => 1,
        _ => rust_fib(x - 1) + rust_fib(x - 2)
    }
}
%%rustimport  --release --force
use pyo3::prelude::*;

#[pyfunction]
fn rust_fib(x: usize) -> usize {
    match x {
        1 => 1,
        0 => 1,
        _ => rust_fib(x - 1) + rust_fib(x - 2)
    }
}

    Updating crates.io index
 Downloading crates ...
  Downloaded syn v2.0.64
   Compiling target-lexicon v0.12.14
   Compiling once_cell v1.19.0
   Compiling autocfg v1.3.0
   Compiling proc-macro2 v1.0.82
   Compiling unicode-ident v1.0.12
   Compiling libc v0.2.154
   Compiling parking_lot_core v0.9.10
   Compiling portable-atomic v1.6.0
   Compiling cfg-if v1.0.0
   Compiling scopeguard v1.2.0
   Compiling smallvec v1.13.2
   Compiling heck v0.4.1
   Compiling indoc v2.0.5
   Compiling unindent v0.2.3
   Compiling lock_api v0.4.12
   Compiling memoffset v0.9.1
   Compiling pyo3-build-config v0.21.2
   Compiling quote v1.0.36
   Compiling syn v2.0.64
   Compiling parking_lot v0.12.2
   Compiling pyo3-ffi v0.21.2
   Compiling pyo3 v0.21.2
   Compiling pyo3-macros-backend v0.21.2
   Compiling pyo3-macros v0.21.2
   Compiling _magic_4716e3197ba1bc8f v0.1.0 (/tmp/rustimport/_magic_4716e3197ba1bc8f-860e1651e5b8e95bc6f26cfde7bdc78b/_magic_4716e3197ba1bc8f)
    Finished release [optimized] target(s) in 8.30s

In [87]:

Copied!

%timeit rust_fib(10)
%timeit rust_fib(10)

168 ns ± 19.9 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [88]:

Copied!

# fib_sqe = [py_fib(i) for i in range(10)]
# fib_sqe
# fib_sqe = [py_fib(i) for i in range(10)]
# fib_sqe

In [89]:

Copied!

# fib_sqe = [cpp_fib(i) for i in range(10)]
# fib_sqe
# fib_sqe = [cpp_fib(i) for i in range(10)]
# fib_sqe

In [90]:

Copied!

# fib_sqe = [rust_fib(i) for i in range(10)]
# fib_sqe
# fib_sqe = [rust_fib(i) for i in range(10)]
# fib_sqe

Don't reinvent the wheel!¶

Python is one of the most popular languages. A lot of super talented people have spent a lot of time optimizing python and wrapping low lower level languages in Python code. Use libraries such as:

NumPy
SciPy
Pandas

In [ ]: