Benchmarking and Profiling¶
Benchmarking and Profiling are crucial techniques used to quantify the performance of code and measure improvements in its execution. In Python, understanding the performance of different implementations or comparing the efficiency of algorithms is essential, especially in scenarios where optimization is vital.
When trying to optimize code, it is important to remember that the time required to optimize the code, might be better spent on other tasks. This is to say, when developing code the primary goal is to deliver working code. Delivering fast and efficent code is a secondary goal. With this in mind, it is important not to prematurely optimize code. One can spend an incredible ammount of time trying to optimize code before it even works! Futhermore, one could spend house of work to save only seconds of run time. A balance needs to be struck!
A suitable development workflow would look something like this:
- Get a working instance of the code.
- Determine suitable benchmarks for evaluating the code (run time, iterations/second, iterations/cpu cores, percision on output, memory usage etc), and set target performance goals (e.g. total run time, total memory budget, etc).
- Evaluate the performance using the predefined benchmarks.
- If performance goals aren't met, profile the code to determine where the bottle necks are and whether changes can be made.
- Implement suitable changes to the code and repeat from step 3.
Optimizing code can be a very time consuming process. There is a balance that needs to be made between acceptable run time and developer hours needed. Making mulitple changes to the code at once can also introduce bugs that might be difficult to debug. Either make small iterative changes to the code, or make sure the code passes pre-defined tests after each development cycle.
Python provides the timeit
module, which enables the measurement of execution time for specific code segments. This module offers a simple yet effective way to evaluate the performance of individual code snippets or functions:
%%time
will time the contents of an entire cell with a single run%%timeit
will time the contents of an entire cell, averaging over multiples run of that cell%time
will time a signle line with a single run%timeit
will time a single line, averaging over multiples run of that cell
timeit
will also suspend some features, such as the garbage collector, to get an accurate measurement of the code's runtime.
import time
import timeit
import numpy as np
import matplotlib.pyplot as plt
%%timeit
test = 0
for i in range(100000):
test += 1
2.48 ms ± 31.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
We can also measure the time of the cell from one iteration using %%time
%%time
test = 0
for i in range(100000):
test += 1
CPU times: user 4.59 ms, sys: 0 ns, total: 4.59 ms Wall time: 4.61 ms
We see a number of different times listed here:
Wall time
: This is the time that has passed during the running of the codeCPU time
: This is the time that the CPU was busy. This also includes time when the CPU has "swapped" its attention to a different process.user
: This is the time that the process has been running on the CPU but outside of the system kernelsys
: This is the time taken inside the system kernel. This could indicate that the code requires a lot of system calls rather than calls written by the developer.
As a rule of thumb:
- CPU Time / Wall Time ~ 1. The process spent most of it's time using the CPU. A faster CPU could improve performance.
- CPU Time / Wall Time < 1. This suggest the process has spent time waiting. This could be resource allocations, network/hardware IO, locks (to prevent data races) etc. The smaller the number, the more time spent waiting. 0.9 would suggest 10% of the time the CPU is idle.
- CPU Time / Wall Time > 1. This suggests that we are utilizing multiple processors when running our code.
If we are running with N processors:
- CPU Time / Wall Time ~ N. The process is well distributed across all CPUs with little to no idle time.
- CPU Time / Wall Time < N. This suggest the process has spent time waiting. This could be resource allocations, network/hardware IO, locks (to prevent data races) etc.
We can use %timeit
to time an single line or process, averaging over multiple runs:
print ("This line won't be timed")
# Either will this line
test = np.random.random(1000)
# This line will be timed
%timeit test = np.random.random(1000)
This line won't be timed 4.14 µs ± 87.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Or we can use %time
to time an single line or process, with no averaging:
print ("This line won't be timed")
# Either will this line
test = np.random.random(1000)
# This line will be timed
%time test = np.random.random(1000)
This line won't be timed CPU times: user 100 µs, sys: 4 µs, total: 104 µs Wall time: 78.7 µs
Let's look at comparing two functions. Consider vector addition. In a pure Python, we would use a for loop to iterate over each element. We could also use numpy's inbuild functions to perform the addition/
def add_two_arrays(a : np.ndarray,b : np.ndarray) -> np.ndarray:
"""Add two numpy arrays togeter using a for loop
Args:
a : array 1 to be added
b : array 2 to be added
Returns:
array of a + b
"""
c = np.zeros(len(a))
for i in range(len(a)):
c[i] = a[i] + b[i]
return c
def add_two_arrays_numpy(a : np.ndarray, b : np.ndarray) -> np.ndarray:
"""Add two numpy arrays togeter using vectorization
Args:
a : array 1 to be added
b : array 2 to be added
Returns:
array of a + b
"""
c = a + b
return c
test_x = np.random.random(1000)
test_y = np.random.random(1000)
print("Pure Python:")
%timeit test_z = add_two_arrays(test_x, test_y)
print("Numpy:")
%timeit test_z = add_two_arrays_numpy(test_x, test_y)
Pure Python: 155 µs ± 742 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each) Numpy:
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) Cell In[35], line 7 5 get_ipython().run_line_magic('timeit', 'test_z = add_two_arrays(test_x, test_y)') 6 print("Numpy:") ----> 7 get_ipython().run_line_magic('timeit', 'test_z = add_two_arrays_numpy(test_x, test_y)') File ~/miniforge3/envs/workshop/lib/python3.10/site-packages/IPython/core/interactiveshell.py:2480, in InteractiveShell.run_line_magic(self, magic_name, line, _stack_depth) 2478 kwargs['local_ns'] = self.get_local_scope(stack_depth) 2479 with self.builtin_trap: -> 2480 result = fn(*args, **kwargs) 2482 # The code below prevents the output from being displayed 2483 # when using magics with decorator @output_can_be_silenced 2484 # when the last Python token in the expression is a ';'. 2485 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False): File ~/miniforge3/envs/workshop/lib/python3.10/site-packages/IPython/core/magics/execution.py:1189, in ExecutionMagics.timeit(self, line, cell, local_ns) 1186 if time_number >= 0.2: 1187 break -> 1189 all_runs = timer.repeat(repeat, number) 1190 best = min(all_runs) / number 1191 worst = max(all_runs) / number File ~/miniforge3/envs/workshop/lib/python3.10/timeit.py:206, in Timer.repeat(self, repeat, number) 204 r = [] 205 for i in range(repeat): --> 206 t = self.timeit(number) 207 r.append(t) 208 return r File ~/miniforge3/envs/workshop/lib/python3.10/site-packages/IPython/core/magics/execution.py:173, in Timer.timeit(self, number) 171 gc.disable() 172 try: --> 173 timing = self.inner(it, self.timer) 174 finally: 175 if gcold: File <magic-timeit>:1, in inner(_it, _timer) KeyboardInterrupt:
We see a dramatic improvement by using NumPy arrays to add two vectors together. This is because NumPy is a highly optimized library, with the low-level code written in C++. Is this always the case?
def time_addition(n):
test_x = np.random.random(n)
test_y = np.random.random(n)
n_repeat = 10
# pure python
t_start = time.time()
for _ in range(n_repeat):
_ = add_two_arrays(test_x, test_y)
t_pure = np.median(time.time() - t_start)
# numpy
t_start = time.time()
for _ in range(n_repeat):
_ = add_two_arrays_numpy(test_x, test_y)
t_numpy = np.median(time.time() - t_start)
return t_pure, t_numpy
n_array = np.arange(20)
times_py = np.zeros(n_array.shape)
times_np = np.zeros(n_array.shape)
for i, n in enumerate(n_array):
times_py[i], times_np[i] = time_addition(int(2**n))
plt.plot(2 ** n_array, times_py, "C0s:", label = "Pure Python")
plt.plot(2 ** n_array, times_np, "C1o--", label = "Numpy")
plt.loglog()
plt.grid()
plt.legend()
plt.xlabel("Array Length")
plt.ylabel("Run Time [s]");
At very small array sizes the performance is comparable. This is likely due to the overhead of calling the low level NumPy code. The NumPy perfomance saturates around $10^2$ - $10^3$ before scaling in a comparatable fashion to pure python, however it is still orders of magnitude faster. This perfomance will be system dependent!
It might be easy to assume for the above example that NumPy is always better than pure Python, but this isn't always the case and we should always consider our specific needs.
If we where writing code to operate on small (O($10^1$)), we may be faster to use a pure python implementation rather than importing NumPy and introducing an additional overhead.
Line profiling¶
Line profiling is a useful tool for finding bottlenecks within a block of code. There are many profilers one can use:
- cProfiles
- line_profiler
- todo get more
For this example, we'll use line_profiler
mamba/conda install line_profiler
or
pip install line_profiler
Let's look at an expensive function that uses a MC method to estimate $\pi$.
We're going to throw random points $x, y \in [0,1]$ into a square of length 1. This will give an estimate of the the area of the square:
$$ A_{square} \propto N_{total}$$ The area of a circle is given by: $$ A_{circle} = \pi r^2$$ Since we're throwing random numbers in the range $[0,1]$, we will fill a quatre of a circle. $$ A_{circle} \propto 4 * \tilde{N_{within}}$$ Where $\tilde{N_{within}}$ is the number of samples within 1 of the origin (or $x^2 + y^2 <1$).
Combing these, we can estimate $pi$ as: $$ \pi \approx 4 \frac{N_{within}}{N_{total}}$$
def monte_carlo_pi(nsamples):
acc = 0
for i in range(nsamples):
x = np.random.random()
y = np.random.random()
if (x ** 2 + y ** 2) < 1.0:
acc += 1
return 4.0 * acc / nsamples
n_range = np.arange(7)
pi_diff = [np.abs(np.pi - monte_carlo_pi(10**n)) for n in n_range]
plt.plot(10**n_range, pi_diff)
plt.xlabel("Number of Random Samples")
plt.ylabel("$|\pi_{NumPy} - \pi_{MC}|$")
plt.grid()
plt.xscale('log')
#plt.yscale('log')
We can see that as $N_{total}$ increases we tend to the NumPy value of $\pi$. We obtain deminising returns at around $10^4$ samples. Let's choose this as our benchmark point.
n_total = 10 ** 4
How long does this take to run?
%timeit monte_carlo_pi(n_total)
6.14 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
n_run = 100
n_repeat = 10
t_original = np.array(
timeit.repeat(f"monte_carlo_pi({n_total})", "from __main__ import monte_carlo_pi", repeat = n_repeat, number = n_run)
) / n_run
print(f"Original Runtime = {np.mean(t_original) : 0.2e} +/- {np.std(t_original) :0.2e} s")
Original Runtime = 6.36e-03 +/- 5.86e-04 s
If we want to speed up this calculation, we should first find what is the major bottleneck.
%load_ext line_profiler
The line_profiler extension is already loaded. To reload it, use: %reload_ext line_profiler
%lprun -f monte_carlo_pi monte_carlo_pi(n_total)
Timer unit: 1e-09 s Total time: 0.0180272 s File: /tmp/ipykernel_235029/1024659408.py Function: monte_carlo_pi at line 1 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1 def monte_carlo_pi(nsamples): 2 1 852.0 852.0 0.0 acc = 0 3 10001 1441541.0 144.1 8.0 for i in range(nsamples): 4 10000 5764246.0 576.4 32.0 x = np.random.random() 5 10000 5585385.0 558.5 31.0 y = np.random.random() 6 10000 3567354.0 356.7 19.8 if (x ** 2 + y ** 2) < 1.0: 7 7911 1666182.0 210.6 9.2 acc += 1 8 1 1623.0 1623.0 0.0 return 4.0 * acc / nsamples
The above shows a breakdown of where we're spending time. We have each line of the function broken down. We can see that the allocations of acc
and the return of 4.0 * acc / nsamples
are expensive operations, but this only happens once per execution. Since these only happen once there may be some "noise" effecting these estimates.
for i in range(nsamples):
takes 8% of our time. This is pretty significant. Python is also nutoriously bad at looping. If we can avoid looping in Python we should.
We also see that acc += 1
doesn't happen every iteration, and we spend ~20% of our time checking if (x ** 2 + y ** 2) < 1.0
. We can see that the dominate part of the time is spent generating random numbers (x/y = np.random.random()
) which takes ~60% of the run time.
If we were looking to speed up this function, we should work target the most expensive parts first, this would suggest targeting:
x = np.random.random()
y = np.random.random()
if (x ** 2 + y ** 2) < 1.0:
Which constitutes 80% or our run time. Since these are ran 10000 times, it might make more sense to somehow only run this once. But how...
def monte_carlo_pi_pure_numpy(nsamples):
# Create random arrays with required number of samples
# Use the highly optimized numpy functions instead of for loop
x = np.random.random(nsamples)
y = np.random.random(nsamples)
# Do the filtering and calculation in the one line
return 4.0 *np.sum((x ** 2 + y ** 2) < 1.0) / nsamples
%timeit monte_carlo_pi_pure_numpy(n_total)
90.5 µs ± 3.92 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
n_run = 1000
n_repeat = 10
t_optimized = np.array(
timeit.repeat(f"monte_carlo_pi_pure_numpy({n_total})", "from __main__ import monte_carlo_pi_pure_numpy", repeat = n_repeat, number = n_run)
) / n_run
t_optimized
array([1.30074857e-04, 1.22065353e-04, 8.42903380e-05, 8.41132850e-05, 8.37166480e-05, 8.39098730e-05, 8.36135550e-05, 8.34919960e-05, 8.35091390e-05, 8.30124940e-05])
print(f"Optimized Runtime = {np.mean(t_optimized) : 0.2e} +/- {np.std(t_optimized) :0.2e} s")
Optimized Runtime = 9.22e-05 +/- 1.70e-05 s
%lprun -f monte_carlo_pi_pure_numpy monte_carlo_pi_pure_numpy(n_total)
Timer unit: 1e-09 s Total time: 0.000826387 s File: /tmp/ipykernel_235029/764224049.py Function: monte_carlo_pi_pure_numpy at line 1 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1 def monte_carlo_pi_pure_numpy(nsamples): 2 # Create random arrays with required number of samples 3 # Use the highly optimized numpy functions instead of for loop 4 1 184658.0 184658.0 22.3 x = np.random.random(nsamples) 5 1 180490.0 180490.0 21.8 y = np.random.random(nsamples) 6 # Do the filtering and calculation in the one line 7 1 461239.0 461239.0 55.8 return 4.0 *np.sum((x ** 2 + y ** 2) < 1.0) / nsamples
speedup = np.mean(t_original) / np.mean(t_optimized)
print (f"Optmization speeds up the code by a factor of {speedup:0.2f}")
Optmization speeds up the code by a factor of 65.55
Can we do better?¶
We're already using highly optimized NumPy which is based on C/C++ code. However we're still interpting slow bytecode at run time and running on one core.
# !pip install numba
from numba import jit
@jit
def monte_carlo_pi_pure_numpy_jit(nsamples):
# Create random arrays with required number of samples
# Use the highly optimized numpy functions instead of for loop
x = np.random.random(nsamples)
y = np.random.random(nsamples)
# Do the filtering and calculation in the one line
return 4.0 *np.sum((x ** 2 + y ** 2) < 1.0) / nsamples
t1 = time.time()
monte_carlo_pi_pure_numpy_jit(n_total)
telap = time.time() - t1
print (f"This took {telap}s")
This took 0.818145751953125s
t1 = time.time()
monte_carlo_pi_pure_numpy_jit(n_total)
telap = time.time() - t1
print (f"This took {telap}s")
This took 0.00010251998901367188s
n_run = 1000
n_repeat = 10
t_jitted = np.array(
timeit.repeat(f"monte_carlo_pi_pure_numpy_jit({n_total})", "from __main__ import monte_carlo_pi_pure_numpy_jit", repeat = n_repeat, number = n_run)
) / n_run
# t_jitted
print(f"jitted Runtime = {np.mean(t_jitted) : 0.2e} +/- {np.std(t_jitted) :0.2e} s")
jitted Runtime = 6.72e-05 +/- 1.59e-05 s
speedup = np.mean(t_original) / np.mean(t_jitted)
print (f"Optmization speeds up the code by a factor of {speedup:0.2f}")
Optmization speeds up the code by a factor of 94.69
@jit( parallel = True )
def monte_carlo_pi_pure_numpy_jit_parallel(nsamples):
# Create random arrays with required number of samples
# Use the highly optimized numpy functions instead of for loop
x = np.random.random(nsamples)
y = np.random.random(nsamples)
# Do the filtering and calculation in the one line
return 4.0 *np.sum((x ** 2 + y ** 2) < 1.0) / nsamples
t1 = time.time()
monte_carlo_pi_pure_numpy_jit_parallel(n_total)
telap = time.time() - t1
print (f"This took {telap}s")
This took 0.7153468132019043s
n_run = 10000
n_repeat = 10
t_jitted_parallel = np.array(
timeit.repeat(f"monte_carlo_pi_pure_numpy_jit_parallel({n_total})", "from __main__ import monte_carlo_pi_pure_numpy_jit_parallel", repeat = n_repeat, number = n_run)
) / n_run
t_jitted_parallel
array([3.53246284e-05, 1.69790315e-05, 2.01592582e-05, 2.66108533e-05, 6.16496429e-05, 2.76365722e-05, 3.48632386e-05, 1.14208586e-05, 1.32643553e-05, 1.14757513e-05])
print(f"jitted parallel Runtime = {np.mean(t_jitted_parallel) : 0.2e} +/- {np.std(t_jitted_parallel) :0.2e} s")
jitted parallel Runtime = 2.59e-05 +/- 1.46e-05 s
speedup = np.mean(t_original) / np.mean(t_jitted_parallel)
print (f"Optmization speeds up the code by a factor of {speedup:0.2f}")
Optmization speeds up the code by a factor of 245.19
@jit( parallel = True, nopython = True)
def monte_carlo_pi_pure_numpy_jit_parallel_nopython(nsamples):
# Create random arrays with required number of samples
# Use the highly optimized numpy functions instead of for loop
x = np.random.random(nsamples)
y = np.random.random(nsamples)
# Do the filtering and calculation in the one line
return 4.0 *np.sum((x ** 2 + y ** 2) < 1.0) / nsamples
t1 = time.time()
monte_carlo_pi_pure_numpy_jit_parallel_nopython(n_total)
telap = time.time() - t1
print (f"This took {telap}s")
This took 0.31411242485046387s
n_run = 10000
n_repeat = 10
t_jitted_parallel_nopython = np.array(
timeit.repeat(f"monte_carlo_pi_pure_numpy_jit_parallel_nopython({n_total})", "from __main__ import monte_carlo_pi_pure_numpy_jit_parallel_nopython", repeat = n_repeat, number = n_run)
) / n_run
t_jitted_parallel_nopython
array([3.03248594e-05, 1.24157844e-05, 1.27079011e-05, 1.13436051e-05, 1.17766736e-05, 3.14462926e-05, 1.97302514e-05, 1.40485196e-05, 1.42128084e-05, 1.49775431e-05])
print(f"jitted parallel no python Runtime = {np.mean(t_jitted_parallel_nopython) : 0.2e} +/- {np.std(t_jitted_parallel_nopython) :0.2e} s")
jitted parallel no python Runtime = 1.73e-05 +/- 7.16e-06 s
speedup = np.mean(t_original) / np.mean(t_jitted_parallel_nopython)
print (f"Optmization speeds up the code by a factor of {speedup:0.2f}")
Optmization speeds up the code by a factor of 367.65
from numba import prange
@jit(nopython = True, parallel = True)
def monte_carlo_pi_revisit(nsamples):
acc = 0
for i in prange(nsamples):
x = np.random.random()
y = np.random.random()
if (x ** 2 + y ** 2) < 1.0:
acc += 1
return 4.0 * acc / nsamples
t1 = time.time()
monte_carlo_pi_revisit(n_total)
telap = time.time() - t1
print (f"This took {telap}s")
This took 0.2649538516998291s
n_run = 10000
n_repeat = 10
t_jitted_revisit = np.array(
timeit.repeat(f"monte_carlo_pi_revisit({n_total})", "from __main__ import monte_carlo_pi_revisit", repeat = n_repeat, number = n_run)
) / n_run
t_jitted_revisit
array([1.48544039e-05, 1.88295904e-05, 1.98919663e-05, 1.77156133e-05, 1.22228075e-05, 1.14322552e-05, 1.59235416e-05, 1.14099232e-05, 4.08112965e-05, 4.06835035e-05])
print(f"jitted parallel no python Runtime = {np.mean(t_jitted_revisit) : 0.2e} +/- {np.std(t_jitted_revisit) :0.2e} s")
jitted parallel no python Runtime = 2.04e-05 +/- 1.06e-05 s
speedup = np.mean(t_original) / np.mean(t_jitted_revisit)
print (f"Optmization speeds up the code by a factor of {speedup:0.2f}")
Optmization speeds up the code by a factor of 312.10
Summary¶
We've looked at both how to speed up a simple function by first using pure NumPy, then moving to compiled versions, before moving to a parallel implementation:
# speedup = np.mean(t_original) / np.mean(t_optimized)
# print (f"Pure NumPy speeds up the code by a factor of {speedup:0.2f}")
speedup = np.mean(t_original) / np.mean(t_jitted)
print (f"Compiled NumPy speeds up the code by a factor of {speedup:0.2f}")
speedup = np.mean(t_original) / np.mean(t_jitted_parallel)
print (f"Parallel annd compiled NumPy speeds up the code by a factor of {speedup:0.2f}")
speedup = np.mean(t_original) / np.mean(t_jitted_parallel_nopython)
print (f"Parallel annd compiled NumPy with 'no python' speeds up the code by a factor of {speedup:0.2f}")
speedup = np.mean(t_original) / np.mean(t_jitted_revisit)
print (f"Parallel and complied python speeds up the code by a factor of {speedup:0.2f}")
Compiled NumPy speeds up the code by a factor of 94.69 Parallel annd compiled NumPy speeds up the code by a factor of 245.19 Parallel annd compiled NumPy with 'no python' speeds up the code by a factor of 367.65 Parallel and complied python speeds up the code by a factor of 312.10
The compiled and parallel implementations offer the best performance. If we don't have access to a multi-core system, we achieve x100 performace by simply compiling the function with numba and jit.
Memory Profiling¶
So far we've only considered the run time performance. This is only one aspect of performance. Memory usage is often a limiting factor in performance.
We can use the memory_profile
to profile the memory of our code. This works by sampling the memory usage at intervals while the code is running.
Similar to %%timeit
and %timeit
we can use %%memit
and %memit
to get the memory usage of a cell and a single line.
# if you have issues with async switch to v0.61
# pip install memory_profiler==0.61
%load_ext memory_profiler
# to automatically reload changed files
%load_ext autoreload
%autoreload 2
The memory_profiler extension is already loaded. To reload it, use: %reload_ext memory_profiler The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
%memit big_array = [i for i in range(10**7)]
peak memory: 780.84 MiB, increment: 324.75 MiB
If we want to try and find what processes are using a lot of memory, we can line profile a function.
%%writefile function_to_profile.py
import time
def expensive_function():
x = [ 120 for i in range(100) ]
b = [ x * 100 for i in range(1000) ]
big_list = [ i for i in range(10**5) ]
another_big_list = [ i for i in range(10**5) ]
time.sleep(0.01)
x = [ 42 for i in range(1000) ]
Overwriting function_to_profile.py
from function_to_profile import expensive_function
%mprun -f expensive_function expensive_function()
Filename: /home/obriens/Documents/websites/mcgill_site/docs/Tutorials/Python/function_to_profile.py Line # Mem usage Increment Occurrences Line Contents ============================================================= 3 455.5 MiB 455.5 MiB 1 def expensive_function(): 4 5 455.5 MiB 0.0 MiB 103 x = [ 120 for i in range(100) ] 6 531.0 MiB 75.5 MiB 1003 b = [ x * 100 for i in range(1000) ] 7 533.7 MiB 2.6 MiB 100003 big_list = [ i for i in range(10**5) ] 8 538.0 MiB 4.4 MiB 100003 another_big_list = [ i for i in range(10**5) ] 9 538.0 MiB 0.0 MiB 1 time.sleep(0.01) 10 11 538.0 MiB 0.0 MiB 1003 x = [ 42 for i in range(1000) ]
This can often be difficult to parse and understand, in a jupyter notebook. I prefer to use memory_profiler
via the command line interface (which we can still call from jupyter!).
%%writefile initial_code.py
import numpy as np
import pandas as pd
import time
@profile
def load_data():
df = pd.read_csv("./gaia3.csv")
return df
@profile
def filter_data(df):
parallax = df["parallax"]
parallax_err = df["parallax_over_error"]
good_mask = parallax_err > 20
return good_mask
@profile
def get_distance(df):
distance = 1./ (df["parallax"] * 1e-3)
return distance
@profile
def assign_distance(df, distance):
df["distance"] = distance
@profile
def filter_distance(df):
distance = df["distance"]
good_mask = distance < 10
return good_mask
@profile
def apply_filter(df, data_filter):
data = df[data_filter]
return data
if __name__ == '__main__':
data = load_data()
time.sleep(0.1)
data_filter = filter_data(data)
time.sleep(0.1)
data_filtered = apply_filter(data, data_filter)
time.sleep(0.1)
distance = get_distance(data_filtered)
time.sleep(0.1)
assign_distance(data_filtered, distance)
time.sleep(0.1)
distance_filter = filter_distance(data_filtered)
time.sleep(0.1)
data_10pc = apply_filter(data_filtered, distance_filter)
time.sleep(0.1)
Overwriting initial_code.py
!mprof run --interval 0.005 --python python initial_code.py
!mprof plot --output initial_code_profile.png
mprof: Sampling memory every 0.005s running new process running as a Python program... initial_code.py:25: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df["distance"] = distance Using last profile data.
from IPython.display import Image, display
From the above we can see a few things:
- The file we read in is pretty big. The dataframe takes up ~250 MiB.
- Filtering the data has a negligable effect on the memory usage.
- We see a huge spike when calling
apply_filter
for the first time. - We see a second smaller spike when calling
get_distance
.
A plot like this is useful when we're trying to decide on what resources we need to run this program. We could likely run this on most machines as ~400 MiB is attainable by most personal laptops. If we wanted to submit this job to a cluster, we'd want at least ~450 MiB to provide some breathing room.
If we want to target areas to reduce memory usage, we should target the apply_filter
and get_distance
functions. Additionally we can look at ways to reduce the dataset size (we'll come back to this...).
Let's look at apply_filter
:
@profile
def apply_filter(df, data_filter):
data = df[data_filter]
return data
...
data_filtered = apply_filter(data, data_filter)
...
data_10pc = apply_filter(data_filtered, distance_filter)
...
We are using the apply_filter
. This function takes in a pandas.dataframe
and a boolean array and returns a new pandas.dataframe
. Let's start by overwriting the data in the function:
@profile
def apply_filter(df, data_filter):
df = df[data_filter]
return df
Additionally we'll overwrite all the instances of data
%%writefile initial_code_iteration1.py
import numpy as np
import pandas as pd
import time
@profile
def load_data():
df = pd.read_csv("./gaia3.csv")
return df
@profile
def filter_data(df):
parallax = df["parallax"]
parallax_err = df["parallax_over_error"]
good_mask = parallax_err > 20
return good_mask
@profile
def get_distance(df):
distance = 1./ (df["parallax"] * 1e-3)
return distance
@profile
def assign_distance(df, distance):
df["distance"] = distance
@profile
def filter_distance(df):
distance = df["distance"]
good_mask = distance < 10
return good_mask
@profile
def apply_filter(df, data_filter):
df = df[data_filter]
return df
if __name__ == '__main__':
data = load_data()
time.sleep(0.1)
data_filter = filter_data(data)
time.sleep(0.1)
data = apply_filter(data, data_filter)
time.sleep(0.1)
distance = get_distance(data)
time.sleep(0.1)
assign_distance(data, distance)
time.sleep(0.1)
distance_filter = filter_distance(data)
time.sleep(0.1)
data = apply_filter(data, distance_filter)
time.sleep(0.1)
Overwriting initial_code_iteration1.py
!mprof run --interval 0.005 --python python initial_code_iteration1.py
!mprof plot --output initial_code_iteration1.png
mprof: Sampling memory every 0.005s running new process running as a Python program... Using last profile data.
This is an improvement, but we still see a spike at ~400 MiB when apply_filter
is called for the first time. Why is this? Let's look at the function again:
def apply_filter(df, data_filter):
df = df[data_filter]
return df
When calling df = df[data_filter]
we first make a copy of the data using df[data_filter]
before assigning it to df
. Let's look at a more memory efficient way to do this:
def apply_filter(df, data_filter):
# Assign NaN to all the RA of the filtered events
df["ra"][data_filter] = np.nan
# Drop the entries with NaN, inplace = True meaning overwrite
df.dropna(inplace = True)
Since we've overwritten the original dataframe, we don't need to return anything...
%%writefile initial_code_iteration2.py
import numpy as np
import pandas as pd
import time
@profile
def load_data():
df = pd.read_csv("./gaia3.csv")
return df
@profile
def filter_data(df):
parallax = df["parallax"]
parallax_err = df["parallax_over_error"]
good_mask = parallax_err > 20
return good_mask
@profile
def get_distance(df):
distance = 1./ (df["parallax"] * 1e-3)
return distance
@profile
def assign_distance(df, distance):
df["distance"] = distance
@profile
def filter_distance(df):
distance = df["distance"]
good_mask = distance < 10
return good_mask
@profile
def apply_filter(df, data_filter):
# Assign NaN to all the RA of the filtered events
df["ra"][data_filter] = np.nan
# Drop the entries with NaN, inplace = True meaning overwrite
df.dropna(inplace = True)
return df
if __name__ == '__main__':
data = load_data()
time.sleep(0.1)
data_filter = filter_data(data)
time.sleep(0.1)
apply_filter(data, data_filter)
time.sleep(0.1)
distance = get_distance(data)
time.sleep(0.1)
assign_distance(data, distance)
time.sleep(0.1)
distance_filter = filter_distance(data)
time.sleep(0.1)
apply_filter(data, distance_filter)
time.sleep(0.1)
Overwriting initial_code_iteration2.py
!mprof run --interval 0.005 --python python initial_code_iteration2.py
!mprof plot --output initial_code_iteration2.png
mprof: Sampling memory every 0.005s running new process running as a Python program... initial_code_iteration2.py:39: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0! You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy. A typical example is when you are setting values in a column of a DataFrame, like: df["col"][row_indexer] = value Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`. See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df["ra"][data_filter] = np.nan initial_code_iteration2.py:39: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df["ra"][data_filter] = np.nan initial_code_iteration2.py:39: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0! You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy. A typical example is when you are setting values in a column of a DataFrame, like: df["col"][row_indexer] = value Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`. See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df["ra"][data_filter] = np.nan initial_code_iteration2.py:39: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df["ra"][data_filter] = np.nan Using last profile data.
This looks a lot better, but we still have that inital overhead of the dataset, is there anything we can do here?
Let's look at the data.
!head gaia3.csv
,ra,dec,parallax,parallax_over_error,bp_rp,phot_g_mean_mag,pm,pmra,pmra_error,pmdec,pmdec_error 0,251.49255411977842,-50.74953285089441,6.383755900057115,103.153015,0.79550457,10.54673,32.09387,31.380714480358343,0.05963018,-6.728077202639662,0.045103732 1,251.58997660680922,-50.63863518231249,13.398257319108815,750.62354,2.1655512,13.470652,98.00884,-49.35844067323052,0.017024113,-84.67277009494555,0.012675992 2,251.09313867055343,-51.007071500182086,6.106723881526058,448.8808,1.2691355,12.529914,9.986012,5.650983532208639,0.013577581,8.233275210204,0.009894044 3,251.74646120955111,-51.20105760802903,7.481941724988636,87.46144,2.5889482,16.969418,147.46635,10.335826592962476,0.08570085,-147.10370008642158,0.05650899 4,250.18815359198393,-51.2730504962615,8.606532814900987,163.01225,2.755577,16.33911,5.618203,-1.3052739918459513,0.06419676,5.464472928347609,0.04363508 5,250.32171790439398,-51.39721173654795,5.948736623759927,103.88973,0.010123253,7.615107,31.96969,-16.618858274260234,0.06304326,-27.31070564371402,0.04115518 6,250.63769352342914,-51.83570545712104,25.8932028451182,1238.9082,2.691968,13.436164,434.55444,-58.46042596499814,0.021226894,-430.60414864152074,0.015450544 7,250.1119445586245,-51.26003101959269,6.36085302396653,131.53783,2.53827,16.118967,77.69121,-52.38320393098749,0.049328525,-57.37528741330969,0.037503965 8,250.18507947208107,-51.477673785228106,19.588654431566606,768.66254,0.44006157,6.22464,131.83145,11.454067954459212,0.028156618,131.33292721367764,0.019740378
We have a lot of percision in our measurements. Lets compare the measured values to the uncertainty on these values. Let's limit ourself to the first 100 values. By default, pandas
will use 64 bit percision when reading in floats. This might be overkill.
import pandas as pd
df = pd.read_csv("gaia3.csv", nrows=100)
df.head()
Unnamed: 0 | ra | dec | parallax | parallax_over_error | bp_rp | phot_g_mean_mag | pm | pmra | pmra_error | pmdec | pmdec_error | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 251.492554 | -50.749533 | 6.383756 | 103.153015 | 0.795505 | 10.546730 | 32.093870 | 31.380714 | 0.059630 | -6.728077 | 0.045104 |
1 | 1 | 251.589977 | -50.638635 | 13.398257 | 750.623540 | 2.165551 | 13.470652 | 98.008840 | -49.358441 | 0.017024 | -84.672770 | 0.012676 |
2 | 2 | 251.093139 | -51.007072 | 6.106724 | 448.880800 | 1.269135 | 12.529914 | 9.986012 | 5.650984 | 0.013578 | 8.233275 | 0.009894 |
3 | 3 | 251.746461 | -51.201058 | 7.481942 | 87.461440 | 2.588948 | 16.969418 | 147.466350 | 10.335827 | 0.085701 | -147.103700 | 0.056509 |
4 | 4 | 250.188154 | -51.273050 | 8.606533 | 163.012250 | 2.755577 | 16.339110 | 5.618203 | -1.305274 | 0.064197 | 5.464473 | 0.043635 |
Do we need this percision? In this case it looks like the errors on our data are already pretty large.
import numpy as np
df_f32 = pd.read_csv("gaia3.csv", nrows=100, dtype=np.float32)
df_f32.head()
Unnamed: 0 | ra | dec | parallax | parallax_over_error | bp_rp | phot_g_mean_mag | pm | pmra | pmra_error | pmdec | pmdec_error | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 251.492554 | -50.749535 | 6.383756 | 103.153015 | 0.795505 | 10.546730 | 32.093868 | 31.380714 | 0.059630 | -6.728077 | 0.045104 |
1 | 1.0 | 251.589981 | -50.638634 | 13.398257 | 750.623535 | 2.165551 | 13.470652 | 98.008842 | -49.358440 | 0.017024 | -84.672768 | 0.012676 |
2 | 2.0 | 251.093140 | -51.007072 | 6.106724 | 448.880798 | 1.269135 | 12.529914 | 9.986012 | 5.650983 | 0.013578 | 8.233275 | 0.009894 |
3 | 3.0 | 251.746460 | -51.201057 | 7.481942 | 87.461441 | 2.588948 | 16.969418 | 147.466354 | 10.335827 | 0.085701 | -147.103699 | 0.056509 |
4 | 4.0 | 250.188156 | -51.273052 | 8.606533 | 163.012253 | 2.755577 | 16.339109 | 5.618203 | -1.305274 | 0.064197 | 5.464473 | 0.043635 |
import matplotlib.pyplot as plt
plt.hist(df["parallax"] - df_f32["parallax"])
df_f16 = pd.read_csv("gaia3.csv", nrows=100, dtype=np.float16)
df_f16.head()
Unnamed: 0 | ra | dec | parallax | parallax_over_error | bp_rp | phot_g_mean_mag | pm | pmra | pmra_error | pmdec | pmdec_error | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 251.500 | -50.75000 | 6.382812 | 103.1250 | 0.795410 | 10.546875 | 32.093750 | 31.375000 | 0.059631 | -6.726562 | 0.045105 |
1 | 1.0 | 251.625 | -50.62500 | 13.398438 | 750.5000 | 2.166016 | 13.468750 | 98.000000 | -49.343750 | 0.017029 | -84.687500 | 0.012672 |
2 | 2.0 | 251.125 | -51.00000 | 6.105469 | 449.0000 | 1.269531 | 12.531250 | 9.984375 | 5.652344 | 0.013580 | 8.234375 | 0.009895 |
3 | 3.0 | 251.750 | -51.18750 | 7.480469 | 87.4375 | 2.589844 | 16.968750 | 147.500000 | 10.335938 | 0.085693 | -147.125000 | 0.056519 |
4 | 4.0 | 250.250 | -51.28125 | 8.609375 | 163.0000 | 2.755859 | 16.343750 | 5.617188 | -1.305664 | 0.064209 | 5.464844 | 0.043640 |
plt.hist(df["parallax"] - df_f16["parallax"])
For a 32 bit float the uncertainy introduce is O($10^{-7}$), whereas for a 16 bit float the uncertainy is O($10^{-3}$). Let's start off by looking at using 32 bit floats instead when creating the pandas dataframe:
def load_data():
df = pd.read_csv("./gaia3.csv", dtype = np.float32)
return df
%%writefile initial_code_iteration3.py
import numpy as np
import pandas as pd
import time
@profile
def load_data():
df = pd.read_csv("./gaia3.csv", dtype = np.float32)
return df
@profile
def filter_data(df):
parallax = df["parallax"]
parallax_err = df["parallax_over_error"]
good_mask = parallax_err > 20
return good_mask
@profile
def get_distance(df):
distance = 1./ (df["parallax"] * 1e-3)
return distance
@profile
def assign_distance(df, distance):
df["distance"] = distance
@profile
def filter_distance(df):
distance = df["distance"]
good_mask = distance < 10
return good_mask
@profile
def apply_filter(df, data_filter):
# Assign NaN to all the RA of the filtered events
df["ra"][data_filter] = np.nan
# Drop the entries with NaN, inplace = True meaning overwrite
df.dropna(inplace = True)
return df
if __name__ == '__main__':
data = load_data()
time.sleep(0.1)
data_filter = filter_data(data)
time.sleep(0.1)
apply_filter(data, data_filter)
time.sleep(0.1)
distance = get_distance(data)
time.sleep(0.1)
assign_distance(data, distance)
time.sleep(0.1)
distance_filter = filter_distance(data)
time.sleep(0.1)
apply_filter(data, distance_filter)
time.sleep(0.1)
Overwriting initial_code_iteration3.py
!mprof run --interval 0.005 --python python initial_code_iteration3.py
!mprof plot --output initial_code_iteration3.png
mprof: Sampling memory every 0.005s running new process running as a Python program... initial_code_iteration3.py:39: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0! You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy. A typical example is when you are setting values in a column of a DataFrame, like: df["col"][row_indexer] = value Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`. See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df["ra"][data_filter] = np.nan initial_code_iteration3.py:39: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0! You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy. A typical example is when you are setting values in a column of a DataFrame, like: df["col"][row_indexer] = value Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`. See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df["ra"][data_filter] = np.nan Using last profile data.
We see a big improvement, with the peak memory usage < 250 MiB. Since we're limited by the uncertinty in the measurements, we might decide to lower our precision again:
def load_data():
df = pd.read_csv("./gaia3.csv", dtype = np.float16)
return df
%%writefile initial_code_iteration4.py
import numpy as np
import pandas as pd
import time
@profile
def load_data():
df = pd.read_csv("./gaia3.csv", dtype = np.float16)
return df
@profile
def filter_data(df):
parallax = df["parallax"]
parallax_err = df["parallax_over_error"]
good_mask = parallax_err > 20
return good_mask
@profile
def get_distance(df):
distance = 1./ (df["parallax"] * 1e-3)
return distance
@profile
def assign_distance(df, distance):
df["distance"] = distance
@profile
def filter_distance(df):
distance = df["distance"]
good_mask = distance < 10
return good_mask
@profile
def apply_filter(df, data_filter):
# Assign NaN to all the RA of the filtered events
df["ra"][data_filter] = np.nan
# Drop the entries with NaN, inplace = True meaning overwrite
df.dropna(inplace = True)
return df
if __name__ == '__main__':
data = load_data()
time.sleep(0.1)
data_filter = filter_data(data)
time.sleep(0.1)
apply_filter(data, data_filter)
time.sleep(0.1)
distance = get_distance(data)
time.sleep(0.1)
assign_distance(data, distance)
time.sleep(0.1)
distance_filter = filter_distance(data)
time.sleep(0.1)
apply_filter(data, distance_filter)
time.sleep(0.1)
Overwriting initial_code_iteration4.py
!mprof run --interval 0.005 --python python initial_code_iteration4.py
!mprof plot --output initial_code_iteration4.png
mprof: Sampling memory every 0.005s running new process running as a Python program... /home/obriens/miniforge3/envs/workshop/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py:234: RuntimeWarning: overflow encountered in cast chunks = self._reader.read_low_memory(nrows) initial_code_iteration4.py:39: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0! You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy. A typical example is when you are setting values in a column of a DataFrame, like: df["col"][row_indexer] = value Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`. See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df["ra"][data_filter] = np.nan initial_code_iteration4.py:39: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0! You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy. A typical example is when you are setting values in a column of a DataFrame, like: df["col"][row_indexer] = value Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`. See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df["ra"][data_filter] = np.nan Using last profile data.
Before we start reading the the dataframe we already have ~75 MiB. Can we reduce this?
NumPy is a pretty big library, let's only import what we need.
%%writefile initial_code_iteration5.py
from numpy import float16, nan
import pandas as pd
from time import sleep
@profile
def load_data():
df = pd.read_csv("./gaia3.csv", dtype = float16)
return df
@profile
def filter_data(df):
parallax = df["parallax"]
parallax_err = df["parallax_over_error"]
good_mask = parallax_err > 20
return good_mask
@profile
def get_distance(df):
distance = 1./ (df["parallax"] * 1e-3)
return distance
@profile
def assign_distance(df, distance):
df["distance"] = distance
@profile
def filter_distance(df):
distance = df["distance"]
good_mask = distance < 10
return good_mask
@profile
def apply_filter(df, data_filter):
# Assign NaN to all the RA of the filtered events
df["ra"][data_filter] = nan
# Drop the entries with NaN, inplace = True meaning overwrite
df.dropna(inplace = True)
return df
if __name__ == '__main__':
data = load_data()
sleep(0.1)
data_filter = filter_data(data)
sleep(0.1)
apply_filter(data, data_filter)
sleep(0.1)
distance = get_distance(data)
sleep(0.1)
assign_distance(data, distance)
sleep(0.1)
distance_filter = filter_distance(data)
sleep(0.1)
apply_filter(data, distance_filter)
sleep(0.1)
Overwriting initial_code_iteration5.py
!mprof run --interval 0.005 --python python initial_code_iteration5.py
!mprof plot --output initial_code_iteration5.png
mprof: Sampling memory every 0.005s running new process running as a Python program... /home/obriens/miniforge3/envs/workshop/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py:234: RuntimeWarning: overflow encountered in cast chunks = self._reader.read_low_memory(nrows) initial_code_iteration5.py:39: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0! You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy. A typical example is when you are setting values in a column of a DataFrame, like: df["col"][row_indexer] = value Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`. See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df["ra"][data_filter] = nan initial_code_iteration5.py:39: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0! You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy. A typical example is when you are setting values in a column of a DataFrame, like: df["col"][row_indexer] = value Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`. See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df["ra"][data_filter] = nan Using last profile data.
This brings us below 175 MiB. There are other steps we could look at:
- Running in batches to reduce the overall usage.
- Only loading in the columns needed for analysis
- Using a different package/backend?
Summary¶
We're managed to reduce the memory usage from ~450 MiB to < 175 MiB. We achieved this by:
- Profiling the memory usage to see what to target
- Avoiding copying the data
- Using inbuild functions to delete elements rather than doing by copy
- Changing the percision of the data we're using
- Reducing the import overhead
There is still a lot of redundency in the above code, we could explore other options:
- Rewriting parts of the code in more memory efficient languages (Rust, C++).
- Reducing the import overhead by reducing the complexity of the code and doing more steps in pure python.
- Using more memory efficient libraries (e.g. Polars).
- Spliting the dataset into chunks and loading in smaller bits at a time.
We also need to consider the development cost in finding low-memory solutions. Reducing the memory usage by 10% might be better than spending hours trying to reduce it by 20%.
The methods we've used here also have trade offs:
- Going from f64 -> f16 is a lossy form of compression. We have lost information that we cannot simply get back.
- We've thrown away data after it's read in and filtered. We can no longer compare to the original dataset without rewriting code.
- We've decreased the complexity, we can no longer access functions like
numpy.sum
without explicitly importing them first.
When working with clusters we can use tools such as memory_profiler
to estimate the requirements of the job we're about to submit. This can prevent our job crashing if we haven't requested enough memory.
def if_function(x):
if x < 0:
return 0
else:
return np.sqrt(x)
x = np.random.random(10**5)
%timeit y = if_function(x)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[111], line 1 ----> 1 get_ipython().run_line_magic('timeit', 'y = if_function(x)') File ~/miniforge3/envs/workshop/lib/python3.10/site-packages/IPython/core/interactiveshell.py:2480, in InteractiveShell.run_line_magic(self, magic_name, line, _stack_depth) 2478 kwargs['local_ns'] = self.get_local_scope(stack_depth) 2479 with self.builtin_trap: -> 2480 result = fn(*args, **kwargs) 2482 # The code below prevents the output from being displayed 2483 # when using magics with decorator @output_can_be_silenced 2484 # when the last Python token in the expression is a ';'. 2485 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False): File ~/miniforge3/envs/workshop/lib/python3.10/site-packages/IPython/core/magics/execution.py:1185, in ExecutionMagics.timeit(self, line, cell, local_ns) 1183 for index in range(0, 10): 1184 number = 10 ** index -> 1185 time_number = timer.timeit(number) 1186 if time_number >= 0.2: 1187 break File ~/miniforge3/envs/workshop/lib/python3.10/site-packages/IPython/core/magics/execution.py:173, in Timer.timeit(self, number) 171 gc.disable() 172 try: --> 173 timing = self.inner(it, self.timer) 174 finally: 175 if gcold: File <magic-timeit>:1, in inner(_it, _timer) Cell In[109], line 2, in if_function(x) 1 def if_function(x): ----> 2 if x < 0: 3 return 0 4 else: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
We might be tempted to simply rewrite this function with a for loop. But remember, Python is terrible when looping...
def loop_function(x):
ret = np.zeros(len(x))
for i in range(len(x)):
if x[i] < 0:
ret[i] = 0
else:
ret[i] = np.sqrt(x[i])
return ret
%timeit y = loop_function(x)
65.5 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
vec_function = np.vectorize(if_function)
%timeit y = vec_function(x)
54.3 ms ± 422 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Caching frequently used function calls¶
When you're expecting to repeat function calls with the same arguements, cache the results.
#first in first out cache wrapper
def fifo(func):
# Define the cache size
cache_size = 5
# Use a dictionary as the cache
cache = {}
def wrapper(*args, **kwargs):
# tokenize the arguments
token = ""
for arg in args:
token += str(arg)
for arg in kwargs:
token += str(arg)
# If still have the token
if token in cache:
return cache[token]
# Call function
result = func(*args, **kwargs)
cache[token] = result
# Check if we need to remove items from cache
if len(cache) > 5:
cache[next(iter(cache))]
# Return the results of the function
return result
return wrapper
def multiply_string(my_string, rept):
time.sleep(0.5)
return my_string * rept
%time multiply_string("hello", 5)
CPU times: user 1.07 ms, sys: 33 µs, total: 1.1 ms Wall time: 503 ms
'hellohellohellohellohello'
%time multiply_string("hello", 5)
CPU times: user 1.25 ms, sys: 37 µs, total: 1.29 ms Wall time: 502 ms
'hellohellohellohellohello'
@fifo
def multiply_string_cache(my_string, rept):
time.sleep(0.5)
return my_string * rept
%time multiply_string_cache("goodbye", 5)
CPU times: user 1.2 ms, sys: 36 µs, total: 1.24 ms Wall time: 503 ms
'goodbyegoodbyegoodbyegoodbyegoodbye'
%time multiply_string_cache("goodbye", 5)
CPU times: user 8 µs, sys: 0 ns, total: 8 µs Wall time: 13.1 µs
'goodbyegoodbyegoodbyegoodbyegoodbye'
Now that we know how the cache works, we can use some blackboxed python functions.
Let's use a Least recently used) cache. This will track the number of times we call a set of args. When the cache fills up, the least used item will be removed.
from functools import lru_cache
@lru_cache(maxsize=5)
def multiply_string_lru(my_string, rept):
time.sleep(0.5)
return my_string * rept
%time multiply_string_lru("goodbye", 5)
CPU times: user 952 µs, sys: 29 µs, total: 981 µs Wall time: 503 ms
'goodbyegoodbyegoodbyegoodbyegoodbye'
%time multiply_string_lru("goodbye", 5)
CPU times: user 4 µs, sys: 0 ns, total: 4 µs Wall time: 7.39 µs
'goodbyegoodbyegoodbyegoodbyegoodbye'
Already have working code from a more efficient language? Import it!¶
Python has lots of tools to import code from different languages
def py_fib(x):
if x in [0,1] :
return 1
else:
return py_fib(x - 1) + py_fib(x - 2)
%timeit py_fib(10)
9.06 µs ± 93.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
from numba import jit
@jit( nopython = True )
def py_fib(x):
if x in [0,1] :
return 1
else:
return py_fib(x - 1) + py_fib(x - 2)
py_fib(10)
89
%timeit py_fib(10)
362 ns ± 3.35 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
We can use packages like Cython to write C++ code in Python. Or we can directly import a precompiled library.
We can make C++ functions accessible in Python by wrapping our code in a extern "C"
block.
extern "C" {
double my_function(... some arguments...){
... some code ...
}
}
Why extern "C"
? One of the differences between C and C++ is that C++ allows us to overload functions. This means that two functions can have the same name, but accept different arguements. For example let say we have some functionality that we would like to make common across different data types:
double get_sqrt_integer( int a );
double get_sqrt_double( double a );
The user shouldn't need to worry if the argument is an int
or a double
. So we would "overload" the function like so:
double get_sqrt( int a );
double get_sqrt( double a );
So the user only ever calls get_sqrt
and the compile handles the logic on whether it is a double
or an int
. The compile does this by giving each of the function versions unique names that are generated at compile time.
By using extern "C"
we are reverting back to the C
method, which will explicitly asign the name we give the function, but it means that we can no longer overload the function. We therefore would need to run:
extern "C" {
double get_sqrt_integer( int a );
double get_sqrt_double( double a );
}
When analyzing the compiled library we would see the two functions called get_sqrt_integer
and get_sqrt_double
.
%%writefile cpp_fib.cpp
extern "C" {
int cpp_fib(unsigned int x)
{
if ( (x == 0) || (x == 1) )
{
return 1;
}
else
{
return cpp_fib(x-1) + cpp_fib(x-2);
}
}
}
Overwriting cpp_fib.cpp
Compile to a shared library
# mamba / conda install gxx
# or
# mamba /conda install gxx-compiler
! g++ -fPIC -shared -o cpp_fib.so cpp_fib.cpp -O3
! ls cpp_*
cpp_fib.cpp cpp_fib.so
Load in the library with ctypes
and "translate" the argument types
import ctypes
cpp_lib = ctypes.CDLL("./cpp_fib.so")
# Access the cpp_fib function from the library
# and assigne the argument type
cpp_lib.cpp_fib.argtypes = [ctypes.c_uint]
# Alias it to cpp_fib
cpp_fib = cpp_lib.cpp_fib
%timeit cpp_fib(10)
273 ns ± 4.17 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
We can also load in Rust code using rustimport
mamba/conda install rustimport
And the pyo3
rust crate:
%load_ext rustimport
use --release
to complie an optimized binary rather than a debug binary.
%%rustimport --release --force
use pyo3::prelude::*;
#[pyfunction]
fn rust_fib(x: usize) -> usize {
match x {
1 => 1,
0 => 1,
_ => rust_fib(x - 1) + rust_fib(x - 2)
}
}
Updating crates.io index Downloading crates ... Downloaded syn v2.0.64 Compiling target-lexicon v0.12.14 Compiling once_cell v1.19.0 Compiling autocfg v1.3.0 Compiling proc-macro2 v1.0.82 Compiling unicode-ident v1.0.12 Compiling libc v0.2.154 Compiling parking_lot_core v0.9.10 Compiling portable-atomic v1.6.0 Compiling cfg-if v1.0.0 Compiling scopeguard v1.2.0 Compiling smallvec v1.13.2 Compiling heck v0.4.1 Compiling indoc v2.0.5 Compiling unindent v0.2.3 Compiling lock_api v0.4.12 Compiling memoffset v0.9.1 Compiling pyo3-build-config v0.21.2 Compiling quote v1.0.36 Compiling syn v2.0.64 Compiling parking_lot v0.12.2 Compiling pyo3-ffi v0.21.2 Compiling pyo3 v0.21.2 Compiling pyo3-macros-backend v0.21.2 Compiling pyo3-macros v0.21.2 Compiling _magic_4716e3197ba1bc8f v0.1.0 (/tmp/rustimport/_magic_4716e3197ba1bc8f-860e1651e5b8e95bc6f26cfde7bdc78b/_magic_4716e3197ba1bc8f) Finished release [optimized] target(s) in 8.30s
%timeit rust_fib(10)
168 ns ± 19.9 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# fib_sqe = [py_fib(i) for i in range(10)]
# fib_sqe
# fib_sqe = [cpp_fib(i) for i in range(10)]
# fib_sqe
# fib_sqe = [rust_fib(i) for i in range(10)]
# fib_sqe
Don't reinvent the wheel!¶
Python is one of the most popular languages. A lot of super talented people have spent a lot of time optimizing python and wrapping low lower level languages in Python code. Use libraries such as:
- NumPy
- SciPy
- Pandas