<--

cProfile

There are quite some profiling tools for Python, like popular scalene. But I like python for simplicity and really like to use some “default” tools from the standard python toolchain - so cProfile is my choice.
It doesn’t require fine-grained code annotation and very simple to use.
It shows stats on how much time is spent while executing certain functions. Really easy to work and interpret.

Run

For example, the following command will run profiling on each function that is being run inside the main.py and display the stats sorted by cumulative time:

python -m cProfile -s cumulative -o main.prof main.py

Or you can profile only certain function calls, changing your code like that:

import cProfile, pstats
...

if __name__ == "__main__":
    profiler = cProfile.Profile()
    profiler.enable()
    foo(inp)
    profiler.disable()
    stats = pstats.Stats(profiler).sort_stats('cumtime')
    stats.print_stats()
    stats.dump_stats("main.prof")
    ...

And then run:

python main.py

Here is the result example of the summary that will be printed out for torch StableDiffusion model inference code:

         592917 function calls (534423 primitive calls) in 12.643 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   12.646   12.646 sd_prof.py:20(run)
        1    0.000    0.000   12.646   12.646 .../torch/autograd/grad_mode.py:24(decorate_context)
        1    0.005    0.005   12.646   12.646 .../diffusers/stable_diffusion/pipeline_stable_diffusion.py:408(__call__)
       71    0.000    0.000    8.350    0.118 .../diffusers/schedulers/scheduling_pndm.py:192(step)
       71    0.041    0.001    8.349    0.118 .../diffusers/schedulers/scheduling_pndm.py:278(step_plms)
       71    8.302    0.117    8.307    0.117 .../diffusers/schedulers/scheduling_pndm.py:358(_get_prev_sample)
 47918/75    0.111    0.000    4.083    0.054 .../torch/nn/modules/module.py:1104(_call_impl)
       71    0.010    0.000    4.033    0.057 .../diffusers/models/unet_2d_condition.py:337(forward)
      213    0.007    0.000    2.811    0.013 .../diffusers/models/unet_2d_blocks.py:1528(forward)
     1136    0.041    0.000    2.260    0.002 .../diffusers/models/attention.py:178(forward)
     1136    0.160    0.000    1.782    0.002 .../diffusers/models/attention.py:479(forward)
     1576    0.176    0.000    1.530    0.001 .../diffusers/models/resnet.py:454(forward)
     2272    0.054    0.000    1.215    0.001 .../diffusers/models/attention.py:592(forward)
    15616    0.027    0.000    0.970    0.000 .../torch/nn/modules/linear.py:102(forward)

To visualize the result, you can use gprof2dot, which will draw the execution graph.

First, install graphviz:

apt-get install graphviz

or

yum install graphviz

Then, you install gprof2dot via pypi:

pip install gprof2dot

And finally, you can generate the graph, using profile file generate by cProfile:

gprof2dot -f pstats main.prof | dot -Tpng -o output.png

The example result looks like this:

Percentages on the picture represent total time spent in the function and all it’s parents and self time spent in the function itself. See the short and clear explanation of how to interpret these boxes on the project’s github page.

Results review

Using the text and graph outputs for Stable Diffusion model from above, we can give a short summary:

  • (the obvious one) less reconstruction iterations - the faster ;)
  • apparently, ~2/3rd of the time is spent in the “scheduler” inside this get_prev_sample function, which has a lot of matrix multiplications and taking square roots
  • most of the rest of the time - is under Unet (which makes sense it’s the core part). And looks like a lot of stuff is happening inside the attention block’s forward pass.
  • a bit misleading thing is that most of operations in torch goes through torch.nn.module class, so that’s ok when lots of calls are there.

It makes sense to start optimization from applying, let’s say, flash attention, since big amount of time was spent on running Unet’s with attention layers.