Julia Community ๐ŸŸฃ

Neven
Neven

Posted on

Profile a short-running function

Recently I went to profile a short-running function. Several Julia packages, ProfileCanvas.jl being one of them, provide functionality for getting a CPU execution profile, often via a macro called like @profview expr_to_profile.

One problem that can occur is the run time of one invocation of expr_to_profile being so short that the profile ends up being useless because not enough samples are collected. Here's a quick-and-dirty hack solution to this problem. Criticize it in the comments below!

Example: profiling cos(::Float64)

"""
    run_in_a_loop(run_workload, get_argument, repeat_count)::Nothing
"""
function run_in_a_loop(run_workload::F, get_argument::G, repeat_count::Int) where {F, G}
    for i โˆˆ 1:repeat_count
        x = get_argument(i)
        r = @noinline run_workload(x)
        Base.donotdelete(r)  # warning: not public
    end
end

# Example: profile `cos(::Float64)` execution

# Use a global to prevent constant folding, but precompute everything
# to allow the intended workload to dominate the profiling results.
precomputed_arguments::NTuple{128, Float64} = (rand(Float64, 128)...,);

function get_argument(i::Int)
    axs = Base.OneTo(length(precomputed_arguments))
    j = mod(i, axs)
    precomputed_arguments[j]
end

# Use a global to prevent unexpected compiler optimization.
repeat_count::Int = 1000000

using ProfileCanvas: @profview

# Get everything compiled. The profile may (?) include compilation
# here, so we're not interested in it yet.
run_in_a_loop(cos, get_argument, repeat_count)
@profview run_in_a_loop(cos, get_argument, repeat_count)

# Run again to see run-time performance.
repeat_count *= 1000
@profview run_in_a_loop(cos, get_argument, repeat_count)
Enter fullscreen mode Exit fullscreen mode

Here's the profile flame graph I ended up with:

profile flame graph for the example

So about 85% of the samples do end up in cos, even though a single call of cos(::Float64) is over very quickly. Not bad!

donotdelete

The secret sauce above is donotdelete. This is its doc string:

help?> Base.donotdelete
  โ”‚ Warning
  โ”‚
  โ”‚  The following bindings may be internal; they may change or be removed in future versions:
  โ”‚
  โ”‚    โ€ข  Base.donotdelete

  Base.donotdelete(args...)

  This function prevents dead-code elimination (DCE) of itself and any arguments passed to it, but is
  otherwise the lightest barrier possible. In particular, it is not a GC safepoint, does not model an
  observable heap effect, does not expand to any code itself and may be re-ordered with respect to
  other side effects (though the total number of executions may not change).

  A useful model for this function is that it hashes all memory reachable from args and escapes this
  information through some observable side-channel that does not otherwise impact program behavior. Of
  course that's just a model. The function does nothing and returns nothing.

  This is intended for use in benchmarks that want to guarantee that args are actually computed.
  (Otherwise DCE may see that the result of the benchmark is unused and delete the entire benchmark
  code).

  โ”‚ Note
  โ”‚
  โ”‚  donotdelete does not affect constant folding. For example, in donotdelete(1+1), no add
  โ”‚  instruction needs to be executed at runtime and the code is semantically equivalent to
  โ”‚  donotdelete(2).

  โ”‚ Note
  โ”‚
  โ”‚  This intrinsic does not affect the semantics of code that is dead because it is
  โ”‚  unreachable. For example, the body of the function f(x) = false && donotdelete(x) may be
  โ”‚  deleted in its entirety. The semantics of this intrinsic only guarantee that if the
  โ”‚  intrinsic is semantically executed, then there is some program state at which the value of
  โ”‚  the arguments of this intrinsic were available (in a register, in memory, etc.).

  โ”‚ Julia 1.8
  โ”‚
  โ”‚  This method was added in Julia 1.8.

  Examples
  โ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰ก

  function loop()
      for i = 1:1000
          # The compiler must guarantee that there are 1000 program points (in the correct
          # order) at which the value of `i` is in a register, but has otherwise
          # total control over the program.
          donotdelete(i)
      end
  end
Enter fullscreen mode Exit fullscreen mode

Similar functionality in Rust:

Some C++ libraries offer similar functionality, too. For example, the folly library has doNotOptimizeAway.

Top comments (0)