Profiling: timing and memory tracing
The toolkit includes a stage profiler (cim_app_histories.general.perf)
for measuring where pipeline runs spend time and memory. This page
explains why it works the way it does, how to use it from the CLI and
from code, how to read its numbers, and when to reach for external
tools instead.
Why stage profiling rather than line profiling
Line-oriented tools such as memory_profiler decorate individual
functions and report per-line costs. That model fights class-structured
code (every method needs decorating, inheritance and callbacks confuse
attribution) and is the wrong granularity for a corpus pipeline anyway:
the question is not "which line allocates" but "what does each stage
— string extraction, classification, link building — cost, summed over
ten thousand APKs".
The harness therefore profiles named stages. You place the boundaries with a context manager, so it follows any code structure, and results are aggregated by stage name: a stage entered once per file across a whole shard produces one summary row (count, total, max), keeping the report at JSON-line size no matter how large the corpus.
Quick start
From the CLI, add --profile to a flows run:
cim-apps flows --apk app.apk --outdir results/ --profile
Each output record in results/<app>.flows.jsonl gains a "profile"
section. On a SLURM array this means per-stage economics for every app
in the corpus, embedded in the data you were already collecting:
#SBATCH --array=0-99
apptainer run cim-apps.sif flows \
--apk-list corpus.txt --outdir results/ --profile \
--task-index "$SLURM_ARRAY_TASK_ID" --task-count 100
From code:
from cim_app_histories.general.perf import StageProfiler
from cim_app_histories.calls.multimodal_pipeline import build_flow_graph
prof = StageProfiler()
graph = build_flow_graph(files, permissions=perms, profiler=prof)
report = prof.report()
prof.close()
To instrument your own code, wrap the phases that matter:
prof = StageProfiler()
for name, data in files:
with prof.stage("extract"):
strings = extract(data)
with prof.stage("classify"):
result = classify(name, strings)
print(prof.report())
Stages nest, producing dotted labels (outer.inner). A decorator form
exists for wrapping existing callables without editing them:
classify = prof.wrap("classify", classifier.classify_so_pre).
Reading the report
{"total_wall_s": 4.36, "rss_kb_now": 95436, "python_memory_traced": true,
"stages": [
{"stage": "classify_module", "calls": 30, "wall_s": 1.504,
"wall_max_s": 0.089, "cpu_s": 1.489, "py_kb_delta": 0,
"py_kb_peak_max": 63753, "rss_kb_delta": 61552}, ...]}
| Field | Meaning |
|---|---|
calls |
times the stage was entered (aggregated by name) |
wall_s / wall_max_s |
total and worst-case elapsed time — what a user or job waits for |
cpu_s |
process CPU time; a large wall−cpu gap means waiting (I/O, locks), not computing |
py_kb_delta |
net Python-heap growth across all calls; persistent growth here suggests retained objects |
py_kb_peak_max |
highest Python-heap high-water mark observed during any single call |
rss_kb_delta |
net process RSS growth; this is what the OS and SLURM see, and includes native allocations invisible to tracemalloc |
Two readings worth internalising. A stage with large py_kb_peak_max
but near-zero py_kb_delta allocates transiently and cleans up — the
peak tells you the memory headroom one worker needs. A stage where
rss_kb_delta grows but py_kb_delta does not points at native-code
allocations (or allocator fragmentation), which is your cue for memray
(below).
Per-app reports across a corpus analyse naturally with pandas:
import json, pandas as pd, pathlib
rows = [{"app": r["input"], **s}
for p in pathlib.Path("results").glob("*.flows.jsonl")
for r in map(json.loads, open(p))
for s in r.get("profile", {}).get("stages", [])]
df = pd.DataFrame(rows)
print(df.groupby("stage")[["wall_s", "py_kb_peak_max"]]
.agg({"wall_s": "sum", "py_kb_peak_max": "max"})
.sort_values("wall_s", ascending=False))
Costs and caveats
The profiler is honest about its own overhead. Python-memory tracing
uses tracemalloc, which typically slows execution 1.5–3x and adds
memory of its own; it is therefore opt-in. For timing and RSS only, at
near-zero overhead, construct StageProfiler(trace_python_memory=False).
Library code instrumented with stages defaults to NullProfiler, a
do-nothing stand-in, so unprofiled production runs are unaffected — the
test suite asserts that profiled and unprofiled runs produce identical
results.
Three measurement caveats. First, tracemalloc sees Python allocations
only; native allocations show up solely in the RSS columns. Second, the
peak counter is process-global: a nested stage's reset truncates its
parent's peak window, so treat py_kb_peak_max as reliable for stages
with no profiled children (timings and deltas nest correctly). Third,
RSS is read from /proc/self/statm, so RSS columns are zero on
non-Linux platforms; everything else works everywhere.
When to use external tools instead
The harness gives continuous, corpus-scale numbers embedded in your
output data. For occasional deep dives, three external tools complement
it without code changes. py-spy produces sampled flamegraphs of a live
process and can attach to a running HPC task by PID
(py-spy record -o flame.svg --pid <pid> or
py-spy record -- cim-apps flows ...), at negligible overhead. memray
(memray run --native -o out.bin cim-apps flows ..., then
memray flamegraph out.bin) produces allocation flamegraphs that do
include native extensions — the strongest tool whenever RSS and
tracemalloc disagree. Scalene offers line-level profiles separating
Python, native, and copy costs, useful on a laptop once the harness has
told you which stage to stare at. For whole-job ground truth on the
cluster, /usr/bin/time -v reports peak RSS for any command, and
sacct --format=JobID,MaxRSS,Elapsed reports what SLURM actually
charged you; the harness's numbers should reconcile with both.