Running at scale
The same commands that analyse one app on a laptop analyse ten thousand
on a cluster. Three features make that work: deterministic sharding
(--task-index/--task-count), resumable outputs (existing results
are skipped), and atomic writes (a killed job never leaves a truncated
file that a restart would wrongly skip).
SLURM array jobs
Divide a corpus across an array by passing the array variables through:
#!/bin/bash
#SBATCH --array=0-99
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=04:00:00
cim-apps metadata \
--apk-dir /scratch/apps --outdir /scratch/results \
--task-index "$SLURM_ARRAY_TASK_ID" \
--task-count "$SLURM_ARRAY_TASK_COUNT"
Each array task processes its own deterministic shard; within a task,
work spreads across the allocated CPUs automatically (the worker count
respects --cpus-per-task, so do not set --workers by hand). If some
tasks fail or time out, resubmit the identical array: completed apps
skip, and only the gaps are reprocessed.
For corpora that are very large or still being downloaded, freeze a manifest first and shard that instead of a directory:
find /scratch/apps -name '*.apk' | sort > corpus.txt
cim-apps metadata --apk-list corpus.txt --outdir /scratch/results \
--task-index "$SLURM_ARRAY_TASK_ID" --task-count 100
Containers (Apptainer)
Most clusters run Apptainer/Singularity rather than Docker. The
repository ships a definition file; build once (on a machine where you
have --fakeroot or root), then run the same image everywhere:
apptainer build cim-apps.sif apptainer.def
apptainer run --bind /scratch:/scratch cim-apps.sif metadata \
--apk-dir /scratch/apps --outdir /scratch/results ...
The image pins the toolkit and its dependencies, so the laptop pilot and the cluster run use bit-identical code — and the version recorded in every output proves it.
Sizing a run
Analysis cost is dominated by androguard parsing, roughly seconds per app and scaling with APK size; the batch machinery itself adds well under a millisecond per app. Before committing an allocation, pilot a small shard with profiling:
head -20 corpus.txt > pilot.txt
cim-apps metadata --apk-list pilot.txt --outdir pilot/ --profile
The embedded profile (see the profiling guide)
gives per-stage wall time and peak memory per app; multiply out for the
corpus, add headroom, and set --time and --mem from measurements
rather than guesses. Afterwards, reconcile against what SLURM actually
charged with sacct --format=JobID,MaxRSS,Elapsed.
Filesystem etiquette
Parallel filesystems (Lustre, GPFS) prefer few large files over many
small ones. The toolkit's one-JSONL-per-app output is fine at the
tens-of-thousands scale, but keep --outdir on scratch rather than
home, and merge results into a single file or Parquet table for the
analysis phase:
cat /scratch/results/*.metadata.jsonl > corpus.metadata.jsonl