Command reference
All analysis runs through one command, cim-apps, with three
workflows. Every workflow shares the same input and batching options,
so anything you learn about running one applies to the others.
cim-apps {metadata, classify, flows} [input] --outdir DIR [options]
Choosing inputs
Exactly one of three input modes:
--apk PATH analyses a single APK file.
--apk-dir DIR analyses every .apk file in a directory (add
--recursive to include subdirectories). Files are processed in sorted
order; hidden files and non-APK files are ignored. A directory may of
course contain just one app. For very large corpora, or folders that a
scraper is still writing into, prefer a manifest: the file list must be
identical every time the command runs for array sharding to divide it
safely.
--apk-list FILE reads one APK path per line from a text file (#
comments allowed). Generate one once with, for example,
find /scratch/apps -name '*.apk' | sort > corpus.txt.
Common options
--outdir DIR (required) is where results go: one
<app>.<workflow>.jsonl file per input. If two inputs share a
filename (two app.apk in different folders), a short hash is added so
outputs never collide. Failures produce an <app>.<workflow>.error.jsonl
file instead of stopping the batch.
--workers N sets parallel processes; the default is the number of
CPUs actually allocated to the job (it respects SLURM and container
limits), so you rarely need to set it.
--force reprocesses inputs whose output file already exists. Without
it, completed apps are skipped — which is what makes interrupted runs
resumable: just run the same command again.
--task-index I / --task-count N divide the input list into N
deterministic shards and process only shard I. These map directly onto
SLURM array variables; see Running at scale.
metadata
cim-apps metadata --apk-dir apps/ --outdir results/
One record per app: application name, package, version code and name,
permissions, activities, intents, localisation coverage (which
language/region resource sets the app ships, e.g. ["zh", "CN", ""]),
and the A/B-testing SDKs detected in its code (all classes*.dex files
are scanned, so multidex apps are fully covered). This is the
per-version observable set that app-histories studies track across
releases — A/B and localisation are fields of this record, not separate
commands.
classify
cim-apps classify --apk-dir apps/ --outdir results/
Detects AI/ML components: native runtime libraries (TensorFlow Lite, ONNX Runtime, and proprietary vendor runtimes detected heuristically) and model files, with inferred vendor and category. Detections are heuristic — treat them as evidence to audit, not ground truth, and see the methods page for how thresholds are set.
flows
cim-apps flows --apk app.apk --outdir results/ [--trace] [--profile]
Builds a graph per app of how data may flow: device inputs (microphone, camera, Bluetooth/MIDI, text, network streams, files, sensors, screen) linked to the modules that show evidence of using them (native libraries and the model files they reference), linked onward to network endpoints and produced outputs. Links require co-located evidence in the module's own strings — keyword tables cover English and Chinese — and each link records the evidence behind it. The output includes Sankey-ready edges for visualisation.
--trace strengthens links using dex method tracing (substantially
slower; the traced chains are summarised into evidence, never emitted).
--profile embeds per-stage timing and memory into each record — see
the profiling guide.
listening
cim-apps listening --apk-dir apps/ --outdir results/
The audio-specialist workflow: traces audio inputs only (microphone,
streams, files, Bluetooth/MIDI) through the canonical chain capture →
dsp → features → inference → output, extracting the audio parameters
visible at each stage (sample rates, channels, frame/hop sizes, mel
bins, codecs) and recording where they change between stages. Outputs
one pretty-printed .json document per app rather than JSONL. See the
listening guide for the stage model and worked
examples.
Recipes
Resume an interrupted corpus run (completed apps skip automatically):
cim-apps metadata --apk-dir apps/ --outdir results/
Re-run everything after upgrading the toolkit:
cim-apps metadata --apk-dir apps/ --outdir results-v2/ # or --force
Pilot ten apps with profiling before committing a cluster allocation:
head -10 corpus.txt > pilot.txt
cim-apps flows --apk-list pilot.txt --outdir pilot/ --profile