Listening: tracing audio inputs and their parameters
The listening workflow is built for machine-listening studies. Where
flows asks "what inputs feed what modules" across all modalities,
listening asks the audio question in depth: how does this app capture
sound, what does it do to the signal, what does a model receive, and
what leaves the device — and crucially, with what parameters, and how
those parameters change along the way.
cim-apps listening --apk app.apk --outdir results/
cim-apps listening --apk-dir apps/ --outdir results/
Output is one pretty-printed JSON file per app
(<app>.listening.json), unlike the JSONL of the other workflows —
each app's trace is a single document meant to be read and visualised
as a whole. All the usual batching options (--apk-dir, --apk-list,
sharding, resume) work identically.
The stage model
Audio processing in apps follows a recognisable pipeline, and the workflow organises everything it finds into that canonical order:
capture -> dsp -> features -> inference -> output
capture is where sound enters: AudioRecord/AAudio/OpenSL for the microphone, a stream player for network audio, MediaExtractor for files, MIDI over Bluetooth. dsp covers signal conditioning — resampling, denoising, echo cancellation, gain control, voice activity detection, loudness normalisation. features is the transform into model food: FFT/STFT, mel spectrograms, MFCCs, filterbanks, pitch and chroma. inference is the model itself, with a task label (speech_to_text, wake_word, speaker_recognition, music_analysis, speech_synthesis, audio_classification, audio_embedding) and, where a model file is referenced by name from an audio module, the model artefact with its format. output is what leaves: transcripts and labels, and network endpoints with their categories (upload, streaming, recommendation).
A single native library frequently spans several stages — one .so
doing capture, DSP, and features is common — so chain entries are
per-(module, stage), each carrying the operations observed and the
parameters visible at that point.
Parameters and transitions
For each stage the workflow extracts the audio parameter vocabulary
visible in the module's strings: sample rates (validated against the
standard set — 8000/16000/22050/44100/48000 etc., including "16k"
forms), channel layout (mono/stereo), named values such as n_fft,
hop_length, frame_size, n_mels, buffer_size and bitrate, and
codecs (Opus, AAC, PCM, FLAC...).
Where the same parameter appears with different values in successive stages, a transition is recorded:
{"parameter": "sample_rate",
"from_stage": "capture", "from": [48000],
"to_stage": "features", "to": [16000]}
That particular transition — high-rate stereo capture collapsing to 16 kHz mono before features — is the classic signature of an ASR front end, and it is exactly the kind of finding the workflow exists to surface: not just that an app listens, but the shape of the listening.
Two worked examples
These illustrate what the output shows for the two motivating cases. They are schematic — what the workflow surfaces depends on what each real app's binaries reveal, and the examples below are illustrative shapes, not results from the named services.
A speech-to-text app. Sources show microphone evidenced by the
AudioRecord API, microphone keywords (including 麦克风), and the
RECORD_AUDIO permission as corroboration. The chain shows capture at
48 kHz stereo with a 3840-byte buffer; a dsp stage with resample,
denoise, and VAD; a features stage at 16 kHz mono with n_fft 512,
hop_length 160, n_mels 80 (a standard log-mel front end); an
inference stage linking assets/asr_v3.tflite (task: speech_to_text)
referenced by the feature library; and an output stage with a
transcript and an upload endpoint. Two transitions are recorded:
sample_rate 48000→16000 and channel_layout stereo→mono.
A music-streaming app with recommendations. Sources show
network_stream (HLS/ExoPlayer evidence) rather than the microphone —
and, importantly, no microphone claim is invented just because the
app plays audio. The chain shows codec parameters (Opus/AAC) at
capture, loudness handling in dsp, then analysis features — tempo,
beat tracking, chroma — feeding an inference stage with task
music_analysis and an embedding vocabulary, and an output stage whose
endpoint is categorised recommendation (.../recommendations/next).
That is the observable skeleton of "what to play next": which signal
properties the app computes on-device, and where they are sent.
Reading the output
import json
rec = json.load(open("results/MyApp.listening.json"))
rec["summary"] # sources, stages present, tasks, endpoints
rec["sources"] # per-source: which modules, with what evidence
rec["chain"] # the ordered stage entries
rec["parameter_transitions"]
Across a corpus, the summaries aggregate naturally — for instance, counting which inference tasks co-occur with which sources, or how common the 48 kHz→16 kHz signature is per app category.
What this evidence is, and is not
The workflow performs static analysis: parameters are vocabulary
observed in binary strings, not measured runtime values. A 16000
adjacent to sample_rate is strong evidence of a 16 kHz path; it is
not a recording of one, and a value may be a default, one of several
configurations, or occasionally an unrelated constant (the extractor
only accepts numbers that match known rates or named parameters, but
audit a sample by hand before reporting). Transitions are inferred from
stage co-location within the app, not from observed dataflow — the
--trace-style dex confirmation available in flows is a natural
extension here. Sources require co-located evidence and permissions
only corroborate, so an app is never claimed to use the microphone
merely because it could. Keyword tables carry English and Chinese and
are data, not code: extend them in
src/cim_app_histories/calls/listening.py (STAGE_OPERATIONS,
INFERENCE_TASKS, the parameter patterns) as your corpus teaches you
vendor-specific vocabulary.