* Implement rotating JSONL recorder for persistent logging * Fixes * Update documentation and clean up imports in command files * Address remaining recorder review feedback Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/2541773c-869a-463f-9fae-8505272c06ff Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com> * recorder: fix lock re-entry deadlock on start() and force_rotate_all() The previous "Fixes" commit added `_files_snapshot()` which acquires `self._lock` so handlers don't race with `stop()` clearing `_files`. But two callers were already holding `self._lock` when they invoked methods that go through the snapshot: - `start()` writes the `recorder_start` event from inside its `with self._lock:` block. `_write_event` -> `_files_snapshot` re-acquires the same non-reentrant `threading.Lock`, freezing process startup. - `force_rotate_all()` calls `self.status()` (which also acquires `self._lock`) while still holding the lock from rotating each file. Both fixes release the lock before the call. The recorder_start marker still lands in events.jsonl because the started/started_at flags are already set when we write it. Verified end-to-end against the standalone /tmp/verify_pr_fixes.py harness — all 9 PR review-comment fixes pass, including pause/resume event ordering and concurrent start/stop without KeyError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix markdown linting issues in leakhunt.md and repro.md * Handle recorder startup and query review fixes Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com> * Tighten recorder follow-up tests Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com> * Stabilize recorder startup tests Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com> * Remove brittle recorder startup test Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com> * Polish recorder follow-up errors Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com> * Refine recorder startup and regex errors Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com> * Clean up recorder follow-up nits Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com> * Trunk --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.5 KiB
description: Hunt for memory leaks (and other slow degradations) by reading the persistent recorder's heap timeline + log slice over a window argument-hint: [window=1h] [field=free_heap] [variant=local]
/leakhunt — read the recorder, classify a memory leak
Use the always-on recorder (mcp-server/.mtlog/) to read a heap timeline plus the matching log slice and produce a one-page verdict: steady / slow leak / fragmentation / OOM-imminent. No firmware changes, no special build flags — the LocalStats telemetry packet that the firmware already broadcasts every ~60 s carries heap_free_bytes and heap_total_bytes.
Two signal paths — pick the right one
| Path | Build flag | Cadence | Per-thread attribution | Cost |
|---|---|---|---|---|
| LocalStats packet | (default) | ~60 s | No | Free — always on |
[heap N] log prefix |
-DDEBUG_HEAP=1 |
every log line | Yes (Thread X leaked) | Bigger flash + log volume |
Both feed the same telemetry_timeline(field="free_heap") query — when DEBUG_HEAP is on, the recorder synthesizes telemetry rows from log prefixes (tagged source: debug_heap), so a single timeline call gets whichever signal is available. For a slow leak diagnosis, the default path is plenty (60 s cadence over 6 h = 360 points; linear regression over that nails sub-100-byte/min slopes). DEBUG_HEAP is for attribution — when the slope is real and you need to know which thread is leaking.
What to do
-
Parse
$ARGUMENTS: optionalwindow(default1h, accepts30m/6h/-3d/etc.), optionalfield(defaultfree_heap; alternates:total_heap,battery_level, anything in the LocalStats variant), optionalvariant(defaultlocal; alternates:device,environment,power,airQuality,health). -
Verify the recorder is alive — call
mcp__meshtastic__recorder_status. Check:running == Truefiles.telemetry.lines > 0(at least one telemetry packet recorded — if zero, the device hasn't broadcast LocalStats yet ORset_debug_log_apihas never been on; tell the operator to runmcp__meshtastic__set_debug_log_api(enabled=True)and wait one device-update interval)files.telemetry.last_tswithin the last 5 minutes (if older, the device is silent — log that, not "leak detected")
-
Detect whether DEBUG_HEAP is active —
mcp__meshtastic__logs_window(start="-2m", grep=r"\\[heap \\d+\\]", max_lines=3). If any line matches, the firmware has the prefix → DEBUG_HEAP is on, expect higher-cadence data andheap_eventrows. If zero matches over the last 2 minutes, you're on the LocalStats-only path. -
Pull the timeline —
mcp__meshtastic__telemetry_timeline(window=$window, variant=$variant, field=$field, max_points=200). Read:samples— how many raw points contributedmin,max— total swingslope_per_min— units per minute (linear regression over the whole window)
-
Pull the log context for the same window —
mcp__meshtastic__logs_window(start="-${window}", grep="Heap status|leaked heap|freed heap|out of memory|Alloc an err|panic|abort", max_lines=200). These are the strings the firmware emits when something memory-related happens (DEBUG_HEAPbuilds emit"Heap status:"and"leaked heap"lines; production builds emit"Alloc an err"on failure and"out of memory"on OOM). -
Pull marker events so we know if the operator labeled phases —
mcp__meshtastic__events_window(start="-${window}", kind="mark|connection_lost|connection_established"). If aconnection_lostoverlaps a sharp drop, that's not a leak; that's a reboot.
6a. (DEBUG_HEAP only) Per-thread attribution — mcp__meshtastic__logs_window(start="-${window}", grep="leaked heap", max_lines=200). Each row has a structured heap_event field with {kind, thread, before, after, delta}. Aggregate by thread: sum the delta over the window per thread name. The thread with the largest cumulative negative delta is your suspect. Note the count too — a thread with 50× small leaks is different from 1× big leak.
-
Classify based on what the data says, NOT on what you wish it said. Use these rules in order:
- Insufficient data (< 5 samples): say so. Suggest a longer window or longer wait. Stop.
- Reboot mid-window: if any
connection_lostevent is present ANDfree_heapjumped UP at that timestamp, the device rebooted. Note it; pre-reboot trend may be a leak but you only have part of the curve. - OOM-imminent: any
Alloc an err=orout of memoryline in the log slice. This trumps everything; flag urgently. - Slow leak:
slope_per_min < -50ANDmax - min > 1000AND no reboot. The heap is monotonically (or near-monotonically) declining. Estimate time-to-zero:min / -slope_per_minminutes. Surface it. - Fragmentation suspect:
slope_per_minclose to zero (|x| < 50) BUT min trends down across the window AND the log slice showsAlloc an errwarnings WITHOUT total OOM. Means free total is OK but largest contiguous block is shrinking. Recommend aDEBUG_HEAPbuild to confirm. - Steady: |slope_per_min| < 50, no error lines. Heap is fine.
- Recovery curve: slope is POSITIVE — heap recovered. Either a workload completed or GC fired. Note it; not a leak.
-
Report:
/leakhunt window=6h field=free_heap variant=local ──────────────────────────────────────────────────── recorder : running, telem last_ts 8s ago build : DEBUG_HEAP=ON (per-line prefix detected) samples : 14,200 over 6h (cadence ~1.5s, log-line synth) free_heap : min 92,344 / max 124,008 / range 31,664 slope : -82 bytes/min (negative — heap declining) reboots : none in window OOM events : none error lines : 3× "Alloc an err=ESP_ERR_NO_MEM" at +4h12m, +5h08m, +5h44m thread leaks : (DEBUG_HEAP) MeshPacket -3,124 B over 18 events Router -1,408 B over 4 events others -240 B verdict : SLOW LEAK — primary suspect MeshPacket thread est. time-to-OOM: ~1,127 min (~18.8 h) at current slope evidence : (3 log line citations with uptimes)Then: what to do next.
- SLOW LEAK, DEBUG_HEAP off → recommend rebuilding with the flag and re-running this skill. Concrete one-liner the operator can copy:
After flash, set debug_log_api back on and wait one window; re-run
mcp__meshtastic__build(env="<env>", build_flags={"DEBUG_HEAP": 1}) mcp__meshtastic__pio_flash(env="<env>", port="<port>", confirm=True)/leakhunt. - SLOW LEAK, DEBUG_HEAP on → cite the top-leaking thread name from step 6a. Point at the corresponding source file (
grep -rn "ThreadName(\"<name>\")" src/); the operator decides what to fix. - FRAGMENTATION SUSPECT → propose pre-allocating any per-packet buffers; or rebuilding with
CONFIG_HEAP_TASK_TRACKING=yon ESP32 to see who's holding the largest blocks. - OOM-IMMINENT → flag for immediate attention; don't wait for the next telemetry interval.
- STEADY → say so; stop. Don't invent problems.
- SLOW LEAK, DEBUG_HEAP off → recommend rebuilding with the flag and re-running this skill. Concrete one-liner the operator can copy:
What NOT to do
- Don't assume a leak from a single dip. LocalStats fires every ~60 s and the firmware naturally allocates+frees on each broadcast cycle; one packet sees the trough. Look at the slope, not the deltas.
- Don't recommend code changes. This skill diagnoses; the operator decides what to fix.
- Don't enable
set_debug_log_apiautomatically — if it's off, telemetry isn't reaching pubsub anyway, and the recorder will be empty. Tell the operator to flip it on and wait, then re-run. - Don't run heavy workloads to "trigger the leak." The recorder is passive; we read what's there.
Companion: mark_event for stress runs
If the operator wants to test under stimulus (e.g. blast 50 broadcasts and see what the heap does), they can frame the experiment with markers:
mark_event("burst-start")
… run the workload …
mark_event("burst-end")
/leakhunt window=15m
The markers land in both events.jsonl and logs.jsonl, so the report can show "free_heap dipped 8 KB during the burst window, recovered to baseline within 2 LocalStats cycles" → not a leak.