mirror of https://github.com/meshtastic/firmware.git synced 2026-06-05 18:08:45 +00:00

Files

T

Ben Meadors f6a954b97e Implement rotating JSONL recorder for persistent logging (#10428 )

* Implement rotating JSONL recorder for persistent logging

* Fixes

* Update documentation and clean up imports in command files

* Address remaining recorder review feedback

Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/2541773c-869a-463f-9fae-8505272c06ff

Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com>

* recorder: fix lock re-entry deadlock on start() and force_rotate_all()

The previous "Fixes" commit added `_files_snapshot()` which acquires
`self._lock` so handlers don't race with `stop()` clearing `_files`.
But two callers were already holding `self._lock` when they invoked
methods that go through the snapshot:

  - `start()` writes the `recorder_start` event from inside its `with
    self._lock:` block. `_write_event` -> `_files_snapshot` re-acquires
    the same non-reentrant `threading.Lock`, freezing process startup.

  - `force_rotate_all()` calls `self.status()` (which also acquires
    `self._lock`) while still holding the lock from rotating each file.

Both fixes release the lock before the call. The recorder_start marker
still lands in events.jsonl because the started/started_at flags are
already set when we write it.

Verified end-to-end against the standalone /tmp/verify_pr_fixes.py
harness — all 9 PR review-comment fixes pass, including pause/resume
event ordering and concurrent start/stop without KeyError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix markdown linting issues in leakhunt.md and repro.md

* Handle recorder startup and query review fixes

Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a

Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com>

* Tighten recorder follow-up tests

Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a

Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com>

* Stabilize recorder startup tests

Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a

Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com>

* Remove brittle recorder startup test

Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a

Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com>

* Polish recorder follow-up errors

Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a

Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com>

* Refine recorder startup and regex errors

Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a

Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com>

* Clean up recorder follow-up nits

Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a

Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com>

* Trunk

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-10 09:22:40 -05:00

8.5 KiB

Raw Permalink Blame History

description: Hunt for memory leaks (and other slow degradations) by reading the persistent recorder's heap timeline + log slice over a window argument-hint: [window=1h] [field=free_heap] [variant=local]

`/leakhunt` — read the recorder, classify a memory leak

Use the always-on recorder (mcp-server/.mtlog/) to read a heap timeline plus the matching log slice and produce a one-page verdict: steady / slow leak / fragmentation / OOM-imminent. No firmware changes, no special build flags — the LocalStats telemetry packet that the firmware already broadcasts every ~60 s carries heap_free_bytes and heap_total_bytes.

Two signal paths — pick the right one

Path	Build flag	Cadence	Per-thread attribution	Cost
LocalStats packet	(default)	~60 s	No	Free — always on
`[heap N]` log prefix	`-DDEBUG_HEAP=1`	every log line	Yes (Thread X leaked)	Bigger flash + log volume

Both feed the same telemetry_timeline(field="free_heap") query — when DEBUG_HEAP is on, the recorder synthesizes telemetry rows from log prefixes (tagged source: debug_heap), so a single timeline call gets whichever signal is available. For a slow leak diagnosis, the default path is plenty (60 s cadence over 6 h = 360 points; linear regression over that nails sub-100-byte/min slopes). DEBUG_HEAP is for attribution — when the slope is real and you need to know which thread is leaking.

What to do

Parse $ARGUMENTS: optional window (default 1h, accepts 30m/6h/-3d/etc.), optional field (default free_heap; alternates: total_heap, battery_level, anything in the LocalStats variant), optional variant (default local; alternates: device, environment, power, airQuality, health).
Verify the recorder is alive — call mcp__meshtastic__recorder_status. Check:
- running == True
- files.telemetry.lines > 0 (at least one telemetry packet recorded — if zero, the device hasn't broadcast LocalStats yet OR set_debug_log_api has never been on; tell the operator to run mcp__meshtastic__set_debug_log_api(enabled=True) and wait one device-update interval)
- files.telemetry.last_ts within the last 5 minutes (if older, the device is silent — log that, not "leak detected")
Detect whether DEBUG_HEAP is active — mcp__meshtastic__logs_window(start="-2m", grep=r"\\[heap \\d+\\]", max_lines=3). If any line matches, the firmware has the prefix → DEBUG_HEAP is on, expect higher-cadence data and heap_event rows. If zero matches over the last 2 minutes, you're on the LocalStats-only path.
Pull the timeline — mcp__meshtastic__telemetry_timeline(window=$window, variant=$variant, field=$field, max_points=200). Read:
- samples — how many raw points contributed
- min, max — total swing
- slope_per_min — units per minute (linear regression over the whole window)
Pull the log context for the same window — mcp__meshtastic__logs_window(start="-${window}", grep="Heap status|leaked heap|freed heap|out of memory|Alloc an err|panic|abort", max_lines=200). These are the strings the firmware emits when something memory-related happens (DEBUG_HEAP builds emit "Heap status:" and "leaked heap" lines; production builds emit "Alloc an err" on failure and "out of memory" on OOM).
Pull marker events so we know if the operator labeled phases — mcp__meshtastic__events_window(start="-${window}", kind="mark|connection_lost|connection_established"). If a connection_lost overlaps a sharp drop, that's not a leak; that's a reboot.

6a. (DEBUG_HEAP only) Per-thread attribution — mcp__meshtastic__logs_window(start="-${window}", grep="leaked heap", max_lines=200). Each row has a structured heap_event field with {kind, thread, before, after, delta}. Aggregate by thread: sum the delta over the window per thread name. The thread with the largest cumulative negative delta is your suspect. Note the count too — a thread with 50× small leaks is different from 1× big leak.

Classify based on what the data says, NOT on what you wish it said. Use these rules in order:
- Insufficient data (< 5 samples): say so. Suggest a longer window or longer wait. Stop.
- Reboot mid-window: if any connection_lost event is present AND free_heap jumped UP at that timestamp, the device rebooted. Note it; pre-reboot trend may be a leak but you only have part of the curve.
- OOM-imminent: any Alloc an err= or out of memory line in the log slice. This trumps everything; flag urgently.
- Slow leak: slope_per_min < -50 AND max - min > 1000 AND no reboot. The heap is monotonically (or near-monotonically) declining. Estimate time-to-zero: min / -slope_per_min minutes. Surface it.
- Fragmentation suspect: slope_per_min close to zero (|x| < 50) BUT min trends down across the window AND the log slice shows Alloc an err warnings WITHOUT total OOM. Means free total is OK but largest contiguous block is shrinking. Recommend a DEBUG_HEAP build to confirm.
- Steady: |slope_per_min| < 50, no error lines. Heap is fine.
- Recovery curve: slope is POSITIVE — heap recovered. Either a workload completed or GC fired. Note it; not a leak.

Report:

/leakhunt window=6h field=free_heap variant=local
────────────────────────────────────────────────────
recorder        : running, telem last_ts 8s ago
build           : DEBUG_HEAP=ON (per-line prefix detected)
samples         : 14,200 over 6h (cadence ~1.5s, log-line synth)
free_heap       : min 92,344 / max 124,008 / range 31,664
slope           : -82 bytes/min (negative — heap declining)
reboots         : none in window
OOM events      : none
error lines     : 3× "Alloc an err=ESP_ERR_NO_MEM" at +4h12m, +5h08m, +5h44m
thread leaks    : (DEBUG_HEAP) MeshPacket -3,124 B over 18 events
                               Router       -1,408 B over 4 events
                               others       -240 B
verdict         : SLOW LEAK — primary suspect MeshPacket thread
est. time-to-OOM: ~1,127 min (~18.8 h) at current slope
evidence        : (3 log line citations with uptimes)

Then: what to do next.

SLOW LEAK, DEBUG_HEAP off → recommend rebuilding with the flag and re-running this skill. Concrete one-liner the operator can copy:
```
mcp__meshtastic__build(env="<env>", build_flags={"DEBUG_HEAP": 1})
mcp__meshtastic__pio_flash(env="<env>", port="<port>", confirm=True)
```
After flash, set debug_log_api back on and wait one window; re-run /leakhunt.
SLOW LEAK, DEBUG_HEAP on → cite the top-leaking thread name from step 6a. Point at the corresponding source file (grep -rn "ThreadName(\"<name>\")" src/); the operator decides what to fix.
FRAGMENTATION SUSPECT → propose pre-allocating any per-packet buffers; or rebuilding with CONFIG_HEAP_TASK_TRACKING=y on ESP32 to see who's holding the largest blocks.
OOM-IMMINENT → flag for immediate attention; don't wait for the next telemetry interval.
STEADY → say so; stop. Don't invent problems.

What NOT to do

Don't assume a leak from a single dip. LocalStats fires every ~60 s and the firmware naturally allocates+frees on each broadcast cycle; one packet sees the trough. Look at the slope, not the deltas.
Don't recommend code changes. This skill diagnoses; the operator decides what to fix.
Don't enable set_debug_log_api automatically — if it's off, telemetry isn't reaching pubsub anyway, and the recorder will be empty. Tell the operator to flip it on and wait, then re-run.
Don't run heavy workloads to "trigger the leak." The recorder is passive; we read what's there.

Companion: `mark_event` for stress runs

If the operator wants to test under stimulus (e.g. blast 50 broadcasts and see what the heap does), they can frame the experiment with markers:

mark_event("burst-start")
… run the workload …
mark_event("burst-end")
/leakhunt window=15m

The markers land in both events.jsonl and logs.jsonl, so the report can show "free_heap dipped 8 KB during the burst window, recovered to baseline within 2 LocalStats cycles" → not a leak.

8.5 KiB Raw Permalink Blame History Unescape Escape