* Implement rotating JSONL recorder for persistent logging * Fixes * Update documentation and clean up imports in command files * Address remaining recorder review feedback Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/2541773c-869a-463f-9fae-8505272c06ff Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com> * recorder: fix lock re-entry deadlock on start() and force_rotate_all() The previous "Fixes" commit added `_files_snapshot()` which acquires `self._lock` so handlers don't race with `stop()` clearing `_files`. But two callers were already holding `self._lock` when they invoked methods that go through the snapshot: - `start()` writes the `recorder_start` event from inside its `with self._lock:` block. `_write_event` -> `_files_snapshot` re-acquires the same non-reentrant `threading.Lock`, freezing process startup. - `force_rotate_all()` calls `self.status()` (which also acquires `self._lock`) while still holding the lock from rotating each file. Both fixes release the lock before the call. The recorder_start marker still lands in events.jsonl because the started/started_at flags are already set when we write it. Verified end-to-end against the standalone /tmp/verify_pr_fixes.py harness — all 9 PR review-comment fixes pass, including pause/resume event ordering and concurrent start/stop without KeyError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix markdown linting issues in leakhunt.md and repro.md * Handle recorder startup and query review fixes Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com> * Tighten recorder follow-up tests Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com> * Stabilize recorder startup tests Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com> * Remove brittle recorder startup test Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com> * Polish recorder follow-up errors Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com> * Refine recorder startup and regex errors Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com> * Clean up recorder follow-up nits Agent-Logs-Url: https://github.com/meshtastic/firmware/sessions/78540a9f-fe62-4350-b252-0ae5621f0b8a Co-authored-by: thebentern <9000580+thebentern@users.noreply.github.com> * Trunk --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.2 KiB
description, argument-hint
| description | argument-hint |
|---|---|
| Re-run a specific test N times in isolation to triage flakes, diff firmware logs between passes and failures | <test-node-id> [count=5] |
/repro — flakiness triage for one test
Re-run a single pytest node ID N times in isolation, track pass rate, and surface what's different in the firmware logs between the passing attempts and the failing ones. Turns "it's flaky, I guess" into "it fails when X, passes when Y."
What to do
-
Parse
$ARGUMENTS: first token is the pytest node id (e.g.tests/mesh/test_direct_with_ack.py::test_direct_with_ack_roundtrip[nrf52->esp32s3]); second token is an integer count (default5, cap at20). If the first token doesn't look like a test path (no::and notests/prefix), treat the whole$ARGUMENTSas a-kfilter instead. -
Sanity-check the hub first (so we're not measuring "nothing plugged in" N times): call
mcp__meshtastic__list_devices. If the test name containsnrf52oresp32s3and the matching VID isn't present, stop and report — re-running won't help. -
Loop N times. For each iteration:
./mcp-server/run-tests.sh <test-id> --tb=short -p no:cacheproviderCapture: exit code, duration, and (on failure) the
Meshtastic debugfirmware log section frommcp-server/tests/report.html.-p no:cacheprovidersuppresses pytest's.pytest_cachewrites so iterations don't influence each other. -
Track a small structured tally:
attempt 1: PASS (42s) attempt 2: FAIL (128s) ← firmware log 200-line tail captured attempt 3: PASS (39s) attempt 4: FAIL (121s) attempt 5: PASS (41s) -------------------------------------- pass rate: 3/5 (60%) | mean duration: 74s -
On mixed outcomes: diff the firmware log tails between a representative passing attempt and a representative failing attempt. Focus on:
- Error-level lines only present in failures (
PKI_UNKNOWN_PUBKEY,Alloc an err=,Skip send,No suitable channel) - Timing around the assertion event — did a broadcast go out, was there an ACK, did NAK fire?
- Device state fields that changed (nodesByNum entries, region/preset, channel_num)
Surface the top 3 differences as a "passes when / fails when" table. Don't dump full logs — pull specific lines with uptime timestamps.
- Error-level lines only present in failures (
5a. Archive recorder slices per attempt (no extra device interaction; the recorder runs autouse). Right after each attempt finishes, capture its (start_ts, end_ts) and call mcp__meshtastic__recorder_export(start=<start>, end=<end>, dest_dir="mcp-server/tests/repro_artifacts/<safe-test-id>/attempt_<n>/"). This drops a logs.jsonl, telemetry.jsonl, packets.jsonl, and events.jsonl snapshot scoped to the attempt window. Use these for cross-attempt diffs in step 5: jq '.line' logs.jsonl is faster than re-running the test, and the telemetry slice lets you compare heap behavior across attempts.
-
Classify the flake into one of:
- LoRa airtime collision → pass rate improves with fewer concurrent transmitters; propose a
time.sleepgap or retry bump in the test body. - PKI key staleness → fails on first attempt, passes after self-heal; existing retry loop in
test_direct_with_ack.pyhandles this. - NodeInfo cooldown →
Skip send NodeInfo since we sent it <600s agoin fail-only logs; needsbroadcast_nodeinfo_ping()warmup. - Hardware-specific (one direction fails, other passes; one device's firmware is older; driver wedged) → specific recovery pointer. For a device that's wedged past
touch_1200bps, the next escalation isuhubctl_cycle(role=..., confirm=True)to hard-power-cycle its hub port (requiresuhubctlinstalled). - Device went dark mid-run → fails from some attempt onward, never recovers, firmware log stops arriving. Almost always hardware: a Guru crash + frozen CDC. Hard-power-cycle via
uhubctl_cycle(role=..., confirm=True)before the next iteration; if that also fails, escalate to replug. - Genuinely unknown → say so; don't invent a root cause.
- LoRa airtime collision → pass rate improves with fewer concurrent transmitters; propose a
-
Report back with:
- Pass rate and mean duration.
- Classification + evidence (the specific log lines that support it).
- A suggested next step (re-run with specific args, open
/diagnose, edit a specific test file, nothing).
Examples
/repro tests/mesh/test_direct_with_ack.py::test_direct_with_ack_roundtrip[esp32s3->nrf52] 10— runs 10 times, diffs firmware logs./repro broadcast_delivers— no::, notests/, so interpreted as-k broadcast_delivers; runs every matching test the default 5 times./repro tests/telemetry/test_device_telemetry_broadcast.py 3— shorter run for a slow test.
Constraints
- Don't exceed
count=20per invocation — airtime and USB wear add up. If the user asks for 50, negotiate down. - Don't rebuild firmware as part of triage; flakes that only reproduce under different firmware belong in a separate session.
- If the FIRST attempt fails AND the rest all pass, that's a classic "state leak from a prior test" → say so and suggest running with
--force-bakeor starting from a clean state rather than chasing the first failure.