OpenTelemetry assumes spans end. Agent runs disagree.
A single Amy job can spawn 200 tool calls across an eight-hour window, with retries, partial failures, and human approvals in the middle. Treating that as one trace blows up the exporter; treating each tool call as its own trace loses the parent context.
What works: a trace per logical phase, a stable correlation ID stitched through metadata, and ledger-style append-only events for anything we'd want to replay. The traces stop being the source of truth and start being the search index over the events.