Default retry libraries treat every error as a network blip. That's fine when every error is a network blip. It's a slow disaster when a 5xx means "this will never succeed" and your job retries it 18 times before giving up.
Our taxonomy is three rules. Network and 5xx: retry with backoff. 429: retry, but respect Retry-After. 4xx-not-429: don't retry — escalate. The rule that matters is the third one. Everything else is industry standard; the third is what stops a single permission misconfiguration from burning a day's compute.