What products were affected and what was the impact?
Classify, Country of Origin
Impact: DEGRADED PERFORMANCE
What timeframe did this issue occur?
| Date | Time |
|---|---|
| Mar 11, 2026 | 2026-03-11 10:12 - 2026-03-11 12:06 |
How was the issue detected?
Zonos observed elevated error rates affecting Classify and Country of Origin APIs.
What functionality was affected?
A portion of Classify and Country of Origin inference requests did not return a valid response.
What was the root cause?
A firmware issue on a subset of GPU nodes caused network connectivity failures. This caused the nodes to enter an unhealthy state. Our reconnect logic did not properly remove unhealthy nodes from the pool, so they continued to receive requests.
What was the resolution of the problem, and what steps are being taken to prevent future issues?
Shortly after discovering the issue, we restarted nodes to resolve the connectivity issues and restore service. We then identified the reconnect logic issue, and released a fix to prevent future issues.
We have scheduled a firmware upgrade on all nodes to resolve the underlying network issue.
We have also refined our alerting to catch errors more quickly in the future.