2026-03-11 - Elevated Error Rates on Inference Services

Incident Report for Zonos

Postmortem

What products were affected and what was the impact?

Classify, Country of Origin

Impact: DEGRADED PERFORMANCE

 

What timeframe did this issue occur?

Date Time
Mar 11, 2026 2026-03-11 10:12 - 2026-03-11 12:06

 

How was the issue detected?

Zonos observed elevated error rates affecting Classify and Country of Origin APIs.

What functionality was affected?

A portion of Classify and Country of Origin inference requests did not return a valid response.

What was the root cause?

A firmware issue on a subset of GPU nodes caused network connectivity failures. This caused the nodes to enter an unhealthy state. Our reconnect logic did not properly remove unhealthy nodes from the pool, so they continued to receive requests.

What was the resolution of the problem, and what steps are being taken to prevent future issues?

Shortly after discovering the issue, we restarted nodes to resolve the connectivity issues and restore service. We then identified the reconnect logic issue, and released a fix to prevent future issues.

We have scheduled a firmware upgrade on all nodes to resolve the underlying network issue.

We have also refined our alerting to catch errors more quickly in the future.

Posted Mar 12, 2026 - 08:00 MDT

Resolved

This incident has been resolved.
Posted Mar 11, 2026 - 12:06 MDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Mar 11, 2026 - 11:54 MDT

Identified

The issue has been identified and a fix is being implemented.
Posted Mar 11, 2026 - 11:44 MDT

Investigating

We are currently investigating this issue.
Posted Mar 11, 2026 - 11:42 MDT
This incident affected: Classify.