reciTAL × status

Service providers status pages:

← Dashboard

Incident #9 / SEV-2
Predictions are extremely slow to process

resolved

Detected: 2024-02-22 10:19:38 UTC

Resolved: 2024-02-22 10:57:22 UTC

Summary

The service accepts classification requests but they are initially slow to process resulting in webhooks being triggered and result APIs responding with "404 : predictions not ready" long after the initial requests.

After an initial correction, classification requests are no longer processed until resolution forcing us to increase severity and declare customer impact.

Customer Impact

Ended: 2024-02-20 18:05:00 UTC

Classification no longer sent by the system. All prediction requests stay in queue and are not processed, prediction results APIs respond with code 404 : predictions not ready.

Root Cause

In an effort to increase the fairness of tasks processing (especially between training and inference), priority handling was deployed.

It was discovered that after the introduction of these changes, workers started to die unexpectedly. This was not initially too big of an issue because they would restart automatically.

In an effort to resolve this issue, on Feb. 20, 2024 at 4:00 pm the workers were migrated to a different broker (Message Queue) that was configured slightly differently. From that point, workers no longer restarted automatically which interrupted the processing flow.

On Feb, 20, 2024 at 8:15 pm, the installation was rolled back to a known working version.