Detected: 2024-02-22 10:19:38 UTC
Resolved: 2024-02-22 10:57:22 UTC
The service accepts classification requests but they are initially slow to process resulting in webhooks being triggered and result APIs responding with "404 : predictions not ready" long after the initial requests.
After an initial correction, classification requests are no longer processed until resolution forcing us to increase severity and declare customer impact.
Ended: 2024-02-20 18:05:00 UTC
Classification no longer sent by the system. All prediction requests stay in queue and are not processed, prediction results APIs respond with code 404 : predictions not ready.
In an effort to increase the fairness of tasks processing (especially between training and inference), priority handling was deployed.
It was discovered that after the introduction of these changes, workers started to die unexpectedly. This was not initially too big of an issue because they would restart automatically.
In an effort to resolve this issue, on Feb. 20, 2024 at 4:00 pm the workers were migrated to a different broker (Message Queue) that was configured slightly differently. From that point, workers no longer restarted automatically which interrupted the processing flow.
On Feb, 20, 2024 at 8:15 pm, the installation was rolled back to a known working version.