Back to overview

Inferences Flux are down

Jan 06 at 05:16am PST
Affected services
Web application
Private Models Infrastructure

Resolved
Jan 06 at 05:16am PST

On January 6, 2026, between 2:09 PM and 4:30 PM CET, we experienced a service disruption specifically affecting our FLUX image generation models. During this window, users attempting to use these models would have encountered infinite loading of Generation.
All other services, including our standard model training and other generation engines, remained fully operational.

What Happened?
The interruption was caused by a technical issue during a scheduled update to our inference infrastructure. A specific error in our model-loading process prevented the large-scale data files required for Flux models from initializing correctly.
While our automated monitoring systems initially flagged the deployment as "healthy," the underlying storage was in an inconsistent state, leading to failed requests once the update went live.

How We Resolved It
Our engineering team identified the root cause at 2:10 PM CET.
We implemented a fix that:
- Corrected the model download sequence to ensure data integrity.
- Wiped and resynchronized the affected storage volumes.
- Validated the fix through our preview environment before pushing it to all users.
Service was fully restored by 4:30 PM CET.

What We’re Doing to Prevent This
- To prevent a recurrence, we have implemented the following:
Enhanced Deployment Logic: We have updated our systems to "fail-fast." If a model fails to load perfectly, the update will now automatically abort rather than going live.
- Pre-Production Testing: We have added a mandatory validation step on a separate preview environment for all infrastructure changes before they reach the production gallery.
- Improved Monitoring: We have added specific health checks that verify the integrity of model files on disk, ensuring that the service only accepts traffic when it is 100% ready.