I will explain a recent situation where an application that uses SignalR/WebSockets disconnected when routed through the Azure Application Gateway during listener configuration changes.
Me and my team are working with a client, migrating their on-premises workloads to Microsoft Azure. Some of the workloads are built using SignalR which provides optimal communication for in-sequence data over WebSockets. The users of the applications expect a reliable stream of data over a long period of time.
Our design features an Azure Application Gateway with the Web Application Firewall. The public DNS records for the applications point to the AppGw, which inspects the traffic and proxies to the backend pools which host the applications.
As one can imagine, there has been a lot of testing, debugging, and improvement. That means there have been many configuration changes to the application configurations in the AppGw: listeners, HTTP settings, and backend pools.
We had stable connections from test clients to the applications but the developers saw something. Every now, and then, all clients would lose their connection. The developers observed the times and noticed a correlation with when we ran our DevOps pipelines to apply changes. In short: every time we updated the AppGw, the clients were disconnected.
I reached out to Microsoft (thank you to Ashutosh who was very helpful!). Ashutosh dug into the platform logs and explained the issue to me.
The WebSocket sessions were handled by the “data plane” of the AppGw data resource. Every time a new configuration is applied, a new data plane is created. The old data plane is maintained for a short period of time – 30 seconds by default – before being dropped. That means when we applied a change, the handling for existing WebSocket connections was dropped 30 seconds later.
The timeout for the data plane can be adjusted up to 1 hour (3600 seconds) from the default of 30 seconds. This would not solve our issue – 1 minute or 1 hour just delays the disconnect instead of avoiding it.
The solution we have come up with is to isolate the “production” workloads into a stable WAF while unstable workloads are migrated to a “pre-staging” WAF. Any changes to the “production” WAF must be done out of hours unless there is an emergency that demands a change and we acknowledge that disconnects will happen.