UI latency pausing DAGs from main /dags page (Airflow 3.1.4, Celery + Docker Compose) #60847
Replies: 13 comments 1 reply
-
|
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
Beta Was this translation helpful? Give feedback.
-
|
I’d like to take a closer look at this. I’ll start by reproducing the behavior locally and comparing the API calls and UI flow between pausing a DAG from the main /dags list and from the DAG detail page. |
Beta Was this translation helpful? Give feedback.
-
Thank you. Not really, the set up is quite simple and I don't really know what's going on. I have another airflow instance with pretty much the same config running on k8s and it feels way smoother. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the additional context — that’s helpful. |
Beta Was this translation helpful? Give feedback.
-
|
Could be just lack of memory for Docker. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the suggestion — that makes sense. Before proposing any code changes, I’ll focus on verifying whether this is primarily a Docker Compose resource issue versus a UI/API behavior difference. I’ll compare the API calls and response timings when pausing from the DAG list versus the DAG detail page and report back with concrete findings. If it turns out to be environment-related rather than a code issue, I’ll close the loop here accordingly. |
Beta Was this translation helpful? Give feedback.
-
What environment related issue could potentially cause it? Seems like memory is fine and docker has plenty of resources 32vCPU and 256gb of memory. Quick look with htop and resources look fine, plus historical data from the box seems also fine in grafana. It's also worth mentioning that the main dags menu takes an awful time to load. When refreshing/loading the page you see the skeletons as the dags are loading for a solid 5-6 seconds whereas in k8s it's pretty much instant. |
Beta Was this translation helpful? Give feedback.
-
|
One of the reasons might also be not upgrading to latest version. 3.1.5 had some "N+1" dag fixes (too many sql queries generated on dag list page) . What can also slow down things are huge history in dag files that might slow down queries ( Maybe those are limited resources on your database, not vacuumed/cleaned/rebuilt indexes - the usual database maintenance that should happen regardless of what application is using it. You should take a loook at the monitoring of your database, looking at excessive queries, peaks in resource usage, strange errors - possibly looking at plans of queries generated by your database and looking for full-table scans. Generally when database statistics are not updated and cleaned up regularly, your table stats migh wildy differ from the actual data stored and database engine might use sub-optimal ways of planning the queries, whihc - for example -might result in loading all rows to memory instead of just loading indexes. Also we are just about (next week latest) to have RC of 3.1.6 which you might want to take for a spin before we release it (watch the devlist for announcement) and it has even more small fixes and optimisations in those UI/DB queries generated and it might be that this will help. Another thing you might take a look at - Fast API (when run in containers) recommends ONE worker per container and multiple containers - so comparing to flask it's way better to have several running api-servers - each in a separate container rather than several workers in a single api-server (as counter-intuitive it might seem). You can try it all - ideally one-by-one and see if things are improving after every step - it would be great to have feedback if some of that helped. |
Beta Was this translation helpful? Give feedback.
-
Yeah let's see how it goes, hopefully it solves the issue. I'm just surprised on the diff load times between celery executor and k8s executor. Also worth mentioning that on my k8s setup I'm using git bundles and in the docker compose one I use s3 bundles. |
Beta Was this translation helpful? Give feedback.
-
So far
DB is cleaned fairly often that why I doubt it's a db issue. All metrics in the aws console look healthy and what I would expect |
Beta Was this translation helpful? Give feedback.
-
Seems like none of them made improvements with latency. @potiuk Is there any more info that I can give so that we can try to pinpoint the issue? Happy to debug more on my side if you have any other ideas. |
Beta Was this translation helpful? Give feedback.
-
|
Here is another: try 3.1.6rc1 that has been up for testing yesterday morning. |
Beta Was this translation helpful? Give feedback.
-
|
Converting to discussion -(should be converted already) - let us know how the new releases work for you. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Apache Airflow version
Other Airflow 3 version (please specify below)
If "Other Airflow 3 version" selected, which one?
3.1.4
What happened?
When navigating the main DAGs list page
/dagsUI, clicking the pause toggle for a DAG result in significant latency before the UI reflects the new state. Users see a long delay between the click and the toggle animation updating.However, when pausing the same DAG from its detail view (inside the DAG view), the action is much faster and responsive.
What you think should happen instead?
Pausing a DAG from the main DAGs list should be as responsive as pausing from the DAG detail view. Latency should be minimal (<1 second ideally) on both pages.
How to reproduce
3.Click the pause toggle on any listed DAG.
Operating System
AlmaLinux 9.5 (Teal Serval)
Versions of Apache Airflow Providers
Deployment
Docker-Compose
Deployment details
No response
Anything else?
Environment
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions