Airflow Triggerer facing frequent restarts #33647

shubhransh-eb · 2023-08-23T07:27:25Z

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

We are using airflow version: 2.6.3

We have the metastore in aws.

At around 12:00 AM PST, we have around 150+ async sensor starting at same time. They act as our sensors to wait for upstream data. We have them waiting for around 6-12 hours daily. Now after the upgrade, after running for 4-5 days we see that triggerer get restarts automatically.

On investigation wee found that the query used by triggerer to get list of trigger is taking lot of time, causing the triggerer to kill python code and hence restart.

We are able to resolve it after doing analyze command on task_instance table.

Query used

select
    `trigger`.id as trigger_id,
    `trigger`.classpath as trigger_classpath,
    `trigger`.kwargs as trigger_kwargs,
    `trigger`.created_date as trigger_created_date,
    `trigger`.triggerer_id as trigger_triggerer_id,
    task_instance_1.try_number as task_instance_1_try_number,
    job_1.id as job_1_id,
    job_1.dag_id as job_1_dag_id,
    job_1.state as job_1_state,
    job_1.job_type as job_1_job_type,
    job_1.start_date as job_1_start_date,
    job_1.end_date as job_1_end_date,
    job_1.latest_heartbeat as job_1_latest_heartbeat,
    job_1.executor_class as job_1_executor_class,
    job_1.hostname as job_1_hostname,
    job_1.unixname as job_1_unixname,
    trigger_1.id as trigger_1_id,
    trigger_1.classpath as trigger_1_classpath,
    trigger_1.kwargs as trigger_1_kwargs,
    trigger_1.created_date as trigger_1_created_date,
    trigger_1.triggerer_id as trigger_1_triggerer_id,
    dag_run_1.state as dag_run_1_state,
    dag_run_1.id as dag_run_1_id,
    dag_run_1.dag_id as dag_run_1_dag_id,
    dag_run_1.queued_at as dag_run_1_queued_at,
    dag_run_1.execution_date as dag_run_1_execution_date,
    dag_run_1.start_date as dag_run_1_start_date,
    dag_run_1.end_date as dag_run_1_end_date,
    dag_run_1.run_id as dag_run_1_run_id,
    dag_run_1.creating_job_id as dag_run_1_creating_job_id,
    dag_run_1.external_trigger as dag_run_1_external_trigger,
    dag_run_1.run_type as dag_run_1_run_type,
    dag_run_1.conf as dag_run_1_conf,
    dag_run_1.data_interval_start as dag_run_1_data_interval_start,
    dag_run_1.data_interval_end as dag_run_1_data_interval_end,
    dag_run_1.last_scheduling_decision as dag_run_1_last_scheduling_decision,
    dag_run_1.dag_hash as dag_run_1_dag_hash,
    dag_run_1.log_template_id as dag_run_1_log_template_id,
    dag_run_1.updated_at as dag_run_1_updated_at,
    task_instance_1.task_id as task_instance_1_task_id,
    task_instance_1.dag_id as task_instance_1_dag_id,
    task_instance_1.run_id as task_instance_1_run_id,
    task_instance_1.map_index as task_instance_1_map_index,
    task_instance_1.start_date as task_instance_1_start_date,
    task_instance_1.end_date as task_instance_1_end_date,
    task_instance_1.duration as task_instance_1_duration,
    task_instance_1.state as task_instance_1_state,
    task_instance_1.max_tries as task_instance_1_max_tries,
    task_instance_1.hostname as task_instance_1_hostname,
    task_instance_1.unixname as task_instance_1_unixname,
    task_instance_1.job_id as task_instance_1_job_id,
    task_instance_1.pool as task_instance_1_pool,
    task_instance_1.pool_slots as task_instance_1_pool_slots,
    task_instance_1.queue as task_instance_1_queue,
    task_instance_1.priority_weight as task_instance_1_priority_weight,
    task_instance_1.operator as task_instance_1_operator,
    task_instance_1.queued_dttm as task_instance_1_queued_dttm,
    task_instance_1.queued_by_job_id as task_instance_1_queued_by_job_id,
    task_instance_1.pid as task_instance_1_pid,
    task_instance_1.executor_config as task_instance_1_executor_config,
    task_instance_1.updated_at as task_instance_1_updated_at,
    task_instance_1.external_executor_id as task_instance_1_external_executor_id,
    task_instance_1.trigger_id as task_instance_1_trigger_id,
    task_instance_1.trigger_timeout as task_instance_1_trigger_timeout,
    task_instance_1.next_method as task_instance_1_next_method,
    task_instance_1.next_kwargs as task_instance_1_next_kwargs
from
    `trigger`
    left outer join (task_instance as task_instance_1 inner join dag_run as dag_run_1
                     on dag_run_1.dag_id = task_instance_1.dag_id and dag_run_1.run_id = task_instance_1.run_id) on `trigger`.id = task_instance_1.trigger_id
    left outer join `trigger` as trigger_1 on trigger_1.id = task_instance_1.trigger_id
    left outer join job as job_1 on job_1.id = trigger_1.triggerer_id
where
        `trigger`.id in
        (); -- list of ids

It takes around 5-6 mins and when run analyze command it takes less than 1 second.

The number of sensor is same before and after upgrade.

What you think should happen instead

No response

How to reproduce

Upgrade to V2.6.3
Swapn 150 sensors and wait (It doesnt happen every time but only some days

Operating System

NAME="Debian GNU/Linux" VERSION_ID="11" VERSION="11 (bullseye)"

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2023-08-23T07:27:26Z

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

shubhransh-eb · 2023-08-23T09:55:26Z

Also adding a screenshot of how query numbers changes before and after analyze command

potiuk · 2023-08-24T18:26:48Z

Thanks for the detailed analysis and the cost breakdown before/after. This is super helpful.

Looks like an index hint should be needed or smth like that. Very interesting one. I will mark it for 2.7.1 hoping maybe someone will have time to fix it before

eladkal · 2023-08-24T20:25:01Z

2.6.3 had a known issue with Triggerer health which was fixed in 2.7.0 #33089
I wonder if its related?

potiuk · 2023-08-24T20:30:10Z

I don't think so - this looks more like trigger query taking far too long because the DB optimiser does not choose the right plan to do the query efficiently.

shubhransh-eb · 2023-08-25T07:11:47Z

Hello @potiuk thank you for the reply.

For now is there a way we can solve this, other than regularly running the analyze command on the tables?

potiuk · 2023-08-25T09:05:33Z

No idea. Because I do not know the reason yet.

potiuk · 2023-08-25T09:05:58Z

Someone will have to take a look and investigate it

shubhransh-eb · 2023-08-25T09:16:20Z

Sure, thanks for help though :)

Taragolis · 2023-08-30T22:44:32Z

Looks like an index hint should be needed or smth like that. Very interesting one. I will mark it for 2.7.1 hoping maybe someone will have time to fix it before

To be honest better have a rule not to use IN with any potential big dataset. It really makes most RDBMS unhappy.

For example in Postgres everything in IN become part of execution plan, and if it quite a big, then DB spend most of the time for parse, trying build multiple different plans, calculate costs over a lot of different and in the end chouse 'lets take something', and time spend for this analyze might be greater than even do FULL SEQ SCAN over couple of tables.

In general better to get rid of non constant sized IN filters (couple statuses for tasks and dags) and replace by other methods:

[NOT] EXISTS, for SEMI-ANTI Joins over subqueries
JOIN over VALUES, in this case execution plans shouldn't be crazy, it should supported in PG, MySQL8 and MsSQL (RIP), maybe something similar exists for SQLite
Regular Joins :D

@shubhransh-eb I guess you use MySQL backend? If so, I wonder which version?

Taragolis · 2023-08-30T22:55:46Z

And I currently think how to optimize or get rid off magic wrapper

airflow/airflow/utils/sqlalchemy.py

Line 535 in 32a490e

def tuple_in_condition(

shubhransh-eb · 2023-08-31T04:36:35Z

Looks like an index hint should be needed or smth like that. Very interesting one. I will mark it for 2.7.1 hoping maybe someone will have time to fix it before

To be honest better have a rule not to use IN with any potential big dataset. It really makes most RDBMS unhappy.

For example in Postgres everything in IN become part of execution plan, and if it quite a big, then DB spend most of the time for parse, trying build multiple different plans, calculate costs over a lot of different and in the end chouse 'lets take something', and time spend for this analyze might be greater than even do FULL SEQ SCAN over couple of tables.

In general better to get rid of non constant sized IN filters (couple statuses for tasks and dags) and replace by other methods:

[NOT] EXISTS, for SEMI-ANTI Joins over subqueries

JOIN over VALUES, in this case execution plans shouldn't be crazy, it should supported in PG, MySQL8 and MsSQL (RIP), maybe something similar exists for SQLite

Regular Joins :D

@shubhransh-eb I guess you use MySQL backend? If so, I wonder which version?

Hello ,
we are using Aurora MySQL
Engine Version: 5.7.mysql_aurora.2.11.2

shubhransh-eb · 2023-09-12T07:17:28Z

Not sure how much help this will provide, but we are facing this issue now around every week or max to max once every two keep.

Just to mitigate the issue, we have added alerts on top of our database (mysql) to get alert if CPU spikes (Because of frequency restart of triggerer) and then we are able to manually run the analyze command

Taragolis · 2023-09-12T19:46:44Z

First of all I would recommend considering possibility of upgrading to a new version of MySQL, 5.7 it is almost EOL, even if Amazon would support MySQL 5.7 on Aurora there is big chance that Airflow would stop support MySQL 5.7 in versions which released after 31 Oct 2023. That mean that further improvements in triggerer would not available. In additional 8.0 should provide better query analyser/planner. Just make sure that you test migration on snapshot of DB before doing this on prod database.

Anyway, I inspected data transfers between Triggerer and TriggerJob, it might help someone (maybe it was me) who want to optimise this:

Load Triggers
All associated IDs with current Triggerer
Update Triggers
Bulk Load Triggers - Query which might make a problem in case of huge input dataset
Put data in different different dequeue

Seems like 1-4 might be executed in one query with additional overhead on captured data but it might reduce time to execute on DB side, however required additional filtration on client (Airflow) side.

shubhransh-eb · 2023-09-13T03:54:30Z

Hello @Taragolis
Sorry for confusion, we use mysql version 8.0.28.

Just wanted to confirm, what you are suggesting is the query used for triggers need to be updated to solve this, correct?

potiuk · 2023-09-13T06:09:29Z

Just wanted to confirm, what you are suggesting is the query used for triggers need to be updated to solve this, correct?

Let me explain how I see the options we have here (I have not done a detailed analysis what is wrong - those are a bit intelligent guesses).

I am not sure you can do much more NOW than analysing the tables periodically until the code of Airlfow is updated (but maybe you can also attempt to PR some changes).

Likely we need to load at the steps involved and optimise the way DB is used. Running analyse frequently should likely help you (and likely you can even schedule it every day for example) - but I think fundamentally someone (maybe @Taragolis or someone else) needs to optimise the way how we run queries to get rid of the effect you see in case you have huge amount of triggers happening.

The root cause is - I believe - that when you add an delete a lot of data, at some point in time the built-in optimiser of MySQL gets confused about what is the fastest execution plan to get the data, and likely produces the plan that is not-at-all optimised - I think the main reason for that is that indexes are never rebuilt and they grow in size when data is often deleted and added and at some point of time the optimizer sees that the index size is so big, that it is better and faster to not use the index at all.

This is why "analyze" helps, because it looks at the actual data left and allows the optimizer to find a better and more optimized way to read the data, it rebuilds the index and makes it way smaller, and then the optimizer will start using it again/

I see two ways you could approach it:

optimize the code (suggested by @Taragolis ) by running less number of queries but then it has to be done carefully, not using the ever-growing indexes and maybe dropping the indexes altogether. But that would require to change the way how we are using the DB.
we could also add hinting to the query, that would always force the plan that we know is optimal (but there we need to figure out which query and what plan is best, and maybe there are cases where it's not really the best one. But fundamentally if the index will be ever growing, this will break as well.
we could possibly rebuild the indexes from airflow side (but that's tricky - when? What to do with other tasks running?)

Those are a bit guesses - maybe @Taragolis who have done a bit more analysis can also confirm if my thinking is right.

shubhransh-eb · 2023-09-13T06:13:08Z

Thanks for suggestions @potiuk , For now we run analyze command if we see there is some issue.
Thanks for the help :)

Taragolis · 2023-09-13T09:36:42Z

Those are a bit guesses - maybe @Taragolis who have done a bit more analysis can also confirm if my thinking is right.

To be honest I've had a look after I found this issue initially and I was lying in bed and check code thought browser on iPad and just forgot to write a message. That mean all findings need to be verified first, I assume that we use this approach:

It works in most cases
We do not have triggerer states in DB, maybe for some optimisation reason.

The problem also that we operate with set on client side (Airflow) for ids before send to DB backend and even similar queries might be not so similar for DB. But this my assumption.

we could also add hinting to the query

I like a position of some postgres-vendor developer about hint, something like "Maybe we want to have a hints in vanilla postgres, but not by same way it implemented in Oracle but in our product we need implements some close related stuff to make people who migrate from OracleDB to our product". In general it comes from the fact that statistic in most cases better when especially if it comes to the COB (Cost Base Optimisation) or next-gen of COB

The problem with hint that it fix "Here and Now" and it might work in particular this case, with particular this amount data, particular this indexes, particular this amount of memory, for particular this user and as soon as some of parameters changes the things could become worser or not improve if this hints not exists.

This is just my personal position: "Query hint it is a solution of last resort after you try all other last resort solutions"

Sorry for confusion, we use mysql version 8.0.28.

That is nice.

For now we run analyze command if we see there is some issue.

@shubhransh-eb I'm not an expert on MySQL but is any configuration exists which might potentially turn on/off auto gathering table statistic (aka ANALYZE)? Or it maybe by design you should manually run ANALYZE time to time.

If compare to Postgres I know exactly that autoanalyze daemon run in background and if user turned off then high intensive workloads query become slower over time. But even with postgres autoanalyze daemon in some cases better manually run AMALYZE TABLE especially after huge delete + insert

shubhransh-eb · 2023-09-13T09:45:48Z

@shubhransh-eb I'm not an expert on MySQL but is any configuration exists which might potentially turn on/off auto gathering table statistic (aka ANALYZE)? Or it maybe by design you should manually run ANALYZE time to time.

I am not an expert in MySQL either, but from what I found out on internet, I dont think MySQL automatically runs analyze command, that could be the reason why we have to manually run it to make this work.

shubhransh-eb · 2023-09-13T09:46:16Z

Also in this case, this issue happen when we have around 150 sensors start around same time (within a min)

Taragolis · 2023-09-13T18:23:08Z

I am not an expert in MySQL either, but from what I found out on internet, I dont think MySQL automatically runs analyze command, that could be the reason why we have to manually run it to make this work.

I think MySQL should run something to gathering statistics without it hardly possible to calculate costs for the queries. Manual ANALYZE in this case something like: "Forget everything you know about the table and collect new statistics".
If you don't know probably you don't know how to turn it off, I've asked because it was quite popular to turn-off auto-vacuum daemon in Postgres years ago and got all side effects 🤣

this issue happen when we have around 150 sensors start around same time

That also could be a reason. Again not expert of MySQL and no idea how it handle multiple simultaneous connection. In Postgres it is quite expensive operation time+memory. So base recommendation it was use some connection pooler between DB and AIrflow, internally Airflow use SQLaclhemy pooler but it limited by single process, so better have something bettween since you use Managed MySQL on AWS, you might try to use RDS Proxy

And last but not least also might be a reason fact that most of the deferrable operators not truly async, especially something like TaskStateTrigger, which might kept session for very long time and prevent gathering statistic from database, it should not be a problem on Postgres, but who know maybe this is a problem for MySQL. This one my assumption.

shubhransh-eb · 2023-09-23T04:42:15Z

So we had an incident in morning, where analyze command didnt help as well.

We had to run following commands to bring back the table

ALTER TABLE task_instance STATS_SAMPLE_PAGES=256;
analyze

Our hunch is since trigger_id is set to null when work is done, and by default mysql value of STATS_SAMPLE_PAGES is 20, stats calculation is affected. When you have huge table (in our case 5M rows in task_instance) and it run analyze command with such low STATS_SAMPLE_PAGES (with a lot of them having null), it catches cardinality in wrong way.

we have updated the value in our table and will analyze it, but this could be a solution to set page_size for this table to be higher so that it can handle this.

tirkarthi · 2023-12-08T11:20:00Z

@shubhransh-eb Can you please add the previous version of Airflow where it worked fine? We upgraded from Airflow 2.3.4 to 2.7.2. We are also using mysql 8 (self hosted) and faced the query taking a long time even with 1 trigger running entry in one of the environments where db migration was done in place and has lot of records in other tables. The triggerer just hangs with the query executing. In the other environment db migration was done on a backup and restored database with less records and has no problem running triggerer job. We found that by trial and error to making dag_run to load lazily helped triggerer process to run fine though list trigger page in UI still hangs on trying to use it with active triggers. It loads fine without lazy loading when there is no trigger though. I am not sure if its relevant to you but the query mentioned in the issue description was the same in our case taking long time and thought to add it here.

airflow/airflow/models/trigger.py

Lines 99 to 110 in 4824ca7

    
           def bulk_fetch(cls, ids: Iterable[int], session: Session = NEW_SESSION) -> dict[int, Trigger]: 
        
               """Fetch all the Triggers by ID and return a dict mapping ID -> Trigger instance.""" 
        
               query = session.scalars( 
        
                   select(cls) 
        
                   .where(cls.id.in_(ids)) 
        
                   .options( 
        
                       joinedload("task_instance"), 
        
                       joinedload("task_instance.trigger"), 
        
                       joinedload("task_instance.trigger.triggerer_job"), 
        
                   ) 
        
               ) 
        
               return {obj.id: obj for obj in query}

Lazy load dag_run. We tried noload but it seems this is needed for logging trigger logs which made some db changes to reference task_instance for logging though I am not sure if that could be the issue with mysql.

    def bulk_fetch(cls, ids: Iterable[int], session: Session = NEW_SESSION) -> dict[int, Trigger]:
        """Fetch all the Triggers by ID and return a dict mapping ID -> Trigger instance."""
        query = session.scalars(
            select(cls)
            .where(cls.id.in_(ids))
            .options(
                joinedload("task_instance").lazyload("dag_run"),
                joinedload("task_instance.trigger"),
                joinedload("task_instance.trigger.triggerer_job"),
            )
        )
        return {obj.id: obj for obj in query}

shubhransh-eb · 2023-12-11T05:20:34Z

Hello @tirkarthi
Thanks for reply, in our case we migrated from 2.5.1 to 2.6.3

QuintenBruynseraede · 2024-01-02T13:53:53Z

For what it's worth, I managed to fix this issue by cleaning up some old records from the trigger table. The table only had 88 rows, of which none were old than ±2 months. After cleaning up 27 records older than 1 month, the triggerers started heartbeating reliably again and are no longer restarting.

From 2.8.0 onwards airflow db clean has the option to clean that table (see #34908). I just did it manually in a postgres shell.

shubhransh-eb · 2024-01-02T13:59:29Z

@QuintenBruynseraede we are doing somewhat the same thing, as mentioned we are deleting all tasks instances before 90 days, but will check this as well

Thanks for help :)

arunravimv · 2024-05-05T04:26:07Z

Hi Airflow Community, we faced the same issue on Airflow 2.7.3 (using AWS RDS MySql 8.0.35). For large airflow deployments with 1000s of active dags and large volume of records in TI table we think that this query is not making use of ti_trigger_id index. Here is a snapshot of MySQL explain forcing it work on all rows in TI table.

I tried to manually run the query with some use index hint (on task_instance table) however not Successful yet, still the same filter Plan . Also am wondering if adding a hint could help or not (because of subquery) Or Adding a new index with dag_id, dag_run_id and trigger_id is better ?

I am happy to contribute changes with some guidence.

After some Db clean (removed around 4k Dag Runs) the plan changed to

eladkal · 2024-05-15T13:09:03Z

Hi @arunravimv you can open PR with your suggested code changes and we can review specific suggestion on the PR itself

arunravimv · 2024-06-20T07:55:57Z

Hi @eladkal ,

We've implemented a patch in which selectinload has supplanted joinedload during the bulk fetching of Triggers. This alteration has led to a notable enhancement in performance. The SQL analysis outcomes, for both before and after the modification, are appended herewith. I think we should also change the default relationship loading for task instance table in Triggerer model to use selectinload to solve the webserver triggerer listing timeout issue. If this transformation appears acceptable, I'd be happy to initiate a Pull Request (PR) for more extensive dialogue.

joinedload

Airflow Code Snippet

query = session.scalars(
    select(cls)
    .where(cls.id.in_(ids))
    .options(
        joinedload("task_instance"),
        joinedload("task_instance.trigger"),
        joinedload("task_instance.trigger.triggerer_job"),
    )
)

Explain Analyze

-> Nested loop left join  (cost=101 rows=95) (actual time=0.22..0.359 rows=3 loops=1)
    -> Nested loop left join  (cost=67.8 rows=95) (actual time=0.21..0.348 rows=3 loops=1)
        -> Nested loop left join  (cost=34.6 rows=95) (actual time=0.204..0.338 rows=3 loops=1)
            -> Filter: (`trigger`.id in (969,968,984))  (cost=1.36 rows=3) (actual time=0.049..0.0565 rows=3 loops=1)
                -> Index range scan on trigger using PRIMARY over (id = 968) OR (id = 969) OR (id = 984)  (cost=1.36 rows=3) (actual time=0.048..0.0545 rows=3 loops=1)
            -> Nested loop inner join  (cost=35.9 rows=31.7) (actual time=0.0915..0.0932 rows=1 loops=3)
                -> Index lookup on task_instance_1 using ti_trigger_id (trigger_id=`trigger`.id)  (cost=8.97 rows=31.7) (actual time=0.0716..0.0731 rows=1 loops=3)
                -> Single-row index lookup on dag_run_1 using dag_run_dag_id_run_id_key (dag_id=task_instance_1.dag_id, run_id=task_instance_1.run_id)  (cost=0.251 rows=1) (actual time=0.0194..0.0195 rows=1 loops=3)
        -> Single-row index lookup on trigger_1 using PRIMARY (id=task_instance_1.trigger_id)  (cost=0.251 rows=1) (actual time=0.00301..0.00305 rows=1 loops=3)
    -> Single-row index lookup on job_1 using PRIMARY (id=trigger_1.triggerer_id)  (cost=0.251 rows=1) (actual time=0.00316..0.00321 rows=1 loops=3)

selectinload

Airflow Code Snippet

query = session.scalars(
    select(cls)
    .where(cls.id.in_(ids))
    .options(
        selectinload("task_instance"),
        joinedload("task_instance.trigger"),
        joinedload("task_instance.trigger.triggerer_job"),
    )
)

Explain Analyze

-> Nested loop inner join  (cost=5.26 rows=3) (actual time=0.0895..0.362 rows=3 loops=1)
    -> Nested loop left join  (cost=4.21 rows=3) (actual time=0.065..0.313 rows=3 loops=1)
        -> Nested loop left join  (cost=3.16 rows=3) (actual time=0.0521..0.298 rows=3 loops=1)
            -> Index range scan on task_instance using ti_trigger_id over (trigger_id = 968) OR (trigger_id = 969) OR (trigger_id = 984), with index condition: (task_instance.trigger_id in (969,968,984))  (cost=2.11 rows=3) (actual time=0.0399..0.273 rows=3 loops=1)
            -> Single-row index lookup on trigger_1 using PRIMARY (id=task_instance.trigger_id)  (cost=0.283 rows=1) (actual time=0.00755..0.00759 rows=1 loops=3)
        -> Single-row index lookup on job_1 using PRIMARY (id=trigger_1.triggerer_id)  (cost=0.283 rows=1) (actual time=0.00454..0.00458 rows=1 loops=3)
    -> Single-row index lookup on dag_run_1 using dag_run_dag_id_run_id_key (dag_id=task_instance.dag_id, run_id=task_instance.run_id)  (cost=0.283 rows=1) (actual time=0.016..0.016 rows=1 loops=3)

eladkal · 2024-06-20T10:51:22Z

I'd be happy to initiate a Pull Request (PR) for more extensive dialogue.

That would be best

tirkarthi · 2024-06-29T17:14:09Z

@arunravimv We are also facing an issue where the trigger entry is inserted into the table but took 20-30 seconds to be picked up by triggerer itself causing delays between trigger creation and trigger being actually executed. The delay is not consistent though with sometimes the triggerer quickly picking up the trigger for execution. We are using MySQL with large number of task instances and active dags. I was wondering by performance improvement did this patch help with reducing the delay or any other specific performance issue?

Thanks

arunravimv · 2024-07-01T04:54:05Z

@tirkarthi we are able to load trigger based poll almost immediately (1-2 seconds), but I think this also depends on a lot of parameters like db configuration/load and the number of concurrent triggers and trigger processes you are running.

josephangbc · 2024-07-04T07:45:16Z

Hi @eladkal, we have raised PR #40487. Could you help us to review it? Thank you!

shubhransh-eb added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Aug 23, 2023

potiuk removed the needs-triage label for new issues that we didn't triage yet label Aug 24, 2023

potiuk added this to the Airflow 2.7.1 milestone Aug 24, 2023

ephraimbuddy modified the milestones: Airflow 2.7.1, Airflow 2.7.2 Aug 28, 2023

ephraimbuddy removed this from the Airflow 2.7.2 milestone Oct 4, 2023

ephraimbuddy modified the milestones: Airflow 2.9.0, Airflow 2.8.1 Dec 6, 2023

ephraimbuddy modified the milestones: Airflow 2.8.1, Airflow 2.8.2 Jan 15, 2024

ephraimbuddy modified the milestones: Airflow 2.8.2, Airflow 2.9.0, Airflow 2.8.3 Feb 22, 2024

potiuk added the area:performance label Feb 23, 2024

ephraimbuddy modified the milestones: Airflow 2.8.3, Airflow 2.9.0 Mar 6, 2024

ephraimbuddy modified the milestones: Airflow 2.9.0, Airflow 2.9.1 Apr 2, 2024

ephraimbuddy modified the milestones: Airflow 2.9.1, Airflow 2.9.2 Apr 29, 2024

eladkal removed this from the Airflow 2.9.2 milestone May 15, 2024

eladkal added the affected_version:2.6 Issues Reported for 2.6 label May 15, 2024

josephangbc linked a pull request Jun 28, 2024 that will close this issue

Use selectinload in trigger #40487

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Airflow Triggerer facing frequent restarts #33647

Airflow Triggerer facing frequent restarts #33647

shubhransh-eb commented Aug 23, 2023

boring-cyborg bot commented Aug 23, 2023

shubhransh-eb commented Aug 23, 2023

potiuk commented Aug 24, 2023

eladkal commented Aug 24, 2023 •

edited

Loading

potiuk commented Aug 24, 2023

shubhransh-eb commented Aug 25, 2023

potiuk commented Aug 25, 2023

potiuk commented Aug 25, 2023

shubhransh-eb commented Aug 25, 2023

Taragolis commented Aug 30, 2023

Taragolis commented Aug 30, 2023

shubhransh-eb commented Aug 31, 2023

shubhransh-eb commented Sep 12, 2023

Taragolis commented Sep 12, 2023

shubhransh-eb commented Sep 13, 2023 •

edited

Loading

potiuk commented Sep 13, 2023

shubhransh-eb commented Sep 13, 2023

Taragolis commented Sep 13, 2023

shubhransh-eb commented Sep 13, 2023

shubhransh-eb commented Sep 13, 2023

Taragolis commented Sep 13, 2023

shubhransh-eb commented Sep 23, 2023 •

edited

Loading

tirkarthi commented Dec 8, 2023 •

edited

Loading

shubhransh-eb commented Dec 11, 2023

QuintenBruynseraede commented Jan 2, 2024

shubhransh-eb commented Jan 2, 2024

arunravimv commented May 5, 2024 •

edited

Loading

eladkal commented May 15, 2024

arunravimv commented Jun 20, 2024 •

edited

Loading

eladkal commented Jun 20, 2024

tirkarthi commented Jun 29, 2024 •

edited

Loading

arunravimv commented Jul 1, 2024

josephangbc commented Jul 4, 2024

Airflow Triggerer facing frequent restarts #33647

Airflow Triggerer facing frequent restarts #33647

Comments

shubhransh-eb commented Aug 23, 2023

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented Aug 23, 2023

shubhransh-eb commented Aug 23, 2023

potiuk commented Aug 24, 2023

eladkal commented Aug 24, 2023 • edited Loading

potiuk commented Aug 24, 2023

shubhransh-eb commented Aug 25, 2023

potiuk commented Aug 25, 2023

potiuk commented Aug 25, 2023

shubhransh-eb commented Aug 25, 2023

Taragolis commented Aug 30, 2023

Taragolis commented Aug 30, 2023

shubhransh-eb commented Aug 31, 2023

shubhransh-eb commented Sep 12, 2023

Taragolis commented Sep 12, 2023

shubhransh-eb commented Sep 13, 2023 • edited Loading

potiuk commented Sep 13, 2023

shubhransh-eb commented Sep 13, 2023

Taragolis commented Sep 13, 2023

shubhransh-eb commented Sep 13, 2023

shubhransh-eb commented Sep 13, 2023

Taragolis commented Sep 13, 2023

shubhransh-eb commented Sep 23, 2023 • edited Loading

tirkarthi commented Dec 8, 2023 • edited Loading

shubhransh-eb commented Dec 11, 2023

QuintenBruynseraede commented Jan 2, 2024

shubhransh-eb commented Jan 2, 2024

arunravimv commented May 5, 2024 • edited Loading

eladkal commented May 15, 2024

arunravimv commented Jun 20, 2024 • edited Loading

joinedload

selectinload

eladkal commented Jun 20, 2024

tirkarthi commented Jun 29, 2024 • edited Loading

arunravimv commented Jul 1, 2024

josephangbc commented Jul 4, 2024

eladkal commented Aug 24, 2023 •

edited

Loading

shubhransh-eb commented Sep 13, 2023 •

edited

Loading

shubhransh-eb commented Sep 23, 2023 •

edited

Loading

tirkarthi commented Dec 8, 2023 •

edited

Loading

arunravimv commented May 5, 2024 •

edited

Loading

arunravimv commented Jun 20, 2024 •

edited

Loading

tirkarthi commented Jun 29, 2024 •

edited

Loading