Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparkSqlOperator and SparkSubmitOperator are using different types for configurations #40507

Open
2 tasks done
duhizjame opened this issue Jun 30, 2024 · 2 comments · May be fixed by #40527
Open
2 tasks done

SparkSqlOperator and SparkSubmitOperator are using different types for configurations #40507

duhizjame opened this issue Jun 30, 2024 · 2 comments · May be fixed by #40527
Assignees
Labels
area:providers kind:bug This is a clearly a bug

Comments

@duhizjame
Copy link

duhizjame commented Jun 30, 2024

Apache Airflow Provider(s)

apache-spark

Versions of Apache Airflow Providers

apache-airflow==2.9.2
apache-airflow-providers-apache-spark==4.8.2

Apache Airflow version

2.9.2

Operating System

MacOS

Deployment

Docker-Compose

Deployment details

No response

What happened

The SparkSubmitOperator uses a dictionary to handle the 'conf' property of the operator
The SparkSqlOperator uses a string in format PARAM=VALUE,PARAM2=VALUE2 to handle the 'conf' property.

The first option allows a config like this to be passed:

conf = {
        'spark.jars.packages': 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.80.0',
        'spark.driver.extraJavaOptions': '-Divy.cache.dir=/tmp -Divy.home=/tmp',
        'spark.sql.extensions': 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions'
}

while the second option will always split the packages into:
--conf spark.jars.packages=org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2 --conf org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.80.0 due to it being split on the comma as a delimiter.
This effectively does not allow adding any of the comma delimited configurations of spark.

The SparkSubmitOperator also has a bigger list of available properties; including the --packages flag which is available as well on the spark/bin/spark-sql script.

What you think should happen instead

The first option allows for more flexibility when adding configs, and a dictionary seems the right way to store the configs. It would enforce the same behaviour on both spark operators, making it easier to adjust/maintain. Also less documentation to keep :)

for conf_el in self._conf.split(","):

This is the place where the config is split on ','

How to reproduce

Create a dag and task:

conf = {
        'spark.jars.packages': 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.80.0',
        'spark.driver.extraJavaOptions': '-Divy.cache.dir=/tmp -Divy.home=/tmp',
        'spark.sql.extensions': 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions'
}

config_string = ','.join([f"{key}={value}" for key, value in conf.items()])

merge_branch = SparkSqlOperator(
    name="merge_branch",
    task_id="merge_branch",
    conf=config_string, # requires a string instead of a dict
    conn_id='spark',
    dag=dag,
    sql=f"MERGE BRANCH {ref} INTO main IN nessie",
    retries=0
)

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@duhizjame duhizjame added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jun 30, 2024
Copy link

boring-cyborg bot commented Jun 30, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

duhizjame added a commit to duhizjame/dataops-with-nessie that referenced this issue Jun 30, 2024
@aritra24
Copy link
Collaborator

aritra24 commented Jul 1, 2024

@duhizjame feel free to raise a PR

@aritra24 aritra24 removed the needs-triage label for new issues that we didn't triage yet label Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants