Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting "error sending request for url" AzureError when writing very large deltatable to Azure Gen 2 #2639

Open
Josh-Hiz opened this issue Jun 30, 2024 · 9 comments
Labels
bug Something isn't working question Further information is requested

Comments

@Josh-Hiz
Copy link

Environment

Delta-rs version:
This happens on both 0.18.1 and 0.16.1, I haven't tested anything else.

Environment: Python 3.11

  • OS: Windows 10

Bug

What happened:

When writing an extremely large deltalake file (30000 total partition folders) to Azure Gen 2, I keep getting the following:

OSError: Generic MicrosoftAzure error: Error after 0 retries in 30.0027383s, max_retries:10, retry_timeout:180s, source:error sending request for url (url_here): operation timed out

This error happens regardless of what engine (I tested both Rust and PyArrow) and regardless of Deltalake version (I tried 0.18.1 and 0.16.1), I run the following call after creating an extremely large dataframe via pd.concat:

            write_deltalake(
                table_or_uri=f"url_here",
                data=data, # Extremely large pandas dataframe
                storage_options={options_here},
                mode="overwrite",
                partition_by=partition_scheme,
                engine="rust"
            )

My deltatable contains millions of rows, however this should not be an issue to write to the deltalake so I am not sure why exactly I am getting this error at all. When it comes to writing very small deltalakes (1000s of rows) its fine, what exactly can be the cause and solution to this?

Everything works, including concatenation, it just errors when I try writing the DF.

What you expected to happen:

For the write to be successful regardless of how long it takes.

How to reproduce it:

Most likely need to get an extremely large deltatable of millions of rows (gb of data) and try performing a write to Azure Gen 2.

It should be important to note the error message I gave is when I tried PyArrow, Rust is similar except it performs 10 retries, I dont know why azure wont even retry. I am using abfss in my url when going to Azure Gen 2

@Josh-Hiz Josh-Hiz added the bug Something isn't working label Jun 30, 2024
@Josh-Hiz
Copy link
Author

After further investigating, this error propagates actually depending on the partition I chose to partition the table by, why is that? One of the partition schemes I chose was in total 267 partitions, the next scheme I chose had over 30k+, my problem here is why is this the case? Why is my choice of partition_by affecting this? It should be error free regardless of the number of partitions or the time I need to wait.

@ion-elgreco
Copy link
Collaborator

You can pass in storage_options, {"timeout": "120s"}

@ion-elgreco ion-elgreco added question Further information is requested and removed bug Something isn't working labels Jun 30, 2024
@Josh-Hiz
Copy link
Author

You can pass in storage_options, {"timeout": "120s"}

The error still persists depending on the partition chosen.

@Josh-Hiz
Copy link
Author

Through further looking into the data, I do not believe the data has anything to do with this error @ion-elgreco however possibly the number of partitions might be an issue, with the assumption of 30k+ partitions, can deltalake even handle that? If so, is there a possibility that write_deltalake is trying to write to partitions before even making the partition folder?

@Josh-Hiz
Copy link
Author

Josh-Hiz commented Jul 1, 2024

@ion-elgreco Would it be an issue if I try making partitions that are of timestamp?

@Josh-Hiz
Copy link
Author

Josh-Hiz commented Jul 2, 2024

ValueError: Incorrect array length for StructArray field "column_name", expected 40000 got 39999,
another error frequently happening when operating on large data with Azure

@rtyler rtyler added the bug Something isn't working label Jul 5, 2024
@rtyler
Copy link
Member

rtyler commented Jul 5, 2024

calling write_deltalake() with a URL means that the writer has to first read the transaction log/open the table first. Are you able to construct a DeltaTable()` reliably on this URL? I'm still uncertain on how we might write a test reproduction case here

@Josh-Hiz
Copy link
Author

Josh-Hiz commented Jul 5, 2024

calling write_deltalake() with a URL means that the writer has to first read the transaction log/open the table first. Are you able to construct a DeltaTable()` reliably on this URL? I'm still uncertain on how we might write a test reproduction case here

Yes, making the delta table is fine.

@Josh-Hiz
Copy link
Author

Josh-Hiz commented Jul 10, 2024

@rtyler I have come up with a solution in the meantime, it seems like (atleast with the data I have) deltalake simply cant handle writing all the partitions at once, so I simply write in batches, meaning I essentially get a few thousand dataframes (4096 is what I used for testing), pd.concat them all, and write them essentially as a batch, and go on to the next batch and so on and this is fast enough for my purposes. I am not exactly sure if deltalake can handle writing 10k+ partitions at a time, as writing 31k resulted in the error I was originally getting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants