Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add a low-memory JSON lines reader option based on byte range reads #16122

Closed
GregoryKimball opened this issue Jun 27, 2024 · 0 comments · Fixed by #16204
Closed

[FEA] Add a low-memory JSON lines reader option based on byte range reads #16122

GregoryKimball opened this issue Jun 27, 2024 · 0 comments · Fixed by #16204
Assignees
Labels
cuDF (Python) cuIO cuIO issue feature request New feature or request

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jun 27, 2024

The cudf JSON reader has a large memory footprint - it was around 8x in 23.12 and in testing for 24.08 has exceeded 15x. This makes JSON reading very difficult to do in memory-constrained environments. Let's add a low-memory mode for JSON lines reader based on byte-range support.

Here is an experiment that yields the same dataframe, but reading in 100 MB chunks.

import cudf
import rmm
import nvtx

df = cudf.DataFrame({
    'a': ['aaaa'] * 1_000_000,
    'b': ['bbbb'] * 1_000_000,
    'c': ['cccc'] * 1_000_000,
})
df = cudf.concat([df] * 20, ignore_index=True)
file_path = '/raid/file.jsonl'
df.to_json(file_path, lines=True, orient='records', engine='cudf')


mr = rmm.mr.CudaMemoryResource()
mr = rmm.mr.StatisticsResourceAdaptor(mr)
rmm.mr.set_current_device_resource(mr)

with nvtx.annotate('single read'):
    df = cudf.read_json(file_path, lines=True)

df_size_mb = df.memory_usage(deep=True).sum() / 1e6
print(f"Dataframe size: {df_size_mb:.2f} MB")
peak_mb = mr.allocation_counts.peak_bytes / 1e6
print(f"Peak memory: {peak_mb:.2f} MB")

mr = rmm.mr.CudaMemoryResource()
mr = rmm.mr.StatisticsResourceAdaptor(mr)
rmm.mr.set_current_device_resource(mr)

with nvtx.annotate('byte_range read'):
    chunk_size = 100_000_000
    data = []
    x =0
    while True:
        try:
            d = cudf.read_json(
                file_path,        
                lines=True, 
                byte_range=(chunk_size * x, chunk_size)
            )
        except:
            # RuntimeError: CUDF failure at: /opt/conda/conda-bld/work/cpp/src/io/utilities/datasource.cpp:219: Offset is past end of file
            break
        data.append(d)    
        x+=1
    df = cudf.concat(data, ignore_index=True)

df_size_mb = df.memory_usage(deep=True).sum() / 1e6
print(f"Dataframe size: {df_size_mb:.2f} MB")
peak_mb = mr.allocation_counts.peak_bytes / 1e6
print(f"Peak memory: {peak_mb:.2f} MB")
Dataframe size: 480.00 MB
Peak memory: 7580.00 MB
Dataframe size: 480.00 MB
Peak memory: 1494.29 MB

And surprisingly the chunked reader is even faster!!

image

Additional context

If we make the chunk size less than 2GB, we will have large strings support in the JSON reader. I believe we should consider making byte-range based reading the default for cudf.pandas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuDF (Python) cuIO cuIO issue feature request New feature or request
Projects
Status: Done
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants