[FEA] Add a low-memory JSON lines reader option based on byte range reads #16122

GregoryKimball · 2024-06-27T23:32:34Z

The cudf JSON reader has a large memory footprint - it was around 8x in 23.12 and in testing for 24.08 has exceeded 15x. This makes JSON reading very difficult to do in memory-constrained environments. Let's add a low-memory mode for JSON lines reader based on byte-range support.

Here is an experiment that yields the same dataframe, but reading in 100 MB chunks.

import cudf
import rmm
import nvtx

df = cudf.DataFrame({
    'a': ['aaaa'] * 1_000_000,
    'b': ['bbbb'] * 1_000_000,
    'c': ['cccc'] * 1_000_000,
})
df = cudf.concat([df] * 20, ignore_index=True)
file_path = '/raid/file.jsonl'
df.to_json(file_path, lines=True, orient='records', engine='cudf')


mr = rmm.mr.CudaMemoryResource()
mr = rmm.mr.StatisticsResourceAdaptor(mr)
rmm.mr.set_current_device_resource(mr)

with nvtx.annotate('single read'):
    df = cudf.read_json(file_path, lines=True)

df_size_mb = df.memory_usage(deep=True).sum() / 1e6
print(f"Dataframe size: {df_size_mb:.2f} MB")
peak_mb = mr.allocation_counts.peak_bytes / 1e6
print(f"Peak memory: {peak_mb:.2f} MB")

mr = rmm.mr.CudaMemoryResource()
mr = rmm.mr.StatisticsResourceAdaptor(mr)
rmm.mr.set_current_device_resource(mr)

with nvtx.annotate('byte_range read'):
    chunk_size = 100_000_000
    data = []
    x =0
    while True:
        try:
            d = cudf.read_json(
                file_path,        
                lines=True, 
                byte_range=(chunk_size * x, chunk_size)
            )
        except:
            # RuntimeError: CUDF failure at: /opt/conda/conda-bld/work/cpp/src/io/utilities/datasource.cpp:219: Offset is past end of file
            break
        data.append(d)    
        x+=1
    df = cudf.concat(data, ignore_index=True)

df_size_mb = df.memory_usage(deep=True).sum() / 1e6
print(f"Dataframe size: {df_size_mb:.2f} MB")
peak_mb = mr.allocation_counts.peak_bytes / 1e6
print(f"Peak memory: {peak_mb:.2f} MB")

Dataframe size: 480.00 MB
Peak memory: 7580.00 MB
Dataframe size: 480.00 MB
Peak memory: 1494.29 MB

And surprisingly the chunked reader is even faster!!

Additional context

If we make the chunk size less than 2GB, we will have large strings support in the JSON reader. I believe we should consider making byte-range based reading the default for cudf.pandas.

The text was updated successfully, but these errors were encountered:

GregoryKimball added feature request New feature or request cuIO cuIO issue cuDF (Python) labels Jun 27, 2024

GregoryKimball added this to the Nested JSON reader milestone Jun 27, 2024

GregoryKimball assigned galipremsagar Jun 27, 2024

galipremsagar mentioned this issue Jul 6, 2024

Add low memory JSON reader for cudf.pandas #16204

Merged

3 tasks

rapids-bot bot closed this as completed in #16204 Jul 12, 2024

rapids-bot bot closed this as completed in 954ce6d Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add a low-memory JSON lines reader option based on byte range reads #16122

[FEA] Add a low-memory JSON lines reader option based on byte range reads #16122

GregoryKimball commented Jun 27, 2024 •

edited

Loading

[FEA] Add a low-memory JSON lines reader option based on byte range reads #16122

[FEA] Add a low-memory JSON lines reader option based on byte range reads #16122

Comments

GregoryKimball commented Jun 27, 2024 • edited Loading

Additional context

GregoryKimball commented Jun 27, 2024 •

edited

Loading