[Performance] Run live preview inference on a cudastream #15837

drhead · 2024-05-19T00:05:33Z

Description

I have vastly improved the performance of live preview by making two changes:

running the decode on its own CUDA stream which lets operations parallelize
making the required DtoH transfer non-blocking

As a result, live preview (at least with TAESD) is basically free now. On a 150 step, 512x512, batch 4 inference, it takes 25.1s to complete in total without live preview, and 25.7s to complete with live preview happening as often as it can (100ms delay, every step. In practice it doesn't actually preview that fast, but I find it hard to imagine that this isn't fast enough for almost anyone.)

In its current state, ~~I am about as certain as I can be that this will cause problems for anyone who isn't using an NVIDIA card!~~ it shouldn't run at all on CPU, but I know torch with AMD backends still calls things "cuda" so I would love feedback on how well it works on different hardware, to see if attempting to use CUDA streams on AMD makes your card explode or something, so that something can be done about that.

Screenshots/videos:

Delicious compute overlap:

Stream 7 is the main/default cudastream, stream 13 is the live preview cudastream.

Checklist:

I have read contributing wiki page
I have performed a self-review of my own code
My code follows the style guidelines
My code passes tests

light-and-ray · 2024-05-19T22:10:53Z

But will it slow down the main generation?

drhead · 2024-05-19T22:15:30Z

But will it slow down the main generation?

Having live preview enabled will always slightly slow down the main generation since it involves doing more work compared to not having live preview on. This implementation should have a much lower reduction in performance, since it not only ensures that live preview never blocks the main generation, but it is also able to overlap its compute to an extent.

gel-crabs · 2024-05-19T22:22:39Z

It works on AMD! I've used cudastreams before with CuPy, so the support is there, and the CUDA checks will work on AMD.

You will have to set the "Show previews of all images generated in a batch as a grid" in the Live Previews settings to on, or else you will get a 'Tensor' object has no attribute 'save' error (not exclusive to AMD).

However, I intermittently get purple images, noise images, etc. as live previews during generation, in between normal live previews. Forcing non_blocking to false in sd_samplers_common fixes this (I assume at a cost to performance).

It may have something to do with the other performance patches I've applied recently; can anyone else reproduce this?

drhead · 2024-05-19T22:37:00Z

It works on AMD! I've used cudastreams before with CuPy, so the support is there, and the CUDA checks will work on AMD.

You will have to set the "Show previews of all images generated in a batch as a grid" in the Live Previews settings to on, or else you will get a 'Tensor' object has no attribute 'save' error (not exclusive to AMD).

However, I intermittently get purple images, noise images, etc. as live previews during generation, in between normal live previews. Forcing non_blocking to false in sd_samplers_common fixes this (I assume at a cost to performance).

It may have something to do with the other performance patches I've applied recently; can anyone else reproduce this?

So that's where that function is used... I didn't test single image generation at all, sorry.

Please tell me if the commit I just made fixes both problems. If it doesn't, then that means there's a serious problem with non-blocking on AMD that needs further investigation.

gel-crabs · 2024-05-19T23:25:34Z

It works on AMD! I've used cudastreams before with CuPy, so the support is there, and the CUDA checks will work on AMD.
You will have to set the "Show previews of all images generated in a batch as a grid" in the Live Previews settings to on, or else you will get a 'Tensor' object has no attribute 'save' error (not exclusive to AMD).
However, I intermittently get purple images, noise images, etc. as live previews during generation, in between normal live previews. Forcing non_blocking to false in sd_samplers_common fixes this (I assume at a cost to performance).
It may have something to do with the other performance patches I've applied recently; can anyone else reproduce this?

So that's where that function is used... I didn't test single image generation at all, sorry.

Please tell me if the commit I just made fixes both problems. If it doesn't, then that means there's a serious problem with non-blocking on AMD that needs further investigation.

The issue with single image generation is fixed!

The issue with intermittent purple/noisy images is still there, but less frequent. It turns out that removing --disable-nan-check from my startup file fixes it, though. Do you have the NaN check disabled?

drhead · 2024-05-19T23:28:19Z

I do have it disabled. disable_nan_check would be causing a forced sync. Whatever is happening indicates that synchronizing the cuda stream doesn't do what it should on AMD.

gel-crabs · 2024-05-20T00:19:06Z

I do have it disabled. disable_nan_check would be causing a forced sync. Whatever is happening indicates that synchronizing the cuda stream doesn't do what it should on AMD.

I really should've taken into account that my setup is unsuitable for testing right now (I've started getting actual memory errors due to hot weather and a dirt-cheap mobo).

Since this issue involves synchronization with the CPU/memory, it could very well just be that. This issue seems to only start after my fans spin up.

I'm going to test this again in a few days after I rebuild my computer. Sorry for leading you on a wild goose chase if this turns out to be an issue on my end.

drhead · 2024-05-20T08:35:02Z

On another note, I do think this needs to include some sort of forced maximum interval between live preview updates as an option. Having it fully async like this is amazing when it can keep up (even though it trends towards providing less than the user may have asked for), but that won't always be the case for every system.

wfjsw · 2024-05-21T03:16:25Z

nvm it is my fault; I somehow dropped an s when applying the patches.

It does error out, but the error is not in the console. It ends up in sysinfo endpoint

Soulreaver90 · 2024-05-21T09:41:39Z

I run a AMD rx 6700xt. I've swapped between this and what's in production, and noticed my speed goes down quite a bit using this commit. At 1024x with SDXL, I get a avg of 1.10s/it, but with this commit it drops to 1.50s/it, slower by 40s. I haven't tested much else, but it doesn't improve performance for me unless I am missing something.

gel-crabs · 2024-05-21T15:32:24Z

I run a AMD rx 6700xt. I've swapped between this and what's in production, and noticed my speed goes down quite a bit using this commit. At 1024x with SDXL, I get a avg of 1.10s/it, but with this commit it drops to 1.50s/it, slower by 40s. I haven't tested much else, but it doesn't improve performance for me unless I am missing something.

Did you run multiple generations each time you tested?

drhead · 2024-05-21T16:23:47Z

I run a AMD rx 6700xt. I've swapped between this and what's in production, and noticed my speed goes down quite a bit using this commit. At 1024x with SDXL, I get a avg of 1.10s/it, but with this commit it drops to 1.50s/it, slower by 40s. I haven't tested much else, but it doesn't improve performance for me unless I am missing something.

Do you have the other performance PR patches applied? The changes might depend on that to a degree, and I exclusively tested on the assumption that this is building on top of the other patches.

If you do this, and it still shows a performance regression, even after trying multiple times, I would like for you to run profiling and figure out a way to get me the file (it might end up being somewhere around 200MB). Having profiling data for an AMD card would be extremely helpful. I have been doing it by wrapping the main processing loop in processing.py around line 980:

            with devices.without_autocast() if devices.unet_needs_upcast else devices.autocast():
                from torch.profiler import profile, record_function, ProfilerActivity
                with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
                        record_shapes=True,
                        profile_memory=True, # Track memory allocation
                        with_stack=True,
                        with_flops=True) as prof:
                    with record_function("model_inference"):
                        samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)

                # Print profiling results
                print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=20))
                print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=20))
                # Export to Chrome trace format
                prof.export_chrome_trace("trace_livepreview.json")

Apply this code, and start WebUI. Run a 10 step inference to let it warm up and get all of the first time operations out of the way (they make the profile unnecessarily large and hard to read). Once that is done, run a 20 step inference, making sure that it actually shows you live previews in that time span. You will overwrite the first profile when you do this.

This will give a printout of the top 20 operators by self time on both CPU and GPU, and export a trace file that can be opened on http://ui.perfetto.dev/. The profile shouldn't contain personal information (I chose the chrome trace even though Tensorboard traces contain more useful info like SM utilization, since tensorboard traces will show your hostname), but you should check out the trace on Perfetto yourself to verify.

Soulreaver90 · 2024-05-21T16:47:25Z

I run a AMD rx 6700xt. I've swapped between this and what's in production, and noticed my speed goes down quite a bit using this commit. At 1024x with SDXL, I get a avg of 1.10s/it, but with this commit it drops to 1.50s/it, slower by 40s. I haven't tested much else, but it doesn't improve performance for me unless I am missing something.

Do you have the other performance PR patches applied? The changes might depend on that to a degree, and I exclusively tested on the assumption that this is building on top of the other patches.

If you do this, and it still shows a performance regression, even after trying multiple times, I would like for you to run profiling and figure out a way to get me the file (it might end up being somewhere around 200MB). Having profiling data for an AMD card would be extremely helpful. I have been doing it by wrapping the main processing loop in processing.py around line 980:
            with devices.without_autocast() if devices.unet_needs_upcast else devices.autocast():
                from torch.profiler import profile, record_function, ProfilerActivity
                with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
                        record_shapes=True,
                        profile_memory=True, # Track memory allocation
                        with_stack=True,
                        with_flops=True) as prof:
                    with record_function("model_inference"):
                        samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)

                # Print profiling results
                print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=20))
                print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=20))
                # Export to Chrome trace format
                prof.export_chrome_trace("trace_livepreview.json")
Apply this code, and start WebUI. Run a 10 step inference to let it warm up and get all of the first time operations out of the way (they make the profile unnecessarily large and hard to read). Once that is done, run a 20 step inference, making sure that it actually shows you live previews in that time span. You will overwrite the first profile when you do this.

This will give a printout of the top 20 operators by self time on both CPU and GPU, and export a trace file that can be opened on http://ui.perfetto.dev/. The profile shouldn't contain personal information (I chose the chrome trace even though Tensorboard traces contain more useful info like SM utilization, since tensorboard traces will show your hostname), but you should check out the trace on Perfetto yourself to verify.

I did not test this with the performance patches, I had actually tested them separately. I am cloning a clean instance for testing and will apply this ontop of the performance patches.

gel-crabs · 2024-05-26T23:25:51Z

It had nothing to do with my memory, same issues.

drhead · 2024-05-28T17:26:09Z

It had nothing to do with my memory, same issues.

In my own testing lately I have noticed a few issues with occasional noise outputs, but they're infrequent. I will probably want to make this a toggleable option if I can't eliminate these with more careful syncs.

gel-crabs · 2024-05-29T14:45:34Z

It had nothing to do with my memory, same issues.

In my own testing lately I have noticed a few issues with occasional noise outputs, but they're infrequent. I will probably want to make this a toggleable option if I can't eliminate these with more careful syncs.

Ahh. If you have any ideas for places in the file where the stream sync can be put, I'd be glad to help out as they're more frequent on my machine/setup.

Another idea may be to put the sync in progress.py, as that's where the live preview itself is updated.

FurkanGozukara · 2024-06-05T14:55:01Z

i just tested this branch on ubuntu A6000 GPU

still way slower than ComfyUI or Forge

API • Github • Gradio • Startup profile • Reload UI
version: v1.9.3-6-g27e35f13 • python: 3.10.12 • torch: 2.1.2+cu121 • xformers: 0.0.23.post1 • gradio: 3.41.2 • checkpoint: 912c9dc74f

Automatic1111 vs Forge vs ComfyUI on our Massed Compute VM image - 3.63 it vs 4.9 it vs 5.35 it

AUTOMATIC1111 · 2024-06-08T09:57:01Z

This changes single_sample_to_image to return a tensor instead of PIL image; if some extension is using it, that extension will break.

Is this only useful for grids, or for single images too?

And, well, if this adds noise, then this shouldn't be added at all.

drhead · 2024-06-08T16:03:34Z

This changes single_sample_to_image to return a tensor instead of PIL image; if some extension is using it, that extension will break.

Is this only useful for grids, or for single images too?

And, well, if this adds noise, then this shouldn't be added at all.

The noise outputs in question are only on the live previews, and in my experiences only on one image usually for one live preview frame. Final outputs shouldn't be at any risk of being noisy, I haven't experienced that and people I see with issues are all saying it is specifically the live previews only.

The issue appears to be that under some circumstances the image tensors which were sent from device to host as non-blocking transfers are accessed before they are actually ready on device, so you get a preview of whatever happened to be in memory in the space that was reserved for that tensor.

I want to debug this more when I have time and see if I can find a more reliable solution for non-blocking DtoH transfers, and also make the cudastream function optional and definitely non-default if it causes malformed previews. I will also modify the existing functions for backwards compatibility as requested. For now I will mark this as a draft.

drhead and others added 4 commits May 18, 2024 19:56

Live previews run on cudastream

387bcd8

Lint

72c5966

Only use cudastreams for live preview when cuda available

044494d

the code runs better when you import things you need

4eb7cb4

drhead marked this pull request as ready for review May 19, 2024 01:03

drhead requested a review from AUTOMATIC1111 as a code owner May 19, 2024 01:03

handle non blocking better and case of single image

27e35f1

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

drhead marked this pull request as draft June 8, 2024 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Run live preview inference on a cudastream #15837

[Performance] Run live preview inference on a cudastream #15837

drhead commented May 19, 2024 •

edited

Loading

light-and-ray commented May 19, 2024

drhead commented May 19, 2024

gel-crabs commented May 19, 2024

drhead commented May 19, 2024

gel-crabs commented May 19, 2024

drhead commented May 19, 2024

gel-crabs commented May 20, 2024 •

edited

Loading

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as resolved.

drhead commented May 20, 2024

This comment was marked as outdated.

wfjsw commented May 21, 2024

Soulreaver90 commented May 21, 2024

gel-crabs commented May 21, 2024

drhead commented May 21, 2024

Soulreaver90 commented May 21, 2024

gel-crabs commented May 26, 2024

drhead commented May 28, 2024

gel-crabs commented May 29, 2024

FurkanGozukara commented Jun 5, 2024

AUTOMATIC1111 commented Jun 8, 2024

drhead commented Jun 8, 2024

[Performance] Run live preview inference on a cudastream #15837

Are you sure you want to change the base?

[Performance] Run live preview inference on a cudastream #15837

Conversation

drhead commented May 19, 2024 • edited Loading

Description

Screenshots/videos:

Checklist:

light-and-ray commented May 19, 2024

drhead commented May 19, 2024

gel-crabs commented May 19, 2024

drhead commented May 19, 2024

gel-crabs commented May 19, 2024

drhead commented May 19, 2024

gel-crabs commented May 20, 2024 • edited Loading

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as resolved.

drhead commented May 20, 2024

This comment was marked as outdated.

wfjsw commented May 21, 2024

Soulreaver90 commented May 21, 2024

gel-crabs commented May 21, 2024

drhead commented May 21, 2024

Soulreaver90 commented May 21, 2024

gel-crabs commented May 26, 2024

drhead commented May 28, 2024

gel-crabs commented May 29, 2024

FurkanGozukara commented Jun 5, 2024

AUTOMATIC1111 commented Jun 8, 2024

drhead commented Jun 8, 2024

drhead commented May 19, 2024 •

edited

Loading

gel-crabs commented May 20, 2024 •

edited

Loading