-
Notifications
You must be signed in to change notification settings - Fork 26k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Run live preview inference on a cudastream #15837
base: dev
Are you sure you want to change the base?
Conversation
But will it slow down the main generation? |
Having live preview enabled will always slightly slow down the main generation since it involves doing more work compared to not having live preview on. This implementation should have a much lower reduction in performance, since it not only ensures that live preview never blocks the main generation, but it is also able to overlap its compute to an extent. |
It works on AMD! I've used cudastreams before with CuPy, so the support is there, and the CUDA checks will work on AMD. You will have to set the "Show previews of all images generated in a batch as a grid" in the Live Previews settings to on, or else you will get a 'Tensor' object has no attribute 'save' error (not exclusive to AMD). However, I intermittently get purple images, noise images, etc. as live previews during generation, in between normal live previews. Forcing non_blocking to false in sd_samplers_common fixes this (I assume at a cost to performance). It may have something to do with the other performance patches I've applied recently; can anyone else reproduce this? |
So that's where that function is used... I didn't test single image generation at all, sorry. Please tell me if the commit I just made fixes both problems. If it doesn't, then that means there's a serious problem with non-blocking on AMD that needs further investigation. |
The issue with single image generation is fixed! The issue with intermittent purple/noisy images is still there, but less frequent. It turns out that removing --disable-nan-check from my startup file fixes it, though. Do you have the NaN check disabled? |
I do have it disabled. disable_nan_check would be causing a forced sync. Whatever is happening indicates that synchronizing the cuda stream doesn't do what it should on AMD. |
I really should've taken into account that my setup is unsuitable for testing right now (I've started getting actual memory errors due to hot weather and a dirt-cheap mobo). Since this issue involves synchronization with the CPU/memory, it could very well just be that. This issue seems to only start after my fans spin up. I'm going to test this again in a few days after I rebuild my computer. Sorry for leading you on a wild goose chase if this turns out to be an issue on my end. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
On another note, I do think this needs to include some sort of forced maximum interval between live preview updates as an option. Having it fully async like this is amazing when it can keep up (even though it trends towards providing less than the user may have asked for), but that won't always be the case for every system. |
This comment was marked as outdated.
This comment was marked as outdated.
nvm it is my fault; I somehow dropped an It does error out, but the error is not in the console. It ends up in sysinfo endpoint |
I run a AMD rx 6700xt. I've swapped between this and what's in production, and noticed my speed goes down quite a bit using this commit. At 1024x with SDXL, I get a avg of 1.10s/it, but with this commit it drops to 1.50s/it, slower by 40s. I haven't tested much else, but it doesn't improve performance for me unless I am missing something. |
Did you run multiple generations each time you tested? |
Do you have the other performance PR patches applied? The changes might depend on that to a degree, and I exclusively tested on the assumption that this is building on top of the other patches. If you do this, and it still shows a performance regression, even after trying multiple times, I would like for you to run profiling and figure out a way to get me the file (it might end up being somewhere around 200MB). Having profiling data for an AMD card would be extremely helpful. I have been doing it by wrapping the main processing loop in processing.py around line 980: with devices.without_autocast() if devices.unet_needs_upcast else devices.autocast():
from torch.profiler import profile, record_function, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True, # Track memory allocation
with_stack=True,
with_flops=True) as prof:
with record_function("model_inference"):
samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
# Print profiling results
print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=20))
print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=20))
# Export to Chrome trace format
prof.export_chrome_trace("trace_livepreview.json") Apply this code, and start WebUI. Run a 10 step inference to let it warm up and get all of the first time operations out of the way (they make the profile unnecessarily large and hard to read). Once that is done, run a 20 step inference, making sure that it actually shows you live previews in that time span. You will overwrite the first profile when you do this. This will give a printout of the top 20 operators by self time on both CPU and GPU, and export a trace file that can be opened on http://ui.perfetto.dev/. The profile shouldn't contain personal information (I chose the chrome trace even though Tensorboard traces contain more useful info like SM utilization, since tensorboard traces will show your hostname), but you should check out the trace on Perfetto yourself to verify. |
I did not test this with the performance patches, I had actually tested them separately. I am cloning a clean instance for testing and will apply this ontop of the performance patches. |
It had nothing to do with my memory, same issues. |
In my own testing lately I have noticed a few issues with occasional noise outputs, but they're infrequent. I will probably want to make this a toggleable option if I can't eliminate these with more careful syncs. |
Ahh. If you have any ideas for places in the file where the stream sync can be put, I'd be glad to help out as they're more frequent on my machine/setup. Another idea may be to put the sync in progress.py, as that's where the live preview itself is updated. |
i just tested this branch on ubuntu A6000 GPU still way slower than ComfyUI or Forge API • Github • Gradio • Startup profile • Reload UI Automatic1111 vs Forge vs ComfyUI on our Massed Compute VM image - 3.63 it vs 4.9 it vs 5.35 it |
This changes Is this only useful for grids, or for single images too? And, well, if this adds noise, then this shouldn't be added at all. |
The noise outputs in question are only on the live previews, and in my experiences only on one image usually for one live preview frame. Final outputs shouldn't be at any risk of being noisy, I haven't experienced that and people I see with issues are all saying it is specifically the live previews only. The issue appears to be that under some circumstances the image tensors which were sent from device to host as non-blocking transfers are accessed before they are actually ready on device, so you get a preview of whatever happened to be in memory in the space that was reserved for that tensor. I want to debug this more when I have time and see if I can find a more reliable solution for non-blocking DtoH transfers, and also make the cudastream function optional and definitely non-default if it causes malformed previews. I will also modify the existing functions for backwards compatibility as requested. For now I will mark this as a draft. |
Description
I have vastly improved the performance of live preview by making two changes:
As a result, live preview (at least with TAESD) is basically free now. On a 150 step, 512x512, batch 4 inference, it takes 25.1s to complete in total without live preview, and 25.7s to complete with live preview happening as often as it can (100ms delay, every step. In practice it doesn't actually preview that fast, but I find it hard to imagine that this isn't fast enough for almost anyone.)
In its current state,
I am about as certain as I can be that this will cause problems for anyone who isn't using an NVIDIA card!it shouldn't run at all on CPU, but I know torch with AMD backends still calls things "cuda" so I would love feedback on how well it works on different hardware, to see if attempting to use CUDA streams on AMD makes your card explode or something, so that something can be done about that.Screenshots/videos:
Delicious compute overlap:
Stream 7 is the main/default cudastream, stream 13 is the live preview cudastream.
Checklist: