FileWatching.mkpidlock by multiple Julia instances in parallel leads to 10 seconds delay #55038

aw32 · 2024-07-05T12:51:37Z

Description

The FileWatching.mkpidlock function seems to produce a 10 seconds delay if sufficiently many Julia processes are trying to acquire the pid file.
Minimal test code:

for i in {1..10}; do julia -e "using FileWatching; @time FileWatching.mkpidlock(\"testwaitfile.pid\") do ; end" & done

Example output:

  0.036439 seconds (21.68 k allocations: 1.112 MiB, 64.32% compilation time)
  0.037288 seconds (21.68 k allocations: 1.112 MiB, 62.75% compilation time)
  0.037221 seconds (21.68 k allocations: 1.112 MiB, 63.32% compilation time)
  0.049062 seconds (21.68 k allocations: 1.112 MiB, 67.08% compilation time)
  0.036490 seconds (21.68 k allocations: 1.112 MiB, 63.47% compilation time)
  0.035922 seconds (21.68 k allocations: 1.112 MiB, 65.30% compilation time)
  0.035703 seconds (21.68 k allocations: 1.112 MiB, 64.55% compilation time)
  0.040621 seconds (21.79 k allocations: 1.117 MiB, 65.54% compilation time)
  0.037188 seconds (21.68 k allocations: 1.112 MiB, 63.14% compilation time)
 10.076159 seconds (27.64 k allocations: 1.417 MiB, 0.46% compilation time)

The number of instances necessary to trigger this and the number of delayed instances depends on the filesystem and the speed of the filesystem. For faster filesystems one can try with more instances, e.g. 50 or 100. This might be hinting at a race condition.

Expected behavior

We expect the instances to not wait for 10 seconds after release, when trying to acquire the pid file.

Background

This problem first came up, when starting multiple Julia instances to do MPI on an HPC cluster started as a SLURM job.
The script contained code to activate an environment:

using Pkg;
Pkg.activate(".");

This resulted in accumulated delay of the instances starting and increased total runtime.

0.3087480068206787
10.339919805526733
20.351145029067993
30.371562957763672
40.39089107513428
50.40513586997986
60.42451500892639
70.41116309165955
80.43787407875061
90.45805597305298

We circumvent this problem by simply using the --project argument instead of using Pkg.activate in the script and the problem does not occur with --project. Further, we identified the delay to come from the usage of mkpidlock on the manifest_usage.toml.pid file.

Julia versions

The behavior was reproduced with Julia 1.10.4 and Julia 1.11.0-rc1 using the official builds from the website.

The text was updated successfully, but these errors were encountered:

sgaure · 2024-07-05T13:32:38Z

This happens if the file watcher (FileWatcher.watch_file) in the OS fails for some reason. This varies with OS/file system etc. The mkpidlock then fallbacks to polling. The default poll interval is 10 seconds. It can be changed with e.g. mkpidlock(..., poll_interval=1.0).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FileWatching.mkpidlock by multiple Julia instances in parallel leads to 10 seconds delay #55038

FileWatching.mkpidlock by multiple Julia instances in parallel leads to 10 seconds delay #55038

aw32 commented Jul 5, 2024

sgaure commented Jul 5, 2024

FileWatching.mkpidlock by multiple Julia instances in parallel leads to 10 seconds delay #55038

FileWatching.mkpidlock by multiple Julia instances in parallel leads to 10 seconds delay #55038

Comments

aw32 commented Jul 5, 2024

Description

Expected behavior

Background

Julia versions

sgaure commented Jul 5, 2024