Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileWatching.mkpidlock by multiple Julia instances in parallel leads to 10 seconds delay #55038

Open
aw32 opened this issue Jul 5, 2024 · 1 comment

Comments

@aw32
Copy link
Contributor

aw32 commented Jul 5, 2024

Description

The FileWatching.mkpidlock function seems to produce a 10 seconds delay if sufficiently many Julia processes are trying to acquire the pid file.
Minimal test code:

for i in {1..10}; do julia -e "using FileWatching; @time FileWatching.mkpidlock(\"testwaitfile.pid\") do ; end" & done

Example output:

  0.036439 seconds (21.68 k allocations: 1.112 MiB, 64.32% compilation time)
  0.037288 seconds (21.68 k allocations: 1.112 MiB, 62.75% compilation time)
  0.037221 seconds (21.68 k allocations: 1.112 MiB, 63.32% compilation time)
  0.049062 seconds (21.68 k allocations: 1.112 MiB, 67.08% compilation time)
  0.036490 seconds (21.68 k allocations: 1.112 MiB, 63.47% compilation time)
  0.035922 seconds (21.68 k allocations: 1.112 MiB, 65.30% compilation time)
  0.035703 seconds (21.68 k allocations: 1.112 MiB, 64.55% compilation time)
  0.040621 seconds (21.79 k allocations: 1.117 MiB, 65.54% compilation time)
  0.037188 seconds (21.68 k allocations: 1.112 MiB, 63.14% compilation time)
 10.076159 seconds (27.64 k allocations: 1.417 MiB, 0.46% compilation time)

The number of instances necessary to trigger this and the number of delayed instances depends on the filesystem and the speed of the filesystem. For faster filesystems one can try with more instances, e.g. 50 or 100. This might be hinting at a race condition.

Expected behavior

We expect the instances to not wait for 10 seconds after release, when trying to acquire the pid file.

Background

This problem first came up, when starting multiple Julia instances to do MPI on an HPC cluster started as a SLURM job.
The script contained code to activate an environment:

using Pkg;
Pkg.activate(".");

This resulted in accumulated delay of the instances starting and increased total runtime.

0.3087480068206787
10.339919805526733
20.351145029067993
30.371562957763672
40.39089107513428
50.40513586997986
60.42451500892639
70.41116309165955
80.43787407875061
90.45805597305298

We circumvent this problem by simply using the --project argument instead of using Pkg.activate in the script and the problem does not occur with --project. Further, we identified the delay to come from the usage of mkpidlock on the manifest_usage.toml.pid file.

Julia versions

The behavior was reproduced with Julia 1.10.4 and Julia 1.11.0-rc1 using the official builds from the website.

@sgaure
Copy link

sgaure commented Jul 5, 2024

This happens if the file watcher (FileWatcher.watch_file) in the OS fails for some reason. This varies with OS/file system etc. The mkpidlock then fallbacks to polling. The default poll interval is 10 seconds. It can be changed with e.g. mkpidlock(..., poll_interval=1.0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants