-
Notifications
You must be signed in to change notification settings - Fork 19.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keras 3 gives incorrect output from evaluate/fit in distributed context #19891
Comments
With a bit more investigation I figured out that what's going on is that import tensorflow as tf
import keras
# import tf_keras as keras
keras.utils.set_random_seed(0)
n_replicas = 2
gpus = tf.config.list_physical_devices("GPU")
tf.config.set_logical_device_configuration(
gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=1000)] * n_replicas
)
batch_size = 12
x = tf.random.uniform((batch_size, 1), -1, 1, seed=0)
y = tf.random.uniform((batch_size, 10), -1, 1, seed=1)
with tf.distribute.MirroredStrategy().scope():
inp = keras.Input(shape=(1,))
layer = keras.layers.Dense(10)
model = keras.Model(inp, layer(inp))
model.compile(loss="mse", optimizer="sgd")
gt = keras.losses.mean_squared_error(y, model.predict(x, batch_size=batch_size))
eval = model.evaluate(x, y, batch_size=batch_size)
model.fit(x, y, batch_size=batch_size, epochs=1)
print(f"ground truth: {tf.reduce_mean(gt)}")
print(f"loss from first replica: {tf.reduce_mean(gt[:batch_size//n_replicas])}")
print(f"evaluate: {eval}") Which gives output:
|
One more piece of investigation. I believe the above issue with I think that the reason the import tensorflow as tf
import keras
# import tf_keras as keras
keras.utils.set_random_seed(0)
n_replicas = 2
gpus = tf.config.list_physical_devices("GPU")
tf.config.set_logical_device_configuration(
gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=1000)] * n_replicas
)
batch_size = 12
local_batch_size = batch_size // n_replicas
x = tf.random.uniform((batch_size, 1), -1, 1, seed=0)
y = tf.random.uniform((batch_size, 1), -1, 1, seed=1)
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
inp = keras.Input(shape=(1,))
layer = keras.layers.Dense(
1, use_bias=False, kernel_initializer=keras.initializers.constant(1)
)
model = keras.Model(inp, layer(inp))
model.compile(loss="mse", optimizer=keras.optimizers.SGD(learning_rate=1.0))
model.fit(x, y, batch_size=batch_size, epochs=1)
weights = strategy.run(lambda: layer.kernel.value).values
print(f"per-replica weights: {[w.numpy() for w in weights]}") We can see that each replica is maintaining independent weights:
If we switch to
|
Thank you for providing detailed investigation on the issue. We will look into it. |
I think I was able to get past this issue, but then I run into this bug #19246 so I can't really tell if things are working correctly or not. |
Hi @drasmuss! Based on this comment it seems like you have been able to resolve the problem that was raised in this issue and there is already an open issue for the outstanding bug. I'm closing this issue then. If possible, it would be great if you could add more information about how you were able to resolve the problem discussed in this issue. Thanks! |
No, the issue is not resolved. I had been working on a fix locally, but was unable to verify it due to that other bug. But this issue itself is still present, Keras gives incorrect output from fit and evaluate in a distributed context. |
I see! Thanks for clarifying! We're looking into the other bug that you linked here! After resolution of #19246, please let us know if the issue persisted! |
This issue will still require a pull request (or two) of its own to fix, it definitely won't be resolved on its own after #19246 is fixed. |
I see! I re-opened the issue! Is it the display issue you mentioned in #19891 (comment) |
I believe it's actually two separate issues (both requiring fixes). One is the wrong value being returned from evaluate. The other is that the gradient aggregation is not happening, so the distributed replicas are not sharing information at all during training (which essentially means that you're getting no benefit from the distributed training). |
Thanks for clarifying! I'll look into this! |
In Keras 3, changing the number of replicas during distributed training/evaluation changes the output of the model:
This gives output:
n_replicas=1
:n_replicas=2
:n_replicas=4
:We can see that the ground truth is invariant to the number of replicas, as expected. But the loss value calculated by
evaluate
is incorrect for alln_replicas > 1
. And this doesn't just impact the evaluation, we can see thatfit
results in a different change in the model output as we change the number of replicas.If we switch to
tf-keras
, then we get the expected output regardless of the number of replicas:n_replicas=1
:n_replicas=2
:n_replicas=4
:The text was updated successfully, but these errors were encountered: