View source on GitHub |
Convergence criterion based on inner products between successive gradients.
Inherits From: ConvergenceCriterion
tfp.substrates.jax.optimizer.convergence_criteria.SuccessiveGradientsAreUncorrelated(
window_size=10, min_num_steps=20, name=None
)
Let g[t]
be the gradient vector at step t
, and g[t-1]
the previous
gradient. Their inner product:
grad_inner_product[t] = sum_i(g[t, i] * g[t - 1, i])
measures correlation between successive optimization steps. We expect this to be positive if the optimization is making progress; conversely, it can be shown to be negative in expectation at the stationary distribution of constant-step-size SGD [(Pflug, 1990)][2].
This criterion detects convergence when an exponentially-weighted moving
average of grad_inner_product
becomes negative; intuitively, when there has
been no consistent direction to the most recent window_size
steps.
Theoretical analysis shows that with no decay (window_size=np.inf
), this
rule stops in finite time almost surely, for constant-step-size SGD under
standard assumptions ([Pflug, 1990][2]; [Chee and Toulis, 2017][1]). In
practice, it is often more efficient to use a decaying moving average.
Batch semantics: because this criterion does not depend on the loss,
vector-valued losses will not produce vector-valued convergence indicators.
Instead, the returned has_converged
is always scalar, and is computed from
the inner product summed across gradients from all variables being
optimized.
please contact tfprobability@tensorflow.org
.
References
[1] Jerry Chee and Panos Toulis. Convergence diagnostics for stochastic gradient descent with constant step size. _arXiv preprint arXiv:1710.06382, 2017. https://arxiv.org/abs/1710.06382
[2] Georg Ch. Pflug. Non-asymptotic confidence bounds for stochastic approximation algorithms with constant step size. Monatshefte fur Mathematik, 110(3-4), pp.297-314, 1990.
Attributes | |
---|---|
min_num_steps
|
|
name
|
|
window_size
|
Methods
bootstrap
bootstrap(
loss, grads, parameters
)
Returns a structure of Tensors
for the rule's state at step 0.
The shape of the Tensor
s specifying loss
, grads
, and parameters
may
optionally be prefixed by one or more batch dimension(s).
Args | |
---|---|
loss
|
float Tensor initial value of loss being optimized.
|
grads
|
list of float Tensor gradients of loss wrt parameters .
|
parameters
|
list of float Tensor initial values of parameters
being optimized.
|
Returns | |
---|---|
initial_auxiliary_state
|
(Structure of) Tensor (s) representing the
initial auxiliary state carried forward by this criterion.
|
one_step
one_step(
step, loss, grads, parameters, auxiliary_state
)
Updates tracked quantities for a new step, and determines if converged.
The shape of the Tensor
s specifying loss
, grads
, and parameters
may
optionally be prefixed by one or more batch dimension(s). In this case,
the returned value has_converged
will have shape equal to the broadcast
batch shape of whichever of those quantities is used by this convergence
criterion, and the quantities defining the convergence criterion (
min_num_steps
, etc.).
Args | |
---|---|
step
|
integer Tensor index of the current step, where step >= 1 (on
step 0 , initial_state should be called instead).
|
loss
|
float Tensor value of loss at the current step.
|
grads
|
list of float Tensor gradients of loss wrt parameters .
|
parameters
|
list of float Tensor current values of parameters
being optimized.
|
auxiliary_state
|
the (structure of) Tensor (s) containing state carried
forward from the previous step.
|
Returns | |
---|---|
has_converged
|
boolean Tensor indicating whether the optimization has
converged.
|
updated_auxiliary_state
|
(Structure of) Tensor (s) representing
updated quantities tracked by the convergence criterion. This should
match the structure of the value returned by bootstrap .
|