Policy that samples actions based on the FALCON algorithm.
Inherits From: RewardPredictionBasePolicy
, TFPolicy
tf_agents.bandits.policies.falcon_reward_prediction_policy.FalconRewardPredictionPolicy(
time_step_spec: tf_agents.typing.types.TimeStep
,
action_spec: tf_agents.typing.types.NestedTensorSpec
,
reward_network: tf_agents.typing.types.Network
,
exploitation_coefficient: Optional[types.FloatOrReturningFloat] = 1.0,
max_exploration_probability_hint: Optional[types.FloatOrReturningFloat] = None,
observation_and_action_constraint_splitter: Optional[types.Splitter] = None,
accepts_per_arm_features: bool = False,
constraints: Iterable[tf_agents.bandits.policies.constraints.BaseConstraint
] = (),
emit_policy_info: Tuple[Text, ...] = (),
num_samples_list: Sequence[tf.Variable] = (),
name: Optional[Text] = None
)
Args |
time_step_spec
|
A TimeStep spec of the expected time_steps.
|
action_spec
|
A nest of BoundedTensorSpec representing the actions.
|
reward_network
|
An instance of a tf_agents.network.Network , callable via
network(observation, step_type) -> (output, final_state) .
|
exploitation_coefficient
|
float or callable that returns a float. Its
value will be internally lower-bounded at 0. It controls how
exploitative the policy behaves w.r.t the predicted rewards: A larger
value makes the policy sample the greedy action (one with the best
predicted reward) with a higher probability.
|
max_exploration_probability_hint
|
An optional float, representing a hint
on the maximum exploration probability, internally clipped to [0, 1].
When this argument is set, exploitation_coefficient is ignored and the
policy attempts to choose non-greedy actions with at most this
probability. When such an upper bound cannot be achieved, e.g. due to
insufficient training data, the policy attempts to minimize the
probability of choosing non-greedy actions on a best-effort basis. For a
demonstration of how it affects the policy behavior, see the unit test
testMaxExplorationProbabilityHint in
falcon_reward_prediction_policy_test .
|
observation_and_action_constraint_splitter
|
A function used for masking
valid/invalid actions with each state of the environment. The function
takes in a full observation and returns a tuple consisting of 1) the
part of the observation intended as input to the network and 2) the
mask. The mask should be a 0-1 Tensor of shape [batch_size,
num_actions] . This function should also work with a TensorSpec as
input, and should output TensorSpec objects for the observation and
mask.
|
accepts_per_arm_features
|
(bool) Whether the policy accepts per-arm
features.
|
constraints
|
iterable of constraints objects that are instances of
tf_agents.bandits.agents.BaseConstraint .
|
emit_policy_info
|
(tuple of strings) what side information we want to get
as part of the policy info. Allowed values can be found in
policy_utilities.PolicyInfo .
|
num_samples_list
|
Sequence of tf.Variable's representing the number of
examples for every action that the policy was trained with. For per-arm
features, the size of the list is expected to be 1, representing the
total number of examples the policy was trained with.
|
name
|
The name of this policy. All variables in this module will fall
under that name. Defaults to the class name.
|
Raises |
NotImplementedError
|
If action_spec contains more than one
BoundedTensorSpec or the BoundedTensorSpec is not valid.
|
Attributes |
accepts_per_arm_features
|
|
action_spec
|
Describes the TensorSpecs of the Tensors expected by step(action) .
action can be a single Tensor, or a nested dict, list or tuple of
Tensors.
|
collect_data_spec
|
Describes the Tensors written when using this policy with an environment.
|
emit_log_probability
|
Whether this policy instance emits log probabilities or not.
|
info_spec
|
Describes the Tensors emitted as info by action and distribution .
info can be an empty tuple, a single Tensor, or a nested dict,
list or tuple of Tensors.
|
num_samples_list
|
|
num_trainable_elements
|
|
observation_and_action_constraint_splitter
|
|
policy_state_spec
|
Describes the Tensors expected by step(_, policy_state) .
policy_state can be an empty tuple, a single Tensor, or a nested dict,
list or tuple of Tensors.
|
policy_step_spec
|
Describes the output of action() .
|
time_step_spec
|
Describes the TimeStep tensors returned by step() .
|
trajectory_spec
|
Describes the Tensors written when using this policy with an environment.
|
validate_args
|
Whether action & distribution validate input and output args.
|
Methods
action
View source
action(
time_step: tf_agents.trajectories.TimeStep
,
policy_state: tf_agents.typing.types.NestedTensor
= (),
seed: Optional[types.Seed] = None
) -> tf_agents.trajectories.PolicyStep
Generates next action given the time_step and policy_state.
Args |
time_step
|
A TimeStep tuple corresponding to time_step_spec() .
|
policy_state
|
A Tensor, or a nested dict, list or tuple of Tensors
representing the previous policy_state.
|
seed
|
Seed to use if action performs sampling (optional).
|
Returns |
A PolicyStep named tuple containing:
action : An action Tensor matching the action_spec .
state : A policy state tensor to be fed into the next call to action.
info : Optional side information such as action log probabilities.
|
Raises |
RuntimeError
|
If subclass init didn't call super().init.
ValueError or TypeError: If validate_args is True and inputs or
outputs do not match time_step_spec , policy_state_spec ,
or policy_step_spec .
|
distribution
View source
distribution(
time_step: tf_agents.trajectories.TimeStep
,
policy_state: tf_agents.typing.types.NestedTensor
= ()
) -> tf_agents.trajectories.PolicyStep
Generates the distribution over next actions given the time_step.
Args |
time_step
|
A TimeStep tuple corresponding to time_step_spec() .
|
policy_state
|
A Tensor, or a nested dict, list or tuple of Tensors
representing the previous policy_state.
|
Returns |
A PolicyStep named tuple containing:
action : A tf.distribution capturing the distribution of next actions.
state : A policy state tensor for the next call to distribution.
info : Optional side information such as action log probabilities.
|
Raises |
ValueError or TypeError: If validate_args is True and inputs or
outputs do not match time_step_spec , policy_state_spec ,
or policy_step_spec .
|
get_initial_state
View source
get_initial_state(
batch_size: Optional[types.Int]
) -> tf_agents.typing.types.NestedTensor
Returns an initial state usable by the policy.
Args |
batch_size
|
Tensor or constant: size of the batch dimension. Can be None
in which case no dimensions gets added.
|
Returns |
A nested object of type policy_state containing properly
initialized Tensors.
|
update
View source
update(
policy,
tau: float = 1.0,
tau_non_trainable: Optional[float] = None,
sort_variables_by_name: bool = False
) -> tf.Operation
Update the current policy with another policy.
This would include copying the variables from the other policy.
Args |
policy
|
Another policy it can update from.
|
tau
|
A float scalar in [0, 1]. When tau is 1.0 (the default), we do a hard
update. This is used for trainable variables.
|
tau_non_trainable
|
A float scalar in [0, 1] for non_trainable variables.
If None, will copy from tau.
|
sort_variables_by_name
|
A bool, when True would sort the variables by name
before doing the update.
|
Returns |
An TF op to do the update.
|