tf_agents.bandits.policies.falcon_reward_prediction_policy.FalconRewardPredictionPolicy

Policy that samples actions based on the FALCON algorithm.

Inherits From: RewardPredictionBasePolicy, TFPolicy

tf_agents.bandits.policies.falcon_reward_prediction_policy.FalconRewardPredictionPolicy(
    time_step_spec: tf_agents.typing.types.TimeStep,
    action_spec: tf_agents.typing.types.NestedTensorSpec,
    reward_network: tf_agents.typing.types.Network,
    exploitation_coefficient: Optional[types.FloatOrReturningFloat] = 1.0,
    max_exploration_probability_hint: Optional[types.FloatOrReturningFloat] = None,
    observation_and_action_constraint_splitter: Optional[types.Splitter] = None,
    accepts_per_arm_features: bool = False,
    constraints: Iterable[tf_agents.bandits.policies.constraints.BaseConstraint] = (),
    emit_policy_info: Tuple[Text, ...] = (),
    num_samples_list: Sequence[tf.Variable] = (),
    name: Optional[Text] = None
)

Args
`time_step_spec`	A `TimeStep` spec of the expected time_steps.
`action_spec`	A nest of BoundedTensorSpec representing the actions.
`reward_network`	An instance of a `tf_agents.network.Network`, callable via `network(observation, step_type) -> (output, final_state)`.
`exploitation_coefficient`	float or callable that returns a float. Its value will be internally lower-bounded at 0. It controls how exploitative the policy behaves w.r.t the predicted rewards: A larger value makes the policy sample the greedy action (one with the best predicted reward) with a higher probability.
`max_exploration_probability_hint`	An optional float, representing a hint on the maximum exploration probability, internally clipped to [0, 1]. When this argument is set, `exploitation_coefficient` is ignored and the policy attempts to choose non-greedy actions with at most this probability. When such an upper bound cannot be achieved, e.g. due to insufficient training data, the policy attempts to minimize the probability of choosing non-greedy actions on a best-effort basis. For a demonstration of how it affects the policy behavior, see the unit test `testMaxExplorationProbabilityHint` in `falcon_reward_prediction_policy_test`.
`observation_and_action_constraint_splitter`	A function used for masking valid/invalid actions with each state of the environment. The function takes in a full observation and returns a tuple consisting of 1) the part of the observation intended as input to the network and 2) the mask. The mask should be a 0-1 `Tensor` of shape `[batch_size, num_actions]`. This function should also work with a `TensorSpec` as input, and should output `TensorSpec` objects for the observation and mask.
`accepts_per_arm_features`	(bool) Whether the policy accepts per-arm features.
`constraints`	iterable of constraints objects that are instances of `tf_agents.bandits.agents.BaseConstraint`.
`emit_policy_info`	(tuple of strings) what side information we want to get as part of the policy info. Allowed values can be found in `policy_utilities.PolicyInfo`.
`num_samples_list`	`Sequence` of tf.Variable's representing the number of examples for every action that the policy was trained with. For per-arm features, the size of the list is expected to be 1, representing the total number of examples the policy was trained with.
`name`	The name of this policy. All variables in this module will fall under that name. Defaults to the class name.

Raises
`NotImplementedError`	If `action_spec` contains more than one `BoundedTensorSpec` or the `BoundedTensorSpec` is not valid.

Attributes
`accepts_per_arm_features`
`action_spec`	Describes the TensorSpecs of the Tensors expected by `step(action)`. `action` can be a single Tensor, or a nested dict, list or tuple of Tensors.
`collect_data_spec`	Describes the Tensors written when using this policy with an environment.
`emit_log_probability`	Whether this policy instance emits log probabilities or not.
`info_spec`	Describes the Tensors emitted as info by `action` and `distribution`. `info` can be an empty tuple, a single Tensor, or a nested dict, list or tuple of Tensors.
`num_samples_list`
`num_trainable_elements`
`observation_and_action_constraint_splitter`
`policy_state_spec`	Describes the Tensors expected by `step(_, policy_state)`. `policy_state` can be an empty tuple, a single Tensor, or a nested dict, list or tuple of Tensors.
`policy_step_spec`	Describes the output of `action()`.
`time_step_spec`	Describes the `TimeStep` tensors returned by `step()`.
`trajectory_spec`	Describes the Tensors written when using this policy with an environment.
`validate_args`	Whether `action` & `distribution` validate input and output args.

Methods

`action`

View source

action(
    time_step: tf_agents.trajectories.TimeStep,
    policy_state: tf_agents.typing.types.NestedTensor = (),
    seed: Optional[types.Seed] = None
) -> tf_agents.trajectories.PolicyStep

Generates next action given the time_step and policy_state.

Args
`time_step`	A `TimeStep` tuple corresponding to `time_step_spec()`.
`policy_state`	A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state.
`seed`	Seed to use if action performs sampling (optional).

Returns
A `PolicyStep` named tuple containing: `action`: An action Tensor matching the `action_spec`. `state`: A policy state tensor to be fed into the next call to action. `info`: Optional side information such as action log probabilities.

Raises
`RuntimeError`	If subclass init didn't call super().init. ValueError or TypeError: If `validate_args is True` and inputs or outputs do not match `time_step_spec`, `policy_state_spec`, or `policy_step_spec`.

`distribution`

View source

distribution(
    time_step: tf_agents.trajectories.TimeStep,
    policy_state: tf_agents.typing.types.NestedTensor = ()
) -> tf_agents.trajectories.PolicyStep

Generates the distribution over next actions given the time_step.

Args
`time_step`	A `TimeStep` tuple corresponding to `time_step_spec()`.
`policy_state`	A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state.

Returns

Returns
A `PolicyStep` named tuple containing: `action`: A tf.distribution capturing the distribution of next actions. `state`: A policy state tensor for the next call to distribution. `info`: Optional side information such as action log probabilities.

A PolicyStep named tuple containing:

action: A tf.distribution capturing the distribution of next actions. state: A policy state tensor for the next call to distribution. info: Optional side information such as action log probabilities.

Raises
ValueError or TypeError: If `validate_args is True` and inputs or outputs do not match `time_step_spec`, `policy_state_spec`, or `policy_step_spec`.

`get_initial_state`

View source

get_initial_state(
    batch_size: Optional[types.Int]
) -> tf_agents.typing.types.NestedTensor

Returns an initial state usable by the policy.

Args
`batch_size`	Tensor or constant: size of the batch dimension. Can be None in which case no dimensions gets added.

Returns
A nested object of type `policy_state` containing properly initialized Tensors.

`update`

View source

update(
    policy,
    tau: float = 1.0,
    tau_non_trainable: Optional[float] = None,
    sort_variables_by_name: bool = False
) -> tf.Operation

Update the current policy with another policy.

This would include copying the variables from the other policy.

Args
`policy`	Another policy it can update from.
`tau`	A float scalar in [0, 1]. When tau is 1.0 (the default), we do a hard update. This is used for trainable variables.
`tau_non_trainable`	A float scalar in [0, 1] for non_trainable variables. If None, will copy from tau.
`sort_variables_by_name`	A bool, when True would sort the variables by name before doing the update.

Returns
An TF op to do the update.

tf_agents.bandits.policies.falcon_reward_prediction_policy.FalconRewardPredictionPolicy

Args

Raises

Attributes

Methods

action

distribution

get_initial_state

update

`action`

`distribution`

`get_initial_state`

`update`