tf.experimental.dtensor.initialize_accelerator_system

Initializes accelerators and communication fabrics for DTensor.

DTensor configures TensorFlow to run in the local mode or multi-client mode.

  • In local mode, a mesh can only use devices attached to the current process.
  • In multi-client mode, a mesh can span across devices from multiple clients.

If DTENSOR_JOBS is non-empty, DTensor configures TensorFlow to run in the multi-client mode using the distributed runtime. In multi-client mode devices on different clients can communicate with each other.

The following environment variables controls the behavior of this function.

  • DTENSOR_JOBS: string, a comma separated list. Each item in the list is of format {hostname}:{port}. If empty, DTensor runs in the local mode. Examples of valid DTENSOR_JOBS values:
    • 4 clients on localhost: localhost:10000,localhost:10001,localhost:10002,localhost:10003
    • 2 clients on host1, 2 clients on host2 host1:10000,host1:10001,host2:10000,host2:10003 If the hostnames are BNS addresses, the items must be sorted in alphabetical order.
  • DTENSOR_CLIENT_ID: integer, between 0 to num_clients - 1, to identify the client id of the current process. The default value is 0.
  • DTENSOR_JOB_NAME: string, a string for the name of the TensorFlow job. The job name controls the job name section of the TensorFlow DeviceSpecs, e.g., job:worker in /job:worker/replica:0/task:0/device:TPU:0 when the job name is worker. The default value is localhost in local mode, and worker when in the multi-client mode. All DTensor clients within the same multi-client cluster share the same job name.
  • DTENSOR_USE_PARALLEL_EXECUTOR: string, with its value being pw to specify that the backend is Pathways, and TensorFlow otherwise.

device_type Type of accelerator to use, can be CPU, GPU, or TPU. If None, uses tf.experimental.dtensor.preferred_device_type().
enable_coordination_service If true, enable distributed coordination service to make sure that workers know the devices on each other, when there is more than 1 client.
experimental_reset_context Reset the tensorflow context. Behaviors of existing TensorFlow objects (e.g. Tensors) are undefined. Set this to True as an escape hatch, if there is no clear way to refactor your code to call initialize_accelerator_system() before calling TensorFlow APIs that initialize the context.

device_type the type of accelerator that was initialized.