Two different robots#

The code for this example is implemented different_robots. Let us import it.

[1]:
from enki_env.examples import different_robots

Environment#

The environment contains one Thymio and one E-Puck. Otherwise it is very similar to the previous “same robots” example: same task, same reward, just different robots with (slightly in this case) different sensors.

To create the environment via script, run:

python -m enki_env.examples.different_robot.environment
[2]:
env = different_robots.make_env(render_mode="human")
env.reset()
env.snapshot()
snapshot

The robots belong to different groups with different observation spaces

[3]:
env.group_map
[3]:
{'thymio': ['thymio_0'], 'e-puck': ['e-puck_0']}
[4]:
env.observation_spaces
[4]:
{'thymio_0': Dict('wheel_speeds': Box(-1.0, 1.0, (2,), float32), 'prox/value': Box(0.0, 1.0, (7,), float32)),
 'e-puck_0': Dict('wheel_speeds': Box(-1.0, 1.0, (2,), float32), 'prox/value': Box(0.0, 1.0, (8,), float32))}

Baseline#

We adapted the Thymio baseline to work for the E-Puck

To evaluate the performances of both baselines via script, run:

python -m enki_env.examples.different_robots.baseline
[5]:
import inspect

print(inspect.getsource(different_robots.EPuckBaseline.predict))
    def predict(self,
                observation: Observation,
                state: State | None = None,
                episode_start: EpisodeStart | None = None,
                deterministic: bool = False) -> tuple[Action, State | None]:
        prox = np.atleast_2d(observation['prox/value'])
        m = np.max(prox, axis=-1)
        prox[m > 0] /= m[:, np.newaxis][m > 0]
        ws = np.array([(-0.1, -0.25, -0.5, -1, -1, 0.5, 0.25, 0.1)], dtype=np.float32)
        w = np.tensordot(prox, ws, axes=([1], [1]))
        w[m == 0] = 1
        return np.clip(w, -1, 1), None

To perform a rollout, we need to assign the policy to the whole group.

[6]:
rollout = env.unwrapped.rollout(max_steps=10, policies={'thymio': different_robots.ThymioBaseline(),
                                                        'e-puck': different_robots.EPuckBaseline()})
[7]:
rollout.keys()
[7]:
dict_keys(['thymio', 'e-puck'])
[8]:
rollout['thymio'].episode_reward, rollout['e-puck'].episode_reward
[8]:
(np.float64(-2.969223195758214), np.float64(-13.845665476374068))

Reinforcement Learning#

Let us now train and evaluate two RL policies for this task, one for each robot.

To perform this via script, run:

python -m enki_env.examples.different_robots.rl
[9]:
policies = different_robots.get_policies()
[10]:
policies.keys()
[10]:
dict_keys(['thymio', 'e-puck'])
[11]:
rollout = env.rollout(max_steps=10, policies=policies)
rollout['thymio'].episode_reward, rollout['e-puck'].episode_reward
[11]:
(np.float64(-3.5418375520177077), np.float64(-15.970891166964257))

Learning a centralized policy#

In alternative to learning two distributed policy (one for each type of robot), we could learn a single centralized policy that computes the action for both robots as once, taking as input the aggregated observation.

To perform this via script, run:

python -m enki_env.examples.different_robots.centralized_policy_rl

We wrap the multi-agent environment in a concatenated environement

[12]:
from enki_env.concat_env import ConcatEnv
[13]:
cenv = ConcatEnv(different_robots.make_env())

The observation spaces contains all the keys of the individual robot observation spaces, with values that are concatenated when multiple robots have the same key.

In this case, for example, prox/value has a totol 15 values from the 7 thymio values and the 8 e-puck values.

[14]:
cenv.observation_space
[14]:
Dict('prox/value': Box(0.0, 1.0, (15,), float32), 'wheel_speeds': Box(-1.0, 1.0, (4,), float32))

Similarly, the action space is concatenation of all the action spaces of individual robots

[15]:
cenv.action_space
[15]:
Box(-1.0, 1.0, (2,), float32)

We can load and evaluate the policy similarly as with the other environements:

[16]:
policy = different_robots.get_policy()
rollout = cenv.rollout(max_steps=10, policy=policy)
rollout['thymio'].episode_reward, rollout['e-puck'].episode_reward
[16]:
(np.float64(-3.2257519966167836), np.float64(-18.096107888526664))

Video#

To conclude, to generate a similar video as before, you can run

python -m enki_env.examples.different_robots.video

The following lines displays the video of alternating baseline (yellow) and distributed policy (cyan)

[17]:
video = different_robots.make_video()
video.display_in_notebook(fps=30, width=640, rd_kwargs=dict(logger=None))
[17]:

and the video of centralized policy (violet)

[18]:
video = different_robots.make_video(centralized=True)
video.display_in_notebook(fps=30, width=640, rd_kwargs=dict(logger=None))
[18]:
[ ]: