Single Robot#

The code for this example is implemented single_robot. Let us import it.

[1]:
from enki_env.examples import single_robot

Environment#

The environment contains a single Thymio and a wall. The initial position of the robot is fixed while the orientation is sampled uniformly. The robot goal is to rotate to face the wall perpendicularly in the shortest time.

To create the environment via script, run:

python -m enki_env.examples.single_robot.environment
[2]:
env = single_robot.make_env(render_mode="human")
env.reset()
env.unwrapped.snapshot()
snapshot
[3]:
env.unwrapped.display_in_notebook()

The view above is interactive. Try to zoom in and out (trackpad/mouse wheel). Try also to select (mouse click) and move the robot (dragging while mouse pressed).

The robot, in this configuration, is constrained to stay in place (fix_position=True).

[4]:
env.unwrapped.config.action
[4]:
ThymioAction(max_speed=16.6, max_acceleration=1, fix_orientation=False, fix_position=True, acceleration=False, dtype=<class 'numpy.float64'>, prox_comm=False, max_proximity_comm_payload=1)

The action space contains therefore a single value that is applied as negative speed for the left wheel and positive speed for the righ wheel.

[5]:
env.action_space
[5]:
Box(-1.0, 1.0, (1,), float64)

Observations are configured

[6]:
env.unwrapped.config.observation
[6]:
ThymioObservation(max_speed=16.6, speed=True, normalize=True, dtype=<class 'numpy.float64'>, proximity_distance=False, proximity_value=True, proximity_comm_payload=False, proximity_comm_intensity=False, proximity_comm_rx=False, max_proximity_distance=28, max_proximity_value=4505, max_proximity_comm_payload=1, max_proximity_comm_intensity=4600, max_proximity_comm_number=1)

so that the observation space contains the values of measured by the proximity sensors, normalized between 0 and 1, and the speed of the wheels, normalized between -1 and 1.

[11]:
env.observation_space
[11]:
Dict('wheel_speeds': Box(-1.0, 1.0, (2,), float64), 'prox/value': Box(0.0, 1.0, (7,), float64))

The reward penalizes the robot that is not facing the wall (angle=0)

[12]:
import inspect

print(inspect.getsource(env.unwrapped.config.reward))
def reward(robot: pyenki.DifferentialWheeled, success: bool | None) -> float:
    return -1 - abs(normalize_angle(robot.angle)) - 0.1 * (abs(
        robot.left_wheel_encoder_speed) + abs(robot.right_wheel_encoder_speed))

The task terminates when the robot faces the wall and speed is now.

[13]:
print(inspect.getsource(env.unwrapped.config.terminations[0]))
    def f(robot: pyenki.DifferentialWheeled) -> bool | None:
        if abs(normalize_angle(robot.angle)) < angle_tol and is_still(
                robot, speed_tol):
            # Success
            return True
        return None

Let us try to apply a random action for a few steps and record the reward.

[14]:
rollout = env.unwrapped.rollout(max_steps=10)
rollout.reward
[14]:
array([[-4.49364891],
       [-4.05116681],
       [-4.09256358],
       [-2.57216218],
       [-2.49780775],
       [-4.07027706],
       [-2.98486844],
       [-2.73763285],
       [-2.66497817],
       [-3.88526066]])

Baseline#

We have hand-coded a simple policy to achieve the task.

To evaluate the baseline via script, run:

python -m enki_env.examples.single_robot.baseline
[15]:
print(inspect.getsource(single_robot.Baseline.predict))
    def predict(self,
                observation: Observation,
                state: State | None = None,
                episode_start: EpisodeStart | None = None,
                deterministic: bool = False) -> tuple[Action, State | None]:
        prox = observation['prox/value']
        prox = np.atleast_2d(prox)
        w = 0.25 * (prox[:, 0] + 2 * prox[:, 1] - 2 * prox[:, 3] - prox[:, 4])
        w[np.all(prox[:, :5] == 0, axis=-1)] = 1
        return np.clip([w], -1, 1), None

[16]:
env.unwrapped.display_in_notebook()

Try changing the seed and redo the rollout!

If we want to evaluate quatitatively better to skip (real-time) rendering (render_mode=None). Let us evaluate it over some episode (we disable real-time rendering to speed it up)

[17]:
import numpy as np

eval_env = single_robot.make_env(render_mode=None)

rollouts = [eval_env.unwrapped.rollout(seed=i, policy=single_robot.Baseline()) for i in range(100)]
rewards = np.array([rollout.episode_reward for rollout in rollouts])
steps =  np.array([rollout.episode_length for rollout in rollouts])
print(f"Mean reward: {rewards.mean(): .1f}, mean steps: {steps.mean(): .1f}")
Mean reward: -14.6, mean steps:  6.8

Reinforcement learning#

Let us now train and evaluate a RL policy for the same task.

To perform this via script, run:

python -m enki_env.examples.single_robot.rl
[18]:
policy = single_robot.get_policy()

Let’s redo the same evaluation but with the trained policy

[19]:
rollouts = [eval_env.unwrapped.rollout(seed=i, policy=policy) for i in range(100)]
rewards = np.array([rollout.episode_reward for rollout in rollouts])
steps =  np.array([rollout.episode_length for rollout in rollouts])
print(f"Mean reward: {rewards.mean(): .1f}, mean steps: {steps.mean(): .1f}")
Mean reward: -9.7, mean steps:  3.4

Video#

The examples include a script to generate a video where, for the same initial states, first we play out the episode using the baseline and then using the RL policy. The Thymio changes colors between yellow (baseline) and cyan (policy)

python -m enki_env.examples.single_robot.video
[20]:
video = single_robot.make_video()
video.display_in_notebook(fps=30, width=640, rd_kwargs=dict(logger=None))
[20]:
[ ]: