Hierarchical Generative Adversarial Imitation Learning with Mid-level Input Generation for Autonomous Driving on Urban Environments

Deriving robust control policies for realistic urban navigation scenarios is not a trivial task. In an end-to-end approach, these policies must map high-dimensional images from the vehicle's cameras to low-level actions such as steering and throttle. While pure Reinforcement Learning (RL) approaches are based exclusively on rewards, Generative Adversarial Imitation Learning (GAIL) agents learn from expert demonstrations while interacting with the environment, which favors GAIL on tasks for which a reward signal is difficult to derive.

In this work, the hGAIL architecture was proposed to solve the autonomous navigation of a vehicle in an end-to-end approach, mapping sensory perceptions directly to low-level actions, while simultaneously learning mid-level input representations of the agent's environment. The proposed hGAIL consists of an hierarchical Adversarial Imitation Learning architecture composed of two main modules: the GAN (Generative Adversarial Nets) which generates the Bird's-Eye View (BEV) representation mainly from the images of three frontal cameras of the vehicle, and the GAIL which learns to control the vehicle based mainly on the BEV predictions from the GAN as input.

Our experiments have shown that GAIL exclusively from cameras (without BEV) fails to even learn the task, while hGAIL, after training, was able to autonomously navigate successfully in all intersections of the city.

Fig. 1 - Hierarchical Generative Adversarial Imitation Learning (hGAIL) for policy learning with mid-level input representation. It basically consists of chained GAN and GAIL networks, where the first one (GAN) generates BEV representation from the vehicle's three frontal cameras, sparse trajectory and high-level command, while the latter (GAIL) outputs the acceleration and steering based on the predicted BEV input (generated by GAN), the current speed and the last applied actions. Both GAN and GAIL learn simultaneously while the agent interacts to the CARLA environment. The discriminator parts of both networks are not shown for the sake of simplicity.

This research has been summarized in a paper that is under review. More details at:

https://github.com/gustavokcouto/hgail

Some of the results can be seen on Youtube as series of videos, showing the learning through interaction with the urban environment:
https://www.youtube.com/playlist?list=PLDjIHGRag4wB2v5KdnInWj7MTCTNEPQ3U

Both the intermediate mid-level input BEV representation and the control policy are learned as the agent navigates in an urban town.
After training, we can see the autonomous vehicle navigation in the video below:

We can also observe the learning of the Bird's-Eye view representation of the Conditional GAN embedded in the hGAIL architecture, which learns concomitantly with the GAIL policy: