Deep Reinforcement Learning in Self Driving Cars πŸš™- Part 1

Sampath Kumaran Ganesan
5 min readAug 19, 2023

--

Today, we live in an era where robots are getting citizenship (Robot named Sophia got citizenship from Saudi Arabian government in 2017), Exploring other planets and celestial objects in space (e.g. Mars rover robot from Nasa), Performing surgeries (Even in complex transplant surgeries and tumor extractions), Transport and navigation (self driving cars and other robots with motion), Manufacturing (e.g. Kuka robots), Agriculture, Cooking, Military (e.g. Spot robots from Boston Dynamics) and the list goes on…

Experts say that by 2035, self driving space will generate revenue of $300 to $400 billions. Companies like Tesla (Autopilot), Google (Waymo), Mercedes, Toyota and many others in the industry compete with each other to push the boundaries of this fascinating field.

In the self driving space, there exists six levels of automation from Level 0 to Level 5 from SAE (Society of Automotive Engineers) in the increasing order of automation. As of now, there are many companies that provide level 2 or level 3 automation. Level 0 is no automation and Level 5 is full automation.

In part 1 of the self driving car series, below are the basics that we will cover:

  1. Introduction to Reinforcement Learning
  2. Types of Reinforcement Learning
  3. Introduction to Deep Learning
  4. Basics of Q β€” learning with introduction to TD3

We shall see how a deep reinforcement learning algorithm called TD3 (Twin Delayed Deep Deterministic Policy Gradients) can be used in the self driving space using the CARLA simulator in the subsequent parts.

CARLA has been developed from the ground up to support development, training, and validation of autonomous driving systems.

Autonomous Systems

Reinforcement Learning (RL) is a branch of machine learning which is about learning the optimal behavior in an environment to obtain the maximum cumulative rewards or return.

The main terminologies in RL are:

  1. Agent: One who takes decisions based on the rewards and punishments it gets from the environment. For e.g. An agent in self driving space is the car itself.
  2. Environment: It is the world that the Agent lives and interacts with it. For e.g. Road and its surroundings are the environment for the self driving car.
  3. State: It is the current representation of the environment that the Agent lives in. For e.g. Self driving car in traffic at a particular distance from the end location.
  4. Action: It is the mechanism by which the Agent transitions from State to State in an Environment. For e.g. Self driving car taking a right turn.
  5. Reward: It is arguably a very important concept in RL. It is the feedback that the Agent gets from the Environment by performing certain Actions from State A to State B. For e.g. Self driving car may get a reward by keeping a safe distance from other vehicles in the Environment.

In the below diagram, Agent interacts with the Environment by taking an Action at time t (Aβ‚œ) . By taking the Action, the Agent moves from one State (Sβ‚œ) to another State (Sβ‚œβ‚Šβ‚) and gets the Reward (Rβ‚œ). This cycle continues as long as the Agent reaches a terminal state.

Reinforcement Learning Cycle

There are two main Tasks (An instance of Reinforcement Learning) in RL is Episodic and Continuous.

Episodic Task: There is a terminal state or an end state. An Agent <β€” >Environment interactions from starting state to ending state is called as an Episode. For e.g. Tic Tac Toe game where there is a start state and a terminal state. Often, in this type of task, reward will be given at the end or may be in an intermittent state.

Continuous Task: There is no ending for this type of task. For e.g. Training a self driving car to drive forward or making a Humanoid to walk by itself. Rewards may come all the way through out the task.

We also need to understand what is a Policy in Reinforcement Learning? Policy is nothing but a function that maps states to actions. This type of policy is called as Behavior policy. Another type of policy is called as Target policy in which the Agent uses to learn from the rewards received from its actions.

  1. Off-Policy: When the behavior policy is different from target policy
  2. On-Policy: When the behavior policy is same as that of target policy

So, moving on, What is Deep Reinforcement Learning πŸ’­? It is a combination of Deep Learning and Reinforcement Learning.

Deep Learning: It is a branch of Machine Learning in which Neural Networks (Inspired by Human Brain) with three or more layers. It has the ability to learn from large volumes of data. These are also called as Representational learning because it can learn the features by itself whereas in machine learning, we need to properly select the appropriate features. Deep learning algorithms are mainly used in unstructured data like images, text, audio etc.,

In the below diagram, there is an Input Layer, three hidden layers and an Output layer. The hidden layers learn the complex patterns from the inputs and generate an output through the nonlinear activation functions like rectified linear unit, hyperbolic tangent etc., For e.g. In self driving space, we can use deep learning to classify the road signs, detecting the objects and pedestrians in the environment, segmenting the road lanes with pathway etc.,

So far, we have seen the basics of deep reinforcement learning. Now we will dive into TD3 algorithm and its use in self driving cars.

TD3 β€” It is an off policy deep reinforcement learning algorithm. It is used for continuous action spaces like self driving or in robotics. It is an Actor β€” Critic approach. Actor β€” It specifies actions given the current state of the environment. Critic β€” It specifies the signal to criticize the actions made by the Actor. Q β€” learning is the essential part of the TD3 algorithm.

Q β€” Learning: It is used to find the best course of action given the current state of the agent. It is used to find the next best action given the current state.

Q β€” Learning Pseudocode

Above is the pseudocode for Q β€” Learning. Below are the steps explained in detail:

  1. Initialize a Q matrix with states (s) and actions (a)
  2. In the loop (which represents the episode), Initialize/Reset the state (s)
  3. Then, for each step in the episode, choose action from state using the policy derived from Q
  4. Once the action is chosen, get the reward and observe the new state
  5. Update the Q matrix with the reward function using dynamic programming and temporal difference
  6. Update the new state as the current state and continue as long as the new state is the terminal/end state

In the next part of the series, we will dive deep into TD3 and its usage in self driving cars with simulations. Stay Tuned for the next part.

--

--

No responses yet