Toronto Public Transportation Optimization with DQN

Last updated on Jan 19, 2021

Coordinated Transit Signal Priority (cTSP)

Introduction

Coordinate Transit Signal Priority with two traffic intersections in Toronto area, improving speed and reliability with a single deep reinforcement learning agent.

Model

1. Time step

Time steps are renewed upon bus check-in events (should avoid bug if more than one bus checked in at the same time. For example, Bus A checks in at time x at Intx 1, and Bus B checks in at time x at Intx 2.). A time step is a time point at which a bus is detected by a loop detector. Every check-in event at any check-in loop detector in the system (a system includes all intersections and road segments connecting these intersections) would initiate a new time step.

Example: Bus 1, 2 and 3 are in the system at time step t. When Bus 4 checks in, time step t+1 is initiated.

At each time step, the RL model

reads the state of the current environment,
chooses an action, and
calculates the reward of the last time step.

2. State

States are collected when time steps are renewed. A state includes observations at all intersections in the system which contains bus-, traffic-, and signal-related information. Each intersection has following observations:

Upstream of the POZ, downstream of the upstream intersection (prePOZ)
- Check-out time of the bus closest to the downstream POZ
- number of buses
In the POZ
- Last available check-out time
  If bunch (POZ has more than one bus, Number of buses > 1): use the current time as the check-out time
- Check-in time of the current bus (current time) that initiated this time step
- Check-in headway
- Number of buses in the POZ,
- Number of cars in the POZ
- Time to the end of EW green: exclude any registered action
  Registered action: any action that has not been executed but planned, or is now being executed at the time of check-in

3. Action

Action is chosen at every time step as soon as the state is received by the RL model. Actions make adjustment of the durations of the first available EW green for each intersection at time step t.

If a bus checked in during EW red, adjustment is made to the first available EW green following the red
If a bus checked in during EW green, adjustment is made to the current EW green
Example: A bus checks in at intersection 1 at time step t. At time step t, the phase at Intx 1 is red in the direction of bus movement, the adjustment is made to the EW green following red. At time step t, the phase at Intx 2 is EW green, the adjustment is made to this EW green.

To ensure consistency with iTSP, actions are EW green truncations of -20, -15, -10, -5, do-nothing, green extensions of +5, +10, +15, +20s.

When at is selected, if there is a registered action (maybe decided at time step t-1) for an intersection, at would overwrite at-1 if possible. If a truncation action is selected, and the truncation amount > remaining EW green, EW green would be end “now.”

4. Reward

Reward associated with state and action at time step t is calculated at time step t+1. Rewards are computed using data (headway and travel time in the POZ) of all check-out events occurred between time step t and t+1.

Example: If bus A and B checked out two different intersections (or the same intersection) between time step t and t+1, rt = rA + rB = 0.6*(headway improvement of bus A) – 0.4*(travel time of bus A in the POZ) + 0.6*(headway improvement of bus B) – 0.4*(travel time of bus B in the POZ)

If no bus checked out between time step t and t+1, rt = 0.