1. Introduction

This is an Exploratory Data Analysis (EDA) Kernel for Lyft Motion Prediction for Autonomous Vehicles competition dataset.

We start with the analysis preparation, which requires, for this competition, to install and load several packages for load and manage l5kit dataset.
We follow with data exploration, reviewing the agents, the scenes, the frames and following with inspection of the animated scenes.

2. Analysis preparation

2.1. Install & load packages

We will have to install l5kit to access the data.

In [2]:
import os
import numpy as np
import pandas as pd
from l5kit.data import ChunkedDataset, LocalDataManager
from l5kit.dataset import EgoDataset, AgentDataset
from l5kit.rasterization import build_rasterizer
from l5kit.configs import load_config_data
from l5kit.visualization import draw_trajectory, TARGET_POINTS_COLOR
from l5kit.geometry import transform_points
from l5kit.data import PERCEPTION_LABELS
from tqdm import tqdm
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns 
from matplotlib import animation, rc
from matplotlib.ticker import MultipleLocator
from IPython.display import display, clear_output
import PIL
from IPython.display import HTML

rc('animation', html='jshtml')

2.2. Configuration

We set the local dataset configurations before accessing it. We set the path for the l5kit data folder by setting the environment variable L5KIT_DATA_FOLDER and we load the lyft configuration files from a yaml file from an external dataset.

In [3]:
os.environ["L5KIT_DATA_FOLDER"] = "/kaggle/input/lyft-motion-prediction-autonomous-vehicles"
cfg = load_config_data("/kaggle/input/lyft-config-files/visualisation_config.yaml")
print(cfg)
{'format_version': 4, 'model_params': {'model_architecture': 'resnet50', 'history_num_frames': 0, 'history_step_size': 1, 'history_delta_time': 0.1, 'future_num_frames': 50, 'future_step_size': 1, 'future_delta_time': 0.1}, 'raster_params': {'raster_size': [224, 224], 'pixel_size': [0.5, 0.5], 'ego_center': [0.25, 0.5], 'map_type': 'py_semantic', 'satellite_map_key': 'aerial_map/aerial_map.png', 'semantic_map_key': 'semantic_map/semantic_map.pb', 'dataset_meta_key': 'meta.json', 'filter_agents_threshold': 0.5}, 'val_data_loader': {'key': 'scenes/sample.zarr', 'batch_size': 12, 'shuffle': False, 'num_workers': 16}}

3. Load data

We load, using the l5kit local data manager, the dataset. L5kit uses zarr data format; here the arrays are divided into chunks and compressed.

In [4]:
# local data manager
dm = LocalDataManager()
# set dataset path
dataset_path = dm.require(cfg["val_data_loader"]["key"])
# load the dataset; this is a zarr format, chunked dataset
chunked_dataset = ChunkedDataset(dataset_path)
# open the dataset
chunked_dataset.open()
print(chunked_dataset)
+------------+------------+------------+---------------+-----------------+----------------------+----------------------+----------------------+---------------------+
| Num Scenes | Num Frames | Num Agents | Num TR lights | Total Time (hr) | Avg Frames per Scene | Avg Agents per Frame | Avg Scene Time (sec) | Avg Frame frequency |
+------------+------------+------------+---------------+-----------------+----------------------+----------------------+----------------------+---------------------+
|    100     |   24838    |  1893736   |     316008    |       0.69      |        248.38        |        76.24         |        24.83         |        10.00        |
+------------+------------+------------+---------------+-----------------+----------------------+----------------------+----------------------+---------------------+

4. Data exploration

4.1. Explore the dataset

We load and show entities in the dataset.

4.1.1. Agents

We start with the agents.

In [5]:
agents = chunked_dataset.agents
agents_df = pd.DataFrame(agents)
agents_df.columns = ["data"]; features = ['centroid', 'extent', 'yaw', 'velocity', 'track_id', 'label_probabilities']

for i, feature in enumerate(features):
    agents_df[feature] = agents_df['data'].apply(lambda x: x[i])
agents_df.drop(columns=["data"],inplace=True)
print(f"agents dataset: {agents_df.shape}")
agents_df.head()
agents dataset: (1893736, 6)
Out[5]:
centroid extent yaw velocity track_id label_probabilities
0 [665.0342407226562, -2207.51220703125] [4.3913283, 1.8138304, 1.5909758] 1.016675 [0.0, 0.0] 1 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
1 [717.6612548828125, -2173.760009765625] [5.150925, 1.9530917, 2.04021] -0.783224 [0.0, 0.0] 2 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2 [730.681396484375, -2180.678955078125] [2.9482825, 1.4842174, 1.1125067] -0.321747 [0.0, 0.0] 3 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
3 [671.2536010742188, -2204.745361328125] [1.7067024, 0.9287868, 0.6282158] 0.785501 [0.0, 0.0] 4 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
4 [669.7763061523438, -2213.004638671875] [0.25109944, 0.6343781, 1.654377] 1.492359 [0.0, 0.0] 5 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...

The fields in the agents dataset are the following:

  • centroid - the agent position (in plane - two dimmensions);
  • extent - the agent dimmensions (three dimmensions, let's called length, width, height);
  • yaw - the agent oscilation/twist about the vertical plane;
  • velocity - the speed of the agent - in euclidian space;
  • track_id - index of track associated to the agent;
  • label_probabilities - gives the probability for the agent to belong to one of 17 different agent type; we will explore these labels in a moment;

Let's look to the distribution of few of these values.

Centroid distribution

In [6]:
agents_df['cx'] = agents_df['centroid'].apply(lambda x: x[0])
agents_df['cy'] = agents_df['centroid'].apply(lambda x: x[1])
In [7]:
fig, ax = plt.subplots(1,1,figsize=(8,8))
plt.scatter(agents_df['cx'], agents_df['cy'], marker='+')
plt.xlabel('x', fontsize=11); plt.ylabel('y', fontsize=11)
plt.title("Centroids distribution")
plt.show()

Extent distribution

In [8]:
agents_df['ex'] = agents_df['extent'].apply(lambda x: x[0])
agents_df['ey'] = agents_df['extent'].apply(lambda x: x[1])
agents_df['ez'] = agents_df['extent'].apply(lambda x: x[2])
In [9]:
sns.set_style('whitegrid')

fig, ax = plt.subplots(1,3,figsize=(16,5))
plt.subplot(1,3,1)
plt.scatter(agents_df['ex'], agents_df['ey'], marker='+')
plt.xlabel('ex', fontsize=11); plt.ylabel('ey', fontsize=11)
plt.title("Extent: ex-ey")
plt.subplot(1,3,2)
plt.scatter(agents_df['ey'], agents_df['ez'], marker='+', color="red")
plt.xlabel('ey', fontsize=11); plt.ylabel('ez', fontsize=11)
plt.title("Extent: ey-ez")
plt.subplot(1,3,3)
plt.scatter(agents_df['ez'], agents_df['ex'], marker='+', color="green")
plt.xlabel('ez', fontsize=11); plt.ylabel('ex', fontsize=11)
plt.title("Extent: ez-ex")
plt.show();

Yaw

Let's see yaw distribution.

In [10]:
fig, ax = plt.subplots(1,1,figsize=(8,8))
sns.distplot(agents_df['yaw'],color="magenta")
plt.title("Yaw distribution")
plt.show()

Velocity

Let's look to velocity distribution.

In [11]:
agents_df['vx'] = agents_df['velocity'].apply(lambda x: x[0])
agents_df['vy'] = agents_df['velocity'].apply(lambda x: x[1])
In [12]:
fig, ax = plt.subplots(1,1,figsize=(8,8))
plt.title("Velocity distribution")
plt.scatter(agents_df['vx'], agents_df['vy'], marker='.', color="red")
plt.xlabel('vx', fontsize=11); plt.ylabel('vy', fontsize=11)
plt.show();

track id

In [13]:
print("Number of tracks: ", agents_df.track_id.nunique())
print("Entries per track id (first 10): \n", agents_df.track_id.value_counts()[0:10])
                                        
Number of tracks:  2547
Entries per track id (first 10): 
 1     14922
2     12377
3     10179
5      9108
6      8605
4      8224
9      7927
7      7371
10     7345
8      7050
Name: track_id, dtype: int64

Let's look to the distribution of the label probabilities.

In [14]:
probabilities = agents["label_probabilities"]
labels_indexes = np.argmax(probabilities, axis=1)
counts = []
for idx_label, label in enumerate(PERCEPTION_LABELS):
    counts.append(np.sum(labels_indexes == idx_label))

agents_df = pd.DataFrame()
for count, label in zip(counts, PERCEPTION_LABELS):
    agents_df = agents_df.append(pd.DataFrame({'label':label, 'count':count},index=[0]))
agents_df = agents_df.reset_index().drop(columns=['index'], axis=1)
In [15]:
print(f"agents probabilities dataset: {agents_df.shape}")
agents_df  
agents probabilities dataset: (17, 2)
Out[15]:
label count
0 PERCEPTION_LABEL_NOT_SET 0
1 PERCEPTION_LABEL_UNKNOWN 1324481
2 PERCEPTION_LABEL_DONTCARE 0
3 PERCEPTION_LABEL_CAR 519385
4 PERCEPTION_LABEL_VAN 0
5 PERCEPTION_LABEL_TRAM 0
6 PERCEPTION_LABEL_BUS 0
7 PERCEPTION_LABEL_TRUCK 0
8 PERCEPTION_LABEL_EMERGENCY_VEHICLE 0
9 PERCEPTION_LABEL_OTHER_VEHICLE 0
10 PERCEPTION_LABEL_BICYCLE 0
11 PERCEPTION_LABEL_MOTORCYCLE 0
12 PERCEPTION_LABEL_CYCLIST 6688
13 PERCEPTION_LABEL_MOTORCYCLIST 0
14 PERCEPTION_LABEL_PEDESTRIAN 43182
15 PERCEPTION_LABEL_ANIMAL 0
16 AVRESEARCH_LABEL_DONTCARE 0

There are 4 different active agents present in the dataset, as following:

  • PERCEPTION_LABEL_UNKNOWN - majority;
  • PERCEPTION_LABEL_CAR;
  • PERCEPTION_LABEL_CYCLIST;
  • PERCEPTION_LABEL_PEDESTRIAN.

Let's look to their distribution:

In [16]:
f, ax = plt.subplots(1,1, figsize=(10,4))
plt.scatter(agents_df['label'], agents_df['count']+1, marker='*')
plt.xticks(rotation=90, size=8)
plt.xlabel('Perception label')
plt.ylabel(f'Agents count')
plt.title("Agents perception label values count distribution")
plt.grid(True)
ax.set(yscale="log")
plt.show()

4.1.2. Scenes

Let's look now to the scenes.

In [17]:
scenes = chunked_dataset.scenes
scenes_df = pd.DataFrame(scenes)
scenes_df.columns = ["data"]; features = ['frame_index_interval', 'host', 'start_time', 'end_time']
for i, feature in enumerate(features):
    scenes_df[feature] = scenes_df['data'].apply(lambda x: x[i])
scenes_df.drop(columns=["data"],inplace=True)
print(f"scenes dataset: {scenes_df.shape}")
scenes_df.head()
scenes dataset: (100, 4)
Out[17]:
frame_index_interval host start_time end_time
0 [0, 248] host-a013 1572643684617362176 1572643709617362176
1 [248, 497] host-a013 1572643749559148288 1572643774559148288
2 [497, 746] host-a013 1572643774559148288 1572643799559148288
3 [746, 995] host-a013 1572643799559148288 1572643824559148288
4 [995, 1244] host-a013 1572643824559148288 1572643849559148288
In [18]:
f, ax = plt.subplots(1,1, figsize=(6,4))
sns.countplot(scenes_df.host)
plt.xlabel('Host')
plt.ylabel(f'Count')
plt.title("Scenes host count distribution")
plt.show()

Let's show the scenes frame index succesion, on the same graph with the host.

In [19]:
scenes_df['frame_index_start'] = scenes_df['frame_index_interval'].apply(lambda x: x[0])
scenes_df['frame_index_end'] = scenes_df['frame_index_interval'].apply(lambda x: x[1])
scenes_df.head()
Out[19]:
frame_index_interval host start_time end_time frame_index_start frame_index_end
0 [0, 248] host-a013 1572643684617362176 1572643709617362176 0 248
1 [248, 497] host-a013 1572643749559148288 1572643774559148288 248 497
2 [497, 746] host-a013 1572643774559148288 1572643799559148288 497 746
3 [746, 995] host-a013 1572643799559148288 1572643824559148288 746 995
4 [995, 1244] host-a013 1572643824559148288 1572643849559148288 995 1244
In [20]:
f, ax = plt.subplots(1,1, figsize=(8,8))
spacing = 498
minorLocator = MultipleLocator(spacing)
ax.yaxis.set_minor_locator(minorLocator)
ax.xaxis.set_minor_locator(minorLocator)
plt.xlabel('Start frame index')
plt.ylabel(f'End frame index')
plt.grid(which = 'minor')
plt.title("Frames scenes start and end index (grouped per host)")
sns.scatterplot(scenes_df['frame_index_start'], scenes_df['frame_index_end'], marker='|',  hue=scenes_df['host'])
plt.show()

4.1.3. Frames

We are now looking to the frames.

In [21]:
frames_df = pd.DataFrame(chunked_dataset.frames)
frames_df.columns = ["data"]; features = ['timestamp', 'agent_index_interval', 'traffic_light_faces_index_interval', 
                                          'ego_translation','ego_rotation']
for i, feature in enumerate(features):
    frames_df[feature] = frames_df['data'].apply(lambda x: x[i])
frames_df.drop(columns=["data"],inplace=True)
print(f"frames dataset: {frames_df.shape}")
frames_df.head()
frames dataset: (24838, 5)
Out[21]:
timestamp agent_index_interval traffic_light_faces_index_interval ego_translation ego_rotation
0 1572643684801892606 [0, 38] [0, 0] [680.6197509765625, -2183.32763671875, 288.541... [[0.5467331409454346, -0.837294340133667, 0.00...
1 1572643684901714926 [38, 85] [0, 0] [681.1856079101562, -2182.42236328125, 288.608... [[0.5470812916755676, -0.837059736251831, 0.00...
2 1572643685001499246 [85, 142] [0, 0] [681.7647094726562, -2181.522705078125, 288.68... [[0.5479603409767151, -0.8364874720573425, 0.0...
3 1572643685101394026 [142, 200] [0, 0] [682.3414306640625, -2180.624267578125, 288.75... [[0.5491225123405457, -0.8357341885566711, 0.0...
4 1572643685201412346 [200, 254] [0, 0] [682.9197998046875, -2179.73046875, 288.827392... [[0.5504215955734253, -0.8348868489265442, -7....

The frames are described by:

  • timestamp;
  • agent index interval;
  • traffic light faces index interval;
  • ego translation;
  • ego rotation;

Let's look to ego translations.

Ego translations

In [22]:
frames_df['dx'] = frames_df['ego_translation'].apply(lambda x: x[0])
frames_df['dy'] = frames_df['ego_translation'].apply(lambda x: x[1])
frames_df['dz'] = frames_df['ego_translation'].apply(lambda x: x[2])
In [23]:
sns.set_style('whitegrid')
plt.figure()

fig, ax = plt.subplots(1,3,figsize=(16,5))

plt.subplot(1,3,1)
plt.scatter(frames_df['dx'], frames_df['dy'], marker='+')
plt.xlabel('dx', fontsize=11); plt.ylabel('dy', fontsize=11)
plt.title("Translations: dx-dy")
plt.subplot(1,3,2)
plt.scatter(frames_df['dy'], frames_df['dz'], marker='+', color="red")
plt.xlabel('dy', fontsize=11); plt.ylabel('dz', fontsize=11)
plt.title("Translations: dy-dz")
plt.subplot(1,3,3)
plt.scatter(frames_df['dz'], frames_df['dx'], marker='+', color="green")
plt.xlabel('dz', fontsize=11); plt.ylabel('dx', fontsize=11)
plt.title("Translations: dz-dx")

fig.suptitle("Ego translations in 2D planes of the 3 components (dx,dy,dz)", size=14)
plt.show();
<Figure size 432x288 with 0 Axes>
In [24]:
fig, ax = plt.subplots(1,3,figsize=(16,5))
colors = ['magenta', 'orange', 'darkblue']; labels= ["dx", "dy", "dz"]
for i in range(0,3):
    df = frames_df['ego_translation'].apply(lambda x: x[i])
    plt.subplot(1,3,i + 1)
    sns.distplot(df, hist=False, color = colors[ i ])
    plt.xlabel(labels[i])
fig.suptitle("Ego translations distribution", size=14)
plt.show()

Ego rotations

Let's also plot Ego rotations components distributions. The rotation matrix is 3 x 3.

In [25]:
fig, ax = plt.subplots(3,3,figsize=(16,16))
colors = ['red', 'blue', 'green', 'magenta', 'orange', 'darkblue', 'black', 'cyan', 'darkgreen']
for i in range(0,3):
    for j in range(0,3):
        df = frames_df['ego_rotation'].apply(lambda x: x[i][j])
        plt.subplot(3,3,i * 3 + j + 1)
        sns.distplot(df, hist=False, color = colors[ i * 3 + j  ])
        plt.xlabel(f'r[ {i + 1} ][ {j + 1} ]')
fig.suptitle("Ego rotation angles distribution", size=14)
plt.show()

Trafic lights faces index interval

In [26]:
frames_df['tlfii0'] = frames_df['traffic_light_faces_index_interval'].apply(lambda x: x[0])
frames_df['tlfii1'] = frames_df['traffic_light_faces_index_interval'].apply(lambda x: x[1])
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(1,1,figsize=(8,8))
plt.scatter(frames_df['tlfii0'], frames_df['tlfii1'], marker='+')
plt.xlabel('Trafic lights faces index interval [0]', fontsize=11); plt.ylabel('Trafic lights faces index interval [1]', fontsize=11)
plt.title("Trafic lights faces index interval")
plt.show()
<Figure size 432x288 with 0 Axes>

Agents index interval

In [27]:
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
colors = ['cyan', 'darkgreen']
for i in range(0,2):
    df = frames_df['agent_index_interval'].apply(lambda x: x[i])
    plt.subplot(1, 2, i + 1)
    sns.distplot(df, hist=False, color = colors[ i ])
    plt.xlabel(f'agent index interval [ {i} ]')
fig.suptitle("Agent index interval", size=14)
plt.show()