We are ready now to play with the data to create the visualisations.
Challenges:
To obtain the data needed for the visuals my first intuition was: look at the cumulative distance column for every runner, identify when a lap distance was completed (1000, 2000, 3000, etc.) by each of them and do the differences of timestamps.
That algorithm looks simple, and might work, but it had some limitations that I needed to address:
- Exact lap distances are often completed in between two data points registered. To be more accurate I had to do interpolation of both position and time.
- Due to difference in the precision of devices, there might be misalignments across runners. The most typical is when a runner’s lap notification beeps before another one even if they have been together the whole track. To minimise this I decided to use the reference runner to set the position marks for every lap in the track. The time difference will be calculated when other runners cross those marks (even though their cumulative distance is ahead or behind the lap). This is more close to the reality of the race: if someone crosses a point before, they are ahead (regardless the cumulative distance of their device)
- With the previous point comes another problem: the latitude and longitude of a reference mark might never be exactly registered on the other runners’ data. I used Nearest Neighbours to find the closest datapoint in terms of position.
- Finally, Nearest Neighbours might bring wrong datapoints if the track crosses the same positions at different moments in time. So the population where the Nearest Neighbours will look for the best match needs to be reduced to a smaller group of candidates. I defined a window size of 20 datapoints around the target distance (distance_cum).
Algorithm
With all the previous limitations in mind, the algorithm should be as follows:
1. Choose the reference and a lap distance (default= 1km)
2. Using the reference data, identify the position and the moment every lap was completed: the reference marks.
3. Go to other runner’s data and identify the moments they crossed those position marks. Then calculate the difference in time of both runners crossing the marks. Finally the delta of this time difference to represent the evolution of the gap.
Code Example
1. Choose the reference and a lap distance (default= 1km)
- Juan will be the reference (juan_df) on the examples.
- The other runners will be Pedro (pedro_df ) and Jimena (jimena_df).
- Lap distance will be 1000 metres
2. Create interpolate_laps(): function that finds or interpolates the exact point for each completed lap and return it in a new dataframe. The inferpolation is done with the function: interpolate_value() that was also created.
## Function: interpolate_value()Input:
- start: The starting value.
- end: The ending value.
- fraction: A value between 0 and 1 that represents the position between
the start and end values where the interpolation should occur.
Return:
- The interpolated value that lies between the start and end values
at the specified fraction.
def interpolate_value(start, end, fraction):
return start + (end - start) * fraction
## Function: interpolate_laps()Input:
- track_df: dataframe with track data.
- lap_distance: metres per lap (default 1000)
Return:
- track_laps: dataframe with lap metrics. As many rows as laps identified.
def interpolate_laps(track_df , lap_distance = 1000):
#### 1. Initialise track_laps with the first row of track_df
track_laps = track_df.loc[0][['latitude','longitude','elevation','date_time','distance_cum']].copy()# Set distance_cum = 0
track_laps[['distance_cum']] = 0
# Transpose dataframe
track_laps = pd.DataFrame(track_laps)
track_laps = track_laps.transpose()
#### 2. Calculate number_of_laps = Total Distance / lap_distance
number_of_laps = track_df['distance_cum'].max()//lap_distance
#### 3. For each lap i from 1 to number_of_laps:
for i in range(1,int(number_of_laps+1),1):
# a. Calculate target_distance = i * lap_distance
target_distance = i*lap_distance
# b. Find first_crossing_index where track_df['distance_cum'] > target_distance
first_crossing_index = (track_df['distance_cum'] > target_distance).idxmax()
# c. If match is exactly the lap distance, copy that row
if (track_df.loc[first_crossing_index]['distance_cum'] == target_distance):
new_row = track_df.loc[first_crossing_index][['latitude','longitude','elevation','date_time','distance_cum']]
# Else: Create new_row with interpolated values, copy that row.
else:
fraction = (target_distance - track_df.loc[first_crossing_index-1, 'distance_cum']) / (track_df.loc[first_crossing_index, 'distance_cum'] - track_df.loc[first_crossing_index-1, 'distance_cum'])
# Create the new row
new_row = pd.Series({
'latitude': interpolate_value(track_df.loc[first_crossing_index-1, 'latitude'], track_df.loc[first_crossing_index, 'latitude'], fraction),
'longitude': interpolate_value(track_df.loc[first_crossing_index-1, 'longitude'], track_df.loc[first_crossing_index, 'longitude'], fraction),
'elevation': interpolate_value(track_df.loc[first_crossing_index-1, 'elevation'], track_df.loc[first_crossing_index, 'elevation'], fraction),
'date_time': track_df.loc[first_crossing_index-1, 'date_time'] + (track_df.loc[first_crossing_index, 'date_time'] - track_df.loc[first_crossing_index-1, 'date_time']) * fraction,
'distance_cum': target_distance
}, name=f'lap_{i}')
# d. Add the new row to the dataframe that stores the laps
new_row_df = pd.DataFrame(new_row)
new_row_df = new_row_df.transpose()
track_laps = pd.concat([track_laps,new_row_df])
#### 4. Convert date_time to datetime format and remove timezone
track_laps['date_time'] = pd.to_datetime(track_laps['date_time'], format='%Y-%m-%d %H:%M:%S.%f%z')
track_laps['date_time'] = track_laps['date_time'].dt.tz_localize(None)
#### 5. Calculate seconds_diff between consecutive rows in track_laps
track_laps['seconds_diff'] = track_laps['date_time'].diff()
return track_laps
Applying the interpolate function to the reference dataframe will generate the following dataframe:
juan_laps = interpolate_laps(juan_df , lap_distance=1000)
Note as it was a 10k race, 10 laps of 1000m has been identified (see column distance_cum). The column seconds_diff has the time per lap. The rest of the columns (latitude, longitude, elevation and date_time) mark the position and time for each lap of the reference as the result of interpolation.
3. To calculate the time gaps between the reference and the other runners I created the function gap_to_reference()
## Helper Functions:
- get_seconds(): Convert timedelta to total seconds
- format_timedelta(): Format timedelta as a string (e.g., "+01:23" or "-00:45")
# Convert timedelta to total seconds
def get_seconds(td):
# Convert to total seconds
total_seconds = td.total_seconds() return total_seconds
# Format timedelta as a string (e.g., "+01:23" or "-00:45")
def format_timedelta(td):
# Convert to total seconds
total_seconds = td.total_seconds()
# Determine sign
sign = '+' if total_seconds >= 0 else '-'
# Take absolute value for calculation
total_seconds = abs(total_seconds)
# Calculate minutes and remaining seconds
minutes = int(total_seconds // 60)
seconds = int(total_seconds % 60)
# Format the string
return f"{sign}{minutes:02d}:{seconds:02d}"
## Function: gap_to_reference()Input:
- laps_dict: dictionary containing the df_laps for all the runnners' names
- df_dict: dictionary containing the track_df for all the runnners' names
- reference_name: name of the reference
Return:
- matches: processed data with time differences.
def gap_to_reference(laps_dict, df_dict, reference_name):
#### 1. Get the reference's lap data from laps_dict
matches = laps_dict[reference_name][['latitude','longitude','date_time','distance_cum']]#### 2. For each racer (name) and their data (df) in df_dict:
for name, df in df_dict.items():
# If racer is the reference:
if name == reference_name:
# Set time difference to zero for all laps
for lap, row in matches.iterrows():
matches.loc[lap,f'seconds_to_reference_{reference_name}'] = 0
# If racer is not the reference:
if name != reference_name:
# a. For each lap find the nearest point in racer's data based on lat, lon.
for lap, row in matches.iterrows():
# Step 1: set the position and lap distance from the reference
target_coordinates = matches.loc[lap][['latitude', 'longitude']].values
target_distance = matches.loc[lap]['distance_cum']
# Step 2: find the datapoint that will be in the centre of the window
first_crossing_index = (df_dict[name]['distance_cum'] > target_distance).idxmax()
# Step 3: select the 20 candidate datapoints to look for the match
window_size = 20
window_sample = df_dict[name].loc[first_crossing_index-(window_size//2):first_crossing_index+(window_size//2)]
candidates = window_sample[['latitude', 'longitude']].values
# Step 4: get the nearest match using the coordinates
nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(candidates)
distance, indice = nn.kneighbors([target_coordinates])
nearest_timestamp = window_sample.iloc[indice.flatten()]['date_time'].values
nearest_distance_cum = window_sample.iloc[indice.flatten()]['distance_cum'].values
euclidean_distance = distance
matches.loc[lap,f'nearest_timestamp_{name}'] = nearest_timestamp[0]
matches.loc[lap,f'nearest_distance_cum_{name}'] = nearest_distance_cum[0]
matches.loc[lap,f'euclidean_distance_{name}'] = euclidean_distance
# b. Calculate time difference between racer and reference at this point
matches[f'time_to_ref_{name}'] = matches[f'nearest_timestamp_{name}'] - matches['date_time']
# c. Store time difference and other relevant data
matches[f'time_to_ref_diff_{name}'] = matches[f'time_to_ref_{name}'].diff()
matches[f'time_to_ref_diff_{name}'] = matches[f'time_to_ref_diff_{name}'].fillna(pd.Timedelta(seconds=0))
# d. Format data using helper functions
matches[f'lap_difference_seconds_{name}'] = matches[f'time_to_ref_diff_{name}'].apply(get_seconds)
matches[f'lap_difference_formatted_{name}'] = matches[f'time_to_ref_diff_{name}'].apply(format_timedelta)
matches[f'seconds_to_reference_{name}'] = matches[f'time_to_ref_{name}'].apply(get_seconds)
matches[f'time_to_reference_formatted_{name}'] = matches[f'time_to_ref_{name}'].apply(format_timedelta)
#### 3. Return processed data with time differences
return matches
Below the code to implement the logic and store results on the dataframe matches_gap_to_reference:
# Lap distance
lap_distance = 1000# Store the DataFrames in a dictionary
df_dict = {
'jimena': jimena_df,
'juan': juan_df,
'pedro': pedro_df,
}
# Store the Lap DataFrames in a dictionary
laps_dict = {
'jimena': interpolate_laps(jimena_df , lap_distance),
'juan': interpolate_laps(juan_df , lap_distance),
'pedro': interpolate_laps(pedro_df , lap_distance)
}
# Calculate gaps to reference
reference_name = 'juan'
matches_gap_to_reference = gap_to_reference(laps_dict, df_dict, reference_name)
The columns of the resulting dataframe contain the important information that will be displayed on the graphs: