Introduction
A couple years ago I compared racing data between two races (Gold Discovery and Equinox, Santa Claus and Equinox) in the same season for all runners that ran in both events. The result was an estimate of how fast I might run the Equinox Marathon based on my times for Gold Discovery and the Santa Claus Half Marathon.
Several years have passed and I've run more races and collected more racing data for all the major Fairbanks races and wanted to run the same analysis for all combinations of races.
Data
The data comes from a database I’ve built of race times for all competitors, mostly coming from the results available from Chronotrack, but including some race results from SportAlaska.
We started by loading the required R packages and reading in all the racing data, a small subset of which looks like this.
race | year | name | finish_time | birth_year | sex |
---|---|---|---|---|---|
Beat Beethoven | 2015 | thomas mcclelland | 00:21:49 | 1995 | M |
Equinox Marathon | 2015 | jennifer paniati | 06:24:14 | 1989 | F |
Equinox Marathon | 2014 | kris starkey | 06:35:55 | 1972 | F |
Midnight Sun Run | 2014 | kathy toohey | 01:10:42 | 1960 | F |
Midnight Sun Run | 2016 | steven rast | 01:59:41 | 1960 | M |
Equinox Marathon | 2013 | elizabeth smith | 09:18:53 | 1987 | F |
... | ... | ... | ... | ... | ... |
Next we loaded in the names and distances of the races and combined this with the individual racing data. The data from Chronotrack doesn’t include the mileage and we will need that to calculate pace (minutes per mile).
My database doesn’t have complete information about all the racers that competed, and in some cases the information for a runner in one race conflicts with the information for the same runner in a different race. In order to resolve this, we generated a list of runners, grouped by their name, and threw out racers where their name matches but their gender was reported differently from one race to the next. Please understand we’re not doing this to exclude those who have changed their gender identity along the way, but to eliminate possible bias from data entry mistakes.
Finally, we combined the racers with the individual racing data, substituting our corrected runner information for what appeared in the individual race’s data. We also calculated minutes per mile (pace) and the age of the runner during the year of the race (age). Because we’re assigning a birth year to the minimum reported year from all races, our age variable won’t change during the running season, which is closer to the way age categories are calculated in Europe. Finally, we removed results where pace was greater than 20 minutes per mile for races longer than ten miles, and greater than 16 minute miles for races less than ten miles. These are likely to be outliers, or competitors not running the race.
name | birth_year | gender | race_str | year | miles | minutes | pace | age |
---|---|---|---|---|---|---|---|---|
aaron austin | 1983 | M | midnight_sun_run | 2014 | 6.2 | 50.60 | 8.16 | 31 |
aaron bravo | 1999 | M | midnight_sun_run | 2013 | 6.2 | 45.26 | 7.30 | 14 |
aaron bravo | 1999 | M | midnight_sun_run | 2014 | 6.2 | 40.08 | 6.46 | 15 |
aaron bravo | 1999 | M | midnight_sun_run | 2015 | 6.2 | 36.65 | 5.91 | 16 |
aaron bravo | 1999 | M | midnight_sun_run | 2016 | 6.2 | 36.31 | 5.85 | 17 |
aaron bravo | 1999 | M | spruce_tree_classic | 2014 | 6.0 | 42.17 | 7.03 | 15 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
We combined all available results for each runner in all years they participated such that the resulting rows are grouped by runner and year and columns are the races themselves. The values in each cell represent the pace for the runner × year × race combination.
For example, here’s the first six rows for runners that completed Beat Beethoven and the Chena River Run in the years I have data. I also included the column for the Midnight Sun Run in the table, but the actual data has a column for all the major Fairbanks races. You’ll see that two of the six runners listed ran BB and CRR but didn’t run MSR in that year.
name | gender | age | year | beat_beethoven | chena_river_run | midnight_sun_run |
---|---|---|---|---|---|---|
aaron schooley | M | 36 | 2016 | 8.19 | 8.15 | 8.88 |
abby fett | F | 33 | 2014 | 10.68 | 10.34 | 11.59 |
abby fett | F | 35 | 2016 | 11.97 | 12.58 | NA |
abigail haas | F | 11 | 2015 | 9.34 | 8.29 | NA |
abigail haas | F | 12 | 2016 | 8.48 | 7.90 | 11.40 |
aimee hughes | F | 43 | 2015 | 11.32 | 9.50 | 10.69 |
... | ... | ... | ... | ... | ... | ... |
With this data, we build a whole series of linear models, one for each race combination. We created a series of formula strings and objects for all the combinations, then executed them using map(). We combined the start and predicted race names with the linear models, and used glance() and tidy() from the broom package to turn the models into statistics and coefficients.
All of the models between races were highly significant, but many of them contain coefficients that aren’t significantly different than zero. That means that including that term (age, gender or first race pace) isn’t adding anything useful to the model. We used the significance of each term to reduce our models so they only contained coefficients that were significant and regenerated the statistics and coefficients for these reduced models.
The full R code appears at the bottom of this post.
Results
Here’s the statistics from the ten best performing models (based on R² ).
start_race | predicted_race | n | R² | p-value |
---|---|---|---|---|
run_of_the_valkyries | golden_heart_trail_run | 40 | 0.956 | 0 |
golden_heart_trail_run | equinox_marathon | 36 | 0.908 | 0 |
santa_claus_half_marathon | golden_heart_trail_run | 34 | 0.896 | 0 |
midnight_sun_run | gold_discovery_run | 139 | 0.887 | 0 |
beat_beethoven | golden_heart_trail_run | 32 | 0.886 | 0 |
run_of_the_valkyries | gold_discovery_run | 44 | 0.877 | 0 |
midnight_sun_run | golden_heart_trail_run | 52 | 0.877 | 0 |
gold_discovery_run | santa_claus_half_marathon | 111 | 0.876 | 0 |
chena_river_run | golden_heart_trail_run | 44 | 0.873 | 0 |
run_of_the_valkyries | santa_claus_half_marathon | 91 | 0.851 | 0 |
It’s interesting how many times the Golden Heart Trail Run appears on this list since that run is something of an outlier in the Usibelli running series because it’s the only race entirely on trails. Maybe it’s because it’s distance (5K) is comparable with a lot of the earlier races in the season, but because it’s on trails it matches well with the later races that are at least partially on trails like Gold Discovery or Equinox.
Here are the ten worst models.
start_race | predicted_race | n | R² | p-value |
---|---|---|---|---|
midnight_sun_run | equinox_marathon | 431 | 0.525 | 0 |
beat_beethoven | hoodoo_half_marathon | 87 | 0.533 | 0 |
beat_beethoven | midnight_sun_run | 818 | 0.570 | 0 |
chena_river_run | equinox_marathon | 196 | 0.572 | 0 |
equinox_marathon | hoodoo_half_marathon | 90 | 0.584 | 0 |
beat_beethoven | equinox_marathon | 265 | 0.585 | 0 |
gold_discovery_run | hoodoo_half_marathon | 41 | 0.599 | 0 |
beat_beethoven | santa_claus_half_marathon | 163 | 0.612 | 0 |
run_of_the_valkyries | equinox_marathon | 125 | 0.642 | 0 |
midnight_sun_run | hoodoo_half_marathon | 118 | 0.657 | 0 |
Most of these models are shorter races like Beat Beethoven or the Chena River Run predicting longer races like Equinox or one of the half marathons. Even so, each model explains more than half the variation in the data, which isn’t terrible.
Application
Now that we have all our models and their coefficients, we used these models to make predictions of future performance. I’ve written an online calculator based on the reduced models that let you predict your race results as you go through the running season. The calculator is here: Fairbanks Running Race Converter.
For example, I ran a 7:41 pace for Run of the Valkyries this year. Entering that, plus my age and gender into the converter predicts an 8:57 pace for the first running of the HooDoo Half Marathon. The R² for this model was a respectable 0.71 even though only 23 runners ran both races this year (including me). My actual pace for HooDoo was 8:18, so I came in quite a bit faster than this. No wonder my knee and hip hurt after the race! Using my time from the Golden Heart Trail Run, the converter predicts a HooDoo Half pace of 8:16.2, less than a minute off my 1:48:11 finish.
Appendix: R code
library(tidyverse)
library(lubridate)
library(broom)
races_db <- src_postgres(host="localhost", dbname="races")
combined_races <- tbl(races_db, build_sql(
"SELECT race, year, lower(name) AS name, finish_time,
year - age AS birth_year, sex
FROM chronotrack
UNION
SELECT race, year, lower(name) AS name, finish_time,
birth_year,
CASE WHEN age_class ~ 'M' THEN 'M' ELSE 'F' END AS sex
FROM sportalaska
UNION
SELECT race, year, lower(name) AS name, finish_time,
NULL AS birth_year, NULL AS sex
FROM other"))
races <- tbl(races_db, build_sql(
"SELECT race,
lower(regexp_replace(race, '[ ’]', '_', 'g')) AS race_str,
date_part('year', date) AS year,
miles
FROM races"))
racing_data <- combined_races %>%
inner_join(races) %>%
filter(!is.na(finish_time))
racers <- racing_data %>%
group_by(name) %>%
summarize(races=n(),
birth_year=min(birth_year),
gender_filter=ifelse(sum(ifelse(sex=='M',1,0))==
sum(ifelse(sex=='F',1,0)),
FALSE, TRUE),
gender=ifelse(sum(ifelse(sex=='M',1,0))>
sum(ifelse(sex=='F',1,0)),
'M', 'F')) %>%
ungroup() %>%
filter(gender_filter) %>%
select(-gender_filter)
racing_data_filled <- racing_data %>%
inner_join(racers, by="name") %>%
mutate(birth_year=birth_year.y) %>%
select(name, birth_year, gender, race_str, year, miles, finish_time) %>%
group_by(name, race_str, year) %>%
mutate(n=n()) %>%
filter(!is.na(birth_year), n==1) %>%
ungroup() %>%
collect() %>%
mutate(fixed=ifelse(grepl('[0-9]+:[0-9]+:[0-9.]+', finish_time),
finish_time,
paste0('00:', finish_time)),
minutes=as.numeric(seconds(hms(fixed)))/60.0,
pace=minutes/miles,
age=year-birth_year,
age_class=as.integer(age/10)*10,
group=paste0(gender, age_class),
gender=as.factor(gender)) %>%
filter((miles<10 & pace<16) | (miles>=10 & pace<20)) %>%
select(-fixed, -finish_time, -n)
speeds_combined <- racing_data_filled %>%
select(name, gender, age, age_class, group, race_str, year, pace) %>%
spread(race_str, pace)
main_races <- c('beat_beethoven', 'chena_river_run', 'midnight_sun_run',
'run_of_the_valkyries', 'gold_discovery_run',
'santa_claus_half_marathon', 'golden_heart_trail_run',
'equinox_marathon', 'hoodoo_half_marathon')
race_formula_str <-
lapply(seq(1, length(main_races)-1),
function(i)
lapply(seq(i+1, length(main_races)),
function(j) paste(main_races[[j]], '~',
main_races[[i]],
'+ gender', '+ age'))) %>%
unlist()
race_formulas <- lapply(race_formula_str, function(i) as.formula(i)) %>%
unlist()
lm_models <- map(race_formulas, ~ lm(.x, data=speeds_combined))
models <- tibble(start_race=factor(gsub('.* ~ ([^ ]+).*',
'\\1',
race_formula_str),
levels=main_races),
predicted_race=factor(gsub('([^ ]+).*',
'\\1',
race_formula_str),
levels=main_races),
lm_models=lm_models) %>%
arrange(start_race, predicted_race)
model_stats <- glance(models %>% rowwise(), lm_models)
model_coefficients <- tidy(models %>% rowwise(), lm_models)
reduced_formula_str <- model_coefficients %>%
ungroup() %>%
filter(p.value<0.05, term!='(Intercept)') %>%
mutate(term=gsub('genderM', 'gender', term)) %>%
group_by(predicted_race, start_race) %>%
summarize(independent_vars=paste(term, collapse=" + ")) %>%
ungroup() %>%
transmute(reduced_formulas=paste(predicted_race, independent_vars, sep=' ~ '))
reduced_formula_str <- reduced_formula_str$reduced_formulas
reduced_race_formulas <- lapply(reduced_formula_str,
function(i) as.formula(i)) %>% unlist()
reduced_lm_models <- map(reduced_race_formulas, ~ lm(.x, data=speeds_combined))
n_from_lm <- function(model) {
summary_object <- summary(model)
summary_object$df[1] + summary_object$df[2]
}
reduced_models <- tibble(start_race=factor(gsub('.* ~ ([^ ]+).*', '\\1', reduced_formula_str),
levels=main_races),
predicted_race=factor(gsub('([^ ]+).*', '\\1', reduced_formula_str),
levels=main_races),
lm_models=reduced_lm_models) %>%
arrange(start_race, predicted_race) %>%
rowwise() %>%
mutate(n=n_from_lm(lm_models))
reduced_model_stats <- glance(reduced_models %>% rowwise(), lm_models)
reduced_model_coefficients <- tidy(reduced_models %>% rowwise(), lm_models) %>%
ungroup()
coefficients_and_stats <- reduced_model_stats %>%
inner_join(reduced_model_coefficients,
by=c("start_race", "predicted_race", "n")) %>%
select(start_race, predicted_race, n, r.squared, term, estimate)
write_csv(coefficients_and_stats,
"coefficients.csv")
make_scatterplot <- function(start_race, predicted_race) {
age_limits <- speeds_combined %>%
filter_(paste("!is.na(", start_race, ")"),
paste("!is.na(", predicted_race, ")")) %>%
summarize(min=min(age), max=max(age)) %>%
unlist()
q <- ggplot(data=speeds_combined,
aes_string(x=start_race, y=predicted_race)) +
# plasma works better with a grey background
# theme_bw() +
geom_abline(slope=1, color="darkred", alpha=0.5) +
geom_smooth(method="lm", se=FALSE) +
geom_point(aes(shape=gender, color=age)) +
scale_color_viridis(option="plasma",
limits=age_limits) +
scale_x_continuous(breaks=pretty_breaks(n=10)) +
scale_y_continuous(breaks=pretty_breaks(n=6))
svg_filename <- paste0(paste(start_race, predicted_race, sep="-"), ".svg")
height <- 9
width <- 16
resize <- 0.75
svg(svg_filename, height=height*resize, width=width*resize)
print(q)
dev.off()
}
lapply(seq(1, length(main_races)-1),
function(i)
lapply(seq(i+1, length(main_races)),
function(j)
make_scatterplot(main_races[[i]], main_races[[j]])
)