Introduction

A couple years ago I compared racing data between two races (Gold Discovery and Equinox, Santa Claus and Equinox) in the same season for all runners that ran in both events. The result was an estimate of how fast I might run the Equinox Marathon based on my times for Gold Discovery and the Santa Claus Half Marathon.

Several years have passed and I've run more races and collected more racing data for all the major Fairbanks races and wanted to run the same analysis for all combinations of races.

Data

The data comes from a database I’ve built of race times for all competitors, mostly coming from the results available from Chronotrack, but including some race results from SportAlaska.

We started by loading the required R packages and reading in all the racing data, a small subset of which looks like this.

race	year	name	finish_time	birth_year	sex
Beat Beethoven	2015	thomas mcclelland	00:21:49	1995	M
Equinox Marathon	2015	jennifer paniati	06:24:14	1989	F
Equinox Marathon	2014	kris starkey	06:35:55	1972	F
Midnight Sun Run	2014	kathy toohey	01:10:42	1960	F
Midnight Sun Run	2016	steven rast	01:59:41	1960	M
Equinox Marathon	2013	elizabeth smith	09:18:53	1987	F
...	...	...	...	...	...

Next we loaded in the names and distances of the races and combined this with the individual racing data. The data from Chronotrack doesn’t include the mileage and we will need that to calculate pace (minutes per mile).

My database doesn’t have complete information about all the racers that competed, and in some cases the information for a runner in one race conflicts with the information for the same runner in a different race. In order to resolve this, we generated a list of runners, grouped by their name, and threw out racers where their name matches but their gender was reported differently from one race to the next. Please understand we’re not doing this to exclude those who have changed their gender identity along the way, but to eliminate possible bias from data entry mistakes.

Finally, we combined the racers with the individual racing data, substituting our corrected runner information for what appeared in the individual race’s data. We also calculated minutes per mile (pace) and the age of the runner during the year of the race (age). Because we’re assigning a birth year to the minimum reported year from all races, our age variable won’t change during the running season, which is closer to the way age categories are calculated in Europe. Finally, we removed results where pace was greater than 20 minutes per mile for races longer than ten miles, and greater than 16 minute miles for races less than ten miles. These are likely to be outliers, or competitors not running the race.

name	birth_year	gender	race_str	year	miles	minutes	pace	age
aaron austin	1983	M	midnight_sun_run	2014	6.2	50.60	8.16	31
aaron bravo	1999	M	midnight_sun_run	2013	6.2	45.26	7.30	14
aaron bravo	1999	M	midnight_sun_run	2014	6.2	40.08	6.46	15
aaron bravo	1999	M	midnight_sun_run	2015	6.2	36.65	5.91	16
aaron bravo	1999	M	midnight_sun_run	2016	6.2	36.31	5.85	17
aaron bravo	1999	M	spruce_tree_classic	2014	6.0	42.17	7.03	15
...	...	...	...	...	...	...	...	...

We combined all available results for each runner in all years they participated such that the resulting rows are grouped by runner and year and columns are the races themselves. The values in each cell represent the pace for the runner × year × race combination.

For example, here’s the first six rows for runners that completed Beat Beethoven and the Chena River Run in the years I have data. I also included the column for the Midnight Sun Run in the table, but the actual data has a column for all the major Fairbanks races. You’ll see that two of the six runners listed ran BB and CRR but didn’t run MSR in that year.

name	gender	age	year	beat_beethoven	chena_river_run	midnight_sun_run
aaron schooley	M	36	2016	8.19	8.15	8.88
abby fett	F	33	2014	10.68	10.34	11.59
abby fett	F	35	2016	11.97	12.58	NA
abigail haas	F	11	2015	9.34	8.29	NA
abigail haas	F	12	2016	8.48	7.90	11.40
aimee hughes	F	43	2015	11.32	9.50	10.69
...	...	...	...	...	...	...

With this data, we build a whole series of linear models, one for each race combination. We created a series of formula strings and objects for all the combinations, then executed them using map(). We combined the start and predicted race names with the linear models, and used glance() and tidy() from the broom package to turn the models into statistics and coefficients.

All of the models between races were highly significant, but many of them contain coefficients that aren’t significantly different than zero. That means that including that term (age, gender or first race pace) isn’t adding anything useful to the model. We used the significance of each term to reduce our models so they only contained coefficients that were significant and regenerated the statistics and coefficients for these reduced models.

The full R code appears at the bottom of this post.

Results

Here’s the statistics from the ten best performing models (based on R² ).

start_race	predicted_race	n	R²
run_of_the_valkyries	golden_heart_trail_run	40	0.956
golden_heart_trail_run	equinox_marathon	36	0.908
santa_claus_half_marathon	golden_heart_trail_run	34	0.896
midnight_sun_run	gold_discovery_run	139	0.887
beat_beethoven	golden_heart_trail_run	32	0.886
run_of_the_valkyries	gold_discovery_run	44	0.877
midnight_sun_run	golden_heart_trail_run	52	0.877
gold_discovery_run	santa_claus_half_marathon	111	0.876
chena_river_run	golden_heart_trail_run	44	0.873
run_of_the_valkyries	santa_claus_half_marathon	91	0.851

It’s interesting how many times the Golden Heart Trail Run appears on this list since that run is something of an outlier in the Usibelli running series because it’s the only race entirely on trails. Maybe it’s because it’s distance (5K) is comparable with a lot of the earlier races in the season, but because it’s on trails it matches well with the later races that are at least partially on trails like Gold Discovery or Equinox.

Here are the ten worst models.

start_race	predicted_race	n	R²
midnight_sun_run	equinox_marathon	431	0.525
beat_beethoven	hoodoo_half_marathon	87	0.533
beat_beethoven	midnight_sun_run	818	0.570
chena_river_run	equinox_marathon	196	0.572
equinox_marathon	hoodoo_half_marathon	90	0.584
beat_beethoven	equinox_marathon	265	0.585
gold_discovery_run	hoodoo_half_marathon	41	0.599
beat_beethoven	santa_claus_half_marathon	163	0.612
run_of_the_valkyries	equinox_marathon	125	0.642
midnight_sun_run	hoodoo_half_marathon	118	0.657

Most of these models are shorter races like Beat Beethoven or the Chena River Run predicting longer races like Equinox or one of the half marathons. Even so, each model explains more than half the variation in the data, which isn’t terrible.

Application

Now that we have all our models and their coefficients, we used these models to make predictions of future performance. I’ve written an online calculator based on the reduced models that let you predict your race results as you go through the running season. The calculator is here: Fairbanks Running Race Converter.

For example, I ran a 7:41 pace for Run of the Valkyries this year. Entering that, plus my age and gender into the converter predicts an 8:57 pace for the first running of the HooDoo Half Marathon. The R² for this model was a respectable 0.71 even though only 23 runners ran both races this year (including me). My actual pace for HooDoo was 8:18, so I came in quite a bit faster than this. No wonder my knee and hip hurt after the race! Using my time from the Golden Heart Trail Run, the converter predicts a HooDoo Half pace of 8:16.2, less than a minute off my 1:48:11 finish.

Appendix: R code

library(tidyverse)
library(lubridate)
library(broom)

races_db <- src_postgres(host="localhost", dbname="races")

combined_races <- tbl(races_db, build_sql(
    "SELECT race, year, lower(name) AS name, finish_time,
        year - age AS birth_year, sex
     FROM chronotrack
     UNION
     SELECT race, year, lower(name) AS name, finish_time,
        birth_year,
        CASE WHEN age_class ~ 'M' THEN 'M' ELSE 'F' END AS sex
     FROM sportalaska
     UNION
     SELECT race, year, lower(name) AS name, finish_time,
        NULL AS birth_year, NULL AS sex
     FROM other"))

races <- tbl(races_db, build_sql(
    "SELECT race,
        lower(regexp_replace(race, '[ ’]', '_', 'g')) AS race_str,
        date_part('year', date) AS year,
        miles
     FROM races"))

racing_data <- combined_races %>%
    inner_join(races) %>%
    filter(!is.na(finish_time))

racers <- racing_data %>%
    group_by(name) %>%
    summarize(races=n(),
              birth_year=min(birth_year),
              gender_filter=ifelse(sum(ifelse(sex=='M',1,0))==
                                   sum(ifelse(sex=='F',1,0)),
                                   FALSE, TRUE),
              gender=ifelse(sum(ifelse(sex=='M',1,0))>
                            sum(ifelse(sex=='F',1,0)),
                            'M', 'F')) %>%
    ungroup() %>%
    filter(gender_filter) %>%
    select(-gender_filter)

racing_data_filled <- racing_data %>%
    inner_join(racers, by="name") %>%
    mutate(birth_year=birth_year.y) %>%
    select(name, birth_year, gender, race_str, year, miles, finish_time) %>%
    group_by(name, race_str, year) %>%
    mutate(n=n()) %>%
    filter(!is.na(birth_year), n==1) %>%
    ungroup() %>%
    collect() %>%
    mutate(fixed=ifelse(grepl('[0-9]+:[0-9]+:[0-9.]+', finish_time),
                        finish_time,
                        paste0('00:', finish_time)),
           minutes=as.numeric(seconds(hms(fixed)))/60.0,
           pace=minutes/miles,
           age=year-birth_year,
           age_class=as.integer(age/10)*10,
           group=paste0(gender, age_class),
           gender=as.factor(gender)) %>%
    filter((miles<10 & pace<16) | (miles>=10 & pace<20)) %>%
    select(-fixed, -finish_time, -n)

speeds_combined <- racing_data_filled %>%
    select(name, gender, age, age_class, group, race_str, year, pace) %>%
    spread(race_str, pace)

main_races <- c('beat_beethoven', 'chena_river_run', 'midnight_sun_run',
                'run_of_the_valkyries', 'gold_discovery_run',
                'santa_claus_half_marathon', 'golden_heart_trail_run',
                'equinox_marathon', 'hoodoo_half_marathon')

race_formula_str <-
    lapply(seq(1, length(main_races)-1),
           function(i)
               lapply(seq(i+1, length(main_races)),
                      function(j) paste(main_races[[j]], '~',
                                        main_races[[i]],
                                        '+ gender', '+ age'))) %>%
    unlist()

race_formulas <- lapply(race_formula_str, function(i) as.formula(i)) %>%
    unlist()

lm_models <- map(race_formulas, ~ lm(.x, data=speeds_combined))

models <- tibble(start_race=factor(gsub('.* ~ ([^ ]+).*',
                                        '\\1',
                                        race_formula_str),
                                   levels=main_races),
                 predicted_race=factor(gsub('([^ ]+).*',
                                            '\\1',
                                            race_formula_str),
                                       levels=main_races),
                 lm_models=lm_models) %>%
    arrange(start_race, predicted_race)

model_stats <- glance(models %>% rowwise(), lm_models)
model_coefficients <- tidy(models %>% rowwise(), lm_models)

reduced_formula_str <- model_coefficients %>%
    ungroup() %>%
    filter(p.value<0.05, term!='(Intercept)') %>%
    mutate(term=gsub('genderM', 'gender', term)) %>%
    group_by(predicted_race, start_race) %>%
    summarize(independent_vars=paste(term, collapse=" + ")) %>%
    ungroup() %>%
    transmute(reduced_formulas=paste(predicted_race, independent_vars, sep=' ~ '))

reduced_formula_str <- reduced_formula_str$reduced_formulas

reduced_race_formulas <- lapply(reduced_formula_str,
                                function(i) as.formula(i)) %>% unlist()

reduced_lm_models <- map(reduced_race_formulas, ~ lm(.x, data=speeds_combined))

n_from_lm <- function(model) {
    summary_object <- summary(model)

    summary_object$df[1] + summary_object$df[2]
}

reduced_models <- tibble(start_race=factor(gsub('.* ~ ([^ ]+).*', '\\1', reduced_formula_str),
                                           levels=main_races),
                         predicted_race=factor(gsub('([^ ]+).*', '\\1', reduced_formula_str),
                                               levels=main_races),
                         lm_models=reduced_lm_models) %>%
    arrange(start_race, predicted_race) %>%
    rowwise() %>%
    mutate(n=n_from_lm(lm_models))

reduced_model_stats <- glance(reduced_models %>% rowwise(), lm_models)
reduced_model_coefficients <- tidy(reduced_models %>% rowwise(), lm_models) %>%
    ungroup()

coefficients_and_stats <- reduced_model_stats %>%
    inner_join(reduced_model_coefficients,
               by=c("start_race", "predicted_race", "n")) %>%
    select(start_race, predicted_race, n, r.squared, term, estimate)

write_csv(coefficients_and_stats,
          "coefficients.csv")

make_scatterplot <- function(start_race, predicted_race) {
   age_limits <- speeds_combined %>%
      filter_(paste("!is.na(", start_race, ")"),
               paste("!is.na(", predicted_race, ")")) %>%
      summarize(min=min(age), max=max(age)) %>%
      unlist()

   q <- ggplot(data=speeds_combined,
               aes_string(x=start_race, y=predicted_race)) +
            # plasma works better with a grey background
            # theme_bw() +
            geom_abline(slope=1, color="darkred", alpha=0.5) +
            geom_smooth(method="lm", se=FALSE) +
            geom_point(aes(shape=gender, color=age)) +
            scale_color_viridis(option="plasma",
                              limits=age_limits) +
            scale_x_continuous(breaks=pretty_breaks(n=10)) +
            scale_y_continuous(breaks=pretty_breaks(n=6))

   svg_filename <- paste0(paste(start_race, predicted_race, sep="-"), ".svg")

   height <- 9
   width <- 16
   resize <- 0.75

   svg(svg_filename, height=height*resize, width=width*resize)
   print(q)
   dev.off()
}

lapply(seq(1, length(main_races)-1),
      function(i)
            lapply(seq(i+1, length(main_races)),
                  function(j)
                        make_scatterplot(main_races[[i]], main_races[[j]])
                  )

tags: R running statistics races data science