Bayesian Nonparametrics for Marathon Modeling

06 October 2014

New Image

Bayesian nonparametric (BNP) models present several desirable properties. Their most recognizable benefit is the ability of avoiding the need to specify a closed model and letting the data choose the model (or models) that describe it best, in order to provide competitive predictions [6]. In this paper, we focus instead on their generative property, that allows explaining the data in an amenable way, allowing us to make hypotheses and extract conclusions of the obtained results. This ability facilitates collaboration with experts in other fields, avoiding the frequent black-box flavor found in other methods. Some examples of descriptive analyses can be found in psychiatry [15], genetics [19], topic modeling [7], image segmentation [9], speaker diarization [8] or tracking [11]. BNP models are extremely flexible and they can also provide accurate predictions with a structure that is not necessarily interpretable. If we want both insightful descriptive conclusions and accurate predictions, we need to specify the prior in a way that points towards the sought explanation (we need to add our prior information to the model). In this way, the first insight should not be foreign to us. This makes the model trustworthy for experts in other fields, so other conclusions that were not common knowledge can be taken as plausible. At this stage we are able to formulate hypotheses that can be tested with future data and can provide previously unknown insights about the given problem. In a way, most BNP models are described as general priors [17, 12, 18, 10] that are applicable for a large number of problems. We believe that BNP will be useful for non-machine-learning experts if we can constraint the priors to provide accurate and interpretable solutions. In this paper, we present a novel application of BNPs to model marathon runners. We aim at analyzing the data from different perspectives, in order to find hidden properties of the athletes, while providing accurate predictions. We resort to a nonparametric model instead of a parametric one to leave room for the unexpected. We build a model to fairly compare the finishing time of runners for different ages and sex. This has several applications. First, there are marathons that award entry to participants by their best marathon in the previous 12 months.1 The entry requirements vary considerably for one event to the next, as there is no widely accepted standard method to specify these requirements. Second, the World Master Athletics (WMA) has an age-graded system [3] for equalizing the finishing time according to age and sex. They lobby for this measure to be taken into consideration for selecting the winners for each race. However, this procedure is based on world records, or in other words, top record outliers, which might not be very representative, or even realistic.