# Introduction to Data Science at the Middletown Bike Rally

Grit, genes or privilege – which matters most?

The 10th grade math class at Middletown High School is using data science tools to explain the results of last summer’s Middletown bike rally. To construct a dataset, they use the miles covered by each of the 30 riders in the rally as the dependent variable. The independent explanatory variables they choose are motivation (“grit”), aptitude (“genes”) and the amounts riders’ parents spend on their kids equipment and training for the rally (“privilege”). Based on answers to a questionnaire, each rider is given a grit score, a gene score and a privilege score.

What you’ll learn

• How correlation between variables is measured and interpreted…
• How simple and multiple regression equations are estimated and interpreted.
• How and why multiple trials of a random variable are normally distributed.
• How probability distributions for random variables are constructed from regression equations.
• How probability distributions are used to predict outcomes for random variables.

Course Content

• Introduction –> 1 lecture • 4min.
• The Middletown Bike Rally –> 2 lectures • 10min.
• Explaining rider performance –> 3 lectures • 21min.
• Predicting rider performance –> 4 lectures • 32min.

Requirements

The 10th grade math class at Middletown High School is using data science tools to explain the results of last summer’s Middletown bike rally. To construct a dataset, they use the miles covered by each of the 30 riders in the rally as the dependent variable. The independent explanatory variables they choose are motivation (“grit”), aptitude (“genes”) and the amounts riders’ parents spend on their kids equipment and training for the rally (“privilege”). Based on answers to a questionnaire, each rider is given a grit score, a gene score and a privilege score.

The class uses several data science tools to analyse the database to determine how much of the variation in rider performance is explained by grit, genes and privilege, and to predict rider performance in next summer’s rally.

They begin by looking at the correlations between rider performance and the explanatory variables. They learn how correlation is calculated, and how to interpret strong, weak, positive and negative correlation.

The class then  performs simple regressions on rider performance using each of the the explanatory variables in turn. Each regression produces an equation whose coefficient and constant describe the relationship between rider performance and grit, genes or privilege.

The class then looks at how the R-squared value reported in each regression is calculated. R-squared measures the percentage of variation in the dependent variable that is explained by variation in the explanatory variable.

To understand the combined explanatory effect of grit, genes and privilege on rider performance, the class proceeds to use multiple regression on the dataset. Multiple regression estimates coefficients and a constant for a single equation that includes all three explanatory variables.

Having estimated the equation that best explains rider performance last summer, the class then learns how the regression equation can be used to predict rider performance in next summer’s rally.

The starting point here is to understand that rider performance next summer can be seen as a random variable, because it is the sum of random variables, each represented by one of the terms of the regression equation.

The class then looks at frequency distributions that result after multiple trials of a random variable that is the sum of random variables. They see that as the number of trials increases, the distribution takes on the bell shape of the so-called normal distribution.

Moving to the next step, the class considers how a frequency distribution can also be thought of as a probability distribution. The class learns how to build a normal probabilitly distribution for a random variable by using the mean or expected value of the variable together with the variable’s standard error, which measures how widely multiple trials of the variable are spread around the mean value.

The class is now ready to use the multiple regression equation to build the probability distribution for a rider’s performance next summer. For any given rider, the equation calculates the expected number of miles he will cover based on his scores. The regression also calculates the standard error of the estimate.

In the final stage of the analysis, the class uses probability distributions to calculate the odds of various outcomes in next summer’s rally — for example, the odds that Gina will ride more than 35 miles, or the odds that Gina will ride further than her brother Joey.

Get Tutorial