# Cycling Power Predictions II¶

In my first attempt, I set out to see how accurately you could predict your power output during cycling from features such as cadence, speed and heartrate. I extracted the data from roughly 50 of my previous ride files, during which I used a PowerTap Hub to measure my true power output. I also measured my speed, elevation, heartrate and cadence through other devices.

I carried out some basic feature engineering in order to create variables such as change in heartrate, change in speed, and the steepness of the road. I fitted numerous models to the data and compared their results. By comparing the feature importances generated from the Random Forest model, the most important variables turned out to the be the un-engineered cadence and heartrate. ## Physics-Inspired Features¶

The question I want to answer here is can we improve on our feature engineering? More specifically, can we use the physical equations of motion to improve how we construct features. I mentioned in the first attempt that Strava provides a power estimate based on simple physical assumptions – this estimates your power output by simply balancing all the relevant forces that act on you as a cyclist. Let’s begin with the famous equation that states that the net force acting on a body equals the mass times the acceleration, $F_{net} = ma$.

We can now write the net force $F_{net}$ as the sum of the individual forces that act on a cyclist as the following $F_{net} = F_{output} - F_{air} - F_{grav} - F_{rolling} - F_{brake} ,$

where $F_{output}$ is the force being output by the cyclist (which is what we’re trying to predict), $F_{grav}$ is the force of gravity (this can be positive or negative depending on whether we’re going up or downhill), $F_{air}$ is aerodynamic drag, $F_{rolling}$ is the rolling resistance between the tyres and road, and finally $F_{brake}$ is the force of the cyclist applying the brakes.

We now want to write down physical equations for each force. The rolling resistance will not vary much during a ride so we will ignore this term, and we cannot predict when a cyclist will brake either so this term will also be excluded. That leaves us with the aerodynamic drag and the force of gravity. The force caused by aerodynamic drag has the following equation $F_{air} = \frac{1}{2} C_d A \rho v^2,$

where $C_d$ is the coefficient of drag (i.e. how much drag the wind creates per unit area), $A$ is the cyclists total frontal area that the wind ‘sees’, $\rho$ is the density of air, and $v$ is the speed of the cyclist. The drag coefficient multpliplied by the frontal area produces the total aerodynamic drag ‘ $C_dA$‘ which cyclists are obsessed with reducing. The equation for the gravitational term is given by $F_{grav} = m g \text{sin}(\theta),$

where $m$ is the total mass of the cyclist plus the bike, $g = 9.81$ is the gravitational acceleration, and $\theta$ is the angle of the road relative to a flat surface ($latex\theta = 0$ is completely flat, while $\theta=90$ degrees is a vertical wall). We now have formulas for the force of aerodynamic drag and gravity, but we’re interested in predicting the power generated by the cyclist, not the force. This is an important difference as power is the energy generated/work done, due to a force, per second. In order to convert the above force terms to power, we simply multiply by the speed, leading to $P_{air} = \frac{1}{2} C_d A \rho v^3 ,$ $P_{grav} = m g \text{sin}(\theta) v.$

These terms tells us how aerodynamic drag and gravity contribute to the kinetic energy change of the cyclist. Ignoring the constants, we can see that these two power terms are proportional to the following quantities $P_{air} \propto (\text{speed})^3,$ $P_{grav} \propto \text{gradient}\times\text{speed},$

where we have assumed $\text{sin}(\theta) \approx \text{tan}(\theta) = \text{gradient}$, which is not a bad approximation as changes in elevation are small compared to changes in horizontal distance. These two quantities, $(\text{speed})^3$ and $\text{gradient}\times\text{speed}$, are good candidates for features as they tell us how the power demands of the cyclist change in time. The speed data is a stored variable in the .fit files, and we can easily calculate the gradient.

One problem with these two new features is that although the wind and gradient may change, this does not necessarily mean the cyclist will increase or decrease their power. For example when going downhill, most cyclists will reduce their power output and freewheel, even though the wind resistance has increased dramatically. To try and make an even better feature, let’s consider the balance of all the power terms (ignoring rolling resistance) $mav = P_{output} - P_{air} + P_{grav} ,$

which is $F_{net} = ma$ multiplied by the speed $v$. Rearrange for $P_{output}$ and then substitute our equations for $P_{air}$ and $P_{grav}$, giving $P_{output} = mav + \frac{1}{2} C_d A \rho v^3 - m g \text{sin}(\theta) v .$

This gives us a rough estimate of the cyclists power output by balancing the dominant forces. This should hopefully form a better feature than $(\text{speed})^3$ and $\text{gradient}\times\text{speed}$ as it should more accurately track changes in the true power output of the cyclist. We’ll use in the ballpark values for our parameters: $m=80 \text{kg}$, $C_d A = 0.3 \text{m}^2$, $\rho = 1.225 \text{kgm}^3$. The remaining variables speed, gradient and acceleration can be calculated from the ride data.

We now have three new features to add to our dataset:

New Feature Equation Physical Origin
1 $v^3$ Proportional to wind power
2 $\text{sin}(\theta) v$ Proportional to gravitational power
3 $mav + \frac{1}{2} C_d A \rho v^3 + m g \text{sin}(\theta) v$ Rough estimate of cyclists power output

We now want to see if using these three new features improves the accuracy of our models. In order to compare the results from my first attempt, I’ll use the exact same models with the exact parameters, but this time we’ll introduce the three new features above in addition to the original features used previously. The only model change I will make is to introduce three more hidden units in each hidden layer of the neural network; this ensures that the number of hidden units is one more than the number of features.

The dataset is formed using the functions defined in the python script titled ‘dataPrep2.py’, which can be found in the GitHub repository for this post. The data is extracted from the .fit files and saved as a numpy array. Let’s now load this data.

In :
import numpy as np

xTrain, yTrain, xTest, yTest = loadDataset2()

print "Total number of observations: ", xTrain.shape+xTest.shape
print "       Training observations: ", xTrain.shape
print "           Test observations: ", xTest.shape
print "\n Number of inputs/features: ", xTrain.shape
print " Number of outputs/targets: ", 1

Total number of observations:  177330
Training observations:  137208
Test observations:  40122

Number of inputs/features:  9
Number of outputs/targets:  1

In :
# plot 20 minutes of data of our 3 new features

import matplotlib.pyplot as plt
%matplotlib inline

fig, (ax0,ax1,ax2) = plt.subplots(3,1,sharex=True)
t = np.linspace(1,1200,1200)

ax0.plot( t, xTrain[-1200:,-3], color='darkblue', label='speed*gradient' )
ax0.set_xlim( (1,1200) )
ax0.yaxis.grid(True)
ax0.set_title('New Feature 1')

ax1.plot( t, xTrain[-1200:,-2], color='darkred', label='speed cubed' )
ax1.set_xlim( (1,1200) )
ax1.yaxis.grid(True)
ax1.set_title('New Feature 2')

ax2.plot( t, xTrain[-1200:,-1], color='darkgreen', label='crude power estimate')
ax2.set_xlim( (1,1200) )
ax2.yaxis.grid(True)
ax2.set_title('New Feature 3')
ax2.set_xlabel('Time (seconds)')

fig.set_size_inches( (12,9) )

plt.show() It can be seen in the third feature, the estimate of the cyclists power based on simple physical assumptions, has unrealistic positive spikes (I’d like to see any cyclist put out over 6000 watts). The power also drops below zero at points; this may be due to the fact that we are ignoring rolling resistance and any braking that might occurs. If a cyclist applies the brakes, this estimate might interpret the resulting deceleration as the cyclist applying negative power.

These issues however do not matter too much. We’re using this power estimate as a feature to increase the amount of information we have on the variability of the true power output by the cyclist. The power estimate feature doesn’t have to be physically consistent, it simply has to contain potentially useful information.

Before training our models, we’ll clean up and standardise our data as before.

In :
# Remove NaNs and standardise data

from sklearn.preprocessing import Imputer, StandardScaler

# convert all infinite values to nans
xTrain[ np.isinf( xTrain ) ] = np.nan
xTest[ np.isinf( xTest ) ] = np.nan

# construct imputer and scaler
myImputer = Imputer( missing_values='NaN', strategy='median', axis=0 )
myScaler = StandardScaler()

# impute NaNs
xTrain = myImputer.fit_transform( xTrain )
xTest = myImputer.transform( xTest )

# remove mean and divide by standard deviation
xTrain = myScaler.fit_transform( xTrain )
xTest = myScaler.transform( xTest )


## Construct and Train Models¶

Use the same models as before:

• Multiple Linear Regression.
• K-Nearest Neighbour Regressor.
• Random Forest Regressor.
• Multi-Layer Perceptron Regressor.
In :
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor

# model names
modelNames = [ 'Linear Regression', 'K-Nearest', 'Random Forest', 'Neural Network' ]

# construct models
modelList =  [ LinearRegression(),
KNeighborsRegressor( n_neighbors=75 ),
RandomForestRegressor( n_estimators=500, n_jobs=3 ),
MLPRegressor( hidden_layer_sizes=(10,10), learning_rate_init=0.001,
alpha=0.0, max_iter=1000, tol=0.0 )
]

# train each model
for model in modelList:

print "\nCurrently training:\n\t", model, "\n"
model.fit( xTrain, yTrain )
print "Done!"

Currently training:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Done!

Currently training:
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=75, p=2,
weights='uniform')

Done!

Currently training:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=3,
oob_score=False, random_state=None, verbose=0, warm_start=False)

Done!

Currently training:
MLPRegressor(activation='relu', alpha=0.0, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(10, 10), learning_rate='constant',
learning_rate_init=0.001, max_iter=1000, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=None,
verbose=False, warm_start=False)

Done!


## Model Evaluation¶

Previously we used the coefficient of determination $R^2$ to compare our models. The $R^2$ score equals one for perfect predictions, zero for simply using the average power value, and negative values for really bad models. The original scores we are trying to improve on are listed below.

Model Score
Multiple Linear Regression 0.0614
K-Nearest Neigbour 0.0929
Random Forest 0.0913
Neural Network 0.0945

Let’s see if these scores change by introducing our three new physics-inspired features.

In :
# loop through each model and calculate the score
for i, model in enumerate( modelList ) :

print modelNames[i], ":", np.round( model.score( xTest, yTest ), 4 )

Linear Regression : 0.0633
K-Nearest : 0.0924
Random Forest : 0.0928
Neural Network : 0.0958


We do see some modest improvements and the linear regression improved the most. This is likely due to the fact that the new features are nonlinear functions of the original data, such that the linear model is now regressing onto features which vary more linearly with the true power. This increases the linear models predictive power.

The K-Nearest Neighbour model score decreases slightly. The performance of a K-Nearest Neighbour model generally decreases as the number of features grows larger; this is because notions of similarity/distance between data points distort in a high-dimensional space. The decrease in score however is small and may not be statistically significant, so we should be refrain from over-interpreting such changes.

Both the Random Forest and Neural Network model scores increase, and this likely due to an increased amount of information being supplied to the models by the new features. Why not form every conceivable combination of the original features and feed them in too? This would increase the amount of information the models can utilise and therefore lead to more accurate predictions. Their are a few issues with this method. First, this would dramatically increase the computational cost of training these models, potentially making the training intractable. Second, if the number of features becomes comparable to the number of training examples, you’re at risk of overfitting and will then have to employ some form of regularisation. Third, you may lose some interpretability of your features if you start making increasingly more complex functions of the original features (this however may not be an issue depending on the goal of the project).

Finally, lets look at the feature importances generated by the Random Forest and see how these new features compare to each other with regards to predictive skill.

In :
# extract feature importances from random forest model
featImpo = modelList.feature_importances_

# make list of feature names
features = [ 'Cadence', 'Speed', 'Heartrate',
'Speed cubed', 'Power estimate']

# plot a bar chart
plt.barh( np.linspace(1,9,9), featImpo, color='red', edgecolor='black', linewidth=2 )

fig, ax = plt.gcf(), plt.gca()

fig.set_size_inches( (12,10) )
ax.set_xlabel( 'Feature Importance', fontsize=15 )
ax.yaxis.grid(True)
ax.yaxis.set_ticks( np.linspace(1,9,9) )
ax.yaxis.set_ticklabels([])
ax.set_axisbelow(True)
ax.set_title('Feature Importance from Random Forest Model', fontsize=15, fontweight='bold' )

# annotate
for i in range(9) :
if i<6 :
plt.text( -0.01, i+1, features[i], fontsize=15, va='center', ha='right' )
else :
plt.text( -0.01, i+1, features[i], fontsize=15, va='center', ha='right', fontweight='bold' )

plt.show() Heartrate and cadence remain the most useful features, but the gradient $*$speed feature moves into third place. This is interesting as individually, gradient and speed are the least important features, while the interaction between them (their product) becomes useful. The speed cubed feature didn’t turn out to be that important; despite being proportional to the aerodynamic drag, the true power output must not vary much with changes in the wind forcing.

Which features does the linear regression rely on the most? Since all the input features have been normalised, we can directly compare the magnitude of the resulting linear regression coefficients. Let’s extract them and compare them like how we have done with the above feature importances.

In :
coef = modelList.coef_

# plot a bar chart
plt.barh( np.linspace(1,9,9), np.abs( coef ), color='brown', edgecolor='black', linewidth=2 )

fig, ax = plt.gcf(), plt.gca()

fig.set_size_inches( (12,10) )
ax.set_xlabel( 'Magnitude of coefficient', fontsize=15 )
ax.yaxis.grid(True)
ax.yaxis.set_ticks( np.linspace(1,9,9) )
ax.yaxis.set_ticklabels([])
ax.set_axisbelow(True)
ax.set_title('Coefficient Magnitudes of Linear Regression Model', fontsize=15, fontweight='bold' )

# annotate
for i in range(9) :
if i<6 :
plt.text( -1, i+1, features[i], fontsize=15, va='center', ha='right' )
else :
plt.text( -1, i+1, features[i], fontsize=15, va='center', ha='right', fontweight='bold' )

plt.show() We see that cadence is only the fourth most important feature within the linear regression model and this makes sense; when pedalling a cyclist can exhibit a range of power values for the same cadence value, but if the cadence drops to zero, then the power must also drop to zero. This is a highly nonlinear relationship so it’s not surprisingly the linear model can’t capture that behaviour. We also see that our physics-inspired features are relatively useful within the linear regression model.

To sum, up you can improve model performance by engineering features based on the physical equations of the system. The code for this post can be found in this GitHub repo. Thank you to Reddit user nickl for suggesting looking into this type of feature engineering.

# Cycling Power Without a Power Meter¶

Power meters within cycling have exploded in popularity in recent years. Cyclists are now more aware of the benefits of knowing your power output. This is useful for tracking performance, measuring functional theshold power, calories burned, as well as being able to complete highly specific and targeted training sessions. It also gives you one of the best metrics for comparing your fitness with others: the number of watts you can generate per kilo of body weight. When used in combination with heartrate, power meters allow you to monitor long term changes in your fitness and training load, with the bonus of being able to spot if you’re overtraining.

Despite the recent reduction in price of power meters, they’re still not cheap. For a new unit (whether it is hub based, crank based, pedal based, etc) will set you back at least £400-£500, with the certain brands costing a couple of grand. Wouldn’t it be great it you could have an accurate measure of your power output without having to fork out on an (yet another) expensive piece of cycling kit? My wallet is still recovering from my recent carbon wheelset purchase…

Predicting power without a power meter however is very difficult. When you upload a ride to Strava, it takes your ride data and will estimate your power output for you based on a physical model which makes assumptions on wind drag and rolling resistance. These estimates are OK but they generally underestimate average power, and do not accurately capture the temporal variability accurately. They also don’t include effects such as drafting from traffic or braking on a descent.

Strava uses a physical model to estimate power – what about a purely statistical model that relies on proxies of power? This is exactly what the PowerTap PowerCal Heartrate Monitor does; it measures your heartrate and then uses an algorithm to estimate your power output. The cycling community was skeptical of such an approach but an in-depth review showed it wasn’t actually that bad. Indeed, I owned one myself for a while and was surprised at the how well it did for a purely statistcal approach (here’s a ride I did while wearing the PowerCal back in 2015).

In this blog post, I’m going to attempt the statistical approach to predicting power, but using a range of machine learning algorithms. Machine learning algorithms are great at finding relationships between variables that a human may not necessarily be able to spot, or explicitly code. In general, supervised machine learning algorithms learns a mapping between the input and output variables. In this post I will collect data from many of my previous rides in order to train models to predict power from various input variables. The best predictors of power will be identified and various models will be compared .

## Preparing the data¶

At any particular point in time, we want to take certain variables (e.g, heartrate or acceleration) and make a prediction for the corresponding power output. We therefore need enough data to train our model and make accurate predictions. We also don’t want to train our model on a single ride; this will likely lead to overfitting and won’t generalise well to other situations.

To do this, I use the .fit files from my rides between 18th August 2017 and 5th November 2017. I haven’t shown the code here, but I use a python script to loop through all the .fit files and extract the data I need and calculate any particular quantities I believe will be useful in predicting power. If you want to see how to extract data from your own .fit files generated from a GPS devious, see this previous post.

I then save a single numpy array that contains all these quantities, from every ride, where each row corresponds to a different point in time. The particular quantities I use are defined below – some features are already stored in the .fit file while some require engineering.

### Output/targets¶

• Power. This is simply the quantity we are trying to predict and is in watts. Any .fit files that do not contain power values are ignored.

### Input/features¶

• Cadence. How fast you are pedalling: the number of revolutions of the pedals per minute. If you stop pedalling, then cadence goes to zero, and therefore power goes to zero. We therefore expect cadence to contain some predictive skill.
• Speed. If you’re moving along a flat road and increase your power output, then your speed will also increase until the wind resistance balances the increased force applied to the pedals. Obviously this relationship breaks down in many situations so it is unclear how useful speed will be.
• Heartrate. As we saw in a previous post, heart rate is correlated with your power output, so should be a useful variable to feed into our model.
• Change in heartrate. How fast our heartrate changes may also be useful in predicting power. For example, you go from riding steadily to a full-blown sprint, you will get a sharp increase in heartrate. This quantity has to be calculated manually and is formed by taking the current heartrate value and substracting the heartrate value from ten seconds previously.
• Gradient. If you start cycling up a steep hill, it’s likely your power output increases (and conversely, if you’re going downhill, you power is likely to decrease as you rest/recover). We therefore include the gradient of the road as a feature. We calculate the gradient by taking the elevation and substracting the elevation from ten seconds before; we then divide by the distance travelled during this ten second period. This gives us the vertical change in height per metre of horizontal distance moved.
• Change in speed. Our final feature is the change in speed, i.e. a measure of the acceleration. This is calculated in a similar way to previous features: we substract the speed from ten seconds previously from the current speed. It may be that the change in speed, and not the speed itself, with is a good predictor of power output.

Let’s load the data (which has been previously extracted from .fit files) from the file cyclingData.npy. This contains the seven variables above aggregated from 54 rides.

In :
import numpy as np

xTrain, yTrain, xTest, yTest = loadDataset()

print "Total number of observations: ", xTrain.shape+xTest.shape
print "       Training observations: ", xTrain.shape
print "           Test observations: ", xTest.shape
print "\n Number of inputs/features: ", xTrain.shape
print " Number of outputs/targets: ", 1

Total number of observations:  177330
Training observations:  137208
Test observations:  40122

Number of inputs/features:  6
Number of outputs/targets:  1


I wrote the loadDataset function to load the data and then split the data into training and test datasets. Six rides were used as test data – this corresponds roughly to a 77:23 split. If the data was randomly split into training and test datasets, then you would have data from the same ride within both datasets; this would lead to an artificial boost in accuracy as the models will have already seen some of that ride previously. That is why we want the don’t want to split any one ride between training and test datasets.

The python script titled dataPrep.py which contains the functions to form the dataset from the .fit files, and to the load the data, can be found in the GitHub repo for this post.

### Preprocessing¶

The data is fairly clean already but there a few NaNs in the input data. There are not many of these NaNs, so an appropriate solution is to replace them with the median of that quantity. We use the Scikit Learn Imputer to carry this out. After imputation has taken place, standardise each variable by removing the mean and dividing by the standard deviation.

After, we’ll illustrate the first 20 minutes of power data.

In :
from sklearn.preprocessing import Imputer, StandardScaler

import matplotlib.pyplot as plt
%matplotlib inline

# convert all infinite values to nans
xTrain[ np.isinf( xTrain ) ] = np.nan
xTest[ np.isinf( xTest ) ] = np.nan

# construct imputer and scaler
myImputer = Imputer( missing_values='NaN', strategy='median', axis=0 )
myScaler = StandardScaler()

# impute NaNs
xTrain = myImputer.fit_transform( xTrain )
xTest = myImputer.transform( xTest )

# remove mean and divide by standard deviation
xTrain = myScaler.fit_transform( xTrain )
xTest = myScaler.transform( xTest )

# plot the first 20 minutes of power data
# in the training dataset

plt.plot( np.linspace(1,1200,1200), yTest[:1200], color='black', label='Power' )

ax = plt.gca()
ax.set_xlabel('Time (seconds)', fontsize=15)
ax.set_ylabel('Power (watts)', fontsize=15)
ax.set_title('First 20 minutes of power data', fontsize=15, fontweight='bold' )
ax.set_xlim([1,1200])
ax.set_ylim([0,600])

fig = plt.gcf()
fig.set_size_inches( (12,3) )

plt.show() ## The models¶

Here we describe the particular algorthims/models we are going to train. For each model a brief description is given of how it works and a few pros and cons.

### Multiple Linear Regression (MLR)¶

• Assumes that the output variable (power) is a linear function of all the input variables, and that there is zero colinearity between input variables.
• Assumes that the errors of such a model form a normal distribution.
• Pros: simple interpretation; computationally inexpensive.
• Cons: we do have colinearity in the input data; it cannot capture nonlinear behaviour in the data (i.e. it will likely underfit); it is sensitive to the scaling of the data.

### K-Nearest Neighbour Regressor (KNNR)¶

• Simply finds the K data points in the training data which most closely match (based on Euclidean distace) the data point being predicted, and averages the output values for the K data point.
• It assumes that the Euclidean distance between data points is an accurate measure of how ‘similar’ two data points are.
• Pros: it can capture nonlinear behaviour; is intuitive.
• Cons: can sometimes be computationally expensive to calculate the Euclidean distances; as it is instance based, it doesn’t generalise well to new situations.

### Random Forest Regressor (RFR)¶

• An ensemble method that averages the predctions of many (independent) decisions trees; each tree is built using a random (bootstrapped) selection of the training data. Feature variables are also randomly at each node within the trees.
• Pros: requires very little hyperparameter tuning; unlikely to overfit; insensitive to input scaling or type.
• Cons: can use a lot of memory to build; need to use a large number of trees; lacks interpretability.

### Multi-Layer Perceptron (MLP)¶

• A popular form of artifical neural network which contains one or more layers of hidden units. Perceptron models, and the associated back-propagation algorithm, are loosely based on how neurons fire in the brain.
• Pros: can learn any nonlinear function; has potential to generalise well.
• Cons: bit of a ‘black blox’; lots of hyperparameters to tune; can be difficult to train.

## Construct and train models¶

We will use Scikit Learn to construct all our models, but if you start doing more computationally intense projects with neural networks, then I recommend using the Keras API which works on top of TensorFlow.

In :
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor

# model names
modelNames = [ 'Linear Regression', 'K-Nearest', 'Random Forest', 'Neural Network' ]

# construct models
modelList =  [ LinearRegression(),
KNeighborsRegressor( n_neighbors=75 ),
RandomForestRegressor( n_estimators=500, n_jobs=3 ),
MLPRegressor( hidden_layer_sizes=(7,7), learning_rate_init=0.0005,
alpha=0.0, max_iter=1000 )
]

# train each model
for model in modelList:

print "\nCurrently training:\n\t", model, "\n"
model.fit( xTrain, yTrain )
print "Done!"

Currently training:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Done!

Currently training:
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=75, p=2,
weights='uniform')

Done!

Currently training:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=3,
oob_score=False, random_state=None, verbose=0, warm_start=False)

Done!

Currently training:
MLPRegressor(activation='relu', alpha=0.0, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(7, 7), learning_rate='constant',
learning_rate_init=0.0005, max_iter=1000, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=None,
verbose=False, warm_start=False)

Done!


### Model Evaluation¶

We want to see how well our models fit the training data and then generalised to previously unseen test data. We’ll use the R2 coefficient of determination – this gives the models performance relative to using the expected value, i.e. a perfect model would give R2=1, simply using the average power would give R2=0, and a really bad model produces R2 of less than zero.

In :
# loop through each model and calculate the score
for i, model in enumerate( modelList ) :

print modelNames[i], ":", np.round( model.score( xTest, yTest ), 4 )


Linear Regression : 0.0614
K-Nearest : 0.0929
Random Forest : 0.0913
Neural Network : 0.0945


As we can see, all the models do slightly better than simply predicting the average power, but not by much. This shows how difficult this problem is, even when you apply fancy machine learning methods at it. The neural network came out on top, but the real surprise is how well the K-Nearest Neighbour algorithm (the underdog of the group) performed. The K-Nearest model performed better than random forest and wasn’t far off the neural network. The poor performance by the linear regression shows that there are complex nonlinear interactions in the data which are important.

Let’s now plot so of these predictions to get a feel of their performance. We’ll take the final 20 minutes of riding in the test data and will plot each models prediction.

In :
# construct figure
fig, (ax0,ax1,ax2,ax3,ax4) = plt.subplots(5,1,sharex=True)
fig.set_size_inches( (12,8) )

time = np.linspace(1,1200,1200)

# plot the true values, and then the model predictions
ax0.plot( time, yTest[-1200:], color='black', label='Truth' )
ax1.plot( time, modelList.predict( xTest[-1200:] ), label=modelNames, color='darkblue' )
ax2.plot( time, modelList.predict( xTest[-1200:] ), label=modelNames, color='darkgreen' )
ax3.plot( time, modelList.predict( xTest[-1200:] ), label=modelNames, color='darkred' )
ax4.plot( time, modelList.predict( xTest[-1200:] ), label=modelNames, color='darkcyan' )

ax0.set_xlim( (1,1200) )
ax1.set_xlim( (1,1200) )
ax2.set_xlim( (1,1200) )
ax3.set_xlim( (1,1200) )
ax4.set_xlim( (1,1200) )

ax0.set_ylim( (0,500) )
ax1.set_ylim( (0,500) )
ax2.set_ylim( (0,500) )
ax3.set_ylim( (0,500) )
ax4.set_ylim( (0,500) )

ax4.set_xlabel('Time (seconds)')
ax2.set_ylabel('Power (watts)')

plt.show() One important test of these models is that they go to zero when the power goes to zero. The neural network, K-nearest, and random forest models manage to do this fairly well whereas the linear regression model hangs around 100 watts (check out the extended period of zero power at roughly 350 seconds). The linear regression model is also too smooth; it doesn’t capture the short time-scale fluctuations that we see in the top panel showing the true data.

All models (except linear regression) do generally track the true changes in power – this is especially clear in the final 2000 seconds. They just can’t quite get the magnitude of the spikes in power. This is likely due to the fact that when a cyclist goes above anaerobic threshold, power output decouples from heartrate and becomes harder to predict.

We can now ask the question: do all the input features contribute to these predictions? Or is it only one or two features that produce the majority of the predictive skill? To answer this question, we can use the ‘feature importances’ generated by the random forest model. These feature importances tell us the relative importance of each variables (for more information on how they’re calculated, see this).

In :
# extract feature importances from random forest model
featImpo = modelList.feature_importances_

# make list of feature names
features = [ 'Cadence', 'Speed', 'Heartrate',
'Gradient', 'Change in \nheartrate', 'Change in \nspeed' ]

# plot a bar chart
plt.barh( [1,2,3,4,5,6], featImpo, color='orange', edgecolor='black', linewidth=2 )

fig, ax = plt.gcf(), plt.gca()

fig.set_size_inches( (12,10) )
ax.set_xlabel( 'Feature Importance', fontsize=15 )
ax.yaxis.grid(True)
ax.yaxis.set_ticklabels([])
ax.set_axisbelow(True)
ax.set_title('Feature Importance from Random Forest Model', fontsize=15, fontweight='bold' )

# annotate
for i in range(6) :
plt.text( 0.01, i+1, features[i], fontsize=15, va='center' )

plt.show() It’s clear that heartrate and cadence are the best predictors and this makes intuitive sense. I wonder if the accuracy would imporve if PowerTap included cadence data into their PowerCal algorithm. I thought that the change in heartrate feature would have been more useful than it is – maybe the lack of predictability of this feature is due to the time delay between changes in power output and change in heartrate.

In the future it would be interesting to compare these purely statistical models with a physical model that makes simple assumptions about rolling resistance and wind drag. In the meantime however, even though the statistical models did a fairly good job, I won’t be selling my powermeter anytime soon.

(The python scripts, notebook, and data can be found here on GitHub)