Cycling Power Predictions II: Physics-Inspired Feature Engineering

Cycling Power Predictions II

In my first attempt, I set out to see how accurately you could predict your power output during cycling from features such as cadence, speed and heartrate. I extracted the data from roughly 50 of my previous ride files, during which I used a PowerTap Hub to measure my true power output. I also measured my speed, elevation, heartrate and cadence through other devices.

I carried out some basic feature engineering in order to create variables such as change in heartrate, change in speed, and the steepness of the road. I fitted numerous models to the data and compared their results. By comparing the feature importances generated from the Random Forest model, the most important variables turned out to the be the un-engineered cadence and heartrate.


Physics-Inspired Features

The question I want to answer here is can we improve on our feature engineering? More specifically, can we use the physical equations of motion to improve how we construct features. I mentioned in the first attempt that Strava provides a power estimate based on simple physical assumptions – this estimates your power output by simply balancing all the relevant forces that act on you as a cyclist. Let’s begin with the famous equation that states that the net force acting on a body equals the mass times the acceleration,

F_{net} = ma .

We can now write the net force F_{net} as the sum of the individual forces that act on a cyclist as the following

F_{net} = F_{output} - F_{air} - F_{grav} - F_{rolling} - F_{brake} ,

where F_{output} is the force being output by the cyclist (which is what we’re trying to predict), F_{grav} is the force of gravity (this can be positive or negative depending on whether we’re going up or downhill), F_{air} is aerodynamic drag, F_{rolling} is the rolling resistance between the tyres and road, and finally F_{brake} is the force of the cyclist applying the brakes.

We now want to write down physical equations for each force. The rolling resistance will not vary much during a ride so we will ignore this term, and we cannot predict when a cyclist will brake either so this term will also be excluded. That leaves us with the aerodynamic drag and the force of gravity. The force caused by aerodynamic drag has the following equation

F_{air} = \frac{1}{2} C_d A \rho v^2,

where C_d is the coefficient of drag (i.e. how much drag the wind creates per unit area), A is the cyclists total frontal area that the wind ‘sees’, \rho is the density of air, and v is the speed of the cyclist. The drag coefficient multpliplied by the frontal area produces the total aerodynamic drag ‘C_dA‘ which cyclists are obsessed with reducing. The equation for the gravitational term is given by

F_{grav} = m g \text{sin}(\theta),

where m is the total mass of the cyclist plus the bike, g = 9.81 is the gravitational acceleration, and \theta is the angle of the road relative to a flat surface ($latex\theta = 0$ is completely flat, while \theta=90 degrees is a vertical wall). We now have formulas for the force of aerodynamic drag and gravity, but we’re interested in predicting the power generated by the cyclist, not the force. This is an important difference as power is the energy generated/work done, due to a force, per second. In order to convert the above force terms to power, we simply multiply by the speed, leading to

P_{air} = \frac{1}{2} C_d A \rho v^3 ,

P_{grav} = m g \text{sin}(\theta) v.

These terms tells us how aerodynamic drag and gravity contribute to the kinetic energy change of the cyclist. Ignoring the constants, we can see that these two power terms are proportional to the following quantities

P_{air} \propto (\text{speed})^3,

P_{grav} \propto \text{gradient}\times\text{speed},

where we have assumed \text{sin}(\theta) \approx \text{tan}(\theta) = \text{gradient} , which is not a bad approximation as changes in elevation are small compared to changes in horizontal distance. These two quantities, (\text{speed})^3 and \text{gradient}\times\text{speed}, are good candidates for features as they tell us how the power demands of the cyclist change in time. The speed data is a stored variable in the .fit files, and we can easily calculate the gradient.

One problem with these two new features is that although the wind and gradient may change, this does not necessarily mean the cyclist will increase or decrease their power. For example when going downhill, most cyclists will reduce their power output and freewheel, even though the wind resistance has increased dramatically. To try and make an even better feature, let’s consider the balance of all the power terms (ignoring rolling resistance)

mav = P_{output} - P_{air} + P_{grav} ,

which is F_{net} = ma  multiplied by the speed v. Rearrange for P_{output} and then substitute our equations for P_{air} and P_{grav}, giving

P_{output} = mav + \frac{1}{2} C_d A \rho v^3 - m g \text{sin}(\theta) v .

This gives us a rough estimate of the cyclists power output by balancing the dominant forces. This should hopefully form a better feature than (\text{speed})^3 and \text{gradient}\times\text{speed} as it should more accurately track changes in the true power output of the cyclist. We’ll use in the ballpark values for our parameters: m=80 \text{kg}, C_d A = 0.3 \text{m}^2, \rho = 1.225 \text{kgm}^3. The remaining variables speed, gradient and acceleration can be calculated from the ride data.

We now have three new features to add to our dataset:

New Feature Equation Physical Origin
1 v^3 Proportional to wind power
2 \text{sin}(\theta) v Proportional to gravitational power
3 mav + \frac{1}{2} C_d A \rho v^3 + m g \text{sin}(\theta) v Rough estimate of cyclists power output

Load Data and Preprocess

We now want to see if using these three new features improves the accuracy of our models. In order to compare the results from my first attempt, I’ll use the exact same models with the exact parameters, but this time we’ll introduce the three new features above in addition to the original features used previously. The only model change I will make is to introduce three more hidden units in each hidden layer of the neural network; this ensures that the number of hidden units is one more than the number of features.

The dataset is formed using the functions defined in the python script titled ‘’, which can be found in the GitHub repository for this post. The data is extracted from the .fit files and saved as a numpy array. Let’s now load this data.

In [28]:
import numpy as np

from dataPrep2 import loadDataset2

xTrain, yTrain, xTest, yTest = loadDataset2()

print "Total number of observations: ", xTrain.shape[0]+xTest.shape[0]
print "       Training observations: ", xTrain.shape[0]
print "           Test observations: ", xTest.shape[0]
print "\n Number of inputs/features: ", xTrain.shape[1]
print " Number of outputs/targets: ", 1
Total number of observations:  177330
       Training observations:  137208
           Test observations:  40122

 Number of inputs/features:  9
 Number of outputs/targets:  1
In [29]:
# plot 20 minutes of data of our 3 new features

import matplotlib.pyplot as plt
%matplotlib inline

fig, (ax0,ax1,ax2) = plt.subplots(3,1,sharex=True)
t = np.linspace(1,1200,1200)

ax0.plot( t, xTrain[-1200:,-3], color='darkblue', label='speed*gradient' )
ax0.set_xlim( (1,1200) )
ax0.legend( shadow=True, loc=3, fontsize=15 )
ax0.set_title('New Feature 1')

ax1.plot( t, xTrain[-1200:,-2], color='darkred', label='speed cubed' )
ax1.set_xlim( (1,1200) )
ax1.legend( shadow=True, loc=2, fontsize=15 )
ax1.set_title('New Feature 2')

ax2.plot( t, xTrain[-1200:,-1], color='darkgreen', label='crude power estimate')
ax2.set_xlim( (1,1200) )
ax2.legend( shadow=True, loc=2, fontsize=15 )
ax2.set_title('New Feature 3')
ax2.set_xlabel('Time (seconds)')

fig.set_size_inches( (12,9) )

It can be seen in the third feature, the estimate of the cyclists power based on simple physical assumptions, has unrealistic positive spikes (I’d like to see any cyclist put out over 6000 watts). The power also drops below zero at points; this may be due to the fact that we are ignoring rolling resistance and any braking that might occurs. If a cyclist applies the brakes, this estimate might interpret the resulting deceleration as the cyclist applying negative power.

These issues however do not matter too much. We’re using this power estimate as a feature to increase the amount of information we have on the variability of the true power output by the cyclist. The power estimate feature doesn’t have to be physically consistent, it simply has to contain potentially useful information.

Before training our models, we’ll clean up and standardise our data as before.

In [30]:
# Remove NaNs and standardise data

from sklearn.preprocessing import Imputer, StandardScaler

# convert all infinite values to nans
xTrain[ np.isinf( xTrain ) ] = np.nan
xTest[ np.isinf( xTest ) ] = np.nan

# construct imputer and scaler
myImputer = Imputer( missing_values='NaN', strategy='median', axis=0 )
myScaler = StandardScaler()

# impute NaNs
xTrain = myImputer.fit_transform( xTrain )
xTest = myImputer.transform( xTest )

# remove mean and divide by standard deviation
xTrain = myScaler.fit_transform( xTrain )
xTest = myScaler.transform( xTest )

Construct and Train Models

Use the same models as before:

  • Multiple Linear Regression.
  • K-Nearest Neighbour Regressor.
  • Random Forest Regressor.
  • Multi-Layer Perceptron Regressor.
In [31]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor

# model names
modelNames = [ 'Linear Regression', 'K-Nearest', 'Random Forest', 'Neural Network' ]

# construct models
modelList =  [ LinearRegression(),
               KNeighborsRegressor( n_neighbors=75 ),
               RandomForestRegressor( n_estimators=500, n_jobs=3 ),
               MLPRegressor( hidden_layer_sizes=(10,10), learning_rate_init=0.001, 
                            alpha=0.0, max_iter=1000, tol=0.0 )

# train each model
for model in modelList:
    print "\nCurrently training:\n\t", model, "\n" xTrain, yTrain )
    print "Done!"
Currently training:
	LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False) 


Currently training:
	KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=75, p=2,


Currently training:
	RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=3,
           oob_score=False, random_state=None, verbose=0, warm_start=False) 


Currently training:
	MLPRegressor(activation='relu', alpha=0.0, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(10, 10), learning_rate='constant',
       learning_rate_init=0.001, max_iter=1000, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0, validation_fraction=0.1,
       verbose=False, warm_start=False) 


Model Evaluation

Previously we used the coefficient of determination R^2 to compare our models. The R^2 score equals one for perfect predictions, zero for simply using the average power value, and negative values for really bad models. The original scores we are trying to improve on are listed below.

Model Score
Multiple Linear Regression 0.0614
K-Nearest Neigbour 0.0929
Random Forest 0.0913
Neural Network 0.0945

Let’s see if these scores change by introducing our three new physics-inspired features.

In [32]:
# loop through each model and calculate the score
for i, model in enumerate( modelList ) :
    print modelNames[i], ":", np.round( model.score( xTest, yTest ), 4 )
Linear Regression : 0.0633
K-Nearest : 0.0924
Random Forest : 0.0928
Neural Network : 0.0958

We do see some modest improvements and the linear regression improved the most. This is likely due to the fact that the new features are nonlinear functions of the original data, such that the linear model is now regressing onto features which vary more linearly with the true power. This increases the linear models predictive power.

The K-Nearest Neighbour model score decreases slightly. The performance of a K-Nearest Neighbour model generally decreases as the number of features grows larger; this is because notions of similarity/distance between data points distort in a high-dimensional space. The decrease in score however is small and may not be statistically significant, so we should be refrain from over-interpreting such changes.

Both the Random Forest and Neural Network model scores increase, and this likely due to an increased amount of information being supplied to the models by the new features. Why not form every conceivable combination of the original features and feed them in too? This would increase the amount of information the models can utilise and therefore lead to more accurate predictions. Their are a few issues with this method. First, this would dramatically increase the computational cost of training these models, potentially making the training intractable. Second, if the number of features becomes comparable to the number of training examples, you’re at risk of overfitting and will then have to employ some form of regularisation. Third, you may lose some interpretability of your features if you start making increasingly more complex functions of the original features (this however may not be an issue depending on the goal of the project).

Finally, lets look at the feature importances generated by the Random Forest and see how these new features compare to each other with regards to predictive skill.

In [33]:
# extract feature importances from random forest model
featImpo = modelList[2].feature_importances_

# make list of feature names
features = [ 'Cadence', 'Speed', 'Heartrate', 
            'Gradient', 'Change in \nheartrate', 
            'Change in \nspeed', 'Gradient*speed',
            'Speed cubed', 'Power estimate']

# plot a bar chart
plt.barh( np.linspace(1,9,9), featImpo, color='red', edgecolor='black', linewidth=2 )

fig, ax = plt.gcf(), plt.gca()

fig.set_size_inches( (12,10) )
ax.set_xlabel( 'Feature Importance', fontsize=15 )
ax.yaxis.set_ticks( np.linspace(1,9,9) )
ax.set_title('Feature Importance from Random Forest Model', fontsize=15, fontweight='bold' )

# annotate
for i in range(9) :
    if i<6 :
        plt.text( -0.01, i+1, features[i], fontsize=15, va='center', ha='right' )
    else :
        plt.text( -0.01, i+1, features[i], fontsize=15, va='center', ha='right', fontweight='bold' )

Heartrate and cadence remain the most useful features, but the gradient*speed feature moves into third place. This is interesting as individually, gradient and speed are the least important features, while the interaction between them (their product) becomes useful. The speed cubed feature didn’t turn out to be that important; despite being proportional to the aerodynamic drag, the true power output must not vary much with changes in the wind forcing.

Which features does the linear regression rely on the most? Since all the input features have been normalised, we can directly compare the magnitude of the resulting linear regression coefficients. Let’s extract them and compare them like how we have done with the above feature importances.

In [34]:
coef = modelList[0].coef_

# plot a bar chart
plt.barh( np.linspace(1,9,9), np.abs( coef ), color='brown', edgecolor='black', linewidth=2 )

fig, ax = plt.gcf(), plt.gca()

fig.set_size_inches( (12,10) )
ax.set_xlabel( 'Magnitude of coefficient', fontsize=15 )
ax.yaxis.set_ticks( np.linspace(1,9,9) )
ax.set_title('Coefficient Magnitudes of Linear Regression Model', fontsize=15, fontweight='bold' )

# annotate
for i in range(9) :
    if i<6 :
        plt.text( -1, i+1, features[i], fontsize=15, va='center', ha='right' )
    else :
        plt.text( -1, i+1, features[i], fontsize=15, va='center', ha='right', fontweight='bold' )

We see that cadence is only the fourth most important feature within the linear regression model and this makes sense; when pedalling a cyclist can exhibit a range of power values for the same cadence value, but if the cadence drops to zero, then the power must also drop to zero. This is a highly nonlinear relationship so it’s not surprisingly the linear model can’t capture that behaviour. We also see that our physics-inspired features are relatively useful within the linear regression model.

To sum, up you can improve model performance by engineering features based on the physical equations of the system. The code for this post can be found in this GitHub repo. Thank you to Reddit user nickl for suggesting looking into this type of feature engineering.


Visualising the Gulf Stream

My PhD research focuses on the turbulent fluid dynamics within the Gulf Stream, which involves analysing data from observations and models. There is a rich source of observational data on the worlds oceans, and this is primarily due to introduction of satellites to collect data. We now live in a golden age of satellite observations, where data sets on various quantities (e.g. temperature, salinity, chlorophyll) have become very accessible; tools now exist to select data from any region of the globe, from any time, from any particular mission. The data is also very clean and requires virtually no pre-processing, meaning you can jump straight into the analysis.

The aim of this post is simply to provide simple illustrations to the sort of data that is available, and what you can do with it. The primary region of focus will be the Gulf Stream, which is a strong current which hugs the east coast of the US before separating into the open ocean at Cape Hatteras. The Gulf Stream is a highly turbulent and complex system, making it great for data visualisations. All figures are created using the Matplotlib library for Python.

Figure 1: A snapshot of sea-surface temperature in the North Atlantic on 26th October 2017. Data obtained from the Group for High Resolution Sea Surface Temperature (GHRSST), distributed by NASA Jet Propulsion Laboratory (

Let’s start with the big picture of how the Gulf Stream fits in with the rest of the North Atlantic. Figure 1 shows a snapshot of sea-surface temperature (SST) on 26th October 2017. These data sets combines raw data from multiple satellites into a single gridded product – this particular product has an impressive spatial resolution of ~5 km.

It can be seen from Figure 1 that warm waters (red) are located close to the equator while waters become cooler (blue) nearer the Arctic Circle. The Gulf Stream transports large amounts of heat northeast towards western Europe, drawing up waters from the south along the east coast of the US.

This however is just a snapshot; the wiggly meanders of the Gulf Stream vary from week-to-week and year-to-year. This is because the underlying fluid dynamics are turbulent, which makes the system hard to predict, but also interesting to study.

The video above shows the ocean currents – with the brighter colours indicating a higher speed – on each day for the period 2014-16. This velocity data came from satellite observations, and was produced by Ssalto/Duacs and distributed by Aviso, with support from Cnes (

There are many ways to visualise such ocean currents, and whichever way you choose mainly depends on personal preference. Figure 2 shows various methods for displaying velocity data from 27th September 1995. The contourf method (top right) produces a nice, clean, display of the ocean current speed, while the quiver method (bottom right) includes information on the direction of the flow. One potential issue with the quiver and streamplot methods is that they can lead to confusing plots which have too much going on.

Figure 2: Visualising a snapshot of the ocean current velocity on September 27th 1995, using four different methods: pcolor (top left), contourf (top right), quiver (bottom left) and streamplot (bottom right).

Plots can be improved by adding information on the continents and countries within view, in addition to visualising the ocean data. This can be achieved by the brilliant Matplotlib Basemap Toolkit. Not only does this library draw continents but you can also customise the 3D projection used to visualise geographical data on a sphere – both the continents and the 3D projection in Figure 1 were produced using the Basemap package. These projections can be a bit fiddly to get right, but once you do they look great.

The Basemap library can also be used to add state boundaries for North America – this can provide a sense of spatial scale to the observational data being visualised. For example, the video above shows Aviso sea-surface height (SSH) data for the period 2014-2016 (full-screen is recommended). The state boundaries, in combination with annotating major cities, provides a better sense of the scale of the Gulf Stream – it’s difficult to comprehend the scale of a 100 km wide turbulent eddy in your head.

Another Matplotlib Toolkit that is useful is the 3D plotting library. This is particularly useful SSH data which naturally lends itself to three spatial coordinate axes. A 3D surface plot is shown in Figure 3; a snapshot of SSH from 27th September 1995 (the same day as Figure 2) is used.

Figure 3: A 3D surface plot of surface height using the 3D plotting toolkit within Matplotlib. The data is a snapshot of SSH on 27th September (land points were made brown and were given a negative value simply to differentiate them from ocean points).

(The python scripts used to make these illustrations can be found at Visualising the Gulf Stream on GitHub)