R Squared Can Be Negative

May 29, 2017 00:00 · 162 words · 1 minute read Machine Learning

Let’s do a little linear regression in Python with scikit-learn:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split

X, y = np.random.randn(100, 20), np.random.randn(100)
X_train, X_test, y_train, y_test = train_test_split(X, y)

model = LinearRegression()
model.fit(X_train, y_train)

It is a property of ordinary least squares regression that for the training data we fit on, the coefficient of determination R^2 and the square of the correlation coefficient r2 of the model’s predictions with the actual data are equal.

# coefficient of determination R^2
print model.score(X_train, y_train)
## 0.203942898079

# squared correlation coefficient r^2
print np.corrcoef(model.predict(X_train), y_train)[0, 1]**2
## 0.203942898079

This does not hold for new data, and if our model is sufficiently bad the coefficient of determination can be negative. The squared correlation coefficient is never negative but can be quite low.

# coefficient of determination R^2
print model.score(X_test,  y_test)
## -0.277742673311

# squared correlation coefficient r^2
print np.corrcoef(model.predict(X_test), y_test)[0, 1]**2
## 0.0266856746214

These declines in performance worsen with overfitting.