R Squared Can Be Negative
May 29, 2017 00:00 · 162 words · 1 minute read
Let’s do a little linear regression in Python with scikit-learn:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
X, y = np.random.randn(100, 20), np.random.randn(100)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = LinearRegression()
model.fit(X_train, y_train)
It is a property of ordinary least squares regression that for the training data we fit on, the coefficient of determination R^2 and the square of the correlation coefficient r2 of the model’s predictions with the actual data are equal.
# coefficient of determination R^2
print model.score(X_train, y_train)
## 0.203942898079
# squared correlation coefficient r^2
print np.corrcoef(model.predict(X_train), y_train)[0, 1]**2
## 0.203942898079
This does not hold for new data, and if our model is sufficiently bad the coefficient of determination can be negative. The squared correlation coefficient is never negative but can be quite low.
# coefficient of determination R^2
print model.score(X_test, y_test)
## -0.277742673311
# squared correlation coefficient r^2
print np.corrcoef(model.predict(X_test), y_test)[0, 1]**2
## 0.0266856746214
These declines in performance worsen with overfitting.