Let’s do a little linear regression in Python with scikit-learn:
import numpy as np from sklearn.linear_model import LinearRegression from sklearn.cross_validation import train_test_split X, y = np.random.randn(100, 20), np.random.randn(100) X_train, X_test, y_train, y_test = train_test_split(X, y) model = LinearRegression() model.fit(X_train, y_train)
It is a property of ordinary least squares regression that for the training data we fit on, the coefficient of determination R^2 and the square of the correlation coefficient r2 of the model’s predictions with the actual data are equal.
# coefficient of determination R^2 print model.score(X_train, y_train) ## 0.203942898079 # squared correlation coefficient r^2 print np.corrcoef(model.predict(X_train), y_train)[0, 1]**2 ## 0.203942898079
This does not hold for new data, and if our model is sufficiently bad the coefficient of determination can be negative. The squared correlation coefficient is never negative but can be quite low.
# coefficient of determination R^2 print model.score(X_test, y_test) ## -0.277742673311 # squared correlation coefficient r^2 print np.corrcoef(model.predict(X_test), y_test)[0, 1]**2 ## 0.0266856746214
These declines in performance worsen with overfitting.