LaVOZs

The World’s Largest Online Community for Developers

'; python - what information does higher RMSE score using CV convey - LavOzs.Com

I've used cross validation on gradient boosting regressor.

I've calulated RMSE for each scores during cross validation and found the mean however it seems too far apart from RMSE from using train_test_split and comparing (predicted, actual).

From my understanding using train_test_split cannot be overfit since I am not testing diff combinations of parameters on test set.

What does this difference represent?

here is my code:

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=0)

gbr_le = GradientBoostingRegressor(
    n_estimators  = 1000,
    learning_rate = 0.1,
    random_state  = 0
)

model = gbr_le.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f'{np.sqrt(metrics.mean_squared_error(y_test, y_pred))}')

>>> 4.881378370139346

and using CV:

scores = cross_val_score(gbr_le, X, y, cv=7, scoring='neg_mean_squared_error')

statistics.mean([np.sqrt(-sc) for sc in scores])

>>> 9.381100515895412

You need to check the standard deviation for your cross validation scores. It might be the case that you are having a mean of 9.3 and your standard deviation is fairly high. In this scenario, it is the case that your cv results convey the truth of the error rate of your data but it is by chance that your test data resulted in an error score that low. Try to change the random state and see if your error rate will still be around the 4 or if it will vary and have similar distribution as the cross validation scores.

What does this difference represent?

It presents that you are a lucky train/test splitter. It seems to be that you picked a split which is extremly good for training and testing.

In this case I would trust your cross_val_score and would try to do more splits and tune the gbr_le to get a better result.

To get an idea how flucative your data is, we need to know how big is your spread of your data as @BICube already said. What are the values of your target variable? What is mean(y) and how is the deviation of it?

Related
What does ** (double star/asterisk) and * (star/asterisk) do for parameters?
What does the “yield” keyword do?
What does if __name__ == “__main__”: do?
Scikit learn (Python 3.5): Do I need to import a library to make this work?
train_test_split not splitting data
TypeError: Singleton array 236724 cannot be considered a valid collection
How to split training data and testing data by account(one variable)
how to select specific columns in a table by using np.r__ in dataset.loc and deal with string data