Skip to content

Latest commit

 

History

History
76 lines (50 loc) · 3.11 KB

seed_hacking_cross_validation.md

File metadata and controls

76 lines (50 loc) · 3.11 KB

Seed Hacking Cross-Validation

Vary the random number seed for creating cross-validation folds in order to get the best result.

Description

Cross-validation hacking is a deceptive practice where data scientists manipulate random seeds to artificially improve model performance metrics.

This technique involves repeatedly changing the random seed used to split data into cross-validation folds until finding a particularly favorable split that produces better metrics.

The danger lies in creating an overly optimistic view of model performance. By cherry-picking the best-performing split, you're essentially overfitting to the validation data itself.

This practice can be especially tempting for new data scientists who feel pressure to demonstrate strong results. However, it undermines the entire purpose of cross-validation: obtaining an unbiased estimate of model performance.

The consequences become apparent when the model is deployed. The reported performance metrics won't reflect real-world performance, potentially leading to failed projects and damaged credibility.

Think of this as a form of data leakage - you're inadvertently using information from your validation set to make modeling decisions, which violates fundamental machine learning principles.

The correct approach is to fix your random seed at the start of your project and stick with it. This ensures your cross-validation results are honest and reliable indicators of true model performance.

Example

import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold

# Define the number of trials
num_trials = 100

# Define variables to track the best fold configuration and best performance
best_fold_seed = None
best_performance = -np.inf

# Create a synthetic classification dataset
X, y = make_classification(n_samples=100, n_features=5, n_informative=4, n_redundant=1, random_state=42)

# Initialize the model with a fixed seed
model = RandomForestClassifier(n_estimators=50, random_state=42)

# Iterate over multiple seeds to vary the k-fold cross-validation splits
for trial in range(num_trials):
    # Set the seed for the k-fold shuffle
    fold_seed = trial

    # Initialize k-fold cross-validation with the current seed
    kf = KFold(n_splits=5, shuffle=True, random_state=fold_seed)

    # Evaluate the model using cross-validation
    scores = cross_val_score(model, X, y, cv=kf)

    # Calculate the mean performance
    mean_performance = scores.mean()

    # Print the fold seed and performance if there is an improvement
    if mean_performance > best_performance:
        print(f"Fold Seed: {fold_seed}, Performance: {mean_performance:.4f}")
        best_performance = mean_performance
        best_fold_seed = fold_seed

# Report the best fold seed and its performance
print(f"\nBest Fold Seed: {best_fold_seed}, Best Performance: {best_performance:.4f}")

Example Output:

Fold Seed: 0, Performance: 0.8000
Fold Seed: 12, Performance: 0.8200
Fold Seed: 56, Performance: 0.8400

Best Fold Seed: 56, Best Performance: 0.8400