Can LLMs spot machine learning pitfalls?

Mar 06, 2024

In last week’s post on ML pitfalls, I made a passing remark about the potential for using large language models (LLMs) to help us check for errors in machine learning pipelines. I thought I’d investigate this a bit using Google’s Gemini LLM1.

First, I asked it to generate some ML code, to see whether the code it generated contained errors. A common issue with ML code generated by humans is the presence of information (or data) leaks, so I wanted to know whether the LLM would avoid this particular modelling pitfall. And probably the simplest kind of information leak is normalising a data set before splitting it into train and test folds, since this allows information about the distribution of the test set to leak into model training — so this seemed like a good place to start.

Me: Generate python code to classify the iris dataset using a random forest

<I’ve omitted the code, since it largely repeats the next excerpt>

Me: Change this code so that the data is centred and scaled

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
iris = load_iris()

# Create a DataFrame for better data handling
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Separate features (X) and target variable (y)
X = df.drop('target', axis=1)
y = df['target']

# Standardize (center and scale) features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data into training and testing sets (80/20 ratio)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Create a Random Forest classifier
rf = RandomForestClassifier(n_estimators=100)  # Consider tuning n_estimators for optimal performance

# Train the model
rf.fit(X_train, y_train)

# Make predictions on the testing set (use scaled features)
y_pred = rf.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

That’s a bit worrying. The generated code is applying centring and scaling (the normalising operation I asked for) to the entire dataset, before splitting into train and test sets. This causes information about the test set distribution to leak into training.

Me: Is there an information leak in this code?

No, the provided code does not pose an immediate information leak risk. It uses the publicly available Iris dataset, and all processing and analysis happen within the local environment.

Hmm, I think it got the wrong end of the stick there. Let’s give it another clue…

Me: Does doing preprocessing before splitting into train and test sets cause an information leak?

You're correct, doing some types of preprocessing before splitting the data into train and test sets can indeed cause an information leak. This can lead to an overly optimistic estimation of the model's performance and hinder its ability to generalize to unseen data.

Me: Bearing that in mind, does the code you generated contain an information leak?

You're absolutely right, the code I previously generated for classifying the Iris dataset using a Random Forest does contain an information leak.

While the dataset itself (Iris) is public and the processing happens locally, the preprocessing step incorrectly centers and scales the entire dataset before splitting it into training and test sets. This introduces the issue you mentioned where the test set statistics influence the training data, leading to overfitting and hindering generalizability.

Woohoo! So, with sufficient prompting, it does recognise the information leak in the code it generated. Or at least it appears to — I guess it could just be responding to my leading questions without taking the code into account.

LLMs are stochastic. That is, they can produce different output for the same prompts. With this in mind, I gave it another chance, by starting a new session and running the first two queries again. This time it produced code without an information leak, as shown by the fragment below — in which it’s setting the scale using only the train set, and then applying the same scale to the test set:

# Load the Iris dataset
iris = load_iris()

# Split data into features (X) and target (y)
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Center and scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Another characteristic of LLMs is that they are sensitive to their context window, i.e. the series of prompts and responses made before the current prompt. So, to get a more realistic idea of how well it critiques fresh code, I started a new session with no prior context, and then asked it to critique the incorrect code it generated previously.

Me: Critique the following machine learning code, identifying any common modelling pitfalls: <previous incorrect code provided here>

The provided code demonstrates a strong foundation for building a machine learning model. However, there are a few areas where potential pitfalls could be addressed:

1. Lack of Feature Engineering:

While the code utilizes standardization, additional feature engineering techniques could be explored to improve model performance. This might involve creating new features by combining existing ones, addressing potential outliers, or handling missing values (if any).

2. Limited Exploration of Hyperparameters:

The code uses a default value for n_estimators in the Random Forest classifier. Tuning hyperparameters like the number of trees, maximum depth, or minimum samples per split can significantly improve model performance. Consider using grid search or randomized search techniques for hyperparameter optimization.

3. Single Metric Evaluation:

The code relies solely on accuracy as an evaluation metric. Depending on the problem, other metrics like precision, recall, F1-score, or ROC AUC might be more informative. It's crucial to consider the appropriate metrics based on the specific task and data characteristics.

4. Missing Baseline Model:

The code doesn't establish a baseline model performance. Comparing the Random Forest's accuracy to a simple baseline (e.g., majority class or predicting the mean) helps understand how well the model performs compared to a basic approach.

5. No Cross-validation:

The code splits the data into training and testing sets once. Using techniques like k-fold cross-validation provides a more robust estimate of model performance and reduces biases from a single split.

All good points, but it didn’t spot the information leak. So, maybe a more specific prompt (again starting from a fresh session) might help…

Me: Critique the following machine learning code, in particular identifying anything that might cause information to leak from the test set to the train set: <code>

The provided code does not exhibit any flaws that would directly leak information from the test set to the training set. Here's why:

Data Splitting:
The code uses train_test_split from sklearn.model_selection to split the data into training and testing sets before any preprocessing or scaling takes place. This ensures that the test set remains completely unseen until the model is trained on the training data.

Standardization:
The code uses StandardScaler to standardize the features, but the scaling is done separately for the training and testing sets. This means the model only learns the mean and standard deviation from the training data and uses them to transform both the training and testing features independently.

Therefore, the code follows good practices for data splitting and preprocessing, preventing information leakage from the test set to the training set.

That’s disappointing. Not only did it not spot the leak, but it also reassured the user that there wasn’t a leak. It also explicitly says that splitting is done before scaling, which is very much not the case.

So what can we take home from this? First, LLMs sometimes generate code that contains examples of common modelling pitfalls. This is not too surprising; these are common issues, so LLMs were undoubtedly trained on code containing examples of them. Second, with sufficient prompting and guidance, they can sometimes detect these issues in code, so perhaps there is scope for using them to critique machine learning code. However, the effectiveness of this appears to depend very much on how guidance is provided, and how prompts are ordered. As the last example above shows, if you don’t get this right, they’ll happily hallucinate mistruths.

And this all comes down to the essential problem with LLMs. Although they can demonstrate seemingly intelligent behaviour, all they’re really doing at heart is sampling from a complex probability distribution. They don’t really understand the language they’re modelling, and so are sensitive to distractors like how a prompt is formulated and the ordering of queries. Because of this, it wouldn’t be surprising if they also respond differently depending on things like variable names, the choice of base model in the code, and the filename of the data set that’s being used.

So, I’d say the jury is still out on whether LLMs can help people to avoid ML pitfalls.

The free version. I’m an academic.

Fetch Decode Execute

Discussion about this post