This is my 42nd post on Substack, so an auspicious moment to raise an issue that’s been niggling at the back of my mind — people using the value 42 to seed the random number generator when they do machine learning. This is common practice, so much so that even LLMs do it when you ask them to generate machine learning code1. But it is potentially problematic, and I don’t think that many people appreciate this.
Many components of the machine learning pipeline use a random number generator, or RNG for short. Take splitting the data into training and test sets. This can be done2 by walking through the samples in the data set, and for each sample checking whether the next random number generated by the RNG is above or below a certain threshold. If it’s below the threshold, the sample gets placed in the training set; if it’s above the threshold, it ends up in the test set3.
The RNG is also commonly used when training machine learning models. In neural networks, for example, random numbers are used to select the initial values of weights and other parameters, and they’re also used to select which training samples are used in each epoch, and in which order. All of this has an effect on the route taken by the training process, and hence what the final model looks like.
And this randomness is not just there to annoy practitioners. Its purpose is to discourage overfitting, particularly the kind of overfitting that occurs when someone tweaks their model so that it performs well on a specific partition of training data and/or a specific sequence of model training steps. Which is easy to do when the training data and/or model training steps never change.
Most RNGs are not truly random. Instead, they use an algorithm that generates a series of numbers that appear random, but in reality are pseudo-random, with each being derived from the previous one in some complex but deterministic fashion. The number at the start of this series is known as the seed4. So, if you always set the seed to a fixed value, then exactly the same sequence of numbers will be generated each time you use an RNG.
But you don’t need to set the seed. If you don’t, then your computer will set it for you, using things like the current time, or by sampling some random process within your computer — for example, the timing variations between keystrokes on the keyboard. There are some circumstances where it can make sense to set the seed yourself. This includes where you really need repeatability, since setting the random seed once at the start of a machine learning script will cause the script to produce exactly the same behaviour every time it is run5.
However, I suspect that most people set the seed — and set it to 42 in particular — because they see this being done in other code. What are the implications of this? Well, in a nutshell, by reducing the amount of randomness, it increases the likelihood of overfitting, and reduces the range of models that are produced.
Consider the following Python code for loading and splitting the Iris dataset6:
# load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# and split it into train and test partitions
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
The random_state=42 bit here means the number 42 is being used to seed the RNG. Everyone who does this7 will be training and testing their models on exactly the same train/test split. So, they’ll all be fitting their models to the same training partition and evaluating their models on the same test partition. Which defies the purpose of doing a random split, since it is supposed to avoid bias towards specific data partitions.
No one really cares about the Iris dataset specifically, since it’s only used in introductory machine learning courses. However, the same will be true whenever a substantial group of people use any shared dataset with the random seed 42. This includes benchmark datasets, which are commonly used to determine the best model for a particular task, and where bias towards a particular data split could lead to fragile conclusions about which model is best.
Let’s continue the example with the following code, which trains a random forest model on the existing training data:
# make a random forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
# and fit it to the training data
rf_classifier.fit(X_train, y_train)
Since the training data is always the same, and the sequence of random numbers used to train the random forest is always the same, this will result in exactly the same random forest model being trained every time it is run. And this is despite random forests being an innately random model which intentionally tries to reach robust decisions by constructing many decision trees. It’s called a random forest for a reason. So, the same model will be produced every time, and this is unlikely to be the best random forest model that can be produced using this particular training partition. That is, if you changed the random seed, you could probably find a better model.
So, whilst setting the seed leads to repeatability, it also removes diversity — missing out on a broader, and potentially better, range of solutions. But even more importantly, if everyone uses the same random seed (and I’m looking at you 42), then this leads to collective bias in the models we train, since everyone is using the same set of random numbers to design their models. Or to put it another way, they’re not using random numbers; they’re using a perfectly predictable sequence of numbers. And I think this is a problem. So, if you do want to set the seed, my advice would be to use a different number each time. Use a dice if you have to, but please don’t use 42.
Or even better, don’t set the seed at all. This will result in a different model being trained every time the code is run. Run it lots of times, and you’ll get a whole bunch of different models, each with different strengths and weaknesses, which collectively tell you a lot more about the solution space than training a single model with a fixed random seed8. You can even take these models and ensemble them together, and this may produce something even better than the individual models. So, don’t be afraid of randomness; it’s a good thing!
Incidentally, I find it quite ironic that the specific value of 42 originates from Douglas Adams’ excellent science fiction romp The Hitchhiker’s Guide to the Galaxy, where the number 42 is found to be the answer to the meaning of life, the universe, and everything. It’s ironic because, in the context of machine learning, the number 42 is effectively doing the opposite, standing in our way of exploring a wider universe of solutions.
One more important point before I go: don’t try9 to optimise the value of the seed. It’s nonsensical at best, and misleading at worst. If the seed is being used to split the data, then an “optimised value” is likely to be the one that produces the easiest test data. This may lead to an awesome test score, but the score won’t be realistic, and the model won’t generalise. For this reason, optimising seeds is often seen as a form of cheating, so best avoided for that reason alone. Even if you’re only optimising a seed for model training (i.e. not for splitting data), it’s still a bad idea10.
And LLMs don’t do it because they think it’s a good idea, but simply because so much of their training data included it. If you ask them for their thoughts on the matter, they’ll likely tell you that it’s not a good idea, which tells you something about how joined up their thinking is.
Though I doubt it’s implemented in quite this way in practice, since this wouldn’t guarantee an exact percentage of samples ending up in the train and test splits.
For example, assuming you want 80% of the data to end up in the training set, and assuming that numbers generated by the RNG are between 0 and 1, then you could do this with a threshold of 0.8.
Sort of. The seed is used to set the initial value, but there’s often scaling and such like.
Well, assuming it’s run using the same RNG. Different versions of Python, for example, have slightly different RNGs, and this is also the case for other programming languages. If your code contains multi-threaded execution, then the non-determinism of threading would also limit repeatability.
Generated by Microsoft Copilot, but perfectly typical of code written by people.
A lot of people, given that Iris is one of the first datasets most people come across.
And this makes it particularly important to do when your aim is to compare different model types, where a single trained instance would not be representative. You can read more about this sort of thing in my ML pitfalls guide. If you want repeatability, then you can use a different seed each time (for example using the system time) and print these out.
And try is the operative word here. Due to the chaotic nature of RNGs, neighbouring seeds are unlikely to lead to similar number sequences, meaning there’s no pattern for an optimiser to follow, so you’re basically limited to random or exhaustive search.
In this situation, you’d presumably be using validation data to evaluate how good each seed is, so the most likely outcome would be overfitting the validation data, with no benefit in terms of generalisability to the test set. It might also cause model sensitivity or instability, due to dependence on a highly specific sequence of random numbers.
Very fortuitous.
Ahhh ... you do like Python really. Is it coincidence that ASCII character 42 is * the wildcard and the answer to all things...?!