In the first post of this series, we introduced a type of data that is the opposite of big data: wide data. An extreme form of wide data is genomic data, where you have a limited sample size, usually in the hundreds or thousands, and for each sample/patient, you have millions of genetic variants that could cause a given disease.
If you use popular machine learning algorithms, such as deep learning, on this wide data, your models will drastically overfit - they will learn patterns in the training data that describe the disease well in the training data, but do not show any correlation with it in the real world.
Remember, the goal of machine learning and predictive modeling is to find patterns in your training data that generalize well to the real world, so that you can make predictions. For example, you might train your model on the genetics of a certain disease and then go out into the world and identify patients at risk.
Using simpler models, such as logistic regression, does not work either. One would test (and this is actually a common practice in bioinformatics) each one of the millions of genetic variants for an association with the disease.
Have you heard of the multiple testing problem?
Well, if you have a data set where the number of observations (patients in this case), is not drastically larger than the number of features (genetic variants), testing each variant for an association with the disease will lead you to find patterns that randomly explain the disease very well. Have a look at the simple table of randomly generated features below.
Because the number of features is larger than the number of observations, even randomly generated features are able to predict the outcome very well - in this case Variable 7 is a perfect predictor. This, of course, doesn’t mean that this random feature has anything to do with the outcome or would be able to make a prediction better than chance.
A somewhat funny illustration of the multiple testing problem comes from the analysis of US government data. Here the US collected all kinds of different information on parts of their population and then decided to test each variable they collected for a correlation with every other variable - so they did a lot of statistical tests. What they found was a near-perfect correlation between spending on science and suicide rate. Yes, as some people asked me, this was across political parties. And no, of course this effect does not exist in the ‘real world’. It is an artifact of multiple testing.
Have you encountered overfitting? What methods do you use to avoid the problem? Can we solve this by increasing the sample size, i.e., collecting data from a lot more people? Let us know what you think in the comments. We’ll give our two cents in the next post.