April 15, 2020

Wide Data

It's not Big Data!

We’ve all seen and experienced the benefits that big data brought to machine learning. Computers have become good at understanding human language interpreting images - think of self-driving cars.

Big data means that you have millions (if not billions) of observations, and for each observation you have a manageable amount of data. When data scientists analyze Twitter data, for instance, they have hundreds of millions of Tweets, but each Tweet is shorter than 30 words.

A Tweet (from the German version of Twitter, for those who were wondering)

This is classical big data and all well-known machine learning techniques work very well as long as big data is provided. The availability of big data eclipses the actual algorithms used. You might know the MNIST data - it’s a large number of images of hand written digits and tons of machine learning tutorials use this data set. If you haven’t yet, you should try it out, and don’t worry - you will get great results. Even the most basic machine learning models can identify 97% of the digits correctly. If you really put in some effort and build complicated deep nets, you can get up to 99.7% accuracy, so using the fanciest ‘AI’ algorithm out there gives you a whooping 2.7% increase - seems like data trumps algorithm.

Hand-written digits from the MNIST data set

While I’m not arguing that there are no clever ways of using deep learning that yield massive benefits, availability of big data is the driving factor. If you have access to large amounts of nice data that nobody else has, go to Yoshua Bengio’s Github page, download the latest Deep Net and go out there and pitch your start up - in 2015 you would have had a better chance than in 2020, but it’s still worth a shot.

The message so far - if you have access to big data, use it!

But what about areas in which data is super expensive or simply doesn’t exist? What if you need to cure diseases? You are looking at human genomes, so your data sets contain, if you’re lucky, 2000 subjects. In the Twitter data mentioned at the beginning of this post, you have 500 million observations and 20 words per observation; in the genomic data, you have 2000 observations and for each observation (patient) you have 2 million genetic variants that could play a role in the disease - now your data is not big, it is small and wide:

Instead of learning something very simple from a large number of examples, your algorithm now needs to learn something incredibly complex from very few examples.

Wide Data vs. Big Data

Have ever crossed paths with Wide Data, ever encountered the Curse of Dimensionality? Or just wondered if there is a world beyond Big Data? Let us know in the comments!

We’ll discuss what happens when you use common machine learning techniques on this wide data and what alternative solutions people have devised in the next post.