June 26, 2020

Wide Data - Part 3

Polygenic Risk Scores

In previous posts (Part 1, Part 2) we’ve covered the challenges that Wide Data, Big Data’s more complex cousin, presents to data scientists. We’ve also discussed mass sequencing as an attempt make data bigger and less wide. While increasing sample size is helpful in general, we can only increase it linearly, whereas our feature space grows exponentially as we strive to discover even mildly complex patterns. Thus, ultimately, we need algorithmic solutions to solve Wide Data. In this post and the next, we examine the industry standard for the analysis of Wide Genomic Data, Polygenic Risk Scores, as well a deep learning solution.

Polygenic Risk Scores (PRS) promise that “scientists can capture in a single measurement how millions of sites across the genome can impact one patient’s health”. In essence, PRS are the weighted sum of all genetic variants that bear influence on a given disease. Patients with a high risk score are more likely to develop a disease.

Even though PRS are possibly the most popular approach to deal with Wide Genomic Data, they do little to address the obvious concerns discussed in previous posts:

  1. Despite having “polygenic” in the name, the score covers merely additive effects (variant A leads to an increase in risk, and so does variant B), but not interactions between different variants (variant A leads to an increase in risk, but only if variant B is also present and variant C is not present).
  2. The basis of a risk score is still blindly testing each single gene variant for a correlation with the disease - a practice that leads to the multiple comparisons problem, as discussed in Part 2. Therefore any artifacts originating from that bad practice may also be present in PRS.
  3. PRS overfit. Due to including tens of thousands of genetic variants into the risk score, PRS are very prone to learning patterns in the training data that are not present in the real world. PRS advocates deny this with fervor, but if you are in any doubt, just use PRS in a predictive model and see for yourself how it will perfectly fit the training set, but struggle to predict the test portion of the data.
  4. PRS do not work with sample sizes of <2000 patients and are therefor not a viable solution for rare disease data.
  5. Our tests show that predictive models based on PRS often have an accuracy of <60%, i.e. out of a hundred patients, the model predicts 60 patients correctly and the remaining 40 incorrectly. Compare this with tossing a coin to predict who will suffer from a disease, which leads to an accuracy of 50%. This limits the practical use of PRS - note that as with all predictive models, extreme cases (here a risk score close 0 or 100) are more likely to be accurately predicted.

Given that PRS raise a variety of statistical concerns and have limited predictive performance, why are they so popular?

This might be question for social scientists to explore. Perhaps medical practitioners have been abundantly cautious about making predictions in the face of uncertainty and therefore gravitate toward metrics that bear their probabilistic nature in full view. I would be very interested to hear your thoughts on the matter. What accounts for the popularity of PRS, in your view? Do you use them yourself, and if so, why? Please let us know in the comments!