Polygenic scores from individual-level data

If you have individual-level data (i.e. genotypes and phenotypes), you can basically use any supervised learning (machine learning) method to train a PGS. However, because of the size of the genetic data, you will quickly have scalability issues with these models. Moreover, it has been shown that effects for most diseases and traits are small and essentially additive, and that fancy methods such as deep learning are not much effective at constructing PGS (Kelemen et al. 2025).

Therefore, using penalized linear/logistic regression (PLR) can be a very efficient and effective method to train PGS. In my R package bigstatsr, I have developed a very fast implementation with automatic choice of the two hyper-parameters (Privé, Aschard, and Blum 2019). You can find a tutorial explaining its implementation and use here.

This is an example of using PLR for predicting height from genotypes in the UK Biobank

training on 350K individuals x 656K variants in less than 24H
within both males and females, PGS achieved a correlation of 65.5% (\(r^2\) of 42.9%) between genetically predicted and true height

References

Kelemen, Martin, Yu Xu, Tao Jiang, Jing Hua Zhao, Carl A Anderson, Chris Wallace, Adam Butterworth, and Michael Inouye. 2025. “Performance of Deep-Learning-Based Approaches to Improve Polygenic Scores.” Nature Communications 16 (1): 1–9. https://doi.org/10.1038/s41467-025-60056-1.

Privé, Florian, Hugues Aschard, and Michael GB Blum. 2019. “Efficient Implementation of Penalized Regression for Genetic Risk Prediction.” Genetics 212 (1): 65–74. https://doi.org/10.1534/genetics.119.302019.

Florian Privé

July 4, 2025

References