Chapter 8 iPSYCH data

8.1 Data on GenomeDK

  • The iPSYCH2015 data (Bybjerg-Grauholm et al., 2020) has been imputed (not by me) using the RICOPILI pipeline (Lam et al., 2020); you can get more info about this pipeline here

  • imputation has been performed separately for the previous (2012) data and the new (2015) data

  • outputs are in IMPUTE2 Oxford format storing imputation probabilities (to be 0, 1, or 2), separated for 2012 and 2015 and by regions of the genome

  • RICOPILI “qc1”: variants with an INFO score > 0.1 and MAF > 0.005

Then,

  • I transformed this data to my bigSNP format by transforming probabilities to dosages, and merging all datasets while restricting to variants passing qc1 for both waves 2012 / 2015 (around 8.8M variants, across a total of around 134K individuals)

  • this is available in the sub-folder bigsnp_r_format/, and can be loaded into R with snp_attach("dosage_ipsych2015.rds"), and for which the backingfile is one very large binary file of 1.1 TB

  • this stores dosage data, i.e. expected genotype values \(0 \times P(0) + 1 \times P(1) + 2 \times P(2)\) (between 0 and 2, but rounded to 2 decimal places through CODE_DOSAGE)

  • there are also information on the individuals (in $fam) and on the variants (in $map)

8.2 Data on Statistics Denmark

  • I split the dosage data (from my format) into 137 parts with at most 70K variants (also 4 parts for chromosome X), wrote these to text files, then Sussie converted these to SAS format to be sent to Statistics Denmark

  • Emil helped convert these 137 SAS files back to my format

  • the information on individuals was not sent to Statistics Denmark for some reason, but sample IDs are included (in $fam) so that information on individuals can be found elsewhere on the server and linked to the genotype data via these IDs (using e.g. match() or dplyr::left_join())

  • then you can either use my R packages to analyze the data or to write bed files (with a loss of information, where dosages are further rounded to 0/1/2)

8.3 Warnings about the data

  • there are 49 duplicate individuals in the data on GenomeDK, and these were removed from the data transferred to Statistics Denmark

  • only dosage data is available on Statistics Denmark (i.e. imputation probabilities in the original format are not available there)

  • imputation is far from perfect (due to small genotyping chips)

  • imputation accuracies are not the same for 2012 / 2015, as well as for some allele frequencies; you may need to perform some QC, analyze the two cohorts separately, or at least add an indicator variable as covariate (is_2012, e.g. when performing a GWAS)

8.4 Other data available

  • relatedness KING coefficients (\(> 2^{-4.5}\)) computed between pairs of individuals (cf. section 4.1)

  • PCs computed on the combined data, following best practices from Privé, Luu, Blum, et al. (2020) (cf. chapter 5)

  • a subset of homogeneous individuals (basically Northern Europeans) derived from PCs with two lines of code

  • PGS for 215 different traits and diseases, based on the UK Biobank individual-level data (Privé, Aschard, et al., 2022), computed for all iPSYCH individuals

  • 900+ external PGS derived by Clara from externally published GWAS summary statistics (Albiñana et al., 2023)

  • even more PGS derived by Ole (ask him about them)

References

Albiñana, C., Zhu, Z., Schork, A.J., Ingason, A., Aschard, H., Brikell, I., et al. (2023). Multi-PGS enhances polygenic prediction by combining 937 polygenic scores. Nature Communications, 14, 4702.
Bybjerg-Grauholm, J., Pedersen, C.B., Bækvad-Hansen, M., Pedersen, M.G., Adamsen, D., Hansen, C.S., et al. (2020). The iPSYCH2015 case-cohort sample: Updated directions for unravelling genetic and environmental architectures of severe mental disorders. medRxiv. Retrieved from https://doi.org/10.1101/2020.11.30.20237768
Lam, M., Awasthi, S., Watson, H.J., Goldstein, J., Panagiotaropoulou, G., Trubetskoy, V., et al. (2020). RICOPILI: rapid imputation for COnsortias PIpeLIne. Bioinformatics, 36, 930–933.
Privé, F., Aschard, H., Carmi, S., Folkersen, L., Hoggart, C., O’Reilly, P.F., & Vilhjálmsson, B.J. (2022). Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort. The American Journal of Human Genetics, 109, 12–23.
Privé, F., Luu, K., Blum, M.G., McGrath, J.J., & Vilhjálmsson, B.J. (2020). Efficient toolkit implementing best practices for principal component analysis of population genetic data. Bioinformatics, 36, 4449–4457.