Chapter 8 iPSYCH data

8.1 Data on GenomeDK

  • The iPSYCH2015 data (Bybjerg-Grauholm et al., 2020) has been imputed (not by me) using the RICOPILI pipeline (Lam et al., 2020); you can get more info about this pipeline here

  • imputation has been performed separately for the previous (2012) data and the new (2015) data

  • outputs are in IMPUTE2 Oxford format storing imputation probabilities (to be 0, 1, or 2), separated for 2012 and 2015 and by regions of the genome

  • RICOPILI “qc1”: variants with an INFO score > 0.1 and MAF > 0.005


  • I transformed this data to my bigSNP format by transforming probabilities to dosages, and merging all datasets while restricting to variants passing qc1 for both waves 2012 / 2015 (around 8.8M variants, across a total of around 134K individuals)

  • this is available in the sub-folder bigsnp_r_format/, and can be loaded into R with snp_attach("dosage_ipsych2015.rds"), and for which the backingfile is one very large binary file of 1.1 TB

  • this stores dosage data, i.e. expected genotype values \(0 \times P(0) + 1 \times P(1) + 2 \times P(2)\) (between 0 and 2, but rounded to 2 decimal places through CODE_DOSAGE)

  • there are also information on the individuals (in $fam) and on the variants (in $map)

8.2 Data on Statistics Denmark

  • I split the dosage data (from my format) into 137 parts with at most 70K variants (also 4 parts for chromosome X), wrote these to text files, then Sussie converted these to SAS format to be sent to Statistics Denmark

  • Emil helped convert these 137 SAS files back to my format

  • the information on individuals was not sent to Statistics Denmark for some reason, but sample IDs are included (in $fam) so that information on individuals can be found elsewhere on the server and linked to the genotype data via these IDs (using e.g. match() or dplyr::left_join())

  • then you can either use my R packages to analyze the data or to write bed files (with a loss of information, where dosages are further rounded to 0/1/2)

8.3 Warnings about the data

  • there are 49 duplicate individuals in the data on GenomeDK, and these were removed from the data transferred to Statistics Denmark

  • only dosage data is available on Statistics Denmark (i.e. imputation probabilities in the original format are not available there)

  • imputation is far from perfect (due to small chips)

  • imputation accuracies are not the same for 2012 / 2015, as well as for some allele frequencies; you may need to perform some QC, analyze the two cohorts separately, or at least add an indicator variable as covariate (is_2012, e.g. when performing a GWAS)

8.4 Other data available

  • relatedness KING coefficients (\(> 2^{-4.5}\)) computed between pairs of individuals (cf. section 4.1)

  • PCs computed on the combined data, following best practices from Privé, Luu, Blum, et al. (2020) (cf. chapter 5)

  • a subset of homogeneous individuals (basically Northern Europeans) derived from PCs with two lines of code

  • polygenic scores for 215 different traits and diseases, based on the UK Biobank individual-level data (Privé, Aschard, et al., 2022), computed for all iPSYCH individuals

  • 900+ external polygenic scores derived by Clara from externally published GWAS summary statistics (Albinana et al., 2022)


Albinana, C., Zhu, Z., Schork, A.J., Ingason, A., Aschard, H., Brikell, I., et al.others. (2022). Multi-PGS enhances polygenic prediction: Weighting 937 polygenic scores. medRxiv. Retrieved from
Bybjerg-Grauholm, J., Pedersen, C.B., Bækvad-Hansen, M., Pedersen, M.G., Adamsen, D., Hansen, C.S., et al.others. (2020). The iPSYCH2015 case-cohort sample: Updated directions for unravelling genetic and environmental architectures of severe mental disorders. medRxiv. Retrieved from
Lam, M., Awasthi, S., Watson, H.J., Goldstein, J., Panagiotaropoulou, G., Trubetskoy, V., et al.others. (2020). RICOPILI: rapid imputation for COnsortias PIpeLIne. Bioinformatics, 36, 930–933.
Privé, F., Aschard, H., Carmi, S., Folkersen, L., Hoggart, C., O’Reilly, P.F., & Vilhjálmsson, B.J. (2022). Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort. The American Journal of Human Genetics, 109, 12–23.
Privé, F., Luu, K., Blum, M.G., McGrath, J.J., & Vilhjálmsson, B.J. (2020). Efficient toolkit implementing best practices for principal component analysis of population genetic data. Bioinformatics, 36, 4449–4457.