Chapter 8 iPSYCH data

8.1 Data on GenomeDK

The iPSYCH2015 data (Bybjerg-Grauholm et al., 2020) has been imputed (not by me) using the RICOPILI pipeline (Lam et al., 2020); you can get more info about this pipeline here
imputation has been performed separately for the previous (2012) data and the new (2015) data
outputs are in IMPUTE2 Oxford format storing imputation probabilities (to be 0, 1, or 2), separated for 2012 and 2015 and by regions of the genome
RICOPILI “qc1”: variants with an INFO score > 0.1 and MAF > 0.005

Then,

I transformed this data to my bigSNP format by transforming probabilities to dosages, and merging all datasets while restricting to variants passing qc1 for both waves 2012 / 2015 (around 8.8M variants, across a total of around 134K individuals)
this is available in the sub-folder bigsnp_r_format/, and can be loaded into R with snp_attach("dosage_ipsych2015.rds"), and for which the backingfile is one very large binary file of 1.1 TB
this stores dosage data, i.e. expected genotype values $0 \times P(0) + 1 \times P(1) + 2 \times P(2)$ (between 0 and 2, but rounded to 2 decimal places through CODE_DOSAGE)
there are also information on the individuals (in $fam) and on the variants (in $map)

8.2 Data on Statistics Denmark

I split the dosage data (from my format) into 137 parts with at most 70K variants (also 4 parts for chromosome X), wrote these to text files, then Sussie converted these to SAS format to be sent to Statistics Denmark
Emil helped convert these 137 SAS files back to my format
the information on individuals was not sent to Statistics Denmark for some reason, but sample IDs are included (in $fam) so that information on individuals can be found elsewhere on the server and linked to the genotype data via these IDs (using e.g. match() or dplyr::left_join())
then you can either use my R packages to analyze the data or to write bed files (with a loss of information, where dosages are further rounded to 0/1/2)

8.3 Warnings about the data

there are 49 duplicate individuals in the data on GenomeDK, and these were removed from the data transferred to Statistics Denmark
only dosage data is available on Statistics Denmark (i.e. imputation probabilities in the original format are not available there)
imputation is far from perfect (due to small genotyping chips)
imputation accuracies are not the same for 2012 / 2015, as well as for some allele frequencies; you may need to perform some QC, analyze the two cohorts separately, or at least add an indicator variable as covariate (is_2012, e.g. when performing a GWAS)

8.4 Other data available

relatedness KING coefficients ($> 2^{-4.5}$) computed between pairs of individuals (cf. section 4.1)
PCs computed on the combined data, following best practices from Privé, Luu, Blum, et al. (2020) (cf. chapter 5)
a subset of homogeneous individuals (basically Northern Europeans) derived from PCs with two lines of code
PGS for 215 different traits and diseases, based on the UK Biobank individual-level data (Privé, Aschard, et al., 2022), computed for all iPSYCH individuals
900+ external PGS derived by Clara from externally published GWAS summary statistics (Albiñana et al., 2023)
even more PGS derived by Ole (ask him about them)

References

Albiñana, C., Zhu, Z., Schork, A.J., Ingason, A., Aschard, H., Brikell, I., et al. (2023). Multi-PGS enhances polygenic prediction by combining 937 polygenic scores. Nature Communications, 14, 4702.

Bybjerg-Grauholm, J., Pedersen, C.B., Bækvad-Hansen, M., Pedersen, M.G., Adamsen, D., Hansen, C.S., et al. (2020). The iPSYCH2015 case-cohort sample: Updated directions for unravelling genetic and environmental architectures of severe mental disorders. medRxiv. Retrieved from https://doi.org/10.1101/2020.11.30.20237768

Lam, M., Awasthi, S., Watson, H.J., Goldstein, J., Panagiotaropoulou, G., Trubetskoy, V., et al. (2020). RICOPILI: rapid imputation for COnsortias PIpeLIne. Bioinformatics, 36, 930–933.

Privé, F., Aschard, H., Carmi, S., Folkersen, L., Hoggart, C., O’Reilly, P.F., & Vilhjálmsson, B.J. (2022). Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort. The American Journal of Human Genetics, 109, 12–23.

Privé, F., Luu, K., Blum, M.G., McGrath, J.J., & Vilhjálmsson, B.J. (2020). Efficient toolkit implementing best practices for principal component analysis of population genetic data. Bioinformatics, 36, 4449–4457.