# Chapter 8 iPSYCH data

## 8.1 Data on GenomeDK

• imputation has been performed separately for the previous (2012) data and the new (2015) data

• outputs are in IMPUTE2 Oxford format storing imputation probabilities (to be 0, 1, or 2), separated for 2012 and 2015 and by regions of the genome

• RICOPILI “qc1”: variants with an INFO score > 0.1 and MAF > 0.005

Then,

• I transformed this data to my bigSNP format by transforming probabilities to dosages, and merging all datasets while restricting to variants passing qc1 for both waves 2012 / 2015 (around 8.8M variants, across a total of around 134K individuals)

• this is available in the sub-folder bigsnp_r_format/, and can be loaded into R with snp_attach("dosage_ipsych2015.rds"), and for which the backingfile is one very large binary file of 1.1 TB

• this stores dosage data, i.e. expected genotype values $$0 \times P(0) + 1 \times P(1) + 2 \times P(2)$$ (between 0 and 2, but rounded to 2 decimal places through CODE_DOSAGE)

• there are also information on the individuals (in $fam) and on the variants (in $map)

## 8.2 Data on Statistics Denmark

• I split the dosage data (from my format) into 137 parts with at most 70K variants (also 4 parts for chromosome X), wrote these to text files, then Sussie converted these to SAS format to be sent to Statistics Denmark

• Emil helped convert these 137 SAS files back to my format

• the information on individuals was not sent to Statistics Denmark for some reason, but sample IDs are included (in \$fam) so that information on individuals can be found elsewhere on the server and linked to the genotype data via these IDs (using e.g. match() or dplyr::left_join())

• then you can either use my R packages to analyze the data or to write bed files (with a loss of information, where dosages are further rounded to 0/1/2)

## 8.3 Warnings about the data

• there are 49 duplicate individuals in the data on GenomeDK, and these were removed from the data transferred to Statistics Denmark

• only dosage data is available on Statistics Denmark (i.e. imputation probabilities in the original format are not available there)

• imputation is far from perfect (due to small chips)

• imputation accuracies are not the same for 2012 / 2015, as well as for some allele frequencies; you may need to perform some QC, analyze the two cohorts separately, or at least add an indicator variable as covariate (is_2012, e.g. when performing a GWAS)

## 8.4 Other data available

• relatedness KING coefficients ($$> 2^{-4.5}$$) computed between pairs of individuals (cf. section 4.1)

• PCs computed on the combined data, following best practices from Privé, Luu, Blum, et al. (2020) (cf. chapter 5)

• a subset of homogeneous individuals (basically Northern Europeans) derived from PCs with two lines of code

• polygenic scores for 215 different traits and diseases, based on the UK Biobank individual-level data , computed for all iPSYCH individuals

• 900+ external polygenic scores derived by Clara from externally published GWAS summary statistics

### References

Albinana, C., Zhu, Z., Schork, A.J., Ingason, A., Aschard, H., Brikell, I., et al.others. (2022). Multi-PGS enhances polygenic prediction: Weighting 937 polygenic scores. medRxiv. Retrieved from https://doi.org/10.1101/2022.09.14.22279940
Bybjerg-Grauholm, J., Pedersen, C.B., Bækvad-Hansen, M., Pedersen, M.G., Adamsen, D., Hansen, C.S., et al.others. (2020). The iPSYCH2015 case-cohort sample: Updated directions for unravelling genetic and environmental architectures of severe mental disorders. medRxiv. Retrieved from https://doi.org/10.1101/2020.11.30.20237768
Lam, M., Awasthi, S., Watson, H.J., Goldstein, J., Panagiotaropoulou, G., Trubetskoy, V., et al.others. (2020). RICOPILI: rapid imputation for COnsortias PIpeLIne. Bioinformatics, 36, 930–933.
Privé, F., Aschard, H., Carmi, S., Folkersen, L., Hoggart, C., O’Reilly, P.F., & Vilhjálmsson, B.J. (2022). Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort. The American Journal of Human Genetics, 109, 12–23.
Privé, F., Luu, K., Blum, M.G., McGrath, J.J., & Vilhjálmsson, B.J. (2020). Efficient toolkit implementing best practices for principal component analysis of population genetic data. Bioinformatics, 36, 4449–4457.