Fast imputation algorithm based on local XGBoost models.
snp_fastImpute(
Gna,
infos.chr,
alpha = 1e-04,
size = 200,
p.train = 0.8,
n.cor = nrow(Gna),
seed = NA,
ncores = 1
)
A FBM.code256
(typically <bigSNP>$genotypes
).
You can have missing values in these data.
Vector of integers specifying each SNP's chromosome.
Typically <bigSNP>$map$chromosome
.
Type-I error for testing correlations. Default is 1e-4
.
Number of neighbor SNPs to be possibly included in the model
imputing this particular SNP. Default is 200
.
Proportion of non missing genotypes that are used for training
the imputation model while the rest is used to assess the accuracy of
this imputation model. Default is 0.8
.
Number of rows that are used to estimate correlations. Default uses them all.
An integer, for reproducibility. Default doesn't use seeds.
Number of cores used. Default doesn't use parallelism.
You may use bigstatsr::nb_cores()
.
An FBM with
the proportion of missing values by SNP (first row),
the estimated proportion of imputation errors by SNP (second row).
if (FALSE) {
fake <- snp_attachExtdata("example-missing.bed")
G <- fake$genotypes
CHR <- fake$map$chromosome
infos <- snp_fastImpute(G, CHR)
infos[, 1:5]
# Still missing values
big_counts(G, ind.col = 1:10)
# You need to change the code of G
# To make this permanent, you need to save (modify) the file on disk
fake$genotypes$code256 <- CODE_IMPUTE_PRED
fake <- snp_save(fake)
big_counts(fake$genotypes, ind.col = 1:10)
# Plot for post-checking
## Here there is no SNP with more than 1% error (estimated)
pvals <- c(0.01, 0.005, 0.002, 0.001); colvals <- 2:5
df <- data.frame(pNA = infos[1, ], pError = infos[2, ])
# base R
plot(subset(df, pNA > 0.001), pch = 20)
idc <- lapply(seq_along(pvals), function(i) {
curve(pvals[i] / x, from = 0, lwd = 2,
col = colvals[i], add = TRUE)
})
legend("topright", legend = pvals, title = "p(NA & Error)",
col = colvals, lty = 1, lwd = 2)
# ggplot2
library(ggplot2)
Reduce(function(p, i) {
p + stat_function(fun = function(x) pvals[i] / x, color = colvals[i])
}, x = seq_along(pvals), init = ggplot(df, aes(pNA, pError))) +
geom_point() +
coord_cartesian(ylim = range(df$pError, na.rm = TRUE)) +
theme_bigstatsr()
}