
Cross-validation for Ridge Redundancy Analysis
rrda.cv.RdThis function performs cross-validation to evaluate the performance of Ridge Redundancy Analysis (RDA) models. It calculates the mean squared error (MSE) for different ranks and ridge penalty values through cross-validation folds. The function also supports centering and scaling of the input matrices.
The range of lambda for the cross-validation is automatically calculated following the method of "glmnet" (Friedman et al., 2010). When we have a matrix of response variables (Y; n times q matrix) and a matrix of explanatory variables (X; n times p matrix), the largest lambda for the validation is obtained as follows
$$ \lambda_{\text{max}} = \frac{\max_{j \in \{1, 2, \dots, p\}} \sqrt{\sum_{k=1}^{q} \left( \sum_{i=1}^{n} (x_{ij}\cdot y_{ik}) \right)^2}}{N \times 10^{-3}}$$
Then, we define \(\lambda_{min}=10^{-4}\lambda_{max}\), and the sequence \(\lambda\) is generated based on the range.
Also, to reduce the computation, the variable sampling is performed for the large matrix of X and Y (by default, when the number of the variables is over 1000). Alternatively, the range of lambda can be specified manually.
Usage
rrda.cv(
Y,
X,
maxrank = NULL,
lambda = NULL,
num.lambda = 50,
nfold = 5,
folds = NULL,
sample.X = 1000,
sample.Y = 1000,
scale.X = FALSE,
scale.Y = FALSE,
center.X = TRUE,
center.Y = TRUE,
verbose = TRUE
)Arguments
- Y
A numeric matrix of response variables.
- X
A numeric matrix of explanatory variables.
- maxrank
A numeric vector specifying the maximum rank of the coefficient Bhat. Default is
NULL, which sets it to(min(15, min(dim(X), dim(Y)))).- lambda
A numeric vector of ridge penalty values. Default is
NULL, where the lambda values are automatically chosen.- num.lambda
A number of lambda generated (only when the lambda is not given by user). Default is 50.
- nfold
The number of folds for cross-validation. Default is 5.
- folds
A vector specifying the folds. Default is
NULL, which randomly assigns folds.- sample.X
A number of variables sampled from X for the lamdba range estimate. Default is 1000.
- sample.Y
A number of variables sampled from Y for the lamdba range estimate. Default is 1000.
- scale.X
Logical indicating if
Xshould be scaled. IfTRUE, scalesX. Default isFALSE.- scale.Y
Logical indicating if
Yshould be scaled. IfTRUE, scalesY. Default isFALSE.- center.X
Logical indicating if
Xshould be centered. IfTRUE, scalesX. Default isTRUE.- center.Y
Logical indicating if
Yshould be centered. IfTRUE, scalesY. Default isTRUE.- verbose
Logical indicating. If
TRUE, the function displays information about the function call. Default isTRUE.
Value
A list containing the cross-validated MSE matrix, lambda values, rank values, and the optimal lambda and rank.
Examples
if (FALSE) { # \dontrun{
set.seed(10)
simdata<-rdasim1(n = 100,p = 200,q = 200,k = 3)
X <- simdata$X
Y <- simdata$Y
cv_result<- rrda.cv(Y = Y, X = X, maxrank = 5, nfold = 5)
rrda.summary(cv_result = cv_result)
##Complete Example##
# library(future) # <- if you want to compute in parallel
# plan(multisession) # <- if you want to compute in parallel
# cv_result<- rrda.cv(Y = Y, X = X, maxrank = 5, nfold = 5) # cv
# plan(multisession) # <- To come back to sequential computing
# rrda.summary(cv_result = cv_result) # cv result
p <- rrda.plot(cv_result) # cv result plot
print(p)
h <- rrda.heatmap(cv_result) # cv result heatmao
print(h)
estimated_lambda<-cv_result$opt_min$lambda # selected parameter
estimated_rank<-cv_result$opt_min$rank # selected parameter
Bhat <- rrda.fit(Y = Y, X = X, nrank = estimated_rank,lambda = estimated_lambda) # fitting
Bhat_mat<-rrda.coef(Bhat)
Yhat_mat <- rrda.predict(Bhat = Bhat, X = X) # prediction
Yhat<-Yhat_mat[[1]][[1]][[1]] # predicted values
cor_Y_Yhat<-diag(cor(Y,Yhat)) # correlation
summary(cor_Y_Yhat)
} # }