| Title: | Retention Time Prediction in Liquid Chromatography |
|---|---|
| Description: | A framework for predicting retention times in liquid chromatography. Users can train custom models for specific chromatography columns, predict retention times using existing models, or adjust existing models to account for altered experimental conditions. The provided functionalities can be accessed either via the R console or via a graphical user interface. Related work: Bonini et al. (2020) <doi:10.1021/acs.analchem.9b05765>. |
| Authors: | Tobias Schmidt [aut, cre, cph] (ORCID: <https://orcid.org/0000-0001-9681-9253>), Christian Amesoeder [aut, cph] (ORCID: <https://orcid.org/0000-0002-1668-8351>), Marian Schoen [aut, cph], Fadi Fadil [aut, cph] (ORCID: <https://orcid.org/0000-0002-9532-1901>), Katja Dettmer [aut, cph] (ORCID: <https://orcid.org/0000-0001-7337-2380>), Peter Oefner [ths, cph] (ORCID: <https://orcid.org/0000-0002-1499-3977>) |
| Maintainer: | Tobias Schmidt <[email protected]> |
| License: | GPL-3 |
| Version: | 1.3.0 |
| Built: | 2026-05-10 06:35:14 UTC |
| Source: | https://github.com/spang-lab/FastRet |
The goal of this function is to train a model that predicts RT_ADJ (retention time measured on a new, adjusted column) from RT (retention time measured on the original column) and to attach this adjustment model to an existing FastRet model.
adjust_frm( frm, new_data, predictors = 1:6, nfolds = 5, verbose = 1, seed = NULL, do_cv = TRUE, adj_type = "lm", add_cds = NULL )adjust_frm( frm, new_data, predictors = 1:6, nfolds = 5, verbose = 1, seed = NULL, do_cv = TRUE, adj_type = "lm", add_cds = NULL )
frm |
An object of class |
new_data |
Data frame with required columns "RT", "NAME", "SMILES"; optional "INCHIKEY".
"RT" must be the retention time measured on the adjusted column.
Each row must match at least one row in |
predictors |
Numeric vector specifying which transformations to include in the model. Available options are: 1=RT, 2=RT^2, 3=RT^3, 4=log(RT), 5=exp(RT), 6=sqrt(RT). Note that predictor 1 (RT) is always included, even if not specified explicitly. |
nfolds |
The number of folds for cross validation. |
verbose |
Show progress messages? |
seed |
An integer value to set the seed for random number generation to allow for reproducible results. |
do_cv |
A logical value indicating whether to perform cross-validation. If FALSE,
the |
adj_type |
A string representing the adjustment model type. Either "lm", "lasso", "ridge", or "gbtree". |
add_cds |
A logical value indicating whether to add chemical descriptors as predictors
to new data. Default is TRUE if |
Matching is done via "SMILES"+"INCHIKEY" if both datasets have non-missing
INCHIKEYs for all rows; otherwise via "SMILES"+"NAME". If multiple rows in
frm$df match the same row in new_data, their RT values are averaged
first, and this average is used for training the adjustment model.
Example: if frm$df equals data.frame OLD shown below and new_data equals
data.frame NEW, then the resulting, paired data.frame will look like PAIRED.
OLD <- data.frame(
NAME = c("A", "B", "B", "C" ),
SMILES = c("C", "CC", "CC", "CCC"),
RT = c(5.0, 8.0, 8.2, 9.0 )
)
NEW <- data.frame(
NAME = c("A", "B", "B", "B"),
SMILES = c("C", "CC", "CC", "CC"),
RT = c(2.5, 5.5, 5.7, 5.6)
)
PAIRED <- data.frame(
NAME = c("A", "B", "B", "B"),
SMILES = c("C", "CC", "CC", "CC"),
RT = c(5.0, 8.1, 8.1, 8.1), # Average of OLD$RT[2:3]
RT_ADJ = c(2.5, 5.5, 5.7, 5.6) # Taken from NEW
)
If do_cv is TRUE, the adjustment procedure is evaluated in
cross-validation. However, care must be taken when interpreting the CV
results, as the model performance depends on both the adjustment layer and
the original model, which was trained on the full base dataset. Therefore,
the observed CV metrics should be read as "expected performance when
predicting RTs for molecules that were part of the base-model training but
not part of the adjustment set" instead of "expected performance when
predicting RTs for completely new molecules".
An object of class frm, as returned by train_frm(), but with an
additional element adj containing the adjustment model. Components of adj
are:
model: The fitted adjustment model. Class depends on adj_type and is
one of lm, glmnet, or xgb.Booster.
df: The data frame used for training the adjustment model. Including
columns "NAME", "SMILES", "RT", "RT_ADJ" and optionally "INCHIKEY", as well
as any additional predictors specified via the predictors argument.
cv: A named list containing the cross validation results (see 'Details'),
or NULL if do_cv = FALSE. When not NULL, elements are:
folds: A list of integer vectors specifying the samples in each fold.
models: A list of adjustment models trained on each fold.
stats: A list of vectors with RMSE, Rsquared, MAE, pBelow1Min per fold.
Added with v1.3.0.
preds: Retention time predictions obtained during CV by applying the
adjustment model to the hold-out data.
preds_adjonly: Removed (i.e. NULL) since v1.3.0.
args: Function arguments used for adjustment (excluding frm, new_data
and verbose). Added with v1.3.0.
version: The version of the FastRet package used to train the adjustment
model. Added with v1.3.0.
frm <- read_rp_lasso_model_rds() new_data <- read_rpadj_xlsx() frm_adj <- adjust_frm(frm, new_data, verbose = 0)frm <- read_rp_lasso_model_rds() new_data <- read_rpadj_xlsx() frm_adj <- adjust_frm(frm, new_data, verbose = 0)
Clips predicted retention times by fitting a log-normal distribution to the observed training RTs and bounding predictions to the central 99.99% interval. All observed RTs must be positive to estimate the distribution. If the estimated lower bound would be negative, it is replaced by 1% of the observed minimum RT instead.
clip_predictions(yhat, y)clip_predictions(yhat, y)
yhat |
Numeric vector of predicted retention times. |
y |
Numeric vector of observed retention times used to derive bounds. |
Numeric vector of clipped (bounded) predictions.
# Draw only a few samples (10) and clip based on these. The allowed range will # be much bigger than the observed range. set.seed(42) y <- rlnorm(n = 1000, meanlog = 2, sdlog = 0.1) yhat <- y yhat[1] <- -100 # way too low to be realistic yhat[2] <- 1000 # way too high to be realistic yhat <- clip_predictions(yhat, y) range(y) # [ 6.18, 8.93] yhat[1:2] # [ 4.96, 10.61] # Limited by theoretical bounds # Draw more samples (1000) and clip based on these. The allowed range will # be almost identical to the observed range. set.seed(42) y <- rnorm(n = 100, mean = 100, sd = 5) yhat <- y yhat[1] <- -100 yhat[2] <- 1000 yhat <- clip_predictions(yhat, y) range(y) # 83.14, 117.47 yhat[1:2] # 83.14, 117.72# Draw only a few samples (10) and clip based on these. The allowed range will # be much bigger than the observed range. set.seed(42) y <- rlnorm(n = 1000, meanlog = 2, sdlog = 0.1) yhat <- y yhat[1] <- -100 # way too low to be realistic yhat[2] <- 1000 # way too high to be realistic yhat <- clip_predictions(yhat, y) range(y) # [ 6.18, 8.93] yhat[1:2] # [ 4.96, 10.61] # Limited by theoretical bounds # Draw more samples (1000) and clip based on these. The allowed range will # be almost identical to the observed range. set.seed(42) y <- rnorm(n = 100, mean = 100, sd = 5) yhat <- y yhat[1] <- -100 yhat[2] <- 1000 yhat <- clip_predictions(yhat, y) range(y) # 83.14, 117.47 yhat[1:2] # 83.14, 117.72
Creates the FastRet GUI
fastret_app(port = 8080, host = "0.0.0.0", reload = FALSE, nsw = 1)fastret_app(port = 8080, host = "0.0.0.0", reload = FALSE, nsw = 1)
port |
The port the application should listen on |
host |
The address the application should listen on |
reload |
Whether to reload the application when the source code changes |
nsw |
The number of subworkers each worker is allowed to start. The higher this number, the faster individual tasks like model fitting can be processed. |
An object of class shiny.appobj.
x <- fastret_app() if (interactive()) shiny::runApp(x)x <- fastret_app() if (interactive()) shiny::runApp(x)
Calculate Chemical Descriptors (CDs) for a list of molecules. Molecules can appear multiple times in the list.
getCDs(df, verbose = 1, nw = 1, keepdf = TRUE)getCDs(df, verbose = 1, nw = 1, keepdf = TRUE)
df |
dataframe with two mandatory columns: "NAME" and "SMILES" |
verbose |
0: no output, 1: progress, 2: more progress and warnings |
nw |
number of workers for parallel processing |
keepdf |
If TRUE, |
A dataframe with all input columns (if keepdf is TRUE) and chemical
descriptors as remaining columns.
cds <- getCDs(head(RP, 3), verbose = 1, nw = 1)cds <- getCDs(head(RP, 3), verbose = 1, nw = 1)
Creates scatter plots of measured vs. predicted retention times (RT) for a
FastRet Model (FRM). Supports plotting cross-validation (CV) predictions and
fitted predictions on the training set, as well as their adjusted variants
when the model has been adjusted via adjust_frm(). Coloring highlights
points within 1 minute of the identity line and simple outliers.
plot_frm(frm = train_frm(verbose = 1), type = "scatter.cv", trafo = "identity")plot_frm(frm = train_frm(verbose = 1), type = "scatter.cv", trafo = "identity")
frm |
An object of class |
type |
Plot type. One of:
|
trafo |
Transformation applied for display. One of:
|
NULL, called for its side effect of plotting.
frm <- read_rp_lasso_model_rds() plot_frm(frm, type = "scatter.cv")frm <- read_rp_lasso_model_rds() plot_frm(frm, type = "scatter.cv")
Predict retention times for new data using a FastRet Model (FRM).
## S3 method for class 'frm' predict( object = train_frm(), df = object$df, adjust = NULL, verbose = 0, clip = TRUE, impute = TRUE, ... )## S3 method for class 'frm' predict( object = train_frm(), df = object$df, adjust = NULL, verbose = 0, clip = TRUE, impute = TRUE, ... )
object |
An object of class |
df |
A data.frame with the same columns as the training data. |
adjust |
If |
verbose |
A logical value indicating whether to print progress messages. |
clip |
Clip predictions to be within RT range of training data? |
impute |
Impute missing predictor values using column means of training data? |
... |
Not used. Required to match the generic signature of |
A numeric vector with the predicted retention times.
object <- read_rp_lasso_model_rds() df <- head(RP) yhat <- predict(object, df)object <- read_rp_lasso_model_rds() df <- head(RP) yhat <- predict(object, df)
Preprocess data so they can be used as input for train_frm().
preprocess_data( data, degree_polynomial = 1, interaction_terms = FALSE, verbose = 1, nw = 1, rm_near_zero_var = TRUE, rm_na = TRUE, add_cds = TRUE, rm_ucs = TRUE, rt_terms = 1, mandatory = c("NAME", "RT", "SMILES") )preprocess_data( data, degree_polynomial = 1, interaction_terms = FALSE, verbose = 1, nw = 1, rm_near_zero_var = TRUE, rm_na = TRUE, add_cds = TRUE, rm_ucs = TRUE, rt_terms = 1, mandatory = c("NAME", "RT", "SMILES") )
data |
Dataframe with following columns:
|
degree_polynomial |
Add predictors with polynomial terms up to the specified degree, e.g. 2 means "add squares", 3 means "add squares and cubes". Set to 1 to leave descriptors unchanged. |
interaction_terms |
Add interaction terms? Polynomial terms are not included in the generation of interaction terms. |
verbose |
0: no output, 1: show progress, 2: progress and warnings. |
nw |
Number of workers to use for parallel processing. |
rm_near_zero_var |
Remove near zero variance predictors? |
rm_na |
Remove NA values? |
add_cds |
Add chemical descriptors using |
rm_ucs |
Remove unsupported columns? |
rt_terms |
Which retention-time transformations to append as extra predictors. Supply a
numeric vector referencing predefined rt_terms (1=RT, 2=I(RT^2),
3=I(RT^3), 4=log(RT), 5=exp(RT), 6=sqrt(RT)) or a character vector with the
explicit transformation terms. Character values are passed to |
mandatory |
Character vector of mandatory columns that must be present in |
If add_cds = TRUE, chemical descriptors are added using getCDs(). If
all chemical descriptors listed in CDFeatures are already present in
the input data object, getCDs() will leave them unchanged. If one or more
chemical descriptors are missing, all chemical descriptors will be
recalculated and existing ones will be overwritten.
A dataframe with the preprocessed data.
data <- head(RP, 3) pre <- preprocess_data(data, verbose = 0)data <- head(RP, 3) pre <- preprocess_data(data, verbose = 0)
Reads the Retip::HILIC dataset (CC BY 4.0) from the Retip package or, if
Retip is not installed, downloads the dataset directly from the Retip GitHub repository. Before returning the dataset,
SMILES strings are canonicalized and the original tibble object is
converted to a base R data.frame.
read_retip_hilic_data(verbose = 1)read_retip_hilic_data(verbose = 1)
verbose |
Verbosity. 1 for messages, 0 to suppress them. |
Attribution as required by CC BY 4.0:
Original dataset by:
Paolo Bonini, Tobias Kind, Hiroshi Tsugawa, Dinesh Kumar Barupal, and
Oliver Fiehn as part of the Retip project.
Source repository: https://github.com/oloBion/Retip
Original file: https://github.com/oloBion/Retip/raw/master/data/HILIC.RData
License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
Modifications in FastRet:
converted tibble to data.frame
canonicalized SMILES using as_canonical()
renamed column 'INCHKEY' to 'INCHIKEY'
A data frame with 970 rows and the following columns:
NAME: Molecule name
INCHIKEY: InChIKey
SMILES: Canonical SMILES string
RT: Retention time in Minutes
https://github.com/oloBion/Retip/raw/master/data/HILIC.RData
Retip: Retention Time Prediction for Compound Annotation in Untargeted Metabolomics
Paolo Bonini, Tobias Kind, Hiroshi Tsugawa, Dinesh Kumar Barupal, and Oliver Fiehn
Analytical Chemistry 2020 92 (11), 7515-7522 DOI: 10.1021/acs.analchem.9b05765
df <- read_retip_hilic_data(verbose = 0)df <- read_retip_hilic_data(verbose = 0)
Read a LASSO model trained on the RP dataset using train_frm().
read_rp_lasso_model_rds()read_rp_lasso_model_rds()
A frm object.
frm <- read_rp_lasso_model_rds()frm <- read_rp_lasso_model_rds()
Reads retention times from a reverse phase liquid chromatography experiment,
performed at 35C and a flow rate of 0.3 mL/min. The data is also available
as a dataframe in the package; to access it directly, use RP.
read_rp_xlsx()read_rp_xlsx()
A dataframe of 442 metabolites with columns RT, SMILES and NAME.
Measured by the Institute of Functional Genomics at the University of Regensburg.
RP
x <- read_rp_xlsx() all.equal(x, RP)x <- read_rp_xlsx() all.equal(x, RP)
Subset of the data from read_rp_xlsx() with some slight modifications to
simulate changes in temperature and/or flowrate.
read_rpadj_xlsx()read_rpadj_xlsx()
A dataframe with 25 rows (metabolites) and 3 columns: RT, SMILES and NAME.
x <- read_rpadj_xlsx()x <- read_rpadj_xlsx()
Retention time data from a reverse phase liquid chromatography measured with
a temperature of 35C and a flowrate of 0.3ml/min. The same data
is available as an xlsx file in the package. To read it into R use
read_rp_xlsx(). @format A dataframe of 442 metabolites with the following
columns:
Retention time
SMILES notation of the metabolite
Name of the metabolite
RPRP
An object of class data.frame with 442 rows and 3 columns.
Measured by the Institute of Functional Genomics at the University of Regensburg.
read_rp_xlsx
The function adjust_frm() is used to modify existing FastRet models based
on changes in chromatographic conditions. It requires a set of molecules with
measured retention times on both the original and new column. This function
selects a sensible subset of molecules from the original dataset for
re-measurement. The selection process includes:
Generating chemical descriptors from the SMILES strings of the molecules.
These are the features used by train_frm() and adjust_frm().
Standardizing chemical descriptors to have zero mean and unit variance.
Training a Ridge Regression model with the standardized chemical descriptors as features and the retention times as the target variable.
Scaling the chemical descriptors by coefficients of the Ridge Regression model.
Clustering the entire dataset, which includes the scaled chemical descriptors and the retention times.
Returning the clustering results, which include the cluster assignments, the medoid indicators, and the raw data.
selective_measuring( raw_data, k_cluster = 25, verbose = 1, seed = NULL, rt_coef = "max_ridge_coef" )selective_measuring( raw_data, k_cluster = 25, verbose = 1, seed = NULL, rt_coef = "max_ridge_coef" )
raw_data |
The raw data to be processed. Must be a dataframe with columns NAME, RT and SMILES. |
k_cluster |
The number of clusters for PAM clustering. |
verbose |
The level of verbosity. |
seed |
An optional random seed for reproducibility, set at the beginning of the function. |
rt_coef |
Which coefficient to use for scaling RT before clustering. Options are:
|
A list containing the following elements:
clustering: A data frame with columns RT, SMILES, NAME, CLUSTER and
IS_MEDOID.
clobj: The clustering object. The object returned by the clustering
function. Depends on the method parameter.
coefs: The coefficients from the Ridge Regression model.
model: The Ridge Regression model.
df: The preprocessed data.
dfz: The standardized features.
dfzb: The features scaled by the coefficients (betas) of the Ridge
Regression model.
x <- selective_measuring(RP[1:50, ], k = 5, verbose = 0) # For the sake of a short runtime, only the first 50 rows of the RP dataset # were used in this example. In practice, you should always use the entire # dataset to find the optimal subset for re-measurement.x <- selective_measuring(RP[1:50, ], k = 5, verbose = 0) # For the sake of a short runtime, only the first 50 rows of the RP dataset # were used in this example. In practice, you should always use the entire # dataset to find the optimal subset for re-measurement.
Starts the FastRet GUI
start_gui(port = 8080, host = "0.0.0.0", reload = FALSE, nw = 2, nsw = 1)start_gui(port = 8080, host = "0.0.0.0", reload = FALSE, nw = 2, nsw = 1)
port |
The port the application should listen on |
host |
The address the application should listen on |
reload |
Whether to reload the application when the source code changes |
nw |
The number of worker processes started. The first worker always listens for
user input from the GUI. The other workers are used for handling long running
tasks like model fitting or clustering. If |
nsw |
The number of subworkers each worker is allowed to start. The higher this number, the faster individual tasks like model fitting can be processed. A value of 1 means that all subprocesses will run sequentially. |
If you set nw = 3 and nsw = 4, you should have at least 16 cores, one
core for the shiny main process. Three cores for the three worker processes
and 12 cores (3 * 4) for the subworkers. For the default case, nworkers = 2
and nsw = 1, you only need 3 cores, as nsw = 1 means that all
subprocesses will run sequentially.
A shiny app. This function returns a shiny app that can be run to interact with the model.
if (interactive()) start_gui()if (interactive()) start_gui()
Trains a new model from molecule SMILES to predict retention times (RT) using the specified method.
train_frm( df, method = "lasso", verbose = 1, nfolds = 5, nw = 1, degree_polynomial = 1, interaction_terms = FALSE, rm_near_zero_var = TRUE, rm_na = TRUE, rm_ns = FALSE, seed = NULL, do_cv = TRUE )train_frm( df, method = "lasso", verbose = 1, nfolds = 5, nw = 1, degree_polynomial = 1, interaction_terms = FALSE, rm_near_zero_var = TRUE, rm_na = TRUE, rm_ns = FALSE, seed = NULL, do_cv = TRUE )
df |
A dataframe with columns "NAME", "RT", "SMILES" and optionally a set of
chemical descriptors. If no chemical descriptors are provided, they are
calculated using the function |
method |
A string representing the prediction algorithm. Either "lasso", "ridge", "gbtree", "gbtreeDefault" or "gbtreeRP". Method "gbtree" is an alias for "gbtreeDefault". |
verbose |
A logical value indicating whether to print progress messages. |
nfolds |
An integer representing the number of folds for cross validation. |
nw |
An integer representing the number of workers for parallel processing. |
degree_polynomial |
An integer representing the degree of the polynomial. Polynomials up to the specified degree are included in the model. |
interaction_terms |
A logical value indicating whether to include interaction terms in the model. |
rm_near_zero_var |
A logical value indicating whether to remove near zero variance predictors. |
rm_na |
A logical value indicating whether to remove NA values before training.
Highly recommended to avoid issues during model fitting. Setting this to
FALSE with |
rm_ns |
A logical value indicating whether to remove chemical descriptors that were considered as not suitable for linear regression based on a previous analysis of an independent dataset. Currently not used. |
seed |
An integer value to set the seed for random number generation to allow for reproducible results. |
do_cv |
A logical value indicating whether to perform cross-validation. If FALSE,
the |
A 'FastRet Model', i.e., an object of class frm. Components are:
model: The fitted base model. This can be an object of class glmnet
(for Lasso or Ridge regression) or xgb.Booster (for GBTree models).
df: The data frame used for training the model. The data frame contains
all user-provided columns (including mandatory columns RT, SMILES and NAME)
as well the calculated chemical descriptors. (But no interaction terms or
polynomial features, as these can be recreated within a few milliseconds).
cv: A named list containing the cross validation results, or NULL if
do_cv = FALSE. When not NULL, elements are:
folds: A list of integer vectors specifying the samples in each fold.
models: A list of models trained on each fold.
stats: A list of vectors with RMSE, Rsquared, MAE, pBelow1Min per fold.
preds: Retention time predictions obtained in CV as numeric vector.
seed: The seed used for random number generation.
version: The version of the FastRet package used to train the model.
args: The value of function arguments besides df as named list.
m <- train_frm(df = RP[1:40, ], method = "lasso", nfolds = 2, verbose = 0) # For the sake of a short runtime, only the first 40 rows of the RP dataset # are used in this example. In practice, you should always use the entire # training dataset for model training.m <- train_frm(df = RP[1:40, ], method = "lasso", nfolds = 2, verbose = 0) # For the sake of a short runtime, only the first 40 rows of the RP dataset # are used in this example. In practice, you should always use the entire # training dataset for model training.