Title: | Retention Time Prediction in Liquid Chromatography |
---|---|
Description: | A framework for predicting retention times in liquid chromatography. Users can train custom models for specific chromatography columns, predict retention times using existing models, or adjust existing models to account for altered experimental conditions. The provided functionalities can be accessed either via the R console or via a graphical user interface. Related work: Bonini et al. (2020) <doi:10.1021/acs.analchem.9b05765>. |
Authors: | Christian Amesoeder [aut, cph]
|
Maintainer: | Tobias Schmidt <[email protected]> |
License: | GPL-3 |
Version: | 1.1.4 |
Built: | 2025-02-11 05:32:48 UTC |
Source: | https://github.com/spang-lab/fastret |
The goal of this function is to train a model that predicts RT_ADJ (retention time measured on a new, adjusted column) from RT (retention time measured on the original column) and to attach this "adjustmodel" to an existing FastRet model.
adjust_frm( frm = train_frm(), new_data = read_rpadj_xlsx(), predictors = 1:6, nfolds = 5, verbose = 1 )
adjust_frm( frm = train_frm(), new_data = read_rpadj_xlsx(), predictors = 1:6, nfolds = 5, verbose = 1 )
frm |
An object of class |
new_data |
Dataframe with columns "RT", "NAME", "SMILES" and optionally a set of chemical descriptors. |
predictors |
Numeric vector specifying which predictors to include in the model in addition to RT. Available options are: 1=RT, 2=RT^2, 3=RT^3, 4=log(RT), 5=exp(RT), 6=sqrt(RT). |
nfolds |
An integer representing the number of folds for cross validation. |
verbose |
A logical value indicating whether to print progress messages. |
An object of class frm
, which is a list with the following elements:
model
: A list containing details about the original model.
df
: The data frame used for training the model.
cv
: A list containing the cross validation results.
seed
: The seed used for random number generation.
version
: The version of the FastRet package used to train the model.
adj
: A list containing details about the adjusted model.
frm <- read_rp_lasso_model_rds() new_data <- read_rpadj_xlsx() frmAdjusted <- adjust_frm(frm, new_data, verbose = 0)
frm <- read_rp_lasso_model_rds() new_data <- read_rpadj_xlsx() frmAdjusted <- adjust_frm(frm, new_data, verbose = 0)
Creates the FastRet GUI
fastret_app(port = 8080, host = "0.0.0.0", reload = FALSE, nsw = 1)
fastret_app(port = 8080, host = "0.0.0.0", reload = FALSE, nsw = 1)
port |
The port the application should listen on |
host |
The address the application should listen on |
reload |
Whether to reload the application when the source code changes |
nsw |
The number of subworkers each worker is allowed to start. The higher this number, the faster individual tasks like model fitting can be processed. |
A shiny app. This function returns a shiny app that can be run to interact with the model.
An object of class shiny.appobj
.
x <- fastret_app() if (interactive()) shiny::runApp(x)
x <- fastret_app() if (interactive()) shiny::runApp(x)
Calculate Chemical Descriptors for a list of molecules. Molecules can appear multiple times in the list.
getCDs(df, verbose = 1, nw = 1)
getCDs(df, verbose = 1, nw = 1)
df |
dataframe with two mandatory columns: "NAME" and "SMILES" |
verbose |
0: no output, 1: progress, 2: more progress and warnings |
nw |
number of workers for parallel processing |
A dataframe with the chemical descriptor values appended as columns to the input dataframe.
cds <- getCDs(head(RP, 3), verbose = 1, nw = 1)
cds <- getCDs(head(RP, 3), verbose = 1, nw = 1)
Predict retention times for new data using a FastRet Model (FRM).
## S3 method for class 'frm' predict(object = train_frm(), df = object$df, adjust = NULL, verbose = 0, ...)
## S3 method for class 'frm' predict(object = train_frm(), df = object$df, adjust = NULL, verbose = 0, ...)
object |
An object of class |
df |
A data.frame with the same columns as the training data. |
adjust |
If |
verbose |
A logical value indicating whether to print progress messages. |
... |
Not used. Required to match the generic signature of |
A numeric vector with the predicted retention times.
frm <- read_rp_lasso_model_rds() newdata <- head(RP) yhat <- predict(frm, newdata)
frm <- read_rp_lasso_model_rds() newdata <- head(RP) yhat <- predict(frm, newdata)
Preprocess data so they can be used as input for train_frm()
.
preprocess_data( data, degree_polynomial = 1, interaction_terms = FALSE, verbose = 1, nw = 1 )
preprocess_data( data, degree_polynomial = 1, interaction_terms = FALSE, verbose = 1, nw = 1 )
data |
dataframe with columns RT, NAME, SMILES |
degree_polynomial |
defines how many polynomials get added (if 3 quadratic and cubic terms get added) |
interaction_terms |
if TRUE all interaction terms get added to data set |
verbose |
0 == no output, 1 == show progress, 2 == show progress and warnings |
nw |
number of workers to use for parallel processing |
A dataframe with the preprocessed data
data <- head(RP, 3) # Only use first three rows to speed up example runtime pre <- preprocess_data(data, verbose = 0)
data <- head(RP, 3) # Only use first three rows to speed up example runtime pre <- preprocess_data(data, verbose = 0)
Downloads and reads the HILIC dataset from the Retip package. The dataset is downloaded from https://github.com/oloBion/Retip/raw/master/data/HILIC.RData
, saved to a temporary file and then read and returned.
read_retip_hilic_data(verbose = 1)
read_retip_hilic_data(verbose = 1)
verbose |
Verbosity level. 1 == print progress messages, 0 == no progress messages. |
df A data frame containing the HILIC dataset.
Retip: Retention Time Prediction for Compound Annotation in Untargeted Metabolomics Paolo Bonini, Tobias Kind, Hiroshi Tsugawa, Dinesh Kumar Barupal, and Oliver Fiehn Analytical Chemistry 2020 92 (11), 7515-7522 DOI: 10.1021/acs.analchem.9b05765
df <- read_retip_hilic_data(verbose = 0)
df <- read_retip_hilic_data(verbose = 0)
Read a LASSO model trained on the RP dataset using train_frm()
.
read_rp_lasso_model_rds()
read_rp_lasso_model_rds()
A frm
object.
frm <- read_rp_lasso_model_rds()
frm <- read_rp_lasso_model_rds()
Read retention time data from a reverse phase liquid chromatography measured with a temperature of 35 degree and a flowrate of 0.3ml/min. The data also exists as dataframe in the package. To use it directly in R just enter RP
.
read_rp_xlsx()
read_rp_xlsx()
A dataframe of 442 metabolites with columns RT
, SMILES
and NAME
.
Measured by functional genomics lab at the University of Regensburg.
RP
x <- read_rp_xlsx() all.equal(x, RP)
x <- read_rp_xlsx() all.equal(x, RP)
Subset of the data from read_rp_xlsx()
with some slight modifications to simulate changes in temperature and/or flowrate.
read_rpadj_xlsx()
read_rpadj_xlsx()
A dataframe with 25 rows (metabolites) and 3 columns RT
, SMILES
and NAME
.
x <- read_rpadj_xlsx()
x <- read_rpadj_xlsx()
Retention time data from a reverse phase liquid chromatography measured with a temperature of 35 degree and a flowrate of 0.3ml/min. The same data is available as an xlsx file in the package. To read it into R use read_rp_xlsx()
.
RP
RP
A dataframe of 442 metabolites with the following columns:
Retention time
SMILES notation of the metabolite
Name of the metabolite
Measured by functional genomics lab at the University of Regensburg.
read_rp_xlsx
The function adjust_frm()
is used to modify existing FastRet models based on changes in chromatographic conditions. It requires a set of molecules with measured retention times on both the original and new column. This function selects a sensible subset of molecules from the original dataset for re-measurement. The selection process includes:
Generating chemical descriptors from the SMILES strings of the molecules. These are the features used by train_frm()
and adjust_frm()
.
Standardizing chemical descriptors to have zero mean and unit variance.
Training a Ridge Regression model with the standardized chemical descriptors as features and the retention times as the target variable.
Scaling the chemical descriptors by coefficients of the Ridge Regression model.
Applying PAM clustering on the entire dataset, which includes the scaled chemical descriptors and the retention times.
Returning the clustering results, which include the cluster assignments, the medoid indicators, and the raw data.
selective_measuring(raw_data, k_cluster = 25, verbose = 1)
selective_measuring(raw_data, k_cluster = 25, verbose = 1)
raw_data |
The raw data to be processed. Must be a dataframe with columns NAME, RT and SMILES. |
k_cluster |
The number of clusters for PAM clustering. |
verbose |
The level of verbosity. |
A list containing the following elements:
clustering
: a data frame with raw data, cluster assignments, and medoid indicators
clobj
: the PAM clustering object
coefs
: the coefficients from the Ridge Regression model
model
: the Ridge Regression model
df
: the preprocessed data
dfz
: the standardized features
dfzb
: the features scaled by coefficients of the Ridge Regression model
x <- selective_measuring(RP[1:50, ], k = 5, verbose = 0) # For the sake of a short runtime, only the first 50 rows of the RP dataset # were used in this example. In practice, you should always use the entire # dataset to find the optimal subset for re-measurement.
x <- selective_measuring(RP[1:50, ], k = 5, verbose = 0) # For the sake of a short runtime, only the first 50 rows of the RP dataset # were used in this example. In practice, you should always use the entire # dataset to find the optimal subset for re-measurement.
Starts the FastRet GUI
start_gui(port = 8080, host = "0.0.0.0", reload = FALSE, nw = 2, nsw = 1)
start_gui(port = 8080, host = "0.0.0.0", reload = FALSE, nw = 2, nsw = 1)
port |
The port the application should listen on |
host |
The address the application should listen on |
reload |
Whether to reload the application when the source code changes |
nw |
The number of worker processes started. The first worker always listens for user input from the GUI. The other workers are used for handling long running tasks like model fitting or clustering. If |
nsw |
The number of subworkers each worker is allowed to start. The higher this number, the faster individual tasks like model fitting can be processed. A value of 1 means that all subprocesses will run sequentially. |
If you set nw = 3
and nsw = 4
, you should have at least 16 cores, one core for the shiny main process. Three cores for the three worker processes and 12 cores (3 * 4) for the subworkers. For the default case, nworkers = 2
and nsw = 1
, you only need 3 cores, as nsw = 1
means that all subprocesses will run sequentially.
A shiny app. This function returns a shiny app that can be run to interact with the model.
if (interactive()) start_gui()
if (interactive()) start_gui()
Trains a new model from molecule SMILES to predict retention times (RT) using the specified method.
train_frm( df, method = "lasso", verbose = 1, nfolds = 5, nw = 1, degree_polynomial = 1, interaction_terms = FALSE, rm_near_zero_var = TRUE, rm_na = TRUE, rm_ns = FALSE, seed = NULL )
train_frm( df, method = "lasso", verbose = 1, nfolds = 5, nw = 1, degree_polynomial = 1, interaction_terms = FALSE, rm_near_zero_var = TRUE, rm_na = TRUE, rm_ns = FALSE, seed = NULL )
df |
A dataframe with columns "NAME", "RT", "SMILES" and optionally a set of chemical descriptors. If no chemical descriptors are provided, they are calculated using the function |
method |
A string representing the prediction algorithm. Either "lasso", "ridge" or "gbtree". |
verbose |
A logical value indicating whether to print progress messages. |
nfolds |
An integer representing the number of folds for cross validation. |
nw |
An integer representing the number of workers for parallel processing. |
degree_polynomial |
An integer representing the degree of the polynomial. Polynomials up to the specified degree are included in the model. |
interaction_terms |
A logical value indicating whether to include interaction terms in the model. |
rm_near_zero_var |
A logical value indicating whether to remove near zero variance predictors. Setting this to TRUE can cause the CV results to be overoptimistic, as the variance filtering is done on the whole dataset, i.e. information from the test folds is used for feature selection. |
rm_na |
A logical value indicating whether to remove NA values. Setting this to TRUE can cause the CV results to be overoptimistic, as the variance filtering is done on the whole dataset, i.e. information from the test folds is used for feature selection. |
rm_ns |
A logical value indicating whether to remove chemical descriptors that were considered as not suitable for linear regression based on previous analysis of an independent dataset. |
seed |
An integer value to set the seed for random number generation to allow for reproducible results. |
Setting rm_near_zero_var
and/or rm_na
to TRUE can cause the CV results to be overoptimistic, as the predictor filtering is done on the whole dataset, i.e. information from the test folds is used for feature selection.
A trained FastRet model.
system.time(m <- train_frm(RP[1:80, ], method = "lasso", nfolds = 2, nw = 1, verbose = 0)) # For the sake of a short runtime, only the first 80 rows of the RP dataset # are used in this example. In practice, you should always use the entire # training dataset for model training.
system.time(m <- train_frm(RP[1:80, ], method = "lasso", nfolds = 2, nw = 1, verbose = 0)) # For the sake of a short runtime, only the first 80 rows of the RP dataset # are used in this example. In practice, you should always use the entire # training dataset for model training.