API Improvements:
getCDs():
plot_frm():
preprocess_data():
add_cds to control whether chemical descriptors should be
added to the input data using getCDs().rm_ucs to control whether unsupported columns (i.e.
columns that are neither mandatory nor optional) should be removed from the
input data.rt_terms to control whether transformations of the RT column
(square, cube, log, exp, sqrt) should be added to the input data.CDFeatures are allowed as
optional columns.train_frm():
do_cv to control whether cross-validation should be
performed for performance estimation. Default is TRUE.method now accepts two values for training models with xgbtree
base: "gbtreeDefault" (train xgboost with default params) and "gbtreeRP"
(train xgboost with parameters optimized for the RP dataset). The old value
"gbtree" still works and is now an alias for "gbtreeDefault".frm objects are fully
specified now).clip_predictions(). Of course, the clipping is
always based on the RT range of training folds, not the whole original
training data.predict.frm():
degree_polynomial>1
and/or interaction_terms=TRUE, unless the transformations were manually
applied to the new data beforehand.clip to allow clipping of predictions to be within
the RT range of the training data. Works for both adjusted and unadjusted
models.clip=FALSE. See clip_predictions()
for details.impute=FALSE in predict.frm().selective_measuring():
rt_coef, allowing user to control the influence of RT on
the clustering. A value of 0 means that RT is ignored, a value of
"max_ridge_coefficient" means that RT has the same weight as the most
important chemical descriptor and a value of 1 means no scaling at all
(except standardization to z-scores, which is applied before to the whole
dataset before the ridge regression is trained).adjust_frm():
seed to allow reproducible results.do_cv to control whether cross-validation should be
performed for performance estimation. Default is TRUE.adj_type to control which model should be trained for
adjustment: supported options are "lm", "lasso", "ridge", or "gbtree".
Previously, only "lm" was supported. To stay backwards compatible, the default
is "lm".add_cds to control whether chemical descriptors should be
added to the input data using getCDs(). Only recommended for adj_type other
than "lm".clip_predictions(). Of course, the clipping is
always based on the RT range of training folds, not the whole original
training data.print.frm():
clip_predictions():
train_frm(), predict.frm() and
adjust_frm().get_predictors():
base and adjust to control whether predictors for the
base model, the adjustment model or both should be returned.Bugfixes:
preprocess_data() are now generated
correctly as product of the involved features instead of a division. This
follows common practice in regression modeling and avoids division by zero
issues. Passing older models, trained with division-based interaction terms,
to downstream functions like predict.frm() or adjust_frm() will now lead
to an error. (This is not a breaking change, as predict.frm() and friends
have in fact never been able to handle such models).plot_frm() with type "scatter.cv.adj" or "scatter.train.adj" now correctly
shows retention times from the new data (used for model adjustment) as x-axis
values instead of the original training retention times.catf() now only emits escape codes (i.e. colored output), it the output is
directed to a terminal. If the output is redirected to a file or a pipe, no
escape codes are emitted anymore. Since catf() is used throughout the
package for logging, this fixes the output for the whole package.Internal Improvements:
adjust_frm()fit_gbtree()fit_glmnet()get_param_grid()get_predictors()getCDs()plot_frm()predict_frm()preprocess_data()selective_measuring()train_frm()validate_inputdata()caret dependency by adding custom implementations for:
createFolds()nearZeroVar()adjust_frm() into a private function
merge_dfs().fit_glmnet(), fit_lasso() and fit_ridge() with a single
function fit_glmnet(), that takes the method ("lasso" or "ridge") as
parameter. Instead of a dataframe df that has to contain only predictors
plus the RT column (as reponse), the function now takes a matrix of
predictors X and a vector of responses y. This makes the function more
flexible and easier to test.fit_gbtree_grid() with a much simpler function
find_params_best(). Instead of allowing the specification of every grid
parameter, the new function instead accepts a keyword searchspace for
specifying predefined grids to choose from.fit_gbtree by exposing lots of hardcoded internal xgboost
parameters as function parameters with sensible defaults. In particular, the
user can now set xpar to "default", "rpopt" or a predefined grid-size to
train the model with different hyperparameter settings. Furthermore, the
function is now written in a way that works with both, version 1.7.9.1 and
the new 3.1.2.1 version published on 2025/12/03 (yes, version 2.x was skipped
completely).get_param_grid() for returning predefined
hyperparameter grids for xgboost model training based on keywords like
"tiny", "small" or "large".benchmark_find_params() to benchmark runtime of
find_params_best() for different numbers of cores and/or threads. As it
turns out, choosing a higher number of cores is usually more efficient (at
the cost of worse progress output).named(), as_str(), is_valid_smiles() and
as_canonical()selective_measuring() by aligning glmnet coefficients to columns by
name (more stable) and by including RT, scaled by max(abs(coefs)), in PAM
clustering.libwebp-dev as dependency to Dockerfile.Measurements_v8.xlsx to inst/extdata/. The new
list contains corrections to the old RP dataset plus 1660 new measurements
measured on a total of 18 different chromatographic environments.seed parameter to selective_measuring() function for reproducible
clustering resultstrain_frm() functiondigest and shinybusy dependenciesinst/mockdata/getCDsFor1Molecule(), get_cache_dir(), ram_cache (these
were exported, but declared as internal)parLapply2Improved read_retip_hilic_data():
the dataset is now only downloaded from GitHub if the package is not installed.
If it is installed, the dataset is loaded directly.
Internal Changes:
TODOS.mdutil.R to data.Rmisc/datasetsload_all() and document() to util.Rxlsx and readxl packages with openxlsxAdded a cache cleanup handler that gets registered via
reg.finalizer() upon package loading to ensure that the cache
directory is removed if it doesn't contain any files that should
persist between R sessions.
Added an article about installation details incl. a troubleshooting section
Improved function docs
Improved examples by removing donttest blocks
Improved examples & tests by using smaller example datasets to reduce runtime
Moved patch.R from the R folder to misc/scripts, which is
excluded from the package build using .Rbuildignore. The file
is conditionally sourced by the private function
start_gui_in_devmode() if available, allowing its use during
development without including it in the package.
Added \value tags to the mentioned .Rd files describing the
functions' return values.
Added Bonini et al. (2020) doi:10.1021/acs.analchem.9b05765
as reference to the description part of the DESCRIPTION file,
listing it as Related work. This reference is used in the
documentation for read_retip_hilic_data() and ram_cache. No
additional references are used in the package documentation.
Added Fadi Fadil as a contributor. Fadi measured the example datasets shipped with FastRet.
Added ORCID IDs for contributors as described in [CRAN's checklist for submissions].
read_rp_xlsx() and read_rpadj_xlsx()
into donttest to prevent note "Examples with CPU time > 2.5
times elapsed time: ...". By now I'm pretty sure the culprit is
the xlsx package, which uses a java process for reading the
file. Maybe we should switch to openxlsx or readxl in the
future.preprocess_data() to prevent note
"Examples with CPU time > 2.5 times elapsed time:
preprocess_data (CPU=2.772, elapsed=0.788)".getCDs()Added examples to start_gui(), fastret_app(), getCDsFor1Molecule(),
analyzeCDNames(), check_lm_suitabilitym(), plot_lm_suitability(),
extendedTask(), selective_measuring(), train_frm(), adjust_frm(),
get_predictors()
Improved lots of existing examples
Added additional logging messages at various places
Submitted to CRAN, but rejected because the following examples caused at least one of the following notes on the CRAN testing machines: (1) "CPU time > 5s", (2) "CPU time > 2.5 times elapsed time". In this context, "CPU time" is calculated as the sum of the measured "user" and "system" times.
| function | user | system | elapsed | ratio | | -------------------- | ------| ------ | ------- | ----- | | check_lm_suitability | 5.667 | 0.248 | 2.211 | 2.675 | | predict.frm | 2.477 | 0.112 | 0.763 | 3.393 | | getCDs | 2.745 | 0.089 | 0.961 | 2.949 |
Completely refactored source code, e.g.:
Added a test suite covering all important functions
The UI now uses Extended Tasks for background processing, allowing GUI usage by multiple users at the same time
The clustering now uses Partitioning Around Medoids (PAM) instead of k-means, which is faster and much better suited for our use case
The training of the Lasso and/or XGBoost models is no longer
done using caret but using glmnet and xgboost directly.
The new implementation is much faster and allows for full
control over the number of workers started.
Function getCDs now caches the results on Disk, making the
retrieval of chemical descriptors much faster
The GUI now has a console element, showing the progress of the background tasks like clustering and model training
The GUI has a cleaner interface, because lots of the options are now hidden in the "Advanced" tab by default and are only displayed upon user request
Initial version.
Copy of commit cd243aa82a56df405df8060b84535633cf06b692 of Christian
Amesöders Repository.
(Christian wrote this initial version of FastRet as part of his master thesis
at the Institute of functional Genomics, University of Regensburg).