REED : Rapid & Easy Evaluation of Datasets by DNAlytics

We want you to be happy to pay us!

Predictive modeling is a complex science. But what is more frustrating that obtaining poor or no results at all after having invested time and money in a data mining project? At DNAlytics, we clearly understand this. On our side, it is also a pity to have to announce such poor project outcome to our customers. We definitely don't like it. That is why we now propose a very fast evaluation of the potential value of your data, and this for free!

What does REED do ?

REED (Rapid and Easy Evaluation of Datasets) is a web application which aims at automatically process a dataset in order to get a quick guess of the interest the data represents in terms of predictive modeling and markers identification. The idea is not at all to perform our best work, which cannot be automated, but to give to prospects some hints about their data potential and also some specific issues that the data would contain, and that should be looked at in details. In particular, REED provides:

  • Estimation of predictive performances either with all available features or a good scenario with a smaller number of features automatically selected by our algorithms.
  • The number of features that we judge necessary to transform or drop because of their particular behavior.
  • An outlier analysis, suggesting that some samples might be considered for removal before processing the data further.
  • The number of features statistically significantly differing from one diagnosis to the other.

To begin, enter your email address and upload a dataset

After the upload, please validate the data import (see left column). You will then receive an email as soon as the results are available.

Expected file format : Comma-separated values

  • Categorical values are surrounded by double quotes ( " ).
  • Numerical values are encoded without quote.
  • The decimal separator is a dot ( . ).
  • Missing values are encoded as empty fields (see third sample of the example below) or NA without quote.
  • The first line begins with an empty field and then encodes the variable names.
  • The first column encodes sample identifiers.
  • For classification, the target variable is named class.
  • For regression, the target variable is named response.
  • The target variable cannot contain missing values.

Here is an example of the expected file format. It is a subsamble of the Iris dataset. Spaces are optional.

     ,"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","class"
 "x1",           5.1,          3.5,           1.4,          0.2,"setosa"
 "x2",           4.9,            3,           1.4,          0.2,"setosa"
 "x3",           4.7,          3.2,              ,          0.2,"setosa"
 "x4",           4.6,          3.1,           1.5,          0.2,"setosa"
"x51",             7,          3.2,           4.7,          1.4,"versicolor"
"x52",           6.4,          3.2,           4.5,          1.5,"versicolor"
"x53",           6.9,          3.1,           4.9,          1.5,"versicolor"
"x54",           5.5,          2.3,             4,          1.3,"versicolor"

Some demo datasets are available in the Help section.

How to interpret the results ?

16
outliers
Number of samples excluded from the analyses because they are suspiciously different from the other samples.
42
significantly important variables
Number of variables that are important according to a statistical test.
91%
accuracy (ACC)
Balanced classification rate (BCR) = 87%
Area under the ROC curve (AUC) = 89%
Predictive performances on the whole dataset. ACC, BCR and AUC are classification metrics. They assess how well a predictive model can classify new unseen samples. The higher, the better. For regression metrics, see bottom left circle.
8
variables removed
Number of variables excluded from the analyses because of their missing values or too small variance.
23
variables transformed
Number of variables that underwent a transformation before feeding the predictive model.
4.00
root mean squared residuals (RMSR)
using 15 variables
Variance explained (VE) = 96%
Predictive performances when using a subset of the variables. RMSR and VE are regression metrics. RMSR estimates the error made when predicting the response. The lower, the better. VE estimates how well the regression model fits the data. The higher, the better.

Example datasets

We provide here three public datasets that can directly be tested in REED.

  • Iris is a classification dataset with three classes. The original data only contains 4 continuous variables. We have added a few other variables for illustrative purposes.
  • Iris (2 classes) is the same dataset restricted to two of the classes.
  • Boston is a regression dataset. The original data is documented in this paper.

Disclaimer

This online application offers a raw estimation of standard metrics in a context of predictive analytics, based on data uploaded by the user. Obtaining validated estimations is a much longer and tailor-made process that cannot really be automatized. DNAlytics does not support any claim about the results validity, in particular in terms of generalization capability or biomarker identification robustness. DNAlytics will not be liable for any use that would be made of the offered results. To the contrary, DNAlytics recommends not to use these results as granted before a more extensive analysis of the data set has been performed. The aim of this application is purely to get an rough estimation of the interest there might be in pursuing such detailed analysis. Even in that case, DNAlytics does not guarantee that poor results provided by this application are definitive, and does not guarantee neither that encourageing results provided by this application is a proof of the interest the data might represent.

License

This application is provided at no cost for any user willing to get a first idea of the value of his data. In that context, one is only granted the following use:

  • copy/dissemination of the URL of the application
  • use of the application for its intended purpose, specified above in the disclaimer
  • non-commercial use of the application only (one cannot bill a customer for results provided by the application)
All other uses are not allowed by DNAlytics.

Version: DNA-REED-4.2.0