From daa275d3259ea08ac28c63f0ebda24a69901b8c3 Mon Sep 17 00:00:00 2001 From: Lorenzo Volpi Date: Tue, 2 Jul 2024 16:33:27 +0200 Subject: [PATCH] Updated UCI binary notes --- docs/source/wiki_editable/Datasets.md | 21 ++++++--------------- 1 file changed, 6 insertions(+), 15 deletions(-) diff --git a/docs/source/wiki_editable/Datasets.md b/docs/source/wiki_editable/Datasets.md index a05244e..615a100 100644 --- a/docs/source/wiki_editable/Datasets.md +++ b/docs/source/wiki_editable/Datasets.md @@ -243,24 +243,15 @@ are summarized below. | wine-q-white | 2 | 4898 | 11 | [0.335, 0.665] | dense | | yeast | 2 | 1484 | 8 | [0.711, 0.289] | dense | -### Issues: +#### Notes: All datasets will be downloaded automatically the first time they are requested, and stored in the _quapy_data_ folder for faster further reuse. -However, some datasets require special actions that at the moment are not fully -automated. -* Datasets with ids "ctg.1", "ctg.2", and "ctg.3" (_Cardiotocography Data Set_) load -an Excel file, which requires the user to install the _xlrd_ Python module in order -to open it. -* The dataset with id "pageblocks.5" (_Page Blocks Classification (5)_) needs to -open a "unix compressed file" (extension .Z), which is not directly doable with -standard Pythons packages like gzip or zip. This file would need to be uncompressed using -OS-dependent software manually. Information on how to do it will be printed the first -time the dataset is invoked. -* It is a good idea to ignore datasets _acute.a_, _acute.b_ and _balance.2_, since the former two -are very easy (many classifiers would score 100% accuracy) while the latter is extremely difficult - (probably there is some problem with this dataset, the errors it tends to produce are orders of magnitude -greater than for other datasets, and this has a disproportionate impact in the average performance). +However, notice that it is a good idea to ignore datasets: +* _acute.a_ and _acute.b_: these are very easy and many classifiers would score 100% accuracy +* _balance.2_: this is extremely difficult; probably there is some problem with this dataset, +the errors it tends to produce are orders of magnitude greater than for other datasets, +and this has a disproportionate impact in the average performance. ### Multiclass datasets