Updated UCI binary notes

2024-07-02 16:33:27 +02:00 · 2024-07-02 16:33:27 +02:00 · daa275d325
parent 76b38cb81c
commit daa275d325
1 changed files with 6 additions and 15 deletions
--- a/docs/source/wiki_editable/Datasets.md
+++ b/docs/source/wiki_editable/Datasets.md
@ -243,24 +243,15 @@ are summarized below.
 | wine-q-white | 2 | 4898 | 11 | [0.335, 0.665] | dense |
 | yeast | 2 | 1484 | 8 | [0.711, 0.289] | dense |
-### Issues:
+#### Notes:
 All datasets will be downloaded automatically the first time they are requested, and
 stored in the _quapy_data_ folder for faster further reuse. 
 However, some datasets require special actions that at the moment are not fully
 automated.
-* Datasets with ids "ctg.1", "ctg.2", and "ctg.3" (_Cardiotocography Data Set_) load
+However, notice that it is a good idea to ignore datasets:
-an Excel file, which requires the user to install the _xlrd_ Python module in order 
+* _acute.a_ and _acute.b_: these are very easy and many classifiers would score 100% accuracy
-to open it.
+* _balance.2_: this is extremely difficult; probably there is some problem with this dataset, 
-* The dataset with id "pageblocks.5" (_Page Blocks Classification (5)_) needs to
+the errors it tends to produce are orders of magnitude greater than for other datasets, 
-open a "unix compressed file" (extension .Z), which is not directly doable with
+and this has a disproportionate impact in the average performance.
 standard Pythons packages like gzip or zip. This file would need to be uncompressed using
 OS-dependent software manually. Information on how to do it will be printed the first
 time the dataset is invoked. 
 * It is a good idea to ignore datasets _acute.a_, _acute.b_ and _balance.2_, since the former two
 are very easy (many classifiers would score 100% accuracy) while the latter is extremely difficult
  (probably there is some problem with this dataset, the errors it tends to produce are orders of magnitude 
 greater than for other datasets, and this has a disproportionate impact in the average performance).
 ### Multiclass datasets