468 lines
20 KiB
Markdown
468 lines
20 KiB
Markdown
# Datasets
|
|
|
|
QuaPy makes available several datasets that have been used in
|
|
quantification literature, as well as an interface to allow
|
|
anyone import their custom datasets.
|
|
|
|
A _Dataset_ object in QuaPy is roughly a pair of _LabelledCollection_ objects,
|
|
one playing the role of the training set, another the test set.
|
|
_LabelledCollection_ is a data class consisting of the (iterable)
|
|
instances and labels. This class handles most of the sampling functionality in QuaPy.
|
|
Take a look at the following code:
|
|
|
|
```python
|
|
|
|
import quapy as qp
|
|
import quapy.functional as F
|
|
|
|
instances = [
|
|
'1st positive document', '2nd positive document',
|
|
'the only negative document',
|
|
'1st neutral document', '2nd neutral document', '3rd neutral document'
|
|
]
|
|
labels = [2, 2, 0, 1, 1, 1]
|
|
|
|
data = qp.data.LabelledCollection(instances, labels)
|
|
print(F.strprev(data.prevalence(), prec=2))
|
|
|
|
```
|
|
|
|
Output the class prevalences (showing 2 digit precision):
|
|
```
|
|
[0.17, 0.50, 0.33]
|
|
```
|
|
|
|
One can easily produce new samples at desired class prevalence values:
|
|
|
|
```python
|
|
sample_size = 10
|
|
prev = [0.4, 0.1, 0.5]
|
|
sample = data.sampling(sample_size, *prev)
|
|
|
|
print('instances:', sample.instances)
|
|
print('labels:', sample.labels)
|
|
print('prevalence:', F.strprev(sample.prevalence(), prec=2))
|
|
```
|
|
|
|
Which outputs:
|
|
```
|
|
instances: ['the only negative document' '2nd positive document'
|
|
'2nd positive document' '2nd neutral document' '1st positive document'
|
|
'the only negative document' 'the only negative document'
|
|
'the only negative document' '2nd positive document'
|
|
'1st positive document']
|
|
labels: [0 2 2 1 2 0 0 0 2 2]
|
|
prevalence: [0.40, 0.10, 0.50]
|
|
```
|
|
|
|
Samples can be made consistent across different runs (e.g., to test
|
|
different methods on the same exact samples) by sampling and retaining
|
|
the indexes, that can then be used to generate the sample:
|
|
|
|
```python
|
|
index = data.sampling_index(sample_size, *prev)
|
|
for method in methods:
|
|
sample = data.sampling_from_index(index)
|
|
...
|
|
```
|
|
|
|
However, generating samples for evaluation purposes is tackled in QuaPy
|
|
by means of the evaluation protocols (see the dedicated entries in the manuals
|
|
for [evaluation](./evaluation) and [protocols](./protocols)).
|
|
|
|
|
|
## Reviews Datasets
|
|
|
|
Three datasets of reviews about Kindle devices, Harry Potter's series, and
|
|
the well-known IMDb movie reviews can be fetched using a unified interface.
|
|
For example:
|
|
|
|
```python
|
|
import quapy as qp
|
|
data = qp.datasets.fetch_reviews('kindle')
|
|
```
|
|
|
|
These datasets have been used in:
|
|
```
|
|
Esuli, A., Moreo, A., & Sebastiani, F. (2018, October).
|
|
A recurrent neural network for sentiment quantification.
|
|
In Proceedings of the 27th ACM International Conference on
|
|
Information and Knowledge Management (pp. 1775-1778).
|
|
```
|
|
|
|
The list of reviews ids is available in:
|
|
|
|
```python
|
|
qp.datasets.REVIEWS_SENTIMENT_DATASETS
|
|
```
|
|
|
|
Some statistics of the fhe available datasets are summarized below:
|
|
|
|
| Dataset | classes | train size | test size | train prev | test prev | type |
|
|
|---|:---:|:---:|:---:|:---:|:---:|---|
|
|
| hp | 2 | 9533 | 18399 | \[0.018, 0.982\] | \[0.065, 0.935\] | text |
|
|
| kindle | 2 | 3821 | 21591 | \[0.081, 0.919\] | \[0.063, 0.937\] | text |
|
|
| imdb | 2 | 25000 | 25000 | \[0.500, 0.500\] | \[0.500, 0.500\] | text |
|
|
|
|
## Twitter Sentiment Datasets
|
|
|
|
11 Twitter datasets for sentiment analysis.
|
|
Text is not accessible, and the documents were made available
|
|
in tf-idf format. Each dataset presents two splits: a train/val
|
|
split for model selection purposes, and a train+val/test split
|
|
for model evaluation. The following code exemplifies how to load
|
|
a twitter dataset for model selection.
|
|
|
|
```python
|
|
import quapy as qp
|
|
data = qp.datasets.fetch_twitter('gasp', for_model_selection=True)
|
|
```
|
|
|
|
The datasets were used in:
|
|
|
|
```
|
|
Gao, W., & Sebastiani, F. (2015, August).
|
|
Tweet sentiment: From classification to quantification.
|
|
In 2015 IEEE/ACM International Conference on Advances in
|
|
Social Networks Analysis and Mining (ASONAM) (pp. 97-104). IEEE.
|
|
```
|
|
|
|
Three of the datasets (semeval13, semeval14, and semeval15) share the
|
|
same training set (semeval), meaning that the training split one would get
|
|
when requesting any of them is the same. The dataset "semeval" can only
|
|
be requested with "for_model_selection=True".
|
|
The lists of the Twitter dataset's ids can be consulted in:
|
|
|
|
```python
|
|
# a list of 11 dataset ids that can be used for model selection or model evaluation
|
|
qp.datasets.TWITTER_SENTIMENT_DATASETS_TEST
|
|
|
|
# 9 dataset ids in which "semeval13", "semeval14", and "semeval15" are replaced with "semeval"
|
|
qp.datasets.TWITTER_SENTIMENT_DATASETS_TRAIN
|
|
```
|
|
|
|
Some details can be found below:
|
|
|
|
| Dataset | classes | train size | test size | features | train prev | test prev | type |
|
|
|---|:---:|:---:|:---:|:---:|:---:|:---:|---|
|
|
| gasp | 3 | 8788 | 3765 | 694582 | [0.421, 0.496, 0.082] | [0.407, 0.507, 0.086] | sparse |
|
|
| hcr | 3 | 1594 | 798 | 222046 | [0.546, 0.211, 0.243] | [0.640, 0.167, 0.193] | sparse |
|
|
| omd | 3 | 1839 | 787 | 199151 | [0.463, 0.271, 0.266] | [0.437, 0.283, 0.280] | sparse |
|
|
| sanders | 3 | 2155 | 923 | 229399 | [0.161, 0.691, 0.148] | [0.164, 0.688, 0.148] | sparse |
|
|
| semeval13 | 3 | 11338 | 3813 | 1215742 | [0.159, 0.470, 0.372] | [0.158, 0.430, 0.412] | sparse |
|
|
| semeval14 | 3 | 11338 | 1853 | 1215742 | [0.159, 0.470, 0.372] | [0.109, 0.361, 0.530] | sparse |
|
|
| semeval15 | 3 | 11338 | 2390 | 1215742 | [0.159, 0.470, 0.372] | [0.153, 0.413, 0.434] | sparse |
|
|
| semeval16 | 3 | 8000 | 2000 | 889504 | [0.157, 0.351, 0.492] | [0.163, 0.341, 0.497] | sparse |
|
|
| sst | 3 | 2971 | 1271 | 376132 | [0.261, 0.452, 0.288] | [0.207, 0.481, 0.312] | sparse |
|
|
| wa | 3 | 2184 | 936 | 248563 | [0.305, 0.414, 0.281] | [0.282, 0.446, 0.272] | sparse |
|
|
| wb | 3 | 4259 | 1823 | 404333 | [0.270, 0.392, 0.337] | [0.274, 0.392, 0.335] | sparse |
|
|
|
|
|
|
## UCI Machine Learning
|
|
|
|
### Binary datasets
|
|
|
|
A set of 32 datasets from the [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets.php)
|
|
used in:
|
|
|
|
```
|
|
Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
|
|
Using ensembles for problems with characterizable changes
|
|
in data distribution: A case study on quantification.
|
|
Information Fusion, 34, 87-100.
|
|
```
|
|
|
|
The list does not exactly coincide with that used in Pérez-Gállego et al. 2017
|
|
since we were unable to find the datasets with ids "diabetes" and "phoneme".
|
|
|
|
These dataset can be loaded by calling, e.g.:
|
|
|
|
```python
|
|
import quapy as qp
|
|
|
|
data = qp.datasets.fetch_UCIBinaryDataset('yeast', verbose=True)
|
|
```
|
|
|
|
This call will return a _Dataset_ object in which the training and
|
|
test splits are randomly drawn, in a stratified manner, from the whole
|
|
collection at 70% and 30%, respectively. The _verbose=True_ option indicates
|
|
that the dataset description should be printed in standard output.
|
|
The original data is not split,
|
|
and some papers submit the entire collection to a kFCV validation.
|
|
In order to accommodate with these practices, one could first instantiate
|
|
the entire collection, and then creating a generator that will return one
|
|
training+test dataset at a time, following a kFCV protocol:
|
|
|
|
```python
|
|
import quapy as qp
|
|
|
|
collection = qp.datasets.fetch_UCIBinaryLabelledCollection("yeast")
|
|
for data in qp.data.Dataset.kFCV(collection, nfolds=5, nrepeats=2):
|
|
...
|
|
```
|
|
|
|
Above code will allow to conduct a 2x5FCV evaluation on the "yeast" dataset.
|
|
|
|
All datasets come in numerical form (dense matrices); some statistics
|
|
are summarized below.
|
|
|
|
| Dataset | classes | instances | features | prev | type |
|
|
|---|:---:|:---:|:---:|:---:|---|
|
|
| acute.a | 2 | 120 | 6 | [0.508, 0.492] | dense |
|
|
| acute.b | 2 | 120 | 6 | [0.583, 0.417] | dense |
|
|
| balance.1 | 2 | 625 | 4 | [0.539, 0.461] | dense |
|
|
| balance.2 | 2 | 625 | 4 | [0.922, 0.078] | dense |
|
|
| balance.3 | 2 | 625 | 4 | [0.539, 0.461] | dense |
|
|
| breast-cancer | 2 | 683 | 9 | [0.350, 0.650] | dense |
|
|
| cmc.1 | 2 | 1473 | 9 | [0.573, 0.427] | dense |
|
|
| cmc.2 | 2 | 1473 | 9 | [0.774, 0.226] | dense |
|
|
| cmc.3 | 2 | 1473 | 9 | [0.653, 0.347] | dense |
|
|
| ctg.1 | 2 | 2126 | 21 | [0.222, 0.778] | dense |
|
|
| ctg.2 | 2 | 2126 | 21 | [0.861, 0.139] | dense |
|
|
| ctg.3 | 2 | 2126 | 21 | [0.917, 0.083] | dense |
|
|
| german | 2 | 1000 | 24 | [0.300, 0.700] | dense |
|
|
| haberman | 2 | 306 | 3 | [0.735, 0.265] | dense |
|
|
| ionosphere | 2 | 351 | 34 | [0.641, 0.359] | dense |
|
|
| iris.1 | 2 | 150 | 4 | [0.667, 0.333] | dense |
|
|
| iris.2 | 2 | 150 | 4 | [0.667, 0.333] | dense |
|
|
| iris.3 | 2 | 150 | 4 | [0.667, 0.333] | dense |
|
|
| mammographic | 2 | 830 | 5 | [0.514, 0.486] | dense |
|
|
| pageblocks.5 | 2 | 5473 | 10 | [0.979, 0.021] | dense |
|
|
| semeion | 2 | 1593 | 256 | [0.901, 0.099] | dense |
|
|
| sonar | 2 | 208 | 60 | [0.534, 0.466] | dense |
|
|
| spambase | 2 | 4601 | 57 | [0.606, 0.394] | dense |
|
|
| spectf | 2 | 267 | 44 | [0.794, 0.206] | dense |
|
|
| tictactoe | 2 | 958 | 9 | [0.653, 0.347] | dense |
|
|
| transfusion | 2 | 748 | 4 | [0.762, 0.238] | dense |
|
|
| wdbc | 2 | 569 | 30 | [0.627, 0.373] | dense |
|
|
| wine.1 | 2 | 178 | 13 | [0.669, 0.331] | dense |
|
|
| wine.2 | 2 | 178 | 13 | [0.601, 0.399] | dense |
|
|
| wine.3 | 2 | 178 | 13 | [0.730, 0.270] | dense |
|
|
| wine-q-red | 2 | 1599 | 11 | [0.465, 0.535] | dense |
|
|
| wine-q-white | 2 | 4898 | 11 | [0.335, 0.665] | dense |
|
|
| yeast | 2 | 1484 | 8 | [0.711, 0.289] | dense |
|
|
|
|
#### Notes:
|
|
All datasets will be downloaded automatically the first time they are requested, and
|
|
stored in the _quapy_data_ folder for faster further reuse.
|
|
|
|
However, notice that it is a good idea to ignore datasets:
|
|
* _acute.a_ and _acute.b_: these are very easy and many classifiers would score 100% accuracy
|
|
* _balance.2_: this is extremely difficult; probably there is some problem with this dataset,
|
|
the errors it tends to produce are orders of magnitude greater than for other datasets,
|
|
and this has a disproportionate impact in the average performance.
|
|
|
|
### Multiclass datasets
|
|
|
|
A collection of 24 multiclass datasets from the [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets.php).
|
|
Some of the datasets were first used in [this paper](https://arxiv.org/abs/2401.00490) and can be instantiated as follows:
|
|
|
|
```python
|
|
import quapy as qp
|
|
data = qp.datasets.fetch_UCIMulticlassLabelledCollection('dry-bean', verbose=True)
|
|
```
|
|
|
|
A dataset can be instantiated filtering classes with a minimum number of instances using the `min_class_support` parameter
|
|
(default: `100`) as folows:
|
|
|
|
|
|
```python
|
|
import quapy as qp
|
|
data = qp.datasets.fetch_UCIMulticlassLabelledCollection('dry-bean', min_class_support=50, verbose=True)
|
|
```
|
|
|
|
There are no pre-defined train-test partitions for these datasets, but you can easily create your own with the
|
|
`split_stratified` method, e.g., `data.split_stratified()`. This can be also achieved using the method `fetch_UCIMulticlassDataset`
|
|
as shown below:
|
|
|
|
```python
|
|
data = qp.datasets.fetch_UCIMulticlassDataset('dry-bean', min_test_split=0.4, verbose=True)
|
|
train, test = data.train_test
|
|
```
|
|
|
|
This method tries to respect the `min_test_split` value while generating the train-test partition, but the resulting training set
|
|
will not be bigger than `max_train_instances`, which defaults to `25000`. A bigger value can be passed as a parameter:
|
|
|
|
```python
|
|
data = qp.datasets.fetch_UCIMulticlassDataset('dry-bean', min_test_split=0.4, max_train_instances=30000, verbose=True)
|
|
train, test = data.train_test
|
|
```
|
|
|
|
The datasets correspond to a part of the datasets that can be retrieved from the platform using the following filters:
|
|
* datasets for classification
|
|
* more than 2 classes
|
|
* containing at least 1,000 instances
|
|
* can be imported using the Python API.
|
|
|
|
Some statistics about these datasets are displayed below :
|
|
|
|
| **Dataset** | **classes** | **instances** | **features** | **prevs** | **type** |
|
|
|:------------|:-----------:|:-------------:|:------------:|:----------|:--------:|
|
|
| dry-bean | 7 | 13611 | 16 | [0.097, 0.038, 0.120, 0.261, 0.142, 0.149, 0.194] | dense |
|
|
| wine-quality | 5 | 6462 | 11 | [0.033, 0.331, 0.439, 0.167, 0.030] | dense |
|
|
| academic-success | 3 | 4424 | 36 | [0.321, 0.179, 0.499] | dense |
|
|
| digits | 10 | 5620 | 64 | [0.099, 0.102, 0.099, 0.102, 0.101, 0.099, 0.099, 0.101, 0.099, 0.100] | dense |
|
|
| letter | 26 | 20000 | 16 | [0.039, 0.038, 0.037, 0.040, 0.038, 0.039, 0.039, 0.037, 0.038, 0.037, 0.037, 0.038, 0.040, 0.039, 0.038, 0.040, 0.039, 0.038, 0.037, 0.040, 0.041, 0.038, 0.038, 0.039, 0.039, 0.037] | dense |
|
|
| abalone | 11 | 3842 | 9 | [0.030, 0.067, 0.102, 0.148, 0.179, 0.165, 0.127, 0.069, 0.053, 0.033, 0.027] | dense |
|
|
| obesity | 7 | 2111 | 23 | [0.129, 0.136, 0.166, 0.141, 0.153, 0.137, 0.137] | dense |
|
|
| nursery | 4 | 12958 | 19 | [0.333, 0.329, 0.312, 0.025] | dense |
|
|
| yeast | 4 | 1299 | 8 | [0.356, 0.125, 0.188, 0.330] | dense |
|
|
| hand_digits | 10 | 10992 | 16 | [0.104, 0.104, 0.104, 0.096, 0.104, 0.096, 0.096, 0.104, 0.096, 0.096] | dense |
|
|
| satellite | 6 | 6435 | 36 | [0.238, 0.109, 0.211, 0.097, 0.110, 0.234] | dense |
|
|
| shuttle | 4 | 57927 | 7 | [0.787, 0.003, 0.154, 0.056] | dense |
|
|
| cmc | 3 | 1473 | 9 | [0.427, 0.226, 0.347] | dense |
|
|
| isolet | 26 | 7797 | 617 | [0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038, 0.038] | dense |
|
|
| waveform-v1 | 3 | 5000 | 21 | [0.331, 0.329, 0.339] | dense |
|
|
| molecular | 3 | 3190 | 227 | [0.240, 0.241, 0.519] | dense |
|
|
| poker_hand | 8 | 1024985 | 10 | [0.501, 0.423, 0.048, 0.021, 0.004, 0.002, 0.001, 0.000] | dense |
|
|
| connect-4 | 3 | 67557 | 84 | [0.095, 0.246, 0.658] | dense |
|
|
| mhr | 3 | 1014 | 6 | [0.268, 0.400, 0.331] | dense |
|
|
| chess | 15 | 27870 | 20 | [0.100, 0.051, 0.102, 0.078, 0.017, 0.007, 0.163, 0.061, 0.025, 0.021, 0.014, 0.071, 0.150, 0.129, 0.009] | dense |
|
|
| page_block | 3 | 5357 | 10 | [0.917, 0.061, 0.021] | dense |
|
|
| phishing | 3 | 1353 | 9 | [0.519, 0.076, 0.405] | dense |
|
|
| image_seg | 7 | 2310 | 19 | [0.143, 0.143, 0.143, 0.143, 0.143, 0.143, 0.143] | dense |
|
|
| hcv | 4 | 1385 | 28 | [0.243, 0.240, 0.256, 0.261] | dense |
|
|
|
|
Values shown above refer to datasets obtained through `fetchUCIMulticlassLabelledCollection` using all default parameters.
|
|
|
|
## LeQua 2022 Datasets
|
|
|
|
QuaPy also provides the datasets used for the LeQua 2022 competition.
|
|
In brief, there are 4 tasks (T1A, T1B, T2A, T2B) having to do with text quantification
|
|
problems. Tasks T1A and T1B provide documents in vector form, while T2A and T2B provide
|
|
raw documents instead.
|
|
Tasks T1A and T2A are binary sentiment quantification problems, while T2A and T2B
|
|
are multiclass quantification problems consisting of estimating the class prevalence
|
|
values of 28 different merchandise products.
|
|
|
|
Every task consists of a training set, a set of validation samples (for model selection)
|
|
and a set of test samples (for evaluation). QuaPy returns this data as a LabelledCollection
|
|
(training) and two generation protocols (for validation and test samples), as follows:
|
|
|
|
```python
|
|
training, val_generator, test_generator = fetch_lequa2022(task=task)
|
|
```
|
|
|
|
See the `lequa2022_experiments.py` in the examples folder for further details on how to
|
|
carry out experiments using these datasets.
|
|
|
|
The datasets are downloaded only once, and stored for fast reuse.
|
|
|
|
Some statistics are summarized below:
|
|
|
|
| Dataset | classes | train size | validation samples | test samples | docs by sample | type |
|
|
|---------|:-------:|:----------:|:------------------:|:------------:|:----------------:|:--------:|
|
|
| T1A | 2 | 5000 | 1000 | 5000 | 250 | vector |
|
|
| T1B | 28 | 20000 | 1000 | 5000 | 1000 | vector |
|
|
| T2A | 2 | 5000 | 1000 | 5000 | 250 | text |
|
|
| T2B | 28 | 20000 | 1000 | 5000 | 1000 | text |
|
|
|
|
For further details on the datasets, we refer to the original
|
|
[paper](https://ceur-ws.org/Vol-3180/paper-146.pdf):
|
|
|
|
```
|
|
Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022).
|
|
A Detailed Overview of LeQua@ CLEF 2022: Learning to Quantify.
|
|
```
|
|
|
|
## IFCB Plankton dataset
|
|
|
|
IFCB is a dataset of plankton species in water samples hosted in `Zenodo <https://zenodo.org/records/10036244>`_.
|
|
This dataset is based on the data available publicly at `WHOI-Plankton repo <https://github.com/hsosik/WHOI-Plankton>`_
|
|
and in the scripts for the processing are available at `P. González's repo <https://github.com/pglez82/IFCB_Zenodo>`_.
|
|
|
|
This dataset comes with precomputed features for testing quantification algorithms.
|
|
|
|
Some statistics:
|
|
|
|
| | **Training** | **Validation** | **Test** |
|
|
|-----------------|:------------:|:--------------:|:--------:|
|
|
| samples | 200 | 86 | 678 |
|
|
| total instances | 584474 | 246916 | 2626429 |
|
|
| mean per sample | 2922.3 | 2871.1 | 3873.8 |
|
|
| min per sample | 266 | 59 | 33 |
|
|
| max per sample | 6645 | 7375 | 9112 |
|
|
|
|
The number of features is 512, while the number of classes is 50.
|
|
In terms of prevalence, the mean is 0.020, the minimum is 0, and the maximum is 0.978.
|
|
|
|
The dataset can be loaded for model selection (`for_model_selection=True`, thus returning the training and validation)
|
|
or for test (`for_model_selection=False`, thus returning the training+validation and the test).
|
|
|
|
Additionally, the training can be interpreted as a list (a generator) of samples (`single_sample_train=False`)
|
|
or as a single training set (`single_sample_train=True`).
|
|
|
|
Example:
|
|
|
|
```python
|
|
train, val_gen = qp.datasets.fetch_IFCB(for_model_selection=True, single_sample_train=True)
|
|
# ... model selection
|
|
|
|
train, test_gen = qp.datasets.fetch_IFCB(for_model_selection=False, single_sample_train=True)
|
|
# ... train and evaluation
|
|
```
|
|
|
|
|
|
|
|
## Adding Custom Datasets
|
|
|
|
QuaPy provides data loaders for simple formats dealing with
|
|
text, following the format:
|
|
|
|
```
|
|
class-id \t first document's pre-processed text \n
|
|
class-id \t second document's pre-processed text \n
|
|
...
|
|
```
|
|
|
|
and sparse representations of the form:
|
|
|
|
```
|
|
{-1, 0, or +1} col(int):val(float) col(int):val(float) ... \n
|
|
...
|
|
```
|
|
|
|
The code in charge in loading a LabelledCollection is:
|
|
|
|
```python
|
|
@classmethod
|
|
def load(cls, path:str, loader_func:callable):
|
|
return LabelledCollection(*loader_func(path))
|
|
```
|
|
|
|
indicating that any _loader_func_ (e.g., a user-defined one) which
|
|
returns valid arguments for initializing a _LabelledCollection_ object will allow
|
|
to load any collection. In particular, the _LabelledCollection_ receives as
|
|
arguments the instances (as an iterable) and the labels (as an iterable) and,
|
|
additionally, the number of classes can be specified (it would otherwise be
|
|
inferred from the labels, but that requires at least one positive example for
|
|
all classes to be present in the collection).
|
|
|
|
The same _loader_func_ can be passed to a Dataset, along with two
|
|
paths, in order to create a training and test pair of _LabelledCollection_,
|
|
e.g.:
|
|
|
|
```python
|
|
import quapy as qp
|
|
|
|
train_path = '../my_data/train.dat'
|
|
test_path = '../my_data/test.dat'
|
|
|
|
def my_custom_loader(path):
|
|
with open(path, 'rb') as fin:
|
|
...
|
|
return instances, labels
|
|
|
|
data = qp.data.Dataset.load(train_path, test_path, my_custom_loader)
|
|
```
|
|
|
|
### Data Processing
|
|
|
|
QuaPy implements a number of preprocessing functions in the package _qp.data.preprocessing_, including:
|
|
|
|
* _text2tfidf_: tfidf vectorization
|
|
* _reduce_columns_: reducing the number of columns based on term frequency
|
|
* _standardize_: transforms the column values into z-scores (i.e., subtract the mean and normalizes by the standard deviation, so
|
|
that the column values have zero mean and unit variance).
|
|
* _index_: transforms textual tokens into lists of numeric ids)
|