Datasets¶
-QuaPy makes available several datasets that have been used in -quantification literature, as well as an interface to allow -anyone import their custom datasets.
-A Dataset object in QuaPy is roughly a pair of LabelledCollection objects, -one playing the role of the training set, another the test set. -LabelledCollection is a data class consisting of the (iterable) -instances and labels. This class handles most of the sampling functionality in QuaPy. -Take a look at the following code:
-import quapy as qp
-import quapy.functional as F
-
-instances = [
- '1st positive document', '2nd positive document',
- 'the only negative document',
- '1st neutral document', '2nd neutral document', '3rd neutral document'
-]
-labels = [2, 2, 0, 1, 1, 1]
-
-data = qp.data.LabelledCollection(instances, labels)
-print(F.strprev(data.prevalence(), prec=2))
-
Output the class prevalences (showing 2 digit precision):
-[0.17, 0.50, 0.33]
-
One can easily produce new samples at desired class prevalence values:
-sample_size = 10
-prev = [0.4, 0.1, 0.5]
-sample = data.sampling(sample_size, *prev)
-
-print('instances:', sample.instances)
-print('labels:', sample.labels)
-print('prevalence:', F.strprev(sample.prevalence(), prec=2))
-
Which outputs:
-instances: ['the only negative document' '2nd positive document'
- '2nd positive document' '2nd neutral document' '1st positive document'
- 'the only negative document' 'the only negative document'
- 'the only negative document' '2nd positive document'
- '1st positive document']
-labels: [0 2 2 1 2 0 0 0 2 2]
-prevalence: [0.40, 0.10, 0.50]
-
Samples can be made consistent across different runs (e.g., to test -different methods on the same exact samples) by sampling and retaining -the indexes, that can then be used to generate the sample:
-index = data.sampling_index(sample_size, *prev)
-for method in methods:
- sample = data.sampling_from_index(index)
- ...
-
However, generating samples for evaluation purposes is tackled in QuaPy -by means of the evaluation protocols (see the dedicated entries in the Wiki -for evaluation and -protocols).
-Reviews Datasets¶
-Three datasets of reviews about Kindle devices, Harry Potter’s series, and -the well-known IMDb movie reviews can be fetched using a unified interface. -For example:
-import quapy as qp
-data = qp.datasets.fetch_reviews('kindle')
-
These datasets have been used in:
-Esuli, A., Moreo, A., & Sebastiani, F. (2018, October).
-A recurrent neural network for sentiment quantification.
-In Proceedings of the 27th ACM International Conference on
-Information and Knowledge Management (pp. 1775-1778).
-
The list of reviews ids is available in:
-qp.datasets.REVIEWS_SENTIMENT_DATASETS
-
Some statistics of the fhe available datasets are summarized below:
-Dataset |
-classes |
-train size |
-test size |
-train prev |
-test prev |
-type |
-
---|---|---|---|---|---|---|
hp |
-2 |
-9533 |
-18399 |
-[0.018, 0.982] |
-[0.065, 0.935] |
-text |
-
kindle |
-2 |
-3821 |
-21591 |
-[0.081, 0.919] |
-[0.063, 0.937] |
-text |
-
imdb |
-2 |
-25000 |
-25000 |
-[0.500, 0.500] |
-[0.500, 0.500] |
-text |
-
Twitter Sentiment Datasets¶
-11 Twitter datasets for sentiment analysis. -Text is not accessible, and the documents were made available -in tf-idf format. Each dataset presents two splits: a train/val -split for model selection purposes, and a train+val/test split -for model evaluation. The following code exemplifies how to load -a twitter dataset for model selection.
-import quapy as qp
-data = qp.datasets.fetch_twitter('gasp', for_model_selection=True)
-
The datasets were used in:
-Gao, W., & Sebastiani, F. (2015, August).
-Tweet sentiment: From classification to quantification.
-In 2015 IEEE/ACM International Conference on Advances in
-Social Networks Analysis and Mining (ASONAM) (pp. 97-104). IEEE.
-
Three of the datasets (semeval13, semeval14, and semeval15) share the -same training set (semeval), meaning that the training split one would get -when requesting any of them is the same. The dataset “semeval” can only -be requested with “for_model_selection=True”. -The lists of the Twitter dataset’s ids can be consulted in:
-# a list of 11 dataset ids that can be used for model selection or model evaluation
-qp.datasets.TWITTER_SENTIMENT_DATASETS_TEST
-
-# 9 dataset ids in which "semeval13", "semeval14", and "semeval15" are replaced with "semeval"
-qp.datasets.TWITTER_SENTIMENT_DATASETS_TRAIN
-
Some details can be found below:
-Dataset |
-classes |
-train size |
-test size |
-features |
-train prev |
-test prev |
-type |
-
---|---|---|---|---|---|---|---|
gasp |
-3 |
-8788 |
-3765 |
-694582 |
-[0.421, 0.496, 0.082] |
-[0.407, 0.507, 0.086] |
-sparse |
-
hcr |
-3 |
-1594 |
-798 |
-222046 |
-[0.546, 0.211, 0.243] |
-[0.640, 0.167, 0.193] |
-sparse |
-
omd |
-3 |
-1839 |
-787 |
-199151 |
-[0.463, 0.271, 0.266] |
-[0.437, 0.283, 0.280] |
-sparse |
-
sanders |
-3 |
-2155 |
-923 |
-229399 |
-[0.161, 0.691, 0.148] |
-[0.164, 0.688, 0.148] |
-sparse |
-
semeval13 |
-3 |
-11338 |
-3813 |
-1215742 |
-[0.159, 0.470, 0.372] |
-[0.158, 0.430, 0.412] |
-sparse |
-
semeval14 |
-3 |
-11338 |
-1853 |
-1215742 |
-[0.159, 0.470, 0.372] |
-[0.109, 0.361, 0.530] |
-sparse |
-
semeval15 |
-3 |
-11338 |
-2390 |
-1215742 |
-[0.159, 0.470, 0.372] |
-[0.153, 0.413, 0.434] |
-sparse |
-
semeval16 |
-3 |
-8000 |
-2000 |
-889504 |
-[0.157, 0.351, 0.492] |
-[0.163, 0.341, 0.497] |
-sparse |
-
sst |
-3 |
-2971 |
-1271 |
-376132 |
-[0.261, 0.452, 0.288] |
-[0.207, 0.481, 0.312] |
-sparse |
-
wa |
-3 |
-2184 |
-936 |
-248563 |
-[0.305, 0.414, 0.281] |
-[0.282, 0.446, 0.272] |
-sparse |
-
wb |
-3 |
-4259 |
-1823 |
-404333 |
-[0.270, 0.392, 0.337] |
-[0.274, 0.392, 0.335] |
-sparse |
-
UCI Machine Learning¶
-A set of 32 datasets from the UCI Machine Learning repository -used in:
-Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
-Using ensembles for problems with characterizable changes
-in data distribution: A case study on quantification.
-Information Fusion, 34, 87-100.
-
The list does not exactly coincide with that used in Pérez-Gállego et al. 2017 -since we were unable to find the datasets with ids “diabetes” and “phoneme”.
-These dataset can be loaded by calling, e.g.:
-import quapy as qp
-data = qp.datasets.fetch_UCIDataset('yeast', verbose=True)
-
This call will return a Dataset object in which the training and -test splits are randomly drawn, in a stratified manner, from the whole -collection at 70% and 30%, respectively. The verbose=True option indicates -that the dataset description should be printed in standard output. -The original data is not split, -and some papers submit the entire collection to a kFCV validation. -In order to accommodate with these practices, one could first instantiate -the entire collection, and then creating a generator that will return one -training+test dataset at a time, following a kFCV protocol:
-import quapy as qp
-collection = qp.datasets.fetch_UCILabelledCollection("yeast")
-for data in qp.data.Dataset.kFCV(collection, nfolds=5, nrepeats=2):
- ...
-
Above code will allow to conduct a 2x5FCV evaluation on the “yeast” dataset.
-All datasets come in numerical form (dense matrices); some statistics -are summarized below.
-Dataset |
-classes |
-instances |
-features |
-prev |
-type |
-
---|---|---|---|---|---|
acute.a |
-2 |
-120 |
-6 |
-[0.508, 0.492] |
-dense |
-
acute.b |
-2 |
-120 |
-6 |
-[0.583, 0.417] |
-dense |
-
balance.1 |
-2 |
-625 |
-4 |
-[0.539, 0.461] |
-dense |
-
balance.2 |
-2 |
-625 |
-4 |
-[0.922, 0.078] |
-dense |
-
balance.3 |
-2 |
-625 |
-4 |
-[0.539, 0.461] |
-dense |
-
breast-cancer |
-2 |
-683 |
-9 |
-[0.350, 0.650] |
-dense |
-
cmc.1 |
-2 |
-1473 |
-9 |
-[0.573, 0.427] |
-dense |
-
cmc.2 |
-2 |
-1473 |
-9 |
-[0.774, 0.226] |
-dense |
-
cmc.3 |
-2 |
-1473 |
-9 |
-[0.653, 0.347] |
-dense |
-
ctg.1 |
-2 |
-2126 |
-22 |
-[0.222, 0.778] |
-dense |
-
ctg.2 |
-2 |
-2126 |
-22 |
-[0.861, 0.139] |
-dense |
-
ctg.3 |
-2 |
-2126 |
-22 |
-[0.917, 0.083] |
-dense |
-
german |
-2 |
-1000 |
-24 |
-[0.300, 0.700] |
-dense |
-
haberman |
-2 |
-306 |
-3 |
-[0.735, 0.265] |
-dense |
-
ionosphere |
-2 |
-351 |
-34 |
-[0.641, 0.359] |
-dense |
-
iris.1 |
-2 |
-150 |
-4 |
-[0.667, 0.333] |
-dense |
-
iris.2 |
-2 |
-150 |
-4 |
-[0.667, 0.333] |
-dense |
-
iris.3 |
-2 |
-150 |
-4 |
-[0.667, 0.333] |
-dense |
-
mammographic |
-2 |
-830 |
-5 |
-[0.514, 0.486] |
-dense |
-
pageblocks.5 |
-2 |
-5473 |
-10 |
-[0.979, 0.021] |
-dense |
-
semeion |
-2 |
-1593 |
-256 |
-[0.901, 0.099] |
-dense |
-
sonar |
-2 |
-208 |
-60 |
-[0.534, 0.466] |
-dense |
-
spambase |
-2 |
-4601 |
-57 |
-[0.606, 0.394] |
-dense |
-
spectf |
-2 |
-267 |
-44 |
-[0.794, 0.206] |
-dense |
-
tictactoe |
-2 |
-958 |
-9 |
-[0.653, 0.347] |
-dense |
-
transfusion |
-2 |
-748 |
-4 |
-[0.762, 0.238] |
-dense |
-
wdbc |
-2 |
-569 |
-30 |
-[0.627, 0.373] |
-dense |
-
wine.1 |
-2 |
-178 |
-13 |
-[0.669, 0.331] |
-dense |
-
wine.2 |
-2 |
-178 |
-13 |
-[0.601, 0.399] |
-dense |
-
wine.3 |
-2 |
-178 |
-13 |
-[0.730, 0.270] |
-dense |
-
wine-q-red |
-2 |
-1599 |
-11 |
-[0.465, 0.535] |
-dense |
-
wine-q-white |
-2 |
-4898 |
-11 |
-[0.335, 0.665] |
-dense |
-
yeast |
-2 |
-1484 |
-8 |
-[0.711, 0.289] |
-dense |
-
Issues:¶
-All datasets will be downloaded automatically the first time they are requested, and -stored in the quapy_data folder for faster further reuse. -However, some datasets require special actions that at the moment are not fully -automated.
--
-
Datasets with ids “ctg.1”, “ctg.2”, and “ctg.3” (Cardiotocography Data Set) load -an Excel file, which requires the user to install the xlrd Python module in order -to open it.
-The dataset with id “pageblocks.5” (Page Blocks Classification (5)) needs to -open a “unix compressed file” (extension .Z), which is not directly doable with -standard Pythons packages like gzip or zip. This file would need to be uncompressed using -OS-dependent software manually. Information on how to do it will be printed the first -time the dataset is invoked.
-
LeQua Datasets¶
-QuaPy also provides the datasets used for the LeQua competition. -In brief, there are 4 tasks (T1A, T1B, T2A, T2B) having to do with text quantification -problems. Tasks T1A and T1B provide documents in vector form, while T2A and T2B provide -raw documents instead. -Tasks T1A and T2A are binary sentiment quantification problems, while T2A and T2B -are multiclass quantification problems consisting of estimating the class prevalence -values of 28 different merchandise products.
-Every task consists of a training set, a set of validation samples (for model selection) -and a set of test samples (for evaluation). QuaPy returns this data as a LabelledCollection -(training) and two generation protocols (for validation and test samples), as follows:
-training, val_generator, test_generator = fetch_lequa2022(task=task)
-
See the lequa2022_experiments.py
in the examples folder for further details on how to
-carry out experiments using these datasets.
The datasets are downloaded only once, and stored for fast reuse.
-Some statistics are summarized below:
-Dataset |
-classes |
-train size |
-validation samples |
-test samples |
-docs by sample |
-type |
-
---|---|---|---|---|---|---|
T1A |
-2 |
-5000 |
-1000 |
-5000 |
-250 |
-vector |
-
T1B |
-28 |
-20000 |
-1000 |
-5000 |
-1000 |
-vector |
-
T2A |
-2 |
-5000 |
-1000 |
-5000 |
-250 |
-text |
-
T2B |
-28 |
-20000 |
-1000 |
-5000 |
-1000 |
-text |
-
For further details on the datasets, we refer to the original -paper:
-Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022).
-A Detailed Overview of LeQua@ CLEF 2022: Learning to Quantify.
-
Adding Custom Datasets¶
-QuaPy provides data loaders for simple formats dealing with -text, following the format:
-class-id \t first document's pre-processed text \n
-class-id \t second document's pre-processed text \n
-...
-
and sparse representations of the form:
-{-1, 0, or +1} col(int):val(float) col(int):val(float) ... \n
-...
-
The code in charge in loading a LabelledCollection is:
-@classmethod
-def load(cls, path:str, loader_func:callable):
- return LabelledCollection(*loader_func(path))
-
indicating that any loader_func (e.g., a user-defined one) which -returns valid arguments for initializing a LabelledCollection object will allow -to load any collection. In particular, the LabelledCollection receives as -arguments the instances (as an iterable) and the labels (as an iterable) and, -additionally, the number of classes can be specified (it would otherwise be -inferred from the labels, but that requires at least one positive example for -all classes to be present in the collection).
-The same loader_func can be passed to a Dataset, along with two -paths, in order to create a training and test pair of LabelledCollection, -e.g.:
-import quapy as qp
-
-train_path = '../my_data/train.dat'
-test_path = '../my_data/test.dat'
-
-def my_custom_loader(path):
- with open(path, 'rb') as fin:
- ...
- return instances, labels
-
-data = qp.data.Dataset.load(train_path, test_path, my_custom_loader)
-
Data Processing¶
-QuaPy implements a number of preprocessing functions in the package qp.data.preprocessing, including:
--
-
text2tfidf: tfidf vectorization
-reduce_columns: reducing the number of columns based on term frequency
-standardize: transforms the column values into z-scores (i.e., subtract the mean and normalizes by the standard deviation, so -that the column values have zero mean and unit variance).
-index: transforms textual tokens into lists of numeric ids)
-