diff --git a/.gitignore b/.gitignore index b9703a3..8eaff3e 100644 --- a/.gitignore +++ b/.gitignore @@ -130,3 +130,32 @@ dmypy.json .pyre/ *__pycache__* +*.pdf +*.zip +*.png +*.csv +*.pkl +*.dataframe + + +# other projects +LeQua2022 +MultiLabel +NewMethods +Ordinal +Retrieval +eDiscovery +poster-cikm +slides-cikm +slides-short-cikm +quick_experiment +svm_perf_quantification/svm_struct +svm_perf_quantification/svm_light +TweetSentQuant + + + + + + +*.png diff --git a/quapy/CHANGE_LOG.txt b/CHANGE_LOG.txt similarity index 60% rename from quapy/CHANGE_LOG.txt rename to CHANGE_LOG.txt index 1e0908a..5bf2643 100644 --- a/quapy/CHANGE_LOG.txt +++ b/CHANGE_LOG.txt @@ -1,3 +1,56 @@ +Change Log 0.1.8 +---------------- + +- Added Kernel Density Estimation methods (KDEyML, KDEyCS, KDEyHD) as proposed in the paper: + Moreo, A., González, P., & del Coz, J. J. Kernel Density Estimation for Multiclass Quantification. + arXiv preprint arXiv:2401.00490, 2024 + +- Substantial internal refactor: aggregative methods now inherit a pattern by which the fit method consists of: + a) fitting the classifier and returning the representations of the training instances (typically the posterior + probabilities, the label predictions, or the classifier scores, and typically obtained through kFCV). + b) fitting an aggregation function + The function implemented in step a) is inherited from the super class. Each new aggregative method now has to + implement only the "aggregative_fit" of step b). + This pattern was already implemented for the prediction (thus allowing evaluation functions to be performed + very quicky), and is now available also for training. The main benefit is that model selection now can nestle + the training of quantifiers in two levels: one for the classifier, and another for the aggregation function. + As a result, a method with a param grid of 10 combinations for the classifier and 10 combinations for the + quantifier, now implies 10 trainings of the classifier + 10*10 trainings of the aggregation function (this is + typically much faster than the classifier training), whereas in versions <0.1.8 this amounted to training + 10*10 (classifiers+aggregations). + +- Added different solvers for ACC and PACC quantifiers. In quapy < 0.1.8 these quantifiers try to solve the system + of equations Ax=B exactly (by means of np.linalg.solve). As noted by Mirko Bunse (thanks!), such an exact solution + does sometimes not exist. In cases like this, quapy < 0.1.8 resorted to CC for providing a plausible solution. + ACC and PACC now resorts to an approximated solution in such cases (minimizing the L2-norm of the difference + between Ax-B) as proposed by Mirko Bunse. A quick experiment reveals this heuristic greatly improves the results + of ACC and PACC in T2A@LeQua. + +- Fixed ThresholdOptimization methods (X, T50, MAX, MS and MS2). Thanks to Tobias Schumacher and colleagues for pointing + this out in Appendix A of "Schumacher, T., Strohmaier, M., & Lemmerich, F. (2021). A comparative evaluation of + quantification methods. arXiv:2103.03223v3 [cs.LG]" + +- Added HDx and DistributionMatchingX to non-aggregative quantifiers (see also the new example "comparing_HDy_HDx.py") + +- New UCI multiclass datasets added (thanks to Pablo González). The 5 UCI multiclass datasets are those corresponding + to the following criteria: + - >1000 instances + - >2 classes + - classification datasets + - Python API available + +- New IFCB (plankton) dataset added (thanks to Pablo González). See qp.datasets.fetch_IFCB. + +- Added new evaluation measures NAE, NRAE (thanks to Andrea Esuli) + +- Added new meta method "MedianEstimator"; an ensemble of binary base quantifiers that receives as input a dictionary + of hyperparameters that will explore exhaustively, fitting and generating predictions for each combination of + hyperparameters, and that returns, as the prevalence estimates, the median across all predictions. + +- Added "custom_protocol.py" example. + +- New API documentation template. + Change Log 0.1.7 ---------------- diff --git a/README.md b/README.md index e383da4..d9f697c 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,7 @@ for facilitating the analysis and interpretation of the experimental results. ### Last updates: -* Version 0.1.7 is released! major changes can be consulted [here](quapy/CHANGE_LOG.txt). +* Version 0.1.8 is released! major changes can be consulted [here](CHANGE_LOG.txt). * A detailed documentation is now available [here](https://hlt-isti.github.io/QuaPy/) * The developer API documentation is available [here](https://hlt-isti.github.io/QuaPy/build/html/modules.html) @@ -76,7 +76,7 @@ See the [Wiki](https://github.com/HLT-ISTI/QuaPy/wiki) for detailed examples. * Implementation of many popular quantification methods (Classify-&-Count and its variants, Expectation Maximization, quantification methods based on structured output learning, HDy, QuaNet, quantification ensembles, among others). * Versatile functionality for performing evaluation based on sampling generation protocols (e.g., APP, NPP, etc.). -* Implementation of most commonly used evaluation metrics (e.g., AE, RAE, SE, KLD, NKLD, etc.). +* Implementation of most commonly used evaluation metrics (e.g., AE, RAE, NAE, NRAE, SE, KLD, NKLD, etc.). * Datasets frequently used in quantification (textual and numeric), including: * 32 UCI Machine Learning datasets. * 11 Twitter quantification-by-sentiment datasets. @@ -96,6 +96,9 @@ quantification methods based on structured output learning, HDy, QuaNet, quantif * pandas, xlrd * matplotlib +## Contributing + +In case you want to contribute improvements to quapy, please generate pull request to the "devel" branch. ## Documentation diff --git a/TODO.txt b/TODO.txt index 7e99fb2..d3f2b3d 100644 --- a/TODO.txt +++ b/TODO.txt @@ -33,7 +33,6 @@ Refactor protocols. APP and NPP related functionalities are duplicated in functi New features: ========================================== -Add NAE, NRAE Add "measures for evaluating ordinal"? Add datasets for topic. Do we want to cover cross-lingual quantification natively in QuaPy, or does it make more sense as an application on top? diff --git a/docs/build/html/Datasets.html b/docs/build/html/Datasets.html deleted file mode 100644 index 775690d..0000000 --- a/docs/build/html/Datasets.html +++ /dev/null @@ -1,831 +0,0 @@ - - - - - - - - - - Datasets — QuaPy 0.1.7 documentation - - - - - - - - - - - - - - - - - - - -
-
-
-
- -
-

Datasets

-

QuaPy makes available several datasets that have been used in -quantification literature, as well as an interface to allow -anyone import their custom datasets.

-

A Dataset object in QuaPy is roughly a pair of LabelledCollection objects, -one playing the role of the training set, another the test set. -LabelledCollection is a data class consisting of the (iterable) -instances and labels. This class handles most of the sampling functionality in QuaPy. -Take a look at the following code:

-
import quapy as qp
-import quapy.functional as F
-
-instances = [
-    '1st positive document', '2nd positive document',
-    'the only negative document',
-    '1st neutral document', '2nd neutral document', '3rd neutral document'
-]
-labels = [2, 2, 0, 1, 1, 1]
-
-data = qp.data.LabelledCollection(instances, labels)
-print(F.strprev(data.prevalence(), prec=2))
-
-
-

Output the class prevalences (showing 2 digit precision):

-
[0.17, 0.50, 0.33]
-
-
-

One can easily produce new samples at desired class prevalence values:

-
sample_size = 10
-prev = [0.4, 0.1, 0.5]
-sample = data.sampling(sample_size, *prev)
-
-print('instances:', sample.instances)
-print('labels:', sample.labels)
-print('prevalence:', F.strprev(sample.prevalence(), prec=2))
-
-
-

Which outputs:

-
instances: ['the only negative document' '2nd positive document'
- '2nd positive document' '2nd neutral document' '1st positive document'
- 'the only negative document' 'the only negative document'
- 'the only negative document' '2nd positive document'
- '1st positive document']
-labels: [0 2 2 1 2 0 0 0 2 2]
-prevalence: [0.40, 0.10, 0.50]
-
-
-

Samples can be made consistent across different runs (e.g., to test -different methods on the same exact samples) by sampling and retaining -the indexes, that can then be used to generate the sample:

-
index = data.sampling_index(sample_size, *prev)
-for method in methods:
-    sample = data.sampling_from_index(index)
-    ...
-
-
-

However, generating samples for evaluation purposes is tackled in QuaPy -by means of the evaluation protocols (see the dedicated entries in the Wiki -for evaluation and -protocols).

-
-

Reviews Datasets

-

Three datasets of reviews about Kindle devices, Harry Potter’s series, and -the well-known IMDb movie reviews can be fetched using a unified interface. -For example:

-
import quapy as qp
-data = qp.datasets.fetch_reviews('kindle')
-
-
-

These datasets have been used in:

-
Esuli, A., Moreo, A., & Sebastiani, F. (2018, October). 
-A recurrent neural network for sentiment quantification. 
-In Proceedings of the 27th ACM International Conference on 
-Information and Knowledge Management (pp. 1775-1778).
-
-
-

The list of reviews ids is available in:

-
qp.datasets.REVIEWS_SENTIMENT_DATASETS
-
-
-

Some statistics of the fhe available datasets are summarized below:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Dataset

classes

train size

test size

train prev

test prev

type

hp

2

9533

18399

[0.018, 0.982]

[0.065, 0.935]

text

kindle

2

3821

21591

[0.081, 0.919]

[0.063, 0.937]

text

imdb

2

25000

25000

[0.500, 0.500]

[0.500, 0.500]

text

-
-
-

Twitter Sentiment Datasets

-

11 Twitter datasets for sentiment analysis. -Text is not accessible, and the documents were made available -in tf-idf format. Each dataset presents two splits: a train/val -split for model selection purposes, and a train+val/test split -for model evaluation. The following code exemplifies how to load -a twitter dataset for model selection.

-
import quapy as qp
-data = qp.datasets.fetch_twitter('gasp', for_model_selection=True)
-
-
-

The datasets were used in:

-
Gao, W., & Sebastiani, F. (2015, August). 
-Tweet sentiment: From classification to quantification. 
-In 2015 IEEE/ACM International Conference on Advances in 
-Social Networks Analysis and Mining (ASONAM) (pp. 97-104). IEEE.
-
-
-

Three of the datasets (semeval13, semeval14, and semeval15) share the -same training set (semeval), meaning that the training split one would get -when requesting any of them is the same. The dataset “semeval” can only -be requested with “for_model_selection=True”. -The lists of the Twitter dataset’s ids can be consulted in:

-
# a list of 11 dataset ids that can be used for model selection or model evaluation
-qp.datasets.TWITTER_SENTIMENT_DATASETS_TEST
-
-# 9 dataset ids in which "semeval13", "semeval14", and "semeval15" are replaced with "semeval"
-qp.datasets.TWITTER_SENTIMENT_DATASETS_TRAIN  
-
-
-

Some details can be found below:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Dataset

classes

train size

test size

features

train prev

test prev

type

gasp

3

8788

3765

694582

[0.421, 0.496, 0.082]

[0.407, 0.507, 0.086]

sparse

hcr

3

1594

798

222046

[0.546, 0.211, 0.243]

[0.640, 0.167, 0.193]

sparse

omd

3

1839

787

199151

[0.463, 0.271, 0.266]

[0.437, 0.283, 0.280]

sparse

sanders

3

2155

923

229399

[0.161, 0.691, 0.148]

[0.164, 0.688, 0.148]

sparse

semeval13

3

11338

3813

1215742

[0.159, 0.470, 0.372]

[0.158, 0.430, 0.412]

sparse

semeval14

3

11338

1853

1215742

[0.159, 0.470, 0.372]

[0.109, 0.361, 0.530]

sparse

semeval15

3

11338

2390

1215742

[0.159, 0.470, 0.372]

[0.153, 0.413, 0.434]

sparse

semeval16

3

8000

2000

889504

[0.157, 0.351, 0.492]

[0.163, 0.341, 0.497]

sparse

sst

3

2971

1271

376132

[0.261, 0.452, 0.288]

[0.207, 0.481, 0.312]

sparse

wa

3

2184

936

248563

[0.305, 0.414, 0.281]

[0.282, 0.446, 0.272]

sparse

wb

3

4259

1823

404333

[0.270, 0.392, 0.337]

[0.274, 0.392, 0.335]

sparse

-
-
-

UCI Machine Learning

-

A set of 32 datasets from the UCI Machine Learning repository -used in:

-
Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
-Using ensembles for problems with characterizable changes 
-in data distribution: A case study on quantification.
-Information Fusion, 34, 87-100.
-
-
-

The list does not exactly coincide with that used in Pérez-Gállego et al. 2017 -since we were unable to find the datasets with ids “diabetes” and “phoneme”.

-

These dataset can be loaded by calling, e.g.:

-
import quapy as qp
-data = qp.datasets.fetch_UCIDataset('yeast', verbose=True)
-
-
-

This call will return a Dataset object in which the training and -test splits are randomly drawn, in a stratified manner, from the whole -collection at 70% and 30%, respectively. The verbose=True option indicates -that the dataset description should be printed in standard output. -The original data is not split, -and some papers submit the entire collection to a kFCV validation. -In order to accommodate with these practices, one could first instantiate -the entire collection, and then creating a generator that will return one -training+test dataset at a time, following a kFCV protocol:

-
import quapy as qp
-collection = qp.datasets.fetch_UCILabelledCollection("yeast")
-for data in qp.data.Dataset.kFCV(collection, nfolds=5, nrepeats=2):
-    ...
-
-
-

Above code will allow to conduct a 2x5FCV evaluation on the “yeast” dataset.

-

All datasets come in numerical form (dense matrices); some statistics -are summarized below.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Dataset

classes

instances

features

prev

type

acute.a

2

120

6

[0.508, 0.492]

dense

acute.b

2

120

6

[0.583, 0.417]

dense

balance.1

2

625

4

[0.539, 0.461]

dense

balance.2

2

625

4

[0.922, 0.078]

dense

balance.3

2

625

4

[0.539, 0.461]

dense

breast-cancer

2

683

9

[0.350, 0.650]

dense

cmc.1

2

1473

9

[0.573, 0.427]

dense

cmc.2

2

1473

9

[0.774, 0.226]

dense

cmc.3

2

1473

9

[0.653, 0.347]

dense

ctg.1

2

2126

22

[0.222, 0.778]

dense

ctg.2

2

2126

22

[0.861, 0.139]

dense

ctg.3

2

2126

22

[0.917, 0.083]

dense

german

2

1000

24

[0.300, 0.700]

dense

haberman

2

306

3

[0.735, 0.265]

dense

ionosphere

2

351

34

[0.641, 0.359]

dense

iris.1

2

150

4

[0.667, 0.333]

dense

iris.2

2

150

4

[0.667, 0.333]

dense

iris.3

2

150

4

[0.667, 0.333]

dense

mammographic

2

830

5

[0.514, 0.486]

dense

pageblocks.5

2

5473

10

[0.979, 0.021]

dense

semeion

2

1593

256

[0.901, 0.099]

dense

sonar

2

208

60

[0.534, 0.466]

dense

spambase

2

4601

57

[0.606, 0.394]

dense

spectf

2

267

44

[0.794, 0.206]

dense

tictactoe

2

958

9

[0.653, 0.347]

dense

transfusion

2

748

4

[0.762, 0.238]

dense

wdbc

2

569

30

[0.627, 0.373]

dense

wine.1

2

178

13

[0.669, 0.331]

dense

wine.2

2

178

13

[0.601, 0.399]

dense

wine.3

2

178

13

[0.730, 0.270]

dense

wine-q-red

2

1599

11

[0.465, 0.535]

dense

wine-q-white

2

4898

11

[0.335, 0.665]

dense

yeast

2

1484

8

[0.711, 0.289]

dense

-
-

Issues:

-

All datasets will be downloaded automatically the first time they are requested, and -stored in the quapy_data folder for faster further reuse. -However, some datasets require special actions that at the moment are not fully -automated.

-
    -
  • Datasets with ids “ctg.1”, “ctg.2”, and “ctg.3” (Cardiotocography Data Set) load -an Excel file, which requires the user to install the xlrd Python module in order -to open it.

  • -
  • The dataset with id “pageblocks.5” (Page Blocks Classification (5)) needs to -open a “unix compressed file” (extension .Z), which is not directly doable with -standard Pythons packages like gzip or zip. This file would need to be uncompressed using -OS-dependent software manually. Information on how to do it will be printed the first -time the dataset is invoked.

  • -
-
-
-
-

LeQua Datasets

-

QuaPy also provides the datasets used for the LeQua competition. -In brief, there are 4 tasks (T1A, T1B, T2A, T2B) having to do with text quantification -problems. Tasks T1A and T1B provide documents in vector form, while T2A and T2B provide -raw documents instead. -Tasks T1A and T2A are binary sentiment quantification problems, while T2A and T2B -are multiclass quantification problems consisting of estimating the class prevalence -values of 28 different merchandise products.

-

Every task consists of a training set, a set of validation samples (for model selection) -and a set of test samples (for evaluation). QuaPy returns this data as a LabelledCollection -(training) and two generation protocols (for validation and test samples), as follows:

-
training, val_generator, test_generator = fetch_lequa2022(task=task)
-
-
-

See the lequa2022_experiments.py in the examples folder for further details on how to -carry out experiments using these datasets.

-

The datasets are downloaded only once, and stored for fast reuse.

-

Some statistics are summarized below:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Dataset

classes

train size

validation samples

test samples

docs by sample

type

T1A

2

5000

1000

5000

250

vector

T1B

28

20000

1000

5000

1000

vector

T2A

2

5000

1000

5000

250

text

T2B

28

20000

1000

5000

1000

text

-

For further details on the datasets, we refer to the original -paper:

-
Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022).
-A Detailed Overview of LeQua@ CLEF 2022: Learning to Quantify.
-
-
-
-
-

Adding Custom Datasets

-

QuaPy provides data loaders for simple formats dealing with -text, following the format:

-
class-id \t first document's pre-processed text \n
-class-id \t second document's pre-processed text \n
-...
-
-
-

and sparse representations of the form:

-
{-1, 0, or +1} col(int):val(float) col(int):val(float) ... \n
-...
-
-
-

The code in charge in loading a LabelledCollection is:

-
@classmethod
-def load(cls, path:str, loader_func:callable):
-    return LabelledCollection(*loader_func(path))
-
-
-

indicating that any loader_func (e.g., a user-defined one) which -returns valid arguments for initializing a LabelledCollection object will allow -to load any collection. In particular, the LabelledCollection receives as -arguments the instances (as an iterable) and the labels (as an iterable) and, -additionally, the number of classes can be specified (it would otherwise be -inferred from the labels, but that requires at least one positive example for -all classes to be present in the collection).

-

The same loader_func can be passed to a Dataset, along with two -paths, in order to create a training and test pair of LabelledCollection, -e.g.:

-
import quapy as qp
-
-train_path = '../my_data/train.dat'
-test_path = '../my_data/test.dat'
-
-def my_custom_loader(path):
-    with open(path, 'rb') as fin:
-        ...
-    return instances, labels
-
-data = qp.data.Dataset.load(train_path, test_path, my_custom_loader)
-
-
-
-

Data Processing

-

QuaPy implements a number of preprocessing functions in the package qp.data.preprocessing, including:

-
    -
  • text2tfidf: tfidf vectorization

  • -
  • reduce_columns: reducing the number of columns based on term frequency

  • -
  • standardize: transforms the column values into z-scores (i.e., subtract the mean and normalizes by the standard deviation, so -that the column values have zero mean and unit variance).

  • -
  • index: transforms textual tokens into lists of numeric ids)

  • -
-
-
-
- - -
-
-
-
- -
-
- - - - \ No newline at end of file diff --git a/docs/build/html/Evaluation.html b/docs/build/html/Evaluation.html deleted file mode 100644 index 1b41a03..0000000 --- a/docs/build/html/Evaluation.html +++ /dev/null @@ -1,281 +0,0 @@ - - - - - - - - - - Evaluation — QuaPy 0.1.7 documentation - - - - - - - - - - - - - - - - - - - -
-
-
-
- -
-

Evaluation

-

Quantification is an appealing tool in scenarios of dataset shift, -and particularly in scenarios of prior-probability shift. -That is, the interest in estimating the class prevalences arises -under the belief that those class prevalences might have changed -with respect to the ones observed during training. -In other words, one could simply return the training prevalence -as a predictor of the test prevalence if this change is assumed -to be unlikely (as is the case in general scenarios of -machine learning governed by the iid assumption). -In brief, quantification requires dedicated evaluation protocols, -which are implemented in QuaPy and explained here.

-
-

Error Measures

-

The module quapy.error implements the following error measures for quantification:

-
    -
  • mae: mean absolute error

  • -
  • mrae: mean relative absolute error

  • -
  • mse: mean squared error

  • -
  • mkld: mean Kullback-Leibler Divergence

  • -
  • mnkld: mean normalized Kullback-Leibler Divergence

  • -
-

Functions ae, rae, se, kld, and nkld are also available, -which return the individual errors (i.e., without averaging the whole).

-

Some errors of classification are also available:

-
    -
  • acce: accuracy error (1-accuracy)

  • -
  • f1e: F-1 score error (1-F1 score)

  • -
-

The error functions implement the following interface, e.g.:

-
mae(true_prevs, prevs_hat)
-
-
-

in which the first argument is a ndarray containing the true -prevalences, and the second argument is another ndarray with -the estimations produced by some method.

-

Some error functions, e.g., mrae, mkld, and mnkld, are -smoothed for numerical stability. In those cases, there is a -third argument, e.g.:

-
def mrae(true_prevs, prevs_hat, eps=None): ...
-
-
-

indicating the value for the smoothing parameter epsilon. -Traditionally, this value is set to 1/(2T) in past literature, -with T the sampling size. One could either pass this value -to the function each time, or to set a QuaPy’s environment -variable SAMPLE_SIZE once, and omit this argument -thereafter (recommended); -e.g.:

-
qp.environ['SAMPLE_SIZE'] = 100  # once for all
-true_prev = np.asarray([0.5, 0.3, 0.2])  # let's assume 3 classes
-estim_prev = np.asarray([0.1, 0.3, 0.6])
-error = qp.error.mrae(true_prev, estim_prev)
-print(f'mrae({true_prev}, {estim_prev}) = {error:.3f}')
-
-
-

will print:

-
mrae([0.500, 0.300, 0.200], [0.100, 0.300, 0.600]) = 0.914
-
-
-

Finally, it is possible to instantiate QuaPy’s quantification -error functions from strings using, e.g.:

-
error_function = qp.error.from_name('mse')
-error = error_function(true_prev, estim_prev)
-
-
-
-
-

Evaluation Protocols

-

An evaluation protocol is an evaluation procedure that uses -one specific sample generation procotol to genereate many -samples, typically characterized by widely varying amounts of -shift with respect to the original distribution, that are then -used to evaluate the performance of a (trained) quantifier. -These protocols are explained in more detail in a dedicated entry -in the wiki. For the moment being, let us assume we already have -chosen and instantiated one specific such protocol, that we here -simply call prot. Let also assume our model is called -quantifier and that our evaluatio measure of choice is -mae. The evaluation comes down to:

-
mae = qp.evaluation.evaluate(quantifier, protocol=prot, error_metric='mae')
-print(f'MAE = {mae:.4f}')
-
-
-

It is often desirable to evaluate our system using more than one -single evaluatio measure. In this case, it is convenient to generate -a report. A report in QuaPy is a dataframe accounting for all the -true prevalence values with their corresponding prevalence values -as estimated by the quantifier, along with the error each has given -rise.

-
report = qp.evaluation.evaluation_report(quantifier, protocol=prot, error_metrics=['mae', 'mrae', 'mkld'])
-
-
-

From a pandas’ dataframe, it is straightforward to visualize all the results, -and compute the averaged values, e.g.:

-
pd.set_option('display.expand_frame_repr', False)
-report['estim-prev'] = report['estim-prev'].map(F.strprev)
-print(report)
-
-print('Averaged values:')
-print(report.mean())
-
-
-

This will produce an output like:

-
           true-prev      estim-prev       mae      mrae      mkld
-0     [0.308, 0.692]  [0.314, 0.686]  0.005649  0.013182  0.000074
-1     [0.896, 0.104]  [0.909, 0.091]  0.013145  0.069323  0.000985
-2     [0.848, 0.152]  [0.809, 0.191]  0.039063  0.149806  0.005175
-3     [0.016, 0.984]  [0.033, 0.967]  0.017236  0.487529  0.005298
-4     [0.728, 0.272]  [0.751, 0.249]  0.022769  0.057146  0.001350
-...              ...             ...       ...       ...       ...
-4995    [0.72, 0.28]  [0.698, 0.302]  0.021752  0.053631  0.001133
-4996  [0.868, 0.132]  [0.888, 0.112]  0.020490  0.088230  0.001985
-4997  [0.292, 0.708]  [0.298, 0.702]  0.006149  0.014788  0.000090
-4998    [0.24, 0.76]  [0.220, 0.780]  0.019950  0.054309  0.001127
-4999  [0.948, 0.052]  [0.965, 0.035]  0.016941  0.165776  0.003538
-
-[5000 rows x 5 columns]
-Averaged values:
-mae     0.023588
-mrae    0.108779
-mkld    0.003631
-dtype: float64
-
-Process finished with exit code 0
-
-
-

Alternatively, we can simply generate all the predictions by:

-
true_prevs, estim_prevs = qp.evaluation.prediction(quantifier, protocol=prot)
-
-
-

All the evaluation functions implement specific optimizations for speeding-up -the evaluation of aggregative quantifiers (i.e., of instances of AggregativeQuantifier). -The optimization comes down to generating classification predictions (either crisp or soft) -only once for the entire test set, and then applying the sampling procedure to the -predictions, instead of generating samples of instances and then computing the -classification predictions every time. This is only possible when the protocol -is an instance of OnLabelledCollectionProtocol. The optimization is only -carried out when the number of classification predictions thus generated would be -smaller than the number of predictions required for the entire protocol; e.g., -if the original dataset contains 1M instances, but the protocol is such that it would -at most generate 20 samples of 100 instances, then it would be preferable to postpone the -classification for each sample. This behaviour is indicated by setting -aggr_speedup=”auto”. Conversely, when indicating aggr_speedup=”force” QuaPy will -precompute all the predictions irrespectively of the number of instances and number of samples. -Finally, this can be deactivated by setting aggr_speedup=False. Note that this optimization -is not only applied for the final evaluation, but also for the internal evaluations carried -out during model selection. Since these are typically many, the heuristic can help reduce the -execution time a lot.

-
-
- - -
-
-
-
- -
-
- - - - \ No newline at end of file diff --git a/docs/build/html/Installation.html b/docs/build/html/Installation.html deleted file mode 100644 index b63e795..0000000 --- a/docs/build/html/Installation.html +++ /dev/null @@ -1,178 +0,0 @@ - - - - - - - - - - Installation — QuaPy 0.1.7 documentation - - - - - - - - - - - - - - - - - - - -
-
-
-
- -
-

Installation

-

QuaPy can be easily installed via pip

-
pip install quapy
-
-
-

See pip page for older versions.

-
-

Requirements

-
    -
  • scikit-learn, numpy, scipy

  • -
  • pytorch (for QuaNet)

  • -
  • svmperf patched for quantification (see below)

  • -
  • joblib

  • -
  • tqdm

  • -
  • pandas, xlrd

  • -
  • matplotlib

  • -
-
-
-

SVM-perf with quantification-oriented losses

-

In order to run experiments involving SVM(Q), SVM(KLD), SVM(NKLD), -SVM(AE), or SVM(RAE), you have to first download the -svmperf -package, apply the patch -svm-perf-quantification-ext.patch, -and compile the sources. -The script -prepare_svmperf.sh, -does all the job. Simply run:

-
./prepare_svmperf.sh
-
-
-

The resulting directory ./svm_perf_quantification contains the -patched version of svmperf with quantification-oriented losses.

-

The -svm-perf-quantification-ext.patch -is an extension of the patch made available by -Esuli et al. 2015 -that allows SVMperf to optimize for -the Q measure as proposed by -Barranquero et al. 2015 -and for the KLD and NKLD as proposed by -Esuli et al. 2015 -for quantification. -This patch extends the former by also allowing SVMperf to optimize for -AE and RAE.

-
-
- - -
-
-
-
- -
-
- - - - \ No newline at end of file diff --git a/docs/build/html/Methods.html b/docs/build/html/Methods.html deleted file mode 100644 index 8471f3d..0000000 --- a/docs/build/html/Methods.html +++ /dev/null @@ -1,539 +0,0 @@ - - - - - - - - - - Quantification Methods — QuaPy 0.1.7 documentation - - - - - - - - - - - - - - - - - - - -
-
-
-
- -
-

Quantification Methods

-

Quantification methods can be categorized as belonging to -aggregative and non-aggregative groups. -Most methods included in QuaPy at the moment are of type aggregative -(though we plan to add many more methods in the near future), i.e., -are methods characterized by the fact that -quantification is performed as an aggregation function of the individual -products of classification.

-

Any quantifier in QuaPy shoud extend the class BaseQuantifier, -and implement some abstract methods:

-
    @abstractmethod
-    def fit(self, data: LabelledCollection): ...
-
-    @abstractmethod
-    def quantify(self, instances): ...
-
-
-

The meaning of those functions should be familiar to those -used to work with scikit-learn since the class structure of QuaPy -is directly inspired by scikit-learn’s Estimators. Functions -fit and quantify are used to train the model and to provide -class estimations (the reason why -scikit-learn’ structure has not been adopted as is in QuaPy responds to -the fact that scikit-learn’s predict function is expected to return -one output for each input element –e.g., a predicted label for each -instance in a sample– while in quantification the output for a sample -is one single array of class prevalences). -Quantifiers also extend from scikit-learn’s BaseEstimator, in order -to simplify the use of set_params and get_params used in -model selector.

-
-

Aggregative Methods

-

All quantification methods are implemented as part of the -qp.method package. In particular, aggregative methods are defined in -qp.method.aggregative, and extend AggregativeQuantifier(BaseQuantifier). -The methods that any aggregative quantifier must implement are:

-
    @abstractmethod
-    def fit(self, data: LabelledCollection, fit_learner=True): ...
-
-    @abstractmethod
-    def aggregate(self, classif_predictions:np.ndarray): ...
-
-
-

since, as mentioned before, aggregative methods base their prediction on the -individual predictions of a classifier. Indeed, a default implementation -of BaseQuantifier.quantify is already provided, which looks like:

-
    def quantify(self, instances):
-    classif_predictions = self.classify(instances)
-    return self.aggregate(classif_predictions)
-
-
-

Aggregative quantifiers are expected to maintain a classifier (which is -accessed through the @property classifier). This classifier is -given as input to the quantifier, and can be already fit -on external data (in which case, the fit_learner argument should -be set to False), or be fit by the quantifier’s fit (default).

-

Another class of aggregative methods are the probabilistic -aggregative methods, that should inherit from the abstract class -AggregativeProbabilisticQuantifier(AggregativeQuantifier). -The particularity of probabilistic aggregative methods (w.r.t. -non-probabilistic ones), is that the default quantifier is defined -in terms of the posterior probabilities returned by a probabilistic -classifier, and not by the crisp decisions of a hard classifier. -In any case, the interface classify(instances) remains unchanged.

-

One advantage of aggregative methods (either probabilistic or not) -is that the evaluation according to any sampling procedure (e.g., -the artificial sampling protocol) -can be achieved very efficiently, since the entire set can be pre-classified -once, and the quantification estimations for different samples can directly -reuse these predictions, without requiring to classify each element every time. -QuaPy leverages this property to speed-up any procedure having to do with -quantification over samples, as is customarily done in model selection or -in evaluation.

-
-

The Classify & Count variants

-

QuaPy implements the four CC variants, i.e.:

-
    -
  • CC (Classify & Count), the simplest aggregative quantifier; one that -simply relies on the label predictions of a classifier to deliver class estimates.

  • -
  • ACC (Adjusted Classify & Count), the adjusted variant of CC.

  • -
  • PCC (Probabilistic Classify & Count), the probabilistic variant of CC that -relies on the soft estimations (or posterior probabilities) returned by a (probabilistic) classifier.

  • -
  • PACC (Probabilistic Adjusted Classify & Count), the adjusted variant of PCC.

  • -
-

The following code serves as a complete example using CC equipped -with a SVM as the classifier:

-
import quapy as qp
-import quapy.functional as F
-from sklearn.svm import LinearSVC
-
-training, test = qp.datasets.fetch_twitter('hcr', pickle=True).train_test
-
-# instantiate a classifier learner, in this case a SVM
-svm = LinearSVC()
-
-# instantiate a Classify & Count with the SVM
-# (an alias is available in qp.method.aggregative.ClassifyAndCount)
-model = qp.method.aggregative.CC(svm)
-model.fit(training)
-estim_prevalence = model.quantify(test.instances)
-
-
-

The same code could be used to instantiate an ACC, by simply replacing -the instantiation of the model with:

-
model = qp.method.aggregative.ACC(svm)
-
-
-

Note that the adjusted variants (ACC and PACC) need to estimate -some parameters for performing the adjustment (e.g., the -true positive rate and the false positive rate in case of -binary classification) that are estimated on a validation split -of the labelled set. In this case, the init method of -ACC defines an additional parameter, val_split which, by -default, is set to 0.4 and so, the 40% of the labelled data -will be used for estimating the parameters for adjusting the -predictions. This parameters can also be set with an integer, -indicating that the parameters should be estimated by means of -k-fold cross-validation, for which the integer indicates the -number k of folds. Finally, val_split can be set to a -specific held-out validation set (i.e., an instance of LabelledCollection).

-

The specification of val_split can be -postponed to the invokation of the fit method (if val_split was also -set in the constructor, the one specified at fit time would prevail), -e.g.:

-
model = qp.method.aggregative.ACC(svm)
-# perform 5-fold cross validation for estimating ACC's parameters
-# (overrides the default val_split=0.4 in the constructor)
-model.fit(training, val_split=5)
-
-
-

The following code illustrates the case in which PCC is used:

-
model = qp.method.aggregative.PCC(svm)
-model.fit(training)
-estim_prevalence = model.quantify(test.instances)
-print('classifier:', model.classifier)
-
-
-

In this case, QuaPy will print:

-
The learner LinearSVC does not seem to be probabilistic. The learner will be calibrated.
-classifier: CalibratedClassifierCV(base_estimator=LinearSVC(), cv=5)
-
-
-

The first output indicates that the learner (LinearSVC in this case) -is not a probabilistic classifier (i.e., it does not implement the -predict_proba method) and so, the classifier will be converted to -a probabilistic one through calibration. -As a result, the classifier that is printed in the second line points -to a CalibratedClassifier instance. Note that calibration can only -be applied to hard classifiers when fit_learner=True; an exception -will be raised otherwise.

-

Lastly, everything we said aboud ACC and PCC -applies to PACC as well.

-
-
-

Expectation Maximization (EMQ)

-

The Expectation Maximization Quantifier (EMQ), also known as -the SLD, is available at qp.method.aggregative.EMQ or via the -alias qp.method.aggregative.ExpectationMaximizationQuantifier. -The method is described in:

-

Saerens, M., Latinne, P., and Decaestecker, C. (2002). Adjusting the outputs of a classifier -to new a priori probabilities: A simple procedure. Neural Computation, 14(1):21–41.

-

EMQ works with a probabilistic classifier (if the classifier -given as input is a hard one, a calibration will be attempted). -Although this method was originally proposed for improving the -posterior probabilities of a probabilistic classifier, and not -for improving the estimation of prior probabilities, EMQ ranks -almost always among the most effective quantifiers in the -experiments we have carried out.

-

An example of use can be found below:

-
import quapy as qp
-from sklearn.linear_model import LogisticRegression
-
-dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
-
-model = qp.method.aggregative.EMQ(LogisticRegression())
-model.fit(dataset.training)
-estim_prevalence = model.quantify(dataset.test.instances)
-
-
-

New in v0.1.7: EMQ now accepts two new parameters in the construction method, namely -exact_train_prev which allows to use the true training prevalence as the departing -prevalence estimation (default behaviour), or instead an approximation of it as -suggested by Alexandari et al. (2020) -(by setting exact_train_prev=False). -The other parameter is recalib which allows to indicate a calibration method, among those -proposed by Alexandari et al. (2020), -including the Bias-Corrected Temperature Scaling, Vector Scaling, etc. -See the API documentation for further details.

-
-
-

Hellinger Distance y (HDy)

-

Implementation of the method based on the Hellinger Distance y (HDy) proposed by -González-Castro, V., Alaiz-Rodrı́guez, R., and Alegre, E. (2013). Class distribution -estimation based on the Hellinger distance. Information Sciences, 218:146–164.

-

It is implemented in qp.method.aggregative.HDy (also accessible -through the allias qp.method.aggregative.HellingerDistanceY). -This method works with a probabilistic classifier (hard classifiers -can be used as well and will be calibrated) and requires a validation -set to estimate parameter for the mixture model. Just like -ACC and PACC, this quantifier receives a val_split argument -in the constructor (or in the fit method, in which case the previous -value is overridden) that can either be a float indicating the proportion -of training data to be taken as the validation set (in a random -stratified split), or a validation set (i.e., an instance of -LabelledCollection) itself.

-

HDy was proposed as a binary classifier and the implementation -provided in QuaPy accepts only binary datasets.

-

The following code shows an example of use:

-
import quapy as qp
-from sklearn.linear_model import LogisticRegression
-
-# load a binary dataset
-dataset = qp.datasets.fetch_reviews('hp', pickle=True)
-qp.data.preprocessing.text2tfidf(dataset, min_df=5, inplace=True)
-
-model = qp.method.aggregative.HDy(LogisticRegression())
-model.fit(dataset.training)
-estim_prevalence = model.quantify(dataset.test.instances)
-
-
-

New in v0.1.7: QuaPy now provides an implementation of the generalized -“Distribution Matching” approaches for multiclass, inspired by the framework -of Firat (2016). One can instantiate -a variant of HDy for multiclass quantification as follows:

-
mutliclassHDy = qp.method.aggregative.DistributionMatching(classifier=LogisticRegression(), divergence='HD', cdf=False)
-
-
-

New in v0.1.7: QuaPy now provides an implementation of the “DyS” -framework proposed by Maletzke et al (2020) -and the “SMM” method proposed by Hassan et al (2019) -(thanks to Pablo González for the contributions!)

-
-
-

Threshold Optimization methods

-

New in v0.1.7: QuaPy now implements Forman’s threshold optimization methods; -see, e.g., (Forman 2006) -and (Forman 2008). -These include: T50, MAX, X, Median Sweep (MS), and its variant MS2.

-
-
-

Explicit Loss Minimization

-

The Explicit Loss Minimization (ELM) represent a family of methods -based on structured output learning, i.e., quantifiers relying on -classifiers that have been optimized targeting a -quantification-oriented evaluation measure. -The original methods are implemented in QuaPy as classify & count (CC) -quantifiers that use Joachim’s SVMperf -as the underlying classifier, properly set to optimize for the desired loss.

-

In QuaPy, this can be more achieved by calling the functions:

- -

the last two methods (SVM(AE) and SVM(RAE)) have been implemented in -QuaPy in order to make available ELM variants for what nowadays -are considered the most well-behaved evaluation metrics in quantification.

-

In order to make these models work, you would need to run the script -prepare_svmperf.sh (distributed along with QuaPy) that -downloads SVMperf’ source code, applies a patch that -implements the quantification oriented losses, and compiles the -sources.

-

If you want to add any custom loss, you would need to modify -the source code of SVMperf in order to implement it, and -assign a valid loss code to it. Then you must re-compile -the whole thing and instantiate the quantifier in QuaPy -as follows:

-
# you can either set the path to your custom svm_perf_quantification implementation
-# in the environment variable, or as an argument to the constructor of ELM
-qp.environ['SVMPERF_HOME'] = './path/to/svm_perf_quantification'
-
-# assign an alias to your custom loss and the id you have assigned to it
-svmperf = qp.classification.svmperf.SVMperf
-svmperf.valid_losses['mycustomloss'] = 28
-
-# instantiate the ELM method indicating the loss
-model = qp.method.aggregative.ELM(loss='mycustomloss')
-
-
-

All ELM are binary quantifiers since they rely on SVMperf, that -currently supports only binary classification. -ELM variants (any binary quantifier in general) can be extended -to operate in single-label scenarios trivially by adopting a -“one-vs-all” strategy (as, e.g., in -Gao, W. and Sebastiani, F. (2016). From classification to quantification in tweet sentiment -analysis. Social Network Analysis and Mining, 6(19):1–22). -In QuaPy this is possible by using the OneVsAll class.

-

There are two ways for instantiating this class, OneVsAllGeneric that works for -any quantifier, and OneVsAllAggregative that is optimized for aggregative quantifiers. -In general, you can simply use the getOneVsAll function and QuaPy will choose -the more convenient of the two.

-
import quapy as qp
-from quapy.method.aggregative import SVMQ
-
-# load a single-label dataset (this one contains 3 classes)
-dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
-
-# let qp know where svmperf is
-qp.environ['SVMPERF_HOME'] = '../svm_perf_quantification'
-
-model = getOneVsAll(SVMQ(), n_jobs=-1)  # run them on parallel
-model.fit(dataset.training)
-estim_prevalence = model.quantify(dataset.test.instances)
-
-
-

Check the examples explicit_loss_minimization.py -and one_vs_all.py for more details.

-
-
-
-

Meta Models

-

By meta models we mean quantification methods that are defined on top of other -quantification methods, and that thus do not squarely belong to the aggregative nor -the non-aggregative group (indeed, meta models could use quantifiers from any of those -groups). -Meta models are implemented in the qp.method.meta module.

-
-

Ensembles

-

QuaPy implements (some of) the variants proposed in:

- -

The following code shows how to instantiate an Ensemble of 30 Adjusted Classify & Count (ACC) -quantifiers operating with a Logistic Regressor (LR) as the base classifier, and using the -average as the aggregation policy (see the original article for further details). -The last parameter indicates to use all processors for parallelization.

-
import quapy as qp
-from quapy.method.aggregative import ACC
-from quapy.method.meta import Ensemble
-from sklearn.linear_model import LogisticRegression
-
-dataset = qp.datasets.fetch_UCIDataset('haberman')
-
-model = Ensemble(quantifier=ACC(LogisticRegression()), size=30, policy='ave', n_jobs=-1)
-model.fit(dataset.training)
-estim_prevalence = model.quantify(dataset.test.instances)
-
-
-

Other aggregation policies implemented in QuaPy include:

-
    -
  • ‘ptr’ for applying a dynamic selection based on the training prevalence of the ensemble’s members

  • -
  • ‘ds’ for applying a dynamic selection based on the Hellinger Distance

  • -
  • any valid quantification measure (e.g., ‘mse’) for performing a static selection based on -the performance estimated for each member of the ensemble in terms of that evaluation metric.

  • -
-

When using any of the above options, it is important to set the red_size parameter, which -informs of the number of members to retain.

-

Please, check the model selection -wiki if you want to optimize the hyperparameters of ensemble for classification or quantification.

-
-
-

The QuaNet neural network

-

QuaPy offers an implementation of QuaNet, a deep learning model presented in:

-

Esuli, A., Moreo, A., & Sebastiani, F. (2018, October). -A recurrent neural network for sentiment quantification. -In Proceedings of the 27th ACM International Conference on -Information and Knowledge Management (pp. 1775-1778).

-

This model requires torch to be installed. -QuaNet also requires a classifier that can provide embedded representations -of the inputs. -In the original paper, QuaNet was tested using an LSTM as the base classifier. -In the following example, we show an instantiation of QuaNet that instead uses CNN as a probabilistic classifier, taking its last layer representation as the document embedding:

-
import quapy as qp
-from quapy.method.meta import QuaNet
-from quapy.classification.neural import NeuralClassifierTrainer, CNNnet
-
-# use samples of 100 elements
-qp.environ['SAMPLE_SIZE'] = 100
-
-# load the kindle dataset as text, and convert words to numerical indexes
-dataset = qp.datasets.fetch_reviews('kindle', pickle=True)
-qp.data.preprocessing.index(dataset, min_df=5, inplace=True)
-
-# the text classifier is a CNN trained by NeuralClassifierTrainer
-cnn = CNNnet(dataset.vocabulary_size, dataset.n_classes)
-learner = NeuralClassifierTrainer(cnn, device='cuda')
-
-# train QuaNet
-model = QuaNet(learner, device='cuda')
-model.fit(dataset.training)
-estim_prevalence = model.quantify(dataset.test.instances)
-
-
-
-
-
- - -
-
-
-
- -
-
- - - - \ No newline at end of file diff --git a/docs/build/html/Model-Selection.html b/docs/build/html/Model-Selection.html deleted file mode 100644 index 03a399e..0000000 --- a/docs/build/html/Model-Selection.html +++ /dev/null @@ -1,268 +0,0 @@ - - - - - - - - - - Model Selection — QuaPy 0.1.7 documentation - - - - - - - - - - - - - - - - - - - -
-
-
-
- -
-

Model Selection

-

As a supervised machine learning task, quantification methods -can strongly depend on a good choice of model hyper-parameters. -The process whereby those hyper-parameters are chosen is -typically known as Model Selection, and typically consists of -testing different settings and picking the one that performed -best in a held-out validation set in terms of any given -evaluation measure.

-
-

Targeting a Quantification-oriented loss

-

The task being optimized determines the evaluation protocol, -i.e., the criteria according to which the performance of -any given method for solving is to be assessed. -As a task on its own right, quantification should impose -its own model selection strategies, i.e., strategies -aimed at finding appropriate configurations -specifically designed for the task of quantification.

-

Quantification has long been regarded as an add-on of -classification, and thus the model selection strategies -customarily adopted in classification have simply been -applied to quantification (see the next section). -It has been argued in Moreo, Alejandro, and Fabrizio Sebastiani. -Re-Assessing the “Classify and Count” Quantification Method. -ECIR 2021: Advances in Information Retrieval pp 75–91. -that specific model selection strategies should -be adopted for quantification. That is, model selection -strategies for quantification should target -quantification-oriented losses and be tested in a variety -of scenarios exhibiting different degrees of prior -probability shift.

-

The class qp.model_selection.GridSearchQ implements a grid-search exploration over the space of -hyper-parameter combinations that evaluates -each combination of hyper-parameters by means of a given quantification-oriented -error metric (e.g., any of the error functions implemented -in qp.error) and according to a -sampling generation protocol.

-

The following is an example (also included in the examples folder) of model selection for quantification:

-
import quapy as qp
-from quapy.protocol import APP
-from quapy.method.aggregative import DistributionMatching
-from sklearn.linear_model import LogisticRegression
-import numpy as np
-
-"""
-In this example, we show how to perform model selection on a DistributionMatching quantifier.
-"""
-
-model = DistributionMatching(LogisticRegression())
-
-qp.environ['SAMPLE_SIZE'] = 100
-qp.environ['N_JOBS'] = -1  # explore hyper-parameters in parallel
-
-training, test = qp.datasets.fetch_reviews('imdb', tfidf=True, min_df=5).train_test
-
-# The model will be returned by the fit method of GridSearchQ.
-# Every combination of hyper-parameters will be evaluated by confronting the
-# quantifier thus configured against a series of samples generated by means
-# of a sample generation protocol. For this example, we will use the
-# artificial-prevalence protocol (APP), that generates samples with prevalence
-# values in the entire range of values from a grid (e.g., [0, 0.1, 0.2, ..., 1]).
-# We devote 30% of the dataset for this exploration.
-training, validation = training.split_stratified(train_prop=0.7)
-protocol = APP(validation)
-
-# We will explore a classification-dependent hyper-parameter (e.g., the 'C'
-# hyper-parameter of LogisticRegression) and a quantification-dependent hyper-parameter
-# (e.g., the number of bins in a DistributionMatching quantifier.
-# Classifier-dependent hyper-parameters have to be marked with a prefix "classifier__"
-# in order to let the quantifier know this hyper-parameter belongs to its underlying
-# classifier.
-param_grid = {
-    'classifier__C': np.logspace(-3,3,7),
-    'nbins': [8, 16, 32, 64],
-}
-
-model = qp.model_selection.GridSearchQ(
-    model=model,
-    param_grid=param_grid,
-    protocol=protocol,
-    error='mae',  # the error to optimize is the MAE (a quantification-oriented loss)
-    refit=True,   # retrain on the whole labelled set once done
-    verbose=True  # show information as the process goes on
-).fit(training)
-
-print(f'model selection ended: best hyper-parameters={model.best_params_}')
-model = model.best_model_
-
-# evaluation in terms of MAE
-# we use the same evaluation protocol (APP) on the test set
-mae_score = qp.evaluation.evaluate(model, protocol=APP(test), error_metric='mae')
-
-print(f'MAE={mae_score:.5f}')
-
-
-

In this example, the system outputs:

-
[GridSearchQ]: starting model selection with self.n_jobs =-1
-[GridSearchQ]: hyperparams={'classifier__C': 0.01, 'nbins': 64}	 got mae score 0.04021 [took 1.1356s]
-[GridSearchQ]: hyperparams={'classifier__C': 0.01, 'nbins': 32}	 got mae score 0.04286 [took 1.2139s]
-[GridSearchQ]: hyperparams={'classifier__C': 0.01, 'nbins': 16}	 got mae score 0.04888 [took 1.2491s]
-[GridSearchQ]: hyperparams={'classifier__C': 0.001, 'nbins': 8}	 got mae score 0.05163 [took 1.5372s]
-[...]
-[GridSearchQ]: hyperparams={'classifier__C': 1000.0, 'nbins': 32}	 got mae score 0.02445 [took 2.9056s]
-[GridSearchQ]: optimization finished: best params {'classifier__C': 100.0, 'nbins': 32} (score=0.02234) [took 7.3114s]
-[GridSearchQ]: refitting on the whole development set
-model selection ended: best hyper-parameters={'classifier__C': 100.0, 'nbins': 32}
-MAE=0.03102
-
-
-

The parameter val_split can alternatively be used to indicate -a validation set (i.e., an instance of LabelledCollection) instead -of a proportion. This could be useful if one wants to have control -on the specific data split to be used across different model selection -experiments.

-
-
-

Targeting a Classification-oriented loss

-

Optimizing a model for quantification could rather be -computationally costly. -In aggregative methods, one could alternatively try to optimize -the classifier’s hyper-parameters for classification. -Although this is theoretically suboptimal, many articles in -quantification literature have opted for this strategy.

-

In QuaPy, this is achieved by simply instantiating the -classifier learner as a GridSearchCV from scikit-learn. -The following code illustrates how to do that:

-
learner = GridSearchCV(
-    LogisticRegression(),
-    param_grid={'C': np.logspace(-4, 5, 10), 'class_weight': ['balanced', None]},
-    cv=5)
-model = DistributionMatching(learner).fit(dataset.training)
-
-
-

However, this is conceptually flawed, since the model should be -optimized for the task at hand (quantification), and not for a surrogate task (classification), -i.e., the model should be requested to deliver low quantification errors, rather -than low classification errors.

-
-
- - -
-
-
-
- -
-
- - - - \ No newline at end of file diff --git a/docs/build/html/Plotting.html b/docs/build/html/Plotting.html deleted file mode 100644 index d41bef6..0000000 --- a/docs/build/html/Plotting.html +++ /dev/null @@ -1,350 +0,0 @@ - - - - - - - - - - Plotting — QuaPy 0.1.7 documentation - - - - - - - - - - - - - - - - - - - -
-
-
-
- -
-

Plotting

-

The module qp.plot implements some basic plotting functions -that can help analyse the performance of a quantification method.

-

All plotting functions receive as inputs the outcomes of -some experiments and include, for each experiment, -the following three main arguments:

-
    -
  • method_names a list containing the names of the quantification methods

  • -
  • true_prevs a list containing matrices of true prevalences

  • -
  • estim_prevs a list containing matrices of estimated prevalences -(should be of the same shape as the corresponding matrix in true_prevs)

  • -
-

Note that a method (as indicated by a name in method_names) can -appear more than once. This could occur when various datasets are -involved in the experiments. In this case, all experiments for the -method will be merged and the plot will represent the method’s -performance across various datasets.

-

This is a very simple example of a valid input for the plotting functions:

-
method_names = ['classify & count', 'EMQ', 'classify & count']
-true_prevs = [
-    np.array([[0.5, 0.5], [0.25, 0.75]]),
-    np.array([[0.0, 1.0], [0.25, 0.75], [0.0, 0.1]]),
-    np.array([[0.0, 1.0], [0.25, 0.75], [0.0, 0.1]]),
-]
-estim_prevs = [
-    np.array([[0.45, 0.55], [0.6, 0.4]]),
-    np.array([[0.0, 1.0], [0.5, 0.5], [0.2, 0.8]]),
-    np.array([[0.1, 0.9], [0.3, 0.7], [0.0, 0.1]]),
-]
-
-
-

in which the classify & count has been tested in two datasets and -the EMQ method has been tested only in one dataset. For the first -experiment, only two (binary) quantifications have been tested, -while for the second and third experiments three instances have -been tested.

-

In general, we would like to test the performance of the -quantification methods across different scenarios showcasing -the accuracy of the quantifier in predicting class prevalences -for a wide range of prior distributions. This can easily be -achieved by means of the -artificial sampling protocol -that is implemented in QuaPy.

-

The following code shows how to perform one simple experiment -in which the 4 CC-variants, all equipped with a linear SVM, are -applied to one binary dataset of reviews about Kindle devices and -tested across the entire spectrum of class priors (taking 21 splits -of the interval [0,1], i.e., using prevalence steps of 0.05, and -generating 100 random samples at each prevalence).

-
import quapy as qp
-from protocol import APP
-from quapy.method.aggregative import CC, ACC, PCC, PACC
-from sklearn.svm import LinearSVC
-
-qp.environ['SAMPLE_SIZE'] = 500
-
-def gen_data():
-
-    def base_classifier():
-        return LinearSVC(class_weight='balanced')
-
-    def models():
-        yield 'CC', CC(base_classifier())
-        yield 'ACC', ACC(base_classifier())
-        yield 'PCC', PCC(base_classifier())
-        yield 'PACC', PACC(base_classifier())
-
-    train, test = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5).train_test
-
-    method_names, true_prevs, estim_prevs, tr_prevs = [], [], [], []
-
-    for method_name, model in models():
-        model.fit(train)
-        true_prev, estim_prev = qp.evaluation.prediction(model, APP(test, repeats=100, random_state=0))
-
-        method_names.append(method_name)
-        true_prevs.append(true_prev)
-        estim_prevs.append(estim_prev)
-        tr_prevs.append(train.prevalence())
-
-    return method_names, true_prevs, estim_prevs, tr_prevs
-
-method_names, true_prevs, estim_prevs, tr_prevs = gen_data()
-
-
-

the plots that can be generated are explained below.

-
-

Diagonal Plot

-

The diagonal plot shows a very insightful view of the -quantifier’s performance. It plots the predicted class -prevalence (in the y-axis) against the true class prevalence -(in the x-axis). Unfortunately, it is limited to binary quantification, -although one can simply generate as many diagonal plots as -classes there are by indicating which class should be considered -the target of the plot.

-

The following call will produce the plot:

-
qp.plot.binary_diagonal(method_names, true_prevs, estim_prevs, train_prev=tr_prevs[0], savepath='./plots/bin_diag.png')
-
-
-

the last argument is optional, and indicates the path where to save -the plot (the file extension will determine the format – typical extensions -are ‘.png’ or ‘.pdf’). If this path is not provided, then the plot -will be shown but not saved. -The resulting plot should look like:

-

diagonal plot on Kindle

-

Note that in this case, we are also indicating the training -prevalence, which is plotted in the diagonal a as cyan dot. -The color bands indicate the standard deviations of the predictions, -and can be hidden by setting the argument show_std=False (see -the complete list of arguments in the documentation).

-

Finally, note how most quantifiers, and specially the “unadjusted” -variants CC and PCC, are strongly biased towards the -prevalence seen during training.

-
-
-

Quantification bias

-

This plot aims at evincing the bias that any quantifier -displays with respect to the training prevalences by -means of box plots. -This plot can be generated by:

-
qp.plot.binary_bias_global(method_names, true_prevs, estim_prevs, savepath='./plots/bin_bias.png')
-
-
-

and should look like:

-

bias plot on Kindle

-

The box plots show some interesting facts:

-
    -
  • all methods are biased towards the training prevalence but specially -so CC and PCC (an unbiased quantifier would have a box centered at 0)

  • -
  • the bias is always positive, indicating that all methods tend to -overestimate the positive class prevalence

  • -
  • CC and PCC have high variability while ACC and specially PACC exhibit -lower variability.

  • -
-

Again, these plots could be generated for experiments ranging across -different datasets, and the plot will merge all data accordingly.

-

Another illustrative example can be shown that consists of -training different CC quantifiers trained at different -(artificially sampled) training prevalences. -For this example, we generate training samples of 5000 -documents containing 10%, 20%, …, 90% of positives from the -IMDb dataset, and generate the bias plot again. -This example can be run by rewritting the gen_data() function -like this:

-
def gen_data():
-
-    train, test = qp.datasets.fetch_reviews('imdb', tfidf=True, min_df=5).train_test
-    model = CC(LinearSVC())
-
-    method_data = []
-    for training_prevalence in np.linspace(0.1, 0.9, 9):
-        training_size = 5000
-        # since the problem is binary, it suffices to specify the negative prevalence, since the positive is constrained
-        train_sample = train.sampling(training_size, 1-training_prevalence)
-        model.fit(train_sample)
-        true_prev, estim_prev = qp.evaluation.prediction(model, APP(test, repeats=100, random_state=0))
-        method_name = 'CC$_{'+f'{int(100*training_prevalence)}' + '\%}$'
-        method_data.append((method_name, true_prev, estim_prev, train_sample.prevalence()))
-
-    return zip(*method_data)
-
-
-

and the plot should now look like:

-

bias plot on IMDb

-

which clearly shows a negative bias for CC variants trained on -data containing more negatives (i.e., < 50%) and positive biases -in cases containing more positives (i.e., >50%). The CC trained -at 50% behaves as an unbiased estimator of the positive class -prevalence.

-

The function qp.plot.binary_bias_bins allows the user to -generate box plots broken down by bins of true test prevalence. -To this aim, an argument nbins is passed which indicates -how many isometric subintervals to take. For example -the following plot is produced for nbins=3:

-

bias plot on IMDb

-

Interestingly enough, the seemingly unbiased estimator (CC at 50%) happens to display -a positive bias (or a tendency to overestimate) in cases of low prevalence -(i.e., when the true prevalence of the positive class is below 33%), -and a negative bias (or a tendency to underestimate) in cases of high prevalence -(i.e., when the true prevalence is beyond 67%).

-

Out of curiosity, the diagonal plot for this experiment looks like:

-

diag plot on IMDb

-

showing pretty clearly the dependency of CC on the prior probabilities -of the labeled set it was trained on.

-
-
-

Error by Drift

-

Above discussed plots are useful for analyzing and comparing -the performance of different quantification methods, but are -limited to the binary case. The “error by drift” is a plot -that shows the error in predictions as a function of the -(prior probability) drift between each test sample and the -training set. Interestingly, the error and drift can both be measured -in terms of any evaluation measure for quantification (like the -ones available in qp.error) and can thus be computed -irrespectively of the number of classes.

-

The following shows how to generate the plot for the 4 CC variants, -using 10 bins for the drift -and absolute error as the measure of the error (the -drift in the x-axis is always computed in terms of absolute error since -other errors are harder to interpret):

-
qp.plot.error_by_drift(method_names, true_prevs, estim_prevs, tr_prevs, 
-    error_name='ae', n_bins=10, savepath='./plots/err_drift.png')
-
-
-

diag plot on IMDb

-

Note that all methods work reasonably well in cases of low prevalence -drift (i.e., any CC-variant is a good quantifier whenever the IID -assumption is approximately preserved). The higher the drift, the worse -those quantifiers tend to perform, although it is clear that PACC -yields the lowest error for the most difficult cases.

-

Remember that any plot can be generated across many datasets, and -that this would probably result in a more solid comparison. -In those cases, however, it is likely that the variances of each -method get higher, to the detriment of the visualization. -We recommend to set show_std=False in those cases -in order to hide the color bands.

-
-
- - -
-
-
-
- -
-
- - - - \ No newline at end of file diff --git a/docs/build/html/_images/bin_bias.png b/docs/build/html/_images/bin_bias.png deleted file mode 100644 index 572dae4..0000000 Binary files a/docs/build/html/_images/bin_bias.png and /dev/null differ diff --git a/docs/build/html/_images/bin_bias_bin_cc.png b/docs/build/html/_images/bin_bias_bin_cc.png deleted file mode 100644 index db34c76..0000000 Binary files a/docs/build/html/_images/bin_bias_bin_cc.png and /dev/null differ diff --git a/docs/build/html/_images/bin_bias_cc.png b/docs/build/html/_images/bin_bias_cc.png deleted file mode 100644 index db91dd4..0000000 Binary files a/docs/build/html/_images/bin_bias_cc.png and /dev/null differ diff --git a/docs/build/html/_images/bin_diag.png b/docs/build/html/_images/bin_diag.png deleted file mode 100644 index 7ded71a..0000000 Binary files a/docs/build/html/_images/bin_diag.png and /dev/null differ diff --git a/docs/build/html/_images/bin_diag_cc.png b/docs/build/html/_images/bin_diag_cc.png deleted file mode 100644 index 01bb43d..0000000 Binary files a/docs/build/html/_images/bin_diag_cc.png and /dev/null differ diff --git a/docs/build/html/_images/err_drift.png b/docs/build/html/_images/err_drift.png deleted file mode 100644 index 496b66c..0000000 Binary files a/docs/build/html/_images/err_drift.png and /dev/null differ diff --git a/docs/build/html/_sources/Datasets.md.txt b/docs/build/html/_sources/Datasets.md.txt deleted file mode 100644 index d5e7563..0000000 --- a/docs/build/html/_sources/Datasets.md.txt +++ /dev/null @@ -1,356 +0,0 @@ -# Datasets - -QuaPy makes available several datasets that have been used in -quantification literature, as well as an interface to allow -anyone import their custom datasets. - -A _Dataset_ object in QuaPy is roughly a pair of _LabelledCollection_ objects, -one playing the role of the training set, another the test set. -_LabelledCollection_ is a data class consisting of the (iterable) -instances and labels. This class handles most of the sampling functionality in QuaPy. -Take a look at the following code: - -```python -import quapy as qp -import quapy.functional as F - -instances = [ - '1st positive document', '2nd positive document', - 'the only negative document', - '1st neutral document', '2nd neutral document', '3rd neutral document' -] -labels = [2, 2, 0, 1, 1, 1] - -data = qp.data.LabelledCollection(instances, labels) -print(F.strprev(data.prevalence(), prec=2)) -``` - -Output the class prevalences (showing 2 digit precision): -``` -[0.17, 0.50, 0.33] -``` - -One can easily produce new samples at desired class prevalence values: - -```python -sample_size = 10 -prev = [0.4, 0.1, 0.5] -sample = data.sampling(sample_size, *prev) - -print('instances:', sample.instances) -print('labels:', sample.labels) -print('prevalence:', F.strprev(sample.prevalence(), prec=2)) -``` - -Which outputs: -``` -instances: ['the only negative document' '2nd positive document' - '2nd positive document' '2nd neutral document' '1st positive document' - 'the only negative document' 'the only negative document' - 'the only negative document' '2nd positive document' - '1st positive document'] -labels: [0 2 2 1 2 0 0 0 2 2] -prevalence: [0.40, 0.10, 0.50] -``` - -Samples can be made consistent across different runs (e.g., to test -different methods on the same exact samples) by sampling and retaining -the indexes, that can then be used to generate the sample: - -```python -index = data.sampling_index(sample_size, *prev) -for method in methods: - sample = data.sampling_from_index(index) - ... -``` - -However, generating samples for evaluation purposes is tackled in QuaPy -by means of the evaluation protocols (see the dedicated entries in the Wiki -for [evaluation](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation) and -[protocols](https://github.com/HLT-ISTI/QuaPy/wiki/Protocols)). - - -## Reviews Datasets - -Three datasets of reviews about Kindle devices, Harry Potter's series, and -the well-known IMDb movie reviews can be fetched using a unified interface. -For example: - -```python -import quapy as qp -data = qp.datasets.fetch_reviews('kindle') -``` - -These datasets have been used in: -``` -Esuli, A., Moreo, A., & Sebastiani, F. (2018, October). -A recurrent neural network for sentiment quantification. -In Proceedings of the 27th ACM International Conference on -Information and Knowledge Management (pp. 1775-1778). -``` - -The list of reviews ids is available in: - -```python -qp.datasets.REVIEWS_SENTIMENT_DATASETS -``` - -Some statistics of the fhe available datasets are summarized below: - -| Dataset | classes | train size | test size | train prev | test prev | type | -|---|:---:|:---:|:---:|:---:|:---:|---| -| hp | 2 | 9533 | 18399 | [0.018, 0.982] | [0.065, 0.935] | text | -| kindle | 2 | 3821 | 21591 | [0.081, 0.919] | [0.063, 0.937] | text | -| imdb | 2 | 25000 | 25000 | [0.500, 0.500] | [0.500, 0.500] | text | - - -## Twitter Sentiment Datasets - -11 Twitter datasets for sentiment analysis. -Text is not accessible, and the documents were made available -in tf-idf format. Each dataset presents two splits: a train/val -split for model selection purposes, and a train+val/test split -for model evaluation. The following code exemplifies how to load -a twitter dataset for model selection. - -```python -import quapy as qp -data = qp.datasets.fetch_twitter('gasp', for_model_selection=True) -``` - -The datasets were used in: - -``` -Gao, W., & Sebastiani, F. (2015, August). -Tweet sentiment: From classification to quantification. -In 2015 IEEE/ACM International Conference on Advances in -Social Networks Analysis and Mining (ASONAM) (pp. 97-104). IEEE. -``` - -Three of the datasets (semeval13, semeval14, and semeval15) share the -same training set (semeval), meaning that the training split one would get -when requesting any of them is the same. The dataset "semeval" can only -be requested with "for_model_selection=True". -The lists of the Twitter dataset's ids can be consulted in: - -```python -# a list of 11 dataset ids that can be used for model selection or model evaluation -qp.datasets.TWITTER_SENTIMENT_DATASETS_TEST - -# 9 dataset ids in which "semeval13", "semeval14", and "semeval15" are replaced with "semeval" -qp.datasets.TWITTER_SENTIMENT_DATASETS_TRAIN -``` - -Some details can be found below: - -| Dataset | classes | train size | test size | features | train prev | test prev | type | -|---|:---:|:---:|:---:|:---:|:---:|:---:|---| -| gasp | 3 | 8788 | 3765 | 694582 | [0.421, 0.496, 0.082] | [0.407, 0.507, 0.086] | sparse | -| hcr | 3 | 1594 | 798 | 222046 | [0.546, 0.211, 0.243] | [0.640, 0.167, 0.193] | sparse | -| omd | 3 | 1839 | 787 | 199151 | [0.463, 0.271, 0.266] | [0.437, 0.283, 0.280] | sparse | -| sanders | 3 | 2155 | 923 | 229399 | [0.161, 0.691, 0.148] | [0.164, 0.688, 0.148] | sparse | -| semeval13 | 3 | 11338 | 3813 | 1215742 | [0.159, 0.470, 0.372] | [0.158, 0.430, 0.412] | sparse | -| semeval14 | 3 | 11338 | 1853 | 1215742 | [0.159, 0.470, 0.372] | [0.109, 0.361, 0.530] | sparse | -| semeval15 | 3 | 11338 | 2390 | 1215742 | [0.159, 0.470, 0.372] | [0.153, 0.413, 0.434] | sparse | -| semeval16 | 3 | 8000 | 2000 | 889504 | [0.157, 0.351, 0.492] | [0.163, 0.341, 0.497] | sparse | -| sst | 3 | 2971 | 1271 | 376132 | [0.261, 0.452, 0.288] | [0.207, 0.481, 0.312] | sparse | -| wa | 3 | 2184 | 936 | 248563 | [0.305, 0.414, 0.281] | [0.282, 0.446, 0.272] | sparse | -| wb | 3 | 4259 | 1823 | 404333 | [0.270, 0.392, 0.337] | [0.274, 0.392, 0.335] | sparse | - - -## UCI Machine Learning - -A set of 32 datasets from the [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets.php) -used in: - -``` -Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). -Using ensembles for problems with characterizable changes -in data distribution: A case study on quantification. -Information Fusion, 34, 87-100. -``` - -The list does not exactly coincide with that used in Pérez-Gállego et al. 2017 -since we were unable to find the datasets with ids "diabetes" and "phoneme". - -These dataset can be loaded by calling, e.g.: - -```python -import quapy as qp -data = qp.datasets.fetch_UCIDataset('yeast', verbose=True) -``` - -This call will return a _Dataset_ object in which the training and -test splits are randomly drawn, in a stratified manner, from the whole -collection at 70% and 30%, respectively. The _verbose=True_ option indicates -that the dataset description should be printed in standard output. -The original data is not split, -and some papers submit the entire collection to a kFCV validation. -In order to accommodate with these practices, one could first instantiate -the entire collection, and then creating a generator that will return one -training+test dataset at a time, following a kFCV protocol: - -```python -import quapy as qp -collection = qp.datasets.fetch_UCILabelledCollection("yeast") -for data in qp.data.Dataset.kFCV(collection, nfolds=5, nrepeats=2): - ... -``` - -Above code will allow to conduct a 2x5FCV evaluation on the "yeast" dataset. - -All datasets come in numerical form (dense matrices); some statistics -are summarized below. - -| Dataset | classes | instances | features | prev | type | -|---|:---:|:---:|:---:|:---:|---| -| acute.a | 2 | 120 | 6 | [0.508, 0.492] | dense | -| acute.b | 2 | 120 | 6 | [0.583, 0.417] | dense | -| balance.1 | 2 | 625 | 4 | [0.539, 0.461] | dense | -| balance.2 | 2 | 625 | 4 | [0.922, 0.078] | dense | -| balance.3 | 2 | 625 | 4 | [0.539, 0.461] | dense | -| breast-cancer | 2 | 683 | 9 | [0.350, 0.650] | dense | -| cmc.1 | 2 | 1473 | 9 | [0.573, 0.427] | dense | -| cmc.2 | 2 | 1473 | 9 | [0.774, 0.226] | dense | -| cmc.3 | 2 | 1473 | 9 | [0.653, 0.347] | dense | -| ctg.1 | 2 | 2126 | 22 | [0.222, 0.778] | dense | -| ctg.2 | 2 | 2126 | 22 | [0.861, 0.139] | dense | -| ctg.3 | 2 | 2126 | 22 | [0.917, 0.083] | dense | -| german | 2 | 1000 | 24 | [0.300, 0.700] | dense | -| haberman | 2 | 306 | 3 | [0.735, 0.265] | dense | -| ionosphere | 2 | 351 | 34 | [0.641, 0.359] | dense | -| iris.1 | 2 | 150 | 4 | [0.667, 0.333] | dense | -| iris.2 | 2 | 150 | 4 | [0.667, 0.333] | dense | -| iris.3 | 2 | 150 | 4 | [0.667, 0.333] | dense | -| mammographic | 2 | 830 | 5 | [0.514, 0.486] | dense | -| pageblocks.5 | 2 | 5473 | 10 | [0.979, 0.021] | dense | -| semeion | 2 | 1593 | 256 | [0.901, 0.099] | dense | -| sonar | 2 | 208 | 60 | [0.534, 0.466] | dense | -| spambase | 2 | 4601 | 57 | [0.606, 0.394] | dense | -| spectf | 2 | 267 | 44 | [0.794, 0.206] | dense | -| tictactoe | 2 | 958 | 9 | [0.653, 0.347] | dense | -| transfusion | 2 | 748 | 4 | [0.762, 0.238] | dense | -| wdbc | 2 | 569 | 30 | [0.627, 0.373] | dense | -| wine.1 | 2 | 178 | 13 | [0.669, 0.331] | dense | -| wine.2 | 2 | 178 | 13 | [0.601, 0.399] | dense | -| wine.3 | 2 | 178 | 13 | [0.730, 0.270] | dense | -| wine-q-red | 2 | 1599 | 11 | [0.465, 0.535] | dense | -| wine-q-white | 2 | 4898 | 11 | [0.335, 0.665] | dense | -| yeast | 2 | 1484 | 8 | [0.711, 0.289] | dense | - -### Issues: -All datasets will be downloaded automatically the first time they are requested, and -stored in the _quapy_data_ folder for faster further reuse. -However, some datasets require special actions that at the moment are not fully -automated. - -* Datasets with ids "ctg.1", "ctg.2", and "ctg.3" (_Cardiotocography Data Set_) load -an Excel file, which requires the user to install the _xlrd_ Python module in order -to open it. -* The dataset with id "pageblocks.5" (_Page Blocks Classification (5)_) needs to -open a "unix compressed file" (extension .Z), which is not directly doable with -standard Pythons packages like gzip or zip. This file would need to be uncompressed using -OS-dependent software manually. Information on how to do it will be printed the first -time the dataset is invoked. - -## LeQua Datasets - -QuaPy also provides the datasets used for the LeQua competition. -In brief, there are 4 tasks (T1A, T1B, T2A, T2B) having to do with text quantification -problems. Tasks T1A and T1B provide documents in vector form, while T2A and T2B provide -raw documents instead. -Tasks T1A and T2A are binary sentiment quantification problems, while T2A and T2B -are multiclass quantification problems consisting of estimating the class prevalence -values of 28 different merchandise products. - -Every task consists of a training set, a set of validation samples (for model selection) -and a set of test samples (for evaluation). QuaPy returns this data as a LabelledCollection -(training) and two generation protocols (for validation and test samples), as follows: - -```python -training, val_generator, test_generator = fetch_lequa2022(task=task) -``` - -See the `lequa2022_experiments.py` in the examples folder for further details on how to -carry out experiments using these datasets. - -The datasets are downloaded only once, and stored for fast reuse. - -Some statistics are summarized below: - -| Dataset | classes | train size | validation samples | test samples | docs by sample | type | -|---------|:-------:|:----------:|:------------------:|:------------:|:----------------:|:--------:| -| T1A | 2 | 5000 | 1000 | 5000 | 250 | vector | -| T1B | 28 | 20000 | 1000 | 5000 | 1000 | vector | -| T2A | 2 | 5000 | 1000 | 5000 | 250 | text | -| T2B | 28 | 20000 | 1000 | 5000 | 1000 | text | - -For further details on the datasets, we refer to the original -[paper](https://ceur-ws.org/Vol-3180/paper-146.pdf): - -``` -Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022). -A Detailed Overview of LeQua@ CLEF 2022: Learning to Quantify. -``` - -## Adding Custom Datasets - -QuaPy provides data loaders for simple formats dealing with -text, following the format: - -``` -class-id \t first document's pre-processed text \n -class-id \t second document's pre-processed text \n -... -``` - -and sparse representations of the form: - -``` -{-1, 0, or +1} col(int):val(float) col(int):val(float) ... \n -... -``` - -The code in charge in loading a LabelledCollection is: - -```python -@classmethod -def load(cls, path:str, loader_func:callable): - return LabelledCollection(*loader_func(path)) -``` - -indicating that any _loader_func_ (e.g., a user-defined one) which -returns valid arguments for initializing a _LabelledCollection_ object will allow -to load any collection. In particular, the _LabelledCollection_ receives as -arguments the instances (as an iterable) and the labels (as an iterable) and, -additionally, the number of classes can be specified (it would otherwise be -inferred from the labels, but that requires at least one positive example for -all classes to be present in the collection). - -The same _loader_func_ can be passed to a Dataset, along with two -paths, in order to create a training and test pair of _LabelledCollection_, -e.g.: - -```python -import quapy as qp - -train_path = '../my_data/train.dat' -test_path = '../my_data/test.dat' - -def my_custom_loader(path): - with open(path, 'rb') as fin: - ... - return instances, labels - -data = qp.data.Dataset.load(train_path, test_path, my_custom_loader) -``` - -### Data Processing - -QuaPy implements a number of preprocessing functions in the package _qp.data.preprocessing_, including: - -* _text2tfidf_: tfidf vectorization -* _reduce_columns_: reducing the number of columns based on term frequency -* _standardize_: transforms the column values into z-scores (i.e., subtract the mean and normalizes by the standard deviation, so -that the column values have zero mean and unit variance). -* _index_: transforms textual tokens into lists of numeric ids) diff --git a/docs/build/html/_sources/Evaluation.md.txt b/docs/build/html/_sources/Evaluation.md.txt deleted file mode 100644 index a0175d2..0000000 --- a/docs/build/html/_sources/Evaluation.md.txt +++ /dev/null @@ -1,169 +0,0 @@ -# Evaluation - -Quantification is an appealing tool in scenarios of dataset shift, -and particularly in scenarios of prior-probability shift. -That is, the interest in estimating the class prevalences arises -under the belief that those class prevalences might have changed -with respect to the ones observed during training. -In other words, one could simply return the training prevalence -as a predictor of the test prevalence if this change is assumed -to be unlikely (as is the case in general scenarios of -machine learning governed by the iid assumption). -In brief, quantification requires dedicated evaluation protocols, -which are implemented in QuaPy and explained here. - -## Error Measures - -The module quapy.error implements the following error measures for quantification: -* _mae_: mean absolute error -* _mrae_: mean relative absolute error -* _mse_: mean squared error -* _mkld_: mean Kullback-Leibler Divergence -* _mnkld_: mean normalized Kullback-Leibler Divergence - -Functions _ae_, _rae_, _se_, _kld_, and _nkld_ are also available, -which return the individual errors (i.e., without averaging the whole). - -Some errors of classification are also available: -* _acce_: accuracy error (1-accuracy) -* _f1e_: F-1 score error (1-F1 score) - -The error functions implement the following interface, e.g.: - -```python -mae(true_prevs, prevs_hat) -``` - -in which the first argument is a ndarray containing the true -prevalences, and the second argument is another ndarray with -the estimations produced by some method. - -Some error functions, e.g., _mrae_, _mkld_, and _mnkld_, are -smoothed for numerical stability. In those cases, there is a -third argument, e.g.: - -```python -def mrae(true_prevs, prevs_hat, eps=None): ... -``` - -indicating the value for the smoothing parameter epsilon. -Traditionally, this value is set to 1/(2T) in past literature, -with T the sampling size. One could either pass this value -to the function each time, or to set a QuaPy's environment -variable _SAMPLE_SIZE_ once, and omit this argument -thereafter (recommended); -e.g.: - -```python -qp.environ['SAMPLE_SIZE'] = 100 # once for all -true_prev = np.asarray([0.5, 0.3, 0.2]) # let's assume 3 classes -estim_prev = np.asarray([0.1, 0.3, 0.6]) -error = qp.error.mrae(true_prev, estim_prev) -print(f'mrae({true_prev}, {estim_prev}) = {error:.3f}') -``` - -will print: -``` -mrae([0.500, 0.300, 0.200], [0.100, 0.300, 0.600]) = 0.914 -``` - -Finally, it is possible to instantiate QuaPy's quantification -error functions from strings using, e.g.: - -```python -error_function = qp.error.from_name('mse') -error = error_function(true_prev, estim_prev) -``` - -## Evaluation Protocols - -An _evaluation protocol_ is an evaluation procedure that uses -one specific _sample generation procotol_ to genereate many -samples, typically characterized by widely varying amounts of -_shift_ with respect to the original distribution, that are then -used to evaluate the performance of a (trained) quantifier. -These protocols are explained in more detail in a dedicated [entry -in the wiki](Protocols.md). For the moment being, let us assume we already have -chosen and instantiated one specific such protocol, that we here -simply call _prot_. Let also assume our model is called -_quantifier_ and that our evaluatio measure of choice is -_mae_. The evaluation comes down to: - -```python -mae = qp.evaluation.evaluate(quantifier, protocol=prot, error_metric='mae') -print(f'MAE = {mae:.4f}') -``` - -It is often desirable to evaluate our system using more than one -single evaluatio measure. In this case, it is convenient to generate -a _report_. A report in QuaPy is a dataframe accounting for all the -true prevalence values with their corresponding prevalence values -as estimated by the quantifier, along with the error each has given -rise. - -```python -report = qp.evaluation.evaluation_report(quantifier, protocol=prot, error_metrics=['mae', 'mrae', 'mkld']) -``` - -From a pandas' dataframe, it is straightforward to visualize all the results, -and compute the averaged values, e.g.: - -```python -pd.set_option('display.expand_frame_repr', False) -report['estim-prev'] = report['estim-prev'].map(F.strprev) -print(report) - -print('Averaged values:') -print(report.mean()) -``` - -This will produce an output like: - -``` - true-prev estim-prev mae mrae mkld -0 [0.308, 0.692] [0.314, 0.686] 0.005649 0.013182 0.000074 -1 [0.896, 0.104] [0.909, 0.091] 0.013145 0.069323 0.000985 -2 [0.848, 0.152] [0.809, 0.191] 0.039063 0.149806 0.005175 -3 [0.016, 0.984] [0.033, 0.967] 0.017236 0.487529 0.005298 -4 [0.728, 0.272] [0.751, 0.249] 0.022769 0.057146 0.001350 -... ... ... ... ... ... -4995 [0.72, 0.28] [0.698, 0.302] 0.021752 0.053631 0.001133 -4996 [0.868, 0.132] [0.888, 0.112] 0.020490 0.088230 0.001985 -4997 [0.292, 0.708] [0.298, 0.702] 0.006149 0.014788 0.000090 -4998 [0.24, 0.76] [0.220, 0.780] 0.019950 0.054309 0.001127 -4999 [0.948, 0.052] [0.965, 0.035] 0.016941 0.165776 0.003538 - -[5000 rows x 5 columns] -Averaged values: -mae 0.023588 -mrae 0.108779 -mkld 0.003631 -dtype: float64 - -Process finished with exit code 0 -``` - -Alternatively, we can simply generate all the predictions by: - -```python -true_prevs, estim_prevs = qp.evaluation.prediction(quantifier, protocol=prot) -``` - -All the evaluation functions implement specific optimizations for speeding-up -the evaluation of aggregative quantifiers (i.e., of instances of _AggregativeQuantifier_). -The optimization comes down to generating classification predictions (either crisp or soft) -only once for the entire test set, and then applying the sampling procedure to the -predictions, instead of generating samples of instances and then computing the -classification predictions every time. This is only possible when the protocol -is an instance of _OnLabelledCollectionProtocol_. The optimization is only -carried out when the number of classification predictions thus generated would be -smaller than the number of predictions required for the entire protocol; e.g., -if the original dataset contains 1M instances, but the protocol is such that it would -at most generate 20 samples of 100 instances, then it would be preferable to postpone the -classification for each sample. This behaviour is indicated by setting -_aggr_speedup="auto"_. Conversely, when indicating _aggr_speedup="force"_ QuaPy will -precompute all the predictions irrespectively of the number of instances and number of samples. -Finally, this can be deactivated by setting _aggr_speedup=False_. Note that this optimization -is not only applied for the final evaluation, but also for the internal evaluations carried -out during _model selection_. Since these are typically many, the heuristic can help reduce the -execution time a lot. \ No newline at end of file diff --git a/docs/build/html/_sources/Installation.rst.txt b/docs/build/html/_sources/Installation.rst.txt deleted file mode 100644 index 0eaabd6..0000000 --- a/docs/build/html/_sources/Installation.rst.txt +++ /dev/null @@ -1,56 +0,0 @@ -Installation ------------- - -QuaPy can be easily installed via `pip` - -:: - - pip install quapy - -See `pip page `_ for older versions. - -Requirements -************ - -* scikit-learn, numpy, scipy -* pytorch (for QuaNet) -* svmperf patched for quantification (see below) -* joblib -* tqdm -* pandas, xlrd -* matplotlib - - -SVM-perf with quantification-oriented losses -******************************************** - -In order to run experiments involving SVM(Q), SVM(KLD), SVM(NKLD), -SVM(AE), or SVM(RAE), you have to first download the -`svmperf `_ -package, apply the patch -`svm-perf-quantification-ext.patch `_, -and compile the sources. -The script -`prepare_svmperf.sh `_, -does all the job. Simply run: - -:: - - ./prepare_svmperf.sh - - -The resulting directory `./svm_perf_quantification` contains the -patched version of `svmperf` with quantification-oriented losses. - -The -`svm-perf-quantification-ext.patch `_ -is an extension of the patch made available by -`Esuli et al. 2015 `_ -that allows SVMperf to optimize for -the `Q` measure as proposed by -`Barranquero et al. 2015 `_ -and for the `KLD` and `NKLD` as proposed by -`Esuli et al. 2015 `_ -for quantification. -This patch extends the former by also allowing SVMperf to optimize for -`AE` and `RAE`. \ No newline at end of file diff --git a/docs/build/html/_sources/Methods.md.txt b/docs/build/html/_sources/Methods.md.txt deleted file mode 100644 index 7060a0a..0000000 --- a/docs/build/html/_sources/Methods.md.txt +++ /dev/null @@ -1,438 +0,0 @@ -# Quantification Methods - -Quantification methods can be categorized as belonging to -_aggregative_ and _non-aggregative_ groups. -Most methods included in QuaPy at the moment are of type _aggregative_ -(though we plan to add many more methods in the near future), i.e., -are methods characterized by the fact that -quantification is performed as an aggregation function of the individual -products of classification. - -Any quantifier in QuaPy shoud extend the class _BaseQuantifier_, -and implement some abstract methods: -```python - @abstractmethod - def fit(self, data: LabelledCollection): ... - - @abstractmethod - def quantify(self, instances): ... -``` -The meaning of those functions should be familiar to those -used to work with scikit-learn since the class structure of QuaPy -is directly inspired by scikit-learn's _Estimators_. Functions -_fit_ and _quantify_ are used to train the model and to provide -class estimations (the reason why -scikit-learn' structure has not been adopted _as is_ in QuaPy responds to -the fact that scikit-learn's _predict_ function is expected to return -one output for each input element --e.g., a predicted label for each -instance in a sample-- while in quantification the output for a sample -is one single array of class prevalences). -Quantifiers also extend from scikit-learn's `BaseEstimator`, in order -to simplify the use of _set_params_ and _get_params_ used in -[model selector](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection). - -## Aggregative Methods - -All quantification methods are implemented as part of the -_qp.method_ package. In particular, _aggregative_ methods are defined in -_qp.method.aggregative_, and extend _AggregativeQuantifier(BaseQuantifier)_. -The methods that any _aggregative_ quantifier must implement are: - -```python - @abstractmethod - def fit(self, data: LabelledCollection, fit_learner=True): ... - - @abstractmethod - def aggregate(self, classif_predictions:np.ndarray): ... -``` - -since, as mentioned before, aggregative methods base their prediction on the -individual predictions of a classifier. Indeed, a default implementation -of _BaseQuantifier.quantify_ is already provided, which looks like: - -```python - def quantify(self, instances): - classif_predictions = self.classify(instances) - return self.aggregate(classif_predictions) -``` -Aggregative quantifiers are expected to maintain a classifier (which is -accessed through the _@property_ _classifier_). This classifier is -given as input to the quantifier, and can be already fit -on external data (in which case, the _fit_learner_ argument should -be set to False), or be fit by the quantifier's fit (default). - -Another class of _aggregative_ methods are the _probabilistic_ -aggregative methods, that should inherit from the abstract class -_AggregativeProbabilisticQuantifier(AggregativeQuantifier)_. -The particularity of _probabilistic_ aggregative methods (w.r.t. -non-probabilistic ones), is that the default quantifier is defined -in terms of the posterior probabilities returned by a probabilistic -classifier, and not by the crisp decisions of a hard classifier. -In any case, the interface _classify(instances)_ remains unchanged. - -One advantage of _aggregative_ methods (either probabilistic or not) -is that the evaluation according to any sampling procedure (e.g., -the [artificial sampling protocol](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation)) -can be achieved very efficiently, since the entire set can be pre-classified -once, and the quantification estimations for different samples can directly -reuse these predictions, without requiring to classify each element every time. -QuaPy leverages this property to speed-up any procedure having to do with -quantification over samples, as is customarily done in model selection or -in evaluation. - -### The Classify & Count variants - -QuaPy implements the four CC variants, i.e.: - -* _CC_ (Classify & Count), the simplest aggregative quantifier; one that - simply relies on the label predictions of a classifier to deliver class estimates. -* _ACC_ (Adjusted Classify & Count), the adjusted variant of CC. -* _PCC_ (Probabilistic Classify & Count), the probabilistic variant of CC that -relies on the soft estimations (or posterior probabilities) returned by a (probabilistic) classifier. -* _PACC_ (Probabilistic Adjusted Classify & Count), the adjusted variant of PCC. - -The following code serves as a complete example using CC equipped -with a SVM as the classifier: - -```python -import quapy as qp -import quapy.functional as F -from sklearn.svm import LinearSVC - -training, test = qp.datasets.fetch_twitter('hcr', pickle=True).train_test - -# instantiate a classifier learner, in this case a SVM -svm = LinearSVC() - -# instantiate a Classify & Count with the SVM -# (an alias is available in qp.method.aggregative.ClassifyAndCount) -model = qp.method.aggregative.CC(svm) -model.fit(training) -estim_prevalence = model.quantify(test.instances) -``` - -The same code could be used to instantiate an ACC, by simply replacing -the instantiation of the model with: -```python -model = qp.method.aggregative.ACC(svm) -``` -Note that the adjusted variants (ACC and PACC) need to estimate -some parameters for performing the adjustment (e.g., the -_true positive rate_ and the _false positive rate_ in case of -binary classification) that are estimated on a validation split -of the labelled set. In this case, the __init__ method of -ACC defines an additional parameter, _val_split_ which, by -default, is set to 0.4 and so, the 40% of the labelled data -will be used for estimating the parameters for adjusting the -predictions. This parameters can also be set with an integer, -indicating that the parameters should be estimated by means of -_k_-fold cross-validation, for which the integer indicates the -number _k_ of folds. Finally, _val_split_ can be set to a -specific held-out validation set (i.e., an instance of _LabelledCollection_). - -The specification of _val_split_ can be -postponed to the invokation of the fit method (if _val_split_ was also -set in the constructor, the one specified at fit time would prevail), -e.g.: - -```python -model = qp.method.aggregative.ACC(svm) -# perform 5-fold cross validation for estimating ACC's parameters -# (overrides the default val_split=0.4 in the constructor) -model.fit(training, val_split=5) -``` - -The following code illustrates the case in which PCC is used: - -```python -model = qp.method.aggregative.PCC(svm) -model.fit(training) -estim_prevalence = model.quantify(test.instances) -print('classifier:', model.classifier) -``` -In this case, QuaPy will print: -``` -The learner LinearSVC does not seem to be probabilistic. The learner will be calibrated. -classifier: CalibratedClassifierCV(base_estimator=LinearSVC(), cv=5) -``` -The first output indicates that the learner (_LinearSVC_ in this case) -is not a probabilistic classifier (i.e., it does not implement the -_predict_proba_ method) and so, the classifier will be converted to -a probabilistic one through [calibration](https://scikit-learn.org/stable/modules/calibration.html). -As a result, the classifier that is printed in the second line points -to a _CalibratedClassifier_ instance. Note that calibration can only -be applied to hard classifiers when _fit_learner=True_; an exception -will be raised otherwise. - -Lastly, everything we said aboud ACC and PCC -applies to PACC as well. - - -### Expectation Maximization (EMQ) - -The Expectation Maximization Quantifier (EMQ), also known as -the SLD, is available at _qp.method.aggregative.EMQ_ or via the -alias _qp.method.aggregative.ExpectationMaximizationQuantifier_. -The method is described in: - -_Saerens, M., Latinne, P., and Decaestecker, C. (2002). Adjusting the outputs of a classifier -to new a priori probabilities: A simple procedure. Neural Computation, 14(1):21–41._ - -EMQ works with a probabilistic classifier (if the classifier -given as input is a hard one, a calibration will be attempted). -Although this method was originally proposed for improving the -posterior probabilities of a probabilistic classifier, and not -for improving the estimation of prior probabilities, EMQ ranks -almost always among the most effective quantifiers in the -experiments we have carried out. - -An example of use can be found below: - -```python -import quapy as qp -from sklearn.linear_model import LogisticRegression - -dataset = qp.datasets.fetch_twitter('hcr', pickle=True) - -model = qp.method.aggregative.EMQ(LogisticRegression()) -model.fit(dataset.training) -estim_prevalence = model.quantify(dataset.test.instances) -``` - -_New in v0.1.7_: EMQ now accepts two new parameters in the construction method, namely -_exact_train_prev_ which allows to use the true training prevalence as the departing -prevalence estimation (default behaviour), or instead an approximation of it as -suggested by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html) -(by setting _exact_train_prev=False_). -The other parameter is _recalib_ which allows to indicate a calibration method, among those -proposed by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html), -including the Bias-Corrected Temperature Scaling, Vector Scaling, etc. -See the API documentation for further details. - - -### Hellinger Distance y (HDy) - -Implementation of the method based on the Hellinger Distance y (HDy) proposed by -[González-Castro, V., Alaiz-Rodrı́guez, R., and Alegre, E. (2013). Class distribution -estimation based on the Hellinger distance. Information Sciences, 218:146–164.](https://www.sciencedirect.com/science/article/pii/S0020025512004069) - -It is implemented in _qp.method.aggregative.HDy_ (also accessible -through the allias _qp.method.aggregative.HellingerDistanceY_). -This method works with a probabilistic classifier (hard classifiers -can be used as well and will be calibrated) and requires a validation -set to estimate parameter for the mixture model. Just like -ACC and PACC, this quantifier receives a _val_split_ argument -in the constructor (or in the fit method, in which case the previous -value is overridden) that can either be a float indicating the proportion -of training data to be taken as the validation set (in a random -stratified split), or a validation set (i.e., an instance of -_LabelledCollection_) itself. - -HDy was proposed as a binary classifier and the implementation -provided in QuaPy accepts only binary datasets. - -The following code shows an example of use: -```python -import quapy as qp -from sklearn.linear_model import LogisticRegression - -# load a binary dataset -dataset = qp.datasets.fetch_reviews('hp', pickle=True) -qp.data.preprocessing.text2tfidf(dataset, min_df=5, inplace=True) - -model = qp.method.aggregative.HDy(LogisticRegression()) -model.fit(dataset.training) -estim_prevalence = model.quantify(dataset.test.instances) -``` - -_New in v0.1.7:_ QuaPy now provides an implementation of the generalized -"Distribution Matching" approaches for multiclass, inspired by the framework -of [Firat (2016)](https://arxiv.org/abs/1606.00868). One can instantiate -a variant of HDy for multiclass quantification as follows: - -```python -mutliclassHDy = qp.method.aggregative.DistributionMatching(classifier=LogisticRegression(), divergence='HD', cdf=False) -``` - -_New in v0.1.7:_ QuaPy now provides an implementation of the "DyS" -framework proposed by [Maletzke et al (2020)](https://ojs.aaai.org/index.php/AAAI/article/view/4376) -and the "SMM" method proposed by [Hassan et al (2019)](https://ieeexplore.ieee.org/document/9260028) -(thanks to _Pablo González_ for the contributions!) - -### Threshold Optimization methods - -_New in v0.1.7:_ QuaPy now implements Forman's threshold optimization methods; -see, e.g., [(Forman 2006)](https://dl.acm.org/doi/abs/10.1145/1150402.1150423) -and [(Forman 2008)](https://link.springer.com/article/10.1007/s10618-008-0097-y). -These include: T50, MAX, X, Median Sweep (MS), and its variant MS2. - -### Explicit Loss Minimization - -The Explicit Loss Minimization (ELM) represent a family of methods -based on structured output learning, i.e., quantifiers relying on -classifiers that have been optimized targeting a -quantification-oriented evaluation measure. -The original methods are implemented in QuaPy as classify & count (CC) -quantifiers that use Joachim's [SVMperf](https://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html) -as the underlying classifier, properly set to optimize for the desired loss. - -In QuaPy, this can be more achieved by calling the functions: - -* _newSVMQ_: returns the quantification method called SVM(Q) that optimizes for the metric _Q_ defined -in [_Barranquero, J., Díez, J., and del Coz, J. J. (2015). Quantification-oriented learning based -on reliable classifiers. Pattern Recognition, 48(2):591–604._](https://www.sciencedirect.com/science/article/pii/S003132031400291X) -* _newSVMKLD_ and _newSVMNKLD_: returns the quantification method called SVM(KLD) and SVM(nKLD), standing for - Kullback-Leibler Divergence and Normalized Kullback-Leibler Divergence, as proposed in [_Esuli, A. and Sebastiani, F. (2015). - Optimizing text quantifiers for multivariate loss functions. - ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27._](https://dl.acm.org/doi/abs/10.1145/2700406) -* _newSVMAE_ and _newSVMRAE_: returns a quantification method called SVM(AE) and SVM(RAE) that optimizes for the (Mean) Absolute Error and for the - (Mean) Relative Absolute Error, as first used by - [_Moreo, A. and Sebastiani, F. (2021). Tweet sentiment quantification: An experimental re-evaluation. PLOS ONE 17 (9), 1-23._](https://arxiv.org/abs/2011.02552) - -the last two methods (SVM(AE) and SVM(RAE)) have been implemented in -QuaPy in order to make available ELM variants for what nowadays -are considered the most well-behaved evaluation metrics in quantification. - -In order to make these models work, you would need to run the script -_prepare_svmperf.sh_ (distributed along with QuaPy) that -downloads _SVMperf_' source code, applies a patch that -implements the quantification oriented losses, and compiles the -sources. - -If you want to add any custom loss, you would need to modify -the source code of _SVMperf_ in order to implement it, and -assign a valid loss code to it. Then you must re-compile -the whole thing and instantiate the quantifier in QuaPy -as follows: - -```python -# you can either set the path to your custom svm_perf_quantification implementation -# in the environment variable, or as an argument to the constructor of ELM -qp.environ['SVMPERF_HOME'] = './path/to/svm_perf_quantification' - -# assign an alias to your custom loss and the id you have assigned to it -svmperf = qp.classification.svmperf.SVMperf -svmperf.valid_losses['mycustomloss'] = 28 - -# instantiate the ELM method indicating the loss -model = qp.method.aggregative.ELM(loss='mycustomloss') -``` - -All ELM are binary quantifiers since they rely on _SVMperf_, that -currently supports only binary classification. -ELM variants (any binary quantifier in general) can be extended -to operate in single-label scenarios trivially by adopting a -"one-vs-all" strategy (as, e.g., in -[_Gao, W. and Sebastiani, F. (2016). From classification to quantification in tweet sentiment -analysis. Social Network Analysis and Mining, 6(19):1–22_](https://link.springer.com/article/10.1007/s13278-016-0327-z)). -In QuaPy this is possible by using the _OneVsAll_ class. - -There are two ways for instantiating this class, _OneVsAllGeneric_ that works for -any quantifier, and _OneVsAllAggregative_ that is optimized for aggregative quantifiers. -In general, you can simply use the _getOneVsAll_ function and QuaPy will choose -the more convenient of the two. - -```python -import quapy as qp -from quapy.method.aggregative import SVMQ - -# load a single-label dataset (this one contains 3 classes) -dataset = qp.datasets.fetch_twitter('hcr', pickle=True) - -# let qp know where svmperf is -qp.environ['SVMPERF_HOME'] = '../svm_perf_quantification' - -model = getOneVsAll(SVMQ(), n_jobs=-1) # run them on parallel -model.fit(dataset.training) -estim_prevalence = model.quantify(dataset.test.instances) -``` - -Check the examples _[explicit_loss_minimization.py](..%2Fexamples%2Fexplicit_loss_minimization.py)_ -and [one_vs_all.py](..%2Fexamples%2Fone_vs_all.py) for more details. - -## Meta Models - -By _meta_ models we mean quantification methods that are defined on top of other -quantification methods, and that thus do not squarely belong to the aggregative nor -the non-aggregative group (indeed, _meta_ models could use quantifiers from any of those -groups). -_Meta_ models are implemented in the _qp.method.meta_ module. - -### Ensembles - -QuaPy implements (some of) the variants proposed in: - -* [_Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). -Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. -Information Fusion, 34, 87-100._](https://www.sciencedirect.com/science/article/pii/S1566253516300628) -* [_Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019). - Dynamic ensemble selection for quantification tasks. - Information Fusion, 45, 1-15._](https://www.sciencedirect.com/science/article/pii/S1566253517303652) - -The following code shows how to instantiate an Ensemble of 30 _Adjusted Classify & Count_ (ACC) -quantifiers operating with a _Logistic Regressor_ (LR) as the base classifier, and using the -_average_ as the aggregation policy (see the original article for further details). -The last parameter indicates to use all processors for parallelization. - -```python -import quapy as qp -from quapy.method.aggregative import ACC -from quapy.method.meta import Ensemble -from sklearn.linear_model import LogisticRegression - -dataset = qp.datasets.fetch_UCIDataset('haberman') - -model = Ensemble(quantifier=ACC(LogisticRegression()), size=30, policy='ave', n_jobs=-1) -model.fit(dataset.training) -estim_prevalence = model.quantify(dataset.test.instances) -``` - -Other aggregation policies implemented in QuaPy include: -* 'ptr' for applying a dynamic selection based on the training prevalence of the ensemble's members -* 'ds' for applying a dynamic selection based on the Hellinger Distance -* _any valid quantification measure_ (e.g., 'mse') for performing a static selection based on -the performance estimated for each member of the ensemble in terms of that evaluation metric. - -When using any of the above options, it is important to set the _red_size_ parameter, which -informs of the number of members to retain. - -Please, check the [model selection](https://github.com/HLT-ISTI/QuaPy/wiki/Model-Selection) -wiki if you want to optimize the hyperparameters of ensemble for classification or quantification. - -### The QuaNet neural network - -QuaPy offers an implementation of QuaNet, a deep learning model presented in: - -[_Esuli, A., Moreo, A., & Sebastiani, F. (2018, October). -A recurrent neural network for sentiment quantification. -In Proceedings of the 27th ACM International Conference on -Information and Knowledge Management (pp. 1775-1778)._](https://dl.acm.org/doi/abs/10.1145/3269206.3269287) - -This model requires _torch_ to be installed. -QuaNet also requires a classifier that can provide embedded representations -of the inputs. -In the original paper, QuaNet was tested using an LSTM as the base classifier. -In the following example, we show an instantiation of QuaNet that instead uses CNN as a probabilistic classifier, taking its last layer representation as the document embedding: - -```python -import quapy as qp -from quapy.method.meta import QuaNet -from quapy.classification.neural import NeuralClassifierTrainer, CNNnet - -# use samples of 100 elements -qp.environ['SAMPLE_SIZE'] = 100 - -# load the kindle dataset as text, and convert words to numerical indexes -dataset = qp.datasets.fetch_reviews('kindle', pickle=True) -qp.data.preprocessing.index(dataset, min_df=5, inplace=True) - -# the text classifier is a CNN trained by NeuralClassifierTrainer -cnn = CNNnet(dataset.vocabulary_size, dataset.n_classes) -learner = NeuralClassifierTrainer(cnn, device='cuda') - -# train QuaNet -model = QuaNet(learner, device='cuda') -model.fit(dataset.training) -estim_prevalence = model.quantify(dataset.test.instances) -``` - diff --git a/docs/build/html/_sources/Model-Selection.md.txt b/docs/build/html/_sources/Model-Selection.md.txt deleted file mode 100644 index 1df9107..0000000 --- a/docs/build/html/_sources/Model-Selection.md.txt +++ /dev/null @@ -1,150 +0,0 @@ -# Model Selection - -As a supervised machine learning task, quantification methods -can strongly depend on a good choice of model hyper-parameters. -The process whereby those hyper-parameters are chosen is -typically known as _Model Selection_, and typically consists of -testing different settings and picking the one that performed -best in a held-out validation set in terms of any given -evaluation measure. - -## Targeting a Quantification-oriented loss - -The task being optimized determines the evaluation protocol, -i.e., the criteria according to which the performance of -any given method for solving is to be assessed. -As a task on its own right, quantification should impose -its own model selection strategies, i.e., strategies -aimed at finding appropriate configurations -specifically designed for the task of quantification. - -Quantification has long been regarded as an add-on of -classification, and thus the model selection strategies -customarily adopted in classification have simply been -applied to quantification (see the next section). -It has been argued in [Moreo, Alejandro, and Fabrizio Sebastiani. -Re-Assessing the "Classify and Count" Quantification Method. -ECIR 2021: Advances in Information Retrieval pp 75–91.](https://link.springer.com/chapter/10.1007/978-3-030-72240-1_6) -that specific model selection strategies should -be adopted for quantification. That is, model selection -strategies for quantification should target -quantification-oriented losses and be tested in a variety -of scenarios exhibiting different degrees of prior -probability shift. - -The class _qp.model_selection.GridSearchQ_ implements a grid-search exploration over the space of -hyper-parameter combinations that [evaluates](https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation) -each combination of hyper-parameters by means of a given quantification-oriented -error metric (e.g., any of the error functions implemented -in _qp.error_) and according to a -[sampling generation protocol](https://github.com/HLT-ISTI/QuaPy/wiki/Protocols). - -The following is an example (also included in the examples folder) of model selection for quantification: - -```python -import quapy as qp -from quapy.protocol import APP -from quapy.method.aggregative import DistributionMatching -from sklearn.linear_model import LogisticRegression -import numpy as np - -""" -In this example, we show how to perform model selection on a DistributionMatching quantifier. -""" - -model = DistributionMatching(LogisticRegression()) - -qp.environ['SAMPLE_SIZE'] = 100 -qp.environ['N_JOBS'] = -1 # explore hyper-parameters in parallel - -training, test = qp.datasets.fetch_reviews('imdb', tfidf=True, min_df=5).train_test - -# The model will be returned by the fit method of GridSearchQ. -# Every combination of hyper-parameters will be evaluated by confronting the -# quantifier thus configured against a series of samples generated by means -# of a sample generation protocol. For this example, we will use the -# artificial-prevalence protocol (APP), that generates samples with prevalence -# values in the entire range of values from a grid (e.g., [0, 0.1, 0.2, ..., 1]). -# We devote 30% of the dataset for this exploration. -training, validation = training.split_stratified(train_prop=0.7) -protocol = APP(validation) - -# We will explore a classification-dependent hyper-parameter (e.g., the 'C' -# hyper-parameter of LogisticRegression) and a quantification-dependent hyper-parameter -# (e.g., the number of bins in a DistributionMatching quantifier. -# Classifier-dependent hyper-parameters have to be marked with a prefix "classifier__" -# in order to let the quantifier know this hyper-parameter belongs to its underlying -# classifier. -param_grid = { - 'classifier__C': np.logspace(-3,3,7), - 'nbins': [8, 16, 32, 64], -} - -model = qp.model_selection.GridSearchQ( - model=model, - param_grid=param_grid, - protocol=protocol, - error='mae', # the error to optimize is the MAE (a quantification-oriented loss) - refit=True, # retrain on the whole labelled set once done - verbose=True # show information as the process goes on -).fit(training) - -print(f'model selection ended: best hyper-parameters={model.best_params_}') -model = model.best_model_ - -# evaluation in terms of MAE -# we use the same evaluation protocol (APP) on the test set -mae_score = qp.evaluation.evaluate(model, protocol=APP(test), error_metric='mae') - -print(f'MAE={mae_score:.5f}') -``` - -In this example, the system outputs: -``` -[GridSearchQ]: starting model selection with self.n_jobs =-1 -[GridSearchQ]: hyperparams={'classifier__C': 0.01, 'nbins': 64} got mae score 0.04021 [took 1.1356s] -[GridSearchQ]: hyperparams={'classifier__C': 0.01, 'nbins': 32} got mae score 0.04286 [took 1.2139s] -[GridSearchQ]: hyperparams={'classifier__C': 0.01, 'nbins': 16} got mae score 0.04888 [took 1.2491s] -[GridSearchQ]: hyperparams={'classifier__C': 0.001, 'nbins': 8} got mae score 0.05163 [took 1.5372s] -[...] -[GridSearchQ]: hyperparams={'classifier__C': 1000.0, 'nbins': 32} got mae score 0.02445 [took 2.9056s] -[GridSearchQ]: optimization finished: best params {'classifier__C': 100.0, 'nbins': 32} (score=0.02234) [took 7.3114s] -[GridSearchQ]: refitting on the whole development set -model selection ended: best hyper-parameters={'classifier__C': 100.0, 'nbins': 32} -MAE=0.03102 -``` - -The parameter _val_split_ can alternatively be used to indicate -a validation set (i.e., an instance of _LabelledCollection_) instead -of a proportion. This could be useful if one wants to have control -on the specific data split to be used across different model selection -experiments. - -## Targeting a Classification-oriented loss - -Optimizing a model for quantification could rather be -computationally costly. -In aggregative methods, one could alternatively try to optimize -the classifier's hyper-parameters for classification. -Although this is theoretically suboptimal, many articles in -quantification literature have opted for this strategy. - -In QuaPy, this is achieved by simply instantiating the -classifier learner as a GridSearchCV from scikit-learn. -The following code illustrates how to do that: - -```python -learner = GridSearchCV( - LogisticRegression(), - param_grid={'C': np.logspace(-4, 5, 10), 'class_weight': ['balanced', None]}, - cv=5) -model = DistributionMatching(learner).fit(dataset.training) -``` - -However, this is conceptually flawed, since the model should be -optimized for the task at hand (quantification), and not for a surrogate task (classification), -i.e., the model should be requested to deliver low quantification errors, rather -than low classification errors. - - - diff --git a/docs/build/html/_sources/Plotting.md.txt b/docs/build/html/_sources/Plotting.md.txt deleted file mode 100644 index 99f3f7e..0000000 --- a/docs/build/html/_sources/Plotting.md.txt +++ /dev/null @@ -1,250 +0,0 @@ -# Plotting - -The module _qp.plot_ implements some basic plotting functions -that can help analyse the performance of a quantification method. - -All plotting functions receive as inputs the outcomes of -some experiments and include, for each experiment, -the following three main arguments: - -* _method_names_ a list containing the names of the quantification methods -* _true_prevs_ a list containing matrices of true prevalences -* _estim_prevs_ a list containing matrices of estimated prevalences -(should be of the same shape as the corresponding matrix in _true_prevs_) - -Note that a method (as indicated by a name in _method_names_) can -appear more than once. This could occur when various datasets are -involved in the experiments. In this case, all experiments for the -method will be merged and the plot will represent the method's -performance across various datasets. - -This is a very simple example of a valid input for the plotting functions: -```python -method_names = ['classify & count', 'EMQ', 'classify & count'] -true_prevs = [ - np.array([[0.5, 0.5], [0.25, 0.75]]), - np.array([[0.0, 1.0], [0.25, 0.75], [0.0, 0.1]]), - np.array([[0.0, 1.0], [0.25, 0.75], [0.0, 0.1]]), -] -estim_prevs = [ - np.array([[0.45, 0.55], [0.6, 0.4]]), - np.array([[0.0, 1.0], [0.5, 0.5], [0.2, 0.8]]), - np.array([[0.1, 0.9], [0.3, 0.7], [0.0, 0.1]]), -] -``` -in which the _classify & count_ has been tested in two datasets and -the _EMQ_ method has been tested only in one dataset. For the first -experiment, only two (binary) quantifications have been tested, -while for the second and third experiments three instances have -been tested. - -In general, we would like to test the performance of the -quantification methods across different scenarios showcasing -the accuracy of the quantifier in predicting class prevalences -for a wide range of prior distributions. This can easily be -achieved by means of the -[artificial sampling protocol](https://github.com/HLT-ISTI/QuaPy/wiki/Protocols) -that is implemented in QuaPy. - -The following code shows how to perform one simple experiment -in which the 4 _CC-variants_, all equipped with a linear SVM, are -applied to one binary dataset of reviews about _Kindle_ devices and -tested across the entire spectrum of class priors (taking 21 splits -of the interval [0,1], i.e., using prevalence steps of 0.05, and -generating 100 random samples at each prevalence). - -```python -import quapy as qp -from protocol import APP -from quapy.method.aggregative import CC, ACC, PCC, PACC -from sklearn.svm import LinearSVC - -qp.environ['SAMPLE_SIZE'] = 500 - -def gen_data(): - - def base_classifier(): - return LinearSVC(class_weight='balanced') - - def models(): - yield 'CC', CC(base_classifier()) - yield 'ACC', ACC(base_classifier()) - yield 'PCC', PCC(base_classifier()) - yield 'PACC', PACC(base_classifier()) - - train, test = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5).train_test - - method_names, true_prevs, estim_prevs, tr_prevs = [], [], [], [] - - for method_name, model in models(): - model.fit(train) - true_prev, estim_prev = qp.evaluation.prediction(model, APP(test, repeats=100, random_state=0)) - - method_names.append(method_name) - true_prevs.append(true_prev) - estim_prevs.append(estim_prev) - tr_prevs.append(train.prevalence()) - - return method_names, true_prevs, estim_prevs, tr_prevs - -method_names, true_prevs, estim_prevs, tr_prevs = gen_data() -```` -the plots that can be generated are explained below. - -## Diagonal Plot - -The _diagonal_ plot shows a very insightful view of the -quantifier's performance. It plots the predicted class -prevalence (in the y-axis) against the true class prevalence -(in the x-axis). Unfortunately, it is limited to binary quantification, -although one can simply generate as many _diagonal_ plots as -classes there are by indicating which class should be considered -the target of the plot. - -The following call will produce the plot: - -```python -qp.plot.binary_diagonal(method_names, true_prevs, estim_prevs, train_prev=tr_prevs[0], savepath='./plots/bin_diag.png') -``` - -the last argument is optional, and indicates the path where to save -the plot (the file extension will determine the format -- typical extensions -are '.png' or '.pdf'). If this path is not provided, then the plot -will be shown but not saved. -The resulting plot should look like: - -![diagonal plot on Kindle](./wiki_examples/selected_plots/bin_diag.png) - -Note that in this case, we are also indicating the training -prevalence, which is plotted in the diagonal a as cyan dot. -The color bands indicate the standard deviations of the predictions, -and can be hidden by setting the argument _show_std=False_ (see -the complete list of arguments in the documentation). - -Finally, note how most quantifiers, and specially the "unadjusted" -variants CC and PCC, are strongly biased towards the -prevalence seen during training. - -## Quantification bias - -This plot aims at evincing the bias that any quantifier -displays with respect to the training prevalences by -means of [box plots](https://en.wikipedia.org/wiki/Box_plot). -This plot can be generated by: - -```python -qp.plot.binary_bias_global(method_names, true_prevs, estim_prevs, savepath='./plots/bin_bias.png') -``` - -and should look like: - -![bias plot on Kindle](./wiki_examples/selected_plots/bin_bias.png) - -The box plots show some interesting facts: -* all methods are biased towards the training prevalence but specially -so CC and PCC (an unbiased quantifier would have a box centered at 0) -* the bias is always positive, indicating that all methods tend to -overestimate the positive class prevalence -* CC and PCC have high variability while ACC and specially PACC exhibit -lower variability. - -Again, these plots could be generated for experiments ranging across -different datasets, and the plot will merge all data accordingly. - -Another illustrative example can be shown that consists of -training different CC quantifiers trained at different -(artificially sampled) training prevalences. -For this example, we generate training samples of 5000 -documents containing 10%, 20%, ..., 90% of positives from the -IMDb dataset, and generate the bias plot again. -This example can be run by rewritting the _gen_data()_ function -like this: - -```python -def gen_data(): - - train, test = qp.datasets.fetch_reviews('imdb', tfidf=True, min_df=5).train_test - model = CC(LinearSVC()) - - method_data = [] - for training_prevalence in np.linspace(0.1, 0.9, 9): - training_size = 5000 - # since the problem is binary, it suffices to specify the negative prevalence, since the positive is constrained - train_sample = train.sampling(training_size, 1-training_prevalence) - model.fit(train_sample) - true_prev, estim_prev = qp.evaluation.prediction(model, APP(test, repeats=100, random_state=0)) - method_name = 'CC$_{'+f'{int(100*training_prevalence)}' + '\%}$' - method_data.append((method_name, true_prev, estim_prev, train_sample.prevalence())) - - return zip(*method_data) -``` - -and the plot should now look like: - -![bias plot on IMDb](./wiki_examples/selected_plots/bin_bias_cc.png) - -which clearly shows a negative bias for CC variants trained on -data containing more negatives (i.e., < 50%) and positive biases -in cases containing more positives (i.e., >50%). The CC trained -at 50% behaves as an unbiased estimator of the positive class -prevalence. - -The function _qp.plot.binary_bias_bins_ allows the user to -generate box plots broken down by bins of true test prevalence. -To this aim, an argument _nbins_ is passed which indicates -how many isometric subintervals to take. For example -the following plot is produced for _nbins=3_: - -![bias plot on IMDb](./wiki_examples/selected_plots/bin_bias_bin_cc.png) - -Interestingly enough, the seemingly unbiased estimator (CC at 50%) happens to display -a positive bias (or a tendency to overestimate) in cases of low prevalence -(i.e., when the true prevalence of the positive class is below 33%), -and a negative bias (or a tendency to underestimate) in cases of high prevalence -(i.e., when the true prevalence is beyond 67%). - -Out of curiosity, the diagonal plot for this experiment looks like: - -![diag plot on IMDb](./wiki_examples/selected_plots/bin_diag_cc.png) - -showing pretty clearly the dependency of CC on the prior probabilities -of the labeled set it was trained on. - - -## Error by Drift - -Above discussed plots are useful for analyzing and comparing -the performance of different quantification methods, but are -limited to the binary case. The "error by drift" is a plot -that shows the error in predictions as a function of the -(prior probability) drift between each test sample and the -training set. Interestingly, the error and drift can both be measured -in terms of any evaluation measure for quantification (like the -ones available in _qp.error_) and can thus be computed -irrespectively of the number of classes. - -The following shows how to generate the plot for the 4 CC variants, -using 10 bins for the drift -and _absolute error_ as the measure of the error (the -drift in the x-axis is always computed in terms of _absolute error_ since -other errors are harder to interpret): - -```python -qp.plot.error_by_drift(method_names, true_prevs, estim_prevs, tr_prevs, - error_name='ae', n_bins=10, savepath='./plots/err_drift.png') -``` - -![diag plot on IMDb](./wiki_examples/selected_plots/err_drift.png) - -Note that all methods work reasonably well in cases of low prevalence -drift (i.e., any CC-variant is a good quantifier whenever the IID -assumption is approximately preserved). The higher the drift, the worse -those quantifiers tend to perform, although it is clear that PACC -yields the lowest error for the most difficult cases. - -Remember that any plot can be generated _across many datasets_, and -that this would probably result in a more solid comparison. -In those cases, however, it is likely that the variances of each -method get higher, to the detriment of the visualization. -We recommend to set _show_std=False_ in those cases -in order to hide the color bands. diff --git a/docs/build/html/_sources/index.rst.txt b/docs/build/html/_sources/index.rst.txt index bf17bc7..cc5b4dc 100644 --- a/docs/build/html/_sources/index.rst.txt +++ b/docs/build/html/_sources/index.rst.txt @@ -1,87 +1,36 @@ -.. QuaPy documentation master file, created by - sphinx-quickstart on Tue Nov 9 11:31:32 2021. +.. QuaPy: A Python-based open-source framework for quantification documentation master file, created by + sphinx-quickstart on Wed Feb 7 16:26:46 2024. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. Welcome to QuaPy's documentation! -================================= +========================================================================================== -QuaPy is an open source framework for Quantification (a.k.a. Supervised Prevalence Estimation) -written in Python. +QuaPy is a Python-based open-source framework for quantification. -Introduction +This document contains the API of the modules included in QuaPy. + +Installation ------------ -QuaPy roots on the concept of data sample, and provides implementations of most important concepts -in quantification literature, such as the most important quantification baselines, many advanced -quantification methods, quantification-oriented model selection, many evaluation measures and protocols -used for evaluating quantification methods. -QuaPy also integrates commonly used datasets and offers visualization tools for facilitating the analysis and -interpretation of results. +`pip install quapy` -A quick example: -**************** +GitHub +------------ -The following script fetchs a Twitter dataset, trains and evaluates an -`Adjusted Classify & Count` model in terms of the `Mean Absolute Error` (MAE) -between the class prevalences estimated for the test set and the true prevalences -of the test set. +QuaPy is hosted in GitHub at `https://github.com/HLT-ISTI/QuaPy `_ -:: - - import quapy as qp - from sklearn.linear_model import LogisticRegression - - dataset = qp.datasets.fetch_twitter('semeval16') - - # create an "Adjusted Classify & Count" quantifier - model = qp.method.aggregative.ACC(LogisticRegression()) - model.fit(dataset.training) - - estim_prevalences = model.quantify(dataset.test.instances) - true_prevalences = dataset.test.prevalence() - - error = qp.error.mae(true_prevalences, estim_prevalences) - - print(f'Mean Absolute Error (MAE)={error:.3f}') - - -Quantification is useful in scenarios of prior probability shift. In other -words, we would not be interested in estimating the class prevalences of the test set if -we could assume the IID assumption to hold, as this prevalence would simply coincide with the -class prevalence of the training set. For this reason, any Quantification model -should be tested across samples characterized by different class prevalences. -QuaPy implements sampling procedures and evaluation protocols that automates this endeavour. -See the :doc:`Evaluation` for detailed examples. - -Features -******** - -* Implementation of most popular quantification methods (Classify-&-Count variants, Expectation-Maximization, SVM-based variants for quantification, HDy, QuaNet, and Ensembles). -* Versatile functionality for performing evaluation based on artificial sampling protocols. -* Implementation of most commonly used evaluation metrics (e.g., MAE, MRAE, MSE, NKLD, etc.). -* Popular datasets for Quantification (textual and numeric) available, including: - * 32 UCI Machine Learning datasets. - * 11 Twitter Sentiment datasets. - * 3 Reviews Sentiment datasets. - * 4 tasks from LeQua competition (_new in v0.1.7!_) -* Native supports for binary and single-label scenarios of quantification. -* Model selection functionality targeting quantification-oriented losses. -* Visualization tools for analysing results. .. toctree:: :maxdepth: 2 :caption: Contents: - Installation - Datasets - Evaluation - Protocols - Methods - Model-Selection - Plotting - API Developers documentation +Contents +-------- +.. toctree:: + + modules Indices and tables diff --git a/docs/build/html/_sources/quapy.classification.rst.txt b/docs/build/html/_sources/quapy.classification.rst.txt index 3d14431..cfc7d9b 100644 --- a/docs/build/html/_sources/quapy.classification.rst.txt +++ b/docs/build/html/_sources/quapy.classification.rst.txt @@ -1,38 +1,35 @@ -:tocdepth: 2 - quapy.classification package ============================ Submodules ---------- -quapy.classification.calibration --------------------------------- +quapy.classification.calibration module +--------------------------------------- -.. versionadded:: 0.1.7 .. automodule:: quapy.classification.calibration :members: :undoc-members: :show-inheritance: -quapy.classification.methods ----------------------------- +quapy.classification.methods module +----------------------------------- .. automodule:: quapy.classification.methods :members: :undoc-members: :show-inheritance: -quapy.classification.neural ---------------------------- +quapy.classification.neural module +---------------------------------- .. automodule:: quapy.classification.neural :members: :undoc-members: :show-inheritance: -quapy.classification.svmperf ----------------------------- +quapy.classification.svmperf module +----------------------------------- .. automodule:: quapy.classification.svmperf :members: diff --git a/docs/build/html/_sources/quapy.data.rst.txt b/docs/build/html/_sources/quapy.data.rst.txt index fda5ff0..cadace6 100644 --- a/docs/build/html/_sources/quapy.data.rst.txt +++ b/docs/build/html/_sources/quapy.data.rst.txt @@ -1,37 +1,36 @@ -:tocdepth: 2 - quapy.data package ================== Submodules ---------- -quapy.data.base ---------------- +quapy.data.base module +---------------------- .. automodule:: quapy.data.base :members: :undoc-members: :show-inheritance: -quapy.data.datasets -------------------- +quapy.data.datasets module +-------------------------- .. automodule:: quapy.data.datasets :members: :undoc-members: :show-inheritance: -quapy.data.preprocessing ------------------------- + +quapy.data.preprocessing module +------------------------------- .. automodule:: quapy.data.preprocessing :members: :undoc-members: :show-inheritance: -quapy.data.reader ------------------ +quapy.data.reader module +------------------------ .. automodule:: quapy.data.reader :members: diff --git a/docs/build/html/_sources/quapy.method.rst.txt b/docs/build/html/_sources/quapy.method.rst.txt index cfda57f..8026e0a 100644 --- a/docs/build/html/_sources/quapy.method.rst.txt +++ b/docs/build/html/_sources/quapy.method.rst.txt @@ -1,45 +1,51 @@ -:tocdepth: 2 - quapy.method package ==================== Submodules ---------- -quapy.method.aggregative ------------------------- +quapy.method.aggregative module +------------------------------- .. automodule:: quapy.method.aggregative :members: :undoc-members: :show-inheritance: -quapy.method.base ------------------ +.. automodule:: quapy.method._kdey + :members: + :undoc-members: + :show-inheritance: + +.. automodule:: quapy.method._neural + :members: + :undoc-members: + :show-inheritance: + +.. automodule:: quapy.method._threshold_optim + :members: + :undoc-members: + :show-inheritance: + + +quapy.method.base module +------------------------ .. automodule:: quapy.method.base :members: :undoc-members: :show-inheritance: -quapy.method.meta ------------------ +quapy.method.meta module +------------------------ .. automodule:: quapy.method.meta :members: :undoc-members: :show-inheritance: -quapy.method.neural -------------------- - -.. automodule:: quapy.method.neural - :members: - :undoc-members: - :show-inheritance: - -quapy.method.non\_aggregative ------------------------------ +quapy.method.non\_aggregative module +------------------------------------ .. automodule:: quapy.method.non_aggregative :members: diff --git a/docs/build/html/_sources/quapy.rst.txt b/docs/build/html/_sources/quapy.rst.txt index e3e1697..af2708b 100644 --- a/docs/build/html/_sources/quapy.rst.txt +++ b/docs/build/html/_sources/quapy.rst.txt @@ -1,79 +1,76 @@ -:tocdepth: 2 - quapy package ============= +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + quapy.classification + quapy.data + quapy.method + + Submodules ---------- -quapy.error ------------ +quapy.error module +------------------ .. automodule:: quapy.error :members: :undoc-members: :show-inheritance: -quapy.evaluation ----------------- +quapy.evaluation module +----------------------- .. automodule:: quapy.evaluation :members: :undoc-members: :show-inheritance: -quapy.protocol --------------- - -.. versionadded:: 0.1.7 -.. automodule:: quapy.protocol - :members: - :undoc-members: - :show-inheritance: - -quapy.functional ----------------- +quapy.functional module +----------------------- .. automodule:: quapy.functional :members: :undoc-members: :show-inheritance: -quapy.model\_selection ----------------------- +quapy.model\_selection module +----------------------------- .. automodule:: quapy.model_selection :members: :undoc-members: :show-inheritance: -quapy.plot ----------- +quapy.plot module +----------------- .. automodule:: quapy.plot :members: :undoc-members: :show-inheritance: -quapy.util ----------- +quapy.protocol module +--------------------- + +.. automodule:: quapy.protocol + :members: + :undoc-members: + :show-inheritance: + +quapy.util module +----------------- .. automodule:: quapy.util :members: :undoc-members: :show-inheritance: -Subpackages ------------ - -.. toctree:: - :maxdepth: 3 - - quapy.classification - quapy.data - quapy.method - - Module contents --------------- @@ -81,4 +78,3 @@ Module contents :members: :undoc-members: :show-inheritance: - diff --git a/docs/build/html/_static/background_b01.png b/docs/build/html/_static/background_b01.png deleted file mode 100644 index 353f26d..0000000 Binary files a/docs/build/html/_static/background_b01.png and /dev/null differ diff --git a/docs/build/html/_static/basic.css b/docs/build/html/_static/basic.css index 096e3f6..f316efc 100644 --- a/docs/build/html/_static/basic.css +++ b/docs/build/html/_static/basic.css @@ -4,7 +4,7 @@ * * Sphinx stylesheet -- basic theme. * - * :copyright: Copyright 2007-2022 by the Sphinx team, see AUTHORS. + * :copyright: Copyright 2007-2024 by the Sphinx team, see AUTHORS. * :license: BSD, see LICENSE for details. * */ @@ -55,7 +55,7 @@ div.sphinxsidebarwrapper { div.sphinxsidebar { float: left; - width: 210px; + width: 230px; margin-left: -100%; font-size: 90%; word-wrap: break-word; @@ -237,6 +237,10 @@ a.headerlink { visibility: hidden; } +a:visited { + color: #551A8B; +} + h1:hover > a.headerlink, h2:hover > a.headerlink, h3:hover > a.headerlink, @@ -324,6 +328,7 @@ aside.sidebar { p.sidebar-title { font-weight: bold; } + nav.contents, aside.topic, div.admonition, div.topic, blockquote { @@ -331,6 +336,7 @@ div.admonition, div.topic, blockquote { } /* -- topics ---------------------------------------------------------------- */ + nav.contents, aside.topic, div.topic { @@ -606,6 +612,7 @@ ol.simple p, ul.simple p { margin-bottom: 0; } + aside.footnote > span, div.citation > span { float: left; @@ -667,6 +674,16 @@ dd { margin-left: 30px; } +.sig dd { + margin-top: 0px; + margin-bottom: 0px; +} + +.sig dl { + margin-top: 0px; + margin-bottom: 0px; +} + dl > dd:last-child, dl > dd:last-child > :last-child { margin-bottom: 0; @@ -735,6 +752,14 @@ abbr, acronym { cursor: help; } +.translated { + background-color: rgba(207, 255, 207, 0.2) +} + +.untranslated { + background-color: rgba(255, 207, 207, 0.2) +} + /* -- code displays --------------------------------------------------------- */ pre { diff --git a/docs/build/html/_static/bizstyle.css b/docs/build/html/_static/bizstyle.css deleted file mode 100644 index ec32aa0..0000000 --- a/docs/build/html/_static/bizstyle.css +++ /dev/null @@ -1,508 +0,0 @@ -/* - * bizstyle.css_t - * ~~~~~~~~~~~~~~ - * - * Sphinx stylesheet -- business style theme. - * - * :copyright: Copyright 2011-2014 by Sphinx team, see AUTHORS. - * :license: BSD, see LICENSE for details. - * - */ - -@import url("basic.css"); - -/* -- page layout ----------------------------------------------------------- */ - -body { - font-family: 'Lucida Grande', 'Lucida Sans Unicode', 'Geneva', - 'Verdana', sans-serif; - font-size: 14px; - letter-spacing: -0.01em; - line-height: 150%; - text-align: center; - background-color: white; - background-image: url(background_b01.png); - color: black; - padding: 0; - border-right: 1px solid #336699; - border-left: 1px solid #336699; - - margin: 0px 40px 0px 40px; -} - -div.document { - background-color: white; - text-align: left; - background-repeat: repeat-x; - - -moz-box-shadow: 2px 2px 5px #000; - -webkit-box-shadow: 2px 2px 5px #000; -} - -div.documentwrapper { - float: left; - width: 100%; -} - -div.bodywrapper { - margin: 0 0 0 240px; - border-left: 1px solid #ccc; -} - -div.body { - margin: 0; - padding: 0.5em 20px 20px 20px; -} -div.bodywrapper { - margin: 0 0 0 calc(210px + 30px); -} - -div.related { - font-size: 1em; - - -moz-box-shadow: 2px 2px 5px #000; - -webkit-box-shadow: 2px 2px 5px #000; -} - -div.related ul { - background-color: #336699; - height: 100%; - overflow: hidden; - border-top: 1px solid #ddd; - border-bottom: 1px solid #ddd; -} - -div.related ul li { - color: white; - margin: 0; - padding: 0; - height: 2em; - float: left; -} - -div.related ul li.right { - float: right; - margin-right: 5px; -} - -div.related ul li a { - margin: 0; - padding: 0 5px 0 5px; - line-height: 1.75em; - color: #fff; -} - -div.related ul li a:hover { - color: #fff; - text-decoration: underline; -} - -div.sphinxsidebarwrapper { - padding: 0; -} - -div.sphinxsidebar { - padding: 0.5em 12px 12px 12px; - width: 210px; - font-size: 1em; - text-align: left; -} - -div.sphinxsidebar h3, div.sphinxsidebar h4 { - margin: 1em 0 0.5em 0; - font-size: 1em; - padding: 0.1em 0 0.1em 0.5em; - color: white; - border: 1px solid #336699; - background-color: #336699; -} - -div.sphinxsidebar h3 a { - color: white; -} - -div.sphinxsidebar ul { - padding-left: 1.5em; - margin-top: 7px; - padding: 0; - line-height: 130%; -} - -div.sphinxsidebar ul ul { - margin-left: 20px; -} - -div.sphinxsidebar input { - border: 1px solid #336699; -} - -div.footer { - background-color: white; - color: #336699; - padding: 3px 8px 3px 0; - clear: both; - font-size: 0.8em; - text-align: right; - border-bottom: 1px solid #336699; - - -moz-box-shadow: 2px 2px 5px #000; - -webkit-box-shadow: 2px 2px 5px #000; -} - -div.footer a { - color: #336699; - text-decoration: underline; -} - -/* -- body styles ----------------------------------------------------------- */ - -p { - margin: 0.8em 0 0.5em 0; -} - -a { - color: #336699; - text-decoration: none; -} - -a:hover { - color: #336699; - text-decoration: underline; -} - -div.body a { - text-decoration: underline; -} - -h1, h2, h3 { - color: #336699; -} - -h1 { - margin: 0; - padding: 0.7em 0 0.3em 0; - font-size: 1.5em; -} - -h2 { - margin: 1.3em 0 0.2em 0; - font-size: 1.35em; - padding-bottom: .5em; - border-bottom: 1px solid #336699; -} - -h3 { - margin: 1em 0 -0.3em 0; - font-size: 1.2em; - padding-bottom: .3em; - border-bottom: 1px solid #CCCCCC; -} - -div.body h1 a, div.body h2 a, div.body h3 a, -div.body h4 a, div.body h5 a, div.body h6 a { - color: black!important; -} - -h1 a.anchor, h2 a.anchor, h3 a.anchor, -h4 a.anchor, h5 a.anchor, h6 a.anchor { - display: none; - margin: 0 0 0 0.3em; - padding: 0 0.2em 0 0.2em; - color: #aaa!important; -} - -h1:hover a.anchor, h2:hover a.anchor, h3:hover a.anchor, h4:hover a.anchor, -h5:hover a.anchor, h6:hover a.anchor { - display: inline; -} - -h1 a.anchor:hover, h2 a.anchor:hover, h3 a.anchor:hover, h4 a.anchor:hover, -h5 a.anchor:hover, h6 a.anchor:hover { - color: #777; - background-color: #eee; -} - -a.headerlink { - color: #c60f0f!important; - font-size: 1em; - margin-left: 6px; - padding: 0 4px 0 4px; - text-decoration: none!important; -} - -a.headerlink:hover { - background-color: #ccc; - color: white!important; -} - -cite, code, tt { - font-family: 'Consolas', 'Deja Vu Sans Mono', - 'Bitstream Vera Sans Mono', monospace; - font-size: 0.95em; - letter-spacing: 0.01em; -} - -code { - background-color: #F2F2F2; - border-bottom: 1px solid #ddd; - color: #333; -} - -code.descname, code.descclassname, code.xref { - border: 0; -} - -hr { - border: 1px solid #abc; - margin: 2em; -} - -a code { - border: 0; - color: #CA7900; -} - -a code:hover { - color: #2491CF; -} - -pre { - background-color: transparent !important; - font-family: 'Consolas', 'Deja Vu Sans Mono', - 'Bitstream Vera Sans Mono', monospace; - font-size: 0.95em; - letter-spacing: 0.015em; - line-height: 120%; - padding: 0.5em; - border-right: 5px solid #ccc; - border-left: 5px solid #ccc; -} - -pre a { - color: inherit; - text-decoration: underline; -} - -td.linenos pre { - padding: 0.5em 0; -} - -div.quotebar { - background-color: #f8f8f8; - max-width: 250px; - float: right; - padding: 2px 7px; - border: 1px solid #ccc; -} -nav.contents, -aside.topic, - -div.topic { - background-color: #f8f8f8; -} - -table { - border-collapse: collapse; - margin: 0 -0.5em 0 -0.5em; -} - -table td, table th { - padding: 0.2em 0.5em 0.2em 0.5em; -} - -div.admonition { - font-size: 0.9em; - margin: 1em 0 1em 0; - border: 3px solid #cccccc; - background-color: #f7f7f7; - padding: 0; -} - -div.admonition p { - margin: 0.5em 1em 0.5em 1em; - padding: 0; -} - -div.admonition li p { - margin-left: 0; -} - -div.admonition pre, div.warning pre { - margin: 0; -} - -div.highlight { - margin: 0.4em 1em; -} - -div.admonition p.admonition-title { - margin: 0; - padding: 0.1em 0 0.1em 0.5em; - color: white; - border-bottom: 3px solid #cccccc; - font-weight: bold; - background-color: #165e83; -} - -div.danger { border: 3px solid #f0908d; background-color: #f0cfa0; } -div.error { border: 3px solid #f0908d; background-color: #ede4cd; } -div.warning { border: 3px solid #f8b862; background-color: #f0cfa0; } -div.caution { border: 3px solid #f8b862; background-color: #ede4cd; } -div.attention { border: 3px solid #f8b862; background-color: #f3f3f3; } -div.important { border: 3px solid #f0cfa0; background-color: #ede4cd; } -div.note { border: 3px solid #f0cfa0; background-color: #f3f3f3; } -div.hint { border: 3px solid #bed2c3; background-color: #f3f3f3; } -div.tip { border: 3px solid #bed2c3; background-color: #f3f3f3; } - -div.danger p.admonition-title, div.error p.admonition-title { - background-color: #b7282e; - border-bottom: 3px solid #f0908d; -} - -div.caution p.admonition-title, -div.warning p.admonition-title, -div.attention p.admonition-title { - background-color: #f19072; - border-bottom: 3px solid #f8b862; -} - -div.note p.admonition-title, div.important p.admonition-title { - background-color: #f8b862; - border-bottom: 3px solid #f0cfa0; -} - -div.hint p.admonition-title, div.tip p.admonition-title { - background-color: #7ebea5; - border-bottom: 3px solid #bed2c3; -} - -div.admonition ul, div.admonition ol, -div.warning ul, div.warning ol { - margin: 0.1em 0.5em 0.5em 3em; - padding: 0; -} - -div.versioninfo { - margin: 1em 0 0 0; - border: 1px solid #ccc; - background-color: #DDEAF0; - padding: 8px; - line-height: 1.3em; - font-size: 0.9em; -} - -.viewcode-back { - font-family: 'Lucida Grande', 'Lucida Sans Unicode', 'Geneva', - 'Verdana', sans-serif; -} - -div.viewcode-block:target { - background-color: #f4debf; - border-top: 1px solid #ac9; - border-bottom: 1px solid #ac9; -} - -p.versionchanged span.versionmodified { - font-size: 0.9em; - margin-right: 0.2em; - padding: 0.1em; - background-color: #DCE6A0; -} - -dl.field-list > dt { - color: white; - background-color: #82A0BE; -} - -dl.field-list > dd { - background-color: #f7f7f7; -} - -/* -- table styles ---------------------------------------------------------- */ - -table.docutils { - margin: 1em 0; - padding: 0; - border: 1px solid white; - background-color: #f7f7f7; -} - -table.docutils td, table.docutils th { - padding: 1px 8px 1px 5px; - border-top: 0; - border-left: 0; - border-right: 1px solid white; - border-bottom: 1px solid white; -} - -table.docutils td p { - margin-top: 0; - margin-bottom: 0.3em; -} - -table.field-list td, table.field-list th { - border: 0 !important; - word-break: break-word; -} - -table.footnote td, table.footnote th { - border: 0 !important; -} - -th { - color: white; - text-align: left; - padding-right: 5px; - background-color: #82A0BE; -} - -div.literal-block-wrapper div.code-block-caption { - background-color: #EEE; - border-style: solid; - border-color: #CCC; - border-width: 1px 5px; -} - -/* WIDE DESKTOP STYLE */ -@media only screen and (min-width: 1176px) { -body { - margin: 0 40px 0 40px; -} -} - -/* TABLET STYLE */ -@media only screen and (min-width: 768px) and (max-width: 991px) { -body { - margin: 0 40px 0 40px; -} -} - -/* MOBILE LAYOUT (PORTRAIT/320px) */ -@media only screen and (max-width: 767px) { -body { - margin: 0; -} -div.bodywrapper { - margin: 0; - width: 100%; - border: none; -} -div.sphinxsidebar { - display: none; -} -} - -/* MOBILE LAYOUT (LANDSCAPE/480px) */ -@media only screen and (min-width: 480px) and (max-width: 767px) { -body { - margin: 0 20px 0 20px; -} -} - -/* RETINA OVERRIDES */ -@media -only screen and (-webkit-min-device-pixel-ratio: 2), -only screen and (min-device-pixel-ratio: 2) { -} - -/* -- end ------------------------------------------------------------------- */ \ No newline at end of file diff --git a/docs/build/html/_static/bizstyle.js b/docs/build/html/_static/bizstyle.js deleted file mode 100644 index 4d5d01d..0000000 --- a/docs/build/html/_static/bizstyle.js +++ /dev/null @@ -1,30 +0,0 @@ -// -// bizstyle.js -// ~~~~~~~~~~~ -// -// Sphinx javascript -- for bizstyle theme. -// -// This theme was created by referring to 'sphinxdoc' -// -// :copyright: Copyright 2012-2014 by Sphinx team, see AUTHORS. -// :license: BSD, see LICENSE for details. -// -const initialiseBizStyle = () => { - if (navigator.userAgent.indexOf("iPhone") > 0 || navigator.userAgent.indexOf("Android") > 0) { - document.querySelector("li.nav-item-0 a").innerText = "Top" - } - const truncator = item => {if (item.textContent.length > 20) { - item.title = item.innerText - item.innerText = item.innerText.substr(0, 17) + "..." - } - } - document.querySelectorAll("div.related:first ul li:not(.right) a").slice(1).forEach(truncator); - document.querySelectorAll("div.related:last ul li:not(.right) a").slice(1).forEach(truncator); -} - -window.addEventListener("resize", - () => (document.querySelector("li.nav-item-0 a").innerText = (window.innerWidth <= 776) ? "Top" : "QuaPy 0.1.7 documentation") -) - -if (document.readyState !== "loading") initialiseBizStyle() -else document.addEventListener("DOMContentLoaded", initialiseBizStyle) \ No newline at end of file diff --git a/docs/build/html/_static/css3-mediaqueries.js b/docs/build/html/_static/css3-mediaqueries.js deleted file mode 100644 index 59735f5..0000000 --- a/docs/build/html/_static/css3-mediaqueries.js +++ /dev/null @@ -1 +0,0 @@ -if(typeof Object.create!=="function"){Object.create=function(e){function t(){}t.prototype=e;return new t}}var ua={toString:function(){return navigator.userAgent},test:function(e){return this.toString().toLowerCase().indexOf(e.toLowerCase())>-1}};ua.version=(ua.toString().toLowerCase().match(/[\s\S]+(?:rv|it|ra|ie)[\/: ]([\d.]+)/)||[])[1];ua.webkit=ua.test("webkit");ua.gecko=ua.test("gecko")&&!ua.webkit;ua.opera=ua.test("opera");ua.ie=ua.test("msie")&&!ua.opera;ua.ie6=ua.ie&&document.compatMode&&typeof document.documentElement.style.maxHeight==="undefined";ua.ie7=ua.ie&&document.documentElement&&typeof document.documentElement.style.maxHeight!=="undefined"&&typeof XDomainRequest==="undefined";ua.ie8=ua.ie&&typeof XDomainRequest!=="undefined";var domReady=function(){var e=[];var t=function(){if(!arguments.callee.done){arguments.callee.done=true;for(var t=0;t=200&&r.status<300||r.status===304||navigator.userAgent.indexOf("Safari")>-1&&typeof r.status==="undefined"){t(r.responseText)}else{n()}document.documentElement.style.cursor="";r=null}};r.send("")};var l=function(t){t=t.replace(e.REDUNDANT_COMPONENTS,"");t=t.replace(e.REDUNDANT_WHITESPACE,"$1");t=t.replace(e.WHITESPACE_IN_PARENTHESES,"($1)");t=t.replace(e.MORE_WHITESPACE," ");t=t.replace(e.FINAL_SEMICOLONS,"}");return t};var c={stylesheet:function(t){var n={};var r=[],i=[],s=[],o=[];var u=t.cssHelperText;var a=t.getAttribute("media");if(a){var f=a.toLowerCase().split(",")}else{var f=["all"]}for(var l=0;l-1&&a.href&&a.href.length!==0&&!a.disabled){r[r.length]=a}}if(r.length>0){var c=0;var d=function(){c++;if(c===r.length){i()}};var v=function(t){var n=t.href;f(n,function(r){r=l(r).replace(e.RELATIVE_URLS,"url("+n.substring(0,n.lastIndexOf("/"))+"/$1)");t.cssHelperText=r;d()},d)};for(u=0;u0){r.setAttribute("media",t.join(","))}document.getElementsByTagName("head")[0].appendChild(r);if(r.styleSheet){r.styleSheet.cssText=e}else{r.appendChild(document.createTextNode(e))}r.addedWithCssHelper=true;if(typeof n==="undefined"||n===true){cssHelper.parsed(function(t){var n=p(r,e);for(var i in n){if(n.hasOwnProperty(i)){g(i,n[i])}}a("newStyleParsed",r)})}else{r.parsingDisallowed=true}return r},removeStyle:function(e){return e.parentNode.removeChild(e)},parsed:function(e){if(n){s(e)}else{if(typeof t!=="undefined"){if(typeof e==="function"){e(t)}}else{s(e);d()}}},stylesheets:function(e){cssHelper.parsed(function(t){e(m.stylesheets||y("stylesheets"))})},mediaQueryLists:function(e){cssHelper.parsed(function(t){e(m.mediaQueryLists||y("mediaQueryLists"))})},rules:function(e){cssHelper.parsed(function(t){e(m.rules||y("rules"))})},selectors:function(e){cssHelper.parsed(function(t){e(m.selectors||y("selectors"))})},declarations:function(e){cssHelper.parsed(function(t){e(m.declarations||y("declarations"))})},properties:function(e){cssHelper.parsed(function(t){e(m.properties||y("properties"))})},broadcast:a,addListener:function(e,t){if(typeof t==="function"){if(!u[e]){u[e]={listeners:[]}}u[e].listeners[u[e].listeners.length]=t}},removeListener:function(e,t){if(typeof t==="function"&&u[e]){var n=u[e].listeners;for(var r=0;r=a||s&&l0}}else if("device-height"===e.substring(r-13,r)){c=screen.height;if(t!==null){if(u==="length"){return i&&c>=a||s&&c0}}else if("width"===e.substring(r-5,r)){l=document.documentElement.clientWidth||document.body.clientWidth;if(t!==null){if(u==="length"){return i&&l>=a||s&&l0}}else if("height"===e.substring(r-6,r)){c=document.documentElement.clientHeight||document.body.clientHeight;if(t!==null){if(u==="length"){return i&&c>=a||s&&c0}}else if("device-aspect-ratio"===e.substring(r-19,r)){return u==="aspect-ratio"&&screen.width*a[1]===screen.height*a[0]}else if("color-index"===e.substring(r-11,r)){var h=Math.pow(2,screen.colorDepth);if(t!==null){if(u==="absolute"){return i&&h>=a||s&&h0}}else if("color"===e.substring(r-5,r)){var p=screen.colorDepth;if(t!==null){if(u==="absolute"){return i&&p>=a||s&&p0}}else if("resolution"===e.substring(r-10,r)){var d;if(f==="dpcm"){d=o("1cm")}else{d=o("1in")}if(t!==null){if(u==="resolution"){return i&&d>=a||s&&d0}}else{return false}};var a=function(e){var t=e.getValid();var n=e.getExpressions();var r=n.length;if(r>0){for(var i=0;i0){u=false;for(var f=0;f0){l[c++]=","}l[c++]=h}}if(l.length>0){r[r.length]=cssHelper.addStyle("@media "+l.join("")+"{"+e.getCssText()+"}",t,false)}};var l=function(e,t){for(var n=0;n0}}var o=[],u=[];for(var f in i){if(i.hasOwnProperty(f)){o[o.length]=f;if(i[f]){u[u.length]=f}if(f==="all"){n=true}}}if(u.length>0){r[r.length]=cssHelper.addStyle(e.getCssText(),u,false)}var c=e.getMediaQueryLists();if(n){l(c)}else{l(c,o)}};var h=function(e){for(var t=0;td||Math.abs(s-t)>d){e=n;t=s;clearTimeout(r);r=setTimeout(function(){if(!i()){p()}else{cssHelper.broadcast("cssMediaQueriesTested")}},500)}};window.onresize=function(){var e=window.onresize||function(){};return function(){e();s()}}()};var m=document.documentElement;m.style.marginLeft="-32767px";setTimeout(function(){m.style.marginLeft=""},5e3);return function(){if(!i()){cssHelper.addListener("newStyleParsed",function(e){c(e.cssHelperParsed.stylesheet)});cssHelper.addListener("cssMediaQueriesTested",function(){if(ua.ie){m.style.width="1px"}setTimeout(function(){m.style.width="";m.style.marginLeft=""},0);cssHelper.removeListener("cssMediaQueriesTested",arguments.callee)});s();p()}else{m.style.marginLeft=""}v()}}());try{document.execCommand("BackgroundImageCache",false,true)}catch(e){} diff --git a/docs/build/html/_static/css3-mediaqueries_src.js b/docs/build/html/_static/css3-mediaqueries_src.js deleted file mode 100644 index 7878620..0000000 --- a/docs/build/html/_static/css3-mediaqueries_src.js +++ /dev/null @@ -1,1104 +0,0 @@ -/* -css3-mediaqueries.js - CSS Helper and CSS3 Media Queries Enabler - -author: Wouter van der Graaf -version: 1.0 (20110330) -license: MIT -website: http://code.google.com/p/css3-mediaqueries-js/ - -W3C spec: http://www.w3.org/TR/css3-mediaqueries/ - -Note: use of embedded