updated manuals

This commit is contained in:
Alejandro Moreo Fernandez 2025-10-03 13:00:59 +02:00
parent 8adcc33c59
commit c26c463f3d
7 changed files with 188 additions and 114 deletions

View File

@ -1,4 +1,4 @@
Change Log 0.1.10
Change Log 0.2.0
-----------------
CLEAN TODO-FILE
@ -6,7 +6,7 @@ CLEAN TODO-FILE
- Base code Refactor:
- Removing coupling between LabelledCollection and quantification methods; the fit interface changes:
def fit(data:LabelledCollection): -> def fit(X, y):
- Adding function "predict" (function "quantify" is still present as an alias)
- Adding function "predict" (function "quantify" is still present as an alias, for the nostalgic)
- Aggregative methods's behavior in terms of fit_classifier and how to treat the val_split is now
indicated exclusively at construction time, and it is no longer possible to indicate it at fit time.
This is because, in v<=0.1.9, one could create a method (e.g., ACC) and then indicate:
@ -21,15 +21,16 @@ CLEAN TODO-FILE
- A new parameter "on_calib_error" is passed to the constructor, which informs of the policy to follow
in case the abstention's calibration functions failed (which happens sometimes). Options include:
- 'raise': raises a RuntimeException (default)
- 'backup': reruns avoiding calibration
- 'backup': reruns by silently avoiding calibration
- Parameter "recalib" has been renamed "calib"
- Added aggregative bootstrap for deriving confidence regions (confidence intervals, ellipses in the simplex, or
ellipses in the CLR space). This method is efficient as it leverages the two-phases of the aggregative quantifiers.
This method applies resampling only to the aggregation phase, thus avoiding to train many quantifiers, or
classify multiple times the instances of a sample. See:
- quapy/method/confidence.py (new)
- the new example no. 15.
- BayesianCC moved to confidence.py, where methods having to do with confidence intervals live
- the new example no. 16.confidence_regions.py
- BayesianCC moved to confidence.py, where methods having to do with confidence intervals belong.
- Improved documentation of qp.plot module.
Change Log 0.1.9

View File

@ -340,10 +340,10 @@ and a set of test samples (for evaluation). QuaPy returns this data as a Labelle
(training) and two generation protocols (for validation and test samples), as follows:
```python
training, val_generator, test_generator = fetch_lequa2022(task=task)
training, val_generator, test_generator = qp.datasets.fetch_lequa2022(task=task)
```
See the `lequa2022_experiments.py` in the examples folder for further details on how to
See the `5a.lequa2022_experiments.py` in the examples folder for further details on how to
carry out experiments using these datasets.
The datasets are downloaded only once, and stored for fast reuse.
@ -365,6 +365,53 @@ Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022).
A Detailed Overview of LeQua@ CLEF 2022: Learning to Quantify.
```
## LeQua 2024 Datasets
QuaPy also provides the datasets used for the [LeQua 2024 competition](https://lequa2024.github.io/).
In brief, there are 4 tasks:
* T1: binary quantification (by sentiment)
* T2: multiclass quantification (28 classes, merchandise products)
* T3: ordinal quantification (5-stars sentiment ratings)
* T4: binary sentiment quantification under a combination of covariate shift and prior shift
In all cases, the covariate space has 256 dimensions (extracted using the `ELECTRA-Small` model).
Every task consists of a training set, a set of validation samples (for model selection)
and a set of test samples (for evaluation). QuaPy returns this data as a LabelledCollection
(training bags) and sampling generation protocols (for validation and test bags).
T3 also offers the possibility to obtain a series of training bags (in form of a
sampling generation protocol) instead of one single training bag. Use it as follows:
```python
training, val_generator, test_generator = qp.datasets.fetch_lequa2024(task=task)
```
See the `5b.lequa2024_experiments.py` in the examples folder for further details on how to
carry out experiments using these datasets.
The datasets are downloaded only once, and stored for fast reuse.
Some statistics are summarized below:
| Dataset | classes | train size | validation samples | test samples | docs by sample | type |
|---------|:-------:|:-----------:|:------------------:|:------------:|:--------------:|:--------:|
| T1 | 2 | 5000 | 1000 | 5000 | 250 | vector |
| T2 | 28 | 20000 | 1000 | 5000 | 1000 | vector |
| T3 | 5 | 100 samples | 1000 | 5000 | 200 | vector |
| T4 | 2 | 5000 | 1000 | 5000 | 250 | vector |
For further details on the datasets or the competition, we refer to
[the official site](https://lequa2024.github.io/data/) and
[the overview paper](http://nmis.isti.cnr.it/sebastiani/Publications/LQ2024.pdf).
```
Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022).
An Overview of LeQua 2024, the 2nd International Data Challenge on Learning to Quantify,
Proceedings of the 4th International Workshop on Learning to Quantify (LQ 2024),
ECML-PKDD 2024, Vilnius, Lithuania.
```
## IFCB Plankton dataset
IFCB is a dataset of plankton species in water samples hosted in `Zenodo <https://zenodo.org/records/10036244>`_.
@ -410,8 +457,12 @@ Journal of Plankton Research 41 (4), 449-463](https://par.nsf.gov/servlets/purl/
## Adding Custom Datasets
It is straightforward to import your own datasets into QuaPy.
I what follows, there are some code snippets for doing so; see also the example
[3.custom_collection.py](https://github.com/HLT-ISTI/QuaPy/blob/master/examples/3.custom_collection.py).
QuaPy provides data loaders for simple formats dealing with
text, following the format:
text; for example, use `qp.data.reader.from_text` for the following the format:
```
class-id \t first document's pre-processed text \n
@ -419,13 +470,16 @@ class-id \t second document's pre-processed text \n
...
```
and sparse representations of the form:
or `qp.data.reader.from_sparse` for sparse representations of the form:
```
{-1, 0, or +1} col(int):val(float) col(int):val(float) ... \n
...
```
both functions return a tuple `X, y` containing a list of strings and the corresponding
labels, respectively.
The code in charge in loading a LabelledCollection is:
```python
@ -434,12 +488,13 @@ def load(cls, path:str, loader_func:callable):
return LabelledCollection(*loader_func(path))
```
indicating that any _loader_func_ (e.g., a user-defined one) which
indicating that any `loader_func` (e.g., `from_text`, `from_sparse`, `from_csv`, or a user-defined one) which
returns valid arguments for initializing a _LabelledCollection_ object will allow
to load any collection. In particular, the _LabelledCollection_ receives as
arguments the instances (as an iterable) and the labels (as an iterable) and,
additionally, the number of classes can be specified (it would otherwise be
inferred from the labels, but that requires at least one positive example for
to load any collection. More specifically, the _LabelledCollection_ receives as
arguments the _instances_ (iterable) and the _labels_ (iterable) and,
optionally, the number of classes (it would be
inferred from the labels if not indicated, but this requires at least one
positive example for
all classes to be present in the collection).
The same _loader_func_ can be passed to a Dataset, along with two
@ -452,20 +507,23 @@ import quapy as qp
train_path = '../my_data/train.dat'
test_path = '../my_data/test.dat'
def my_custom_loader(path):
def my_custom_loader(path, **custom_kwargs):
with open(path, 'rb') as fin:
...
return instances, labels
data = qp.data.Dataset.load(train_path, test_path, my_custom_loader)
data = qp.data.Dataset.load(train_path, test_path, my_custom_loader, **custom_kwargs)
```
### Data Processing
QuaPy implements a number of preprocessing functions in the package _qp.data.preprocessing_, including:
QuaPy implements a number of preprocessing functions in the package `qp.data.preprocessing`, including:
* _text2tfidf_: tfidf vectorization
* _reduce_columns_: reducing the number of columns based on term frequency
* _standardize_: transforms the column values into z-scores (i.e., subtract the mean and normalizes by the standard deviation, so
that the column values have zero mean and unit variance).
* _index_: transforms textual tokens into lists of numeric ids
These functions are applied to `Dataset` objects, and offer the possibility to apply the transformation
inline (thus modifying the original dataset), or to return a modified copy.

View File

@ -46,18 +46,18 @@ e.g.:
```python
qp.environ['SAMPLE_SIZE'] = 100 # once for all
true_prev = np.asarray([0.5, 0.3, 0.2]) # let's assume 3 classes
estim_prev = np.asarray([0.1, 0.3, 0.6])
true_prev = [0.5, 0.3, 0.2] # let's assume 3 classes
estim_prev = [0.1, 0.3, 0.6]
error = qp.error.mrae(true_prev, estim_prev)
print(f'mrae({true_prev}, {estim_prev}) = {error:.3f}')
```
will print:
```
mrae([0.500, 0.300, 0.200], [0.100, 0.300, 0.600]) = 0.914
mrae([0.5, 0.3, 0.2], [0.1, 0.3, 0.6]) = 0.914
```
Finally, it is possible to instantiate QuaPy's quantification
It is also possible to instantiate QuaPy's quantification
error functions from strings using, e.g.:
```python
@ -85,7 +85,7 @@ print(f'MAE = {mae:.4f}')
```
It is often desirable to evaluate our system using more than one
single evaluatio measure. In this case, it is convenient to generate
single evaluation measure. In this case, it is convenient to generate
a _report_. A report in QuaPy is a dataframe accounting for all the
true prevalence values with their corresponding prevalence values
as estimated by the quantifier, along with the error each has given
@ -104,7 +104,7 @@ report['estim-prev'] = report['estim-prev'].map(F.strprev)
print(report)
print('Averaged values:')
print(report.mean())
print(report.mean(numeric_only=True))
```
This will produce an output like:
@ -141,11 +141,14 @@ true_prevs, estim_prevs = qp.evaluation.prediction(quantifier, protocol=prot)
All the evaluation functions implement specific optimizations for speeding-up
the evaluation of aggregative quantifiers (i.e., of instances of _AggregativeQuantifier_).
The optimization comes down to generating classification predictions (either crisp or soft)
only once for the entire test set, and then applying the sampling procedure to the
predictions, instead of generating samples of instances and then computing the
classification predictions every time. This is only possible when the protocol
is an instance of _OnLabelledCollectionProtocol_. The optimization is only
is an instance of _OnLabelledCollectionProtocol_.
The optimization is only
carried out when the number of classification predictions thus generated would be
smaller than the number of predictions required for the entire protocol; e.g.,
if the original dataset contains 1M instances, but the protocol is such that it would
@ -156,4 +159,4 @@ precompute all the predictions irrespectively of the number of instances and num
Finally, this can be deactivated by setting _aggr_speedup=False_. Note that this optimization
is not only applied for the final evaluation, but also for the internal evaluations carried
out during _model selection_. Since these are typically many, the heuristic can help reduce the
execution time a lot.
execution time significatively.

View File

@ -1,7 +1,7 @@
# Quantification Methods
Quantification methods can be categorized as belonging to
`aggregative` and `non-aggregative` groups.
`aggregative`, `non-aggregative`, and `meta-learning` groups.
Most methods included in QuaPy at the moment are of type `aggregative`
(though we plan to add many more methods in the near future), i.e.,
are methods characterized by the fact that
@ -12,21 +12,17 @@ Any quantifier in QuaPy shoud extend the class `BaseQuantifier`,
and implement some abstract methods:
```python
@abstractmethod
def fit(self, data: LabelledCollection): ...
def fit(self, X, y): ...
@abstractmethod
def quantify(self, instances): ...
def predict(self, X): ...
```
The meaning of those functions should be familiar to those
used to work with scikit-learn since the class structure of QuaPy
is directly inspired by scikit-learn's _Estimators_. Functions
`fit` and `quantify` are used to train the model and to provide
class estimations (the reason why
scikit-learn' structure has not been adopted _as is_ in QuaPy responds to
the fact that scikit-learn's `predict` function is expected to return
one output for each input element --e.g., a predicted label for each
instance in a sample-- while in quantification the output for a sample
is one single array of class prevalences).
`fit` and `predict` (for which there is an alias `quantify`)
are used to train the model and to provide
class estimations.
Quantifiers also extend from scikit-learn's `BaseEstimator`, in order
to simplify the use of `set_params` and `get_params` used in
[model selection](./model-selection).
@ -40,21 +36,26 @@ The methods that any `aggregative` quantifier must implement are:
```python
@abstractmethod
def aggregation_fit(self, classif_predictions: LabelledCollection, data: LabelledCollection):
def aggregation_fit(self, classif_predictions, labels):
@abstractmethod
def aggregate(self, classif_predictions:np.ndarray): ...
def aggregate(self, classif_predictions): ...
```
These two functions replace the `fit` and `quantify` methods, since those
come with default implementations. The `fit` function is provided and amounts to:
The argument `classif_predictions` is whatever the method `classify` returns.
QuaPy comes with default implementations that cover most common cases, but you can
override `classify` in case your method requires further or different information to work.
These two functions replace the `fit` and `predict` methods, which
come with default implementations. For instance, the `fit` function is
provided and amounts to:
```python
def fit(self, data: LabelledCollection, fit_classifier=True, val_split=None):
self._check_init_parameters()
classif_predictions = self.classifier_fit_predict(data, fit_classifier, predict_on=val_split)
self.aggregation_fit(classif_predictions, data)
return self
def fit(self, X, y):
self._check_init_parameters()
classif_predictions, labels = self.classifier_fit_predict(X, y)
self.aggregation_fit(classif_predictions, labels)
return self
```
Note that this function fits the classifier, and generates the predictions. This is assumed
@ -72,11 +73,11 @@ overriden (if needed) and allows the method to quickly raise any exception based
found in the `__init__` arguments, thus avoiding to break after training the classifier and generating
predictions.
Similarly, the function `quantify` is provided, and amounts to:
Similarly, the function `predict` (alias `quantify`) is provided, and amounts to:
```python
def quantify(self, instances):
classif_predictions = self.classify(instances)
def predict(self, X):
classif_predictions = self.classify(X)
return self.aggregate(classif_predictions)
```
@ -84,12 +85,14 @@ in which only the function `aggregate` is required to be overriden in most cases
Aggregative quantifiers are expected to maintain a classifier (which is
accessed through the `@property` `classifier`). This classifier is
given as input to the quantifier, and can be already fit
on external data (in which case, the `fit_learner` argument should
be set to False), or be fit by the quantifier's fit (default).
given as input to the quantifier, and will be trained by the quantifier's fit (default).
Alternatively, the classifier can be already fit on external data; in this case, the `fit_learner`
argument in the `__init__` should be set to False (see [4.using_pretrained_classifier.py](https://github.com/HLT-ISTI/QuaPy/blob/master/examples/4.using_pretrained_classifier.py)
for a full code example).
The above patterns (in training: fit the classifier, then fit the aggregation;
in test: classify, then aggregate) allows QuaPy to optimize many internal procedures.
The above patterns (in training: (i) fit the classifier, then (ii) fit the aggregation;
in test: (i) classify, then (ii) aggregate) allows QuaPy to optimize many internal procedures,
on the grounds that steps (i) are slower than steps (ii).
In particular, the model selection routing takes advantage of this two-step process
and generates classifiers only for the valid combinations of hyperparameters of the
classifier, and then _clones_ these classifiers and explores the combinations
@ -124,6 +127,7 @@ import quapy.functional as F
from sklearn.svm import LinearSVC
training, test = qp.datasets.fetch_twitter('hcr', pickle=True).train_test
Xtr, ytr = training.Xy
# instantiate a classifier learner, in this case a SVM
svm = LinearSVC()
@ -131,7 +135,7 @@ svm = LinearSVC()
# instantiate a Classify & Count with the SVM
# (an alias is available in qp.method.aggregative.ClassifyAndCount)
model = qp.method.aggregative.CC(svm)
model.fit(training)
model.fit(Xtr, ytr)
estim_prevalence = model.predict(test.instances)
```
@ -153,26 +157,14 @@ predictions. This parameters can also be set with an integer,
indicating that the parameters should be estimated by means of
_k_-fold cross-validation, for which the integer indicates the
number _k_ of folds (the default value is 5). Finally, `val_split` can be set to a
specific held-out validation set (i.e., an instance of `LabelledCollection`).
The specification of `val_split` can be
postponed to the invokation of the fit method (if `val_split` was also
set in the constructor, the one specified at fit time would prevail),
e.g.:
```python
model = qp.method.aggregative.ACC(svm)
# perform 5-fold cross validation for estimating ACC's parameters
# (overrides the default val_split=0.4 in the constructor)
model.fit(training, val_split=5)
```
specific held-out validation set (i.e., an tuple `(X,y)`).
The following code illustrates the case in which PCC is used:
```python
model = qp.method.aggregative.PCC(svm)
model.fit(training)
estim_prevalence = model.predict(test.instances)
model.fit(Xtr, ytr)
estim_prevalence = model.predict(Xte)
print('classifier:', model.classifier)
```
In this case, QuaPy will print:
@ -185,11 +177,11 @@ is not a probabilistic classifier (i.e., it does not implement the
`predict_proba` method) and so, the classifier will be converted to
a probabilistic one through [calibration](https://scikit-learn.org/stable/modules/calibration.html).
As a result, the classifier that is printed in the second line points
to a `CalibratedClassifier` instance. Note that calibration can only
be applied to hard classifiers when `fit_learner=True`; an exception
to a `CalibratedClassifierCV` instance. Note that calibration can only
be applied to hard classifiers if `fit_learner=True`; an exception
will be raised otherwise.
Lastly, everything we said aboud ACC and PCC
Lastly, everything we said about ACC and PCC
applies to PACC as well.
_New in v0.1.9_: quantifiers ACC and PACC now have three additional arguments: `method`, `solver` and `norm`:
@ -259,22 +251,28 @@ An example of use can be found below:
import quapy as qp
from sklearn.linear_model import LogisticRegression
dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
train, test = qp.datasets.fetch_twitter('hcr', pickle=True).train_test
model = qp.method.aggregative.EMQ(LogisticRegression())
model.fit(dataset.training)
estim_prevalence = model.predict(dataset.test.instances)
model.fit(*train.Xy)
estim_prevalence = model.predict(test.X)
```
_New in v0.1.7_: EMQ now accepts two new parameters in the construction method, namely
`exact_train_prev` which allows to use the true training prevalence as the departing
prevalence estimation (default behaviour), or instead an approximation of it as
EMQ accepts additional parameters in the construction method:
* `exact_train_prev`: set to True for using the true training prevalence as the departing
prevalence estimation (default behaviour), or to False for using an approximation of it as
suggested by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html)
(by setting `exact_train_prev=False`).
The other parameter is `recalib` which allows to indicate a calibration method, among those
* `calib`: allows to indicate a calibration method, among those
proposed by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html),
including the Bias-Corrected Temperature Scaling, Vector Scaling, etc.
See the API documentation for further details.
including the Bias-Corrected Temperature Scaling
(`bcts`), Vector Scaling (`bcts`), No-Bias Temperature Scaling (`nbvs`),
or Temperature Scaling (`ts`); default is `None` (no calibration).
* `on_calib_error`: indicates the policy to follow in case the calibrator fails at runtime.
Options include `raise` (default), in which case a RuntimeException is raised; and `backup`, in which
case the calibrator is silently skipped.
You can use the class method `EMQ_BCTS` to effortlessly instantiate EMQ with the best performing
heuristics found by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html). See the API documentation for further details.
### Hellinger Distance y (HDy)
@ -289,11 +287,10 @@ This method works with a probabilistic classifier (hard classifiers
can be used as well and will be calibrated) and requires a validation
set to estimate parameter for the mixture model. Just like
ACC and PACC, this quantifier receives a `val_split` argument
in the constructor (or in the fit method, in which case the previous
value is overridden) that can either be a float indicating the proportion
in the constructor that can either be a float indicating the proportion
of training data to be taken as the validation set (in a random
stratified split), or a validation set (i.e., an instance of
`LabelledCollection`) itself.
stratified split), or the validation set itself (i.e., an tuple
`(X,y)`).
HDy was proposed as a binary classifier and the implementation
provided in QuaPy accepts only binary datasets.
@ -309,11 +306,11 @@ dataset = qp.datasets.fetch_reviews('hp', pickle=True)
qp.data.preprocessing.text2tfidf(dataset, min_df=5, inplace=True)
model = qp.method.aggregative.HDy(LogisticRegression())
model.fit(dataset.training)
estim_prevalence = model.predict(dataset.test.instances)
model.fit(*dataset.training.Xy)
estim_prevalence = model.predict(dataset.test.X)
```
_New in v0.1.7:_ QuaPy now provides an implementation of the generalized
QuaPy also provides an implementation of the generalized
"Distribution Matching" approaches for multiclass, inspired by the framework
of [Firat (2016)](https://arxiv.org/abs/1606.00868). One can instantiate
a variant of HDy for multiclass quantification as follows:
@ -322,17 +319,22 @@ a variant of HDy for multiclass quantification as follows:
mutliclassHDy = qp.method.aggregative.DMy(classifier=LogisticRegression(), divergence='HD', cdf=False)
```
_New in v0.1.7:_ QuaPy now provides an implementation of the "DyS"
QuaPy also provides an implementation of the "DyS"
framework proposed by [Maletzke et al (2020)](https://ojs.aaai.org/index.php/AAAI/article/view/4376)
and the "SMM" method proposed by [Hassan et al (2019)](https://ieeexplore.ieee.org/document/9260028)
(thanks to _Pablo González_ for the contributions!)
### Threshold Optimization methods
_New in v0.1.7:_ QuaPy now implements Forman's threshold optimization methods;
QuaPy implements Forman's threshold optimization methods;
see, e.g., [(Forman 2006)](https://dl.acm.org/doi/abs/10.1145/1150402.1150423)
and [(Forman 2008)](https://link.springer.com/article/10.1007/s10618-008-0097-y).
These include: T50, MAX, X, Median Sweep (MS), and its variant MS2.
These include: `T50`, `MAX`, `X`, Median Sweep (`MS`), and its variant `MS2`.
These methods are binary-only and implement different heuristics for
improving the stability of the denominator of the ACC adjustment (`tpr-fpr`).
The methods are called "threshold" since said heuristics have to do
with different choices of the underlying classifier's threshold.
### Explicit Loss Minimization
@ -415,16 +417,18 @@ model.fit(dataset.training)
estim_prevalence = model.predict(dataset.test.instances)
```
Check the examples on [explicit_loss_minimization](https://github.com/HLT-ISTI/QuaPy/blob/devel/examples/5.explicit_loss_minimization.py)
Check the examples on [explicit loss minimization](https://github.com/HLT-ISTI/QuaPy/blob/devel/examples/17.explicit_loss_minimization.py)
and on [one versus all quantification](https://github.com/HLT-ISTI/QuaPy/blob/devel/examples/10.one_vs_all.py) for more details.
**Note** that the _one versus all_ approach is considered inappropriate under prior probability shift, though.
### Kernel Density Estimation methods (KDEy)
_New in v0.1.8_: QuaPy now provides implementations for the three variants
QuaPy provides implementations for the three variants
of KDE-based methods proposed in
_[Moreo, A., González, P. and del Coz, J.J., 2023.
_[Moreo, A., González, P. and del Coz, J.J..
Kernel Density Estimation for Multiclass Quantification.
arXiv preprint arXiv:2401.00490](https://arxiv.org/abs/2401.00490)_.
Machine Learning. Vol 114 (92), 2025](https://link.springer.com/article/10.1007/s10994-024-06726-5)_
(a [preprint](https://arxiv.org/abs/2401.00490) is available online).
The variants differ in the divergence metric to be minimized:
- KDEy-HD: minimizes the (squared) Hellinger Distance and solves the problem via a Monte Carlo approach
@ -435,22 +439,27 @@ These methods are specifically devised for multiclass problems (although they ca
binary problems too).
All KDE-based methods depend on the hyperparameter `bandwidth` of the kernel. Typical values
that can be explored in model selection range in [0.01, 0.25]. The methods' performance
vary smoothing with smooth variations of this hyperparameter.
that can be explored in model selection range in [0.01, 0.25]. Previous experiments reveal the methods' performance
varies smoothly at small variations of this hyperparameter.
## Composable Methods
The [](quapy.method.composable) module allows the composition of quantification methods from loss functions and feature transformations. Any composed method solves a linear system of equations by minimizing the loss after transforming the data. Methods of this kind include ACC, PACC, HDx, HDy, and many other well-known methods, as well as an unlimited number of re-combinations of their building blocks.
The `quapy.method.composable` module integrates [qunfold](https://github.com/mirkobunse/qunfold) allows the composition
of quantification methods from loss functions and feature transformations (thanks to Mirko Bunse for the integration!).
Any composed method solves a linear system of equations by minimizing the loss after transforming the data. Methods of this kind include ACC, PACC, HDx, HDy, and many other well-known methods, as well as an unlimited number of re-combinations of their building blocks.
### Installation
```sh
pip install --upgrade pip setuptools wheel
pip install "jax[cpu]"
pip install "qunfold @ git+https://github.com/mirkobunse/qunfold@v0.1.4"
pip install "qunfold @ git+https://github.com/mirkobunse/qunfold@v0.1.5"
```
**Note:** since version 0.2.0, QuaPy is only compatible with qunfold >=0.1.5.
### Basics
The composition of a method is implemented through the [](quapy.method.composable.ComposableQuantifier) class. Its documentation also features an example to get you started in composing your own methods.
@ -529,10 +538,11 @@ from quapy.method.meta import Ensemble
from sklearn.linear_model import LogisticRegression
dataset = qp.datasets.fetch_UCIBinaryDataset('haberman')
train, test = dataset.train_test
model = Ensemble(quantifier=ACC(LogisticRegression()), size=30, policy='ave', n_jobs=-1)
model.fit(dataset.training)
estim_prevalence = model.predict(dataset.test.instances)
model.fit(*train.Xy)
estim_prevalence = model.predict(test.X)
```
Other aggregation policies implemented in QuaPy include:
@ -579,13 +589,13 @@ learner = NeuralClassifierTrainer(cnn, device='cuda')
# train QuaNet
model = QuaNet(learner, device='cuda')
model.fit(dataset.training)
estim_prevalence = model.predict(dataset.test.instances)
model.fit(*dataset.training.Xy)
estim_prevalence = model.predict(dataset.test.X)
```
## Confidence Regions for Class Prevalence Estimation
_(New in v0.1.10!)_ Some quantification methods go beyond providing a single point estimate of class prevalence values and also produce confidence regions, which characterize the uncertainty around the point estimate. In QuaPy, two such methods are currently implemented:
_(New in v0.2.0!)_ Some quantification methods go beyond providing a single point estimate of class prevalence values and also produce confidence regions, which characterize the uncertainty around the point estimate. In QuaPy, two such methods are currently implemented:
* Aggregative Bootstrap: The Aggregative Bootstrap method extends any aggregative quantifier by generating confidence regions for class prevalence estimates through bootstrapping. Key features of this method include:
@ -593,9 +603,9 @@ _(New in v0.1.10!)_ Some quantification methods go beyond providing a single poi
During training, bootstrap repetitions are performed only after training the classifier once. These repetitions are used to train multiple aggregation functions.
During inference, bootstrap is applied over pre-classified test instances.
* General Applicability: Aggregative Bootstrap can be applied to any aggregative quantifier.
For further information, check the [example](https://github.com/HLT-ISTI/QuaPy/tree/master/examples) provided.
For further information, check the [example](https://github.com/HLT-ISTI/QuaPy/tree/master/examples/16.confidence_regions.py) provided.
* BayesianCC: is a Bayesian variant of the Adjusted Classify & Count (ACC) quantifier (see more details in [Aggregative Quantifiers](#bayesiancc)).
* BayesianCC: is a Bayesian variant of the Adjusted Classify & Count (ACC) quantifier; see more details in the [example](https://github.com/HLT-ISTI/QuaPy/tree/master/examples/14.bayesian_quantification.py) provided.
Confidence regions are constructed around a point estimate, which is typically computed as the mean value of a set of samples.
The confidence region can be instantiated in three ways:

View File

@ -87,7 +87,7 @@ model = qp.model_selection.GridSearchQ(
error='mae', # the error to optimize is the MAE (a quantification-oriented loss)
refit=True, # retrain on the whole labelled set once done
verbose=True # show information as the process goes on
).fit(training)
).fit(*training.Xy)
print(f'model selection ended: best hyper-parameters={model.best_params_}')
model = model.best_model_
@ -133,7 +133,7 @@ learner = GridSearchCV(
LogisticRegression(),
param_grid={'C': np.logspace(-4, 5, 10), 'class_weight': ['balanced', None]},
cv=5)
model = DistributionMatching(learner).fit(dataset.train)
model = DistributionMatching(learner).fit(*dataset.train.Xy)
```
However, this is conceptually flawed, since the model should be

View File

@ -2,6 +2,9 @@
The module _qp.plot_ implements some basic plotting functions
that can help analyse the performance of a quantification method.
See the provided
[code example](https://github.com/HLT-ISTI/QuaPy/blob/master/examples/13.plotting.py)
for a full example.
All plotting functions receive as inputs the outcomes of
some experiments and include, for each experiment,
@ -77,7 +80,7 @@ def gen_data():
method_names, true_prevs, estim_prevs, tr_prevs = [], [], [], []
for method_name, model in models():
model.fit(train)
model.fit(*train.Xy)
true_prev, estim_prev = qp.evaluation.prediction(model, APP(test, repeats=100, random_state=0))
method_names.append(method_name)
@ -171,7 +174,7 @@ def gen_data():
training_size = 5000
# since the problem is binary, it suffices to specify the negative prevalence, since the positive is constrained
train_sample = train.sampling(training_size, 1-training_prevalence)
model.fit(train_sample)
model.fit(*train_sample.Xy)
true_prev, estim_prev = qp.evaluation.prediction(model, APP(test, repeats=100, random_state=0))
method_name = 'CC$_{'+f'{int(100*training_prevalence)}' + '\%}$'
method_data.append((method_name, true_prev, estim_prev, train_sample.prevalence()))

View File

@ -1,7 +1,5 @@
# Protocols
_New in v0.1.7!_
Quantification methods are expected to behave robustly in the presence of
shift. For this reason, quantification methods need to be confronted with
samples exhibiting widely varying amounts of shift.
@ -106,15 +104,16 @@ train, test = qp.datasets.fetch_reviews('imdb', tfidf=True, min_df=5).train_test
# model selection
train, val = train.split_stratified(train_prop=0.75)
Xtr, ytr = train.Xy
quantifier = qp.model_selection.GridSearchQ(
quantifier,
param_grid={'classifier__C': np.logspace(-2, 2, 5)},
protocol=APP(val) # <- this is the protocol we use for generating validation samples
).fit(train)
).fit(Xtr, ytr)
# default values are n_prevalences=21, repeats=10, random_state=0; this is equialent to:
# val_app = APP(val, n_prevalences=21, repeats=10, random_state=0)
# quantifier = GridSearchQ(quantifier, param_grid, protocol=val_app).fit(train)
# quantifier = GridSearchQ(quantifier, param_grid, protocol=val_app).fit(Xtr, ytr)
# evaluation with APP
mae = qp.evaluation.evaluate(quantifier, protocol=APP(test), error_metric='mae')