improving conf regions docs

2024-12-02 12:03:15 +01:00 · 2024-12-02 12:03:15 +01:00 · c8235ddb2a
parent c79b76516c
commit c8235ddb2a
4 changed files with 33 additions and 9 deletions
--- a/TODO.txt
+++ b/TODO.txt
@ -1,4 +1,3 @@
- [TODO] document confidence in manuals
 - [TODO] Test the return_type="index" in protocols and finish the "distributing_samples.py" example
 - [TODO] Add EDy (an implementation is available at quantificationlib)
 - [TODO] add ensemble methods SC-MQ, MC-SQ, MC-MQ
--- a/docs/source/manuals/methods.md
+++ b/docs/source/manuals/methods.md
@ -221,7 +221,7 @@ Options are:
  * `"condsoftmax"`  applies softmax normalization only if the prevalence vector lies outside of the probability simplex.


-#### BayesianCC (_New in v0.1.9_!)
+#### BayesianCC

 The `BayesianCC` is a variant of ACC introduced in 
 [Ziegler, A. and Czyż, P. "Bayesian quantification with black-box estimators", arXiv (2023)](https://arxiv.org/abs/2302.09159), 
@ -280,8 +280,8 @@ See the API documentation for further details.
 ### Hellinger Distance y (HDy)

 Implementation of the method based on the Hellinger Distance y (HDy) proposed by
-[González-Castro, V., Alaiz-Rodrı́guez, R., and Alegre, E. (2013). Class distribution
-estimation based on the Hellinger distance. Information Sciences, 218:146–164.](https://www.sciencedirect.com/science/article/pii/S0020025512004069)
+[González-Castro, V., Alaiz-Rodríguez, R., and Alegre, E. (2013). Class distribution
+estimation based on the Hellinger distance. Information Sciences, 218:146-164.](https://www.sciencedirect.com/science/article/pii/S0020025512004069)

 It is implemented in `qp.method.aggregative.HDy` (also accessible
 through the allias `qp.method.aggregative.HellingerDistanceY`).
@ -423,7 +423,7 @@ _New in v0.1.8_: QuaPy now provides implementations for the three variants
 of KDE-based methods proposed in 
 _[Moreo, A., González, P. and del Coz, J.J., 2023. 
 Kernel Density Estimation for Multiclass Quantification. 
-arXiv preprint arXiv:2401.00490.](https://arxiv.org/abs/2401.00490)_. 
+arXiv preprint arXiv:2401.00490](https://arxiv.org/abs/2401.00490)_. 
 The variants differ in the divergence metric to be minimized:

 - KDEy-HD: minimizes the (squared) Hellinger Distance and solves the problem via a Monte Carlo approach
@ -582,3 +582,25 @@ model.fit(dataset.training)
 estim_prevalence = model.quantify(dataset.test.instances)
 ```

+## Confidence Regions for Class Prevalence Estimation
+
+_(New in v0.1.10!)_ Some quantification methods go beyond providing a single point estimate of class prevalence values and also produce confidence regions, which characterize the uncertainty around the point estimate. In QuaPy, two such methods are currently implemented:
+
+* Aggregative Bootstrap: The Aggregative Bootstrap method extends any aggregative quantifier by generating confidence regions for class prevalence estimates through bootstrapping. Key features of this method include:
+
+    * Optimized Computation: The bootstrap is applied to pre-classified instances, significantly speeding up training and inference.
+During training, bootstrap repetitions are performed only after training the classifier once. These repetitions are used to train multiple aggregation functions.
+During inference, bootstrap is applied over pre-classified test instances.
+  * General Applicability: Aggregative Bootstrap can be applied to any aggregative quantifier.
+  For further information, check the [example](https://github.com/HLT-ISTI/QuaPy/tree/master/examples) provided.
+
+* BayesianCC: is a Bayesian variant of the Adjusted Classify & Count (ACC) quantifier (see more details in [Aggregative Quantifiers](#bayesiancc)).
+
+Confidence regions are constructed around a point estimate, which is typically computed as the mean value of a set of samples.
+The confidence region can be instantiated in three ways:
+* Confidence intervals: are standard confidence intervals generated for each class independently (_method="intervals"_).
+* Confidence ellipse in the simplex: an ellipse constructed around the mean point; the ellipse lies on the simplex and takes
+  into account possible inter-class dependencies in the data (_method="ellipse"_). 
+* Confidence ellipse in the Centered-Log Ratio (CLR) space: the underlying assumption of the ellipse is that the components are
+  normally distributed. However, we know elements from the simplex have an inner structure. A better approach is to first
+  transform the components into an unconstrained space (the CLR), and then construct the ellipse in such space (_method="ellipse-clr"_).
--- a/examples/15.confidence_regions.py
+++ b/examples/15.confidence_regions.py
@ -1,4 +1,3 @@
-from quapy.method.confidence import BayesianCC
 from quapy.method.confidence import AggregativeBootstrap
 from quapy.method.aggregative import PACC
 import quapy.functional as F
@ -26,7 +25,6 @@ train, test = data.train_test
 # intervals around the point estimate, in this case, at 95% of confidence
 pacc = AggregativeBootstrap(PACC(), n_test_samples=500, confidence_level=0.95)

-
 with qp.util.temp_seed(0):
    # we train the quantifier the usual way
    pacc.fit(train)
--- a/quapy/method/confidence.py
+++ b/quapy/method/confidence.py
@ -447,8 +447,13 @@ class BayesianCC(AggregativeCrispQuantifier, WithConfidenceABC):
    `$ pip install quapy[bayes]`

    :param classifier: a sklearn's Estimator that generates a classifier
-    :param val_split: a float in (0, 1) indicating the proportion of the training data to be used,
-        as a stratified held-out validation set, for generating classifier predictions.
+    :param val_split: specifies the data used for generating classifier predictions. This specification
+        can be made as float in (0, 1) indicating the proportion of stratified held-out validation set to
+        be extracted from the training set; or as an integer (default 5), indicating that the predictions
+        are to be generated in a `k`-fold cross-validation manner (with this integer indicating the value
+        for `k`); or as a collection defining the specific set of data to use for validation.
+        Alternatively, this set can be specified at fit time by indicating the exact set of data
+        on which the predictions are to be generated.
    :param num_warmup: number of warmup iterations for the MCMC sampler (default 500)
    :param num_samples: number of samples to draw from the posterior (default 1000)
    :param mcmc_seed: random seed for the MCMC sampler (default 0)