improving conf regions docs

2024-12-02 12:03:15 +01:00 · 2024-12-02 12:03:15 +01:00 · c8235ddb2a
parent c79b76516c
commit c8235ddb2a
4 changed files with 33 additions and 9 deletions
--- a/TODO.txt
+++ b/TODO.txt
@ -1,4 +1,3 @@
 - [TODO] document confidence in manuals
 - [TODO] Test the return_type="index" in protocols and finish the "distributing_samples.py" example
 - [TODO] Add EDy (an implementation is available at quantificationlib)
 - [TODO] add ensemble methods SC-MQ, MC-SQ, MC-MQ
--- a/docs/source/manuals/methods.md
+++ b/docs/source/manuals/methods.md
@ -221,7 +221,7 @@ Options are:
  * `"condsoftmax"`  applies softmax normalization only if the prevalence vector lies outside of the probability simplex.
-#### BayesianCC (_New in v0.1.9_!)
+#### BayesianCC
 The `BayesianCC` is a variant of ACC introduced in 
 [Ziegler, A. and Czyż, P. "Bayesian quantification with black-box estimators", arXiv (2023)](https://arxiv.org/abs/2302.09159), 
@ -280,8 +280,8 @@ See the API documentation for further details.
 ### Hellinger Distance y (HDy)
 Implementation of the method based on the Hellinger Distance y (HDy) proposed by
-[González-Castro, V., Alaiz-Rodrı́guez, R., and Alegre, E. (2013). Class distribution
+[González-Castro, V., Alaiz-Rodríguez, R., and Alegre, E. (2013). Class distribution
-estimation based on the Hellinger distance. Information Sciences, 218:146–164.](https://www.sciencedirect.com/science/article/pii/S0020025512004069)
+estimation based on the Hellinger distance. Information Sciences, 218:146-164.](https://www.sciencedirect.com/science/article/pii/S0020025512004069)
 It is implemented in `qp.method.aggregative.HDy` (also accessible
 through the allias `qp.method.aggregative.HellingerDistanceY`).
@ -423,7 +423,7 @@ _New in v0.1.8_: QuaPy now provides implementations for the three variants
 of KDE-based methods proposed in 
 _[Moreo, A., González, P. and del Coz, J.J., 2023. 
 Kernel Density Estimation for Multiclass Quantification. 
-arXiv preprint arXiv:2401.00490.](https://arxiv.org/abs/2401.00490)_. 
+arXiv preprint arXiv:2401.00490](https://arxiv.org/abs/2401.00490)_. 
 The variants differ in the divergence metric to be minimized:
 - KDEy-HD: minimizes the (squared) Hellinger Distance and solves the problem via a Monte Carlo approach
@ -582,3 +582,25 @@ model.fit(dataset.training)
 estim_prevalence = model.quantify(dataset.test.instances)
 ```
 ## Confidence Regions for Class Prevalence Estimation
 _(New in v0.1.10!)_ Some quantification methods go beyond providing a single point estimate of class prevalence values and also produce confidence regions, which characterize the uncertainty around the point estimate. In QuaPy, two such methods are currently implemented:
 * Aggregative Bootstrap: The Aggregative Bootstrap method extends any aggregative quantifier by generating confidence regions for class prevalence estimates through bootstrapping. Key features of this method include:
    * Optimized Computation: The bootstrap is applied to pre-classified instances, significantly speeding up training and inference.
 During training, bootstrap repetitions are performed only after training the classifier once. These repetitions are used to train multiple aggregation functions.
 During inference, bootstrap is applied over pre-classified test instances.
  * General Applicability: Aggregative Bootstrap can be applied to any aggregative quantifier.
  For further information, check the [example](https://github.com/HLT-ISTI/QuaPy/tree/master/examples) provided.
 * BayesianCC: is a Bayesian variant of the Adjusted Classify & Count (ACC) quantifier (see more details in [Aggregative Quantifiers](#bayesiancc)).
 Confidence regions are constructed around a point estimate, which is typically computed as the mean value of a set of samples.
 The confidence region can be instantiated in three ways:
 * Confidence intervals: are standard confidence intervals generated for each class independently (_method="intervals"_).
 * Confidence ellipse in the simplex: an ellipse constructed around the mean point; the ellipse lies on the simplex and takes
  into account possible inter-class dependencies in the data (_method="ellipse"_). 
 * Confidence ellipse in the Centered-Log Ratio (CLR) space: the underlying assumption of the ellipse is that the components are
  normally distributed. However, we know elements from the simplex have an inner structure. A better approach is to first
  transform the components into an unconstrained space (the CLR), and then construct the ellipse in such space (_method="ellipse-clr"_).
--- a/examples/15.confidence_regions.py
+++ b/examples/15.confidence_regions.py
@ -1,4 +1,3 @@
 from quapy.method.confidence import BayesianCC
 from quapy.method.confidence import AggregativeBootstrap
 from quapy.method.aggregative import PACC
 import quapy.functional as F
@ -26,7 +25,6 @@ train, test = data.train_test
 # intervals around the point estimate, in this case, at 95% of confidence
 pacc = AggregativeBootstrap(PACC(), n_test_samples=500, confidence_level=0.95)
 with qp.util.temp_seed(0):
    # we train the quantifier the usual way
    pacc.fit(train)
--- a/quapy/method/confidence.py
+++ b/quapy/method/confidence.py
@ -447,8 +447,13 @@ class BayesianCC(AggregativeCrispQuantifier, WithConfidenceABC):
    `$ pip install quapy[bayes]`
    :param classifier: a sklearn's Estimator that generates a classifier
-    :param val_split: a float in (0, 1) indicating the proportion of the training data to be used,
+    :param val_split: specifies the data used for generating classifier predictions. This specification
-        as a stratified held-out validation set, for generating classifier predictions.
+        can be made as float in (0, 1) indicating the proportion of stratified held-out validation set to
        be extracted from the training set; or as an integer (default 5), indicating that the predictions
        are to be generated in a `k`-fold cross-validation manner (with this integer indicating the value
        for `k`); or as a collection defining the specific set of data to use for validation.
        Alternatively, this set can be specified at fit time by indicating the exact set of data
        on which the predictions are to be generated.
    :param num_warmup: number of warmup iterations for the MCMC sampler (default 500)
    :param num_samples: number of samples to draw from the posterior (default 1000)
    :param mcmc_seed: random seed for the MCMC sampler (default 0)