Index
- -A
-B
-C
-D
-E
-F
-G
-H
-| - | - |
I
-| - | - |
J
-| - |
K
-| - | - |
L
-| - | - |
M
-N
-O
-| - | - |
P
-Q
-
|
-
|
-
R
-S
-T
-U
-| - | - |
V
-W
-| - |
X
-| - | - |
Y
-| - |
-
-
-
-
+
+
diff --git a/TODO.txt b/TODO.txt
index d3f2b3d..b7d69fa 100644
--- a/TODO.txt
+++ b/TODO.txt
@@ -1,95 +1,6 @@
-ensembles seem to be broken; they have an internal model selection which takes the parameters, but since quapy now
- works with protocols it would need to know the validation set in order to pass something like
- "protocol: APP(val, etc.)"
-sample_size should not be mandatory when qp.environ['SAMPLE_SIZE'] has been specified
-clean all the cumbersome methods that have to be implemented for new quantifiers (e.g., n_classes_ prop, etc.)
-make truly parallel the GridSearchQ
-make more examples in the "examples" directory
-merge with master, because I had to fix some problems with QuaNet due to an issue notified via GitHub!
-added cross_val_predict in qp.model_selection (i.e., a cross_val_predict for quantification) --would be nice to have
- it parallelized
-
-check the OneVsAll module(s)
-
-check the set_params de neural.py, because the separation of estimator__ is not implemented; see also
- __check_params_colision
-
-HDy can be customized so that the number of bins is specified, instead of explored within the fit method
-
-Packaging:
-==========================================
-Document methods with paper references
-unit-tests
-clean wiki_examples!
-
-Refactor:
-==========================================
-Unify ThresholdOptimization methods, as an extension of PACC (and not ACC), the fit methods are almost identical and
- use a prob classifier (take into account that PACC uses pcc internally, whereas the threshold methods use cc
- instead). The fit method of ACC and PACC has a block for estimating the validation estimates that should be unified
- as well...
-Refactor protocols. APP and NPP related functionalities are duplicated in functional, LabelledCollection, and evaluation
-
-
-New features:
-==========================================
-Add "measures for evaluating ordinal"?
-Add datasets for topic.
-Do we want to cover cross-lingual quantification natively in QuaPy, or does it make more sense as an application on top?
-
-Current issues:
-==========================================
-Revise the class structure of quantification methods and the methods they inherit... There is some confusion regarding
- methods isbinary, isprobabilistic, and the like. The attribute "learner_" in aggregative quantifiers is also
- confusing, since there is a getter and a setter.
-Remove the "deep" in get_params. There is no real compatibility with scikit-learn as for now.
-SVMperf-based learners do not remove temp files in __del__?
-In binary quantification (hp, kindle, imdb) we used F1 in the minority class (which in kindle and hp happens to be the
-negative class). This is not covered in this new implementation, in which the binary case is not treated as such, but as
-an instance of single-label with 2 labels. Check
-Add automatic reindex of class labels in LabelledCollection (currently, class indexes should be ordered and with no gaps)
-OVR I believe is currently tied to aggregative methods. We should provide a general interface also for general quantifiers
-Currently, being "binary" only adds one checker; we should figure out how to impose the check to be automatically performed
-Add random seed management to support replicability (see temp_seed in util.py).
-GridSearchQ is not trully parallelized. It only parallelizes on the predictions.
-In the context of a quantifier (e.g., QuaNet or CC), the parameters of the learner should be prefixed with "estimator__",
- in QuaNet this is resolved with a __check_params_colision, but this should be improved. It might be cumbersome to
- impose the "estimator__" prefix for, e.g., quantifiers like CC though... This should be changed everywhere...
-QuaNet needs refactoring. The base quantifiers ACC and PACC receive val_data with instances already transformed. This
- issue is due to a bad design.
-
-Improvements:
-==========================================
-Explore the hyperparameter "number of bins" in HDy
-Rename EMQ to SLD ?
-Parallelize the kFCV in ACC and PACC?
-Parallelize model selection trainings
-We might want to think of (improving and) adding the class Tabular (it is defined and used on branch tweetsent). A more
- recent version is in the project ql4facct. This class is meant to generate latex tables from results (highligting
- best results, computing statistical tests, colouring cells, producing rankings, producing averages, etc.). Trying
- to generate tables is typically a bad idea, but in this specific case we do have pretty good control of what an
- experiment looks like. (Do we want to abstract experimental results? this could be useful not only for tables but
- also for plots).
-Add proper logging system. Currently we use print
-It might be good to simplify the number of methods that have to be implemented for any new Quantifier. At the moment,
- there are many functions like get_params, set_params, and, specially, @property classes_, which are cumbersome to
- implement for quick experiments. A possible solution is to impose get_params and set_params only in cases in which
- the model extends some "ModelSelectable" interface only. The classes_ should have a default implementation.
-
-Checks:
-==========================================
-How many times is the system of equations for ACC and PACC not solved? How many times is it clipped? Do they sum up
- to one always?
-Re-check how hyperparameters from the quantifier and hyperparameters from the classifier (in aggregative quantifiers)
- is handled. In scikit-learn the hyperparameters from a wrapper method are indicated directly whereas the hyperparams
- from the internal learner are prefixed with "estimator__". In QuaPy, combinations having to do with the classifier
- can be computed at the begining, and then in an internal loop the hyperparams of the quantifier can be explored,
- passing fit_learner=False.
-Re-check Ensembles. As for now, they are strongly tied to aggregative quantifiers.
-Re-think the environment variables. Maybe add new ones (like, for example, parameters for the plots)
-Do we want to wrap prevalences (currently simple np.ndarray) as a class? This might be convenient for some interfaces
- (e.g., for specifying artificial prevalences in samplings, for printing them -- currently supported through
- F.strprev(), etc.). This might however add some overload, and prevent/difficult post processing with numpy.
-Would be nice to get a better integration with sklearn.
-
-
+- [TODO] add ensemble methods SC-MQ, MC-SQ, MC-MQ
+- [TODO] add HistNetQ
+- [TODO] add CDE-iteration and Bayes-CDE methods
+- [TODO] add Friedman's method and DeBias
+- [TODO] check ignore warning stuff
+ check https://docs.python.org/3/library/warnings.html#temporarily-suppressing-warnings
diff --git a/docs/.gitignore b/docs/.gitignore
new file mode 100644
index 0000000..567609b
--- /dev/null
+++ b/docs/.gitignore
@@ -0,0 +1 @@
+build/
diff --git a/docs/build/html/_sources/api.rst.txt b/docs/build/html/_sources/api.rst.txt
deleted file mode 100644
index b628a93..0000000
--- a/docs/build/html/_sources/api.rst.txt
+++ /dev/null
@@ -1,7 +0,0 @@
-API
-===
-
-.. autosummary::
- :toctree: generated
-
- quapy
\ No newline at end of file
diff --git a/docs/build/html/_sources/generated/quapy.rst.txt b/docs/build/html/_sources/generated/quapy.rst.txt
deleted file mode 100644
index 52098bb..0000000
--- a/docs/build/html/_sources/generated/quapy.rst.txt
+++ /dev/null
@@ -1,23 +0,0 @@
-quapy
-=====
-
-.. automodule:: quapy
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
diff --git a/docs/build/html/_sources/index.rst.txt b/docs/build/html/_sources/index.rst.txt
deleted file mode 100644
index cc5b4dc..0000000
--- a/docs/build/html/_sources/index.rst.txt
+++ /dev/null
@@ -1,41 +0,0 @@
-.. QuaPy: A Python-based open-source framework for quantification documentation master file, created by
- sphinx-quickstart on Wed Feb 7 16:26:46 2024.
- You can adapt this file completely to your liking, but it should at least
- contain the root `toctree` directive.
-
-Welcome to QuaPy's documentation!
-==========================================================================================
-
-QuaPy is a Python-based open-source framework for quantification.
-
-This document contains the API of the modules included in QuaPy.
-
-Installation
-------------
-
-`pip install quapy`
-
-GitHub
-------------
-
-QuaPy is hosted in GitHub at `https://github.com/HLT-ISTI/QuaPy | - | - |
| - | - |
| - |
| - | - |
| - | - |
| - | - |
|
-
|
-
| - | - |
| - |
| - | - |
| - |
QuaPy is a Python-based open-source framework for quantification.
-This document contains the API of the modules included in QuaPy.
-pip install quapy
-QuaPy is hosted in GitHub at https://github.com/HLT-ISTI/QuaPy
-BCTSCalibrationNBVSCalibrationRecalibratedProbabilisticClassifierRecalibratedProbabilisticClassifierBaseRecalibratedProbabilisticClassifierBase.classes_RecalibratedProbabilisticClassifierBase.fit()RecalibratedProbabilisticClassifierBase.fit_cv()RecalibratedProbabilisticClassifierBase.fit_tr_val()RecalibratedProbabilisticClassifierBase.predict()RecalibratedProbabilisticClassifierBase.predict_proba()TSCalibrationVSCalibrationCNNnet
-LSTMnet
-NeuralClassifierTrainer
-TextClassifierNet
-TorchDataset
-Dataset
-LabelledCollectionLabelledCollection.XLabelledCollection.XpLabelledCollection.XyLabelledCollection.binaryLabelledCollection.counts()LabelledCollection.join()LabelledCollection.kFCV()LabelledCollection.load()LabelledCollection.n_classesLabelledCollection.pLabelledCollection.prevalence()LabelledCollection.sampling()LabelledCollection.sampling_from_index()LabelledCollection.sampling_index()LabelledCollection.split_random()LabelledCollection.split_stratified()LabelledCollection.stats()LabelledCollection.uniform_sampling()LabelledCollection.uniform_sampling_index()LabelledCollection.yACC
-AdjustedClassifyAndCountAggregativeCrispQuantifierAggregativeMedianEstimator
-AggregativeQuantifierAggregativeQuantifier.aggregate()AggregativeQuantifier.aggregation_fit()AggregativeQuantifier.classes_AggregativeQuantifier.classifierAggregativeQuantifier.classifier_fit_predict()AggregativeQuantifier.classify()AggregativeQuantifier.fit()AggregativeQuantifier.quantify()AggregativeQuantifier.val_splitAggregativeQuantifier.val_split_AggregativeSoftQuantifierBinaryAggregativeQuantifier
-CC
-ClassifyAndCountDMy
-DistributionMatchingYDyS
-EMQ
-ExpectationMaximizationQuantifierHDy
-HellingerDistanceYOneVsAllAggregative
-PACC
-PCC
-ProbabilisticAdjustedClassifyAndCountProbabilisticClassifyAndCountSLDSMM
-newELM()newSVMAE()newSVMKLD()newSVMQ()newSVMRAE()KDEBase
-KDEyCS
-KDEyHD
-KDEyML
-QuaNetModule
-QuaNetTrainer
-mae_loss()MAX
-MS
-MS2
-T50
-ThresholdOptimization
-X
-absolute_error()acc_error()acce()ae()f1_error()f1e()from_name()kld()mae()mean_absolute_error()mean_normalized_absolute_error()mean_normalized_relative_absolute_error()mean_relative_absolute_error()mkld()mnae()mnkld()mnrae()mrae()mse()nae()nkld()normalized_absolute_error()normalized_relative_absolute_error()nrae()rae()relative_absolute_error()se()smooth()HellingerDistance()TopsoeDistance()adjusted_quantification()argmin_prevalence()as_binary_prevalence()check_prevalence_vector()get_divergence()get_nprevpoints_approximation()linear_search()normalize_prevalence()num_prevalence_combinations()optim_minimize()prevalence_from_labels()prevalence_from_probabilities()prevalence_linspace()strprev()uniform_prevalence_sampling()uniform_simplex_sampling()ConfigStatus
-GridSearchQ
-Status
-cross_val_predict()expand_grid()group_params()absolute_error()acc_error()acce()ae()f1_error()f1e()from_name()kld()mae()mean_absolute_error()mean_normalized_absolute_error()mean_normalized_relative_absolute_error()mean_relative_absolute_error()mkld()mnae()mnkld()mnrae()mrae()mse()nae()nkld()normalized_absolute_error()normalized_relative_absolute_error()nrae()rae()relative_absolute_error()se()smooth()HellingerDistance()TopsoeDistance()adjusted_quantification()argmin_prevalence()as_binary_prevalence()check_prevalence_vector()get_divergence()get_nprevpoints_approximation()linear_search()normalize_prevalence()num_prevalence_combinations()optim_minimize()prevalence_from_labels()prevalence_from_probabilities()prevalence_linspace()strprev()uniform_prevalence_sampling()uniform_simplex_sampling()ConfigStatus
-GridSearchQ
-Status
-cross_val_predict()expand_grid()group_params()| - q | ||
| - |
- quapy | - |
| - |
- quapy.classification | - |
| - |
- quapy.classification.calibration | - |
| - |
- quapy.classification.methods | - |
| - |
- quapy.classification.neural | - |
| - |
- quapy.classification.svmperf | - |
| - |
- quapy.data | - |
| - |
- quapy.data.base | - |
| - |
- quapy.data.datasets | - |
| - |
- quapy.data.preprocessing | - |
| - |
- quapy.data.reader | - |
| - |
- quapy.error | - |
| - |
- quapy.evaluation | - |
| - |
- quapy.functional | - |
| - |
- quapy.method | - |
| - |
- quapy.method._kdey | - |
| - |
- quapy.method._neural | - |
| - |
- quapy.method._threshold_optim | - |
| - |
- quapy.method.aggregative | - |
| - |
- quapy.method.base | - |
| - |
- quapy.method.meta | - |
| - |
- quapy.method.non_aggregative | - |
| - |
- quapy.model_selection | - |
| - |
- quapy.plot | - |
| - |
- quapy.protocol | - |
| - |
- quapy.util | - |
Bases: RecalibratedProbabilisticClassifierBase
Applies the Bias-Corrected Temperature Scaling (BCTS) calibration method from abstention.calibration, as defined in -Alexandari et al. paper:
-classifier – a scikit-learn probabilistic classifier
val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p -in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the -training instances (the rest is used for training). In any case, the classifier is retrained in the whole -training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer)
verbose – whether or not to display information in the standard output
Bases: RecalibratedProbabilisticClassifierBase
Applies the No-Bias Vector Scaling (NBVS) calibration method from abstention.calibration, as defined in -Alexandari et al. paper:
-classifier – a scikit-learn probabilistic classifier
val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p -in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the -training instances (the rest is used for training). In any case, the classifier is retrained in the whole -training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer)
verbose – whether or not to display information in the standard output
Bases: object
Abstract class for (re)calibration method from abstention.calibration, as defined in -Alexandari, A., Kundaje, A., & Shrikumar, A. (2020, November). Maximum likelihood with bias-corrected calibration -is hard-to-beat at label shift adaptation. In International Conference on Machine Learning (pp. 222-232). PMLR.:
-Bases: BaseEstimator, RecalibratedProbabilisticClassifier
Applies a (re)calibration method from abstention.calibration, as defined in -Alexandari et al. paper.
-classifier – a scikit-learn probabilistic classifier
calibrator – the calibration object (an instance of abstention.calibration.CalibratorFactory)
val_split – indicate an integer k for performing kFCV to obtain the posterior probabilities, or a float p -in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the -training instances (the rest is used for training). In any case, the classifier is retrained in the whole -training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer); default=None
verbose – whether or not to display information in the standard output
Returns the classes on which the classifier has been trained on
-array-like of shape (n_classes)
-Fits the calibration for the probabilistic classifier.
-X – array-like of shape (n_samples, n_features) with the data instances
y – array-like of shape (n_samples,) with the class labels
self
-Fits the calibration in a cross-validation manner, i.e., it generates posterior probabilities for all -training instances via cross-validation, and then retrains the classifier on all training instances. -The posterior probabilities thus generated are used for calibrating the outputs of the classifier.
-X – array-like of shape (n_samples, n_features) with the data instances
y – array-like of shape (n_samples,) with the class labels
self
-Fits the calibration in a train/val-split manner, i.e.t, it partitions the training instances into a -training and a validation set, and then uses the training samples to learn classifier which is then used -to generate posterior probabilities for the held-out validation data. These posteriors are used to calibrate -the classifier. The classifier is not retrained on the whole dataset.
-X – array-like of shape (n_samples, n_features) with the data instances
y – array-like of shape (n_samples,) with the class labels
self
-Bases: RecalibratedProbabilisticClassifierBase
Applies the Temperature Scaling (TS) calibration method from abstention.calibration, as defined in -Alexandari et al. paper:
-classifier – a scikit-learn probabilistic classifier
val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p -in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the -training instances (the rest is used for training). In any case, the classifier is retrained in the whole -training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer)
verbose – whether or not to display information in the standard output
Bases: RecalibratedProbabilisticClassifierBase
Applies the Vector Scaling (VS) calibration method from abstention.calibration, as defined in -Alexandari et al. paper:
-classifier – a scikit-learn probabilistic classifier
val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p -in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the -training instances (the rest is used for training). In any case, the classifier is retrained in the whole -training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer)
verbose – whether or not to display information in the standard output
Bases: BaseEstimator
An example of a classification method (i.e., an object that implements fit, predict, and predict_proba)
-that also generates embedded inputs (i.e., that implements transform), as those required for
-quapy.method.neural.QuaNet. This is a mock method to allow for easily instantiating
-quapy.method.neural.QuaNet on array-like real-valued instances.
-The transformation consists of applying sklearn.decomposition.TruncatedSVD
-while classification is performed using sklearn.linear_model.LogisticRegression on the low-rank space.
n_components – the number of principal components to retain
kwargs – parameters for the -Logistic Regression classifier
Fit the model according to the given training data. The fit consists of -fitting TruncatedSVD and then LogisticRegression on the low-rank representation.
-X – array-like of shape (n_samples, n_features) with the instances
y – array-like of shape (n_samples, n_classes) with the class labels
self
-Get hyper-parameters for this estimator.
-a dictionary with parameter names mapped to their values
-Predicts labels for the instances X embedded into the low-rank space.
-X – array-like of shape (n_samples, n_features) instances to classify
-a numpy array of length n containing the label predictions, where n is the number of -instances in X
-Predicts posterior probabilities for the instances X embedded into the low-rank space.
-X – array-like of shape (n_samples, n_features) instances to classify
-array-like of shape (n_samples, n_classes) with the posterior probabilities
-Set the parameters of this estimator.
-parameters – a **kwargs dictionary with the estimator parameters for -Logistic Regression -and eventually also n_components for TruncatedSVD
-Returns the low-rank approximation of X with n_components dimensions, or X unaltered if -n_components >= X.shape[1].
-X – array-like of shape (n_samples, n_features) instances to embed
-array-like of shape (n_samples, n_components) with the embedded instances
-Bases: TextClassifierNet
An implementation of quapy.classification.neural.TextClassifierNet based on
-Convolutional Neural Networks.
vocabulary_size – the size of the vocabulary
n_classes – number of target classes
embedding_size – the dimensionality of the word embeddings space (default 100)
hidden_size – the dimensionality of the hidden space (default 256)
repr_size – the dimensionality of the document embeddings space (default 100)
kernel_heights – list of kernel lengths (default [3,5,7]), i.e., the number of -consecutive tokens that each kernel covers
stride – convolutional stride (default 1)
stride – convolutional pad (default 0)
drop_p – drop probability for dropout (default 0.5)
Embeds documents (i.e., performs the forward pass up to the -next-to-last layer).
-input – a batch of instances, typically generated by a torch’s DataLoader
-instance (see quapy.classification.neural.TorchDataset)
a torch tensor of shape (n_samples, n_dimensions), where -n_samples is the number of documents, and n_dimensions is the -dimensionality of the embedding
-Get hyper-parameters for this estimator
-a dictionary with parameter names mapped to their values
-Return the size of the vocabulary
-integer
-Bases: TextClassifierNet
An implementation of quapy.classification.neural.TextClassifierNet based on
-Long Short Term Memory networks.
vocabulary_size – the size of the vocabulary
n_classes – number of target classes
embedding_size – the dimensionality of the word embeddings space (default 100)
hidden_size – the dimensionality of the hidden space (default 256)
repr_size – the dimensionality of the document embeddings space (default 100)
lstm_class_nlayers – number of LSTM layers (default 1)
drop_p – drop probability for dropout (default 0.5)
Embeds documents (i.e., performs the forward pass up to the -next-to-last layer).
-x – a batch of instances, typically generated by a torch’s DataLoader
-instance (see quapy.classification.neural.TorchDataset)
a torch tensor of shape (n_samples, n_dimensions), where -n_samples is the number of documents, and n_dimensions is the -dimensionality of the embedding
-Get hyper-parameters for this estimator
-a dictionary with parameter names mapped to their values
-Return the size of the vocabulary
-integer
-Bases: object
Trains a neural network for text classification.
-net – an instance of TextClassifierNet implementing the forward pass
lr – learning rate (default 1e-3)
weight_decay – weight decay (default 0)
patience – number of epochs that do not show any improvement in validation -to wait before applying early stop (default 10)
epochs – maximum number of training epochs (default 200)
batch_size – batch size for training (default 64)
batch_size_test – batch size for test (default 512)
padding_length – maximum number of tokens to consider in a document (default 300)
device – specify ‘cpu’ (default) or ‘cuda’ for enabling gpu
checkpointpath – where to store the parameters of the best model found so far -according to the evaluation in the held-out validation split (default ‘../checkpoint/classifier_net.dat’)
Gets the device in which the network is allocated
-device
-Fits the model according to the given training data.
-instances – list of lists of indexed tokens
labels – array-like of shape (n_samples, n_classes) with the class labels
val_split – proportion of training documents to be taken as the validation set (default 0.3)
Get hyper-parameters for this estimator
-a dictionary with parameter names mapped to their values
-Predicts labels for the instances
-instances – list of lists of indexed tokens
-a numpy array of length n containing the label predictions, where n is the number of -instances in X
-Predicts posterior probabilities for the instances
-X – array-like of shape (n_samples, n_features) instances to classify
-array-like of shape (n_samples, n_classes) with the posterior probabilities
-Reinitialize the network parameters
-vocab_size – the size of the vocabulary
n_classes – the number of target classes
Bases: Module
Abstract Text classifier (torch.nn.Module)
-Gets the number of dimensions of the embedding space
-integer
-Embeds documents (i.e., performs the forward pass up to the -next-to-last layer).
-x – a batch of instances, typically generated by a torch’s DataLoader
-instance (see quapy.classification.neural.TorchDataset)
a torch tensor of shape (n_samples, n_dimensions), where -n_samples is the number of documents, and n_dimensions is the -dimensionality of the embedding
-Performs the forward pass.
-x – a batch of instances, typically generated by a torch’s DataLoader
-instance (see quapy.classification.neural.TorchDataset)
a tensor of shape (n_instances, n_classes) with the decision scores -for each of the instances and classes
-Get hyper-parameters for this estimator
-a dictionary with parameter names mapped to their values
-Predicts posterior probabilities for the instances in x
-x – a torch tensor of indexed tokens with shape (n_instances, pad_length) -where n_instances is the number of instances in the batch, and pad_length -is length of the pad in the batch
-array-like of shape (n_samples, n_classes) with the posterior probabilities
-Return the size of the vocabulary
-integer
-Bases: Dataset
Transforms labelled instances into a Torch’s torch.utils.data.DataLoader object
instances – list of lists of indexed tokens
labels – array-like of shape (n_samples, n_classes) with the class labels
Converts the labelled collection into a Torch DataLoader with dynamic padding for -the batch
-batch_size – batch size
shuffle – whether or not to shuffle instances
pad_length – the maximum length for the list of tokens (dynamic padding is -applied, meaning that if the longest document in the batch is shorter than -pad_length, then the batch is padded up to its length, and not to pad_length.
device – whether to allocate tensors in cpu or in cuda
a torch.utils.data.DataLoader object
Bases: BaseEstimator, ClassifierMixin
A wrapper for the SVM-perf package by Thorsten Joachims. -When using losses for quantification, the source code has to be patched. See -the installation documentation -for further details.
-References
- -svmperf_base – path to directory containing the binary files svm_perf_learn and svm_perf_classify
C – trade-off between training error and margin (default 0.01)
verbose – set to True to print svm-perf std outputs
loss – the loss to optimize for. Available losses are “01”, “f1”, “kld”, “nkld”, “q”, “qacc”, “qf1”, “qgm”, “mae”, “mrae”.
host_folder – directory where to store the trained model; set to None (default) for using a tmp directory -(temporal directories are automatically deleted)
Evaluate the decision function for the samples in X.
-X – array-like of shape (n_samples, n_features) containing the instances to classify
y – unused
array-like of shape (n_samples,) containing the decision scores of the instances
-Trains the SVM for the multivariate performance loss
-X – training instances
y – a binary vector of labels
self
-Predicts labels for the instances X
-X – array-like of shape (n_samples, n_features) instances to classify
-a numpy array of length n containing the label predictions, where n is the number of -instances in X
-Bases: object
Abstraction of training and test LabelledCollection objects.
training – a LabelledCollection instance
test – a LabelledCollection instance
vocabulary – if indicated, is a dictionary of the terms used in this textual dataset
name – a string representing the name of the dataset
Generates a Dataset from a stratified split of a LabelledCollection instance.
-See LabelledCollection.split_stratified()
collection – LabelledCollection
train_size – the proportion of training documents (the rest conforms the test split)
an instance of Dataset
Returns True if the training collection is labelled according to two classes
-boolean
-The classes according to which the training collection is labelled
-The classes according to which the training collection is labelled
-Generator of stratified folds to be used in k-fold cross validation. This function is only a wrapper around
-LabelledCollection.kFCV() that returns Dataset instances made of training and test folds.
nfolds – integer (default 5), the number of folds to generate
nrepeats – integer (default 1), the number of rounds of k-fold cross validation to run
random_state – integer (default 0), guarantees that the folds generated are reproducible
yields nfolds * nrepeats folds for k-fold cross validation as instances of Dataset
Loads a training and a test labelled set of data and convert it into a Dataset instance.
-The function in charge of reading the instances must be specified. This function can be a custom one, or any of
-the reading functions defined in quapy.data.reader module.
train_path – string, the path to the file containing the training instances
test_path – string, the path to the file containing the test instances
loader_func – a custom function that implements the data loader and returns a tuple with instances and -labels
classes – array-like, the classes according to which the instances are labelled
loader_kwargs – any argument that the loader_func function needs in order to read the instances.
-See LabelledCollection.load() for further details.
a Dataset object
The number of classes according to which the training collection is labelled
-integer
-Reduce the number of instances in place for quick experiments. Preserves the prevalence of each set.
-n_train – number of training documents to keep (default 100)
n_test – number of test documents to keep (default 100)
self
-Returns (and eventually prints) a dictionary with some stats of this dataset. E.g.,:
->>> data = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5)
->>> data.stats()
->>> Dataset=kindle #tr-instances=3821, #te-instances=21591, type=<class 'scipy.sparse.csr.csr_matrix'>, #features=4403, #classes=[0 1], tr-prevs=[0.081, 0.919], te-prevs=[0.063, 0.937]
-show – if set to True (default), prints the stats in standard output
-a dictionary containing some stats of this collection for the training and test collections. The keys -are train and test, and point to dedicated dictionaries of stats, for each collection, with keys -#instances (the number of instances), type (the type representing the instances), -#features (the number of features, if the instances are in array-like format), #classes (the classes of -the collection), prevs (the prevalence values for each class)
-Alias to self.training and self.test
-the training and test collections
-the training and test collections
-If the dataset is textual, and the vocabulary was indicated, returns the size of the vocabulary
-integer
-Bases: object
A LabelledCollection is a set of objects each with a label attached to each of them. -This class implements several sampling routines and other utilities.
-instances – array-like (np.ndarray, list, or csr_matrix are supported)
labels – array-like with the same length of instances
classes – optional, list of classes from which labels are taken. If not specified, the classes are inferred -from the labels. The classes must be indicated in cases in which some of the labels might have no examples -(i.e., a prevalence of 0)
An alias to self.instances
-self.instances
-Gets the instances and the true prevalence. This is useful when implementing evaluation protocols from
-a LabelledCollection object.
a tuple (instances, prevalence) from this collection
-Gets the instances and labels. This is useful when working with sklearn estimators, e.g.:
->>> svm = LinearSVC().fit(*my_collection.Xy)
-a tuple (instances, labels) from this collection
-Returns True if the number of classes is 2
-boolean
-Returns the number of instances for each of the classes in the codeframe.
-a np.ndarray of shape (n_classes) with the number of instances of each class, in the same order -as listed by self.classes_
-Returns a new LabelledCollection as the union of the collections given in input.
args – instances of LabelledCollection
a LabelledCollection representing the union of both collections
Generator of stratified folds to be used in k-fold cross validation.
-nfolds – integer (default 5), the number of folds to generate
nrepeats – integer (default 1), the number of rounds of k-fold cross validation to run
random_state – integer (default 0), guarantees that the folds generated are reproducible
yields nfolds * nrepeats folds for k-fold cross validation
-Loads a labelled set of data and convert it into a LabelledCollection instance. The function in charge
-of reading the instances must be specified. This function can be a custom one, or any of the reading functions
-defined in quapy.data.reader module.
path – string, the path to the file containing the labelled instances
loader_func – a custom function that implements the data loader and returns a tuple with instances and -labels
classes – array-like, the classes according to which the instances are labelled
loader_kwargs – any argument that the loader_func function needs in order to read the instances, i.e., -these arguments are used to call loader_func(path, **loader_kwargs)
a LabelledCollection object
The number of classes
-integer
-An alias to self.prevalence()
-self.prevalence()
-Returns the prevalence, or relative frequency, of the classes in the codeframe.
-a np.ndarray of shape (n_classes) with the relative frequencies of each class, in the same order -as listed by self.classes_
-Return a random sample (an instance of LabelledCollection) of desired size and desired prevalence
-values. For each class, the sampling is drawn without replacement if the requested prevalence is larger than
-the actual prevalence of the class, or with replacement otherwise.
size – integer, the requested size
prevs – the prevalence for each class; the prevalence value for the last class can be lead empty since -it is constrained. E.g., for binary collections, only the prevalence p for the first class (as listed in -self.classes_ can be specified, while the other class takes prevalence value 1-p
shuffle – if set to True (default), shuffles the index before returning it
random_state – seed for reproducing sampling
an instance of LabelledCollection with length == size and prevalence close to prevs (or
-prevalence == prevs if the exact prevalence values can be met as proportions of instances)
Returns an instance of LabelledCollection whose elements are sampled from this collection using the
-index.
index – np.ndarray
-an instance of LabelledCollection
Returns an index to be used to extract a random sample of desired size and desired prevalence values. If the -prevalence values are not specified, then returns the index of a uniform sampling. -For each class, the sampling is drawn with replacement if the requested prevalence is larger than -the actual prevalence of the class, or without replacement otherwise.
-size – integer, the requested size
prevs – the prevalence for each class; the prevalence value for the last class can be lead empty since -it is constrained. E.g., for binary collections, only the prevalence p for the first class (as listed in -self.classes_ can be specified, while the other class takes prevalence value 1-p
shuffle – if set to True (default), shuffles the index before returning it
random_state – seed for reproducing sampling
a np.ndarray of shape (size) with the indexes
-Returns two instances of LabelledCollection split randomly from this collection, at desired
-proportion.
train_prop – the proportion of elements to include in the left-most returned collection (typically used -as the training collection). The rest of elements are included in the right-most returned collection -(typically used as a test collection).
random_state – if specified, guarantees reproducibility of the split.
two instances of LabelledCollection, the first one with train_prop elements, and the
-second one with 1-train_prop elements
Returns two instances of LabelledCollection split with stratification from this collection, at desired
-proportion.
train_prop – the proportion of elements to include in the left-most returned collection (typically used -as the training collection). The rest of elements are included in the right-most returned collection -(typically used as a test collection).
random_state – if specified, guarantees reproducibility of the split.
two instances of LabelledCollection, the first one with train_prop elements, and the
-second one with 1-train_prop elements
Returns (and eventually prints) a dictionary with some stats of this collection. E.g.,:
->>> data = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5)
->>> data.training.stats()
->>> #instances=3821, type=<class 'scipy.sparse.csr.csr_matrix'>, #features=4403, #classes=[0 1], prevs=[0.081, 0.919]
-show – if set to True (default), prints the stats in standard output
-a dictionary containing some stats of this collection. Keys include #instances (the number of -instances), type (the type representing the instances), #features (the number of features, if the -instances are in array-like format), #classes (the classes of the collection), prevs (the prevalence -values for each class)
-Returns a uniform sample (an instance of LabelledCollection) of desired size. The sampling is drawn
-with replacement if the requested size is greater than the number of instances, or without replacement
-otherwise.
size – integer, the requested size
random_state – if specified, guarantees reproducibility of the split.
an instance of LabelledCollection with length == size
Returns an index to be used to extract a uniform sample of desired size. The sampling is drawn -with replacement if the requested size is greater than the number of instances, or without replacement -otherwise.
-size – integer, the size of the uniform sample
random_state – if specified, guarantees reproducibility of the split.
a np.ndarray of shape (size) with the indexes
-An alias to self.labels
-self.labels
-Loads the IFCB dataset for quantification from Zenodo (for more -information on this dataset, please follow the zenodo link). -This dataset is based on the data available publicly at -WHOI-Plankton repo. -The scripts for the processing are available at P. González’s repo. -Basically, this is the IFCB dataset with precomputed features for testing quantification algorithms.
-The datasets are downloaded only once, and stored for fast reuse.
-single_sample_train – a boolean. If true, it will return the train dataset as a
-quapy.data.base.LabelledCollection (all examples together).
-If false, a generator of training samples will be returned. Each example in the training set has an individual label.
for_model_selection – if True, then returns a split 30% of the training set (86 out of 286 samples) to be used for model selection; -if False, then returns the full training set as training set and the test set as the test set
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default -~/quay_data/ directory)
a tuple (train, test_gen) where train is an instance of
-quapy.data.base.LabelledCollection, if single_sample_train is true or
-quapy.data._ifcb.IFCBTrainSamplesFromDir, i.e. a sampling protocol that returns a series of samples
-labelled example by example. test_gen will be a quapy.data._ifcb.IFCBTestSamples,
-i.e., a sampling protocol that returns a series of samples labelled by prevalence.
Loads a UCI dataset as an instance of quapy.data.base.Dataset, as used in
-Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
-Using ensembles for problems with characterizable changes in data distribution: A case study on quantification.
-Information Fusion, 34, 87-100.
-and
-Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019).
-Dynamic ensemble selection for quantification tasks.
-Information Fusion, 45, 1-15..
-The datasets do not come with a predefined train-test split (see fetch_UCILabelledCollection() for further
-information on how to use these collections), and so a train-test split is generated at desired proportion.
-The list of valid dataset names can be accessed in quapy.data.datasets.UCI_DATASETS
dataset_name – a dataset name
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default -~/quay_data/ directory)
test_split – proportion of documents to be included in the test set. The rest conforms the training set
verbose – set to True (default is False) to get information (from the UCI ML repository) about the datasets
a quapy.data.base.Dataset instance
Loads a UCI collection as an instance of quapy.data.base.LabelledCollection, as used in
-Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
-Using ensembles for problems with characterizable changes in data distribution: A case study on quantification.
-Information Fusion, 34, 87-100.
-and
-Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019).
-Dynamic ensemble selection for quantification tasks.
-Information Fusion, 45, 1-15..
-The datasets do not come with a predefined train-test split, and so Pérez-Gállego et al. adopted a 5FCVx2 evaluation
-protocol, meaning that each collection was used to generate two rounds (hence the x2) of 5 fold cross validation.
-This can be reproduced by using quapy.data.base.Dataset.kFCV(), e.g.:
>>> import quapy as qp
->>> collection = qp.datasets.fetch_UCIBinaryLabelledCollection("yeast")
->>> for data in qp.train.Dataset.kFCV(collection, nfolds=5, nrepeats=2):
->>> ...
-The list of valid dataset names can be accessed in quapy.data.datasets.UCI_DATASETS
-dataset_name – a dataset name
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default -~/quay_data/ directory)
test_split – proportion of documents to be included in the test set. The rest conforms the training set
verbose – set to True (default is False) to get information (from the UCI ML repository) about the datasets
a quapy.data.base.LabelledCollection instance
Loads a UCI multiclass dataset as an instance of quapy.data.base.Dataset.
The list of available datasets is taken from https://archive.ics.uci.edu/, following these criteria: -- It has more than 1000 instances -- It is suited for classification -- It has more than two classes -- It is available for Python import (requires ucimlrepo package)
->>> import quapy as qp
->>> dataset = qp.datasets.fetch_UCIMulticlassDataset("dry-bean")
->>> train, test = dataset.train_test
->>> ...
-The list of valid dataset names can be accessed in quapy.data.datasets.UCI_MULTICLASS_DATASETS
-The datasets are downloaded only once and pickled into disk, saving time for consecutive calls.
-dataset_name – a dataset name
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default -~/quay_data/ directory)
test_split – proportion of documents to be included in the test set. The rest conforms the training set
verbose – set to True (default is False) to get information (stats) about the dataset
a quapy.data.base.Dataset instance
Loads a UCI multiclass collection as an instance of quapy.data.base.LabelledCollection.
The list of available datasets is taken from https://archive.ics.uci.edu/, following these criteria: -- It has more than 1000 instances -- It is suited for classification -- It has more than two classes -- It is available for Python import (requires ucimlrepo package)
->>> import quapy as qp
->>> collection = qp.datasets.fetch_UCIMulticlassLabelledCollection("dry-bean")
->>> X, y = collection.Xy
->>> ...
-The list of valid dataset names can be accessed in quapy.data.datasets.UCI_MULTICLASS_DATASETS
-The datasets are downloaded only once and pickled into disk, saving time for consecutive calls.
-dataset_name – a dataset name
data_home – specify the quapy home directory where the dataset will be dumped (leave empty to use the default -~/quay_data/ directory)
test_split – proportion of documents to be included in the test set. The rest conforms the training set
verbose – set to True (default is False) to get information (stats) about the dataset
a quapy.data.base.LabelledCollection instance
Loads the official datasets provided for the LeQua competition. -In brief, there are 4 tasks (T1A, T1B, T2A, T2B) having to do with text quantification -problems. Tasks T1A and T1B provide documents in vector form, while T2A and T2B provide raw documents instead. -Tasks T1A and T2A are binary sentiment quantification problems, while T2A and T2B are multiclass quantification -problems consisting of estimating the class prevalence values of 28 different merchandise products. -We refer to the Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022). -A Detailed Overview of LeQua@ CLEF 2022: Learning to Quantify. for a detailed description -on the tasks and datasets.
-The datasets are downloaded only once, and stored for fast reuse.
-See lequa2022_experiments.py provided in the example folder, that can serve as a guide on how to use these -datasets.
-task – a string representing the task name; valid ones are T1A, T1B, T2A, and T2B
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default -~/quay_data/ directory)
a tuple (train, val_gen, test_gen) where train is an instance of
-quapy.data.base.LabelledCollection, val_gen and test_gen are instances of
-quapy.data._lequa2022.SamplesFromDir, a subclass of quapy.protocol.AbstractProtocol,
-that return a series of samples stored in a directory which are labelled by prevalence.
Loads a Reviews dataset as a Dataset instance, as used in -Esuli, A., Moreo, A., and Sebastiani, F. “A recurrent neural network for sentiment quantification.” -Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018.. -The list of valid dataset names can be accessed in quapy.data.datasets.REVIEWS_SENTIMENT_DATASETS
-dataset_name – the name of the dataset: valid ones are ‘hp’, ‘kindle’, ‘imdb’
tfidf – set to True to transform the raw documents into tfidf weighted matrices
min_df – minimun number of documents that should contain a term in order for the term to be -kept (ignored if tfidf==False)
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default -~/quay_data/ directory)
pickle – set to True to pickle the Dataset object the first time it is generated, in order to allow for -faster subsequent invokations
a quapy.data.base.Dataset instance
Loads a Twitter dataset as a quapy.data.base.Dataset instance, as used in:
-Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis.
-Social Network Analysis and Mining6(19), 1–22 (2016)
-Note that the datasets ‘semeval13’, ‘semeval14’, ‘semeval15’ share the same training set.
-The list of valid dataset names corresponding to training sets can be accessed in
-quapy.data.datasets.TWITTER_SENTIMENT_DATASETS_TRAIN, while the test sets can be accessed in
-quapy.data.datasets.TWITTER_SENTIMENT_DATASETS_TEST
dataset_name – the name of the dataset: valid ones are ‘gasp’, ‘hcr’, ‘omd’, ‘sanders’, ‘semeval13’, -‘semeval14’, ‘semeval15’, ‘semeval16’, ‘sst’, ‘wa’, ‘wb’
for_model_selection – if True, then returns the train split as the training set and the devel split -as the test set; if False, then returns the train+devel split as the training set and the test set as the -test set
min_df – minimun number of documents that should contain a term in order for the term to be kept
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default -~/quay_data/ directory)
pickle – set to True to pickle the Dataset object the first time it is generated, in order to allow for -faster subsequent invokations
a quapy.data.base.Dataset instance
Bases: object
This class implements a sklearn’s-style transformer that indexes text as numerical ids for the tokens it -contains, and that would be generated by sklearn’s -CountVectorizer
-kwargs –
keyworded arguments from -CountVectorizer
- -Adds a new token (regardless of whether it has been found in the text or not), with dedicated id. -Useful to define special tokens for codifying unknown words, or padding tokens.
-word – string, surface form of the token
id – integer, numerical value to assign to the token (leave as None for indicating the next valid id, -default)
nogaps – if set to True (default) asserts that the id indicated leads to no numerical gaps with -precedent ids stored so far
integer, the numerical id for the new token
-Fits the transformer, i.e., decides on the vocabulary, given a list of strings.
-X – a list of strings
-self
-Fits the transform on X and transforms it.
-X – a list of strings
n_jobs – the number of parallel workers to carry out this task
a np.ndarray of numerical ids
-Indexes the tokens of a textual quapy.data.base.Dataset of string documents.
-To index a document means to replace each different token by a unique numerical index.
-Rare words (i.e., words occurring less than min_df times) are replaced by a special token UNK
dataset – a quapy.data.base.Dataset object where the instances of training and test documents
-are lists of str
min_df – minimum number of occurrences below which the term is replaced by a UNK index
inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)
kwargs – the rest of parameters of the transformation (as for sklearn’s -CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>_)
a new quapy.data.base.Dataset (if inplace=False) or a reference to the current
-quapy.data.base.Dataset (inplace=True) consisting of lists of integer values representing indices.
Reduces the dimensionality of the instances, represented as a csr_matrix (or any subtype of -scipy.sparse.spmatrix), of training and test documents by removing the columns of words which are not present -in at least min_df instances in the training set
-dataset – a quapy.data.base.Dataset in which instances are represented in sparse format (any
-subtype of scipy.sparse.spmatrix)
min_df – integer, minimum number of instances below which the columns are removed
inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)
a new quapy.data.base.Dataset (if inplace=False) or a reference to the current
-quapy.data.base.Dataset (inplace=True) where the dimensions corresponding to infrequent terms
-in the training set have been removed
Standardizes the real-valued columns of a quapy.data.base.Dataset.
-Standardization, aka z-scoring, of a variable X comes down to subtracting the average and normalizing by the
-standard deviation.
dataset – a quapy.data.base.Dataset object
inplace – set to True if the transformation is to be applied inplace, or to False (default) if a new
-quapy.data.base.Dataset is to be returned
an instance of quapy.data.base.Dataset
Transforms a quapy.data.base.Dataset of textual instances into a quapy.data.base.Dataset of
-tfidf weighted sparse vectors
dataset – a quapy.data.base.Dataset where the instances of training and test collections are
-lists of str
min_df – minimum number of occurrences for a word to be considered as part of the vocabulary (default 3)
sublinear_tf – whether or not to apply the log scalling to the tf counters (default True)
inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)
kwargs – the rest of parameters of the transformation (as for sklearn’s -TfidfVectorizer)
a new quapy.data.base.Dataset in csr_matrix format (if inplace=False) or a reference to the
-current Dataset (if inplace=True) where the instances are stored in a csr_matrix of real-valued tfidf scores
Binarizes a categorical array-like collection of labels towards the positive class pos_class. E.g.,:
->>> binarize([1, 2, 3, 1, 1, 0], pos_class=2)
->>> array([0, 1, 0, 0, 0, 0])
-y – array-like of labels
pos_class – integer, the positive class
a binary np.ndarray, in which values 1 corresponds to positions in whcih y had pos_class labels, and -0 otherwise
-Reads a csv file in which columns are separated by ‘,’. -File format <label>,<feat1>,<feat2>,…,<featn>
-path – path to the csv file
encoding – the text encoding used to open the file
a np.ndarray for the labels and a ndarray (float) for the covariates
-Reads a labelled collection of real-valued instances expressed in sparse format -File format <-1 or 0 or 1>[s col(int):val(float)]
-path – path to the labelled collection
-a csr_matrix containing the instances (rows), and a ndarray containing the labels
-Reads a labelled colletion of documents. -File fomart <0 or 1> <document>
-path – path to the labelled collection
encoding – the text encoding used to open the file
verbose – if >0 (default) shows some progress information in standard output
a list of sentences, and a list of labels
-Re-indexes a list of labels as a list of indexes, and returns the classnames corresponding to the indexes. -E.g.:
->>> reindex_labels(['B', 'B', 'A', 'C'])
->>> (array([1, 1, 0, 2]), array(['A', 'B', 'C'], dtype='<U1'))
-y – the list or array of original labels
-a ndarray (int) of class indexes, and a ndarray of classnames corresponding to the indexes.
-BCTSCalibrationNBVSCalibrationRecalibratedProbabilisticClassifierRecalibratedProbabilisticClassifierBaseRecalibratedProbabilisticClassifierBase.classes_RecalibratedProbabilisticClassifierBase.fit()RecalibratedProbabilisticClassifierBase.fit_cv()RecalibratedProbabilisticClassifierBase.fit_tr_val()RecalibratedProbabilisticClassifierBase.predict()RecalibratedProbabilisticClassifierBase.predict_proba()TSCalibrationVSCalibrationCNNnet
-LSTMnet
-NeuralClassifierTrainer
-TextClassifierNet
-TorchDataset
-Dataset
-LabelledCollectionLabelledCollection.XLabelledCollection.XpLabelledCollection.XyLabelledCollection.binaryLabelledCollection.counts()LabelledCollection.join()LabelledCollection.kFCV()LabelledCollection.load()LabelledCollection.n_classesLabelledCollection.pLabelledCollection.prevalence()LabelledCollection.sampling()LabelledCollection.sampling_from_index()LabelledCollection.sampling_index()LabelledCollection.split_random()LabelledCollection.split_stratified()LabelledCollection.stats()LabelledCollection.uniform_sampling()LabelledCollection.uniform_sampling_index()LabelledCollection.yACC
-AdjustedClassifyAndCountAggregativeCrispQuantifierAggregativeMedianEstimator
-AggregativeQuantifierAggregativeQuantifier.aggregate()AggregativeQuantifier.aggregation_fit()AggregativeQuantifier.classes_AggregativeQuantifier.classifierAggregativeQuantifier.classifier_fit_predict()AggregativeQuantifier.classify()AggregativeQuantifier.fit()AggregativeQuantifier.quantify()AggregativeQuantifier.val_splitAggregativeQuantifier.val_split_AggregativeSoftQuantifierBinaryAggregativeQuantifier
-CC
-ClassifyAndCountDMy
-DistributionMatchingYDyS
-EMQ
-ExpectationMaximizationQuantifierHDy
-HellingerDistanceYOneVsAllAggregative
-PACC
-PCC
-ProbabilisticAdjustedClassifyAndCountProbabilisticClassifyAndCountSLDSMM
-newELM()newSVMAE()newSVMKLD()newSVMQ()newSVMRAE()KDEBase
-KDEyCS
-KDEyHD
-KDEyML
-QuaNetModule
-QuaNetTrainer
-mae_loss()MAX
-MS
-MS2
-T50
-ThresholdOptimization
-X
-Implementation of error measures used for quantification
-Absolute error between two prevalence vectors \(p\) and \(\hat{p}\) is computed as -\(AE(p,\hat{p})=\frac{1}{|\mathcal{Y}|}\sum_{y\in \mathcal{Y}}|\hat{p}(y)-p(y)|\), -where \(\mathcal{Y}\) are the classes of interest.
-prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
absolute error
-Computes the error in terms of 1-accuracy. The accuracy is computed as -\(\frac{tp+tn}{tp+fp+fn+tn}\), with tp, fp, fn, and tn standing -for true positives, false positives, false negatives, and true negatives, -respectively
-y_true – array-like of true labels
y_pred – array-like of predicted labels
1-accuracy
-Computes the error in terms of 1-accuracy. The accuracy is computed as -\(\frac{tp+tn}{tp+fp+fn+tn}\), with tp, fp, fn, and tn standing -for true positives, false positives, false negatives, and true negatives, -respectively
-y_true – array-like of true labels
y_pred – array-like of predicted labels
1-accuracy
-Absolute error between two prevalence vectors \(p\) and \(\hat{p}\) is computed as -\(AE(p,\hat{p})=\frac{1}{|\mathcal{Y}|}\sum_{y\in \mathcal{Y}}|\hat{p}(y)-p(y)|\), -where \(\mathcal{Y}\) are the classes of interest.
-prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
absolute error
-F1 error: simply computes the error in terms of macro \(F_1\), i.e., -\(1-F_1^M\), where \(F_1\) is the harmonic mean of precision and recall, -defined as \(\frac{2tp}{2tp+fp+fn}\), with tp, fp, and fn standing -for true positives, false positives, and false negatives, respectively. -Macro averaging means the \(F_1\) is computed for each category independently, -and then averaged.
-y_true – array-like of true labels
y_pred – array-like of predicted labels
\(1-F_1^M\)
-F1 error: simply computes the error in terms of macro \(F_1\), i.e., -\(1-F_1^M\), where \(F_1\) is the harmonic mean of precision and recall, -defined as \(\frac{2tp}{2tp+fp+fn}\), with tp, fp, and fn standing -for true positives, false positives, and false negatives, respectively. -Macro averaging means the \(F_1\) is computed for each category independently, -and then averaged.
-y_true – array-like of true labels
y_pred – array-like of predicted labels
\(1-F_1^M\)
-Gets an error function from its name. E.g., from_name(“mae”)
-will return function quapy.error.mae()
err_name – string, the error name
-a callable implementing the requested error
-Kullback-Leibler divergence between two prevalence distributions \(p\) and \(\hat{p}\)
-is computed as
-\(KLD(p,\hat{p})=D_{KL}(p||\hat{p})=
-\sum_{y\in \mathcal{Y}} p(y)\log\frac{p(y)}{\hat{p}(y)}\),
-where \(\mathcal{Y}\) are the classes of interest.
-The distributions are smoothed using the eps factor (see quapy.error.smooth()).
prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
eps – smoothing factor. KLD is not defined in cases in which the distributions contain -zeros; eps is typically set to be \(\frac{1}{2T}\), with \(T\) the sample size. -If eps=None, the sample size will be taken from the environment variable SAMPLE_SIZE -(which has thus to be set beforehand).
Kullback-Leibler divergence between the two distributions
-Computes the mean absolute error (see quapy.error.ae()) across the sample pairs.
prevs – array-like of shape (n_samples, n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
mean absolute error
-Computes the mean absolute error (see quapy.error.ae()) across the sample pairs.
prevs – array-like of shape (n_samples, n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
mean absolute error
-Computes the mean normalized absolute error (see quapy.error.nae()) across the sample pairs.
prevs – array-like of shape (n_samples, n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
mean normalized absolute error
-Computes the mean normalized relative absolute error (see quapy.error.nrae()) across
-the sample pairs. The distributions are smoothed using the eps factor (see
-quapy.error.smooth()).
prevs – array-like of shape (n_samples, n_classes,) with the true -prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
eps – smoothing factor. mnrae is not defined in cases in which the true -distribution contains zeros; eps is typically set to be \(\frac{1}{2T}\), -with \(T\) the sample size. If eps=None, the sample size will be taken from -the environment variable SAMPLE_SIZE (which has thus to be set beforehand).
mean normalized relative absolute error
-Computes the mean relative absolute error (see quapy.error.rae()) across
-the sample pairs. The distributions are smoothed using the eps factor (see
-quapy.error.smooth()).
prevs – array-like of shape (n_samples, n_classes,) with the true -prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
eps – smoothing factor. mrae is not defined in cases in which the true -distribution contains zeros; eps is typically set to be \(\frac{1}{2T}\), -with \(T\) the sample size. If eps=None, the sample size will be taken from -the environment variable SAMPLE_SIZE (which has thus to be set beforehand).
mean relative absolute error
-Computes the mean Kullback-Leibler divergence (see quapy.error.kld()) across the
-sample pairs. The distributions are smoothed using the eps factor
-(see quapy.error.smooth()).
prevs – array-like of shape (n_samples, n_classes,) with the true -prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
eps – smoothing factor. KLD is not defined in cases in which the distributions contain -zeros; eps is typically set to be \(\frac{1}{2T}\), with \(T\) the sample size. -If eps=None, the sample size will be taken from the environment variable SAMPLE_SIZE -(which has thus to be set beforehand).
mean Kullback-Leibler distribution
-Computes the mean normalized absolute error (see quapy.error.nae()) across the sample pairs.
prevs – array-like of shape (n_samples, n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
mean normalized absolute error
-Computes the mean Normalized Kullback-Leibler divergence (see quapy.error.nkld())
-across the sample pairs. The distributions are smoothed using the eps factor
-(see quapy.error.smooth()).
prevs – array-like of shape (n_samples, n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
eps – smoothing factor. NKLD is not defined in cases in which the distributions contain -zeros; eps is typically set to be \(\frac{1}{2T}\), with \(T\) the sample size. -If eps=None, the sample size will be taken from the environment variable SAMPLE_SIZE -(which has thus to be set beforehand).
mean Normalized Kullback-Leibler distribution
-Computes the mean normalized relative absolute error (see quapy.error.nrae()) across
-the sample pairs. The distributions are smoothed using the eps factor (see
-quapy.error.smooth()).
prevs – array-like of shape (n_samples, n_classes,) with the true -prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
eps – smoothing factor. mnrae is not defined in cases in which the true -distribution contains zeros; eps is typically set to be \(\frac{1}{2T}\), -with \(T\) the sample size. If eps=None, the sample size will be taken from -the environment variable SAMPLE_SIZE (which has thus to be set beforehand).
mean normalized relative absolute error
-Computes the mean relative absolute error (see quapy.error.rae()) across
-the sample pairs. The distributions are smoothed using the eps factor (see
-quapy.error.smooth()).
prevs – array-like of shape (n_samples, n_classes,) with the true -prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
eps – smoothing factor. mrae is not defined in cases in which the true -distribution contains zeros; eps is typically set to be \(\frac{1}{2T}\), -with \(T\) the sample size. If eps=None, the sample size will be taken from -the environment variable SAMPLE_SIZE (which has thus to be set beforehand).
mean relative absolute error
-Computes the mean squared error (see quapy.error.se()) across the sample pairs.
prevs – array-like of shape (n_samples, n_classes,) with the -true prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the -predicted prevalence values
mean squared error
-Normalized absolute error between two prevalence vectors \(p\) and \(\hat{p}\) is computed as -\(NAE(p,\hat{p})=\frac{AE(p,\hat{p})}{z_{AE}}\), -where \(z_{AE}=\frac{2(1-\min_{y\in \mathcal{Y}} p(y))}{|\mathcal{Y}|}\), and \(\mathcal{Y}\) -are the classes of interest.
-prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
normalized absolute error
-Normalized Kullback-Leibler divergence between two prevalence distributions \(p\) and
-\(\hat{p}\) is computed as
-math:NKLD(p,hat{p}) = 2frac{e^{KLD(p,hat{p})}}{e^{KLD(p,hat{p})}+1}-1,
-where
-\(\mathcal{Y}\) are the classes of interest.
-The distributions are smoothed using the eps factor (see quapy.error.smooth()).
prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
eps – smoothing factor. NKLD is not defined in cases in which the distributions -contain zeros; eps is typically set to be \(\frac{1}{2T}\), with \(T\) the sample -size. If eps=None, the sample size will be taken from the environment variable -SAMPLE_SIZE (which has thus to be set beforehand).
Normalized Kullback-Leibler divergence between the two distributions
-Normalized absolute error between two prevalence vectors \(p\) and \(\hat{p}\) is computed as -\(NAE(p,\hat{p})=\frac{AE(p,\hat{p})}{z_{AE}}\), -where \(z_{AE}=\frac{2(1-\min_{y\in \mathcal{Y}} p(y))}{|\mathcal{Y}|}\), and \(\mathcal{Y}\) -are the classes of interest.
-prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
normalized absolute error
-Relative absolute error between two prevalence vectors \(p\) and \(\hat{p}\)
-is computed as
-\(NRAE(p,\hat{p})= \frac{RAE(p,\hat{p})}{z_{RAE}}\),
-where
-\(z_{RAE} = \frac{|\mathcal{Y}|-1+\frac{1-\min_{y\in \mathcal{Y}} p(y)}{\min_{y\in \mathcal{Y}} p(y)}}{|\mathcal{Y}|}\)
-and \(\mathcal{Y}\) are the classes of interest.
-The distributions are smoothed using the eps factor (see quapy.error.smooth()).
prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
eps – smoothing factor. nrae is not defined in cases in which the true distribution -contains zeros; eps is typically set to be \(\frac{1}{2T}\), with \(T\) the -sample size. If eps=None, the sample size will be taken from the environment variable -SAMPLE_SIZE (which has thus to be set beforehand).
normalized relative absolute error
-Relative absolute error between two prevalence vectors \(p\) and \(\hat{p}\)
-is computed as
-\(NRAE(p,\hat{p})= \frac{RAE(p,\hat{p})}{z_{RAE}}\),
-where
-\(z_{RAE} = \frac{|\mathcal{Y}|-1+\frac{1-\min_{y\in \mathcal{Y}} p(y)}{\min_{y\in \mathcal{Y}} p(y)}}{|\mathcal{Y}|}\)
-and \(\mathcal{Y}\) are the classes of interest.
-The distributions are smoothed using the eps factor (see quapy.error.smooth()).
prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
eps – smoothing factor. nrae is not defined in cases in which the true distribution -contains zeros; eps is typically set to be \(\frac{1}{2T}\), with \(T\) the -sample size. If eps=None, the sample size will be taken from the environment variable -SAMPLE_SIZE (which has thus to be set beforehand).
normalized relative absolute error
-Relative absolute error between two prevalence vectors \(p\) and \(\hat{p}\)
-is computed as
-\(RAE(p,\hat{p})=
-\frac{1}{|\mathcal{Y}|}\sum_{y\in \mathcal{Y}}\frac{|\hat{p}(y)-p(y)|}{p(y)}\),
-where \(\mathcal{Y}\) are the classes of interest.
-The distributions are smoothed using the eps factor (see quapy.error.smooth()).
prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
eps – smoothing factor. rae is not defined in cases in which the true distribution -contains zeros; eps is typically set to be \(\frac{1}{2T}\), with \(T\) the -sample size. If eps=None, the sample size will be taken from the environment variable -SAMPLE_SIZE (which has thus to be set beforehand).
relative absolute error
-Relative absolute error between two prevalence vectors \(p\) and \(\hat{p}\)
-is computed as
-\(RAE(p,\hat{p})=
-\frac{1}{|\mathcal{Y}|}\sum_{y\in \mathcal{Y}}\frac{|\hat{p}(y)-p(y)|}{p(y)}\),
-where \(\mathcal{Y}\) are the classes of interest.
-The distributions are smoothed using the eps factor (see quapy.error.smooth()).
prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
eps – smoothing factor. rae is not defined in cases in which the true distribution -contains zeros; eps is typically set to be \(\frac{1}{2T}\), with \(T\) the -sample size. If eps=None, the sample size will be taken from the environment variable -SAMPLE_SIZE (which has thus to be set beforehand).
relative absolute error
-Squared error between two prevalence vectors \(p\) and \(\hat{p}\) is computed as -\(SE(p,\hat{p})=\frac{1}{|\mathcal{Y}|}\sum_{y\in \mathcal{Y}}(\hat{p}(y)-p(y))^2\), -where -\(\mathcal{Y}\) are the classes of interest.
-prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
absolute error
-Smooths a prevalence distribution with \(\epsilon\) (eps) as: -\(\underline{p}(y)=\frac{\epsilon+p(y)}{\epsilon|\mathcal{Y}|+ -\displaystyle\sum_{y\in \mathcal{Y}}p(y)}\)
-prevs – array-like of shape (n_classes,) with the true prevalence values
eps – smoothing factor
array-like of shape (n_classes,) with the smoothed distribution
-Evaluates a quantification model according to a specific sample generation protocol and in terms of one -evaluation metric (error).
-model – a quantifier, instance of quapy.method.base.BaseQuantifier
protocol – quapy.protocol.AbstractProtocol; if this object is also instance of
-quapy.protocol.OnLabelledCollectionProtocol, then the aggregation speed-up can be run. This is the
-protocol in charge of generating the samples in which the model is evaluated.
error_metric – a string representing the name(s) of an error function in qp.error -(e.g., ‘mae’), or a callable function implementing the error function itself.
aggr_speedup – whether or not to apply the speed-up. Set to “force” for applying it even if the number of -instances in the original collection on which the protocol acts is larger than the number of instances -in the samples to be generated. Set to True or “auto” (default) for letting QuaPy decide whether it is -convenient or not. Set to False to deactivate.
verbose – boolean, show or not information in stdout
if the error metric is not averaged (e.g., ‘ae’, ‘rae’), returns an array of shape (n_samples,) with -the error scores for each sample; if the error metric is averaged (e.g., ‘mae’, ‘mrae’) then returns -a single float
-Evaluates a quantification model on a given set of samples and in terms of one evaluation metric (error).
-model – a quantifier, instance of quapy.method.base.BaseQuantifier
samples – a list of samples on which the quantifier is to be evaluated
error_metric – a string representing the name(s) of an error function in qp.error -(e.g., ‘mae’), or a callable function implementing the error function itself.
verbose – boolean, show or not information in stdout
if the error metric is not averaged (e.g., ‘ae’, ‘rae’), returns an array of shape (n_samples,) with -the error scores for each sample; if the error metric is averaged (e.g., ‘mae’, ‘mrae’) then returns -a single float
-Generates a report (a pandas’ DataFrame) containing information of the evaluation of the model as according -to a specific protocol and in terms of one or more evaluation metrics (errors).
-model – a quantifier, instance of quapy.method.base.BaseQuantifier
protocol – quapy.protocol.AbstractProtocol; if this object is also instance of
-quapy.protocol.OnLabelledCollectionProtocol, then the aggregation speed-up can be run. This is the protocol
-in charge of generating the samples in which the model is evaluated.
error_metrics – a string, or list of strings, representing the name(s) of an error function in qp.error -(e.g., ‘mae’, the default value), or a callable function, or a list of callable functions, implementing -the error function itself.
aggr_speedup – whether or not to apply the speed-up. Set to “force” for applying it even if the number of -instances in the original collection on which the protocol acts is larger than the number of instances -in the samples to be generated. Set to True or “auto” (default) for letting QuaPy decide whether it is -convenient or not. Set to False to deactivate.
verbose – boolean, show or not information in stdout
a pandas’ DataFrame containing the columns ‘true-prev’ (the true prevalence of each sample), -‘estim-prev’ (the prevalence estimated by the model for each sample), and as many columns as error metrics -have been indicated, each displaying the score in terms of that metric for every sample.
-Uses a quantification model to generate predictions for the samples generated via a specific protocol. -This function is central to all evaluation processes, and is endowed with an optimization to speed-up the -prediction of protocols that generate samples from a large collection. The optimization applies to aggregative -quantifiers only, and to OnLabelledCollectionProtocol protocols, and comes down to generating the classification -predictions once and for all, and then generating samples over the classification predictions (instead of over -the raw instances), so that the classifier prediction is never called again. This behaviour is obtained by -setting aggr_speedup to ‘auto’ or True, and is only carried out if the overall process is convenient in terms -of computations (e.g., if the number of classification predictions needed for the original collection exceed the -number of classification predictions needed for all samples, then the optimization is not undertaken).
-model – a quantifier, instance of quapy.method.base.BaseQuantifier
protocol – quapy.protocol.AbstractProtocol; if this object is also instance of
-quapy.protocol.OnLabelledCollectionProtocol, then the aggregation speed-up can be run. This is the protocol
-in charge of generating the samples for which the model has to issue class prevalence predictions.
aggr_speedup – whether or not to apply the speed-up. Set to “force” for applying it even if the number of -instances in the original collection on which the protocol acts is larger than the number of instances -in the samples to be generated. Set to True or “auto” (default) for letting QuaPy decide whether it is -convenient or not. Set to False to deactivate.
verbose – boolean, show or not information in stdout
a tuple (true_prevs, estim_prevs) in which each element in the tuple is an array of shape -(n_samples, n_classes) containing the true, or predicted, prevalence values for each sample
-Computes the Hellingher Distance (HD) between (discretized) distributions P and Q. -The HD for two discrete distributions of k bins is defined as:
-P – real-valued array-like of shape (k,) representing a discrete distribution
Q – real-valued array-like of shape (k,) representing a discrete distribution
float
-Topsoe distance between two (discretized) distributions P and Q. -The Topsoe distance for two discrete distributions of k bins is defined as:
-P – real-valued array-like of shape (k,) representing a discrete distribution
Q – real-valued array-like of shape (k,) representing a discrete distribution
float
-Implements the adjustment of ACC and PACC for the binary case. The adjustment for a prevalence estimate of the -positive class p comes down to computing:
-prevalence_estim – float, the estimated value for the positive class
tpr – float, the true positive rate of the classifier
fpr – float, the false positive rate of the classifier
clip – set to True (default) to clip values that might exceed the range [0,1]
float, the adjusted count
-Helper that, given a float representing the prevalence for the positive class, returns a np.ndarray of two -values representing a binary distribution.
-positive_prevalence – prevalence for the positive class
clip_if_necessary – if True, clips the value in [0,1] in order to guarantee the resulting distribution -is valid. If False, it then checks that the value is in the valid range, and raises an error if not.
np.ndarray of shape (2,)
-Checks that p is a valid prevalence vector, i.e., that it contains values in [0,1] and that the values sum up to 1.
-p – the prevalence vector to check
-True if p is valid, False otherwise
-Searches for the largest number of (equidistant) prevalence points to define for each of the n_classes classes so -that the number of valid prevalence values generated as combinations of prevalence points (points in a -n_classes-dimensional simplex) do not exceed combinations_budget.
-combinations_budget – integer, maximum number of combinations allowed
n_classes – integer, number of classes
n_repeats – integer, number of repetitions for each prevalence combination
the largest number of prevalence points that generate less than combinations_budget valid prevalences
-Performs a linear search for the best prevalence value in binary problems. The search is carried out by exploring -the range [0,1] stepping by 0.01. This search is inefficient, and is added only for completeness (some of the -early methods in quantification literature used it, e.g., HDy). A most powerful alternative is optim_minimize.
-loss – (callable) the function to minimize
n_classes – (int) the number of classes, i.e., the dimensionality of the prevalence vector
(ndarray) the best prevalence vector found
-Normalize a vector or matrix of prevalence values. The normalization consists of applying a L1 normalization in -cases in which the prevalence values are not all-zeros, and to convert the prevalence values into 1/n_classes in -cases in which all values are zero.
-prevalences – array-like of shape (n_classes,) or of shape (n_samples, n_classes,) with prevalence values
-a normalized vector or matrix of prevalence values
-Computes the number of valid prevalence combinations in the n_classes-dimensional simplex if n_prevpoints equally -distant prevalence values are generated and n_repeats repetitions are requested. -The computation comes down to calculating:
-where N is n_prevpoints-1, i.e., the number of probability mass blocks to allocate, C is the number of -classes, and r is n_repeats. This solution comes from the -Stars and Bars problem.
-n_classes – integer, number of classes
n_prevpoints – integer, number of prevalence points.
n_repeats – integer, number of repetitions for each prevalence combination
The number of possible combinations. For example, if n_classes=2, n_prevpoints=5, n_repeats=1, then the -number of possible combinations are 5, i.e.: [0,1], [0.25,0.75], [0.50,0.50], [0.75,0.25], and [1.0,0.0]
-Searches for the optimal prevalence values, i.e., an n_classes-dimensional vector of the (n_classes-1)-simplex -that yields the smallest lost. This optimization is carried out by means of a constrained search using scipy’s -SLSQP routine.
-loss – (callable) the function to minimize
n_classes – (int) the number of classes, i.e., the dimensionality of the prevalence vector
(ndarray) the best prevalence vector found
-Computed the prevalence values from a vector of labels.
-labels – array-like of shape (n_instances) with the label for each instance
classes – the class labels. This is needed in order to correctly compute the prevalence vector even when -some classes have no examples.
an ndarray of shape (len(classes)) with the class prevalence values
-Returns a vector of prevalence values from a matrix of posterior probabilities.
-posteriors – array-like of shape (n_instances, n_classes,) with posterior probabilities for each class
binarize – set to True (default is False) for computing the prevalence values on crisp decisions (i.e., -converting the vectors of posterior probabilities into class indices, by taking the argmax).
array of shape (n_classes,) containing the prevalence values
-Produces an array of uniformly separated values of prevalence. -By default, produces an array of 21 prevalence values, with -step 0.05 and with the limits smoothed, i.e.: -[0.01, 0.05, 0.10, 0.15, …, 0.90, 0.95, 0.99]
-n_prevalences – the number of prevalence values to sample from the [0,1] interval (default 21)
repeats – number of times each prevalence is to be repeated (defaults to 1)
smooth_limits_epsilon – the quantity to add and subtract to the limits 0 and 1
an array of uniformly separated prevalence values
-Returns a string representation for a prevalence vector. E.g.,
->>> strprev([1/3, 2/3], prec=2)
->>> '[0.33, 0.67]'
-prevalences – a vector of prevalence values
prec – float precision
string
-Implements the Kraemer algorithm -for sampling uniformly at random from the unit simplex. This implementation is adapted from this -post <https://cs.stackexchange.com/questions/3227/uniform-sampling-from-a-simplex>_.
-n_classes – integer, number of classes (dimensionality of the simplex)
size – number of samples to return
np.ndarray of shape (size, n_classes,) if size>1, or of shape (n_classes,) otherwise
-Implements the Kraemer algorithm -for sampling uniformly at random from the unit simplex. This implementation is adapted from this -post <https://cs.stackexchange.com/questions/3227/uniform-sampling-from-a-simplex>_.
-n_classes – integer, number of classes (dimensionality of the simplex)
size – number of samples to return
np.ndarray of shape (size, n_classes,) if size>1, or of shape (n_classes,) otherwise
-Bases: object
Bases: BaseQuantifier
Grid Search optimization targeting a quantification-oriented metric.
-Optimizes the hyperparameters of a quantification method, based on an evaluation method and on an evaluation -protocol for quantification.
-model (BaseQuantifier) – the quantifier to optimize
param_grid – a dictionary with keys the parameter names and values the list of values to explore
protocol – a sample generation protocol, an instance of quapy.protocol.AbstractProtocol
error – an error function (callable) or a string indicating the name of an error function (valid ones
-are those in quapy.error.QUANTIFICATION_ERROR
refit – whether to refit the model on the whole labelled collection (training+validation) with -the best chosen hyperparameter combination. Ignored if protocol=’gen’
timeout – establishes a timer (in seconds) for each of the hyperparameters configurations being tested. -Whenever a run takes longer than this timer, that configuration will be ignored. If all configurations end up -being ignored, a TimeoutError exception is raised. If -1 (default) then no time bound is set.
raise_errors – boolean, if True then raises an exception when a param combination yields any error, if -otherwise is False (default), then the combination is marked with an error status, but the process goes on. -However, if no configuration yields a valid model, then a ValueError exception will be raised.
verbose – set to True to get information through the stdout
Returns the best model found after calling the fit() method, i.e., the one trained on the combination
-of hyper-parameters that minimized the error function.
a trained quantifier
-the error metric.
-training – the training set on which to optimize the hyperparameters
-self
-Returns the dictionary of hyper-parameters to explore (param_grid)
-deep – Unused
-the dictionary param_grid
-Estimate class prevalence values using the best model found after calling the fit() method.
instances – sample contanining the instances
-a ndarray of shape (n_classes) with class prevalence estimates as according to the best model found -by the model selection process.
-Bases: Enum
An enumeration.
-Akin to scikit-learn’s cross_val_predict -but for quantification.
-quantifier – a quantifier issuing class prevalence values
data – a labelled collection
nfolds – number of folds for k-fold cross validation generation
random_state – random seed for reproducibility
a vector of class prevalence values
-Expands a param_grid dictionary as a list of configurations. -Example:
->>> combinations = expand_grid({'A': [1, 10, 100], 'B': [True, False]})
->>> print(combinations)
->>> [{'A': 1, 'B': True}, {'A': 1, 'B': False}, {'A': 10, 'B': True}, {'A': 10, 'B': False}, {'A': 100, 'B': True}, {'A': 100, 'B': False}]
-param_grid – dictionary with keys representing hyper-parameter names, and values representing the range -to explore for that hyper-parameter
-a list of configurations, i.e., combinations of hyper-parameter assignments in the grid.
-Partitions a param_grid dictionary as two lists of configurations, one for the classifier-specific -hyper-parameters, and another for que quantifier-specific hyper-parameters
-param_grid – dictionary with keys representing hyper-parameter names, and values representing the range -to explore for that hyper-parameter
-two expanded grids of configurations, one for the classifier, another for the quantifier
-Box-plots displaying the local bias (i.e., signed error computed as the estimated value minus the true value) -for different bins of (true) prevalence of the positive classs, for each quantification method.
-method_names – array-like with the method names for each experiment
true_prevs – array-like with the true prevalence values (each being a ndarray with n_classes components) for -each experiment
estim_prevs – array-like with the estimated prevalence values (each being a ndarray with n_classes components) -for each experiment
pos_class – index of the positive class
title – the title to be displayed in the plot
nbins – number of bins
colormap – the matplotlib colormap to use (default cm.tab10)
vertical_xticks – whether or not to add secondary grid (default is False)
legend – whether or not to display the legend (default is True)
savepath – path where to save the plot. If not indicated (as default), the plot is shown.
Box-plots displaying the global bias (i.e., signed error computed as the estimated value minus the true value) -for each quantification method with respect to a given positive class.
-method_names – array-like with the method names for each experiment
true_prevs – array-like with the true prevalence values (each being a ndarray with n_classes components) for -each experiment
estim_prevs – array-like with the estimated prevalence values (each being a ndarray with n_classes components) -for each experiment
pos_class – index of the positive class
title – the title to be displayed in the plot
savepath – path where to save the plot. If not indicated (as default), the plot is shown.
The diagonal plot displays the predicted prevalence values (along the y-axis) as a function of the true prevalence
-values (along the x-axis). The optimal quantifier is described by the diagonal (0,0)-(1,1) of the plot (hence the
-name). It is convenient for binary quantification problems, though it can be used for multiclass problems by
-indicating which class is to be taken as the positive class. (For multiclass quantification problems, other plots
-like the error_by_drift() might be preferable though).
method_names – array-like with the method names for each experiment
true_prevs – array-like with the true prevalence values (each being a ndarray with n_classes components) for -each experiment
estim_prevs – array-like with the estimated prevalence values (each being a ndarray with n_classes components) -for each experiment
pos_class – index of the positive class
title – the title to be displayed in the plot
show_std – whether or not to show standard deviations (represented by color bands). This might be inconvenient -for cases in which many methods are compared, or when the standard deviations are high – default True)
legend – whether or not to display the leyend (default True)
train_prev – if indicated (default is None), the training prevalence (for the positive class) is hightlighted -in the plot. This is convenient when all the experiments have been conducted in the same dataset.
savepath – path where to save the plot. If not indicated (as default), the plot is shown.
method_order – if indicated (default is None), imposes the order in which the methods are processed (i.e., -listed in the legend and associated with matplotlib colors).
Displays (only) the top performing methods for different regions of the train-test shift in form of a broken -bar chart, in which each method has bars only for those regions in which either one of the following conditions -hold: (i) it is the best method (in average) for the bin, or (ii) it is not statistically significantly different -(in average) as according to a two-sided t-test on independent samples at confidence ttest_alpha. -The binning can be made “isometric” (same size), or “isomerous” (same number of experiments – default). A second -plot is displayed on top, that displays the distribution of experiments for each bin (when binning=”isometric”) or -the percentiles points of the distribution (when binning=”isomerous”).
-method_names – array-like with the method names for each experiment
true_prevs – array-like with the true prevalence values (each being a ndarray with n_classes components) for -each experiment
estim_prevs – array-like with the estimated prevalence values (each being a ndarray with n_classes components) -for each experiment
tr_prevs – training prevalence of each experiment
n_bins – number of bins in which the y-axis is to be divided (default is 20)
binning – type of binning, either “isomerous” (default) or “isometric”
x_error – a string representing the name of an error function (as defined in quapy.error) to be used for -measuring the amount of train-test shift (default is “ae”)
y_error – a string representing the name of an error function (as defined in quapy.error) to be used for -measuring the amount of error in the prevalence estimations (default is “ae”)
ttest_alpha – the confidence interval above which a p-value (two-sided t-test on independent samples) is -to be considered as an indicator that the two means are not statistically significantly different. Default is -0.005, meaning that a p-value > 0.005 indicates the two methods involved are to be considered similar
tail_density_threshold – sets a threshold on the density of experiments (over the total number of experiments) -below which a bin in the tail (i.e., the right-most ones) will be discarded. This is in order to avoid some -bins to be shown for train-test outliers.
method_order – if indicated (default is None), imposes the order in which the methods are processed (i.e., -listed in the legend and associated with matplotlib colors).
savepath – path where to save the plot. If not indicated (as default), the plot is shown.
Plots the error (along the x-axis, as measured in terms of error_name) as a function of the train-test shift
-(along the y-axis, as measured in terms of quapy.error.ae()). This plot is useful especially for multiclass
-problems, in which “diagonal plots” may be cumbersone, and in order to gain understanding about how methods
-fare in different regions of the prior probability shift spectrum (e.g., in the low-shift regime vs. in the
-high-shift regime).
method_names – array-like with the method names for each experiment
true_prevs – array-like with the true prevalence values (each being a ndarray with n_classes components) for -each experiment
estim_prevs – array-like with the estimated prevalence values (each being a ndarray with n_classes components) -for each experiment
tr_prevs – training prevalence of each experiment
n_bins – number of bins in which the y-axis is to be divided (default is 20)
error_name – a string representing the name of an error function (as defined in quapy.error, default is “ae”)
show_std – whether or not to show standard deviations as color bands (default is False)
show_density – whether or not to display the distribution of experiments for each bin (default is True)
show_density – whether or not to display the legend of the chart (default is True)
logscale – whether or not to log-scale the y-error measure (default is False)
title – title of the plot (default is “Quantification error as a function of distribution shift”)
vlines – array-like list of values (default is None). If indicated, highlights some regions of the space -using vertical dotted lines.
method_order – if indicated (default is None), imposes the order in which the methods are processed (i.e., -listed in the legend and associated with matplotlib colors).
savepath – path where to save the plot. If not indicated (as default), the plot is shown.
Bases: AbstractStochasticSeededProtocol, OnLabelledCollectionProtocol
Implementation of the artificial prevalence protocol (APP). -The APP consists of exploring a grid of prevalence values containing n_prevalences points (e.g., -[0, 0.05, 0.1, 0.15, …, 1], if n_prevalences=21), and generating all valid combinations of -prevalence values for all classes (e.g., for 3 classes, samples with [0, 0, 1], [0, 0.05, 0.95], …, -[1, 0, 0] prevalence values of size sample_size will be yielded). The number of samples for each valid -combination of prevalence values is indicated by repeats.
-data – a LabelledCollection from which the samples will be drawn
sample_size – integer, number of instances in each sample; if None (default) then it is taken from -qp.environ[“SAMPLE_SIZE”]. If this is not set, a ValueError exception is raised.
n_prevalences – the number of equidistant prevalence points to extract from the [0,1] interval for the -grid (default is 21)
repeats – number of copies for each valid prevalence vector (default is 10)
smooth_limits_epsilon – the quantity to add and subtract to the limits 0 and 1
random_state – allows replicating samples across runs (default 0, meaning that the sequence of samples -will be the same every time the protocol is called)
sanity_check – int, raises an exception warning the user that the number of examples to be generated exceed -this number; set to None for skipping this check
return_type – set to “sample_prev” (default) to get the pairs of (sample, prevalence) at each iteration, or -to “labelled_collection” to get instead instances of LabelledCollection
Generates vectors of prevalence values from an exhaustive grid of prevalence values. The -number of prevalence values explored for each dimension depends on n_prevalences, so that, if, for example, -n_prevalences=11 then the prevalence values of the grid are taken from [0, 0.1, 0.2, …, 0.9, 1]. Only -valid prevalence distributions are returned, i.e., vectors of prevalence values that sum up to 1. For each -valid vector of prevalence values, repeat copies are returned. The vector of prevalence values can be -implicit (by setting return_constrained_dim=False), meaning that the last dimension (which is constrained -to 1 - sum of the rest) is not returned (note that, quite obviously, in this case the vector does not sum up to -1). Note that this method is deterministic, i.e., there is no random sampling anywhere.
-a np.ndarray of shape (n, dimensions) if return_constrained_dim=True or of shape -(n, dimensions-1) if return_constrained_dim=False, where n is the number of valid combinations found -in the grid multiplied by repeat
-Realizes the sample given the index of the instances.
-index – indexes of the instances to select
-an instance of qp.data.LabelledCollection
Bases: object
Abstract parent class for sample generation protocols.
- - -Bases: AbstractProtocol
An AbstractStochasticSeededProtocol is a protocol that generates, via any random procedure (e.g.,
-via random sampling), sequences of quapy.data.base.LabelledCollection samples.
-The protocol abstraction enforces
-the object to be instantiated using a seed, so that the sequence can be fully replicated.
-In order to make this functionality possible, the classes extending this abstraction need to
-implement only two functions, samples_parameters() which generates all the parameters
-needed for extracting the samples, and sample() that, given some parameters as input,
-deterministically generates a sample.
random_state – the seed for allowing to replicate any sequence of samples. Default is 0, meaning that -the sequence will be consistent every time the protocol is called.
-The collator prepares the sample to accommodate the desired output format before returning the output. -This collator simply returns the sample as it is. Classes inheriting from this abstract class can -implement their custom collators.
-sample – the sample to be returned
args – additional arguments
the sample adhering to a desired output format (in this case, the sample is returned as it is)
-Bases: AbstractStochasticSeededProtocol
Generates mixtures of two domains (A and B) at controlled rates, but preserving the original class prevalence.
-domainA – one domain, an object of qp.data.LabelledCollection
domainB – another domain, an object of qp.data.LabelledCollection
sample_size – integer, the number of instances in each sample; if None (default) then it is taken from -qp.environ[“SAMPLE_SIZE”]. If this is not set, a ValueError exception is raised.
repeats – int, number of samples to draw for every mixture rate
prevalence – the prevalence to preserv along the mixtures. If specified, should be an array containing -one prevalence value (positive float) for each class and summing up to one. If not specified, the prevalence -will be taken from the domain A (default).
mixture_points – an integer indicating the number of points to take from a linear scale (e.g., 21 will -generate the mixture points [1, 0.95, 0.9, …, 0]), or the array of mixture values itself. -the specific points
random_state – allows replicating samples across runs (default 0, meaning that the sequence of samples -will be the same every time the protocol is called)
Realizes the sample given a pair of indexes of the instances from A and B.
-indexes – indexes of the instances to select from A and B
-an instance of qp.data.LabelledCollection
Bases: AbstractProtocol
A very simple protocol which simply iterates over a list of previously generated samples
-samples – a list of quapy.data.base.LabelledCollection
Bases: AbstractStochasticSeededProtocol, OnLabelledCollectionProtocol
A generator of samples that implements the natural prevalence protocol (NPP). The NPP consists of drawing -samples uniformly at random, therefore approximately preserving the natural prevalence of the collection.
-data – a LabelledCollection from which the samples will be drawn
sample_size – integer, the number of instances in each sample; if None (default) then it is taken from -qp.environ[“SAMPLE_SIZE”]. If this is not set, a ValueError exception is raised.
repeats – the number of samples to generate. Default is 100.
random_state – allows replicating samples across runs (default 0, meaning that the sequence of samples -will be the same every time the protocol is called)
return_type – set to “sample_prev” (default) to get the pairs of (sample, prevalence) at each iteration, or -to “labelled_collection” to get instead instances of LabelledCollection
Realizes the sample given the index of the instances.
-index – indexes of the instances to select
-an instance of qp.data.LabelledCollection
Bases: object
Protocols that generate samples from a qp.data.LabelledCollection object.
Returns a collator function, i.e., a function that prepares the yielded data
-return_type – either ‘sample_prev’ (default) if the collator is requested to yield tuples of
-(sample, prevalence), or ‘labelled_collection’ when it is requested to yield instances of
-qp.data.LabelledCollection
the collator function (a callable function that takes as input an instance of
-qp.data.LabelledCollection)
Returns the labelled collection on which this protocol acts.
-an object of type qp.data.LabelledCollection
Returns a copy of this protocol that acts on a modified version of the original
-qp.data.LabelledCollection in which the original instances have been replaced
-with the outputs of a classifier for each instance. (This is convenient for speeding-up
-the evaluation procedures for many samples, by pre-classifying the instances in advance.)
pre_classifications – the predictions issued by a classifier, typically an array-like -with shape (n_instances,) when the classifier is a hard one, or with shape -(n_instances, n_classes) when the classifier is a probabilistic one.
in_place – whether or not to apply the modification in-place or in a new copy (default).
a copy of this protocol
-Bases: AbstractStochasticSeededProtocol, OnLabelledCollectionProtocol
A variant of APP that, instead of using a grid of equidistant prevalence values,
-relies on the Kraemer algorithm for sampling unit (k-1)-simplex uniformly at random, with
-k the number of classes. This protocol covers the entire range of prevalence values in a
-statistical sense, i.e., unlike APP there is no guarantee that it is covered precisely
-equally for all classes, but it is preferred in cases in which the number of possible
-combinations of the grid values of APP makes this endeavour intractable.
data – a LabelledCollection from which the samples will be drawn
sample_size – integer, the number of instances in each sample; if None (default) then it is taken from -qp.environ[“SAMPLE_SIZE”]. If this is not set, a ValueError exception is raised.
repeats – the number of samples to generate. Default is 100.
random_state – allows replicating samples across runs (default 0, meaning that the sequence of samples -will be the same every time the protocol is called)
return_type – set to “sample_prev” (default) to get the pairs of (sample, prevalence) at each iteration, or -to “labelled_collection” to get instead instances of LabelledCollection
Realizes the sample given the index of the instances.
-index – indexes of the instances to select
-an instance of qp.data.LabelledCollection
Bases: object
A class implementing the early-stopping condition typically used for training neural networks.
->>> earlystop = EarlyStop(patience=2, lower_is_better=True)
->>> earlystop(0.9, epoch=0)
->>> earlystop(0.7, epoch=1)
->>> earlystop.IMPROVED # is True
->>> earlystop(1.0, epoch=2)
->>> earlystop.STOP # is False (patience=1)
->>> earlystop(1.0, epoch=3)
->>> earlystop.STOP # is True (patience=0)
->>> earlystop.best_epoch # is 1
->>> earlystop.best_score # is 0.7
-patience – the number of (consecutive) times that a monitored evaluation metric (typically obtaind in a -held-out validation split) can be found to be worse than the best one obtained so far, before flagging the -stopping condition. An instance of this class is callable, and is to be used as follows:
lower_is_better – if True (default) the metric is to be minimized.
best_score – keeps track of the best value seen so far
best_epoch – keeps track of the epoch in which the best score was set
STOP – flag (boolean) indicating the stopping condition
IMPROVED – flag (boolean) indicating whether there was an improvement in the last call
An alias to os.makedirs(path, exist_ok=True) that also returns the path. This is useful in cases like, e.g.:
->>> path = create_if_not_exist(os.path.join(dir, subdir, anotherdir))
-path – path to create
-the path itself
-Creates the parent dir (if any) of a given path, if not exists. E.g., for ./path/to/file.txt, the path ./path/to -is created.
-path – the path
-Downloads a file from a url
-url – the url
archive_filename – destination filename
Dowloads a function (using download_file()) if the file does not exist.
url – the url
archive_filename – destination filename
Gets the home directory of QuaPy, i.e., the directory where QuaPy saves permanent data, such as dowloaded datasets. -This directory is ~/quapy_data
-a string representing the path
-Applies func to n_jobs slices of args. E.g., if args is an array of 99 items and n_jobs=2, then -func is applied in two parallel processes to args[0:50] and to args[50:99]. func is a function -that already works with a list of arguments.
-func – function to be parallelized
args – array-like of arguments to be passed to the function in different parallel calls
n_jobs – the number of workers
A wrapper of multiprocessing:
->>> Parallel(n_jobs=n_jobs)(
->>> delayed(func)(args_i) for args_i in args
->>> )
-that takes the quapy.environ variable as input silently. -Seeds the child processes to ensure reproducibility when n_jobs>1.
-func – callable
args – args of func
seed – the numeric seed
asarray – set to True to return a np.ndarray instead of a list
backend – indicates the backend used for handling parallel works
Allows for fast reuse of resources that are generated only once by calling generation_func(*args). The next times -this function is invoked, it loads the pickled resource. Example:
->>> def some_array(n): # a mock resource created with one parameter (`n`)
->>> return np.random.rand(n)
->>> pickled_resource('./my_array.pkl', some_array, 10) # the resource does not exist: it is created by calling some_array(10)
->>> pickled_resource('./my_array.pkl', some_array, 10) # the resource exists; it is loaded from './my_array.pkl'
-pickle_path – the path where to save (first time) and load (next times) the resource
generation_func – the function that generates the resource, in case it does not exist in pickle_path
args – any arg that generation_func uses for generating the resources
the resource
-Saves a text file to disk, given its full path, and creates the parent directory if missing.
-path – path where to save the path.
text – text to save.
Can be used in a “with” context to set a temporal seed without modifying the outer numpy’s current state. E.g.:
->>> with temp_seed(random_seed):
->>> pass # do any computation depending on np.random functionality
-random_state – the seed to set within the “with” context
-Opens a context that will launch an exception if not closed after a given number of seconds
->>> def func(start_msg, end_msg):
->>> print(start_msg)
->>> sleep(2)
->>> print(end_msg)
->>>
->>> with timeout(1):
->>> func('begin function', 'end function')
->>> Out[]
->>> begin function
->>> TimeoutError
-seconds – number of seconds, set to <=0 to ignore the timer
-QuaPy module for quantification
-Bases: AggregativeCrispQuantifier
Adjusted Classify & Count,
-the “adjusted” variant of CC, that corrects the predictions of CC
-according to the misclassification rates.
classifier – a sklearn’s Estimator that generates a classifier
val_split – specifies the data used for generating classifier predictions. This specification -can be made as float in (0, 1) indicating the proportion of stratified held-out validation set to -be extracted from the training set; or as an integer (default 5), indicating that the predictions -are to be generated in a k-fold cross-validation manner (with this integer indicating the value -for k); or as a collection defining the specific set of data to use for validation. -Alternatively, this set can be specified at fit time by indicating the exact set of data -on which the predictions are to be generated.
n_jobs – number of parallel workers
solver – indicates the method to be used for obtaining the final estimates. The choice -‘exact’ comes down to solving the system of linear equations \(Ax=B\) where A is a -matrix containing the class-conditional probabilities of the predictions (e.g., the tpr and fpr in -binary) and B is the vector of prevalence values estimated via CC, as \(x=A^{-1}B\). This solution -might not exist for degenerated classifiers, in which case the method defaults to classify and count -(i.e., does not attempt any adjustment). -Another option is to search for the prevalence vector that minimizes the L2 norm of \(|Ax-B|\). The latter -is achieved by indicating solver=’minimize’. This one generally works better, and is the default parameter. -More details about this can be consulted in Bunse, M. “On Multi-Class Extensions of Adjusted Classify and -Count”, on proceedings of the 2nd International Workshop on Learning to Quantify: Methods and Applications -(LQ 2022), ECML/PKDD 2022, Grenoble (France).
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Estimates the misclassification rates.
-classif_predictions – classifier predictions with true labels
-Solves the system linear system \(Ax = B\) with \(A\) = PteCondEstim and \(B\) = prevs_estim
-PteCondEstim – a np.ndarray of shape (n_classes,n_classes,) with entry (i,j) being the estimate -of \(P(y_i|y_j)\), that is, the probability that an instance that belongs to \(y_j\) ends up being -classified as belonging to \(y_i\)
prevs_estim – a np.ndarray of shape (n_classes,) with the class prevalence estimates
solver – indicates the method to use for solving the system of linear equations. Valid options are -‘exact’ (tries to solve the system –may fail if the misclassificatin matrix has rank < n_classes) or -‘optim_minimize’ (minimizes a norm –always exists).
an adjusted np.ndarray of shape (n_classes,) with the corrected class prevalence estimates
-Bases: AggregativeQuantifier, ABC
Abstract class for quantification methods that base their estimations on the aggregation of crips decisions -as returned by a hard classifier. Aggregative crisp quantifiers thus extend Aggregative -Quantifiers by implementing specifications about crisp predictions.
-Bases: BinaryQuantifier
This method is a meta-quantifier that returns, as the estimated class prevalence values, the median of the -estimation returned by differently (hyper)parameterized base quantifiers. -The median of unit-vectors is only guaranteed to be a unit-vector for n=2 dimensions, -i.e., in cases of binary quantification.
-base_quantifier – the base, binary quantifier
random_state – a seed to be set before fitting any base quantifier (default None)
param_grid – the grid or parameters towards which the median will be computed
n_jobs – number of parllel workes
Trains a quantifier.
-data – a quapy.data.base.LabelledCollection consisting of the training data
self
-Get parameters for this estimator.
-deep (bool, default=True) – If True, will return the parameters for this estimator and -contained subobjects that are estimators.
-params – Parameter names mapped to their values.
-dict
-Generate class prevalence estimates for the sample’s instances
-instances – array-like
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Set the parameters of this estimator.
-The method works on simple estimators as well as on nested objects
-(such as Pipeline). The latter have
-parameters of the form <component>__<parameter> so that it’s
-possible to update each component of a nested object.
**params (dict) – Estimator parameters.
-self – Estimator instance.
-estimator instance
-Bases: BaseQuantifier, ABC
Abstract class for quantification methods that base their estimations on the aggregation of classification
-results. Aggregative quantifiers implement a pipeline that consists of generating classification predictions
-and aggregating them. For this reason, the training phase is implemented by classification_fit() followed
-by aggregation_fit(), while the testing phase is implemented by classify() followed by
-aggregate(). Subclasses of this abstract class must provide implementations for these methods.
-Aggregative quantifiers also maintain a classifier attribute.
The method fit() comes with a default implementation based on classification_fit()
-and aggregation_fit().
The method quantify() comes with a default implementation based on classify()
-and aggregate().
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains the aggregation function.
-classif_predictions – a LabelledCollection containing the label predictions issued -by the classifier
data – a quapy.data.base.LabelledCollection consisting of the training data
Class labels, in the same order in which class prevalence values are to be computed. -This default implementation actually returns the class labels of the learner.
-array-like
-Gives access to the classifier
-the classifier (typically an sklearn’s Estimator)
-Trains the classifier if requested (fit_classifier=True) and generate the necessary predictions to -train the aggregation function.
-data – a quapy.data.base.LabelledCollection consisting of the training data
fit_classifier – whether to train the learner (default is True). Set to False if the -learner has been trained outside the quantifier.
predict_on – specifies the set on which predictions need to be issued. This parameter can -be specified as None (default) to indicate no prediction is needed; a float in (0, 1) to -indicate the proportion of instances to be used for predictions (the remainder is used for -training); an integer >1 to indicate that the predictions must be generated via k-fold -cross-validation, using this integer as k; or the data sample itself on which to generate -the predictions.
Provides the label predictions for the given instances. The predictions should respect the format expected by
-aggregate(), e.g., posterior probabilities for probabilistic quantifiers, or crisp predictions for
-non-probabilistic quantifiers. The default one is “decision_function”.
instances – array-like of shape (n_instances, n_features,)
-np.ndarray of shape (n_instances,) with label predictions
-Trains the aggregative quantifier. This comes down to training a classifier and an aggregation function.
-data – a quapy.data.base.LabelledCollection consisting of the training data
fit_classifier – whether to train the learner (default is True). Set to False if the -learner has been trained outside the quantifier.
self
-Generate class prevalence estimates for the sample’s instances by aggregating the label predictions generated -by the classifier.
-instances – array-like
-np.ndarray of shape (n_classes) with class prevalence estimates.
-Bases: AggregativeQuantifier, ABC
Abstract class for quantification methods that base their estimations on the aggregation of posterior -probabilities as returned by a probabilistic classifier. -Aggregative soft quantifiers thus extend Aggregative Quantifiers by implementing specifications -about soft predictions.
-Bases: AggregativeQuantifier, BinaryQuantifier
Trains the aggregative quantifier. This comes down to training a classifier and an aggregation function.
-data – a quapy.data.base.LabelledCollection consisting of the training data
fit_classifier – whether to train the learner (default is True). Set to False if the -learner has been trained outside the quantifier.
self
-Bases: AggregativeCrispQuantifier
The most basic Quantification method. One that simply classifies all instances and counts how many have been -attributed to each of the classes in order to compute class prevalence estimates.
-classifier – a sklearn’s Estimator that generates a classifier
-Computes class prevalence estimates by counting the prevalence of each of the predicted labels.
-classif_predictions – array-like with label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Nothing to do here!
-classif_predictions – this is actually None
-Bases: AggregativeSoftQuantifier
Generic Distribution Matching quantifier for binary or multiclass quantification based on the space of posterior -probabilities. This implementation takes the number of bins, the divergence, and the possibility to work on CDF -as hyperparameters.
-classifier – a sklearn’s Estimator that generates a probabilistic classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set to model the
-validation distribution.
-This parameter can be indicated as a real value (between 0 and 1), representing a proportion of
-validation data, or as an integer, indicating that the validation distribution should be estimated via
-k-fold cross validation (this integer stands for the number of folds k, defaults 5), or as a
-quapy.data.base.LabelledCollection (the split itself).
nbins – number of bins used to discretize the distributions (default 8)
divergence – a string representing a divergence measure (currently, “HD” and “topsoe” are implemented) -or a callable function taking two ndarrays of the same dimension as input (default “HD”, meaning Hellinger -Distance)
cdf – whether to use CDF instead of PDF (default False)
n_jobs – number of parallel workers (default None)
Searches for the mixture model parameter (the sought prevalence values) that yields a validation distribution -(the mixture) that best matches the test distribution, in terms of the divergence measure of choice. -In the multiclass case, with n the number of classes, the test and mixture distributions contain -n channels (proper distributions of binned posterior probabilities), on which the divergence is computed -independently. The matching is computed as an average of the divergence across all channels.
-posteriors – posterior probabilities of the instances in the sample
-a vector of class prevalence estimates
-Trains the classifier (if requested) and generates the validation distributions out of the training data. -The validation distributions have shape (n, ch, nbins), with n the number of classes, ch the number of -channels, and nbins the number of bins. In particular, let V be the validation distributions; then di=V[i] -are the distributions obtained from training data labelled with class i; while dij = di[j] is the discrete -distribution of posterior probabilities P(Y=j|X=x) for training data labelled with class i, and dij[k] -is the fraction of instances with a value in the k-th bin.
-data – the training set
fit_classifier – set to False to bypass the training (the learner is assumed to be already fit)
val_split – either a float in (0,1) indicating the proportion of training instances to use for -validation (e.g., 0.3 for using 30% of the training set as validation data), or a LabelledCollection -indicating the validation set itself, or an int indicating the number k of folds to be used in kFCV -to estimate the parameters
Bases: AggregativeSoftQuantifier, BinaryAggregativeQuantifier
DyS framework (DyS). -DyS is a generalization of HDy method, using a Ternary Search in order to find the prevalence that -minimizes the distance between distributions. -Details for the ternary search have been got from <https://dl.acm.org/doi/pdf/10.1145/3219819.3220059>
-classifier – a sklearn’s Estimator that generates a binary classifier
val_split – a float in range (0,1) indicating the proportion of data to be used as a stratified held-out
-validation distribution, or a quapy.data.base.LabelledCollection (the split itself), or an integer indicating the number of folds (default 5)..
n_bins – an int with the number of bins to use to compute the histograms.
divergence – a str indicating the name of divergence (currently supported ones are “HD” or “topsoe”), or a -callable function computes the divergence between two distributions (two equally sized arrays).
tol – a float with the tolerance for the ternary search algorithm.
n_jobs – number of parallel workers.
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains the aggregation function.
-classif_predictions – a LabelledCollection containing the label predictions issued -by the classifier
data – a quapy.data.base.LabelledCollection consisting of the training data
Bases: AggregativeSoftQuantifier
Expectation Maximization for Quantification (EMQ), -aka Saerens-Latinne-Decaestecker (SLD) algorithm. -EMQ consists of using the well-known Expectation Maximization algorithm to iteratively update the posterior -probabilities generated by a probabilistic classifier and the class prevalence estimates obtained via -maximum-likelihood estimation, in a mutually recursive way, until convergence.
-This implementation also gives access to the heuristics proposed by Alexandari et al. paper. These heuristics consist of using, as the training -prevalence, an estimate of it obtained via k-fold cross validation (instead of the true training prevalence), -and to recalibrate the posterior probabilities of the classifier.
-classifier – a sklearn’s Estimator that generates a classifier
val_split – specifies the data used for generating classifier predictions. This specification -can be made as float in (0, 1) indicating the proportion of stratified held-out validation set to -be extracted from the training set; or as an integer, indicating that the predictions -are to be generated in a k-fold cross-validation manner (with this integer indicating the value -for k, default 5); or as a collection defining the specific set of data to use for validation. -Alternatively, this set can be specified at fit time by indicating the exact set of data -on which the predictions are to be generated. This hyperparameter is only meant to be used when the -heuristics are to be applied, i.e., if a recalibration is required. The default value is None (meaning -the recalibration is not required). In case this hyperparameter is set to a value other than None, but -the recalibration is not required (recalib=None), a warning message will be raised.
exact_train_prev – set to True (default) for using the true training prevalence as the initial observation; -set to False for computing the training prevalence as an estimate of it, i.e., as the expected -value of the posterior probabilities of the training instances.
recalib – a string indicating the method of recalibration. -Available choices include “nbvs” (No-Bias Vector Scaling), “bcts” (Bias-Corrected Temperature Scaling, -default), “ts” (Temperature Scaling), and “vs” (Vector Scaling). Default is None (no recalibration).
n_jobs – number of parallel workers. Only used for recalibrating the classifier if val_split is set to -an integer k –the number of folds.
Computes the Expectation Maximization routine.
-tr_prev – array-like, the training prevalence
posterior_probabilities – np.ndarray of shape (n_instances, n_classes,) with the -posterior probabilities
epsilon – float, the threshold different between two consecutive iterations -to reach before stopping the loop
a tuple with the estimated prevalence values (shape (n_classes,)) and -the corrected posterior probabilities (shape (n_instances, n_classes,))
-Constructs an instance of EMQ using the best configuration found in the Alexandari et al. paper, i.e., one that relies on Bias-Corrected Temperature -Scaling (BCTS) as a recalibration function, and that uses an estimate of the training prevalence instead of -the true training prevalence.
-classifier – a sklearn’s Estimator that generates a classifier
n_jobs – number of parallel workers.
An instance of EMQ with BCTS
-Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains the aggregation function.
-classif_predictions – a LabelledCollection containing the label predictions issued -by the classifier
data – a quapy.data.base.LabelledCollection consisting of the training data
Provides the posterior probabilities for the given instances. If the classifier was required -to be recalibrated, then these posteriors are recalibrated accordingly.
-instances – array-like of shape (n_instances, n_dimensions,)
-np.ndarray of shape (n_instances, n_classes,) with posterior probabilities
-Bases: AggregativeSoftQuantifier, BinaryAggregativeQuantifier
Hellinger Distance y (HDy). -HDy is a probabilistic method for training binary quantifiers, that models quantification as the problem of -minimizing the divergence (in terms of the Hellinger Distance) between two distributions of posterior -probabilities returned by the classifier. One of the distributions is generated from the unlabelled examples and -the other is generated from a validation set. This latter distribution is defined as a mixture of the -class-conditional distributions of the posterior probabilities returned for the positive and negative validation -examples, respectively. The parameters of the mixture thus represent the estimates of the class prevalence values.
-classifier – a sklearn’s Estimator that generates a binary classifier
val_split – a float in range (0,1) indicating the proportion of data to be used as a stratified held-out
-validation distribution, or a quapy.data.base.LabelledCollection (the split itself), or an integer indicating the number of folds (default 5)..
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains a HDy quantifier.
-data – the training set
fit_classifier – set to False to bypass the training (the learner is assumed to be already fit)
val_split – either a float in (0,1) indicating the proportion of training instances to use for
-validation (e.g., 0.3 for using 30% of the training set as validation data), or a
-quapy.data.base.LabelledCollection indicating the validation set itself
self
-Bases: OneVsAllGeneric, AggregativeQuantifier
Allows any binary quantifier to perform quantification on single-label datasets.
-The method maintains one binary quantifier for each class, and then l1-normalizes the outputs so that the
-class prevelences sum up to 1.
-This variant was used, along with the EMQ quantifier, in
-Gao and Sebastiani, 2016.
binary_quantifier – a quantifier (binary) that will be employed to work on multiclass model in a -one-vs-all manner
n_jobs – number of parallel workers
parallel_backend – the parallel backend for joblib (default “loky”); this is helpful for some quantifiers -(e.g., ELM-based ones) that cannot be run with multiprocessing, since the temp dir they create during fit will -is removed and no longer available at predict time.
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-If the base quantifier is not probabilistic, returns a matrix of shape (n,m,) with n the number of -instances and m the number of classes. The entry (i,j) is a binary value indicating whether instance -i `belongs to class `j. The binary classifications are independent of each other, meaning that an instance -can end up be attributed to 0, 1, or more classes. -If the base quantifier is probabilistic, returns a matrix of shape (n,m,2) with n the number of instances -and m the number of classes. The entry (i,j,1) (resp. (i,j,0)) is a value in [0,1] indicating the -posterior probability that instance i belongs (resp. does not belong) to class j. The posterior -probabilities are independent of each other, meaning that, in general, they do not sum up to one.
-instances – array-like
-np.ndarray
-Bases: AggregativeSoftQuantifier
Probabilistic Adjusted Classify & Count, -the probabilistic variant of ACC that relies on the posterior probabilities returned by a probabilistic classifier.
-classifier – a sklearn’s Estimator that generates a classifier
val_split – specifies the data used for generating classifier predictions. This specification -can be made as float in (0, 1) indicating the proportion of stratified held-out validation set to -be extracted from the training set; or as an integer (default 5), indicating that the predictions -are to be generated in a k-fold cross-validation manner (with this integer indicating the value -for k). Alternatively, this set can be specified at fit time by indicating the exact set of data -on which the predictions are to be generated.
n_jobs – number of parallel workers
solver –
indicates the method to be used for obtaining the final estimates. The choice -‘exact’ comes down to solving the system of linear equations \(Ax=B\) where A is a -matrix containing the class-conditional probabilities of the predictions (e.g., the tpr and fpr in -binary) and B is the vector of prevalence values estimated via CC, as \(x=A^{-1}B\). This solution -might not exist for degenerated classifiers, in which case the method defaults to classify and count -(i.e., does not attempt any adjustment). -Another option is to search for the prevalence vector that minimizes the L2 norm of \(|Ax-B|\). The latter -is achieved by indicating solver=’minimize’. This one generally works better, and is the default parameter. -More details about this can be consulted in Bunse, M. “On Multi-Class Extensions of Adjusted Classify and -Count”, on proceedings of the 2nd International Workshop on Learning to Quantify: Methods and Applications -(LQ 2022), ECML/PKDD 2022, Grenoble (France).
-Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Estimates the misclassification rates
-classif_predictions – classifier soft predictions with true labels
-Bases: AggregativeSoftQuantifier
Probabilistic Classify & Count, -the probabilistic variant of CC that relies on the posterior probabilities returned by a probabilistic classifier.
-classifier – a sklearn’s Estimator that generates a classifier
-Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Nothing to do here!
-classif_predictions – this is actually None
-Bases: AggregativeSoftQuantifier, BinaryAggregativeQuantifier
SMM method (SMM). -SMM is a simplification of matching distribution methods where the representation of the examples -is created using the mean instead of a histogram (conceptually equivalent to PACC).
-classifier – a sklearn’s Estimator that generates a binary classifier.
val_split – a float in range (0,1) indicating the proportion of data to be used as a stratified held-out
-validation distribution, or a quapy.data.base.LabelledCollection (the split itself), or an integer indicating the number of folds (default 5)..
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains the aggregation function.
-classif_predictions – a LabelledCollection containing the label predictions issued -by the classifier
data – a quapy.data.base.LabelledCollection consisting of the training data
Explicit Loss Minimization (ELM) quantifiers. -Quantifiers based on ELM represent a family of methods based on structured output learning; -these quantifiers rely on classifiers that have been optimized using a quantification-oriented loss -measure. This implementation relies on -Joachims’ SVM perf structured output -learning algorithm, which has to be installed and patched for the purpose (see this -script). -This function equivalent to:
->>> CC(SVMperf(svmperf_base, loss, C))
-svmperf_base – path to the folder containing the binary files of SVM perf; if set to None (default) -this path will be obtained from qp.environ[‘SVMPERF_HOME’]
loss – the loss to optimize (see quapy.classification.svmperf.SVMperf.valid_losses)
C – trade-off between training error and margin (default 0.01)
returns an instance of CC set to work with SVMperf (with loss and C set properly) as the -underlying classifier
-SVM(KLD) is an Explicit Loss Minimization (ELM) quantifier set to optimize for the Absolute Error as first used by -Moreo and Sebastiani, 2021. -Equivalent to:
->>> CC(SVMperf(svmperf_base, loss='mae', C=C))
-Quantifiers based on ELM represent a family of methods based on structured output learning; -these quantifiers rely on classifiers that have been optimized using a quantification-oriented loss -measure. This implementation relies on -Joachims’ SVM perf structured output -learning algorithm, which has to be installed and patched for the purpose (see this -script). -This function is a wrapper around CC(SVMperf(svmperf_base, loss, C))
-svmperf_base – path to the folder containing the binary files of SVM perf; if set to None (default) -this path will be obtained from qp.environ[‘SVMPERF_HOME’]
C – trade-off between training error and margin (default 0.01)
returns an instance of CC set to work with SVMperf (with loss and C set properly) as the -underlying classifier
-SVM(KLD) is an Explicit Loss Minimization (ELM) quantifier set to optimize for the Kullback-Leibler Divergence -normalized via the logistic function, as proposed by -Esuli et al. 2015. -Equivalent to:
->>> CC(SVMperf(svmperf_base, loss='nkld', C=C))
-Quantifiers based on ELM represent a family of methods based on structured output learning; -these quantifiers rely on classifiers that have been optimized using a quantification-oriented loss -measure. This implementation relies on -Joachims’ SVM perf structured output -learning algorithm, which has to be installed and patched for the purpose (see this -script). -This function is a wrapper around CC(SVMperf(svmperf_base, loss, C))
-svmperf_base – path to the folder containing the binary files of SVM perf; if set to None (default) -this path will be obtained from qp.environ[‘SVMPERF_HOME’]
C – trade-off between training error and margin (default 0.01)
returns an instance of CC set to work with SVMperf (with loss and C set properly) as the -underlying classifier
-SVM(Q) is an Explicit Loss Minimization (ELM) quantifier set to optimize for the Q loss combining a -classification-oriented loss and a quantification-oriented loss, as proposed by -Barranquero et al. 2015. -Equivalent to:
->>> CC(SVMperf(svmperf_base, loss='q', C=C))
-Quantifiers based on ELM represent a family of methods based on structured output learning; -these quantifiers rely on classifiers that have been optimized using a quantification-oriented loss -measure. This implementation relies on -Joachims’ SVM perf structured output -learning algorithm, which has to be installed and patched for the purpose (see this -script). -This function is a wrapper around CC(SVMperf(svmperf_base, loss, C))
-svmperf_base – path to the folder containing the binary files of SVM perf; if set to None (default) -this path will be obtained from qp.environ[‘SVMPERF_HOME’]
C – trade-off between training error and margin (default 0.01)
returns an instance of CC set to work with SVMperf (with loss and C set properly) as the -underlying classifier
-SVM(KLD) is an Explicit Loss Minimization (ELM) quantifier set to optimize for the Relative Absolute Error as first -used by Moreo and Sebastiani, 2021. -Equivalent to:
->>> CC(SVMperf(svmperf_base, loss='mrae', C=C))
-Quantifiers based on ELM represent a family of methods based on structured output learning; -these quantifiers rely on classifiers that have been optimized using a quantification-oriented loss -measure. This implementation relies on -Joachims’ SVM perf structured output -learning algorithm, which has to be installed and patched for the purpose (see this -script). -This function is a wrapper around CC(SVMperf(svmperf_base, loss, C))
-svmperf_base – path to the folder containing the binary files of SVM perf; if set to None (default) -this path will be obtained from qp.environ[‘SVMPERF_HOME’]
C – trade-off between training error and margin (default 0.01)
returns an instance of CC set to work with SVMperf (with loss and C set properly) as the -underlying classifier
-Bases: object
Common ancestor for KDE-based methods. Implements some common routines.
-Wraps the KDE function from scikit-learn.
-X – data for which the density function is to be estimated
bandwidth – the bandwidth of the kernel
a scikit-learn’s KernelDensity object
-Returns an array containing the mixture components, i.e., the KDE functions for each class.
-X – the data containing the covariates
y – the class labels
n_classes – integer, the number of classes
bandwidth – float, the bandwidth of the kernel
a list of KernelDensity objects, each fitted with the corresponding class-specific covariates
-Wraps the density evalution of scikit-learn’s KDE. Scikit-learn returns log-scores (s), so this -function returns \(e^{s}\)
-kde – a previously fit KDE function
X – the data for which the density is to be estimated
np.ndarray with the densities
-Bases: AggregativeSoftQuantifier
Kernel Density Estimation model for quantification (KDEy) relying on the Cauchy-Schwarz divergence (CS) as -the divergence measure to be minimized. This method was first proposed in the paper -Kernel Density Estimation for Multiclass Quantification, in which -the authors proposed a Monte Carlo approach for minimizing the divergence.
-The distribution matching optimization problem comes down to solving:
-\(\hat{\alpha} = \arg\min_{\alpha\in\Delta^{n-1}} \mathcal{D}(\boldsymbol{p}_{\alpha}||q_{\widetilde{U}})\)
-where \(p_{\alpha}\) is the mixture of class-specific KDEs with mixture parameter (hence class prevalence) -\(\alpha\) defined by
-\(\boldsymbol{p}_{\alpha}(\widetilde{x}) = \sum_{i=1}^n \alpha_i p_{\widetilde{L}_i}(\widetilde{x})\)
-where \(p_X(\boldsymbol{x}) = \frac{1}{|X|} \sum_{x_i\in X} K\left(\frac{x-x_i}{h}\right)\) is the -KDE function that uses the datapoints in X as the kernel centers.
-In KDEy-CS, the divergence is taken to be the Cauchy-Schwarz divergence given by:
-\(\mathcal{D}_{\mathrm{CS}}(p||q)=-\log\left(\frac{\int p(x)q(x)dx}{\sqrt{\int p(x)^2dx \int q(x)^2dx}}\right)\)
-The authors showed that this distribution matching admits a closed-form solution
-classifier – a sklearn’s Estimator that generates a binary classifier.
val_split – specifies the data used for generating classifier predictions. This specification -can be made as float in (0, 1) indicating the proportion of stratified held-out validation set to -be extracted from the training set; or as an integer (default 5), indicating that the predictions -are to be generated in a k-fold cross-validation manner (with this integer indicating the value -for k); or as a collection defining the specific set of data to use for validation. -Alternatively, this set can be specified at fit time by indicating the exact set of data -on which the predictions are to be generated.
bandwidth – float, the bandwidth of the Kernel
n_jobs – number of parallel workers
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains the aggregation function.
-classif_predictions – a LabelledCollection containing the label predictions issued -by the classifier
data – a quapy.data.base.LabelledCollection consisting of the training data
Bases: AggregativeSoftQuantifier, KDEBase
Kernel Density Estimation model for quantification (KDEy) relying on the squared Hellinger Disntace (HD) as -the divergence measure to be minimized. This method was first proposed in the paper -Kernel Density Estimation for Multiclass Quantification, in which -the authors proposed a Monte Carlo approach for minimizing the divergence.
-The distribution matching optimization problem comes down to solving:
-\(\hat{\alpha} = \arg\min_{\alpha\in\Delta^{n-1}} \mathcal{D}(\boldsymbol{p}_{\alpha}||q_{\widetilde{U}})\)
-where \(p_{\alpha}\) is the mixture of class-specific KDEs with mixture parameter (hence class prevalence) -\(\alpha\) defined by
-\(\boldsymbol{p}_{\alpha}(\widetilde{x}) = \sum_{i=1}^n \alpha_i p_{\widetilde{L}_i}(\widetilde{x})\)
-where \(p_X(\boldsymbol{x}) = \frac{1}{|X|} \sum_{x_i\in X} K\left(\frac{x-x_i}{h}\right)\) is the -KDE function that uses the datapoints in X as the kernel centers.
-In KDEy-HD, the divergence is taken to be the squared Hellinger Distance, an f-divergence with corresponding -f-generator function given by:
-\(f(u)=(\sqrt{u}-1)^2\)
-The authors proposed a Monte Carlo solution that relies on importance sampling:
-\(\hat{D}_f(p||q)= \frac{1}{t} \sum_{i=1}^t f\left(\frac{p(x_i)}{q(x_i)}\right) \frac{q(x_i)}{r(x_i)}\)
-where the datapoints (trials) \(x_1,\ldots,x_t\sim_{\mathrm{iid}} r\) with \(r\) the -uniform distribution.
-classifier – a sklearn’s Estimator that generates a binary classifier.
val_split – specifies the data used for generating classifier predictions. This specification -can be made as float in (0, 1) indicating the proportion of stratified held-out validation set to -be extracted from the training set; or as an integer (default 5), indicating that the predictions -are to be generated in a k-fold cross-validation manner (with this integer indicating the value -for k); or as a collection defining the specific set of data to use for validation. -Alternatively, this set can be specified at fit time by indicating the exact set of data -on which the predictions are to be generated.
bandwidth – float, the bandwidth of the Kernel
n_jobs – number of parallel workers
random_state – a seed to be set before fitting any base quantifier (default None)
montecarlo_trials – number of Monte Carlo trials (default 10000)
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains the aggregation function.
-classif_predictions – a LabelledCollection containing the label predictions issued -by the classifier
data – a quapy.data.base.LabelledCollection consisting of the training data
Bases: AggregativeSoftQuantifier, KDEBase
Kernel Density Estimation model for quantification (KDEy) relying on the Kullback-Leibler divergence (KLD) as -the divergence measure to be minimized. This method was first proposed in the paper -Kernel Density Estimation for Multiclass Quantification, in which -the authors show that minimizing the distribution mathing criterion for KLD is akin to performing -maximum likelihood (ML).
-The distribution matching optimization problem comes down to solving:
-\(\hat{\alpha} = \arg\min_{\alpha\in\Delta^{n-1}} \mathcal{D}(\boldsymbol{p}_{\alpha}||q_{\widetilde{U}})\)
-where \(p_{\alpha}\) is the mixture of class-specific KDEs with mixture parameter (hence class prevalence) -\(\alpha\) defined by
-\(\boldsymbol{p}_{\alpha}(\widetilde{x}) = \sum_{i=1}^n \alpha_i p_{\widetilde{L}_i}(\widetilde{x})\)
-where \(p_X(\boldsymbol{x}) = \frac{1}{|X|} \sum_{x_i\in X} K\left(\frac{x-x_i}{h}\right)\) is the -KDE function that uses the datapoints in X as the kernel centers.
-In KDEy-ML, the divergence is taken to be the Kullback-Leibler Divergence. This is equivalent to solving: -\(\hat{\alpha} = \arg\min_{\alpha\in\Delta^{n-1}} - -\mathbb{E}_{q_{\widetilde{U}}} \left[ \log \boldsymbol{p}_{\alpha}(\widetilde{x}) \right]\)
-which corresponds to the maximum likelihood estimate.
-classifier – a sklearn’s Estimator that generates a binary classifier.
val_split – specifies the data used for generating classifier predictions. This specification -can be made as float in (0, 1) indicating the proportion of stratified held-out validation set to -be extracted from the training set; or as an integer (default 5), indicating that the predictions -are to be generated in a k-fold cross-validation manner (with this integer indicating the value -for k); or as a collection defining the specific set of data to use for validation. -Alternatively, this set can be specified at fit time by indicating the exact set of data -on which the predictions are to be generated.
bandwidth – float, the bandwidth of the Kernel
n_jobs – number of parallel workers
random_state – a seed to be set before fitting any base quantifier (default None)
Searches for the mixture model parameter (the sought prevalence values) that maximizes the likelihood -of the data (i.e., that minimizes the negative log-likelihood)
-posteriors – instances in the sample converted into posterior probabilities
-a vector of class prevalence estimates
-Trains the aggregation function.
-classif_predictions – a LabelledCollection containing the label predictions issued -by the classifier
data – a quapy.data.base.LabelledCollection consisting of the training data
Bases: Module
Implements the QuaNet forward pass.
-See QuaNetTrainer for training QuaNet.
doc_embedding_size – integer, the dimensionality of the document embeddings
n_classes – integer, number of classes
stats_size – integer, number of statistics estimated by simple quantification methods
lstm_hidden_size – integer, hidden dimensionality of the LSTM cell
lstm_nlayers – integer, number of LSTM layers
ff_layers – list of integers, dimensions of the densely-connected FF layers on top of the -quantification embedding
bidirectional – boolean, whether or not to use bidirectional LSTM
qdrop_p – float, dropout probability
order_by – integer, class for which the document embeddings are to be sorted
Defines the computation performed at every call.
-Should be overridden by all subclasses.
-Note
-Although the recipe for forward pass needs to be defined within
-this function, one should call the Module instance afterwards
-instead of this since the former takes care of running the
-registered hooks while the latter silently ignores them.
Bases: BaseQuantifier
Implementation of QuaNet, a neural network for -quantification. This implementation uses PyTorch and can take advantage of GPU -for speeding-up the training phase.
-Example:
->>> import quapy as qp
->>> from quapy.method.meta import QuaNet
->>> from quapy.classification.neural import NeuralClassifierTrainer, CNNnet
->>>
->>> # use samples of 100 elements
->>> qp.environ['SAMPLE_SIZE'] = 100
->>>
->>> # load the kindle dataset as text, and convert words to numerical indexes
->>> dataset = qp.datasets.fetch_reviews('kindle', pickle=True)
->>> qp.train.preprocessing.index(dataset, min_df=5, inplace=True)
->>>
->>> # the text classifier is a CNN trained by NeuralClassifierTrainer
->>> cnn = CNNnet(dataset.vocabulary_size, dataset.n_classes)
->>> classifier = NeuralClassifierTrainer(cnn, device='cuda')
->>>
->>> # train QuaNet (QuaNet is an alias to QuaNetTrainer)
->>> model = QuaNet(classifier, qp.environ['SAMPLE_SIZE'], device='cuda')
->>> model.fit(dataset.training)
->>> estim_prevalence = model.quantify(dataset.test.instances)
-classifier – an object implementing fit (i.e., that can be trained on labelled data), -predict_proba (i.e., that can generate posterior probabilities of unlabelled examples) and -transform (i.e., that can generate embedded representations of the unlabelled instances).
sample_size – integer, the sample size; default is None, meaning that the sample size should be -taken from qp.environ[“SAMPLE_SIZE”]
n_epochs – integer, maximum number of training epochs
tr_iter_per_poch – integer, number of training iterations before considering an epoch complete
va_iter_per_poch – integer, number of validation iterations to perform after each epoch
lr – float, the learning rate
lstm_hidden_size – integer, hidden dimensionality of the LSTM cells
lstm_nlayers – integer, number of LSTM layers
ff_layers – list of integers, dimensions of the densely-connected FF layers on top of the -quantification embedding
bidirectional – boolean, indicates whether the LSTM is bidirectional or not
qdrop_p – float, dropout probability
patience – integer, number of epochs showing no improvement in the validation set before stopping the -training phase (early stopping)
checkpointdir – string, a path where to store models’ checkpoints
checkpointname – string (optional), the name of the model’s checkpoint
device – string, indicate “cpu” or “cuda”
Trains QuaNet.
-data – the training data on which to train QuaNet. If fit_classifier=True, the data will be split in -40/40/20 for training the classifier, training QuaNet, and validating QuaNet, respectively. If -fit_classifier=False, the data will be split in 66/34 for training QuaNet and validating it, respectively.
fit_classifier – if True, trains the classifier on a split containing 40% of the data
self
-Get parameters for this estimator.
-deep (bool, default=True) – If True, will return the parameters for this estimator and -contained subobjects that are estimators.
-params – Parameter names mapped to their values.
-dict
-Generate class prevalence estimates for the sample’s instances
-instances – array-like
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Set the parameters of this estimator.
-The method works on simple estimators as well as on nested objects
-(such as Pipeline). The latter have
-parameters of the form <component>__<parameter> so that it’s
-possible to update each component of a nested object.
**params (dict) – Estimator parameters.
-self – Estimator instance.
-estimator instance
-Torch-like wrapper for the Mean Absolute Error
-output – predictions
target – ground truth values
mean absolute error loss
-Bases: ThresholdOptimization
Threshold Optimization variant for ACC as proposed by
-Forman 2006 and
-Forman 2008 that looks
-for the threshold that maximizes tpr-fpr.
-The goal is to bring improved stability to the denominator of the adjustment.
classifier – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the
-misclassification rates are to be estimated.
-This parameter can be indicated as a real value (between 0 and 1), representing a proportion of
-validation data, or as an integer, indicating that the misclassification rates should be estimated via
-k-fold cross validation (this integer stands for the number of folds k, defaults 5), or as a
-quapy.data.base.LabelledCollection (the split itself).
Implements the criterion according to which the threshold should be selected. -This function should return the (float) score to be minimized.
-tpr – float, true positive rate
fpr – float, false positive rate
float, a score for the given tpr and fpr
-Bases: ThresholdOptimization
Median Sweep. Threshold Optimization variant for ACC as proposed by
-Forman 2006 and
-Forman 2008 that generates
-class prevalence estimates for all decision thresholds and returns the median of them all.
-The goal is to bring improved stability to the denominator of the adjustment.
classifier – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the
-misclassification rates are to be estimated.
-This parameter can be indicated as a real value (between 0 and 1), representing a proportion of
-validation data, or as an integer, indicating that the misclassification rates should be estimated via
-k-fold cross validation (this integer stands for the number of folds k, defaults 5), or as a
-quapy.data.base.LabelledCollection (the split itself).
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains the aggregation function.
-classif_predictions – a LabelledCollection containing the label predictions issued -by the classifier
data – a quapy.data.base.LabelledCollection consisting of the training data
Implements the criterion according to which the threshold should be selected. -This function should return the (float) score to be minimized.
-tpr – float, true positive rate
fpr – float, false positive rate
float, a score for the given tpr and fpr
-Bases: MS
Median Sweep 2. Threshold Optimization variant for ACC as proposed by
-Forman 2006 and
-Forman 2008 that generates
-class prevalence estimates for all decision thresholds and returns the median of for cases in
-which tpr-fpr>0.25
-The goal is to bring improved stability to the denominator of the adjustment.
classifier – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the
-misclassification rates are to be estimated.
-This parameter can be indicated as a real value (between 0 and 1), representing a proportion of
-validation data, or as an integer, indicating that the misclassification rates should be estimated via
-k-fold cross validation (this integer stands for the number of folds k, defaults 5), or as a
-quapy.data.base.LabelledCollection (the split itself).
Bases: ThresholdOptimization
Threshold Optimization variant for ACC as proposed by
-Forman 2006 and
-Forman 2008 that looks
-for the threshold that makes tpr closest to 0.5.
-The goal is to bring improved stability to the denominator of the adjustment.
classifier – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the
-misclassification rates are to be estimated.
-This parameter can be indicated as a real value (between 0 and 1), representing a proportion of
-validation data, or as an integer, indicating that the misclassification rates should be estimated via
-k-fold cross validation (this integer stands for the number of folds k, defaults 5), or as a
-quapy.data.base.LabelledCollection (the split itself).
Implements the criterion according to which the threshold should be selected. -This function should return the (float) score to be minimized.
-tpr – float, true positive rate
fpr – float, false positive rate
float, a score for the given tpr and fpr
-Bases: BinaryAggregativeQuantifier
Abstract class of Threshold Optimization variants for ACC as proposed by
-Forman 2006 and
-Forman 2008.
-The goal is to bring improved stability to the denominator of the adjustment.
-The different variants are based on different heuristics for choosing a decision threshold
-that would allow for more true positives and many more false positives, on the grounds this
-would deliver larger denominators.
classifier – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the
-misclassification rates are to be estimated.
-This parameter can be indicated as a real value (between 0 and 1), representing a proportion of
-validation data, or as an integer, indicating that the misclassification rates should be estimated via
-k-fold cross validation (this integer stands for the number of folds k, defaults 5), or as a
-quapy.data.base.LabelledCollection (the split itself).
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains the aggregation function.
-classif_predictions – a LabelledCollection containing the label predictions issued -by the classifier
data – a quapy.data.base.LabelledCollection consisting of the training data
Implements the criterion according to which the threshold should be selected. -This function should return the (float) score to be minimized.
-tpr – float, true positive rate
fpr – float, false positive rate
float, a score for the given tpr and fpr
-Bases: ThresholdOptimization
Threshold Optimization variant for ACC as proposed by
-Forman 2006 and
-Forman 2008 that looks
-for the threshold that yields tpr=1-fpr.
-The goal is to bring improved stability to the denominator of the adjustment.
classifier – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the
-misclassification rates are to be estimated.
-This parameter can be indicated as a real value (between 0 and 1), representing a proportion of
-validation data, or as an integer, indicating that the misclassification rates should be estimated via
-k-fold cross validation (this integer stands for the number of folds k, defaults 5), or as a
-quapy.data.base.LabelledCollection (the split itself).
Implements the criterion according to which the threshold should be selected. -This function should return the (float) score to be minimized.
-tpr – float, true positive rate
fpr – float, false positive rate
float, a score for the given tpr and fpr
-Bases: BaseEstimator
Abstract Quantifier. A quantifier is defined as an object of a class that implements the method fit() on
-quapy.data.base.LabelledCollection, the method quantify(), and the set_params() and
-get_params() for model selection (see quapy.model_selection.GridSearchQ())
Trains a quantifier.
-data – a quapy.data.base.LabelledCollection consisting of the training data
self
-Bases: BaseQuantifier
Abstract class of binary quantifiers, i.e., quantifiers estimating class prevalence values for only two classes -(typically, to be interpreted as one class and its complement).
-Bases: OneVsAll, BaseQuantifier
Allows any binary quantifier to perform quantification on single-label datasets. The method maintains one binary -quantifier for each class, and then l1-normalizes the outputs so that the class prevelence values sum up to 1.
-Trains a quantifier.
-data – a quapy.data.base.LabelledCollection consisting of the training data
self
-Implements an ensemble of quapy.method.aggregative.ACC quantifiers, as used by
-Pérez-Gállego et al., 2019.
Equivalent to:
->>> ensembleFactory(classifier, ACC, param_grid, optim, param_mod_sel, **kwargs)
-See ensembleFactory() for further details.
classifier – sklearn’s Estimator that generates a classifier
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
-quapy.model_selection.GridSearchQ
kwargs – kwargs for the class Ensemble
an instance of Ensemble
Implements an ensemble of quapy.method.aggregative.CC quantifiers, as used by
-Pérez-Gállego et al., 2019.
Equivalent to:
->>> ensembleFactory(classifier, CC, param_grid, optim, param_mod_sel, **kwargs)
-See ensembleFactory() for further details.
classifier – sklearn’s Estimator that generates a classifier
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
-quapy.model_selection.GridSearchQ
kwargs – kwargs for the class Ensemble
an instance of Ensemble
Implements an ensemble of quapy.method.aggregative.EMQ quantifiers.
Equivalent to:
->>> ensembleFactory(classifier, EMQ, param_grid, optim, param_mod_sel, **kwargs)
-See ensembleFactory() for further details.
classifier – sklearn’s Estimator that generates a classifier
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
-quapy.model_selection.GridSearchQ
kwargs – kwargs for the class Ensemble
an instance of Ensemble
Implements an ensemble of quapy.method.aggregative.HDy quantifiers, as used by
-Pérez-Gállego et al., 2019.
Equivalent to:
->>> ensembleFactory(classifier, HDy, param_grid, optim, param_mod_sel, **kwargs)
-See ensembleFactory() for further details.
classifier – sklearn’s Estimator that generates a classifier
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
-quapy.model_selection.GridSearchQ
kwargs – kwargs for the class Ensemble
an instance of Ensemble
Implements an ensemble of quapy.method.aggregative.PACC quantifiers.
Equivalent to:
->>> ensembleFactory(classifier, PACC, param_grid, optim, param_mod_sel, **kwargs)
-See ensembleFactory() for further details.
classifier – sklearn’s Estimator that generates a classifier
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
-quapy.model_selection.GridSearchQ
kwargs – kwargs for the class Ensemble
an instance of Ensemble
Bases: BaseQuantifier
Implementation of the Ensemble methods for quantification described by -Pérez-Gállego et al., 2017 -and -Pérez-Gállego et al., 2019. -The policies implemented include:
-Average (policy=’ave’): computes class prevalence estimates as the average of the estimates -returned by the base quantifiers.
Training Prevalence (policy=’ptr’): applies a dynamic selection to the ensemble’s members by retaining only -those members such that the class prevalence values in the samples they use as training set are closest to -preliminary class prevalence estimates computed as the average of the estimates of all the members. The final -estimate is recomputed by considering only the selected members.
Distribution Similarity (policy=’ds’): performs a dynamic selection of base members by retaining -the members trained on samples whose distribution of posterior probabilities is closest, in terms of the -Hellinger Distance, to the distribution of posterior probabilities in the test sample
Accuracy (policy=’<valid error name>’): performs a static selection of the ensemble members by -retaining those that minimize a quantification error measure, which is passed as an argument.
Example:
->>> model = Ensemble(quantifier=ACC(LogisticRegression()), size=30, policy='ave', n_jobs=-1)
-quantifier – base quantification member of the ensemble
size – number of members
red_size – number of members to retain after selection (depending on the policy)
min_pos – minimum number of positive instances to consider a sample as valid
policy – the selection policy; available policies include: ave (default), ptr, ds, and accuracy -(which is instantiated via a valid error name, e.g., mae)
max_sample_size – maximum number of instances to consider in the samples (set to None -to indicate no limit, default)
val_split – a float in range (0,1) indicating the proportion of data to be used as a stratified held-out
-validation split, or a quapy.data.base.LabelledCollection (the split itself).
n_jobs – number of parallel workers (default 1)
verbose – set to True (default is False) to get some information in standard output
Indicates that the quantifier is not aggregative.
-False
-Trains a quantifier.
-data – a quapy.data.base.LabelledCollection consisting of the training data
self
-This function should not be used within quapy.model_selection.GridSearchQ (is here for compatibility
-with the abstract class).
-Instead, use Ensemble(GridSearchQ(q),…), with q a Quantifier (recommended), or
-Ensemble(Q(GridSearchCV(l))) with Q a quantifier class that has a classifier l optimized for
-classification (not recommended).
deep – for compatibility with scikit-learn
-raises an Exception
-Indicates that the quantifier is not probabilistic.
-False
-Generate class prevalence estimates for the sample’s instances
-instances – array-like
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-This function should not be used within quapy.model_selection.GridSearchQ (is here for compatibility
-with the abstract class).
-Instead, use Ensemble(GridSearchQ(q),…), with q a Quantifier (recommended), or
-Ensemble(Q(GridSearchCV(l))) with Q a quantifier class that has a classifier l optimized for
-classification (not recommended).
parameters – dictionary
-raises an Exception
-Bases: BinaryQuantifier
This method is a meta-quantifier that returns, as the estimated class prevalence values, the median of the -estimation returned by differently (hyper)parameterized base quantifiers. -The median of unit-vectors is only guaranteed to be a unit-vector for n=2 dimensions, -i.e., in cases of binary quantification.
-base_quantifier – the base, binary quantifier
random_state – a seed to be set before fitting any base quantifier (default None)
param_grid – the grid or parameters towards which the median will be computed
n_jobs – number of parllel workes
Trains a quantifier.
-data – a quapy.data.base.LabelledCollection consisting of the training data
self
-Get parameters for this estimator.
-deep (bool, default=True) – If True, will return the parameters for this estimator and -contained subobjects that are estimators.
-params – Parameter names mapped to their values.
-dict
-Generate class prevalence estimates for the sample’s instances
-instances – array-like
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Set the parameters of this estimator.
-The method works on simple estimators as well as on nested objects
-(such as Pipeline). The latter have
-parameters of the form <component>__<parameter> so that it’s
-possible to update each component of a nested object.
**params (dict) – Estimator parameters.
-self – Estimator instance.
-estimator instance
-Bases: BinaryQuantifier
This method is a meta-quantifier that returns, as the estimated class prevalence values, the median of the -estimation returned by differently (hyper)parameterized base quantifiers. -The median of unit-vectors is only guaranteed to be a unit-vector for n=2 dimensions, -i.e., in cases of binary quantification.
-base_quantifier – the base, binary quantifier
random_state – a seed to be set before fitting any base quantifier (default None)
param_grid – the grid or parameters towards which the median will be computed
n_jobs – number of parllel workes
Trains a quantifier.
-data – a quapy.data.base.LabelledCollection consisting of the training data
self
-Get parameters for this estimator.
-deep (bool, default=True) – If True, will return the parameters for this estimator and -contained subobjects that are estimators.
-params – Parameter names mapped to their values.
-dict
-Generate class prevalence estimates for the sample’s instances
-instances – array-like
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Set the parameters of this estimator.
-The method works on simple estimators as well as on nested objects
-(such as Pipeline). The latter have
-parameters of the form <component>__<parameter> so that it’s
-possible to update each component of a nested object.
**params (dict) – Estimator parameters.
-self – Estimator instance.
-estimator instance
-Ensemble factory. Provides a unified interface for instantiating ensembles that can be optimized (via model
-selection for quantification) for a given evaluation metric using quapy.model_selection.GridSearchQ.
-If the evaluation metric is classification-oriented
-(instead of quantification-oriented), then the optimization will be carried out via sklearn’s
-GridSearchCV.
Example to instantiate an Ensemble based on quapy.method.aggregative.PACC
-in which the base members are optimized for quapy.error.mae() via
-quapy.model_selection.GridSearchQ. The ensemble follows the policy Accuracy based
-on quapy.error.mae() (the same measure being optimized),
-meaning that a static selection of members of the ensemble is made based on their performance
-in terms of this error.
>>> param_grid = {
->>> 'C': np.logspace(-3,3,7),
->>> 'class_weight': ['balanced', None]
->>> }
->>> param_mod_sel = {
->>> 'sample_size': 500,
->>> 'protocol': 'app'
->>> }
->>> common={
->>> 'max_sample_size': 1000,
->>> 'n_jobs': -1,
->>> 'param_grid': param_grid,
->>> 'param_mod_sel': param_mod_sel,
->>> }
->>>
->>> ensembleFactory(LogisticRegression(), PACC, optim='mae', policy='mae', **common)
-classifier – sklearn’s Estimator that generates a classifier
base_quantifier_class – a class of quantifiers
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
-quapy.model_selection.GridSearchQ
kwargs – kwargs for the class Ensemble
an instance of Ensemble
Gets a histogram out of the posterior probabilities (only for the binary case).
-posterior_probabilities – array-like of shape (n_instances, 2,)
bins – integer
np.ndarray with the relative frequencies for each bin (for the positive class only)
-Bases: BaseQuantifier
Generic Distribution Matching quantifier for binary or multiclass quantification based on the space of covariates. -This implementation takes the number of bins, the divergence, and the possibility to work on CDF as hyperparameters.
-nbins – number of bins used to discretize the distributions (default 8)
divergence – a string representing a divergence measure (currently, “HD” and “topsoe” are implemented) -or a callable function taking two ndarrays of the same dimension as input (default “HD”, meaning Hellinger -Distance)
cdf – whether to use CDF instead of PDF (default False)
n_jobs – number of parallel workers (default None)
Hellinger Distance x (HDx). -HDx is a method for training binary quantifiers, that models quantification as the problem of -minimizing the average divergence (in terms of the Hellinger Distance) across the feature-specific normalized -histograms of two representations, one for the unlabelled examples, and another generated from the training -examples as a mixture model of the class-specific representations. The parameters of the mixture thus represent -the estimates of the class prevalence values.
-The method computes all matchings for nbins in [10, 20, …, 110] and reports the mean of the median. -The best prevalence is searched via linear search, from 0 to 1 stepping by 0.01.
-n_jobs – number of parallel workers
-an instance of this class setup to mimick the performance of the HDx as originally proposed by -González-Castro, Alaiz-Rodríguez, Alegre (2013)
-Generates the validation distributions out of the training data (covariates). -The validation distributions have shape (n, nfeats, nbins), with n the number of classes, nfeats -the number of features, and nbins the number of bins. -In particular, let V be the validation distributions; then di=V[i] are the distributions obtained from -training data labelled with class i; while dij = di[j] is the discrete distribution for feature j in -training data labelled with class i, and dij[k] is the fraction of instances with a value in the k-th bin.
-data – the training set
-Searches for the mixture model parameter (the sought prevalence values) that yields a validation distribution -(the mixture) that best matches the test distribution, in terms of the divergence measure of choice. -The matching is computed as the average dissimilarity (in terms of the dissimilarity measure of choice) -between all feature-specific discrete distributions.
-instances – instances in the sample
-a vector of class prevalence estimates
-Bases: BaseQuantifier
The Maximum Likelihood Prevalence Estimation (MLPE) method is a lazy method that assumes there is no prior -probability shift between training and test instances (put it other way, that the i.i.d. assumpion holds). -The estimation of class prevalence values for any test sample is always (i.e., irrespective of the test sample -itself) the class prevalence seen during training. This method is considered to be a lower-bound quantifier that -any quantification method should beat.
-Computes the training prevalence and stores it.
-data – the training sample
-self
-