Index
- -A
-B
-C
-D
-E
-F
-G
-H
-- | - |
I
-- | - |
J
-- |
K
-- | - |
L
-- | - |
M
-N
-O
-- | - |
P
-Q
-
|
-
|
-
R
-S
-T
-U
-- | - |
V
-W
-- |
X
-- | - |
Y
-- |
- | - |
- | - |
- |
- | - |
- | - |
- | - |
|
-
|
-
- | - |
- |
- | - |
- |
QuaPy is a Python-based open-source framework for quantification.
-This document contains the API of the modules included in QuaPy.
-pip install quapy
-QuaPy is hosted in GitHub at https://github.com/HLT-ISTI/QuaPy
-In this section you can find useful information concerning different aspects of QuaPy, with examples:
-BCTSCalibration
NBVSCalibration
RecalibratedProbabilisticClassifier
RecalibratedProbabilisticClassifierBase
RecalibratedProbabilisticClassifierBase.classes_
RecalibratedProbabilisticClassifierBase.fit()
RecalibratedProbabilisticClassifierBase.fit_cv()
RecalibratedProbabilisticClassifierBase.fit_tr_val()
RecalibratedProbabilisticClassifierBase.predict()
RecalibratedProbabilisticClassifierBase.predict_proba()
TSCalibration
VSCalibration
CNNnet
-LSTMnet
-NeuralClassifierTrainer
-TextClassifierNet
-TorchDataset
-Dataset
-LabelledCollection
LabelledCollection.X
LabelledCollection.Xp
LabelledCollection.Xy
LabelledCollection.binary
LabelledCollection.counts()
LabelledCollection.join()
LabelledCollection.kFCV()
LabelledCollection.load()
LabelledCollection.n_classes
LabelledCollection.p
LabelledCollection.prevalence()
LabelledCollection.sampling()
LabelledCollection.sampling_from_index()
LabelledCollection.sampling_index()
LabelledCollection.split_random()
LabelledCollection.split_stratified()
LabelledCollection.stats()
LabelledCollection.uniform_sampling()
LabelledCollection.uniform_sampling_index()
LabelledCollection.y
ACC
-AdjustedClassifyAndCount
AggregativeCrispQuantifier
AggregativeMedianEstimator
-AggregativeQuantifier
AggregativeQuantifier.aggregate()
AggregativeQuantifier.aggregation_fit()
AggregativeQuantifier.classes_
AggregativeQuantifier.classifier
AggregativeQuantifier.classifier_fit_predict()
AggregativeQuantifier.classify()
AggregativeQuantifier.fit()
AggregativeQuantifier.quantify()
AggregativeQuantifier.val_split
AggregativeQuantifier.val_split_
AggregativeSoftQuantifier
BayesianCC
-BinaryAggregativeQuantifier
-CC
-ClassifyAndCount
DMy
-DistributionMatchingY
DyS
-EMQ
-ExpectationMaximizationQuantifier
HDy
-HellingerDistanceY
OneVsAllAggregative
-PACC
-PCC
-ProbabilisticAdjustedClassifyAndCount
ProbabilisticClassifyAndCount
SLD
SMM
-newELM()
newSVMAE()
newSVMKLD()
newSVMQ()
newSVMRAE()
KDEBase
-KDEyCS
-KDEyHD
-KDEyML
-QuaNetModule
-QuaNetTrainer
-mae_loss()
MAX
-MS
-MS2
-T50
-ThresholdOptimization
-X
-BlobelLoss
CVClassifier
-ClassTransformer
-CombinedLoss
ComposableQuantifier()
DistanceTransformer
-EnergyKernelTransformer
-EnergyLoss
GaussianKernelTransformer
-GaussianRFFKernelTransformer
-HellingerSurrogateLoss
HistogramTransformer
-KernelTransformer
-LaplacianKernelTransformer
-LeastSquaresLoss
TikhonovRegularization
TikhonovRegularized()
absolute_error()
acc_error()
acce()
ae()
f1_error()
f1e()
from_name()
kld()
mae()
mean_absolute_error()
mean_normalized_absolute_error()
mean_normalized_relative_absolute_error()
mean_relative_absolute_error()
mkld()
mnae()
mnkld()
mnrae()
mrae()
mse()
nae()
nkld()
normalized_absolute_error()
normalized_relative_absolute_error()
nrae()
rae()
relative_absolute_error()
se()
smooth()
HellingerDistance()
TopsoeDistance()
argmin_prevalence()
as_binary_prevalence()
check_prevalence_vector()
clip()
condsoftmax()
counts_from_labels()
get_divergence()
get_nprevpoints_approximation()
l1_norm()
linear_search()
normalize_prevalence()
num_prevalence_combinations()
optim_minimize()
prevalence_from_labels()
prevalence_from_probabilities()
prevalence_linspace()
projection_simplex_sort()
softmax()
solve_adjustment()
solve_adjustment_binary()
strprev()
ternary_search()
uniform_prevalence()
uniform_prevalence_sampling()
uniform_simplex_sampling()
ConfigStatus
-GridSearchQ
-Status
-cross_val_predict()
expand_grid()
group_params()
absolute_error()
acc_error()
acce()
ae()
f1_error()
f1e()
from_name()
kld()
mae()
mean_absolute_error()
mean_normalized_absolute_error()
mean_normalized_relative_absolute_error()
mean_relative_absolute_error()
mkld()
mnae()
mnkld()
mnrae()
mrae()
mse()
nae()
nkld()
normalized_absolute_error()
normalized_relative_absolute_error()
nrae()
rae()
relative_absolute_error()
se()
smooth()
HellingerDistance()
TopsoeDistance()
argmin_prevalence()
as_binary_prevalence()
check_prevalence_vector()
clip()
condsoftmax()
counts_from_labels()
get_divergence()
get_nprevpoints_approximation()
l1_norm()
linear_search()
normalize_prevalence()
num_prevalence_combinations()
optim_minimize()
prevalence_from_labels()
prevalence_from_probabilities()
prevalence_linspace()
projection_simplex_sort()
softmax()
solve_adjustment()
solve_adjustment_binary()
strprev()
ternary_search()
uniform_prevalence()
uniform_prevalence_sampling()
uniform_simplex_sampling()
ConfigStatus
-GridSearchQ
-Status
-cross_val_predict()
expand_grid()
group_params()
- q | ||
- |
- quapy | - |
- |
- quapy.classification | - |
- |
- quapy.classification.calibration | - |
- |
- quapy.classification.methods | - |
- |
- quapy.classification.neural | - |
- |
- quapy.classification.svmperf | - |
- |
- quapy.data | - |
- |
- quapy.data.base | - |
- |
- quapy.data.datasets | - |
- |
- quapy.data.preprocessing | - |
- |
- quapy.data.reader | - |
- |
- quapy.error | - |
- |
- quapy.evaluation | - |
- |
- quapy.functional | - |
- |
- quapy.method | - |
- |
- quapy.method._kdey | - |
- |
- quapy.method._neural | - |
- |
- quapy.method._threshold_optim | - |
- |
- quapy.method.aggregative | - |
- |
- quapy.method.base | - |
- |
- quapy.method.composable | - |
- |
- quapy.method.meta | - |
- |
- quapy.method.non_aggregative | - |
- |
- quapy.model_selection | - |
- |
- quapy.plot | - |
- |
- quapy.protocol | - |
- |
- quapy.util | - |
Bases: RecalibratedProbabilisticClassifierBase
Applies the Bias-Corrected Temperature Scaling (BCTS) calibration method from abstention.calibration, as defined in -Alexandari et al. paper:
-classifier – a scikit-learn probabilistic classifier
val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p -in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the -training instances (the rest is used for training). In any case, the classifier is retrained in the whole -training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer)
verbose – whether or not to display information in the standard output
Bases: RecalibratedProbabilisticClassifierBase
Applies the No-Bias Vector Scaling (NBVS) calibration method from abstention.calibration, as defined in -Alexandari et al. paper:
-classifier – a scikit-learn probabilistic classifier
val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p -in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the -training instances (the rest is used for training). In any case, the classifier is retrained in the whole -training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer)
verbose – whether or not to display information in the standard output
Bases: object
Abstract class for (re)calibration method from abstention.calibration, as defined in -Alexandari, A., Kundaje, A., & Shrikumar, A. (2020, November). Maximum likelihood with bias-corrected calibration -is hard-to-beat at label shift adaptation. In International Conference on Machine Learning (pp. 222-232). PMLR.:
-Bases: BaseEstimator
, RecalibratedProbabilisticClassifier
Applies a (re)calibration method from abstention.calibration, as defined in -Alexandari et al. paper.
-classifier – a scikit-learn probabilistic classifier
calibrator – the calibration object (an instance of abstention.calibration.CalibratorFactory)
val_split – indicate an integer k for performing kFCV to obtain the posterior probabilities, or a float p -in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the -training instances (the rest is used for training). In any case, the classifier is retrained in the whole -training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer); default=None
verbose – whether or not to display information in the standard output
Returns the classes on which the classifier has been trained on
-array-like of shape (n_classes)
-Fits the calibration for the probabilistic classifier.
-X – array-like of shape (n_samples, n_features) with the data instances
y – array-like of shape (n_samples,) with the class labels
self
-Fits the calibration in a cross-validation manner, i.e., it generates posterior probabilities for all -training instances via cross-validation, and then retrains the classifier on all training instances. -The posterior probabilities thus generated are used for calibrating the outputs of the classifier.
-X – array-like of shape (n_samples, n_features) with the data instances
y – array-like of shape (n_samples,) with the class labels
self
-Fits the calibration in a train/val-split manner, i.e.t, it partitions the training instances into a -training and a validation set, and then uses the training samples to learn classifier which is then used -to generate posterior probabilities for the held-out validation data. These posteriors are used to calibrate -the classifier. The classifier is not retrained on the whole dataset.
-X – array-like of shape (n_samples, n_features) with the data instances
y – array-like of shape (n_samples,) with the class labels
self
-Bases: RecalibratedProbabilisticClassifierBase
Applies the Temperature Scaling (TS) calibration method from abstention.calibration, as defined in -Alexandari et al. paper:
-classifier – a scikit-learn probabilistic classifier
val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p -in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the -training instances (the rest is used for training). In any case, the classifier is retrained in the whole -training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer)
verbose – whether or not to display information in the standard output
Bases: RecalibratedProbabilisticClassifierBase
Applies the Vector Scaling (VS) calibration method from abstention.calibration, as defined in -Alexandari et al. paper:
-classifier – a scikit-learn probabilistic classifier
val_split – indicate an integer k for performing kFCV to obtain the posterior prevalences, or a float p -in (0,1) to indicate that the posteriors are obtained in a stratified validation split containing p% of the -training instances (the rest is used for training). In any case, the classifier is retrained in the whole -training set afterwards. Default value is 5.
n_jobs – indicate the number of parallel workers (only when val_split is an integer)
verbose – whether or not to display information in the standard output
Bases: BaseEstimator
An example of a classification method (i.e., an object that implements fit, predict, and predict_proba)
-that also generates embedded inputs (i.e., that implements transform), as those required for
-quapy.method.neural.QuaNet
. This is a mock method to allow for easily instantiating
-quapy.method.neural.QuaNet
on array-like real-valued instances.
-The transformation consists of applying sklearn.decomposition.TruncatedSVD
-while classification is performed using sklearn.linear_model.LogisticRegression
on the low-rank space.
n_components – the number of principal components to retain
kwargs – parameters for the -Logistic Regression classifier
Fit the model according to the given training data. The fit consists of -fitting TruncatedSVD and then LogisticRegression on the low-rank representation.
-X – array-like of shape (n_samples, n_features) with the instances
y – array-like of shape (n_samples, n_classes) with the class labels
self
-Get hyper-parameters for this estimator.
-a dictionary with parameter names mapped to their values
-Predicts labels for the instances X embedded into the low-rank space.
-X – array-like of shape (n_samples, n_features) instances to classify
-a numpy array of length n containing the label predictions, where n is the number of -instances in X
-Predicts posterior probabilities for the instances X embedded into the low-rank space.
-X – array-like of shape (n_samples, n_features) instances to classify
-array-like of shape (n_samples, n_classes) with the posterior probabilities
-Set the parameters of this estimator.
-parameters – a **kwargs dictionary with the estimator parameters for -Logistic Regression -and eventually also n_components for TruncatedSVD
-Returns the low-rank approximation of X with n_components dimensions, or X unaltered if -n_components >= X.shape[1].
-X – array-like of shape (n_samples, n_features) instances to embed
-array-like of shape (n_samples, n_components) with the embedded instances
-Bases: TextClassifierNet
An implementation of quapy.classification.neural.TextClassifierNet
based on
-Convolutional Neural Networks.
vocabulary_size – the size of the vocabulary
n_classes – number of target classes
embedding_size – the dimensionality of the word embeddings space (default 100)
hidden_size – the dimensionality of the hidden space (default 256)
repr_size – the dimensionality of the document embeddings space (default 100)
kernel_heights – list of kernel lengths (default [3,5,7]), i.e., the number of -consecutive tokens that each kernel covers
stride – convolutional stride (default 1)
stride – convolutional pad (default 0)
drop_p – drop probability for dropout (default 0.5)
Embeds documents (i.e., performs the forward pass up to the -next-to-last layer).
-input – a batch of instances, typically generated by a torch’s DataLoader
-instance (see quapy.classification.neural.TorchDataset
)
a torch tensor of shape (n_samples, n_dimensions), where -n_samples is the number of documents, and n_dimensions is the -dimensionality of the embedding
-Get hyper-parameters for this estimator
-a dictionary with parameter names mapped to their values
-Return the size of the vocabulary
-integer
-Bases: TextClassifierNet
An implementation of quapy.classification.neural.TextClassifierNet
based on
-Long Short Term Memory networks.
vocabulary_size – the size of the vocabulary
n_classes – number of target classes
embedding_size – the dimensionality of the word embeddings space (default 100)
hidden_size – the dimensionality of the hidden space (default 256)
repr_size – the dimensionality of the document embeddings space (default 100)
lstm_class_nlayers – number of LSTM layers (default 1)
drop_p – drop probability for dropout (default 0.5)
Embeds documents (i.e., performs the forward pass up to the -next-to-last layer).
-x – a batch of instances, typically generated by a torch’s DataLoader
-instance (see quapy.classification.neural.TorchDataset
)
a torch tensor of shape (n_samples, n_dimensions), where -n_samples is the number of documents, and n_dimensions is the -dimensionality of the embedding
-Get hyper-parameters for this estimator
-a dictionary with parameter names mapped to their values
-Return the size of the vocabulary
-integer
-Bases: object
Trains a neural network for text classification.
-net – an instance of TextClassifierNet implementing the forward pass
lr – learning rate (default 1e-3)
weight_decay – weight decay (default 0)
patience – number of epochs that do not show any improvement in validation -to wait before applying early stop (default 10)
epochs – maximum number of training epochs (default 200)
batch_size – batch size for training (default 64)
batch_size_test – batch size for test (default 512)
padding_length – maximum number of tokens to consider in a document (default 300)
device – specify ‘cpu’ (default) or ‘cuda’ for enabling gpu
checkpointpath – where to store the parameters of the best model found so far -according to the evaluation in the held-out validation split (default ‘../checkpoint/classifier_net.dat’)
Gets the device in which the network is allocated
-device
-Fits the model according to the given training data.
-instances – list of lists of indexed tokens
labels – array-like of shape (n_samples, n_classes) with the class labels
val_split – proportion of training documents to be taken as the validation set (default 0.3)
Get hyper-parameters for this estimator
-a dictionary with parameter names mapped to their values
-Predicts labels for the instances
-instances – list of lists of indexed tokens
-a numpy array of length n containing the label predictions, where n is the number of -instances in X
-Predicts posterior probabilities for the instances
-X – array-like of shape (n_samples, n_features) instances to classify
-array-like of shape (n_samples, n_classes) with the posterior probabilities
-Reinitialize the network parameters
-vocab_size – the size of the vocabulary
n_classes – the number of target classes
Bases: Module
Abstract Text classifier (torch.nn.Module)
-Gets the number of dimensions of the embedding space
-integer
-Embeds documents (i.e., performs the forward pass up to the -next-to-last layer).
-x – a batch of instances, typically generated by a torch’s DataLoader
-instance (see quapy.classification.neural.TorchDataset
)
a torch tensor of shape (n_samples, n_dimensions), where -n_samples is the number of documents, and n_dimensions is the -dimensionality of the embedding
-Performs the forward pass.
-x – a batch of instances, typically generated by a torch’s DataLoader
-instance (see quapy.classification.neural.TorchDataset
)
a tensor of shape (n_instances, n_classes) with the decision scores -for each of the instances and classes
-Get hyper-parameters for this estimator
-a dictionary with parameter names mapped to their values
-Predicts posterior probabilities for the instances in x
-x – a torch tensor of indexed tokens with shape (n_instances, pad_length) -where n_instances is the number of instances in the batch, and pad_length -is length of the pad in the batch
-array-like of shape (n_samples, n_classes) with the posterior probabilities
-Return the size of the vocabulary
-integer
-Bases: Dataset
Transforms labelled instances into a Torch’s torch.utils.data.DataLoader
object
instances – list of lists of indexed tokens
labels – array-like of shape (n_samples, n_classes) with the class labels
Converts the labelled collection into a Torch DataLoader with dynamic padding for -the batch
-batch_size – batch size
shuffle – whether or not to shuffle instances
pad_length – the maximum length for the list of tokens (dynamic padding is -applied, meaning that if the longest document in the batch is shorter than -pad_length, then the batch is padded up to its length, and not to pad_length.
device – whether to allocate tensors in cpu or in cuda
a torch.utils.data.DataLoader
object
Bases: BaseEstimator
, ClassifierMixin
A wrapper for the SVM-perf package by Thorsten Joachims. -When using losses for quantification, the source code has to be patched. See -the installation documentation -for further details.
-References
- -svmperf_base – path to directory containing the binary files svm_perf_learn and svm_perf_classify
C – trade-off between training error and margin (default 0.01)
verbose – set to True to print svm-perf std outputs
loss – the loss to optimize for. Available losses are “01”, “f1”, “kld”, “nkld”, “q”, “qacc”, “qf1”, “qgm”, “mae”, “mrae”.
host_folder – directory where to store the trained model; set to None (default) for using a tmp directory -(temporal directories are automatically deleted)
Evaluate the decision function for the samples in X.
-X – array-like of shape (n_samples, n_features) containing the instances to classify
y – unused
array-like of shape (n_samples,) containing the decision scores of the instances
-Trains the SVM for the multivariate performance loss
-X – training instances
y – a binary vector of labels
self
-Predicts labels for the instances X
-X – array-like of shape (n_samples, n_features) instances to classify
-a numpy array of length n containing the label predictions, where n is the number of -instances in X
-Bases: object
Abstraction of training and test LabelledCollection
objects.
training – a LabelledCollection
instance
test – a LabelledCollection
instance
vocabulary – if indicated, is a dictionary of the terms used in this textual dataset
name – a string representing the name of the dataset
Generates a Dataset
from a stratified split of a LabelledCollection
instance.
-See LabelledCollection.split_stratified()
collection – LabelledCollection
train_size – the proportion of training documents (the rest conforms the test split)
an instance of Dataset
Returns True if the training collection is labelled according to two classes
-boolean
-The classes according to which the training collection is labelled
-The classes according to which the training collection is labelled
-Generator of stratified folds to be used in k-fold cross validation. This function is only a wrapper around
-LabelledCollection.kFCV()
that returns Dataset
instances made of training and test folds.
nfolds – integer (default 5), the number of folds to generate
nrepeats – integer (default 1), the number of rounds of k-fold cross validation to run
random_state – integer (default 0), guarantees that the folds generated are reproducible
yields nfolds * nrepeats folds for k-fold cross validation as instances of Dataset
Loads a training and a test labelled set of data and convert it into a Dataset
instance.
-The function in charge of reading the instances must be specified. This function can be a custom one, or any of
-the reading functions defined in quapy.data.reader
module.
train_path – string, the path to the file containing the training instances
test_path – string, the path to the file containing the test instances
loader_func – a custom function that implements the data loader and returns a tuple with instances and -labels
classes – array-like, the classes according to which the instances are labelled
loader_kwargs – any argument that the loader_func function needs in order to read the instances.
-See LabelledCollection.load()
for further details.
a Dataset
object
The number of classes according to which the training collection is labelled
-integer
-Reduce the number of instances in place for quick experiments. Preserves the prevalence of each set.
-n_train – number of training documents to keep (default 100)
n_test – number of test documents to keep (default 100)
self
-Returns (and eventually prints) a dictionary with some stats of this dataset. E.g.,:
->>> data = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5)
->>> data.stats()
->>> Dataset=kindle #tr-instances=3821, #te-instances=21591, type=<class 'scipy.sparse.csr.csr_matrix'>, #features=4403, #classes=[0 1], tr-prevs=[0.081, 0.919], te-prevs=[0.063, 0.937]
-
show – if set to True (default), prints the stats in standard output
-a dictionary containing some stats of this collection for the training and test collections. The keys -are train and test, and point to dedicated dictionaries of stats, for each collection, with keys -#instances (the number of instances), type (the type representing the instances), -#features (the number of features, if the instances are in array-like format), #classes (the classes of -the collection), prevs (the prevalence values for each class)
-Alias to self.training and self.test
-the training and test collections
-the training and test collections
-If the dataset is textual, and the vocabulary was indicated, returns the size of the vocabulary
-integer
-Bases: object
A LabelledCollection is a set of objects each with a label attached to each of them. -This class implements several sampling routines and other utilities.
-instances – array-like (np.ndarray, list, or csr_matrix are supported)
labels – array-like with the same length of instances
classes – optional, list of classes from which labels are taken. If not specified, the classes are inferred -from the labels. The classes must be indicated in cases in which some of the labels might have no examples -(i.e., a prevalence of 0)
An alias to self.instances
-self.instances
-Gets the instances and the true prevalence. This is useful when implementing evaluation protocols from
-a LabelledCollection
object.
a tuple (instances, prevalence) from this collection
-Gets the instances and labels. This is useful when working with sklearn estimators, e.g.:
->>> svm = LinearSVC().fit(*my_collection.Xy)
-
a tuple (instances, labels) from this collection
-Returns True if the number of classes is 2
-boolean
-Returns the number of instances for each of the classes in the codeframe.
-a np.ndarray of shape (n_classes) with the number of instances of each class, in the same order -as listed by self.classes_
-Returns a new LabelledCollection
as the union of the collections given in input.
args – instances of LabelledCollection
a LabelledCollection
representing the union of both collections
Generator of stratified folds to be used in k-fold cross validation.
-nfolds – integer (default 5), the number of folds to generate
nrepeats – integer (default 1), the number of rounds of k-fold cross validation to run
random_state – integer (default 0), guarantees that the folds generated are reproducible
yields nfolds * nrepeats folds for k-fold cross validation
-Loads a labelled set of data and convert it into a LabelledCollection
instance. The function in charge
-of reading the instances must be specified. This function can be a custom one, or any of the reading functions
-defined in quapy.data.reader
module.
path – string, the path to the file containing the labelled instances
loader_func – a custom function that implements the data loader and returns a tuple with instances and -labels
classes – array-like, the classes according to which the instances are labelled
loader_kwargs – any argument that the loader_func function needs in order to read the instances, i.e., -these arguments are used to call loader_func(path, **loader_kwargs)
a LabelledCollection
object
The number of classes
-integer
-An alias to self.prevalence()
-self.prevalence()
-Returns the prevalence, or relative frequency, of the classes in the codeframe.
-a np.ndarray of shape (n_classes) with the relative frequencies of each class, in the same order -as listed by self.classes_
-Return a random sample (an instance of LabelledCollection
) of desired size and desired prevalence
-values. For each class, the sampling is drawn with replacement.
size – integer, the requested size
prevs – the prevalence for each class; the prevalence value for the last class can be lead empty since -it is constrained. E.g., for binary collections, only the prevalence p for the first class (as listed in -self.classes_ can be specified, while the other class takes prevalence value 1-p
shuffle – if set to True (default), shuffles the index before returning it
random_state – seed for reproducing sampling
an instance of LabelledCollection
with length == size and prevalence close to prevs (or
-prevalence == prevs if the exact prevalence values can be met as proportions of instances)
Returns an instance of LabelledCollection
whose elements are sampled from this collection using the
-index.
index – np.ndarray
-an instance of LabelledCollection
Returns an index to be used to extract a random sample of desired size and desired prevalence values. If the -prevalence values are not specified, then returns the index of a uniform sampling. -For each class, the sampling is drawn with replacement.
-size – integer, the requested size
prevs – the prevalence for each class; the prevalence value for the last class can be lead empty since -it is constrained. E.g., for binary collections, only the prevalence p for the first class (as listed in -self.classes_ can be specified, while the other class takes prevalence value 1-p
shuffle – if set to True (default), shuffles the index before returning it
random_state – seed for reproducing sampling
a np.ndarray of shape (size) with the indexes
-Returns two instances of LabelledCollection
split randomly from this collection, at desired
-proportion.
train_prop – the proportion of elements to include in the left-most returned collection (typically used -as the training collection). The rest of elements are included in the right-most returned collection -(typically used as a test collection).
random_state – if specified, guarantees reproducibility of the split.
two instances of LabelledCollection
, the first one with train_prop elements, and the
-second one with 1-train_prop elements
Returns two instances of LabelledCollection
split with stratification from this collection, at desired
-proportion.
train_prop – the proportion of elements to include in the left-most returned collection (typically used -as the training collection). The rest of elements are included in the right-most returned collection -(typically used as a test collection).
random_state – if specified, guarantees reproducibility of the split.
two instances of LabelledCollection
, the first one with train_prop elements, and the
-second one with 1-train_prop elements
Returns (and eventually prints) a dictionary with some stats of this collection. E.g.,:
->>> data = qp.datasets.fetch_reviews('kindle', tfidf=True, min_df=5)
->>> data.training.stats()
->>> #instances=3821, type=<class 'scipy.sparse.csr.csr_matrix'>, #features=4403, #classes=[0 1], prevs=[0.081, 0.919]
-
show – if set to True (default), prints the stats in standard output
-a dictionary containing some stats of this collection. Keys include #instances (the number of -instances), type (the type representing the instances), #features (the number of features, if the -instances are in array-like format), #classes (the classes of the collection), prevs (the prevalence -values for each class)
-Returns a uniform sample (an instance of LabelledCollection
) of desired size. The sampling is drawn
-with replacement.
size – integer, the requested size
random_state – if specified, guarantees reproducibility of the split.
an instance of LabelledCollection
with length == size
Returns an index to be used to extract a uniform sample of desired size. The sampling is drawn -with replacement.
-size – integer, the size of the uniform sample
random_state – if specified, guarantees reproducibility of the split.
a np.ndarray of shape (size) with the indexes
-An alias to self.labels
-self.labels
-Loads the IFCB dataset for quantification from Zenodo (for more -information on this dataset, please follow the zenodo link). -This dataset is based on the data available publicly at -WHOI-Plankton repo. -The dataset already comes with processed features. -The scripts used for the processing are available at P. González’s repo.
-The datasets are downloaded only once, and stored for fast reuse.
-single_sample_train – a boolean. If true, it will return the train dataset as a
-quapy.data.base.LabelledCollection
(all examples together).
-If false, a generator of training samples will be returned. Each example in the training set has an individual label.
for_model_selection – if True, then returns a split 30% of the training set (86 out of 286 samples) to be used for model selection; -if False, then returns the full training set as training set and the test set as the test set
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default -~/quay_data/ directory)
a tuple (train, test_gen) where train is an instance of
-quapy.data.base.LabelledCollection
, if single_sample_train is true or
-quapy.data._ifcb.IFCBTrainSamplesFromDir
, i.e. a sampling protocol that returns a series of samples
-labelled example by example. test_gen will be a quapy.data._ifcb.IFCBTestSamples
,
-i.e., a sampling protocol that returns a series of samples labelled by prevalence.
Loads a UCI dataset as an instance of quapy.data.base.Dataset
, as used in
-Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
-Using ensembles for problems with characterizable changes in data distribution: A case study on quantification.
-Information Fusion, 34, 87-100.
-and
-Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019).
-Dynamic ensemble selection for quantification tasks.
-Information Fusion, 45, 1-15..
-The datasets do not come with a predefined train-test split (see fetch_UCILabelledCollection()
for further
-information on how to use these collections), and so a train-test split is generated at desired proportion.
-The list of valid dataset names can be accessed in quapy.data.datasets.UCI_DATASETS
dataset_name – a dataset name
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default -~/quay_data/ directory)
test_split – proportion of documents to be included in the test set. The rest conforms the training set
verbose – set to True (default is False) to get information (from the UCI ML repository) about the datasets
a quapy.data.base.Dataset
instance
Loads a UCI collection as an instance of quapy.data.base.LabelledCollection
, as used in
-Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
-Using ensembles for problems with characterizable changes in data distribution: A case study on quantification.
-Information Fusion, 34, 87-100.
-and
-Pérez-Gállego, P., Castano, A., Quevedo, J. R., & del Coz, J. J. (2019).
-Dynamic ensemble selection for quantification tasks.
-Information Fusion, 45, 1-15..
-The datasets do not come with a predefined train-test split, and so Pérez-Gállego et al. adopted a 5FCVx2 evaluation
-protocol, meaning that each collection was used to generate two rounds (hence the x2) of 5 fold cross validation.
-This can be reproduced by using quapy.data.base.Dataset.kFCV()
, e.g.:
>>> import quapy as qp
->>> collection = qp.datasets.fetch_UCIBinaryLabelledCollection("yeast")
->>> for data in qp.train.Dataset.kFCV(collection, nfolds=5, nrepeats=2):
->>> ...
-
The list of valid dataset names can be accessed in quapy.data.datasets.UCI_DATASETS
-dataset_name – a dataset name
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default -~/quay_data/ directory)
test_split – proportion of documents to be included in the test set. The rest conforms the training set
verbose – set to True (default is False) to get information (from the UCI ML repository) about the datasets
a quapy.data.base.LabelledCollection
instance
Loads a UCI multiclass dataset as an instance of quapy.data.base.Dataset
.
The list of available datasets is taken from https://archive.ics.uci.edu/, following these criteria: -- It has more than 1000 instances -- It is suited for classification -- It has more than two classes -- It is available for Python import (requires ucimlrepo package)
->>> import quapy as qp
->>> dataset = qp.datasets.fetch_UCIMulticlassDataset("dry-bean")
->>> train, test = dataset.train_test
->>> ...
-
The list of valid dataset names can be accessed in quapy.data.datasets.UCI_MULTICLASS_DATASETS
-The datasets are downloaded only once and pickled into disk, saving time for consecutive calls.
-dataset_name – a dataset name
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default -~/quay_data/ directory)
min_test_split – minimum proportion of instances to be included in the test set. This value is interpreted -as a minimum proportion, meaning that the real proportion could be higher in case the training proportion -(1-min_test_split`% of the instances) surpasses `max_train_instances. In such case, only max_train_instances -are taken for training, and the rest (irrespective of min_test_split) is taken for test.
max_train_instances – maximum number of instances to keep for training (defaults to 25000)
min_class_support – minimum number of istances per class. Classes with fewer instances -are discarded (deafult is 100)
verbose – set to True (default is False) to get information (stats) about the dataset
a quapy.data.base.Dataset
instance
Loads a UCI multiclass collection as an instance of quapy.data.base.LabelledCollection
.
The list of available datasets is taken from https://archive.ics.uci.edu/, following these criteria: -- It has more than 1000 instances -- It is suited for classification -- It has more than two classes -- It is available for Python import (requires ucimlrepo package)
->>> import quapy as qp
->>> collection = qp.datasets.fetch_UCIMulticlassLabelledCollection("dry-bean")
->>> X, y = collection.Xy
->>> ...
-
The list of valid dataset names can be accessed in quapy.data.datasets.UCI_MULTICLASS_DATASETS
-The datasets are downloaded only once and pickled into disk, saving time for consecutive calls.
-dataset_name – a dataset name
data_home – specify the quapy home directory where the dataset will be dumped (leave empty to use the default -~/quay_data/ directory)
test_split – proportion of instances to be included in the test set. The rest conforms the training set
min_class_support – minimum number of istances per class. Classes with fewer instances -are discarded (deafult is 100)
verbose – set to True (default is False) to get information (stats) about the dataset
a quapy.data.base.LabelledCollection
instance
Loads the official datasets provided for the LeQua competition. -In brief, there are 4 tasks (T1A, T1B, T2A, T2B) having to do with text quantification -problems. Tasks T1A and T1B provide documents in vector form, while T2A and T2B provide raw documents instead. -Tasks T1A and T2A are binary sentiment quantification problems, while T2A and T2B are multiclass quantification -problems consisting of estimating the class prevalence values of 28 different merchandise products. -We refer to the Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022). -A Detailed Overview of LeQua@ CLEF 2022: Learning to Quantify. for a detailed description -on the tasks and datasets.
-The datasets are downloaded only once, and stored for fast reuse.
-See 4.lequa2022_experiments.py provided in the example folder, that can serve as a guide on how to use these -datasets.
-task – a string representing the task name; valid ones are T1A, T1B, T2A, and T2B
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default -~/quay_data/ directory)
a tuple (train, val_gen, test_gen) where train is an instance of
-quapy.data.base.LabelledCollection
, val_gen and test_gen are instances of
-quapy.data._lequa2022.SamplesFromDir
, a subclass of quapy.protocol.AbstractProtocol
,
-that return a series of samples stored in a directory which are labelled by prevalence.
Loads a Reviews dataset as a Dataset instance, as used in -Esuli, A., Moreo, A., and Sebastiani, F. “A recurrent neural network for sentiment quantification.” -Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018.. -The list of valid dataset names can be accessed in quapy.data.datasets.REVIEWS_SENTIMENT_DATASETS
-dataset_name – the name of the dataset: valid ones are ‘hp’, ‘kindle’, ‘imdb’
tfidf – set to True to transform the raw documents into tfidf weighted matrices
min_df – minimun number of documents that should contain a term in order for the term to be -kept (ignored if tfidf==False)
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default -~/quay_data/ directory)
pickle – set to True to pickle the Dataset object the first time it is generated, in order to allow for -faster subsequent invokations
a quapy.data.base.Dataset
instance
Loads a Twitter dataset as a quapy.data.base.Dataset
instance, as used in:
-Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis.
-Social Network Analysis and Mining6(19), 1–22 (2016)
-Note that the datasets ‘semeval13’, ‘semeval14’, ‘semeval15’ share the same training set.
-The list of valid dataset names corresponding to training sets can be accessed in
-quapy.data.datasets.TWITTER_SENTIMENT_DATASETS_TRAIN, while the test sets can be accessed in
-quapy.data.datasets.TWITTER_SENTIMENT_DATASETS_TEST
dataset_name – the name of the dataset: valid ones are ‘gasp’, ‘hcr’, ‘omd’, ‘sanders’, ‘semeval13’, -‘semeval14’, ‘semeval15’, ‘semeval16’, ‘sst’, ‘wa’, ‘wb’
for_model_selection – if True, then returns the train split as the training set and the devel split -as the test set; if False, then returns the train+devel split as the training set and the test set as the -test set
min_df – minimun number of documents that should contain a term in order for the term to be kept
data_home – specify the quapy home directory where collections will be dumped (leave empty to use the default -~/quay_data/ directory)
pickle – set to True to pickle the Dataset object the first time it is generated, in order to allow for -faster subsequent invokations
a quapy.data.base.Dataset
instance
Bases: object
This class implements a sklearn’s-style transformer that indexes text as numerical ids for the tokens it -contains, and that would be generated by sklearn’s -CountVectorizer
-kwargs –
keyworded arguments from -CountVectorizer
- -Adds a new token (regardless of whether it has been found in the text or not), with dedicated id. -Useful to define special tokens for codifying unknown words, or padding tokens.
-word – string, surface form of the token
id – integer, numerical value to assign to the token (leave as None for indicating the next valid id, -default)
nogaps – if set to True (default) asserts that the id indicated leads to no numerical gaps with -precedent ids stored so far
integer, the numerical id for the new token
-Fits the transformer, i.e., decides on the vocabulary, given a list of strings.
-X – a list of strings
-self
-Fits the transform on X and transforms it.
-X – a list of strings
n_jobs – the number of parallel workers to carry out this task
a np.ndarray of numerical ids
-Indexes the tokens of a textual quapy.data.base.Dataset
of string documents.
-To index a document means to replace each different token by a unique numerical index.
-Rare words (i.e., words occurring less than min_df times) are replaced by a special token UNK
dataset – a quapy.data.base.Dataset
object where the instances of training and test documents
-are lists of str
min_df – minimum number of occurrences below which the term is replaced by a UNK index
inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)
kwargs – the rest of parameters of the transformation (as for sklearn’s -CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>_)
a new quapy.data.base.Dataset
(if inplace=False) or a reference to the current
-quapy.data.base.Dataset
(inplace=True) consisting of lists of integer values representing indices.
Reduces the dimensionality of the instances, represented as a csr_matrix (or any subtype of -scipy.sparse.spmatrix), of training and test documents by removing the columns of words which are not present -in at least min_df instances in the training set
-dataset – a quapy.data.base.Dataset
in which instances are represented in sparse format (any
-subtype of scipy.sparse.spmatrix)
min_df – integer, minimum number of instances below which the columns are removed
inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)
a new quapy.data.base.Dataset
(if inplace=False) or a reference to the current
-quapy.data.base.Dataset
(inplace=True) where the dimensions corresponding to infrequent terms
-in the training set have been removed
Standardizes the real-valued columns of a quapy.data.base.Dataset
.
-Standardization, aka z-scoring, of a variable X comes down to subtracting the average and normalizing by the
-standard deviation.
dataset – a quapy.data.base.Dataset
object
inplace – set to True if the transformation is to be applied inplace, or to False (default) if a new
-quapy.data.base.Dataset
is to be returned
an instance of quapy.data.base.Dataset
Transforms a quapy.data.base.Dataset
of textual instances into a quapy.data.base.Dataset
of
-tfidf weighted sparse vectors
dataset – a quapy.data.base.Dataset
where the instances of training and test collections are
-lists of str
min_df – minimum number of occurrences for a word to be considered as part of the vocabulary (default 3)
sublinear_tf – whether or not to apply the log scalling to the tf counters (default True)
inplace – whether or not to apply the transformation inplace (True), or to a new copy (False, default)
kwargs – the rest of parameters of the transformation (as for sklearn’s -TfidfVectorizer)
a new quapy.data.base.Dataset
in csr_matrix format (if inplace=False) or a reference to the
-current Dataset (if inplace=True) where the instances are stored in a csr_matrix of real-valued tfidf scores
Binarizes a categorical array-like collection of labels towards the positive class pos_class. E.g.,:
->>> binarize([1, 2, 3, 1, 1, 0], pos_class=2)
->>> array([0, 1, 0, 0, 0, 0])
-
y – array-like of labels
pos_class – integer, the positive class
a binary np.ndarray, in which values 1 corresponds to positions in whcih y had pos_class labels, and -0 otherwise
-Reads a csv file in which columns are separated by ‘,’. -File format <label>,<feat1>,<feat2>,…,<featn>
-path – path to the csv file
encoding – the text encoding used to open the file
a np.ndarray for the labels and a ndarray (float) for the covariates
-Reads a labelled collection of real-valued instances expressed in sparse format -File format <-1 or 0 or 1>[s col(int):val(float)]
-path – path to the labelled collection
-a csr_matrix containing the instances (rows), and a ndarray containing the labels
-Reads a labelled colletion of documents. -File fomart <0 or 1> <document>
-path – path to the labelled collection
encoding – the text encoding used to open the file
verbose – if >0 (default) shows some progress information in standard output
a list of sentences, and a list of labels
-Re-indexes a list of labels as a list of indexes, and returns the classnames corresponding to the indexes. -E.g.:
->>> reindex_labels(['B', 'B', 'A', 'C'])
->>> (array([1, 1, 0, 2]), array(['A', 'B', 'C'], dtype='<U1'))
-
y – the list or array of original labels
-a ndarray (int) of class indexes, and a ndarray of classnames corresponding to the indexes.
-BCTSCalibration
NBVSCalibration
RecalibratedProbabilisticClassifier
RecalibratedProbabilisticClassifierBase
RecalibratedProbabilisticClassifierBase.classes_
RecalibratedProbabilisticClassifierBase.fit()
RecalibratedProbabilisticClassifierBase.fit_cv()
RecalibratedProbabilisticClassifierBase.fit_tr_val()
RecalibratedProbabilisticClassifierBase.predict()
RecalibratedProbabilisticClassifierBase.predict_proba()
TSCalibration
VSCalibration
CNNnet
-LSTMnet
-NeuralClassifierTrainer
-TextClassifierNet
-TorchDataset
-Dataset
-LabelledCollection
LabelledCollection.X
LabelledCollection.Xp
LabelledCollection.Xy
LabelledCollection.binary
LabelledCollection.counts()
LabelledCollection.join()
LabelledCollection.kFCV()
LabelledCollection.load()
LabelledCollection.n_classes
LabelledCollection.p
LabelledCollection.prevalence()
LabelledCollection.sampling()
LabelledCollection.sampling_from_index()
LabelledCollection.sampling_index()
LabelledCollection.split_random()
LabelledCollection.split_stratified()
LabelledCollection.stats()
LabelledCollection.uniform_sampling()
LabelledCollection.uniform_sampling_index()
LabelledCollection.y
ACC
-AdjustedClassifyAndCount
AggregativeCrispQuantifier
AggregativeMedianEstimator
-AggregativeQuantifier
AggregativeQuantifier.aggregate()
AggregativeQuantifier.aggregation_fit()
AggregativeQuantifier.classes_
AggregativeQuantifier.classifier
AggregativeQuantifier.classifier_fit_predict()
AggregativeQuantifier.classify()
AggregativeQuantifier.fit()
AggregativeQuantifier.quantify()
AggregativeQuantifier.val_split
AggregativeQuantifier.val_split_
AggregativeSoftQuantifier
BayesianCC
-BinaryAggregativeQuantifier
-CC
-ClassifyAndCount
DMy
-DistributionMatchingY
DyS
-EMQ
-ExpectationMaximizationQuantifier
HDy
-HellingerDistanceY
OneVsAllAggregative
-PACC
-PCC
-ProbabilisticAdjustedClassifyAndCount
ProbabilisticClassifyAndCount
SLD
SMM
-newELM()
newSVMAE()
newSVMKLD()
newSVMQ()
newSVMRAE()
KDEBase
-KDEyCS
-KDEyHD
-KDEyML
-QuaNetModule
-QuaNetTrainer
-mae_loss()
MAX
-MS
-MS2
-T50
-ThresholdOptimization
-X
-BlobelLoss
CVClassifier
-ClassTransformer
-CombinedLoss
ComposableQuantifier()
DistanceTransformer
-EnergyKernelTransformer
-EnergyLoss
GaussianKernelTransformer
-GaussianRFFKernelTransformer
-HellingerSurrogateLoss
HistogramTransformer
-KernelTransformer
-LaplacianKernelTransformer
-LeastSquaresLoss
TikhonovRegularization
TikhonovRegularized()
Implementation of error measures used for quantification
-Absolute error between two prevalence vectors \(p\) and \(\hat{p}\) is computed as -\(AE(p,\hat{p})=\frac{1}{|\mathcal{Y}|}\sum_{y\in \mathcal{Y}}|\hat{p}(y)-p(y)|\), -where \(\mathcal{Y}\) are the classes of interest.
-prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
absolute error
-Computes the error in terms of 1-accuracy. The accuracy is computed as -\(\frac{tp+tn}{tp+fp+fn+tn}\), with tp, fp, fn, and tn standing -for true positives, false positives, false negatives, and true negatives, -respectively
-y_true – array-like of true labels
y_pred – array-like of predicted labels
1-accuracy
-Computes the error in terms of 1-accuracy. The accuracy is computed as -\(\frac{tp+tn}{tp+fp+fn+tn}\), with tp, fp, fn, and tn standing -for true positives, false positives, false negatives, and true negatives, -respectively
-y_true – array-like of true labels
y_pred – array-like of predicted labels
1-accuracy
-Absolute error between two prevalence vectors \(p\) and \(\hat{p}\) is computed as -\(AE(p,\hat{p})=\frac{1}{|\mathcal{Y}|}\sum_{y\in \mathcal{Y}}|\hat{p}(y)-p(y)|\), -where \(\mathcal{Y}\) are the classes of interest.
-prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
absolute error
-F1 error: simply computes the error in terms of macro \(F_1\), i.e., -\(1-F_1^M\), where \(F_1\) is the harmonic mean of precision and recall, -defined as \(\frac{2tp}{2tp+fp+fn}\), with tp, fp, and fn standing -for true positives, false positives, and false negatives, respectively. -Macro averaging means the \(F_1\) is computed for each category independently, -and then averaged.
-y_true – array-like of true labels
y_pred – array-like of predicted labels
\(1-F_1^M\)
-F1 error: simply computes the error in terms of macro \(F_1\), i.e., -\(1-F_1^M\), where \(F_1\) is the harmonic mean of precision and recall, -defined as \(\frac{2tp}{2tp+fp+fn}\), with tp, fp, and fn standing -for true positives, false positives, and false negatives, respectively. -Macro averaging means the \(F_1\) is computed for each category independently, -and then averaged.
-y_true – array-like of true labels
y_pred – array-like of predicted labels
\(1-F_1^M\)
-Gets an error function from its name. E.g., from_name(“mae”)
-will return function quapy.error.mae()
err_name – string, the error name
-a callable implementing the requested error
-Kullback-Leibler divergence between two prevalence distributions \(p\) and \(\hat{p}\)
-is computed as
-\(KLD(p,\hat{p})=D_{KL}(p||\hat{p})=
-\sum_{y\in \mathcal{Y}} p(y)\log\frac{p(y)}{\hat{p}(y)}\),
-where \(\mathcal{Y}\) are the classes of interest.
-The distributions are smoothed using the eps factor (see quapy.error.smooth()
).
prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
eps – smoothing factor. KLD is not defined in cases in which the distributions contain -zeros; eps is typically set to be \(\frac{1}{2T}\), with \(T\) the sample size. -If eps=None, the sample size will be taken from the environment variable SAMPLE_SIZE -(which has thus to be set beforehand).
Kullback-Leibler divergence between the two distributions
-Computes the mean absolute error (see quapy.error.ae()
) across the sample pairs.
prevs – array-like of shape (n_samples, n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
mean absolute error
-Computes the mean absolute error (see quapy.error.ae()
) across the sample pairs.
prevs – array-like of shape (n_samples, n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
mean absolute error
-Computes the mean normalized absolute error (see quapy.error.nae()
) across the sample pairs.
prevs – array-like of shape (n_samples, n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
mean normalized absolute error
-Computes the mean normalized relative absolute error (see quapy.error.nrae()
) across
-the sample pairs. The distributions are smoothed using the eps factor (see
-quapy.error.smooth()
).
prevs – array-like of shape (n_samples, n_classes,) with the true -prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
eps – smoothing factor. mnrae is not defined in cases in which the true -distribution contains zeros; eps is typically set to be \(\frac{1}{2T}\), -with \(T\) the sample size. If eps=None, the sample size will be taken from -the environment variable SAMPLE_SIZE (which has thus to be set beforehand).
mean normalized relative absolute error
-Computes the mean relative absolute error (see quapy.error.rae()
) across
-the sample pairs. The distributions are smoothed using the eps factor (see
-quapy.error.smooth()
).
prevs – array-like of shape (n_samples, n_classes,) with the true -prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
eps – smoothing factor. mrae is not defined in cases in which the true -distribution contains zeros; eps is typically set to be \(\frac{1}{2T}\), -with \(T\) the sample size. If eps=None, the sample size will be taken from -the environment variable SAMPLE_SIZE (which has thus to be set beforehand).
mean relative absolute error
-Computes the mean Kullback-Leibler divergence (see quapy.error.kld()
) across the
-sample pairs. The distributions are smoothed using the eps factor
-(see quapy.error.smooth()
).
prevs – array-like of shape (n_samples, n_classes,) with the true -prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
eps – smoothing factor. KLD is not defined in cases in which the distributions contain -zeros; eps is typically set to be \(\frac{1}{2T}\), with \(T\) the sample size. -If eps=None, the sample size will be taken from the environment variable SAMPLE_SIZE -(which has thus to be set beforehand).
mean Kullback-Leibler distribution
-Computes the mean normalized absolute error (see quapy.error.nae()
) across the sample pairs.
prevs – array-like of shape (n_samples, n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
mean normalized absolute error
-Computes the mean Normalized Kullback-Leibler divergence (see quapy.error.nkld()
)
-across the sample pairs. The distributions are smoothed using the eps factor
-(see quapy.error.smooth()
).
prevs – array-like of shape (n_samples, n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
eps – smoothing factor. NKLD is not defined in cases in which the distributions contain -zeros; eps is typically set to be \(\frac{1}{2T}\), with \(T\) the sample size. -If eps=None, the sample size will be taken from the environment variable SAMPLE_SIZE -(which has thus to be set beforehand).
mean Normalized Kullback-Leibler distribution
-Computes the mean normalized relative absolute error (see quapy.error.nrae()
) across
-the sample pairs. The distributions are smoothed using the eps factor (see
-quapy.error.smooth()
).
prevs – array-like of shape (n_samples, n_classes,) with the true -prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
eps – smoothing factor. mnrae is not defined in cases in which the true -distribution contains zeros; eps is typically set to be \(\frac{1}{2T}\), -with \(T\) the sample size. If eps=None, the sample size will be taken from -the environment variable SAMPLE_SIZE (which has thus to be set beforehand).
mean normalized relative absolute error
-Computes the mean relative absolute error (see quapy.error.rae()
) across
-the sample pairs. The distributions are smoothed using the eps factor (see
-quapy.error.smooth()
).
prevs – array-like of shape (n_samples, n_classes,) with the true -prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the predicted -prevalence values
eps – smoothing factor. mrae is not defined in cases in which the true -distribution contains zeros; eps is typically set to be \(\frac{1}{2T}\), -with \(T\) the sample size. If eps=None, the sample size will be taken from -the environment variable SAMPLE_SIZE (which has thus to be set beforehand).
mean relative absolute error
-Computes the mean squared error (see quapy.error.se()
) across the sample pairs.
prevs – array-like of shape (n_samples, n_classes,) with the -true prevalence values
prevs_hat – array-like of shape (n_samples, n_classes,) with the -predicted prevalence values
mean squared error
-Normalized absolute error between two prevalence vectors \(p\) and \(\hat{p}\) is computed as -\(NAE(p,\hat{p})=\frac{AE(p,\hat{p})}{z_{AE}}\), -where \(z_{AE}=\frac{2(1-\min_{y\in \mathcal{Y}} p(y))}{|\mathcal{Y}|}\), and \(\mathcal{Y}\) -are the classes of interest.
-prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
normalized absolute error
-Normalized Kullback-Leibler divergence between two prevalence distributions \(p\) and
-\(\hat{p}\) is computed as
-math:NKLD(p,hat{p}) = 2frac{e^{KLD(p,hat{p})}}{e^{KLD(p,hat{p})}+1}-1,
-where
-\(\mathcal{Y}\) are the classes of interest.
-The distributions are smoothed using the eps factor (see quapy.error.smooth()
).
prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
eps – smoothing factor. NKLD is not defined in cases in which the distributions -contain zeros; eps is typically set to be \(\frac{1}{2T}\), with \(T\) the sample -size. If eps=None, the sample size will be taken from the environment variable -SAMPLE_SIZE (which has thus to be set beforehand).
Normalized Kullback-Leibler divergence between the two distributions
-Normalized absolute error between two prevalence vectors \(p\) and \(\hat{p}\) is computed as -\(NAE(p,\hat{p})=\frac{AE(p,\hat{p})}{z_{AE}}\), -where \(z_{AE}=\frac{2(1-\min_{y\in \mathcal{Y}} p(y))}{|\mathcal{Y}|}\), and \(\mathcal{Y}\) -are the classes of interest.
-prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
normalized absolute error
-Relative absolute error between two prevalence vectors \(p\) and \(\hat{p}\)
-is computed as
-\(NRAE(p,\hat{p})= \frac{RAE(p,\hat{p})}{z_{RAE}}\),
-where
-\(z_{RAE} = \frac{|\mathcal{Y}|-1+\frac{1-\min_{y\in \mathcal{Y}} p(y)}{\min_{y\in \mathcal{Y}} p(y)}}{|\mathcal{Y}|}\)
-and \(\mathcal{Y}\) are the classes of interest.
-The distributions are smoothed using the eps factor (see quapy.error.smooth()
).
prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
eps – smoothing factor. nrae is not defined in cases in which the true distribution -contains zeros; eps is typically set to be \(\frac{1}{2T}\), with \(T\) the -sample size. If eps=None, the sample size will be taken from the environment variable -SAMPLE_SIZE (which has thus to be set beforehand).
normalized relative absolute error
-Relative absolute error between two prevalence vectors \(p\) and \(\hat{p}\)
-is computed as
-\(NRAE(p,\hat{p})= \frac{RAE(p,\hat{p})}{z_{RAE}}\),
-where
-\(z_{RAE} = \frac{|\mathcal{Y}|-1+\frac{1-\min_{y\in \mathcal{Y}} p(y)}{\min_{y\in \mathcal{Y}} p(y)}}{|\mathcal{Y}|}\)
-and \(\mathcal{Y}\) are the classes of interest.
-The distributions are smoothed using the eps factor (see quapy.error.smooth()
).
prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
eps – smoothing factor. nrae is not defined in cases in which the true distribution -contains zeros; eps is typically set to be \(\frac{1}{2T}\), with \(T\) the -sample size. If eps=None, the sample size will be taken from the environment variable -SAMPLE_SIZE (which has thus to be set beforehand).
normalized relative absolute error
-Relative absolute error between two prevalence vectors \(p\) and \(\hat{p}\)
-is computed as
-\(RAE(p,\hat{p})=
-\frac{1}{|\mathcal{Y}|}\sum_{y\in \mathcal{Y}}\frac{|\hat{p}(y)-p(y)|}{p(y)}\),
-where \(\mathcal{Y}\) are the classes of interest.
-The distributions are smoothed using the eps factor (see quapy.error.smooth()
).
prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
eps – smoothing factor. rae is not defined in cases in which the true distribution -contains zeros; eps is typically set to be \(\frac{1}{2T}\), with \(T\) the -sample size. If eps=None, the sample size will be taken from the environment variable -SAMPLE_SIZE (which has thus to be set beforehand).
relative absolute error
-Relative absolute error between two prevalence vectors \(p\) and \(\hat{p}\)
-is computed as
-\(RAE(p,\hat{p})=
-\frac{1}{|\mathcal{Y}|}\sum_{y\in \mathcal{Y}}\frac{|\hat{p}(y)-p(y)|}{p(y)}\),
-where \(\mathcal{Y}\) are the classes of interest.
-The distributions are smoothed using the eps factor (see quapy.error.smooth()
).
prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
eps – smoothing factor. rae is not defined in cases in which the true distribution -contains zeros; eps is typically set to be \(\frac{1}{2T}\), with \(T\) the -sample size. If eps=None, the sample size will be taken from the environment variable -SAMPLE_SIZE (which has thus to be set beforehand).
relative absolute error
-Squared error between two prevalence vectors \(p\) and \(\hat{p}\) is computed as -\(SE(p,\hat{p})=\frac{1}{|\mathcal{Y}|}\sum_{y\in \mathcal{Y}}(\hat{p}(y)-p(y))^2\), -where -\(\mathcal{Y}\) are the classes of interest.
-prevs – array-like of shape (n_classes,) with the true prevalence values
prevs_hat – array-like of shape (n_classes,) with the predicted prevalence values
absolute error
-Smooths a prevalence distribution with \(\epsilon\) (eps) as: -\(\underline{p}(y)=\frac{\epsilon+p(y)}{\epsilon|\mathcal{Y}|+ -\displaystyle\sum_{y\in \mathcal{Y}}p(y)}\)
-prevs – array-like of shape (n_classes,) with the true prevalence values
eps – smoothing factor
array-like of shape (n_classes,) with the smoothed distribution
-Evaluates a quantification model according to a specific sample generation protocol and in terms of one -evaluation metric (error).
-model – a quantifier, instance of quapy.method.base.BaseQuantifier
protocol – quapy.protocol.AbstractProtocol
; if this object is also instance of
-quapy.protocol.OnLabelledCollectionProtocol
, then the aggregation speed-up can be run. This is the
-protocol in charge of generating the samples in which the model is evaluated.
error_metric – a string representing the name(s) of an error function in qp.error -(e.g., ‘mae’), or a callable function implementing the error function itself.
aggr_speedup – whether or not to apply the speed-up. Set to “force” for applying it even if the number of -instances in the original collection on which the protocol acts is larger than the number of instances -in the samples to be generated. Set to True or “auto” (default) for letting QuaPy decide whether it is -convenient or not. Set to False to deactivate.
verbose – boolean, show or not information in stdout
if the error metric is not averaged (e.g., ‘ae’, ‘rae’), returns an array of shape (n_samples,) with -the error scores for each sample; if the error metric is averaged (e.g., ‘mae’, ‘mrae’) then returns -a single float
-Evaluates a quantification model on a given set of samples and in terms of one evaluation metric (error).
-model – a quantifier, instance of quapy.method.base.BaseQuantifier
samples – a list of samples on which the quantifier is to be evaluated
error_metric – a string representing the name(s) of an error function in qp.error -(e.g., ‘mae’), or a callable function implementing the error function itself.
verbose – boolean, show or not information in stdout
if the error metric is not averaged (e.g., ‘ae’, ‘rae’), returns an array of shape (n_samples,) with -the error scores for each sample; if the error metric is averaged (e.g., ‘mae’, ‘mrae’) then returns -a single float
-Generates a report (a pandas’ DataFrame) containing information of the evaluation of the model as according -to a specific protocol and in terms of one or more evaluation metrics (errors).
-model – a quantifier, instance of quapy.method.base.BaseQuantifier
protocol – quapy.protocol.AbstractProtocol
; if this object is also instance of
-quapy.protocol.OnLabelledCollectionProtocol
, then the aggregation speed-up can be run. This is the protocol
-in charge of generating the samples in which the model is evaluated.
error_metrics – a string, or list of strings, representing the name(s) of an error function in qp.error -(e.g., ‘mae’, the default value), or a callable function, or a list of callable functions, implementing -the error function itself.
aggr_speedup – whether or not to apply the speed-up. Set to “force” for applying it even if the number of -instances in the original collection on which the protocol acts is larger than the number of instances -in the samples to be generated. Set to True or “auto” (default) for letting QuaPy decide whether it is -convenient or not. Set to False to deactivate.
verbose – boolean, show or not information in stdout
a pandas’ DataFrame containing the columns ‘true-prev’ (the true prevalence of each sample), -‘estim-prev’ (the prevalence estimated by the model for each sample), and as many columns as error metrics -have been indicated, each displaying the score in terms of that metric for every sample.
-Uses a quantification model to generate predictions for the samples generated via a specific protocol. -This function is central to all evaluation processes, and is endowed with an optimization to speed-up the -prediction of protocols that generate samples from a large collection. The optimization applies to aggregative -quantifiers only, and to OnLabelledCollectionProtocol protocols, and comes down to generating the classification -predictions once and for all, and then generating samples over the classification predictions (instead of over -the raw instances), so that the classifier prediction is never called again. This behaviour is obtained by -setting aggr_speedup to ‘auto’ or True, and is only carried out if the overall process is convenient in terms -of computations (e.g., if the number of classification predictions needed for the original collection exceed the -number of classification predictions needed for all samples, then the optimization is not undertaken).
-model – a quantifier, instance of quapy.method.base.BaseQuantifier
protocol – quapy.protocol.AbstractProtocol
; if this object is also instance of
-quapy.protocol.OnLabelledCollectionProtocol
, then the aggregation speed-up can be run. This is the protocol
-in charge of generating the samples for which the model has to issue class prevalence predictions.
aggr_speedup – whether or not to apply the speed-up. Set to “force” for applying it even if the number of -instances in the original collection on which the protocol acts is larger than the number of instances -in the samples to be generated. Set to True or “auto” (default) for letting QuaPy decide whether it is -convenient or not. Set to False to deactivate.
verbose – boolean, show or not information in stdout
a tuple (true_prevs, estim_prevs) in which each element in the tuple is an array of shape -(n_samples, n_classes) containing the true, or predicted, prevalence values for each sample
-Computes the Hellingher Distance (HD) between (discretized) distributions P and Q. -The HD for two discrete distributions of k bins is defined as:
-P – real-valued array-like of shape (k,) representing a discrete distribution
Q – real-valued array-like of shape (k,) representing a discrete distribution
float
-Topsoe distance between two (discretized) distributions P and Q. -The Topsoe distance for two discrete distributions of k bins is defined as:
-P – real-valued array-like of shape (k,) representing a discrete distribution
Q – real-valued array-like of shape (k,) representing a discrete distribution
float
-Searches for the prevalence vector that minimizes a loss function.
-loss – callable, the function to minimize
n_classes – int, number of classes
method – string indicating the search strategy. Possible values are:: -‘optim_minimize’: uses scipy.optim -‘linear_search’: carries out a linear search for binary problems in the space [0, 0.01, 0.02, …, 1] -‘ternary_search’: implements the ternary search (not yet implemented)
np.ndarray, a prevalence vector
-Helper that, given a float representing the prevalence for the positive class, returns a np.ndarray of two -values representing a binary distribution.
-positive_prevalence – float or array-like of floats with the prevalence for the positive class
clip_if_necessary (bool) – if True, clips the value in [0,1] in order to guarantee the resulting distribution -is valid. If False, it then checks that the value is in the valid range, and raises an error if not.
np.ndarray of shape (2,)
-Checks that prevalences is a valid prevalence vector, i.e., it contains values in [0,1] and -the values sum up to 1. In other words, verifies that the prevalences vectors lies in the -probability simplex.
-prevalences (ArrayLike) – the prevalence vector, or vectors, to check
raise_exception (bool) – whether to raise an exception if the vector (or any of the vectors) does -not lie in the simplex (default False)
tolerance (float) – error tolerance for the check sum(prevalences) - 1 = 0
aggr (bool) – if True (default) returns one single bool (True if all prevalence vectors are valid, -False otherwise), if False returns an array of bool, one for each prevalence vector
a single bool True if prevalences is a vector of prevalence values that lies on the simplex, -or False otherwise; alternatively, if prevalences is a matrix of shape (num_vectors, n_classes,) -then it returns one such bool for each prevalence vector
-Clips the values in [0,1] and then applies the L1 normalization.
-prevalences – array-like of shape (n_classes,) or of shape (n_samples, n_classes,) with prevalence values
-np.ndarray representing a valid distribution
-Applies the softmax function only to vectors that do not represent valid distributions.
-prevalences – array-like of shape (n_classes,) or of shape (n_samples, n_classes,) with prevalence values
-np.ndarray representing a valid distribution
-Computes the raw count values from a vector of labels.
-labels – array-like of shape (n_instances,) with the label for each instance
classes – the class labels. This is needed in order to correctly compute the prevalence vector even when -some classes have no examples.
ndarray of shape (len(classes),) with the raw counts for each class, in the same order -as they appear in classes
-Guarantees that the divergence received as argument is a function. That is, if this argument is already -a callable, then it is returned, if it is instead a string, then tries to instantiate the corresponding -divergence from the string name.
-divergence – callable or string indicating the name of the divergence function
-callable
-Searches for the largest number of (equidistant) prevalence points to define for each of the n_classes classes so -that the number of valid prevalence values generated as combinations of prevalence points (points in a -n_classes-dimensional simplex) do not exceed combinations_budget.
-combinations_budget (int) – maximum number of combinations allowed
n_classes (int) – number of classes
n_repeats (int) – number of repetitions for each prevalence combination
the largest number of prevalence points that generate less than combinations_budget valid prevalences
-Applies L1 normalization to the unnormalized_arr so that it becomes a valid prevalence -vector. Zero vectors are mapped onto the uniform distribution. Raises an exception if -the resulting vectors are not valid distributions. This may happen when the original -prevalence vectors contain negative values. Use the clip normalization function -instead to avoid this possibility.
-prevalences – array-like of shape (n_classes,) or of shape (n_samples, n_classes,) with prevalence values
-np.ndarray representing a valid distribution
-Performs a linear search for the best prevalence value in binary problems. The search is carried out by exploring -the range [0,1] stepping by 0.01. This search is inefficient, and is added only for completeness (some of the -early methods in quantification literature used it, e.g., HDy). A most powerful alternative is optim_minimize.
-loss – (callable) the function to minimize
n_classes – (int) the number of classes, i.e., the dimensionality of the prevalence vector
(ndarray) the best prevalence vector found
-Normalizes a vector or matrix of prevalence values. The normalization consists of applying a L1 normalization in -cases in which the prevalence values are not all-zeros, and to convert the prevalence values into 1/n_classes in -cases in which all values are zero.
-prevalences – array-like of shape (n_classes,) or of shape (n_samples, n_classes,) with prevalence values
method (str) –
indicates the normalization method to employ, options are:
-l1: applies L1 normalization (default); a 0 vector is mapped onto the uniform prevalence
clip: clip values in [0,1] and then rescales so that the L1 norm is 1
mapsimplex: projects vectors onto the probability simplex. This implementation relies on -Mathieu Blondel’s projection_simplex_sort
softmax: applies softmax to all vectors
condsoftmax: applies softmax only to invalid prevalence vectors
a normalized vector or matrix of prevalence values
-Computes the number of valid prevalence combinations in the n_classes-dimensional simplex if n_prevpoints equally -distant prevalence values are generated and n_repeats repetitions are requested. -The computation comes down to calculating:
-where N is n_prevpoints-1, i.e., the number of probability mass blocks to allocate, C is the number of -classes, and r is n_repeats. This solution comes from the -Stars and Bars problem.
-n_classes (int) – number of classes
n_prevpoints (int) – number of prevalence points.
n_repeats (int) – number of repetitions for each prevalence combination
The number of possible combinations. For example, if `n_classes`=2, `n_prevpoints`=5, `n_repeats`=1, -then the number of possible combinations are 5, i.e.: [0,1], [0.25,0.75], [0.50,0.50], [0.75,0.25], -and [1.0,0.0]
-Searches for the optimal prevalence values, i.e., an n_classes-dimensional vector of the (n_classes-1)-simplex -that yields the smallest lost. This optimization is carried out by means of a constrained search using scipy’s -SLSQP routine.
-loss – (callable) the function to minimize
n_classes – (int) the number of classes, i.e., the dimensionality of the prevalence vector
(ndarray) the best prevalence vector found
-Computes the prevalence values from a vector of labels.
-labels – array-like of shape (n_instances,) with the label for each instance
classes – the class labels. This is needed in order to correctly compute the prevalence vector even when -some classes have no examples.
ndarray of shape (len(classes),) with the class proportions for each class, in the same order -as they appear in classes
-Returns a vector of prevalence values from a matrix of posterior probabilities.
-posteriors – array-like of shape (n_instances, n_classes,) with posterior probabilities for each class
binarize – set to True (default is False) for computing the prevalence values on crisp decisions (i.e., -converting the vectors of posterior probabilities into class indices, by taking the argmax).
array of shape (n_classes,) containing the prevalence values
-Produces an array of uniformly separated values of prevalence. -By default, produces an array of 21 prevalence values, with -step 0.05 and with the limits smoothed, i.e.: -[0.01, 0.05, 0.10, 0.15, …, 0.90, 0.95, 0.99]
-grid_points – the number of prevalence values to sample from the [0,1] interval (default 21)
repeats – number of times each prevalence is to be repeated (defaults to 1)
smooth_limits_epsilon – the quantity to add and subtract to the limits 0 and 1
an array of uniformly separated prevalence values
-Projects a point onto the probability simplex.
-The code is adapted from Mathieu Blondel’s BSD-licensed -implementation -(see function projection_simplex_sort in their repo) which is accompanying the paper
-Mathieu Blondel, Akinori Fujino, and Naonori Ueda. -Large-scale Multiclass Support Vector Machine Training via Euclidean Projection onto the Simplex, -ICPR 2014, URL
-unnormalized_arr – point in n-dimensional space, shape (n,)
-projection of unnormalized_arr onto the (n-1)-dimensional probability simplex, shape (n,)
-Applies the softmax function to all vectors even if the original vectors were valid distributions. -If you want to leave valid vectors untouched, use condsoftmax instead.
-prevalences – array-like of shape (n_classes,) or of shape (n_samples, n_classes,) with prevalence values
-np.ndarray representing a valid distribution
-Function that tries to solve for \(p\) the equation \(q = M p\), where \(q\) is the vector of -unadjusted counts (as estimated, e.g., via classify and count) with \(q_i\) an estimate of -\(P(\hat{Y}=y_i)\), and where \(M\) is the matrix of class-conditional rates with \(M_{ij}\) an -estimate of \(P(\hat{Y}=y_i|Y=y_j)\).
-class_conditional_rates – array of shape (n_classes, n_classes,) with entry (i,j) being the estimate -of \(P(\hat{Y}=y_i|Y=y_j)\), that is, the probability that an instance that belongs to class \(y_j\) -ends up being classified as belonging to class \(y_i\)
unadjusted_counts – array of shape (n_classes,) containing the unadjusted prevalence values (e.g., as -estimated by CC or PCC)
method (str) –
indicates the adjustment method to be used. Valid options are:
-inversion: tries to solve the equation \(q = M p\) as \(p = M^{-1} q\) where -\(M^{-1}\) is the matrix inversion of \(M\). This inversion may not exist in -degenerated cases.
invariant-ratio: invariant ratio estimator of Vaz et al. 2018, -which replaces the last equation in \(M\) with the normalization condition (i.e., that the sum of -all prevalence values must equal 1).
solver (str) –
the method to use for solving the system of linear equations. Valid options are:
-exact-raise: tries to solve the system using matrix inversion. Raises an error if the matrix has rank -strictly lower than n_classes.
exact-cc: if the matrix is not full rank, returns \(q\) (i.e., the unadjusted counts) as the estimates
exact: deprecated, defaults to ‘exact-cc’ (will be removed in future versions)
minimize: minimizes a loss, so the solution always exists
Implements the adjustment of ACC and PACC for the binary case. The adjustment for a prevalence estimate of the -positive class p comes down to computing:
-prevalence_estim (float) – the estimated value for the positive class (p in the formula)
tpr (float) – the true positive rate of the classifier
fpr (float) – the false positive rate of the classifier
clip (bool) – set to True (default) to clip values that might exceed the range [0,1]
float, the adjusted count
-Returns a string representation for a prevalence vector. E.g.,
->>> strprev([1/3, 2/3], prec=2)
->>> '[0.33, 0.67]'
-
prevalences – array-like of prevalence values
prec – int, indicates the float precision (number of decimal values to print)
string
-Returns a vector representing the uniform distribution for n_classes
-n_classes – number of classes
-np.ndarray with all values 1/n_classes
-Implements the Kraemer algorithm -for sampling uniformly at random from the unit simplex. This implementation is adapted from this -post <https://cs.stackexchange.com/questions/3227/uniform-sampling-from-a-simplex>_.
-n_classes – integer, number of classes (dimensionality of the simplex)
size – number of samples to return
np.ndarray of shape (size, n_classes,) if size>1, or of shape (n_classes,) otherwise
-Implements the Kraemer algorithm -for sampling uniformly at random from the unit simplex. This implementation is adapted from this -post <https://cs.stackexchange.com/questions/3227/uniform-sampling-from-a-simplex>_.
-n_classes – integer, number of classes (dimensionality of the simplex)
size – number of samples to return
np.ndarray of shape (size, n_classes,) if size>1, or of shape (n_classes,) otherwise
-Bases: object
Bases: BaseQuantifier
Grid Search optimization targeting a quantification-oriented metric.
-Optimizes the hyperparameters of a quantification method, based on an evaluation method and on an evaluation -protocol for quantification.
-model (BaseQuantifier) – the quantifier to optimize
param_grid – a dictionary with keys the parameter names and values the list of values to explore
protocol – a sample generation protocol, an instance of quapy.protocol.AbstractProtocol
error – an error function (callable) or a string indicating the name of an error function (valid ones
-are those in quapy.error.QUANTIFICATION_ERROR
refit – whether to refit the model on the whole labelled collection (training+validation) with -the best chosen hyperparameter combination. Ignored if protocol=’gen’
timeout – establishes a timer (in seconds) for each of the hyperparameters configurations being tested. -Whenever a run takes longer than this timer, that configuration will be ignored. If all configurations end up -being ignored, a TimeoutError exception is raised. If -1 (default) then no time bound is set.
raise_errors – boolean, if True then raises an exception when a param combination yields any error, if -otherwise is False (default), then the combination is marked with an error status, but the process goes on. -However, if no configuration yields a valid model, then a ValueError exception will be raised.
verbose – set to True to get information through the stdout
Returns the best model found after calling the fit()
method, i.e., the one trained on the combination
-of hyper-parameters that minimized the error function.
a trained quantifier
-the error metric.
-training – the training set on which to optimize the hyperparameters
-self
-Returns the dictionary of hyper-parameters to explore (param_grid)
-deep – Unused
-the dictionary param_grid
-Estimate class prevalence values using the best model found after calling the fit()
method.
instances – sample contanining the instances
-a ndarray of shape (n_classes) with class prevalence estimates as according to the best model found -by the model selection process.
-Bases: Enum
An enumeration.
-Akin to scikit-learn’s cross_val_predict -but for quantification.
-quantifier – a quantifier issuing class prevalence values
data – a labelled collection
nfolds – number of folds for k-fold cross validation generation
random_state – random seed for reproducibility
a vector of class prevalence values
-Expands a param_grid dictionary as a list of configurations. -Example:
->>> combinations = expand_grid({'A': [1, 10, 100], 'B': [True, False]})
->>> print(combinations)
->>> [{'A': 1, 'B': True}, {'A': 1, 'B': False}, {'A': 10, 'B': True}, {'A': 10, 'B': False}, {'A': 100, 'B': True}, {'A': 100, 'B': False}]
-
param_grid – dictionary with keys representing hyper-parameter names, and values representing the range -to explore for that hyper-parameter
-a list of configurations, i.e., combinations of hyper-parameter assignments in the grid.
-Partitions a param_grid dictionary as two lists of configurations, one for the classifier-specific -hyper-parameters, and another for que quantifier-specific hyper-parameters
-param_grid – dictionary with keys representing hyper-parameter names, and values representing the range -to explore for that hyper-parameter
-two expanded grids of configurations, one for the classifier, another for the quantifier
-Box-plots displaying the local bias (i.e., signed error computed as the estimated value minus the true value) -for different bins of (true) prevalence of the positive classs, for each quantification method.
-method_names – array-like with the method names for each experiment
true_prevs – array-like with the true prevalence values (each being a ndarray with n_classes components) for -each experiment
estim_prevs – array-like with the estimated prevalence values (each being a ndarray with n_classes components) -for each experiment
pos_class – index of the positive class
title – the title to be displayed in the plot
nbins – number of bins
colormap – the matplotlib colormap to use (default cm.tab10)
vertical_xticks – whether or not to add secondary grid (default is False)
legend – whether or not to display the legend (default is True)
savepath – path where to save the plot. If not indicated (as default), the plot is shown.
Box-plots displaying the global bias (i.e., signed error computed as the estimated value minus the true value) -for each quantification method with respect to a given positive class.
-method_names – array-like with the method names for each experiment
true_prevs – array-like with the true prevalence values (each being a ndarray with n_classes components) for -each experiment
estim_prevs – array-like with the estimated prevalence values (each being a ndarray with n_classes components) -for each experiment
pos_class – index of the positive class
title – the title to be displayed in the plot
savepath – path where to save the plot. If not indicated (as default), the plot is shown.
The diagonal plot displays the predicted prevalence values (along the y-axis) as a function of the true prevalence
-values (along the x-axis). The optimal quantifier is described by the diagonal (0,0)-(1,1) of the plot (hence the
-name). It is convenient for binary quantification problems, though it can be used for multiclass problems by
-indicating which class is to be taken as the positive class. (For multiclass quantification problems, other plots
-like the error_by_drift()
might be preferable though).
method_names – array-like with the method names for each experiment
true_prevs – array-like with the true prevalence values (each being a ndarray with n_classes components) for -each experiment
estim_prevs – array-like with the estimated prevalence values (each being a ndarray with n_classes components) -for each experiment
pos_class – index of the positive class
title – the title to be displayed in the plot
show_std – whether or not to show standard deviations (represented by color bands). This might be inconvenient -for cases in which many methods are compared, or when the standard deviations are high – default True)
legend – whether or not to display the leyend (default True)
train_prev – if indicated (default is None), the training prevalence (for the positive class) is hightlighted -in the plot. This is convenient when all the experiments have been conducted in the same dataset.
savepath – path where to save the plot. If not indicated (as default), the plot is shown.
method_order – if indicated (default is None), imposes the order in which the methods are processed (i.e., -listed in the legend and associated with matplotlib colors).
Displays (only) the top performing methods for different regions of the train-test shift in form of a broken -bar chart, in which each method has bars only for those regions in which either one of the following conditions -hold: (i) it is the best method (in average) for the bin, or (ii) it is not statistically significantly different -(in average) as according to a two-sided t-test on independent samples at confidence ttest_alpha. -The binning can be made “isometric” (same size), or “isomerous” (same number of experiments – default). A second -plot is displayed on top, that displays the distribution of experiments for each bin (when binning=”isometric”) or -the percentiles points of the distribution (when binning=”isomerous”).
-method_names – array-like with the method names for each experiment
true_prevs – array-like with the true prevalence values (each being a ndarray with n_classes components) for -each experiment
estim_prevs – array-like with the estimated prevalence values (each being a ndarray with n_classes components) -for each experiment
tr_prevs – training prevalence of each experiment
n_bins – number of bins in which the y-axis is to be divided (default is 20)
binning – type of binning, either “isomerous” (default) or “isometric”
x_error – a string representing the name of an error function (as defined in quapy.error) to be used for -measuring the amount of train-test shift (default is “ae”)
y_error – a string representing the name of an error function (as defined in quapy.error) to be used for -measuring the amount of error in the prevalence estimations (default is “ae”)
ttest_alpha – the confidence interval above which a p-value (two-sided t-test on independent samples) is -to be considered as an indicator that the two means are not statistically significantly different. Default is -0.005, meaning that a p-value > 0.005 indicates the two methods involved are to be considered similar
tail_density_threshold – sets a threshold on the density of experiments (over the total number of experiments) -below which a bin in the tail (i.e., the right-most ones) will be discarded. This is in order to avoid some -bins to be shown for train-test outliers.
method_order – if indicated (default is None), imposes the order in which the methods are processed (i.e., -listed in the legend and associated with matplotlib colors).
savepath – path where to save the plot. If not indicated (as default), the plot is shown.
Plots the error (along the x-axis, as measured in terms of error_name) as a function of the train-test shift
-(along the y-axis, as measured in terms of quapy.error.ae()
). This plot is useful especially for multiclass
-problems, in which “diagonal plots” may be cumbersone, and in order to gain understanding about how methods
-fare in different regions of the prior probability shift spectrum (e.g., in the low-shift regime vs. in the
-high-shift regime).
method_names – array-like with the method names for each experiment
true_prevs – array-like with the true prevalence values (each being a ndarray with n_classes components) for -each experiment
estim_prevs – array-like with the estimated prevalence values (each being a ndarray with n_classes components) -for each experiment
tr_prevs – training prevalence of each experiment
n_bins – number of bins in which the y-axis is to be divided (default is 20)
error_name – a string representing the name of an error function (as defined in quapy.error, default is “ae”)
show_std – whether or not to show standard deviations as color bands (default is False)
show_density – whether or not to display the distribution of experiments for each bin (default is True)
show_density – whether or not to display the legend of the chart (default is True)
logscale – whether or not to log-scale the y-error measure (default is False)
title – title of the plot (default is “Quantification error as a function of distribution shift”)
vlines – array-like list of values (default is None). If indicated, highlights some regions of the space -using vertical dotted lines.
method_order – if indicated (default is None), imposes the order in which the methods are processed (i.e., -listed in the legend and associated with matplotlib colors).
savepath – path where to save the plot. If not indicated (as default), the plot is shown.
Bases: AbstractStochasticSeededProtocol
, OnLabelledCollectionProtocol
Implementation of the artificial prevalence protocol (APP). -The APP consists of exploring a grid of prevalence values containing n_prevalences points (e.g., -[0, 0.05, 0.1, 0.15, …, 1], if n_prevalences=21), and generating all valid combinations of -prevalence values for all classes (e.g., for 3 classes, samples with [0, 0, 1], [0, 0.05, 0.95], …, -[1, 0, 0] prevalence values of size sample_size will be yielded). The number of samples for each valid -combination of prevalence values is indicated by repeats.
-data – a LabelledCollection from which the samples will be drawn
sample_size – integer, number of instances in each sample; if None (default) then it is taken from -qp.environ[“SAMPLE_SIZE”]. If this is not set, a ValueError exception is raised.
n_prevalences – the number of equidistant prevalence points to extract from the [0,1] interval for the -grid (default is 21)
repeats – number of copies for each valid prevalence vector (default is 10)
smooth_limits_epsilon – the quantity to add and subtract to the limits 0 and 1
random_state – allows replicating samples across runs (default 0, meaning that the sequence of samples -will be the same every time the protocol is called)
sanity_check – int, raises an exception warning the user that the number of examples to be generated exceed -this number; set to None for skipping this check
return_type – set to “sample_prev” (default) to get the pairs of (sample, prevalence) at each iteration, or -to “labelled_collection” to get instead instances of LabelledCollection
Generates vectors of prevalence values from an exhaustive grid of prevalence values. The -number of prevalence values explored for each dimension depends on n_prevalences, so that, if, for example, -n_prevalences=11 then the prevalence values of the grid are taken from [0, 0.1, 0.2, …, 0.9, 1]. Only -valid prevalence distributions are returned, i.e., vectors of prevalence values that sum up to 1. For each -valid vector of prevalence values, repeat copies are returned. The vector of prevalence values can be -implicit (by setting return_constrained_dim=False), meaning that the last dimension (which is constrained -to 1 - sum of the rest) is not returned (note that, quite obviously, in this case the vector does not sum up to -1). Note that this method is deterministic, i.e., there is no random sampling anywhere.
-a np.ndarray of shape (n, dimensions) if return_constrained_dim=True or of shape -(n, dimensions-1) if return_constrained_dim=False, where n is the number of valid combinations found -in the grid multiplied by repeat
-Realizes the sample given the index of the instances.
-index – indexes of the instances to select
-an instance of qp.data.LabelledCollection
Bases: object
Abstract parent class for sample generation protocols.
- - -Bases: AbstractProtocol
An AbstractStochasticSeededProtocol is a protocol that generates, via any random procedure (e.g.,
-via random sampling), sequences of quapy.data.base.LabelledCollection
samples.
-The protocol abstraction enforces
-the object to be instantiated using a seed, so that the sequence can be fully replicated.
-In order to make this functionality possible, the classes extending this abstraction need to
-implement only two functions, samples_parameters()
which generates all the parameters
-needed for extracting the samples, and sample()
that, given some parameters as input,
-deterministically generates a sample.
random_state – the seed for allowing to replicate any sequence of samples. Default is 0, meaning that -the sequence will be consistent every time the protocol is called.
-The collator prepares the sample to accommodate the desired output format before returning the output. -This collator simply returns the sample as it is. Classes inheriting from this abstract class can -implement their custom collators.
-sample – the sample to be returned
args – additional arguments
the sample adhering to a desired output format (in this case, the sample is returned as it is)
-Bases: AbstractStochasticSeededProtocol
Generates mixtures of two domains (A and B) at controlled rates, but preserving the original class prevalence.
-domainA – one domain, an object of qp.data.LabelledCollection
domainB – another domain, an object of qp.data.LabelledCollection
sample_size – integer, the number of instances in each sample; if None (default) then it is taken from -qp.environ[“SAMPLE_SIZE”]. If this is not set, a ValueError exception is raised.
repeats – int, number of samples to draw for every mixture rate
prevalence – the prevalence to preserv along the mixtures. If specified, should be an array containing -one prevalence value (positive float) for each class and summing up to one. If not specified, the prevalence -will be taken from the domain A (default).
mixture_points – an integer indicating the number of points to take from a linear scale (e.g., 21 will -generate the mixture points [1, 0.95, 0.9, …, 0]), or the array of mixture values itself. -the specific points
random_state – allows replicating samples across runs (default 0, meaning that the sequence of samples -will be the same every time the protocol is called)
Realizes the sample given a pair of indexes of the instances from A and B.
-indexes – indexes of the instances to select from A and B
-an instance of qp.data.LabelledCollection
Bases: AbstractProtocol
A very simple protocol which simply iterates over a list of previously generated samples
-samples – a list of quapy.data.base.LabelledCollection
Bases: AbstractStochasticSeededProtocol
, OnLabelledCollectionProtocol
A generator of samples that implements the natural prevalence protocol (NPP). The NPP consists of drawing -samples uniformly at random, therefore approximately preserving the natural prevalence of the collection.
-data – a LabelledCollection from which the samples will be drawn
sample_size – integer, the number of instances in each sample; if None (default) then it is taken from -qp.environ[“SAMPLE_SIZE”]. If this is not set, a ValueError exception is raised.
repeats – the number of samples to generate. Default is 100.
random_state – allows replicating samples across runs (default 0, meaning that the sequence of samples -will be the same every time the protocol is called)
return_type – set to “sample_prev” (default) to get the pairs of (sample, prevalence) at each iteration, or -to “labelled_collection” to get instead instances of LabelledCollection
Realizes the sample given the index of the instances.
-index – indexes of the instances to select
-an instance of qp.data.LabelledCollection
Bases: object
Protocols that generate samples from a qp.data.LabelledCollection
object.
Returns a collator function, i.e., a function that prepares the yielded data
-return_type – either ‘sample_prev’ (default) if the collator is requested to yield tuples of
-(sample, prevalence), or ‘labelled_collection’ when it is requested to yield instances of
-qp.data.LabelledCollection
the collator function (a callable function that takes as input an instance of
-qp.data.LabelledCollection
)
Returns the labelled collection on which this protocol acts.
-an object of type qp.data.LabelledCollection
Returns a copy of this protocol that acts on a modified version of the original
-qp.data.LabelledCollection
in which the original instances have been replaced
-with the outputs of a classifier for each instance. (This is convenient for speeding-up
-the evaluation procedures for many samples, by pre-classifying the instances in advance.)
pre_classifications – the predictions issued by a classifier, typically an array-like -with shape (n_instances,) when the classifier is a hard one, or with shape -(n_instances, n_classes) when the classifier is a probabilistic one.
in_place – whether or not to apply the modification in-place or in a new copy (default).
a copy of this protocol
-Bases: AbstractStochasticSeededProtocol
, OnLabelledCollectionProtocol
A variant of APP
that, instead of using a grid of equidistant prevalence values,
-relies on the Kraemer algorithm for sampling unit (k-1)-simplex uniformly at random, with
-k the number of classes. This protocol covers the entire range of prevalence values in a
-statistical sense, i.e., unlike APP there is no guarantee that it is covered precisely
-equally for all classes, but it is preferred in cases in which the number of possible
-combinations of the grid values of APP makes this endeavour intractable.
data – a LabelledCollection from which the samples will be drawn
sample_size – integer, the number of instances in each sample; if None (default) then it is taken from -qp.environ[“SAMPLE_SIZE”]. If this is not set, a ValueError exception is raised.
repeats – the number of samples to generate. Default is 100.
random_state – allows replicating samples across runs (default 0, meaning that the sequence of samples -will be the same every time the protocol is called)
return_type – set to “sample_prev” (default) to get the pairs of (sample, prevalence) at each iteration, or -to “labelled_collection” to get instead instances of LabelledCollection
Realizes the sample given the index of the instances.
-index – indexes of the instances to select
-an instance of qp.data.LabelledCollection
Bases: object
A class implementing the early-stopping condition typically used for training neural networks.
->>> earlystop = EarlyStop(patience=2, lower_is_better=True)
->>> earlystop(0.9, epoch=0)
->>> earlystop(0.7, epoch=1)
->>> earlystop.IMPROVED # is True
->>> earlystop(1.0, epoch=2)
->>> earlystop.STOP # is False (patience=1)
->>> earlystop(1.0, epoch=3)
->>> earlystop.STOP # is True (patience=0)
->>> earlystop.best_epoch # is 1
->>> earlystop.best_score # is 0.7
-
patience – the number of (consecutive) times that a monitored evaluation metric (typically obtaind in a -held-out validation split) can be found to be worse than the best one obtained so far, before flagging the -stopping condition. An instance of this class is callable, and is to be used as follows:
lower_is_better – if True (default) the metric is to be minimized.
best_score – keeps track of the best value seen so far
best_epoch – keeps track of the epoch in which the best score was set
STOP – flag (boolean) indicating the stopping condition
IMPROVED – flag (boolean) indicating whether there was an improvement in the last call
An alias to os.makedirs(path, exist_ok=True) that also returns the path. This is useful in cases like, e.g.:
->>> path = create_if_not_exist(os.path.join(dir, subdir, anotherdir))
-
path – path to create
-the path itself
-Creates the parent dir (if any) of a given path, if not exists. E.g., for ./path/to/file.txt, the path ./path/to -is created.
-path – the path
-Downloads a file from a url
-url – the url
archive_filename – destination filename
Dowloads a function (using download_file()
) if the file does not exist.
url – the url
archive_filename – destination filename
Gets the home directory of QuaPy, i.e., the directory where QuaPy saves permanent data, such as dowloaded datasets. -This directory is ~/quapy_data
-a string representing the path
-Applies func to n_jobs slices of args. E.g., if args is an array of 99 items and n_jobs=2, then -func is applied in two parallel processes to args[0:50] and to args[50:99]. func is a function -that already works with a list of arguments.
-func – function to be parallelized
args – array-like of arguments to be passed to the function in different parallel calls
n_jobs – the number of workers
A wrapper of multiprocessing:
->>> Parallel(n_jobs=n_jobs)(
->>> delayed(func)(args_i) for args_i in args
->>> )
-
that takes the quapy.environ variable as input silently. -Seeds the child processes to ensure reproducibility when n_jobs>1.
-func – callable
args – args of func
seed – the numeric seed
asarray – set to True to return a np.ndarray instead of a list
backend – indicates the backend used for handling parallel works
open_args – if True, then the delayed function is called on *args_i, instead of on args_i
A wrapper of multiprocessing:
->>> Parallel(n_jobs=n_jobs)(
->>> delayed(func)(*args_i) for args_i in args
->>> )
-
that takes the quapy.environ variable as input silently. -Seeds the child processes to ensure reproducibility when n_jobs>1.
-func – callable
args – args of func
seed – the numeric seed
asarray – set to True to return a np.ndarray instead of a list
backend – indicates the backend used for handling parallel works
Allows for fast reuse of resources that are generated only once by calling generation_func(*args). The next times -this function is invoked, it loads the pickled resource. Example:
->>> def some_array(n): # a mock resource created with one parameter (`n`)
->>> return np.random.rand(n)
->>> pickled_resource('./my_array.pkl', some_array, 10) # the resource does not exist: it is created by calling some_array(10)
->>> pickled_resource('./my_array.pkl', some_array, 10) # the resource exists; it is loaded from './my_array.pkl'
-
pickle_path – the path where to save (first time) and load (next times) the resource
generation_func – the function that generates the resource, in case it does not exist in pickle_path
args – any arg that generation_func uses for generating the resources
the resource
-Saves a text file to disk, given its full path, and creates the parent directory if missing.
-path – path where to save the path.
text – text to save.
Can be used in a “with” context to set a temporal seed without modifying the outer numpy’s current state. E.g.:
->>> with temp_seed(random_seed):
->>> pass # do any computation depending on np.random functionality
-
random_state – the seed to set within the “with” context
-Opens a context that will launch an exception if not closed after a given number of seconds
->>> def func(start_msg, end_msg):
->>> print(start_msg)
->>> sleep(2)
->>> print(end_msg)
->>>
->>> with timeout(1):
->>> func('begin function', 'end function')
->>> Out[]
->>> begin function
->>> TimeoutError
-
seconds – number of seconds, set to <=0 to ignore the timer
-QuaPy module for quantification
-Bases: AggregativeCrispQuantifier
Adjusted Classify & Count,
-the “adjusted” variant of CC
, that corrects the predictions of CC
-according to the misclassification rates.
classifier – a sklearn’s Estimator that generates a classifier
val_split – specifies the data used for generating classifier predictions. This specification -can be made as float in (0, 1) indicating the proportion of stratified held-out validation set to -be extracted from the training set; or as an integer (default 5), indicating that the predictions -are to be generated in a k-fold cross-validation manner (with this integer indicating the value -for k); or as a collection defining the specific set of data to use for validation. -Alternatively, this set can be specified at fit time by indicating the exact set of data -on which the predictions are to be generated.
method (str) –
adjustment method to be used:
-’inversion’: matrix inversion method based on the matrix equality \(P(C)=P(C|Y)P(Y)\), -which tries to invert \(P(C|Y)\) matrix.
’invariant-ratio’: invariant ratio estimator of Vaz et al. 2018, -which replaces the last equation with the normalization condition.
solver (str) –
indicates the method to use for solving the system of linear equations. Valid options are:
-’exact-raise’: tries to solve the system using matrix inversion. Raises an error if the matrix has rank -strictly less than n_classes.
’exact-cc’: if the matrix is not of full rank, returns p_c as the estimates, which corresponds to
-no adjustment (i.e., the classify and count method. See quapy.method.aggregative.CC
)
’exact’: deprecated, defaults to ‘exact-cc’
’minimize’: minimizes the L2 norm of \(|Ax-B|\). This one generally works better, and is the -default parameter. More details about this can be consulted in Bunse, M. “On Multi-Class Extensions of -Adjusted Classify and Count”, on proceedings of the 2nd International Workshop on Learning to Quantify: -Methods and Applications (LQ 2022), ECML/PKDD 2022, Grenoble (France).
norm (str) –
the method to use for normalization.
-clip, the values are clipped to the range [0,1] and then L1-normalized.
mapsimplex projects vectors onto the probability simplex. This implementation relies on -Mathieu Blondel’s projection_simplex_sort
condsoftmax, applies a softmax normalization only to prevalence vectors that lie outside the simplex
n_jobs – number of parallel workers
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Estimates the misclassification rates.
-classif_predictions – a quapy.data.base.LabelledCollection
containing,
-as instances, the label predictions issued by the classifier and, as labels, the true labels
data – a quapy.data.base.LabelledCollection
consisting of the training data
Estimate the matrix with entry (i,j) being the estimate of P(hat_yi|yj), that is, the probability that a -document that belongs to yj ends up being classified as belonging to yi
-classes – array-like with the class names
y – array-like with the true labels
y – array-like with the estimated labels
np.ndarray
-Constructs a quantifier that implements the Invariant Ratio Estimator of -Vaz et al. 2018. This amounts -to setting method to ‘invariant-ratio’ and clipping to ‘project’.
-classifier – a sklearn’s Estimator that generates a classifier
val_split – specifies the data used for generating classifier predictions. This specification
can be made as float in (0, 1) indicating the proportion of stratified held-out validation set to -be extracted from the training set; or as an integer (default 5), indicating that the predictions -are to be generated in a k-fold cross-validation manner (with this integer indicating the value -for k); or as a collection defining the specific set of data to use for validation. -Alternatively, this set can be specified at fit time by indicating the exact set of data -on which the predictions are to be generated. -:param n_jobs: number of parallel workers -:return: an instance of ACC configured so that it implements the Invariant Ratio Estimator
-Bases: AggregativeQuantifier
, ABC
Abstract class for quantification methods that base their estimations on the aggregation of crisp decisions -as returned by a hard classifier. Aggregative crisp quantifiers thus extend Aggregative -Quantifiers by implementing specifications about crisp predictions.
-Bases: BinaryQuantifier
This method is a meta-quantifier that returns, as the estimated class prevalence values, the median of the -estimation returned by differently (hyper)parameterized base quantifiers. -The median of unit-vectors is only guaranteed to be a unit-vector for n=2 dimensions, -i.e., in cases of binary quantification.
-base_quantifier – the base, binary quantifier
random_state – a seed to be set before fitting any base quantifier (default None)
param_grid – the grid or parameters towards which the median will be computed
n_jobs – number of parllel workes
Trains a quantifier.
-data – a quapy.data.base.LabelledCollection
consisting of the training data
self
-Get parameters for this estimator.
-deep (bool, default=True) – If True, will return the parameters for this estimator and -contained subobjects that are estimators.
-params – Parameter names mapped to their values.
-dict
-Generate class prevalence estimates for the sample’s instances
-instances – array-like
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Set the parameters of this estimator.
-The method works on simple estimators as well as on nested objects
-(such as Pipeline
). The latter have
-parameters of the form <component>__<parameter>
so that it’s
-possible to update each component of a nested object.
**params (dict) – Estimator parameters.
-self – Estimator instance.
-estimator instance
-Bases: BaseQuantifier
, ABC
Abstract class for quantification methods that base their estimations on the aggregation of classification
-results. Aggregative quantifiers implement a pipeline that consists of generating classification predictions
-and aggregating them. For this reason, the training phase is implemented by classification_fit()
followed
-by aggregation_fit()
, while the testing phase is implemented by classify()
followed by
-aggregate()
. Subclasses of this abstract class must provide implementations for these methods.
-Aggregative quantifiers also maintain a classifier
attribute.
The method fit()
comes with a default implementation based on classification_fit()
-and aggregation_fit()
.
The method quantify()
comes with a default implementation based on classify()
-and aggregate()
.
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains the aggregation function.
-classif_predictions – a quapy.data.base.LabelledCollection
containing,
-as instances, the predictions issued by the classifier and, as labels, the true labels
data – a quapy.data.base.LabelledCollection
consisting of the training data
Class labels, in the same order in which class prevalence values are to be computed. -This default implementation actually returns the class labels of the learner.
-array-like
-Gives access to the classifier
-the classifier (typically an sklearn’s Estimator)
-Trains the classifier if requested (fit_classifier=True) and generate the necessary predictions to -train the aggregation function.
-data – a quapy.data.base.LabelledCollection
consisting of the training data
fit_classifier – whether to train the learner (default is True). Set to False if the -learner has been trained outside the quantifier.
predict_on – specifies the set on which predictions need to be issued. This parameter can -be specified as None (default) to indicate no prediction is needed; a float in (0, 1) to -indicate the proportion of instances to be used for predictions (the remainder is used for -training); an integer >1 to indicate that the predictions must be generated via k-fold -cross-validation, using this integer as k; or the data sample itself on which to generate -the predictions.
Provides the label predictions for the given instances. The predictions should respect the format expected by
-aggregate()
, e.g., posterior probabilities for probabilistic quantifiers, or crisp predictions for
-non-probabilistic quantifiers. The default one is “decision_function”.
instances – array-like of shape (n_instances, n_features,)
-np.ndarray of shape (n_instances,) with label predictions
-Trains the aggregative quantifier. This comes down to training a classifier and an aggregation function.
-data – a quapy.data.base.LabelledCollection
consisting of the training data
fit_classifier – whether to train the learner (default is True). Set to False if the -learner has been trained outside the quantifier.
val_split – specifies the data used for generating classifier predictions. This specification -can be made as float in (0, 1) indicating the proportion of stratified held-out validation set to -be extracted from the training set; or as an integer (default 5), indicating that the predictions -are to be generated in a k-fold cross-validation manner (with this integer indicating the value -for k); or as a collection defining the specific set of data to use for validation. -Alternatively, this set can be specified at fit time by indicating the exact set of data -on which the predictions are to be generated.
self
-Generate class prevalence estimates for the sample’s instances by aggregating the label predictions generated -by the classifier.
-instances – array-like
-np.ndarray of shape (n_classes) with class prevalence estimates.
-Bases: AggregativeQuantifier
, ABC
Abstract class for quantification methods that base their estimations on the aggregation of posterior -probabilities as returned by a probabilistic classifier. -Aggregative soft quantifiers thus extend Aggregative Quantifiers by implementing specifications -about soft predictions.
-Bases: AggregativeCrispQuantifier
Bayesian quantification method,
-which is a variant of ACC
that calculates the posterior probability distribution
-over the prevalence vectors, rather than providing a point estimate obtained
-by matrix inversion.
Can be used to diagnose degeneracy in the predictions visible when the confusion -matrix has high condition number or to quantify uncertainty around the point estimate.
-This method relies on extra dependencies, which have to be installed via: -$ pip install quapy[bayes]
-classifier – a sklearn’s Estimator that generates a classifier
val_split – a float in (0, 1) indicating the proportion of the training data to be used, -as a stratified held-out validation set, for generating classifier predictions.
num_warmup – number of warmup iterations for the MCMC sampler (default 500)
num_samples – number of samples to draw from the posterior (default 1000)
mcmc_seed – random seed for the MCMC sampler (default 0)
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Estimates the misclassification rates.
-classif_predictions – a quapy.data.base.LabelledCollection
containing,
-as instances, the label predictions issued by the classifier and, as labels, the true labels
data – a quapy.data.base.LabelledCollection
consisting of the training data
Bases: AggregativeQuantifier
, BinaryQuantifier
Trains the aggregative quantifier. This comes down to training a classifier and an aggregation function.
-data – a quapy.data.base.LabelledCollection
consisting of the training data
fit_classifier – whether to train the learner (default is True). Set to False if the -learner has been trained outside the quantifier.
val_split – specifies the data used for generating classifier predictions. This specification -can be made as float in (0, 1) indicating the proportion of stratified held-out validation set to -be extracted from the training set; or as an integer (default 5), indicating that the predictions -are to be generated in a k-fold cross-validation manner (with this integer indicating the value -for k); or as a collection defining the specific set of data to use for validation. -Alternatively, this set can be specified at fit time by indicating the exact set of data -on which the predictions are to be generated.
self
-Bases: AggregativeCrispQuantifier
The most basic Quantification method. One that simply classifies all instances and counts how many have been -attributed to each of the classes in order to compute class prevalence estimates.
-classifier – a sklearn’s Estimator that generates a classifier
-Computes class prevalence estimates by counting the prevalence of each of the predicted labels.
-classif_predictions – array-like with label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Nothing to do here!
-classif_predictions – not used
data – not used
Bases: AggregativeSoftQuantifier
Generic Distribution Matching quantifier for binary or multiclass quantification based on the space of posterior -probabilities. This implementation takes the number of bins, the divergence, and the possibility to work on CDF -as hyperparameters.
-classifier – a sklearn’s Estimator that generates a probabilistic classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set to model the
-validation distribution.
-This parameter can be indicated as a real value (between 0 and 1), representing a proportion of
-validation data, or as an integer, indicating that the validation distribution should be estimated via
-k-fold cross validation (this integer stands for the number of folds k, defaults 5), or as a
-quapy.data.base.LabelledCollection
(the split itself).
nbins – number of bins used to discretize the distributions (default 8)
divergence – a string representing a divergence measure (currently, “HD” and “topsoe” are implemented) -or a callable function taking two ndarrays of the same dimension as input (default “HD”, meaning Hellinger -Distance)
cdf – whether to use CDF instead of PDF (default False)
n_jobs – number of parallel workers (default None)
Searches for the mixture model parameter (the sought prevalence values) that yields a validation distribution -(the mixture) that best matches the test distribution, in terms of the divergence measure of choice. -In the multiclass case, with n the number of classes, the test and mixture distributions contain -n channels (proper distributions of binned posterior probabilities), on which the divergence is computed -independently. The matching is computed as an average of the divergence across all channels.
-posteriors – posterior probabilities of the instances in the sample
-a vector of class prevalence estimates
-Trains the aggregation function of a distribution matching method. This comes down to generating the -validation distributions out of the training data. -The validation distributions have shape (n, ch, nbins), with n the number of classes, ch the number of -channels, and nbins the number of bins. In particular, let V be the validation distributions; then di=V[i] -are the distributions obtained from training data labelled with class i; while dij = di[j] is the discrete -distribution of posterior probabilities P(Y=j|X=x) for training data labelled with class i, and dij[k] -is the fraction of instances with a value in the k-th bin.
-classif_predictions – a quapy.data.base.LabelledCollection
containing,
-as instances, the posterior probabilities issued by the classifier and, as labels, the true labels
data – a quapy.data.base.LabelledCollection
consisting of the training data
Bases: AggregativeSoftQuantifier
, BinaryAggregativeQuantifier
DyS framework (DyS). -DyS is a generalization of HDy method, using a Ternary Search in order to find the prevalence that -minimizes the distance between distributions. -Details for the ternary search have been got from <https://dl.acm.org/doi/pdf/10.1145/3219819.3220059>
-classifier – a sklearn’s Estimator that generates a binary classifier
val_split – a float in range (0,1) indicating the proportion of data to be used as a stratified held-out
-validation distribution, or a quapy.data.base.LabelledCollection
(the split itself), or an integer indicating the number of folds (default 5)..
n_bins – an int with the number of bins to use to compute the histograms.
divergence – a str indicating the name of divergence (currently supported ones are “HD” or “topsoe”), or a -callable function computes the divergence between two distributions (two equally sized arrays).
tol – a float with the tolerance for the ternary search algorithm.
n_jobs – number of parallel workers.
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains the aggregation function of DyS.
-classif_predictions – a quapy.data.base.LabelledCollection
containing,
-as instances, the posterior probabilities issued by the classifier and, as labels, the true labels
data – a quapy.data.base.LabelledCollection
consisting of the training data
Bases: AggregativeSoftQuantifier
Expectation Maximization for Quantification (EMQ), -aka Saerens-Latinne-Decaestecker (SLD) algorithm. -EMQ consists of using the well-known Expectation Maximization algorithm to iteratively update the posterior -probabilities generated by a probabilistic classifier and the class prevalence estimates obtained via -maximum-likelihood estimation, in a mutually recursive way, until convergence.
-This implementation also gives access to the heuristics proposed by Alexandari et al. paper. These heuristics consist of using, as the training -prevalence, an estimate of it obtained via k-fold cross validation (instead of the true training prevalence), -and to recalibrate the posterior probabilities of the classifier.
-classifier – a sklearn’s Estimator that generates a classifier
val_split – specifies the data used for generating classifier predictions. This specification -can be made as float in (0, 1) indicating the proportion of stratified held-out validation set to -be extracted from the training set; or as an integer, indicating that the predictions -are to be generated in a k-fold cross-validation manner (with this integer indicating the value -for k, default 5); or as a collection defining the specific set of data to use for validation. -Alternatively, this set can be specified at fit time by indicating the exact set of data -on which the predictions are to be generated. This hyperparameter is only meant to be used when the -heuristics are to be applied, i.e., if a recalibration is required. The default value is None (meaning -the recalibration is not required). In case this hyperparameter is set to a value other than None, but -the recalibration is not required (recalib=None), a warning message will be raised.
exact_train_prev – set to True (default) for using the true training prevalence as the initial observation; -set to False for computing the training prevalence as an estimate of it, i.e., as the expected -value of the posterior probabilities of the training instances.
recalib – a string indicating the method of recalibration. -Available choices include “nbvs” (No-Bias Vector Scaling), “bcts” (Bias-Corrected Temperature Scaling, -default), “ts” (Temperature Scaling), and “vs” (Vector Scaling). Default is None (no recalibration).
n_jobs – number of parallel workers. Only used for recalibrating the classifier if val_split is set to -an integer k –the number of folds.
Computes the Expectation Maximization routine.
-tr_prev – array-like, the training prevalence
posterior_probabilities – np.ndarray of shape (n_instances, n_classes,) with the -posterior probabilities
epsilon – float, the threshold different between two consecutive iterations -to reach before stopping the loop
a tuple with the estimated prevalence values (shape (n_classes,)) and -the corrected posterior probabilities (shape (n_instances, n_classes,))
-Constructs an instance of EMQ using the best configuration found in the Alexandari et al. paper, i.e., one that relies on Bias-Corrected Temperature -Scaling (BCTS) as a recalibration function, and that uses an estimate of the training prevalence instead of -the true training prevalence.
-classifier – a sklearn’s Estimator that generates a classifier
n_jobs – number of parallel workers.
An instance of EMQ with BCTS
-Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains the aggregation function of EMQ. This comes down to recalibrating the posterior probabilities -ir requested.
-classif_predictions – a quapy.data.base.LabelledCollection
containing,
-as instances, the posterior probabilities issued by the classifier and, as labels, the true labels
data – a quapy.data.base.LabelledCollection
consisting of the training data
Provides the posterior probabilities for the given instances. If the classifier was required -to be recalibrated, then these posteriors are recalibrated accordingly.
-instances – array-like of shape (n_instances, n_dimensions,)
-np.ndarray of shape (n_instances, n_classes,) with posterior probabilities
-Bases: AggregativeSoftQuantifier
, BinaryAggregativeQuantifier
Hellinger Distance y (HDy). -HDy is a probabilistic method for training binary quantifiers, that models quantification as the problem of -minimizing the divergence (in terms of the Hellinger Distance) between two distributions of posterior -probabilities returned by the classifier. One of the distributions is generated from the unlabelled examples and -the other is generated from a validation set. This latter distribution is defined as a mixture of the -class-conditional distributions of the posterior probabilities returned for the positive and negative validation -examples, respectively. The parameters of the mixture thus represent the estimates of the class prevalence values.
-classifier – a sklearn’s Estimator that generates a binary classifier
val_split – a float in range (0,1) indicating the proportion of data to be used as a stratified held-out
-validation distribution, or a quapy.data.base.LabelledCollection
(the split itself), or an integer indicating the number of folds (default 5)..
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains the aggregation function of HDy.
-classif_predictions – a quapy.data.base.LabelledCollection
containing,
-as instances, the posterior probabilities issued by the classifier and, as labels, the true labels
data – a quapy.data.base.LabelledCollection
consisting of the training data
Bases: OneVsAllGeneric
, AggregativeQuantifier
Allows any binary quantifier to perform quantification on single-label datasets.
-The method maintains one binary quantifier for each class, and then l1-normalizes the outputs so that the
-class prevelences sum up to 1.
-This variant was used, along with the EMQ
quantifier, in
-Gao and Sebastiani, 2016.
binary_quantifier – a quantifier (binary) that will be employed to work on multiclass model in a -one-vs-all manner
n_jobs – number of parallel workers
parallel_backend – the parallel backend for joblib (default “loky”); this is helpful for some quantifiers -(e.g., ELM-based ones) that cannot be run with multiprocessing, since the temp dir they create during fit will -is removed and no longer available at predict time.
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-If the base quantifier is not probabilistic, returns a matrix of shape (n,m,) with n the number of -instances and m the number of classes. The entry (i,j) is a binary value indicating whether instance -i `belongs to class `j. The binary classifications are independent of each other, meaning that an instance -can end up be attributed to 0, 1, or more classes. -If the base quantifier is probabilistic, returns a matrix of shape (n,m,2) with n the number of instances -and m the number of classes. The entry (i,j,1) (resp. (i,j,0)) is a value in [0,1] indicating the -posterior probability that instance i belongs (resp. does not belong) to class j. The posterior -probabilities are independent of each other, meaning that, in general, they do not sum up to one.
-instances – array-like
-np.ndarray
-Bases: AggregativeSoftQuantifier
Probabilistic Adjusted Classify & Count, -the probabilistic variant of ACC that relies on the posterior probabilities returned by a probabilistic classifier.
-classifier – a sklearn’s Estimator that generates a classifier
val_split – specifies the data used for generating classifier predictions. This specification -can be made as float in (0, 1) indicating the proportion of stratified held-out validation set to -be extracted from the training set; or as an integer (default 5), indicating that the predictions -are to be generated in a k-fold cross-validation manner (with this integer indicating the value -for k). Alternatively, this set can be specified at fit time by indicating the exact set of data -on which the predictions are to be generated.
method (str) –
adjustment method to be used:
-’inversion’: matrix inversion method based on the matrix equality \(P(C)=P(C|Y)P(Y)\), -which tries to invert P(C|Y) matrix.
’invariant-ratio’: invariant ratio estimator of Vaz et al., -which replaces the last equation with the normalization condition.
solver (str) –
the method to use for solving the system of linear equations. Valid options are:
-’exact-raise’: tries to solve the system using matrix inversion. -Raises an error if the matrix has rank strictly less than n_classes.
’exact-cc’: if the matrix is not of full rank, returns p_c as the estimates, which
-corresponds to no adjustment (i.e., the classify and count method. See quapy.method.aggregative.CC
)
’exact’: deprecated, defaults to ‘exact-cc’
’minimize’: minimizes the L2 norm of \(|Ax-B|\). This one generally works better, and is the -default parameter. More details about this can be consulted in Bunse, M. “On Multi-Class Extensions -of Adjusted Classify and Count”, on proceedings of the 2nd International Workshop on Learning to -Quantify: Methods and Applications (LQ 2022), ECML/PKDD 2022, Grenoble (France).
norm (str) –
the method to use for normalization.
-clip, the values are clipped to the range [0,1] and then L1-normalized.
mapsimplex projects vectors onto the probability simplex. This implementation relies on -Mathieu Blondel’s projection_simplex_sort
condsoftmax, applies a softmax normalization only to prevalence vectors that lie outside the simplex
n_jobs – number of parallel workers
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Estimates the misclassification rates
-classif_predictions – a quapy.data.base.LabelledCollection
containing,
-as instances, the posterior probabilities issued by the classifier and, as labels, the true labels
data – a quapy.data.base.LabelledCollection
consisting of the training data
Bases: AggregativeSoftQuantifier
Probabilistic Classify & Count, -the probabilistic variant of CC that relies on the posterior probabilities returned by a probabilistic classifier.
-classifier – a sklearn’s Estimator that generates a classifier
-Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Nothing to do here!
-classif_predictions – not used
data – not used
Bases: AggregativeSoftQuantifier
, BinaryAggregativeQuantifier
SMM method (SMM). -SMM is a simplification of matching distribution methods where the representation of the examples -is created using the mean instead of a histogram (conceptually equivalent to PACC).
-classifier – a sklearn’s Estimator that generates a binary classifier.
val_split – a float in range (0,1) indicating the proportion of data to be used as a stratified held-out
-validation distribution, or a quapy.data.base.LabelledCollection
(the split itself), or an integer indicating the number of folds (default 5)..
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains the aggregation function of SMM.
-classif_predictions – a quapy.data.base.LabelledCollection
containing,
-as instances, the posterior probabilities issued by the classifier and, as labels, the true labels
data – a quapy.data.base.LabelledCollection
consisting of the training data
Explicit Loss Minimization (ELM) quantifiers. -Quantifiers based on ELM represent a family of methods based on structured output learning; -these quantifiers rely on classifiers that have been optimized using a quantification-oriented loss -measure. This implementation relies on -Joachims’ SVM perf structured output -learning algorithm, which has to be installed and patched for the purpose (see this -script). -This function equivalent to:
->>> CC(SVMperf(svmperf_base, loss, C))
-
svmperf_base – path to the folder containing the binary files of SVM perf; if set to None (default) -this path will be obtained from qp.environ[‘SVMPERF_HOME’]
loss – the loss to optimize (see quapy.classification.svmperf.SVMperf.valid_losses
)
C – trade-off between training error and margin (default 0.01)
returns an instance of CC set to work with SVMperf (with loss and C set properly) as the -underlying classifier
-SVM(KLD) is an Explicit Loss Minimization (ELM) quantifier set to optimize for the Absolute Error as first used by -Moreo and Sebastiani, 2021. -Equivalent to:
->>> CC(SVMperf(svmperf_base, loss='mae', C=C))
-
Quantifiers based on ELM represent a family of methods based on structured output learning; -these quantifiers rely on classifiers that have been optimized using a quantification-oriented loss -measure. This implementation relies on -Joachims’ SVM perf structured output -learning algorithm, which has to be installed and patched for the purpose (see this -script). -This function is a wrapper around CC(SVMperf(svmperf_base, loss, C))
-svmperf_base – path to the folder containing the binary files of SVM perf; if set to None (default) -this path will be obtained from qp.environ[‘SVMPERF_HOME’]
C – trade-off between training error and margin (default 0.01)
returns an instance of CC set to work with SVMperf (with loss and C set properly) as the -underlying classifier
-SVM(KLD) is an Explicit Loss Minimization (ELM) quantifier set to optimize for the Kullback-Leibler Divergence -normalized via the logistic function, as proposed by -Esuli et al. 2015. -Equivalent to:
->>> CC(SVMperf(svmperf_base, loss='nkld', C=C))
-
Quantifiers based on ELM represent a family of methods based on structured output learning; -these quantifiers rely on classifiers that have been optimized using a quantification-oriented loss -measure. This implementation relies on -Joachims’ SVM perf structured output -learning algorithm, which has to be installed and patched for the purpose (see this -script). -This function is a wrapper around CC(SVMperf(svmperf_base, loss, C))
-svmperf_base – path to the folder containing the binary files of SVM perf; if set to None (default) -this path will be obtained from qp.environ[‘SVMPERF_HOME’]
C – trade-off between training error and margin (default 0.01)
returns an instance of CC set to work with SVMperf (with loss and C set properly) as the -underlying classifier
-SVM(Q) is an Explicit Loss Minimization (ELM) quantifier set to optimize for the Q loss combining a -classification-oriented loss and a quantification-oriented loss, as proposed by -Barranquero et al. 2015. -Equivalent to:
->>> CC(SVMperf(svmperf_base, loss='q', C=C))
-
Quantifiers based on ELM represent a family of methods based on structured output learning; -these quantifiers rely on classifiers that have been optimized using a quantification-oriented loss -measure. This implementation relies on -Joachims’ SVM perf structured output -learning algorithm, which has to be installed and patched for the purpose (see this -script). -This function is a wrapper around CC(SVMperf(svmperf_base, loss, C))
-svmperf_base – path to the folder containing the binary files of SVM perf; if set to None (default) -this path will be obtained from qp.environ[‘SVMPERF_HOME’]
C – trade-off between training error and margin (default 0.01)
returns an instance of CC set to work with SVMperf (with loss and C set properly) as the -underlying classifier
-SVM(KLD) is an Explicit Loss Minimization (ELM) quantifier set to optimize for the Relative Absolute Error as first -used by Moreo and Sebastiani, 2021. -Equivalent to:
->>> CC(SVMperf(svmperf_base, loss='mrae', C=C))
-
Quantifiers based on ELM represent a family of methods based on structured output learning; -these quantifiers rely on classifiers that have been optimized using a quantification-oriented loss -measure. This implementation relies on -Joachims’ SVM perf structured output -learning algorithm, which has to be installed and patched for the purpose (see this -script). -This function is a wrapper around CC(SVMperf(svmperf_base, loss, C))
-svmperf_base – path to the folder containing the binary files of SVM perf; if set to None (default) -this path will be obtained from qp.environ[‘SVMPERF_HOME’]
C – trade-off between training error and margin (default 0.01)
returns an instance of CC set to work with SVMperf (with loss and C set properly) as the -underlying classifier
-Bases: object
Common ancestor for KDE-based methods. Implements some common routines.
-Wraps the KDE function from scikit-learn.
-X – data for which the density function is to be estimated
bandwidth – the bandwidth of the kernel
a scikit-learn’s KernelDensity object
-Returns an array containing the mixture components, i.e., the KDE functions for each class.
-X – the data containing the covariates
y – the class labels
n_classes – integer, the number of classes
bandwidth – float, the bandwidth of the kernel
a list of KernelDensity objects, each fitted with the corresponding class-specific covariates
-Wraps the density evalution of scikit-learn’s KDE. Scikit-learn returns log-scores (s), so this -function returns \(e^{s}\)
-kde – a previously fit KDE function
X – the data for which the density is to be estimated
np.ndarray with the densities
-Bases: AggregativeSoftQuantifier
Kernel Density Estimation model for quantification (KDEy) relying on the Cauchy-Schwarz divergence (CS) as -the divergence measure to be minimized. This method was first proposed in the paper -Kernel Density Estimation for Multiclass Quantification, in which -the authors proposed a Monte Carlo approach for minimizing the divergence.
-The distribution matching optimization problem comes down to solving:
-\(\hat{\alpha} = \arg\min_{\alpha\in\Delta^{n-1}} \mathcal{D}(\boldsymbol{p}_{\alpha}||q_{\widetilde{U}})\)
-where \(p_{\alpha}\) is the mixture of class-specific KDEs with mixture parameter (hence class prevalence) -\(\alpha\) defined by
-\(\boldsymbol{p}_{\alpha}(\widetilde{x}) = \sum_{i=1}^n \alpha_i p_{\widetilde{L}_i}(\widetilde{x})\)
-where \(p_X(\boldsymbol{x}) = \frac{1}{|X|} \sum_{x_i\in X} K\left(\frac{x-x_i}{h}\right)\) is the -KDE function that uses the datapoints in X as the kernel centers.
-In KDEy-CS, the divergence is taken to be the Cauchy-Schwarz divergence given by:
-\(\mathcal{D}_{\mathrm{CS}}(p||q)=-\log\left(\frac{\int p(x)q(x)dx}{\sqrt{\int p(x)^2dx \int q(x)^2dx}}\right)\)
-The authors showed that this distribution matching admits a closed-form solution
-classifier – a sklearn’s Estimator that generates a binary classifier.
val_split – specifies the data used for generating classifier predictions. This specification -can be made as float in (0, 1) indicating the proportion of stratified held-out validation set to -be extracted from the training set; or as an integer (default 5), indicating that the predictions -are to be generated in a k-fold cross-validation manner (with this integer indicating the value -for k); or as a collection defining the specific set of data to use for validation. -Alternatively, this set can be specified at fit time by indicating the exact set of data -on which the predictions are to be generated.
bandwidth – float, the bandwidth of the Kernel
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains the aggregation function.
-classif_predictions – a quapy.data.base.LabelledCollection
containing,
-as instances, the predictions issued by the classifier and, as labels, the true labels
data – a quapy.data.base.LabelledCollection
consisting of the training data
Bases: AggregativeSoftQuantifier
, KDEBase
Kernel Density Estimation model for quantification (KDEy) relying on the squared Hellinger Disntace (HD) as -the divergence measure to be minimized. This method was first proposed in the paper -Kernel Density Estimation for Multiclass Quantification, in which -the authors proposed a Monte Carlo approach for minimizing the divergence.
-The distribution matching optimization problem comes down to solving:
-\(\hat{\alpha} = \arg\min_{\alpha\in\Delta^{n-1}} \mathcal{D}(\boldsymbol{p}_{\alpha}||q_{\widetilde{U}})\)
-where \(p_{\alpha}\) is the mixture of class-specific KDEs with mixture parameter (hence class prevalence) -\(\alpha\) defined by
-\(\boldsymbol{p}_{\alpha}(\widetilde{x}) = \sum_{i=1}^n \alpha_i p_{\widetilde{L}_i}(\widetilde{x})\)
-where \(p_X(\boldsymbol{x}) = \frac{1}{|X|} \sum_{x_i\in X} K\left(\frac{x-x_i}{h}\right)\) is the -KDE function that uses the datapoints in X as the kernel centers.
-In KDEy-HD, the divergence is taken to be the squared Hellinger Distance, an f-divergence with corresponding -f-generator function given by:
-\(f(u)=(\sqrt{u}-1)^2\)
-The authors proposed a Monte Carlo solution that relies on importance sampling:
-\(\hat{D}_f(p||q)= \frac{1}{t} \sum_{i=1}^t f\left(\frac{p(x_i)}{q(x_i)}\right) \frac{q(x_i)}{r(x_i)}\)
-where the datapoints (trials) \(x_1,\ldots,x_t\sim_{\mathrm{iid}} r\) with \(r\) the -uniform distribution.
-classifier – a sklearn’s Estimator that generates a binary classifier.
val_split – specifies the data used for generating classifier predictions. This specification -can be made as float in (0, 1) indicating the proportion of stratified held-out validation set to -be extracted from the training set; or as an integer (default 5), indicating that the predictions -are to be generated in a k-fold cross-validation manner (with this integer indicating the value -for k); or as a collection defining the specific set of data to use for validation. -Alternatively, this set can be specified at fit time by indicating the exact set of data -on which the predictions are to be generated.
bandwidth – float, the bandwidth of the Kernel
random_state – a seed to be set before fitting any base quantifier (default None)
montecarlo_trials – number of Monte Carlo trials (default 10000)
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains the aggregation function.
-classif_predictions – a quapy.data.base.LabelledCollection
containing,
-as instances, the predictions issued by the classifier and, as labels, the true labels
data – a quapy.data.base.LabelledCollection
consisting of the training data
Bases: AggregativeSoftQuantifier
, KDEBase
Kernel Density Estimation model for quantification (KDEy) relying on the Kullback-Leibler divergence (KLD) as -the divergence measure to be minimized. This method was first proposed in the paper -Kernel Density Estimation for Multiclass Quantification, in which -the authors show that minimizing the distribution mathing criterion for KLD is akin to performing -maximum likelihood (ML).
-The distribution matching optimization problem comes down to solving:
-\(\hat{\alpha} = \arg\min_{\alpha\in\Delta^{n-1}} \mathcal{D}(\boldsymbol{p}_{\alpha}||q_{\widetilde{U}})\)
-where \(p_{\alpha}\) is the mixture of class-specific KDEs with mixture parameter (hence class prevalence) -\(\alpha\) defined by
-\(\boldsymbol{p}_{\alpha}(\widetilde{x}) = \sum_{i=1}^n \alpha_i p_{\widetilde{L}_i}(\widetilde{x})\)
-where \(p_X(\boldsymbol{x}) = \frac{1}{|X|} \sum_{x_i\in X} K\left(\frac{x-x_i}{h}\right)\) is the -KDE function that uses the datapoints in X as the kernel centers.
-In KDEy-ML, the divergence is taken to be the Kullback-Leibler Divergence. This is equivalent to solving: -\(\hat{\alpha} = \arg\min_{\alpha\in\Delta^{n-1}} - -\mathbb{E}_{q_{\widetilde{U}}} \left[ \log \boldsymbol{p}_{\alpha}(\widetilde{x}) \right]\)
-which corresponds to the maximum likelihood estimate.
-classifier – a sklearn’s Estimator that generates a binary classifier.
val_split – specifies the data used for generating classifier predictions. This specification -can be made as float in (0, 1) indicating the proportion of stratified held-out validation set to -be extracted from the training set; or as an integer (default 5), indicating that the predictions -are to be generated in a k-fold cross-validation manner (with this integer indicating the value -for k); or as a collection defining the specific set of data to use for validation. -Alternatively, this set can be specified at fit time by indicating the exact set of data -on which the predictions are to be generated.
bandwidth – float, the bandwidth of the Kernel
random_state – a seed to be set before fitting any base quantifier (default None)
Searches for the mixture model parameter (the sought prevalence values) that maximizes the likelihood -of the data (i.e., that minimizes the negative log-likelihood)
-posteriors – instances in the sample converted into posterior probabilities
-a vector of class prevalence estimates
-Trains the aggregation function.
-classif_predictions – a quapy.data.base.LabelledCollection
containing,
-as instances, the predictions issued by the classifier and, as labels, the true labels
data – a quapy.data.base.LabelledCollection
consisting of the training data
Bases: Module
Implements the QuaNet forward pass.
-See QuaNetTrainer
for training QuaNet.
doc_embedding_size – integer, the dimensionality of the document embeddings
n_classes – integer, number of classes
stats_size – integer, number of statistics estimated by simple quantification methods
lstm_hidden_size – integer, hidden dimensionality of the LSTM cell
lstm_nlayers – integer, number of LSTM layers
ff_layers – list of integers, dimensions of the densely-connected FF layers on top of the -quantification embedding
bidirectional – boolean, whether or not to use bidirectional LSTM
qdrop_p – float, dropout probability
order_by – integer, class for which the document embeddings are to be sorted
Defines the computation performed at every call.
-Should be overridden by all subclasses.
-Note
-Although the recipe for forward pass needs to be defined within
-this function, one should call the Module
instance afterwards
-instead of this since the former takes care of running the
-registered hooks while the latter silently ignores them.
Bases: BaseQuantifier
Implementation of QuaNet, a neural network for -quantification. This implementation uses PyTorch and can take advantage of GPU -for speeding-up the training phase.
-Example:
->>> import quapy as qp
->>> from quapy.method_name.meta import QuaNet
->>> from quapy.classification.neural import NeuralClassifierTrainer, CNNnet
->>>
->>> # use samples of 100 elements
->>> qp.environ['SAMPLE_SIZE'] = 100
->>>
->>> # load the kindle dataset as text, and convert words to numerical indexes
->>> dataset = qp.datasets.fetch_reviews('kindle', pickle=True)
->>> qp.train.preprocessing.index(dataset, min_df=5, inplace=True)
->>>
->>> # the text classifier is a CNN trained by NeuralClassifierTrainer
->>> cnn = CNNnet(dataset.vocabulary_size, dataset.n_classes)
->>> classifier = NeuralClassifierTrainer(cnn, device='cuda')
->>>
->>> # train QuaNet (QuaNet is an alias to QuaNetTrainer)
->>> model = QuaNet(classifier, qp.environ['SAMPLE_SIZE'], device='cuda')
->>> model.fit(dataset.training)
->>> estim_prevalence = model.quantify(dataset.test.instances)
-
classifier – an object implementing fit (i.e., that can be trained on labelled data), -predict_proba (i.e., that can generate posterior probabilities of unlabelled examples) and -transform (i.e., that can generate embedded representations of the unlabelled instances).
sample_size – integer, the sample size; default is None, meaning that the sample size should be -taken from qp.environ[“SAMPLE_SIZE”]
n_epochs – integer, maximum number of training epochs
tr_iter_per_poch – integer, number of training iterations before considering an epoch complete
va_iter_per_poch – integer, number of validation iterations to perform after each epoch
lr – float, the learning rate
lstm_hidden_size – integer, hidden dimensionality of the LSTM cells
lstm_nlayers – integer, number of LSTM layers
ff_layers – list of integers, dimensions of the densely-connected FF layers on top of the -quantification embedding
bidirectional – boolean, indicates whether the LSTM is bidirectional or not
qdrop_p – float, dropout probability
patience – integer, number of epochs showing no improvement in the validation set before stopping the -training phase (early stopping)
checkpointdir – string, a path where to store models’ checkpoints
checkpointname – string (optional), the name of the model’s checkpoint
device – string, indicate “cpu” or “cuda”
Trains QuaNet.
-data – the training data on which to train QuaNet. If fit_classifier=True, the data will be split in -40/40/20 for training the classifier, training QuaNet, and validating QuaNet, respectively. If -fit_classifier=False, the data will be split in 66/34 for training QuaNet and validating it, respectively.
fit_classifier – if True, trains the classifier on a split containing 40% of the data
self
-Get parameters for this estimator.
-deep (bool, default=True) – If True, will return the parameters for this estimator and -contained subobjects that are estimators.
-params – Parameter names mapped to their values.
-dict
-Generate class prevalence estimates for the sample’s instances
-instances – array-like
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Set the parameters of this estimator.
-The method works on simple estimators as well as on nested objects
-(such as Pipeline
). The latter have
-parameters of the form <component>__<parameter>
so that it’s
-possible to update each component of a nested object.
**params (dict) – Estimator parameters.
-self – Estimator instance.
-estimator instance
-Torch-like wrapper for the Mean Absolute Error
-output – predictions
target – ground truth values
mean absolute error loss
-Bases: ThresholdOptimization
Threshold Optimization variant for ACC
as proposed by
-Forman 2006 and
-Forman 2008 that looks
-for the threshold that maximizes tpr-fpr.
-The goal is to bring improved stability to the denominator of the adjustment.
classifier – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the
-misclassification rates are to be estimated.
-This parameter can be indicated as a real value (between 0 and 1), representing a proportion of
-validation data, or as an integer, indicating that the misclassification rates should be estimated via
-k-fold cross validation (this integer stands for the number of folds k, defaults 5), or as a
-quapy.data.base.LabelledCollection
(the split itself).
Implements the criterion according to which the threshold should be selected. -This function should return the (float) score to be minimized.
-tpr – float, true positive rate
fpr – float, false positive rate
float, a score for the given tpr and fpr
-Bases: ThresholdOptimization
Median Sweep. Threshold Optimization variant for ACC
as proposed by
-Forman 2006 and
-Forman 2008 that generates
-class prevalence estimates for all decision thresholds and returns the median of them all.
-The goal is to bring improved stability to the denominator of the adjustment.
classifier – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the
-misclassification rates are to be estimated.
-This parameter can be indicated as a real value (between 0 and 1), representing a proportion of
-validation data, or as an integer, indicating that the misclassification rates should be estimated via
-k-fold cross validation (this integer stands for the number of folds k, defaults 5), or as a
-quapy.data.base.LabelledCollection
(the split itself).
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains the aggregation function.
-classif_predictions – a quapy.data.base.LabelledCollection
containing,
-as instances, the predictions issued by the classifier and, as labels, the true labels
data – a quapy.data.base.LabelledCollection
consisting of the training data
Implements the criterion according to which the threshold should be selected. -This function should return the (float) score to be minimized.
-tpr – float, true positive rate
fpr – float, false positive rate
float, a score for the given tpr and fpr
-Bases: MS
Median Sweep 2. Threshold Optimization variant for ACC
as proposed by
-Forman 2006 and
-Forman 2008 that generates
-class prevalence estimates for all decision thresholds and returns the median of for cases in
-which tpr-fpr>0.25
-The goal is to bring improved stability to the denominator of the adjustment.
classifier – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the
-misclassification rates are to be estimated.
-This parameter can be indicated as a real value (between 0 and 1), representing a proportion of
-validation data, or as an integer, indicating that the misclassification rates should be estimated via
-k-fold cross validation (this integer stands for the number of folds k, defaults 5), or as a
-quapy.data.base.LabelledCollection
(the split itself).
Bases: ThresholdOptimization
Threshold Optimization variant for ACC
as proposed by
-Forman 2006 and
-Forman 2008 that looks
-for the threshold that makes tpr closest to 0.5.
-The goal is to bring improved stability to the denominator of the adjustment.
classifier – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the
-misclassification rates are to be estimated.
-This parameter can be indicated as a real value (between 0 and 1), representing a proportion of
-validation data, or as an integer, indicating that the misclassification rates should be estimated via
-k-fold cross validation (this integer stands for the number of folds k, defaults 5), or as a
-quapy.data.base.LabelledCollection
(the split itself).
Implements the criterion according to which the threshold should be selected. -This function should return the (float) score to be minimized.
-tpr – float, true positive rate
fpr – float, false positive rate
float, a score for the given tpr and fpr
-Bases: BinaryAggregativeQuantifier
Abstract class of Threshold Optimization variants for ACC
as proposed by
-Forman 2006 and
-Forman 2008.
-The goal is to bring improved stability to the denominator of the adjustment.
-The different variants are based on different heuristics for choosing a decision threshold
-that would allow for more true positives and many more false positives, on the grounds this
-would deliver larger denominators.
classifier – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the
-misclassification rates are to be estimated.
-This parameter can be indicated as a real value (between 0 and 1), representing a proportion of
-validation data, or as an integer, indicating that the misclassification rates should be estimated via
-k-fold cross validation (this integer stands for the number of folds k, defaults 5), or as a
-quapy.data.base.LabelledCollection
(the split itself).
Implements the aggregation of label predictions.
-classif_predictions – np.ndarray of label predictions
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Trains the aggregation function.
-classif_predictions – a quapy.data.base.LabelledCollection
containing,
-as instances, the predictions issued by the classifier and, as labels, the true labels
data – a quapy.data.base.LabelledCollection
consisting of the training data
Implements the criterion according to which the threshold should be selected. -This function should return the (float) score to be minimized.
-tpr – float, true positive rate
fpr – float, false positive rate
float, a score for the given tpr and fpr
-Bases: ThresholdOptimization
Threshold Optimization variant for ACC
as proposed by
-Forman 2006 and
-Forman 2008 that looks
-for the threshold that yields tpr=1-fpr.
-The goal is to bring improved stability to the denominator of the adjustment.
classifier – a sklearn’s Estimator that generates a classifier
val_split – indicates the proportion of data to be used as a stratified held-out validation set in which the
-misclassification rates are to be estimated.
-This parameter can be indicated as a real value (between 0 and 1), representing a proportion of
-validation data, or as an integer, indicating that the misclassification rates should be estimated via
-k-fold cross validation (this integer stands for the number of folds k, defaults 5), or as a
-quapy.data.base.LabelledCollection
(the split itself).
Implements the criterion according to which the threshold should be selected. -This function should return the (float) score to be minimized.
-tpr – float, true positive rate
fpr – float, false positive rate
float, a score for the given tpr and fpr
-Bases: BaseEstimator
Abstract Quantifier. A quantifier is defined as an object of a class that implements the method fit()
on
-quapy.data.base.LabelledCollection
, the method quantify()
, and the set_params()
and
-get_params()
for model selection (see quapy.model_selection.GridSearchQ()
)
Trains a quantifier.
-data – a quapy.data.base.LabelledCollection
consisting of the training data
self
-Bases: BaseQuantifier
Abstract class of binary quantifiers, i.e., quantifiers estimating class prevalence values for only two classes -(typically, to be interpreted as one class and its complement).
-Bases: OneVsAll
, BaseQuantifier
Allows any binary quantifier to perform quantification on single-label datasets. The method maintains one binary -quantifier for each class, and then l1-normalizes the outputs so that the class prevelence values sum up to 1.
-Trains a quantifier.
-data – a quapy.data.base.LabelledCollection
consisting of the training data
self
-Implements an ensemble of quapy.method.aggregative.ACC
quantifiers, as used by
-Pérez-Gállego et al., 2019.
Equivalent to:
->>> ensembleFactory(classifier, ACC, param_grid, optim, param_mod_sel, **kwargs)
-
See ensembleFactory()
for further details.
classifier – sklearn’s Estimator that generates a classifier
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
-quapy.model_selection.GridSearchQ
kwargs – kwargs for the class Ensemble
an instance of Ensemble
Implements an ensemble of quapy.method.aggregative.CC
quantifiers, as used by
-Pérez-Gállego et al., 2019.
Equivalent to:
->>> ensembleFactory(classifier, CC, param_grid, optim, param_mod_sel, **kwargs)
-
See ensembleFactory()
for further details.
classifier – sklearn’s Estimator that generates a classifier
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
-quapy.model_selection.GridSearchQ
kwargs – kwargs for the class Ensemble
an instance of Ensemble
Implements an ensemble of quapy.method.aggregative.EMQ
quantifiers.
Equivalent to:
->>> ensembleFactory(classifier, EMQ, param_grid, optim, param_mod_sel, **kwargs)
-
See ensembleFactory()
for further details.
classifier – sklearn’s Estimator that generates a classifier
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
-quapy.model_selection.GridSearchQ
kwargs – kwargs for the class Ensemble
an instance of Ensemble
Implements an ensemble of quapy.method.aggregative.HDy
quantifiers, as used by
-Pérez-Gállego et al., 2019.
Equivalent to:
->>> ensembleFactory(classifier, HDy, param_grid, optim, param_mod_sel, **kwargs)
-
See ensembleFactory()
for further details.
classifier – sklearn’s Estimator that generates a classifier
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
-quapy.model_selection.GridSearchQ
kwargs – kwargs for the class Ensemble
an instance of Ensemble
Implements an ensemble of quapy.method.aggregative.PACC
quantifiers.
Equivalent to:
->>> ensembleFactory(classifier, PACC, param_grid, optim, param_mod_sel, **kwargs)
-
See ensembleFactory()
for further details.
classifier – sklearn’s Estimator that generates a classifier
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
-quapy.model_selection.GridSearchQ
kwargs – kwargs for the class Ensemble
an instance of Ensemble
Bases: BaseQuantifier
Implementation of the Ensemble methods for quantification described by -Pérez-Gállego et al., 2017 -and -Pérez-Gállego et al., 2019. -The policies implemented include:
-Average (policy=’ave’): computes class prevalence estimates as the average of the estimates -returned by the base quantifiers.
Training Prevalence (policy=’ptr’): applies a dynamic selection to the ensemble’s members by retaining only -those members such that the class prevalence values in the samples they use as training set are closest to -preliminary class prevalence estimates computed as the average of the estimates of all the members. The final -estimate is recomputed by considering only the selected members.
Distribution Similarity (policy=’ds’): performs a dynamic selection of base members by retaining -the members trained on samples whose distribution of posterior probabilities is closest, in terms of the -Hellinger Distance, to the distribution of posterior probabilities in the test sample
Accuracy (policy=’<valid error name>’): performs a static selection of the ensemble members by -retaining those that minimize a quantification error measure, which is passed as an argument.
Example:
->>> model = Ensemble(quantifier=ACC(LogisticRegression()), size=30, policy='ave', n_jobs=-1)
-
quantifier – base quantification member of the ensemble
size – number of members
red_size – number of members to retain after selection (depending on the policy)
min_pos – minimum number of positive instances to consider a sample as valid
policy – the selection policy; available policies include: ave (default), ptr, ds, and accuracy -(which is instantiated via a valid error name, e.g., mae)
max_sample_size – maximum number of instances to consider in the samples (set to None -to indicate no limit, default)
val_split – a float in range (0,1) indicating the proportion of data to be used as a stratified held-out
-validation split, or a quapy.data.base.LabelledCollection
(the split itself).
n_jobs – number of parallel workers (default 1)
verbose – set to True (default is False) to get some information in standard output
Indicates that the quantifier is not aggregative.
-False
-Trains a quantifier.
-data – a quapy.data.base.LabelledCollection
consisting of the training data
self
-This function should not be used within quapy.model_selection.GridSearchQ
(is here for compatibility
-with the abstract class).
-Instead, use Ensemble(GridSearchQ(q),…), with q a Quantifier (recommended), or
-Ensemble(Q(GridSearchCV(l))) with Q a quantifier class that has a classifier l optimized for
-classification (not recommended).
deep – for compatibility with scikit-learn
-raises an Exception
-Indicates that the quantifier is not probabilistic.
-False
-Generate class prevalence estimates for the sample’s instances
-instances – array-like
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-This function should not be used within quapy.model_selection.GridSearchQ
(is here for compatibility
-with the abstract class).
-Instead, use Ensemble(GridSearchQ(q),…), with q a Quantifier (recommended), or
-Ensemble(Q(GridSearchCV(l))) with Q a quantifier class that has a classifier l optimized for
-classification (not recommended).
parameters – dictionary
-raises an Exception
-Bases: BinaryQuantifier
This method is a meta-quantifier that returns, as the estimated class prevalence values, the median of the -estimation returned by differently (hyper)parameterized base quantifiers. -The median of unit-vectors is only guaranteed to be a unit-vector for n=2 dimensions, -i.e., in cases of binary quantification.
-base_quantifier – the base, binary quantifier
random_state – a seed to be set before fitting any base quantifier (default None)
param_grid – the grid or parameters towards which the median will be computed
n_jobs – number of parllel workes
Trains a quantifier.
-data – a quapy.data.base.LabelledCollection
consisting of the training data
self
-Get parameters for this estimator.
-deep (bool, default=True) – If True, will return the parameters for this estimator and -contained subobjects that are estimators.
-params – Parameter names mapped to their values.
-dict
-Generate class prevalence estimates for the sample’s instances
-instances – array-like
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Set the parameters of this estimator.
-The method works on simple estimators as well as on nested objects
-(such as Pipeline
). The latter have
-parameters of the form <component>__<parameter>
so that it’s
-possible to update each component of a nested object.
**params (dict) – Estimator parameters.
-self – Estimator instance.
-estimator instance
-Bases: BinaryQuantifier
This method is a meta-quantifier that returns, as the estimated class prevalence values, the median of the -estimation returned by differently (hyper)parameterized base quantifiers. -The median of unit-vectors is only guaranteed to be a unit-vector for n=2 dimensions, -i.e., in cases of binary quantification.
-base_quantifier – the base, binary quantifier
random_state – a seed to be set before fitting any base quantifier (default None)
param_grid – the grid or parameters towards which the median will be computed
n_jobs – number of parllel workes
Trains a quantifier.
-data – a quapy.data.base.LabelledCollection
consisting of the training data
self
-Get parameters for this estimator.
-deep (bool, default=True) – If True, will return the parameters for this estimator and -contained subobjects that are estimators.
-params – Parameter names mapped to their values.
-dict
-Generate class prevalence estimates for the sample’s instances
-instances – array-like
-np.ndarray of shape (n_classes,) with class prevalence estimates.
-Set the parameters of this estimator.
-The method works on simple estimators as well as on nested objects
-(such as Pipeline
). The latter have
-parameters of the form <component>__<parameter>
so that it’s
-possible to update each component of a nested object.
**params (dict) – Estimator parameters.
-self – Estimator instance.
-estimator instance
-Ensemble factory. Provides a unified interface for instantiating ensembles that can be optimized (via model
-selection for quantification) for a given evaluation metric using quapy.model_selection.GridSearchQ
.
-If the evaluation metric is classification-oriented
-(instead of quantification-oriented), then the optimization will be carried out via sklearn’s
-GridSearchCV.
Example to instantiate an Ensemble
based on quapy.method.aggregative.PACC
-in which the base members are optimized for quapy.error.mae()
via
-quapy.model_selection.GridSearchQ
. The ensemble follows the policy Accuracy based
-on quapy.error.mae()
(the same measure being optimized),
-meaning that a static selection of members of the ensemble is made based on their performance
-in terms of this error.
>>> param_grid = {
->>> 'C': np.logspace(-3,3,7),
->>> 'class_weight': ['balanced', None]
->>> }
->>> param_mod_sel = {
->>> 'sample_size': 500,
->>> 'protocol': 'app'
->>> }
->>> common={
->>> 'max_sample_size': 1000,
->>> 'n_jobs': -1,
->>> 'param_grid': param_grid,
->>> 'param_mod_sel': param_mod_sel,
->>> }
->>>
->>> ensembleFactory(LogisticRegression(), PACC, optim='mae', policy='mae', **common)
-
classifier – sklearn’s Estimator that generates a classifier
base_quantifier_class – a class of quantifiers
param_grid – a dictionary with the grid of parameters to optimize for
optim – a valid quantification or classification error, or a string name of it
param_model_sel – a dictionary containing any keyworded argument to pass to
-quapy.model_selection.GridSearchQ
kwargs – kwargs for the class Ensemble
an instance of Ensemble
Gets a histogram out of the posterior probabilities (only for the binary case).
-posterior_probabilities – array-like of shape (n_instances, 2,)
bins – integer
np.ndarray with the relative frequencies for each bin (for the positive class only)
-Bases: BaseQuantifier
Generic Distribution Matching quantifier for binary or multiclass quantification based on the space of covariates. -This implementation takes the number of bins, the divergence, and the possibility to work on CDF as hyperparameters.
-nbins – number of bins used to discretize the distributions (default 8)
divergence – a string representing a divergence measure (currently, “HD” and “topsoe” are implemented) -or a callable function taking two ndarrays of the same dimension as input (default “HD”, meaning Hellinger -Distance)
cdf – whether to use CDF instead of PDF (default False)
n_jobs – number of parallel workers (default None)
Hellinger Distance x (HDx). -HDx is a method for training binary quantifiers, that models quantification as the problem of -minimizing the average divergence (in terms of the Hellinger Distance) across the feature-specific normalized -histograms of two representations, one for the unlabelled examples, and another generated from the training -examples as a mixture model of the class-specific representations. The parameters of the mixture thus represent -the estimates of the class prevalence values.
-The method computes all matchings for nbins in [10, 20, …, 110] and reports the mean of the median. -The best prevalence is searched via linear search, from 0 to 1 stepping by 0.01.
-n_jobs – number of parallel workers
-an instance of this class setup to mimick the performance of the HDx as originally proposed by -González-Castro, Alaiz-Rodríguez, Alegre (2013)
-Generates the validation distributions out of the training data (covariates). -The validation distributions have shape (n, nfeats, nbins), with n the number of classes, nfeats -the number of features, and nbins the number of bins. -In particular, let V be the validation distributions; then di=V[i] are the distributions obtained from -training data labelled with class i; while dij = di[j] is the discrete distribution for feature j in -training data labelled with class i, and dij[k] is the fraction of instances with a value in the k-th bin.
-data – the training set
-Searches for the mixture model parameter (the sought prevalence values) that yields a validation distribution -(the mixture) that best matches the test distribution, in terms of the divergence measure of choice. -The matching is computed as the average dissimilarity (in terms of the dissimilarity measure of choice) -between all feature-specific discrete distributions.
-instances – instances in the sample
-a vector of class prevalence estimates
-Bases: BaseQuantifier
The Maximum Likelihood Prevalence Estimation (MLPE) method is a lazy method that assumes there is no prior -probability shift between training and test instances (put it other way, that the i.i.d. assumpion holds). -The estimation of class prevalence values for any test sample is always (i.e., irrespective of the test sample -itself) the class prevalence seen during training. This method is considered to be a lower-bound quantifier that -any quantification method should beat.
-Computes the training prevalence and stores it.
-data – the training sample
-self
-Bases: BaseQuantifier
Trains a quantifier.
-data – a quapy.data.base.LabelledCollection
consisting of the training data
self
-This module allows the composition of quantification methods from loss functions and feature transformations. This functionality is realized through an integration of the qunfold package: https://github.com/mirkobunse/qunfold.
-Bases: FunctionLoss
The loss function of RUN (Blobel, 1985).
-This loss function models a likelihood function under the assumption of independent Poisson-distributed elements of q with Poisson rates M*p.
-Bases: BaseEstimator
, ClassifierMixin
An ensemble of classifiers that are trained from cross-validation folds.
-All objects of this type have a fixed attribute oob_score = True and, when trained, a fitted attribute self.oob_decision_function_, just like scikit-learn bagging classifiers.
-estimator – A classifier that implements the API of scikit-learn.
n_estimators (optional) – The number of stratified cross-validation folds. Defaults to 5.
random_state (optional) – The random state for stratification. Defaults to None.
Examples
-Here, we create an instance of ACC that trains a logistic regression classifier with 10 cross-validation folds.
->>> ACC(CVClassifier(LogisticRegression(), 10))
-
Bases: AbstractTransformer
A classification-based feature transformation.
-This transformation can either be probabilistic (using the posterior predictions of a classifier) or crisp (using the class predictions of a classifier). It is used in ACC, PACC, CC, PCC, and SLD.
-classifier – A classifier that implements the API of scikit-learn.
is_probabilistic (optional) – Whether probabilistic or crisp predictions of the classifier are used to transform the data. Defaults to False.
fit_classifier (optional) – Whether to fit the classifier when this quantifier is fitted. Defaults to True.
This abstract method has to fit the transformer and to return the transformation of the input data.
-Note
-Implementations of this abstract method should check the sanity of labels by calling _check_y(y, n_classes) and they must set the property self.p_trn = class_prevalences(y, n_classes).
-X – The feature matrix to which this transformer will be fitted.
y – The labels to which this transformer will be fitted.
average (optional) – Whether to return a transfer matrix M or a transformation (f(X), y). Defaults to True.
n_classes (optional) – The number of expected classes. Defaults to None.
A transfer matrix M if average==True or a transformation (f(X), y) if average==False.
-This abstract method has to transform the data X.
-X – The feature matrix that will be transformed.
average (optional) – Whether to return a vector q or a transformation f(X). Defaults to True.
A vector q = f(X).mean(axis=0) if average==True or a transformation f(X) if average==False.
-Bases: AbstractLoss
The weighted sum of multiple losses.
-*losses – An arbitrary number of losses to be added together.
weights (optional) – An array of weights which the losses are scaled.
A generic quantification / unfolding method that solves a linear system of equations.
-This class represents any quantifier that can be described in terms of a loss function, a feature transformation, and a regularization term. In this implementation, the loss is minimized through unconstrained second-order minimization. Valid probability estimates are ensured through a soft-max trick by Bunse (2022).
-loss – An instance of a loss class from quapy.methods.composable.
transformer – An instance of a transformer class from quapy.methods.composable.
solver (optional) – The method argument in scipy.optimize.minimize. Defaults to “trust-ncg”.
solver_options (optional) – The options argument in scipy.optimize.minimize. Defaults to {“gtol”: 1e-8, “maxiter”: 1000}.
seed (optional) – A random number generator seed from which a numpy RandomState is created. Defaults to None.
Examples
-Here, we create the ordinal variant of ACC (Bunse et al., 2023). This variant consists of the original feature transformation of ACC and of the original loss of ACC, the latter of which is regularized towards smooth solutions.
->>> from qunfold.method.composable import (
->>> ComposableQuantifier,
->>> TikhonovRegularized,
->>> LeastSquaresLoss,
->>> ClassTransformer,
->>> )
->>> from sklearn.ensemble import RandomForestClassifier
->>> o_acc = ComposableQuantifier(
->>> TikhonovRegularized(LeastSquaresLoss(), 0.01),
->>> ClassTransformer(RandomForestClassifier(oob_score=True))
->>> )
-
Here, we perform hyper-parameter optimization with the ordinal ACC.
->>> quapy.model_selection.GridSearchQ(
->>> model = o_acc,
->>> param_grid = { # try both splitting criteria
->>> "transformer__classifier__estimator__criterion": ["gini", "entropy"],
->>> },
->>> # ...
->>> )
-
To use a classifier that does not provide the oob_score argument, such as logistic regression, you have to configure a cross validation of this classifier. Here, we employ 10 cross validation folds. 5 folds are the default.
->>> from qunfold.method.composable import CVClassifier
->>> from sklearn.linear_model import LogisticRegression
->>> acc_lr = ComposableQuantifier(
->>> LeastSquaresLoss(),
->>> ClassTransformer(CVClassifier(LogisticRegression(), 10))
->>> )
-
Bases: AbstractTransformer
A distance-based feature transformation, as it is used in EDx and EDy.
-metric (optional) – The metric with which the distance between data items is measured. Can take any value that is accepted by scipy.spatial.distance.cdist. Defaults to “euclidean”.
preprocessor (optional) – Another AbstractTransformer that is called before this transformer. Defaults to None.
This abstract method has to fit the transformer and to return the transformation of the input data.
-Note
-Implementations of this abstract method should check the sanity of labels by calling _check_y(y, n_classes) and they must set the property self.p_trn = class_prevalences(y, n_classes).
-X – The feature matrix to which this transformer will be fitted.
y – The labels to which this transformer will be fitted.
average (optional) – Whether to return a transfer matrix M or a transformation (f(X), y). Defaults to True.
n_classes (optional) – The number of expected classes. Defaults to None.
A transfer matrix M if average==True or a transformation (f(X), y) if average==False.
-This abstract method has to transform the data X.
-X – The feature matrix that will be transformed.
average (optional) – Whether to return a vector q or a transformation f(X). Defaults to True.
A vector q = f(X).mean(axis=0) if average==True or a transformation f(X) if average==False.
-Bases: AbstractTransformer
A kernel-based feature transformation, as it is used in KMM, that uses the energy kernel:
---k(x_1, x_2) = ||x_1|| + ||x_2|| - ||x_1 - x_2||
-
Note
-The methods of this transformer do not support setting average=False.
-preprocessor (optional) – Another AbstractTransformer that is called before this transformer. Defaults to None.
-This abstract method has to fit the transformer and to return the transformation of the input data.
-Note
-Implementations of this abstract method should check the sanity of labels by calling _check_y(y, n_classes) and they must set the property self.p_trn = class_prevalences(y, n_classes).
-X – The feature matrix to which this transformer will be fitted.
y – The labels to which this transformer will be fitted.
average (optional) – Whether to return a transfer matrix M or a transformation (f(X), y). Defaults to True.
n_classes (optional) – The number of expected classes. Defaults to None.
A transfer matrix M if average==True or a transformation (f(X), y) if average==False.
-This abstract method has to transform the data X.
-X – The feature matrix that will be transformed.
average (optional) – Whether to return a vector q or a transformation f(X). Defaults to True.
A vector q = f(X).mean(axis=0) if average==True or a transformation f(X) if average==False.
-Bases: FunctionLoss
The loss function of EDx (Kawakubo et al., 2016) and EDy (Castaño et al., 2022).
-This loss function represents the Energy Distance between two samples.
-Bases: AbstractTransformer
A kernel-based feature transformation, as it is used in KMM, that uses the gaussian kernel:
---k(x, y) = exp(-||x - y||^2 / (2σ^2))
-
sigma (optional) – A smoothing parameter of the kernel function. Defaults to 1.
preprocessor (optional) – Another AbstractTransformer that is called before this transformer. Defaults to None.
This abstract method has to fit the transformer and to return the transformation of the input data.
-Note
-Implementations of this abstract method should check the sanity of labels by calling _check_y(y, n_classes) and they must set the property self.p_trn = class_prevalences(y, n_classes).
-X – The feature matrix to which this transformer will be fitted.
y – The labels to which this transformer will be fitted.
average (optional) – Whether to return a transfer matrix M or a transformation (f(X), y). Defaults to True.
n_classes (optional) – The number of expected classes. Defaults to None.
A transfer matrix M if average==True or a transformation (f(X), y) if average==False.
-This abstract method has to transform the data X.
-X – The feature matrix that will be transformed.
average (optional) – Whether to return a vector q or a transformation f(X). Defaults to True.
A vector q = f(X).mean(axis=0) if average==True or a transformation f(X) if average==False.
-Bases: AbstractTransformer
An efficient approximation of the GaussianKernelTransformer, as it is used in KMM, using random Fourier features.
-sigma (optional) – A smoothing parameter of the kernel function. Defaults to 1.
n_rff (optional) – The number of random Fourier features. Defaults to 1000.
preprocessor (optional) – Another AbstractTransformer that is called before this transformer. Defaults to None.
seed (optional) – Controls the randomness of the random Fourier features. Defaults to None.
This abstract method has to fit the transformer and to return the transformation of the input data.
-Note
-Implementations of this abstract method should check the sanity of labels by calling _check_y(y, n_classes) and they must set the property self.p_trn = class_prevalences(y, n_classes).
-X – The feature matrix to which this transformer will be fitted.
y – The labels to which this transformer will be fitted.
average (optional) – Whether to return a transfer matrix M or a transformation (f(X), y). Defaults to True.
n_classes (optional) – The number of expected classes. Defaults to None.
A transfer matrix M if average==True or a transformation (f(X), y) if average==False.
-This abstract method has to transform the data X.
-X – The feature matrix that will be transformed.
average (optional) – Whether to return a vector q or a transformation f(X). Defaults to True.
A vector q = f(X).mean(axis=0) if average==True or a transformation f(X) if average==False.
-Bases: FunctionLoss
The loss function of HDx and HDy (González-Castro et al., 2013).
-This loss function computes the average of the squared Hellinger distances between feature-wise (or class-wise) histograms. Note that the original HDx and HDy by González-Castro et al (2013) do not use the squared but the regular Hellinger distance. Their approach is problematic because the regular distance is not always twice differentiable and, hence, complicates numerical optimizations.
-Bases: AbstractTransformer
A histogram-based feature transformation, as it is used in HDx and HDy.
-n_bins – The number of bins in each feature.
preprocessor (optional) – Another AbstractTransformer that is called before this transformer. Defaults to None.
unit_scale (optional) – Whether or not to scale each output to a sum of one. A value of False indicates that the sum of each output is the number of features. Defaults to True.
This abstract method has to fit the transformer and to return the transformation of the input data.
-Note
-Implementations of this abstract method should check the sanity of labels by calling _check_y(y, n_classes) and they must set the property self.p_trn = class_prevalences(y, n_classes).
-X – The feature matrix to which this transformer will be fitted.
y – The labels to which this transformer will be fitted.
average (optional) – Whether to return a transfer matrix M or a transformation (f(X), y). Defaults to True.
n_classes (optional) – The number of expected classes. Defaults to None.
A transfer matrix M if average==True or a transformation (f(X), y) if average==False.
-This abstract method has to transform the data X.
-X – The feature matrix that will be transformed.
average (optional) – Whether to return a vector q or a transformation f(X). Defaults to True.
A vector q = f(X).mean(axis=0) if average==True or a transformation f(X) if average==False.
-Bases: AbstractTransformer
A general kernel-based feature transformation, as it is used in KMM. If you intend to use a Gaussian kernel or energy kernel, prefer their dedicated and more efficient implementations over this class.
-Note
-The methods of this transformer do not support setting average=False.
-kernel – A callable that will be used as the kernel. Must follow the signature (X[y==i], X[y==j]) -> scalar.
-This abstract method has to fit the transformer and to return the transformation of the input data.
-Note
-Implementations of this abstract method should check the sanity of labels by calling _check_y(y, n_classes) and they must set the property self.p_trn = class_prevalences(y, n_classes).
-X – The feature matrix to which this transformer will be fitted.
y – The labels to which this transformer will be fitted.
average (optional) – Whether to return a transfer matrix M or a transformation (f(X), y). Defaults to True.
n_classes (optional) – The number of expected classes. Defaults to None.
A transfer matrix M if average==True or a transformation (f(X), y) if average==False.
-This abstract method has to transform the data X.
-X – The feature matrix that will be transformed.
average (optional) – Whether to return a vector q or a transformation f(X). Defaults to True.
A vector q = f(X).mean(axis=0) if average==True or a transformation f(X) if average==False.
-Bases: KernelTransformer
A kernel-based feature transformation, as it is used in KMM, that uses the laplacian kernel.
-sigma (optional) – A smoothing parameter of the kernel function. Defaults to 1.
-Bases: FunctionLoss
The loss function of ACC (Forman, 2008), PACC (Bella et al., 2019), and ReadMe (Hopkins & King, 2010).
-This loss function computes the sum of squares of element-wise errors between q and M*p.
-Bases: AbstractLoss
Tikhonov regularization, as proposed by Blobel (1985).
-This regularization promotes smooth solutions. This behavior is often required in ordinal quantification and in unfolding problems.
-Add TikhonovRegularization (Blobel, 1985) to any loss.
-Calling this function is equivalent to calling
->>> CombinedLoss(loss, TikhonovRegularization(), weights=[1, tau])
-
loss – An instance from qunfold.losses.
tau (optional) – The regularization strength. Defaults to 0.
An instance of CombinedLoss.
-Examples
-The regularized loss of RUN (Blobel, 1985) is:
->>> TikhonovRegularization(BlobelLoss(), tau)
-