updated manuals
This commit is contained in:
parent
8adcc33c59
commit
c26c463f3d
|
|
@ -1,4 +1,4 @@
|
|||
Change Log 0.1.10
|
||||
Change Log 0.2.0
|
||||
-----------------
|
||||
|
||||
CLEAN TODO-FILE
|
||||
|
|
@ -6,7 +6,7 @@ CLEAN TODO-FILE
|
|||
- Base code Refactor:
|
||||
- Removing coupling between LabelledCollection and quantification methods; the fit interface changes:
|
||||
def fit(data:LabelledCollection): -> def fit(X, y):
|
||||
- Adding function "predict" (function "quantify" is still present as an alias)
|
||||
- Adding function "predict" (function "quantify" is still present as an alias, for the nostalgic)
|
||||
- Aggregative methods's behavior in terms of fit_classifier and how to treat the val_split is now
|
||||
indicated exclusively at construction time, and it is no longer possible to indicate it at fit time.
|
||||
This is because, in v<=0.1.9, one could create a method (e.g., ACC) and then indicate:
|
||||
|
|
@ -21,15 +21,16 @@ CLEAN TODO-FILE
|
|||
- A new parameter "on_calib_error" is passed to the constructor, which informs of the policy to follow
|
||||
in case the abstention's calibration functions failed (which happens sometimes). Options include:
|
||||
- 'raise': raises a RuntimeException (default)
|
||||
- 'backup': reruns avoiding calibration
|
||||
- 'backup': reruns by silently avoiding calibration
|
||||
- Parameter "recalib" has been renamed "calib"
|
||||
- Added aggregative bootstrap for deriving confidence regions (confidence intervals, ellipses in the simplex, or
|
||||
ellipses in the CLR space). This method is efficient as it leverages the two-phases of the aggregative quantifiers.
|
||||
This method applies resampling only to the aggregation phase, thus avoiding to train many quantifiers, or
|
||||
classify multiple times the instances of a sample. See:
|
||||
- quapy/method/confidence.py (new)
|
||||
- the new example no. 15.
|
||||
- BayesianCC moved to confidence.py, where methods having to do with confidence intervals live
|
||||
- the new example no. 16.confidence_regions.py
|
||||
- BayesianCC moved to confidence.py, where methods having to do with confidence intervals belong.
|
||||
- Improved documentation of qp.plot module.
|
||||
|
||||
|
||||
Change Log 0.1.9
|
||||
|
|
|
|||
|
|
@ -340,10 +340,10 @@ and a set of test samples (for evaluation). QuaPy returns this data as a Labelle
|
|||
(training) and two generation protocols (for validation and test samples), as follows:
|
||||
|
||||
```python
|
||||
training, val_generator, test_generator = fetch_lequa2022(task=task)
|
||||
training, val_generator, test_generator = qp.datasets.fetch_lequa2022(task=task)
|
||||
```
|
||||
|
||||
See the `lequa2022_experiments.py` in the examples folder for further details on how to
|
||||
See the `5a.lequa2022_experiments.py` in the examples folder for further details on how to
|
||||
carry out experiments using these datasets.
|
||||
|
||||
The datasets are downloaded only once, and stored for fast reuse.
|
||||
|
|
@ -365,6 +365,53 @@ Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022).
|
|||
A Detailed Overview of LeQua@ CLEF 2022: Learning to Quantify.
|
||||
```
|
||||
|
||||
## LeQua 2024 Datasets
|
||||
|
||||
QuaPy also provides the datasets used for the [LeQua 2024 competition](https://lequa2024.github.io/).
|
||||
In brief, there are 4 tasks:
|
||||
* T1: binary quantification (by sentiment)
|
||||
* T2: multiclass quantification (28 classes, merchandise products)
|
||||
* T3: ordinal quantification (5-stars sentiment ratings)
|
||||
* T4: binary sentiment quantification under a combination of covariate shift and prior shift
|
||||
|
||||
In all cases, the covariate space has 256 dimensions (extracted using the `ELECTRA-Small` model).
|
||||
|
||||
Every task consists of a training set, a set of validation samples (for model selection)
|
||||
and a set of test samples (for evaluation). QuaPy returns this data as a LabelledCollection
|
||||
(training bags) and sampling generation protocols (for validation and test bags).
|
||||
T3 also offers the possibility to obtain a series of training bags (in form of a
|
||||
sampling generation protocol) instead of one single training bag. Use it as follows:
|
||||
|
||||
```python
|
||||
training, val_generator, test_generator = qp.datasets.fetch_lequa2024(task=task)
|
||||
```
|
||||
|
||||
See the `5b.lequa2024_experiments.py` in the examples folder for further details on how to
|
||||
carry out experiments using these datasets.
|
||||
|
||||
The datasets are downloaded only once, and stored for fast reuse.
|
||||
|
||||
Some statistics are summarized below:
|
||||
|
||||
| Dataset | classes | train size | validation samples | test samples | docs by sample | type |
|
||||
|---------|:-------:|:-----------:|:------------------:|:------------:|:--------------:|:--------:|
|
||||
| T1 | 2 | 5000 | 1000 | 5000 | 250 | vector |
|
||||
| T2 | 28 | 20000 | 1000 | 5000 | 1000 | vector |
|
||||
| T3 | 5 | 100 samples | 1000 | 5000 | 200 | vector |
|
||||
| T4 | 2 | 5000 | 1000 | 5000 | 250 | vector |
|
||||
|
||||
For further details on the datasets or the competition, we refer to
|
||||
[the official site](https://lequa2024.github.io/data/) and
|
||||
[the overview paper](http://nmis.isti.cnr.it/sebastiani/Publications/LQ2024.pdf).
|
||||
|
||||
```
|
||||
Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022).
|
||||
An Overview of LeQua 2024, the 2nd International Data Challenge on Learning to Quantify,
|
||||
Proceedings of the 4th International Workshop on Learning to Quantify (LQ 2024),
|
||||
ECML-PKDD 2024, Vilnius, Lithuania.
|
||||
```
|
||||
|
||||
|
||||
## IFCB Plankton dataset
|
||||
|
||||
IFCB is a dataset of plankton species in water samples hosted in `Zenodo <https://zenodo.org/records/10036244>`_.
|
||||
|
|
@ -410,8 +457,12 @@ Journal of Plankton Research 41 (4), 449-463](https://par.nsf.gov/servlets/purl/
|
|||
|
||||
## Adding Custom Datasets
|
||||
|
||||
It is straightforward to import your own datasets into QuaPy.
|
||||
I what follows, there are some code snippets for doing so; see also the example
|
||||
[3.custom_collection.py](https://github.com/HLT-ISTI/QuaPy/blob/master/examples/3.custom_collection.py).
|
||||
|
||||
QuaPy provides data loaders for simple formats dealing with
|
||||
text, following the format:
|
||||
text; for example, use `qp.data.reader.from_text` for the following the format:
|
||||
|
||||
```
|
||||
class-id \t first document's pre-processed text \n
|
||||
|
|
@ -419,13 +470,16 @@ class-id \t second document's pre-processed text \n
|
|||
...
|
||||
```
|
||||
|
||||
and sparse representations of the form:
|
||||
or `qp.data.reader.from_sparse` for sparse representations of the form:
|
||||
|
||||
```
|
||||
{-1, 0, or +1} col(int):val(float) col(int):val(float) ... \n
|
||||
...
|
||||
```
|
||||
|
||||
both functions return a tuple `X, y` containing a list of strings and the corresponding
|
||||
labels, respectively.
|
||||
|
||||
The code in charge in loading a LabelledCollection is:
|
||||
|
||||
```python
|
||||
|
|
@ -434,12 +488,13 @@ def load(cls, path:str, loader_func:callable):
|
|||
return LabelledCollection(*loader_func(path))
|
||||
```
|
||||
|
||||
indicating that any _loader_func_ (e.g., a user-defined one) which
|
||||
indicating that any `loader_func` (e.g., `from_text`, `from_sparse`, `from_csv`, or a user-defined one) which
|
||||
returns valid arguments for initializing a _LabelledCollection_ object will allow
|
||||
to load any collection. In particular, the _LabelledCollection_ receives as
|
||||
arguments the instances (as an iterable) and the labels (as an iterable) and,
|
||||
additionally, the number of classes can be specified (it would otherwise be
|
||||
inferred from the labels, but that requires at least one positive example for
|
||||
to load any collection. More specifically, the _LabelledCollection_ receives as
|
||||
arguments the _instances_ (iterable) and the _labels_ (iterable) and,
|
||||
optionally, the number of classes (it would be
|
||||
inferred from the labels if not indicated, but this requires at least one
|
||||
positive example for
|
||||
all classes to be present in the collection).
|
||||
|
||||
The same _loader_func_ can be passed to a Dataset, along with two
|
||||
|
|
@ -452,20 +507,23 @@ import quapy as qp
|
|||
train_path = '../my_data/train.dat'
|
||||
test_path = '../my_data/test.dat'
|
||||
|
||||
def my_custom_loader(path):
|
||||
def my_custom_loader(path, **custom_kwargs):
|
||||
with open(path, 'rb') as fin:
|
||||
...
|
||||
return instances, labels
|
||||
|
||||
data = qp.data.Dataset.load(train_path, test_path, my_custom_loader)
|
||||
data = qp.data.Dataset.load(train_path, test_path, my_custom_loader, **custom_kwargs)
|
||||
```
|
||||
|
||||
### Data Processing
|
||||
|
||||
QuaPy implements a number of preprocessing functions in the package _qp.data.preprocessing_, including:
|
||||
QuaPy implements a number of preprocessing functions in the package `qp.data.preprocessing`, including:
|
||||
|
||||
* _text2tfidf_: tfidf vectorization
|
||||
* _reduce_columns_: reducing the number of columns based on term frequency
|
||||
* _standardize_: transforms the column values into z-scores (i.e., subtract the mean and normalizes by the standard deviation, so
|
||||
that the column values have zero mean and unit variance).
|
||||
* _index_: transforms textual tokens into lists of numeric ids
|
||||
|
||||
These functions are applied to `Dataset` objects, and offer the possibility to apply the transformation
|
||||
inline (thus modifying the original dataset), or to return a modified copy.
|
||||
|
|
@ -46,18 +46,18 @@ e.g.:
|
|||
|
||||
```python
|
||||
qp.environ['SAMPLE_SIZE'] = 100 # once for all
|
||||
true_prev = np.asarray([0.5, 0.3, 0.2]) # let's assume 3 classes
|
||||
estim_prev = np.asarray([0.1, 0.3, 0.6])
|
||||
true_prev = [0.5, 0.3, 0.2] # let's assume 3 classes
|
||||
estim_prev = [0.1, 0.3, 0.6]
|
||||
error = qp.error.mrae(true_prev, estim_prev)
|
||||
print(f'mrae({true_prev}, {estim_prev}) = {error:.3f}')
|
||||
```
|
||||
|
||||
will print:
|
||||
```
|
||||
mrae([0.500, 0.300, 0.200], [0.100, 0.300, 0.600]) = 0.914
|
||||
mrae([0.5, 0.3, 0.2], [0.1, 0.3, 0.6]) = 0.914
|
||||
```
|
||||
|
||||
Finally, it is possible to instantiate QuaPy's quantification
|
||||
It is also possible to instantiate QuaPy's quantification
|
||||
error functions from strings using, e.g.:
|
||||
|
||||
```python
|
||||
|
|
@ -85,7 +85,7 @@ print(f'MAE = {mae:.4f}')
|
|||
```
|
||||
|
||||
It is often desirable to evaluate our system using more than one
|
||||
single evaluatio measure. In this case, it is convenient to generate
|
||||
single evaluation measure. In this case, it is convenient to generate
|
||||
a _report_. A report in QuaPy is a dataframe accounting for all the
|
||||
true prevalence values with their corresponding prevalence values
|
||||
as estimated by the quantifier, along with the error each has given
|
||||
|
|
@ -104,7 +104,7 @@ report['estim-prev'] = report['estim-prev'].map(F.strprev)
|
|||
print(report)
|
||||
|
||||
print('Averaged values:')
|
||||
print(report.mean())
|
||||
print(report.mean(numeric_only=True))
|
||||
```
|
||||
|
||||
This will produce an output like:
|
||||
|
|
@ -141,11 +141,14 @@ true_prevs, estim_prevs = qp.evaluation.prediction(quantifier, protocol=prot)
|
|||
|
||||
All the evaluation functions implement specific optimizations for speeding-up
|
||||
the evaluation of aggregative quantifiers (i.e., of instances of _AggregativeQuantifier_).
|
||||
|
||||
The optimization comes down to generating classification predictions (either crisp or soft)
|
||||
only once for the entire test set, and then applying the sampling procedure to the
|
||||
predictions, instead of generating samples of instances and then computing the
|
||||
classification predictions every time. This is only possible when the protocol
|
||||
is an instance of _OnLabelledCollectionProtocol_. The optimization is only
|
||||
is an instance of _OnLabelledCollectionProtocol_.
|
||||
|
||||
The optimization is only
|
||||
carried out when the number of classification predictions thus generated would be
|
||||
smaller than the number of predictions required for the entire protocol; e.g.,
|
||||
if the original dataset contains 1M instances, but the protocol is such that it would
|
||||
|
|
@ -156,4 +159,4 @@ precompute all the predictions irrespectively of the number of instances and num
|
|||
Finally, this can be deactivated by setting _aggr_speedup=False_. Note that this optimization
|
||||
is not only applied for the final evaluation, but also for the internal evaluations carried
|
||||
out during _model selection_. Since these are typically many, the heuristic can help reduce the
|
||||
execution time a lot.
|
||||
execution time significatively.
|
||||
|
|
@ -1,7 +1,7 @@
|
|||
# Quantification Methods
|
||||
|
||||
Quantification methods can be categorized as belonging to
|
||||
`aggregative` and `non-aggregative` groups.
|
||||
`aggregative`, `non-aggregative`, and `meta-learning` groups.
|
||||
Most methods included in QuaPy at the moment are of type `aggregative`
|
||||
(though we plan to add many more methods in the near future), i.e.,
|
||||
are methods characterized by the fact that
|
||||
|
|
@ -12,21 +12,17 @@ Any quantifier in QuaPy shoud extend the class `BaseQuantifier`,
|
|||
and implement some abstract methods:
|
||||
```python
|
||||
@abstractmethod
|
||||
def fit(self, data: LabelledCollection): ...
|
||||
def fit(self, X, y): ...
|
||||
|
||||
@abstractmethod
|
||||
def quantify(self, instances): ...
|
||||
def predict(self, X): ...
|
||||
```
|
||||
The meaning of those functions should be familiar to those
|
||||
used to work with scikit-learn since the class structure of QuaPy
|
||||
is directly inspired by scikit-learn's _Estimators_. Functions
|
||||
`fit` and `quantify` are used to train the model and to provide
|
||||
class estimations (the reason why
|
||||
scikit-learn' structure has not been adopted _as is_ in QuaPy responds to
|
||||
the fact that scikit-learn's `predict` function is expected to return
|
||||
one output for each input element --e.g., a predicted label for each
|
||||
instance in a sample-- while in quantification the output for a sample
|
||||
is one single array of class prevalences).
|
||||
`fit` and `predict` (for which there is an alias `quantify`)
|
||||
are used to train the model and to provide
|
||||
class estimations.
|
||||
Quantifiers also extend from scikit-learn's `BaseEstimator`, in order
|
||||
to simplify the use of `set_params` and `get_params` used in
|
||||
[model selection](./model-selection).
|
||||
|
|
@ -40,21 +36,26 @@ The methods that any `aggregative` quantifier must implement are:
|
|||
|
||||
```python
|
||||
@abstractmethod
|
||||
def aggregation_fit(self, classif_predictions: LabelledCollection, data: LabelledCollection):
|
||||
def aggregation_fit(self, classif_predictions, labels):
|
||||
|
||||
@abstractmethod
|
||||
def aggregate(self, classif_predictions:np.ndarray): ...
|
||||
def aggregate(self, classif_predictions): ...
|
||||
```
|
||||
|
||||
These two functions replace the `fit` and `quantify` methods, since those
|
||||
come with default implementations. The `fit` function is provided and amounts to:
|
||||
The argument `classif_predictions` is whatever the method `classify` returns.
|
||||
QuaPy comes with default implementations that cover most common cases, but you can
|
||||
override `classify` in case your method requires further or different information to work.
|
||||
|
||||
These two functions replace the `fit` and `predict` methods, which
|
||||
come with default implementations. For instance, the `fit` function is
|
||||
provided and amounts to:
|
||||
|
||||
```python
|
||||
def fit(self, data: LabelledCollection, fit_classifier=True, val_split=None):
|
||||
self._check_init_parameters()
|
||||
classif_predictions = self.classifier_fit_predict(data, fit_classifier, predict_on=val_split)
|
||||
self.aggregation_fit(classif_predictions, data)
|
||||
return self
|
||||
def fit(self, X, y):
|
||||
self._check_init_parameters()
|
||||
classif_predictions, labels = self.classifier_fit_predict(X, y)
|
||||
self.aggregation_fit(classif_predictions, labels)
|
||||
return self
|
||||
```
|
||||
|
||||
Note that this function fits the classifier, and generates the predictions. This is assumed
|
||||
|
|
@ -72,11 +73,11 @@ overriden (if needed) and allows the method to quickly raise any exception based
|
|||
found in the `__init__` arguments, thus avoiding to break after training the classifier and generating
|
||||
predictions.
|
||||
|
||||
Similarly, the function `quantify` is provided, and amounts to:
|
||||
Similarly, the function `predict` (alias `quantify`) is provided, and amounts to:
|
||||
|
||||
```python
|
||||
def quantify(self, instances):
|
||||
classif_predictions = self.classify(instances)
|
||||
def predict(self, X):
|
||||
classif_predictions = self.classify(X)
|
||||
return self.aggregate(classif_predictions)
|
||||
```
|
||||
|
||||
|
|
@ -84,12 +85,14 @@ in which only the function `aggregate` is required to be overriden in most cases
|
|||
|
||||
Aggregative quantifiers are expected to maintain a classifier (which is
|
||||
accessed through the `@property` `classifier`). This classifier is
|
||||
given as input to the quantifier, and can be already fit
|
||||
on external data (in which case, the `fit_learner` argument should
|
||||
be set to False), or be fit by the quantifier's fit (default).
|
||||
given as input to the quantifier, and will be trained by the quantifier's fit (default).
|
||||
Alternatively, the classifier can be already fit on external data; in this case, the `fit_learner`
|
||||
argument in the `__init__` should be set to False (see [4.using_pretrained_classifier.py](https://github.com/HLT-ISTI/QuaPy/blob/master/examples/4.using_pretrained_classifier.py)
|
||||
for a full code example).
|
||||
|
||||
The above patterns (in training: fit the classifier, then fit the aggregation;
|
||||
in test: classify, then aggregate) allows QuaPy to optimize many internal procedures.
|
||||
The above patterns (in training: (i) fit the classifier, then (ii) fit the aggregation;
|
||||
in test: (i) classify, then (ii) aggregate) allows QuaPy to optimize many internal procedures,
|
||||
on the grounds that steps (i) are slower than steps (ii).
|
||||
In particular, the model selection routing takes advantage of this two-step process
|
||||
and generates classifiers only for the valid combinations of hyperparameters of the
|
||||
classifier, and then _clones_ these classifiers and explores the combinations
|
||||
|
|
@ -124,6 +127,7 @@ import quapy.functional as F
|
|||
from sklearn.svm import LinearSVC
|
||||
|
||||
training, test = qp.datasets.fetch_twitter('hcr', pickle=True).train_test
|
||||
Xtr, ytr = training.Xy
|
||||
|
||||
# instantiate a classifier learner, in this case a SVM
|
||||
svm = LinearSVC()
|
||||
|
|
@ -131,7 +135,7 @@ svm = LinearSVC()
|
|||
# instantiate a Classify & Count with the SVM
|
||||
# (an alias is available in qp.method.aggregative.ClassifyAndCount)
|
||||
model = qp.method.aggregative.CC(svm)
|
||||
model.fit(training)
|
||||
model.fit(Xtr, ytr)
|
||||
estim_prevalence = model.predict(test.instances)
|
||||
```
|
||||
|
||||
|
|
@ -153,26 +157,14 @@ predictions. This parameters can also be set with an integer,
|
|||
indicating that the parameters should be estimated by means of
|
||||
_k_-fold cross-validation, for which the integer indicates the
|
||||
number _k_ of folds (the default value is 5). Finally, `val_split` can be set to a
|
||||
specific held-out validation set (i.e., an instance of `LabelledCollection`).
|
||||
|
||||
The specification of `val_split` can be
|
||||
postponed to the invokation of the fit method (if `val_split` was also
|
||||
set in the constructor, the one specified at fit time would prevail),
|
||||
e.g.:
|
||||
|
||||
```python
|
||||
model = qp.method.aggregative.ACC(svm)
|
||||
# perform 5-fold cross validation for estimating ACC's parameters
|
||||
# (overrides the default val_split=0.4 in the constructor)
|
||||
model.fit(training, val_split=5)
|
||||
```
|
||||
specific held-out validation set (i.e., an tuple `(X,y)`).
|
||||
|
||||
The following code illustrates the case in which PCC is used:
|
||||
|
||||
```python
|
||||
model = qp.method.aggregative.PCC(svm)
|
||||
model.fit(training)
|
||||
estim_prevalence = model.predict(test.instances)
|
||||
model.fit(Xtr, ytr)
|
||||
estim_prevalence = model.predict(Xte)
|
||||
print('classifier:', model.classifier)
|
||||
```
|
||||
In this case, QuaPy will print:
|
||||
|
|
@ -185,11 +177,11 @@ is not a probabilistic classifier (i.e., it does not implement the
|
|||
`predict_proba` method) and so, the classifier will be converted to
|
||||
a probabilistic one through [calibration](https://scikit-learn.org/stable/modules/calibration.html).
|
||||
As a result, the classifier that is printed in the second line points
|
||||
to a `CalibratedClassifier` instance. Note that calibration can only
|
||||
be applied to hard classifiers when `fit_learner=True`; an exception
|
||||
to a `CalibratedClassifierCV` instance. Note that calibration can only
|
||||
be applied to hard classifiers if `fit_learner=True`; an exception
|
||||
will be raised otherwise.
|
||||
|
||||
Lastly, everything we said aboud ACC and PCC
|
||||
Lastly, everything we said about ACC and PCC
|
||||
applies to PACC as well.
|
||||
|
||||
_New in v0.1.9_: quantifiers ACC and PACC now have three additional arguments: `method`, `solver` and `norm`:
|
||||
|
|
@ -259,22 +251,28 @@ An example of use can be found below:
|
|||
import quapy as qp
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
|
||||
train, test = qp.datasets.fetch_twitter('hcr', pickle=True).train_test
|
||||
|
||||
model = qp.method.aggregative.EMQ(LogisticRegression())
|
||||
model.fit(dataset.training)
|
||||
estim_prevalence = model.predict(dataset.test.instances)
|
||||
model.fit(*train.Xy)
|
||||
estim_prevalence = model.predict(test.X)
|
||||
```
|
||||
|
||||
_New in v0.1.7_: EMQ now accepts two new parameters in the construction method, namely
|
||||
`exact_train_prev` which allows to use the true training prevalence as the departing
|
||||
prevalence estimation (default behaviour), or instead an approximation of it as
|
||||
EMQ accepts additional parameters in the construction method:
|
||||
* `exact_train_prev`: set to True for using the true training prevalence as the departing
|
||||
prevalence estimation (default behaviour), or to False for using an approximation of it as
|
||||
suggested by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html)
|
||||
(by setting `exact_train_prev=False`).
|
||||
The other parameter is `recalib` which allows to indicate a calibration method, among those
|
||||
* `calib`: allows to indicate a calibration method, among those
|
||||
proposed by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html),
|
||||
including the Bias-Corrected Temperature Scaling, Vector Scaling, etc.
|
||||
See the API documentation for further details.
|
||||
including the Bias-Corrected Temperature Scaling
|
||||
(`bcts`), Vector Scaling (`bcts`), No-Bias Temperature Scaling (`nbvs`),
|
||||
or Temperature Scaling (`ts`); default is `None` (no calibration).
|
||||
* `on_calib_error`: indicates the policy to follow in case the calibrator fails at runtime.
|
||||
Options include `raise` (default), in which case a RuntimeException is raised; and `backup`, in which
|
||||
case the calibrator is silently skipped.
|
||||
|
||||
You can use the class method `EMQ_BCTS` to effortlessly instantiate EMQ with the best performing
|
||||
heuristics found by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html). See the API documentation for further details.
|
||||
|
||||
|
||||
### Hellinger Distance y (HDy)
|
||||
|
|
@ -289,11 +287,10 @@ This method works with a probabilistic classifier (hard classifiers
|
|||
can be used as well and will be calibrated) and requires a validation
|
||||
set to estimate parameter for the mixture model. Just like
|
||||
ACC and PACC, this quantifier receives a `val_split` argument
|
||||
in the constructor (or in the fit method, in which case the previous
|
||||
value is overridden) that can either be a float indicating the proportion
|
||||
in the constructor that can either be a float indicating the proportion
|
||||
of training data to be taken as the validation set (in a random
|
||||
stratified split), or a validation set (i.e., an instance of
|
||||
`LabelledCollection`) itself.
|
||||
stratified split), or the validation set itself (i.e., an tuple
|
||||
`(X,y)`).
|
||||
|
||||
HDy was proposed as a binary classifier and the implementation
|
||||
provided in QuaPy accepts only binary datasets.
|
||||
|
|
@ -309,11 +306,11 @@ dataset = qp.datasets.fetch_reviews('hp', pickle=True)
|
|||
qp.data.preprocessing.text2tfidf(dataset, min_df=5, inplace=True)
|
||||
|
||||
model = qp.method.aggregative.HDy(LogisticRegression())
|
||||
model.fit(dataset.training)
|
||||
estim_prevalence = model.predict(dataset.test.instances)
|
||||
model.fit(*dataset.training.Xy)
|
||||
estim_prevalence = model.predict(dataset.test.X)
|
||||
```
|
||||
|
||||
_New in v0.1.7:_ QuaPy now provides an implementation of the generalized
|
||||
QuaPy also provides an implementation of the generalized
|
||||
"Distribution Matching" approaches for multiclass, inspired by the framework
|
||||
of [Firat (2016)](https://arxiv.org/abs/1606.00868). One can instantiate
|
||||
a variant of HDy for multiclass quantification as follows:
|
||||
|
|
@ -322,17 +319,22 @@ a variant of HDy for multiclass quantification as follows:
|
|||
mutliclassHDy = qp.method.aggregative.DMy(classifier=LogisticRegression(), divergence='HD', cdf=False)
|
||||
```
|
||||
|
||||
_New in v0.1.7:_ QuaPy now provides an implementation of the "DyS"
|
||||
QuaPy also provides an implementation of the "DyS"
|
||||
framework proposed by [Maletzke et al (2020)](https://ojs.aaai.org/index.php/AAAI/article/view/4376)
|
||||
and the "SMM" method proposed by [Hassan et al (2019)](https://ieeexplore.ieee.org/document/9260028)
|
||||
(thanks to _Pablo González_ for the contributions!)
|
||||
|
||||
### Threshold Optimization methods
|
||||
|
||||
_New in v0.1.7:_ QuaPy now implements Forman's threshold optimization methods;
|
||||
QuaPy implements Forman's threshold optimization methods;
|
||||
see, e.g., [(Forman 2006)](https://dl.acm.org/doi/abs/10.1145/1150402.1150423)
|
||||
and [(Forman 2008)](https://link.springer.com/article/10.1007/s10618-008-0097-y).
|
||||
These include: T50, MAX, X, Median Sweep (MS), and its variant MS2.
|
||||
These include: `T50`, `MAX`, `X`, Median Sweep (`MS`), and its variant `MS2`.
|
||||
|
||||
These methods are binary-only and implement different heuristics for
|
||||
improving the stability of the denominator of the ACC adjustment (`tpr-fpr`).
|
||||
The methods are called "threshold" since said heuristics have to do
|
||||
with different choices of the underlying classifier's threshold.
|
||||
|
||||
### Explicit Loss Minimization
|
||||
|
||||
|
|
@ -415,16 +417,18 @@ model.fit(dataset.training)
|
|||
estim_prevalence = model.predict(dataset.test.instances)
|
||||
```
|
||||
|
||||
Check the examples on [explicit_loss_minimization](https://github.com/HLT-ISTI/QuaPy/blob/devel/examples/5.explicit_loss_minimization.py)
|
||||
Check the examples on [explicit loss minimization](https://github.com/HLT-ISTI/QuaPy/blob/devel/examples/17.explicit_loss_minimization.py)
|
||||
and on [one versus all quantification](https://github.com/HLT-ISTI/QuaPy/blob/devel/examples/10.one_vs_all.py) for more details.
|
||||
**Note** that the _one versus all_ approach is considered inappropriate under prior probability shift, though.
|
||||
|
||||
### Kernel Density Estimation methods (KDEy)
|
||||
|
||||
_New in v0.1.8_: QuaPy now provides implementations for the three variants
|
||||
QuaPy provides implementations for the three variants
|
||||
of KDE-based methods proposed in
|
||||
_[Moreo, A., González, P. and del Coz, J.J., 2023.
|
||||
_[Moreo, A., González, P. and del Coz, J.J..
|
||||
Kernel Density Estimation for Multiclass Quantification.
|
||||
arXiv preprint arXiv:2401.00490](https://arxiv.org/abs/2401.00490)_.
|
||||
Machine Learning. Vol 114 (92), 2025](https://link.springer.com/article/10.1007/s10994-024-06726-5)_
|
||||
(a [preprint](https://arxiv.org/abs/2401.00490) is available online).
|
||||
The variants differ in the divergence metric to be minimized:
|
||||
|
||||
- KDEy-HD: minimizes the (squared) Hellinger Distance and solves the problem via a Monte Carlo approach
|
||||
|
|
@ -435,22 +439,27 @@ These methods are specifically devised for multiclass problems (although they ca
|
|||
binary problems too).
|
||||
|
||||
All KDE-based methods depend on the hyperparameter `bandwidth` of the kernel. Typical values
|
||||
that can be explored in model selection range in [0.01, 0.25]. The methods' performance
|
||||
vary smoothing with smooth variations of this hyperparameter.
|
||||
that can be explored in model selection range in [0.01, 0.25]. Previous experiments reveal the methods' performance
|
||||
varies smoothly at small variations of this hyperparameter.
|
||||
|
||||
|
||||
## Composable Methods
|
||||
|
||||
The [](quapy.method.composable) module allows the composition of quantification methods from loss functions and feature transformations. Any composed method solves a linear system of equations by minimizing the loss after transforming the data. Methods of this kind include ACC, PACC, HDx, HDy, and many other well-known methods, as well as an unlimited number of re-combinations of their building blocks.
|
||||
The `quapy.method.composable` module integrates [qunfold](https://github.com/mirkobunse/qunfold) allows the composition
|
||||
of quantification methods from loss functions and feature transformations (thanks to Mirko Bunse for the integration!).
|
||||
|
||||
Any composed method solves a linear system of equations by minimizing the loss after transforming the data. Methods of this kind include ACC, PACC, HDx, HDy, and many other well-known methods, as well as an unlimited number of re-combinations of their building blocks.
|
||||
|
||||
### Installation
|
||||
|
||||
```sh
|
||||
pip install --upgrade pip setuptools wheel
|
||||
pip install "jax[cpu]"
|
||||
pip install "qunfold @ git+https://github.com/mirkobunse/qunfold@v0.1.4"
|
||||
pip install "qunfold @ git+https://github.com/mirkobunse/qunfold@v0.1.5"
|
||||
```
|
||||
|
||||
**Note:** since version 0.2.0, QuaPy is only compatible with qunfold >=0.1.5.
|
||||
|
||||
### Basics
|
||||
|
||||
The composition of a method is implemented through the [](quapy.method.composable.ComposableQuantifier) class. Its documentation also features an example to get you started in composing your own methods.
|
||||
|
|
@ -529,10 +538,11 @@ from quapy.method.meta import Ensemble
|
|||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
dataset = qp.datasets.fetch_UCIBinaryDataset('haberman')
|
||||
train, test = dataset.train_test
|
||||
|
||||
model = Ensemble(quantifier=ACC(LogisticRegression()), size=30, policy='ave', n_jobs=-1)
|
||||
model.fit(dataset.training)
|
||||
estim_prevalence = model.predict(dataset.test.instances)
|
||||
model.fit(*train.Xy)
|
||||
estim_prevalence = model.predict(test.X)
|
||||
```
|
||||
|
||||
Other aggregation policies implemented in QuaPy include:
|
||||
|
|
@ -579,13 +589,13 @@ learner = NeuralClassifierTrainer(cnn, device='cuda')
|
|||
|
||||
# train QuaNet
|
||||
model = QuaNet(learner, device='cuda')
|
||||
model.fit(dataset.training)
|
||||
estim_prevalence = model.predict(dataset.test.instances)
|
||||
model.fit(*dataset.training.Xy)
|
||||
estim_prevalence = model.predict(dataset.test.X)
|
||||
```
|
||||
|
||||
## Confidence Regions for Class Prevalence Estimation
|
||||
|
||||
_(New in v0.1.10!)_ Some quantification methods go beyond providing a single point estimate of class prevalence values and also produce confidence regions, which characterize the uncertainty around the point estimate. In QuaPy, two such methods are currently implemented:
|
||||
_(New in v0.2.0!)_ Some quantification methods go beyond providing a single point estimate of class prevalence values and also produce confidence regions, which characterize the uncertainty around the point estimate. In QuaPy, two such methods are currently implemented:
|
||||
|
||||
* Aggregative Bootstrap: The Aggregative Bootstrap method extends any aggregative quantifier by generating confidence regions for class prevalence estimates through bootstrapping. Key features of this method include:
|
||||
|
||||
|
|
@ -593,9 +603,9 @@ _(New in v0.1.10!)_ Some quantification methods go beyond providing a single poi
|
|||
During training, bootstrap repetitions are performed only after training the classifier once. These repetitions are used to train multiple aggregation functions.
|
||||
During inference, bootstrap is applied over pre-classified test instances.
|
||||
* General Applicability: Aggregative Bootstrap can be applied to any aggregative quantifier.
|
||||
For further information, check the [example](https://github.com/HLT-ISTI/QuaPy/tree/master/examples) provided.
|
||||
For further information, check the [example](https://github.com/HLT-ISTI/QuaPy/tree/master/examples/16.confidence_regions.py) provided.
|
||||
|
||||
* BayesianCC: is a Bayesian variant of the Adjusted Classify & Count (ACC) quantifier (see more details in [Aggregative Quantifiers](#bayesiancc)).
|
||||
* BayesianCC: is a Bayesian variant of the Adjusted Classify & Count (ACC) quantifier; see more details in the [example](https://github.com/HLT-ISTI/QuaPy/tree/master/examples/14.bayesian_quantification.py) provided.
|
||||
|
||||
Confidence regions are constructed around a point estimate, which is typically computed as the mean value of a set of samples.
|
||||
The confidence region can be instantiated in three ways:
|
||||
|
|
|
|||
|
|
@ -87,7 +87,7 @@ model = qp.model_selection.GridSearchQ(
|
|||
error='mae', # the error to optimize is the MAE (a quantification-oriented loss)
|
||||
refit=True, # retrain on the whole labelled set once done
|
||||
verbose=True # show information as the process goes on
|
||||
).fit(training)
|
||||
).fit(*training.Xy)
|
||||
|
||||
print(f'model selection ended: best hyper-parameters={model.best_params_}')
|
||||
model = model.best_model_
|
||||
|
|
@ -133,7 +133,7 @@ learner = GridSearchCV(
|
|||
LogisticRegression(),
|
||||
param_grid={'C': np.logspace(-4, 5, 10), 'class_weight': ['balanced', None]},
|
||||
cv=5)
|
||||
model = DistributionMatching(learner).fit(dataset.train)
|
||||
model = DistributionMatching(learner).fit(*dataset.train.Xy)
|
||||
```
|
||||
|
||||
However, this is conceptually flawed, since the model should be
|
||||
|
|
|
|||
|
|
@ -2,6 +2,9 @@
|
|||
|
||||
The module _qp.plot_ implements some basic plotting functions
|
||||
that can help analyse the performance of a quantification method.
|
||||
See the provided
|
||||
[code example](https://github.com/HLT-ISTI/QuaPy/blob/master/examples/13.plotting.py)
|
||||
for a full example.
|
||||
|
||||
All plotting functions receive as inputs the outcomes of
|
||||
some experiments and include, for each experiment,
|
||||
|
|
@ -77,7 +80,7 @@ def gen_data():
|
|||
method_names, true_prevs, estim_prevs, tr_prevs = [], [], [], []
|
||||
|
||||
for method_name, model in models():
|
||||
model.fit(train)
|
||||
model.fit(*train.Xy)
|
||||
true_prev, estim_prev = qp.evaluation.prediction(model, APP(test, repeats=100, random_state=0))
|
||||
|
||||
method_names.append(method_name)
|
||||
|
|
@ -171,7 +174,7 @@ def gen_data():
|
|||
training_size = 5000
|
||||
# since the problem is binary, it suffices to specify the negative prevalence, since the positive is constrained
|
||||
train_sample = train.sampling(training_size, 1-training_prevalence)
|
||||
model.fit(train_sample)
|
||||
model.fit(*train_sample.Xy)
|
||||
true_prev, estim_prev = qp.evaluation.prediction(model, APP(test, repeats=100, random_state=0))
|
||||
method_name = 'CC$_{'+f'{int(100*training_prevalence)}' + '\%}$'
|
||||
method_data.append((method_name, true_prev, estim_prev, train_sample.prevalence()))
|
||||
|
|
|
|||
|
|
@ -1,7 +1,5 @@
|
|||
# Protocols
|
||||
|
||||
_New in v0.1.7!_
|
||||
|
||||
Quantification methods are expected to behave robustly in the presence of
|
||||
shift. For this reason, quantification methods need to be confronted with
|
||||
samples exhibiting widely varying amounts of shift.
|
||||
|
|
@ -106,15 +104,16 @@ train, test = qp.datasets.fetch_reviews('imdb', tfidf=True, min_df=5).train_test
|
|||
|
||||
# model selection
|
||||
train, val = train.split_stratified(train_prop=0.75)
|
||||
Xtr, ytr = train.Xy
|
||||
quantifier = qp.model_selection.GridSearchQ(
|
||||
quantifier,
|
||||
param_grid={'classifier__C': np.logspace(-2, 2, 5)},
|
||||
protocol=APP(val) # <- this is the protocol we use for generating validation samples
|
||||
).fit(train)
|
||||
).fit(Xtr, ytr)
|
||||
|
||||
# default values are n_prevalences=21, repeats=10, random_state=0; this is equialent to:
|
||||
# val_app = APP(val, n_prevalences=21, repeats=10, random_state=0)
|
||||
# quantifier = GridSearchQ(quantifier, param_grid, protocol=val_app).fit(train)
|
||||
# quantifier = GridSearchQ(quantifier, param_grid, protocol=val_app).fit(Xtr, ytr)
|
||||
|
||||
# evaluation with APP
|
||||
mae = qp.evaluation.evaluate(quantifier, protocol=APP(test), error_metric='mae')
|
||||
|
|
|
|||
Loading…
Reference in New Issue