updated manuals
This commit is contained in:
parent
8adcc33c59
commit
c26c463f3d
|
|
@ -1,4 +1,4 @@
|
||||||
Change Log 0.1.10
|
Change Log 0.2.0
|
||||||
-----------------
|
-----------------
|
||||||
|
|
||||||
CLEAN TODO-FILE
|
CLEAN TODO-FILE
|
||||||
|
|
@ -6,7 +6,7 @@ CLEAN TODO-FILE
|
||||||
- Base code Refactor:
|
- Base code Refactor:
|
||||||
- Removing coupling between LabelledCollection and quantification methods; the fit interface changes:
|
- Removing coupling between LabelledCollection and quantification methods; the fit interface changes:
|
||||||
def fit(data:LabelledCollection): -> def fit(X, y):
|
def fit(data:LabelledCollection): -> def fit(X, y):
|
||||||
- Adding function "predict" (function "quantify" is still present as an alias)
|
- Adding function "predict" (function "quantify" is still present as an alias, for the nostalgic)
|
||||||
- Aggregative methods's behavior in terms of fit_classifier and how to treat the val_split is now
|
- Aggregative methods's behavior in terms of fit_classifier and how to treat the val_split is now
|
||||||
indicated exclusively at construction time, and it is no longer possible to indicate it at fit time.
|
indicated exclusively at construction time, and it is no longer possible to indicate it at fit time.
|
||||||
This is because, in v<=0.1.9, one could create a method (e.g., ACC) and then indicate:
|
This is because, in v<=0.1.9, one could create a method (e.g., ACC) and then indicate:
|
||||||
|
|
@ -21,15 +21,16 @@ CLEAN TODO-FILE
|
||||||
- A new parameter "on_calib_error" is passed to the constructor, which informs of the policy to follow
|
- A new parameter "on_calib_error" is passed to the constructor, which informs of the policy to follow
|
||||||
in case the abstention's calibration functions failed (which happens sometimes). Options include:
|
in case the abstention's calibration functions failed (which happens sometimes). Options include:
|
||||||
- 'raise': raises a RuntimeException (default)
|
- 'raise': raises a RuntimeException (default)
|
||||||
- 'backup': reruns avoiding calibration
|
- 'backup': reruns by silently avoiding calibration
|
||||||
- Parameter "recalib" has been renamed "calib"
|
- Parameter "recalib" has been renamed "calib"
|
||||||
- Added aggregative bootstrap for deriving confidence regions (confidence intervals, ellipses in the simplex, or
|
- Added aggregative bootstrap for deriving confidence regions (confidence intervals, ellipses in the simplex, or
|
||||||
ellipses in the CLR space). This method is efficient as it leverages the two-phases of the aggregative quantifiers.
|
ellipses in the CLR space). This method is efficient as it leverages the two-phases of the aggregative quantifiers.
|
||||||
This method applies resampling only to the aggregation phase, thus avoiding to train many quantifiers, or
|
This method applies resampling only to the aggregation phase, thus avoiding to train many quantifiers, or
|
||||||
classify multiple times the instances of a sample. See:
|
classify multiple times the instances of a sample. See:
|
||||||
- quapy/method/confidence.py (new)
|
- quapy/method/confidence.py (new)
|
||||||
- the new example no. 15.
|
- the new example no. 16.confidence_regions.py
|
||||||
- BayesianCC moved to confidence.py, where methods having to do with confidence intervals live
|
- BayesianCC moved to confidence.py, where methods having to do with confidence intervals belong.
|
||||||
|
- Improved documentation of qp.plot module.
|
||||||
|
|
||||||
|
|
||||||
Change Log 0.1.9
|
Change Log 0.1.9
|
||||||
|
|
|
||||||
|
|
@ -340,10 +340,10 @@ and a set of test samples (for evaluation). QuaPy returns this data as a Labelle
|
||||||
(training) and two generation protocols (for validation and test samples), as follows:
|
(training) and two generation protocols (for validation and test samples), as follows:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
training, val_generator, test_generator = fetch_lequa2022(task=task)
|
training, val_generator, test_generator = qp.datasets.fetch_lequa2022(task=task)
|
||||||
```
|
```
|
||||||
|
|
||||||
See the `lequa2022_experiments.py` in the examples folder for further details on how to
|
See the `5a.lequa2022_experiments.py` in the examples folder for further details on how to
|
||||||
carry out experiments using these datasets.
|
carry out experiments using these datasets.
|
||||||
|
|
||||||
The datasets are downloaded only once, and stored for fast reuse.
|
The datasets are downloaded only once, and stored for fast reuse.
|
||||||
|
|
@ -365,6 +365,53 @@ Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022).
|
||||||
A Detailed Overview of LeQua@ CLEF 2022: Learning to Quantify.
|
A Detailed Overview of LeQua@ CLEF 2022: Learning to Quantify.
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## LeQua 2024 Datasets
|
||||||
|
|
||||||
|
QuaPy also provides the datasets used for the [LeQua 2024 competition](https://lequa2024.github.io/).
|
||||||
|
In brief, there are 4 tasks:
|
||||||
|
* T1: binary quantification (by sentiment)
|
||||||
|
* T2: multiclass quantification (28 classes, merchandise products)
|
||||||
|
* T3: ordinal quantification (5-stars sentiment ratings)
|
||||||
|
* T4: binary sentiment quantification under a combination of covariate shift and prior shift
|
||||||
|
|
||||||
|
In all cases, the covariate space has 256 dimensions (extracted using the `ELECTRA-Small` model).
|
||||||
|
|
||||||
|
Every task consists of a training set, a set of validation samples (for model selection)
|
||||||
|
and a set of test samples (for evaluation). QuaPy returns this data as a LabelledCollection
|
||||||
|
(training bags) and sampling generation protocols (for validation and test bags).
|
||||||
|
T3 also offers the possibility to obtain a series of training bags (in form of a
|
||||||
|
sampling generation protocol) instead of one single training bag. Use it as follows:
|
||||||
|
|
||||||
|
```python
|
||||||
|
training, val_generator, test_generator = qp.datasets.fetch_lequa2024(task=task)
|
||||||
|
```
|
||||||
|
|
||||||
|
See the `5b.lequa2024_experiments.py` in the examples folder for further details on how to
|
||||||
|
carry out experiments using these datasets.
|
||||||
|
|
||||||
|
The datasets are downloaded only once, and stored for fast reuse.
|
||||||
|
|
||||||
|
Some statistics are summarized below:
|
||||||
|
|
||||||
|
| Dataset | classes | train size | validation samples | test samples | docs by sample | type |
|
||||||
|
|---------|:-------:|:-----------:|:------------------:|:------------:|:--------------:|:--------:|
|
||||||
|
| T1 | 2 | 5000 | 1000 | 5000 | 250 | vector |
|
||||||
|
| T2 | 28 | 20000 | 1000 | 5000 | 1000 | vector |
|
||||||
|
| T3 | 5 | 100 samples | 1000 | 5000 | 200 | vector |
|
||||||
|
| T4 | 2 | 5000 | 1000 | 5000 | 250 | vector |
|
||||||
|
|
||||||
|
For further details on the datasets or the competition, we refer to
|
||||||
|
[the official site](https://lequa2024.github.io/data/) and
|
||||||
|
[the overview paper](http://nmis.isti.cnr.it/sebastiani/Publications/LQ2024.pdf).
|
||||||
|
|
||||||
|
```
|
||||||
|
Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022).
|
||||||
|
An Overview of LeQua 2024, the 2nd International Data Challenge on Learning to Quantify,
|
||||||
|
Proceedings of the 4th International Workshop on Learning to Quantify (LQ 2024),
|
||||||
|
ECML-PKDD 2024, Vilnius, Lithuania.
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
## IFCB Plankton dataset
|
## IFCB Plankton dataset
|
||||||
|
|
||||||
IFCB is a dataset of plankton species in water samples hosted in `Zenodo <https://zenodo.org/records/10036244>`_.
|
IFCB is a dataset of plankton species in water samples hosted in `Zenodo <https://zenodo.org/records/10036244>`_.
|
||||||
|
|
@ -410,8 +457,12 @@ Journal of Plankton Research 41 (4), 449-463](https://par.nsf.gov/servlets/purl/
|
||||||
|
|
||||||
## Adding Custom Datasets
|
## Adding Custom Datasets
|
||||||
|
|
||||||
|
It is straightforward to import your own datasets into QuaPy.
|
||||||
|
I what follows, there are some code snippets for doing so; see also the example
|
||||||
|
[3.custom_collection.py](https://github.com/HLT-ISTI/QuaPy/blob/master/examples/3.custom_collection.py).
|
||||||
|
|
||||||
QuaPy provides data loaders for simple formats dealing with
|
QuaPy provides data loaders for simple formats dealing with
|
||||||
text, following the format:
|
text; for example, use `qp.data.reader.from_text` for the following the format:
|
||||||
|
|
||||||
```
|
```
|
||||||
class-id \t first document's pre-processed text \n
|
class-id \t first document's pre-processed text \n
|
||||||
|
|
@ -419,13 +470,16 @@ class-id \t second document's pre-processed text \n
|
||||||
...
|
...
|
||||||
```
|
```
|
||||||
|
|
||||||
and sparse representations of the form:
|
or `qp.data.reader.from_sparse` for sparse representations of the form:
|
||||||
|
|
||||||
```
|
```
|
||||||
{-1, 0, or +1} col(int):val(float) col(int):val(float) ... \n
|
{-1, 0, or +1} col(int):val(float) col(int):val(float) ... \n
|
||||||
...
|
...
|
||||||
```
|
```
|
||||||
|
|
||||||
|
both functions return a tuple `X, y` containing a list of strings and the corresponding
|
||||||
|
labels, respectively.
|
||||||
|
|
||||||
The code in charge in loading a LabelledCollection is:
|
The code in charge in loading a LabelledCollection is:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
|
@ -434,12 +488,13 @@ def load(cls, path:str, loader_func:callable):
|
||||||
return LabelledCollection(*loader_func(path))
|
return LabelledCollection(*loader_func(path))
|
||||||
```
|
```
|
||||||
|
|
||||||
indicating that any _loader_func_ (e.g., a user-defined one) which
|
indicating that any `loader_func` (e.g., `from_text`, `from_sparse`, `from_csv`, or a user-defined one) which
|
||||||
returns valid arguments for initializing a _LabelledCollection_ object will allow
|
returns valid arguments for initializing a _LabelledCollection_ object will allow
|
||||||
to load any collection. In particular, the _LabelledCollection_ receives as
|
to load any collection. More specifically, the _LabelledCollection_ receives as
|
||||||
arguments the instances (as an iterable) and the labels (as an iterable) and,
|
arguments the _instances_ (iterable) and the _labels_ (iterable) and,
|
||||||
additionally, the number of classes can be specified (it would otherwise be
|
optionally, the number of classes (it would be
|
||||||
inferred from the labels, but that requires at least one positive example for
|
inferred from the labels if not indicated, but this requires at least one
|
||||||
|
positive example for
|
||||||
all classes to be present in the collection).
|
all classes to be present in the collection).
|
||||||
|
|
||||||
The same _loader_func_ can be passed to a Dataset, along with two
|
The same _loader_func_ can be passed to a Dataset, along with two
|
||||||
|
|
@ -452,20 +507,23 @@ import quapy as qp
|
||||||
train_path = '../my_data/train.dat'
|
train_path = '../my_data/train.dat'
|
||||||
test_path = '../my_data/test.dat'
|
test_path = '../my_data/test.dat'
|
||||||
|
|
||||||
def my_custom_loader(path):
|
def my_custom_loader(path, **custom_kwargs):
|
||||||
with open(path, 'rb') as fin:
|
with open(path, 'rb') as fin:
|
||||||
...
|
...
|
||||||
return instances, labels
|
return instances, labels
|
||||||
|
|
||||||
data = qp.data.Dataset.load(train_path, test_path, my_custom_loader)
|
data = qp.data.Dataset.load(train_path, test_path, my_custom_loader, **custom_kwargs)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Data Processing
|
### Data Processing
|
||||||
|
|
||||||
QuaPy implements a number of preprocessing functions in the package _qp.data.preprocessing_, including:
|
QuaPy implements a number of preprocessing functions in the package `qp.data.preprocessing`, including:
|
||||||
|
|
||||||
* _text2tfidf_: tfidf vectorization
|
* _text2tfidf_: tfidf vectorization
|
||||||
* _reduce_columns_: reducing the number of columns based on term frequency
|
* _reduce_columns_: reducing the number of columns based on term frequency
|
||||||
* _standardize_: transforms the column values into z-scores (i.e., subtract the mean and normalizes by the standard deviation, so
|
* _standardize_: transforms the column values into z-scores (i.e., subtract the mean and normalizes by the standard deviation, so
|
||||||
that the column values have zero mean and unit variance).
|
that the column values have zero mean and unit variance).
|
||||||
* _index_: transforms textual tokens into lists of numeric ids
|
* _index_: transforms textual tokens into lists of numeric ids
|
||||||
|
|
||||||
|
These functions are applied to `Dataset` objects, and offer the possibility to apply the transformation
|
||||||
|
inline (thus modifying the original dataset), or to return a modified copy.
|
||||||
|
|
@ -46,18 +46,18 @@ e.g.:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
qp.environ['SAMPLE_SIZE'] = 100 # once for all
|
qp.environ['SAMPLE_SIZE'] = 100 # once for all
|
||||||
true_prev = np.asarray([0.5, 0.3, 0.2]) # let's assume 3 classes
|
true_prev = [0.5, 0.3, 0.2] # let's assume 3 classes
|
||||||
estim_prev = np.asarray([0.1, 0.3, 0.6])
|
estim_prev = [0.1, 0.3, 0.6]
|
||||||
error = qp.error.mrae(true_prev, estim_prev)
|
error = qp.error.mrae(true_prev, estim_prev)
|
||||||
print(f'mrae({true_prev}, {estim_prev}) = {error:.3f}')
|
print(f'mrae({true_prev}, {estim_prev}) = {error:.3f}')
|
||||||
```
|
```
|
||||||
|
|
||||||
will print:
|
will print:
|
||||||
```
|
```
|
||||||
mrae([0.500, 0.300, 0.200], [0.100, 0.300, 0.600]) = 0.914
|
mrae([0.5, 0.3, 0.2], [0.1, 0.3, 0.6]) = 0.914
|
||||||
```
|
```
|
||||||
|
|
||||||
Finally, it is possible to instantiate QuaPy's quantification
|
It is also possible to instantiate QuaPy's quantification
|
||||||
error functions from strings using, e.g.:
|
error functions from strings using, e.g.:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
|
@ -85,7 +85,7 @@ print(f'MAE = {mae:.4f}')
|
||||||
```
|
```
|
||||||
|
|
||||||
It is often desirable to evaluate our system using more than one
|
It is often desirable to evaluate our system using more than one
|
||||||
single evaluatio measure. In this case, it is convenient to generate
|
single evaluation measure. In this case, it is convenient to generate
|
||||||
a _report_. A report in QuaPy is a dataframe accounting for all the
|
a _report_. A report in QuaPy is a dataframe accounting for all the
|
||||||
true prevalence values with their corresponding prevalence values
|
true prevalence values with their corresponding prevalence values
|
||||||
as estimated by the quantifier, along with the error each has given
|
as estimated by the quantifier, along with the error each has given
|
||||||
|
|
@ -104,7 +104,7 @@ report['estim-prev'] = report['estim-prev'].map(F.strprev)
|
||||||
print(report)
|
print(report)
|
||||||
|
|
||||||
print('Averaged values:')
|
print('Averaged values:')
|
||||||
print(report.mean())
|
print(report.mean(numeric_only=True))
|
||||||
```
|
```
|
||||||
|
|
||||||
This will produce an output like:
|
This will produce an output like:
|
||||||
|
|
@ -141,11 +141,14 @@ true_prevs, estim_prevs = qp.evaluation.prediction(quantifier, protocol=prot)
|
||||||
|
|
||||||
All the evaluation functions implement specific optimizations for speeding-up
|
All the evaluation functions implement specific optimizations for speeding-up
|
||||||
the evaluation of aggregative quantifiers (i.e., of instances of _AggregativeQuantifier_).
|
the evaluation of aggregative quantifiers (i.e., of instances of _AggregativeQuantifier_).
|
||||||
|
|
||||||
The optimization comes down to generating classification predictions (either crisp or soft)
|
The optimization comes down to generating classification predictions (either crisp or soft)
|
||||||
only once for the entire test set, and then applying the sampling procedure to the
|
only once for the entire test set, and then applying the sampling procedure to the
|
||||||
predictions, instead of generating samples of instances and then computing the
|
predictions, instead of generating samples of instances and then computing the
|
||||||
classification predictions every time. This is only possible when the protocol
|
classification predictions every time. This is only possible when the protocol
|
||||||
is an instance of _OnLabelledCollectionProtocol_. The optimization is only
|
is an instance of _OnLabelledCollectionProtocol_.
|
||||||
|
|
||||||
|
The optimization is only
|
||||||
carried out when the number of classification predictions thus generated would be
|
carried out when the number of classification predictions thus generated would be
|
||||||
smaller than the number of predictions required for the entire protocol; e.g.,
|
smaller than the number of predictions required for the entire protocol; e.g.,
|
||||||
if the original dataset contains 1M instances, but the protocol is such that it would
|
if the original dataset contains 1M instances, but the protocol is such that it would
|
||||||
|
|
@ -156,4 +159,4 @@ precompute all the predictions irrespectively of the number of instances and num
|
||||||
Finally, this can be deactivated by setting _aggr_speedup=False_. Note that this optimization
|
Finally, this can be deactivated by setting _aggr_speedup=False_. Note that this optimization
|
||||||
is not only applied for the final evaluation, but also for the internal evaluations carried
|
is not only applied for the final evaluation, but also for the internal evaluations carried
|
||||||
out during _model selection_. Since these are typically many, the heuristic can help reduce the
|
out during _model selection_. Since these are typically many, the heuristic can help reduce the
|
||||||
execution time a lot.
|
execution time significatively.
|
||||||
|
|
@ -1,7 +1,7 @@
|
||||||
# Quantification Methods
|
# Quantification Methods
|
||||||
|
|
||||||
Quantification methods can be categorized as belonging to
|
Quantification methods can be categorized as belonging to
|
||||||
`aggregative` and `non-aggregative` groups.
|
`aggregative`, `non-aggregative`, and `meta-learning` groups.
|
||||||
Most methods included in QuaPy at the moment are of type `aggregative`
|
Most methods included in QuaPy at the moment are of type `aggregative`
|
||||||
(though we plan to add many more methods in the near future), i.e.,
|
(though we plan to add many more methods in the near future), i.e.,
|
||||||
are methods characterized by the fact that
|
are methods characterized by the fact that
|
||||||
|
|
@ -12,21 +12,17 @@ Any quantifier in QuaPy shoud extend the class `BaseQuantifier`,
|
||||||
and implement some abstract methods:
|
and implement some abstract methods:
|
||||||
```python
|
```python
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
def fit(self, data: LabelledCollection): ...
|
def fit(self, X, y): ...
|
||||||
|
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
def quantify(self, instances): ...
|
def predict(self, X): ...
|
||||||
```
|
```
|
||||||
The meaning of those functions should be familiar to those
|
The meaning of those functions should be familiar to those
|
||||||
used to work with scikit-learn since the class structure of QuaPy
|
used to work with scikit-learn since the class structure of QuaPy
|
||||||
is directly inspired by scikit-learn's _Estimators_. Functions
|
is directly inspired by scikit-learn's _Estimators_. Functions
|
||||||
`fit` and `quantify` are used to train the model and to provide
|
`fit` and `predict` (for which there is an alias `quantify`)
|
||||||
class estimations (the reason why
|
are used to train the model and to provide
|
||||||
scikit-learn' structure has not been adopted _as is_ in QuaPy responds to
|
class estimations.
|
||||||
the fact that scikit-learn's `predict` function is expected to return
|
|
||||||
one output for each input element --e.g., a predicted label for each
|
|
||||||
instance in a sample-- while in quantification the output for a sample
|
|
||||||
is one single array of class prevalences).
|
|
||||||
Quantifiers also extend from scikit-learn's `BaseEstimator`, in order
|
Quantifiers also extend from scikit-learn's `BaseEstimator`, in order
|
||||||
to simplify the use of `set_params` and `get_params` used in
|
to simplify the use of `set_params` and `get_params` used in
|
||||||
[model selection](./model-selection).
|
[model selection](./model-selection).
|
||||||
|
|
@ -40,20 +36,25 @@ The methods that any `aggregative` quantifier must implement are:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
def aggregation_fit(self, classif_predictions: LabelledCollection, data: LabelledCollection):
|
def aggregation_fit(self, classif_predictions, labels):
|
||||||
|
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
def aggregate(self, classif_predictions:np.ndarray): ...
|
def aggregate(self, classif_predictions): ...
|
||||||
```
|
```
|
||||||
|
|
||||||
These two functions replace the `fit` and `quantify` methods, since those
|
The argument `classif_predictions` is whatever the method `classify` returns.
|
||||||
come with default implementations. The `fit` function is provided and amounts to:
|
QuaPy comes with default implementations that cover most common cases, but you can
|
||||||
|
override `classify` in case your method requires further or different information to work.
|
||||||
|
|
||||||
|
These two functions replace the `fit` and `predict` methods, which
|
||||||
|
come with default implementations. For instance, the `fit` function is
|
||||||
|
provided and amounts to:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
def fit(self, data: LabelledCollection, fit_classifier=True, val_split=None):
|
def fit(self, X, y):
|
||||||
self._check_init_parameters()
|
self._check_init_parameters()
|
||||||
classif_predictions = self.classifier_fit_predict(data, fit_classifier, predict_on=val_split)
|
classif_predictions, labels = self.classifier_fit_predict(X, y)
|
||||||
self.aggregation_fit(classif_predictions, data)
|
self.aggregation_fit(classif_predictions, labels)
|
||||||
return self
|
return self
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
@ -72,11 +73,11 @@ overriden (if needed) and allows the method to quickly raise any exception based
|
||||||
found in the `__init__` arguments, thus avoiding to break after training the classifier and generating
|
found in the `__init__` arguments, thus avoiding to break after training the classifier and generating
|
||||||
predictions.
|
predictions.
|
||||||
|
|
||||||
Similarly, the function `quantify` is provided, and amounts to:
|
Similarly, the function `predict` (alias `quantify`) is provided, and amounts to:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
def quantify(self, instances):
|
def predict(self, X):
|
||||||
classif_predictions = self.classify(instances)
|
classif_predictions = self.classify(X)
|
||||||
return self.aggregate(classif_predictions)
|
return self.aggregate(classif_predictions)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
@ -84,12 +85,14 @@ in which only the function `aggregate` is required to be overriden in most cases
|
||||||
|
|
||||||
Aggregative quantifiers are expected to maintain a classifier (which is
|
Aggregative quantifiers are expected to maintain a classifier (which is
|
||||||
accessed through the `@property` `classifier`). This classifier is
|
accessed through the `@property` `classifier`). This classifier is
|
||||||
given as input to the quantifier, and can be already fit
|
given as input to the quantifier, and will be trained by the quantifier's fit (default).
|
||||||
on external data (in which case, the `fit_learner` argument should
|
Alternatively, the classifier can be already fit on external data; in this case, the `fit_learner`
|
||||||
be set to False), or be fit by the quantifier's fit (default).
|
argument in the `__init__` should be set to False (see [4.using_pretrained_classifier.py](https://github.com/HLT-ISTI/QuaPy/blob/master/examples/4.using_pretrained_classifier.py)
|
||||||
|
for a full code example).
|
||||||
|
|
||||||
The above patterns (in training: fit the classifier, then fit the aggregation;
|
The above patterns (in training: (i) fit the classifier, then (ii) fit the aggregation;
|
||||||
in test: classify, then aggregate) allows QuaPy to optimize many internal procedures.
|
in test: (i) classify, then (ii) aggregate) allows QuaPy to optimize many internal procedures,
|
||||||
|
on the grounds that steps (i) are slower than steps (ii).
|
||||||
In particular, the model selection routing takes advantage of this two-step process
|
In particular, the model selection routing takes advantage of this two-step process
|
||||||
and generates classifiers only for the valid combinations of hyperparameters of the
|
and generates classifiers only for the valid combinations of hyperparameters of the
|
||||||
classifier, and then _clones_ these classifiers and explores the combinations
|
classifier, and then _clones_ these classifiers and explores the combinations
|
||||||
|
|
@ -124,6 +127,7 @@ import quapy.functional as F
|
||||||
from sklearn.svm import LinearSVC
|
from sklearn.svm import LinearSVC
|
||||||
|
|
||||||
training, test = qp.datasets.fetch_twitter('hcr', pickle=True).train_test
|
training, test = qp.datasets.fetch_twitter('hcr', pickle=True).train_test
|
||||||
|
Xtr, ytr = training.Xy
|
||||||
|
|
||||||
# instantiate a classifier learner, in this case a SVM
|
# instantiate a classifier learner, in this case a SVM
|
||||||
svm = LinearSVC()
|
svm = LinearSVC()
|
||||||
|
|
@ -131,7 +135,7 @@ svm = LinearSVC()
|
||||||
# instantiate a Classify & Count with the SVM
|
# instantiate a Classify & Count with the SVM
|
||||||
# (an alias is available in qp.method.aggregative.ClassifyAndCount)
|
# (an alias is available in qp.method.aggregative.ClassifyAndCount)
|
||||||
model = qp.method.aggregative.CC(svm)
|
model = qp.method.aggregative.CC(svm)
|
||||||
model.fit(training)
|
model.fit(Xtr, ytr)
|
||||||
estim_prevalence = model.predict(test.instances)
|
estim_prevalence = model.predict(test.instances)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
@ -153,26 +157,14 @@ predictions. This parameters can also be set with an integer,
|
||||||
indicating that the parameters should be estimated by means of
|
indicating that the parameters should be estimated by means of
|
||||||
_k_-fold cross-validation, for which the integer indicates the
|
_k_-fold cross-validation, for which the integer indicates the
|
||||||
number _k_ of folds (the default value is 5). Finally, `val_split` can be set to a
|
number _k_ of folds (the default value is 5). Finally, `val_split` can be set to a
|
||||||
specific held-out validation set (i.e., an instance of `LabelledCollection`).
|
specific held-out validation set (i.e., an tuple `(X,y)`).
|
||||||
|
|
||||||
The specification of `val_split` can be
|
|
||||||
postponed to the invokation of the fit method (if `val_split` was also
|
|
||||||
set in the constructor, the one specified at fit time would prevail),
|
|
||||||
e.g.:
|
|
||||||
|
|
||||||
```python
|
|
||||||
model = qp.method.aggregative.ACC(svm)
|
|
||||||
# perform 5-fold cross validation for estimating ACC's parameters
|
|
||||||
# (overrides the default val_split=0.4 in the constructor)
|
|
||||||
model.fit(training, val_split=5)
|
|
||||||
```
|
|
||||||
|
|
||||||
The following code illustrates the case in which PCC is used:
|
The following code illustrates the case in which PCC is used:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
model = qp.method.aggregative.PCC(svm)
|
model = qp.method.aggregative.PCC(svm)
|
||||||
model.fit(training)
|
model.fit(Xtr, ytr)
|
||||||
estim_prevalence = model.predict(test.instances)
|
estim_prevalence = model.predict(Xte)
|
||||||
print('classifier:', model.classifier)
|
print('classifier:', model.classifier)
|
||||||
```
|
```
|
||||||
In this case, QuaPy will print:
|
In this case, QuaPy will print:
|
||||||
|
|
@ -185,11 +177,11 @@ is not a probabilistic classifier (i.e., it does not implement the
|
||||||
`predict_proba` method) and so, the classifier will be converted to
|
`predict_proba` method) and so, the classifier will be converted to
|
||||||
a probabilistic one through [calibration](https://scikit-learn.org/stable/modules/calibration.html).
|
a probabilistic one through [calibration](https://scikit-learn.org/stable/modules/calibration.html).
|
||||||
As a result, the classifier that is printed in the second line points
|
As a result, the classifier that is printed in the second line points
|
||||||
to a `CalibratedClassifier` instance. Note that calibration can only
|
to a `CalibratedClassifierCV` instance. Note that calibration can only
|
||||||
be applied to hard classifiers when `fit_learner=True`; an exception
|
be applied to hard classifiers if `fit_learner=True`; an exception
|
||||||
will be raised otherwise.
|
will be raised otherwise.
|
||||||
|
|
||||||
Lastly, everything we said aboud ACC and PCC
|
Lastly, everything we said about ACC and PCC
|
||||||
applies to PACC as well.
|
applies to PACC as well.
|
||||||
|
|
||||||
_New in v0.1.9_: quantifiers ACC and PACC now have three additional arguments: `method`, `solver` and `norm`:
|
_New in v0.1.9_: quantifiers ACC and PACC now have three additional arguments: `method`, `solver` and `norm`:
|
||||||
|
|
@ -259,22 +251,28 @@ An example of use can be found below:
|
||||||
import quapy as qp
|
import quapy as qp
|
||||||
from sklearn.linear_model import LogisticRegression
|
from sklearn.linear_model import LogisticRegression
|
||||||
|
|
||||||
dataset = qp.datasets.fetch_twitter('hcr', pickle=True)
|
train, test = qp.datasets.fetch_twitter('hcr', pickle=True).train_test
|
||||||
|
|
||||||
model = qp.method.aggregative.EMQ(LogisticRegression())
|
model = qp.method.aggregative.EMQ(LogisticRegression())
|
||||||
model.fit(dataset.training)
|
model.fit(*train.Xy)
|
||||||
estim_prevalence = model.predict(dataset.test.instances)
|
estim_prevalence = model.predict(test.X)
|
||||||
```
|
```
|
||||||
|
|
||||||
_New in v0.1.7_: EMQ now accepts two new parameters in the construction method, namely
|
EMQ accepts additional parameters in the construction method:
|
||||||
`exact_train_prev` which allows to use the true training prevalence as the departing
|
* `exact_train_prev`: set to True for using the true training prevalence as the departing
|
||||||
prevalence estimation (default behaviour), or instead an approximation of it as
|
prevalence estimation (default behaviour), or to False for using an approximation of it as
|
||||||
suggested by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html)
|
suggested by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html)
|
||||||
(by setting `exact_train_prev=False`).
|
* `calib`: allows to indicate a calibration method, among those
|
||||||
The other parameter is `recalib` which allows to indicate a calibration method, among those
|
|
||||||
proposed by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html),
|
proposed by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html),
|
||||||
including the Bias-Corrected Temperature Scaling, Vector Scaling, etc.
|
including the Bias-Corrected Temperature Scaling
|
||||||
See the API documentation for further details.
|
(`bcts`), Vector Scaling (`bcts`), No-Bias Temperature Scaling (`nbvs`),
|
||||||
|
or Temperature Scaling (`ts`); default is `None` (no calibration).
|
||||||
|
* `on_calib_error`: indicates the policy to follow in case the calibrator fails at runtime.
|
||||||
|
Options include `raise` (default), in which case a RuntimeException is raised; and `backup`, in which
|
||||||
|
case the calibrator is silently skipped.
|
||||||
|
|
||||||
|
You can use the class method `EMQ_BCTS` to effortlessly instantiate EMQ with the best performing
|
||||||
|
heuristics found by [Alexandari et al. (2020)](http://proceedings.mlr.press/v119/alexandari20a.html). See the API documentation for further details.
|
||||||
|
|
||||||
|
|
||||||
### Hellinger Distance y (HDy)
|
### Hellinger Distance y (HDy)
|
||||||
|
|
@ -289,11 +287,10 @@ This method works with a probabilistic classifier (hard classifiers
|
||||||
can be used as well and will be calibrated) and requires a validation
|
can be used as well and will be calibrated) and requires a validation
|
||||||
set to estimate parameter for the mixture model. Just like
|
set to estimate parameter for the mixture model. Just like
|
||||||
ACC and PACC, this quantifier receives a `val_split` argument
|
ACC and PACC, this quantifier receives a `val_split` argument
|
||||||
in the constructor (or in the fit method, in which case the previous
|
in the constructor that can either be a float indicating the proportion
|
||||||
value is overridden) that can either be a float indicating the proportion
|
|
||||||
of training data to be taken as the validation set (in a random
|
of training data to be taken as the validation set (in a random
|
||||||
stratified split), or a validation set (i.e., an instance of
|
stratified split), or the validation set itself (i.e., an tuple
|
||||||
`LabelledCollection`) itself.
|
`(X,y)`).
|
||||||
|
|
||||||
HDy was proposed as a binary classifier and the implementation
|
HDy was proposed as a binary classifier and the implementation
|
||||||
provided in QuaPy accepts only binary datasets.
|
provided in QuaPy accepts only binary datasets.
|
||||||
|
|
@ -309,11 +306,11 @@ dataset = qp.datasets.fetch_reviews('hp', pickle=True)
|
||||||
qp.data.preprocessing.text2tfidf(dataset, min_df=5, inplace=True)
|
qp.data.preprocessing.text2tfidf(dataset, min_df=5, inplace=True)
|
||||||
|
|
||||||
model = qp.method.aggregative.HDy(LogisticRegression())
|
model = qp.method.aggregative.HDy(LogisticRegression())
|
||||||
model.fit(dataset.training)
|
model.fit(*dataset.training.Xy)
|
||||||
estim_prevalence = model.predict(dataset.test.instances)
|
estim_prevalence = model.predict(dataset.test.X)
|
||||||
```
|
```
|
||||||
|
|
||||||
_New in v0.1.7:_ QuaPy now provides an implementation of the generalized
|
QuaPy also provides an implementation of the generalized
|
||||||
"Distribution Matching" approaches for multiclass, inspired by the framework
|
"Distribution Matching" approaches for multiclass, inspired by the framework
|
||||||
of [Firat (2016)](https://arxiv.org/abs/1606.00868). One can instantiate
|
of [Firat (2016)](https://arxiv.org/abs/1606.00868). One can instantiate
|
||||||
a variant of HDy for multiclass quantification as follows:
|
a variant of HDy for multiclass quantification as follows:
|
||||||
|
|
@ -322,17 +319,22 @@ a variant of HDy for multiclass quantification as follows:
|
||||||
mutliclassHDy = qp.method.aggregative.DMy(classifier=LogisticRegression(), divergence='HD', cdf=False)
|
mutliclassHDy = qp.method.aggregative.DMy(classifier=LogisticRegression(), divergence='HD', cdf=False)
|
||||||
```
|
```
|
||||||
|
|
||||||
_New in v0.1.7:_ QuaPy now provides an implementation of the "DyS"
|
QuaPy also provides an implementation of the "DyS"
|
||||||
framework proposed by [Maletzke et al (2020)](https://ojs.aaai.org/index.php/AAAI/article/view/4376)
|
framework proposed by [Maletzke et al (2020)](https://ojs.aaai.org/index.php/AAAI/article/view/4376)
|
||||||
and the "SMM" method proposed by [Hassan et al (2019)](https://ieeexplore.ieee.org/document/9260028)
|
and the "SMM" method proposed by [Hassan et al (2019)](https://ieeexplore.ieee.org/document/9260028)
|
||||||
(thanks to _Pablo González_ for the contributions!)
|
(thanks to _Pablo González_ for the contributions!)
|
||||||
|
|
||||||
### Threshold Optimization methods
|
### Threshold Optimization methods
|
||||||
|
|
||||||
_New in v0.1.7:_ QuaPy now implements Forman's threshold optimization methods;
|
QuaPy implements Forman's threshold optimization methods;
|
||||||
see, e.g., [(Forman 2006)](https://dl.acm.org/doi/abs/10.1145/1150402.1150423)
|
see, e.g., [(Forman 2006)](https://dl.acm.org/doi/abs/10.1145/1150402.1150423)
|
||||||
and [(Forman 2008)](https://link.springer.com/article/10.1007/s10618-008-0097-y).
|
and [(Forman 2008)](https://link.springer.com/article/10.1007/s10618-008-0097-y).
|
||||||
These include: T50, MAX, X, Median Sweep (MS), and its variant MS2.
|
These include: `T50`, `MAX`, `X`, Median Sweep (`MS`), and its variant `MS2`.
|
||||||
|
|
||||||
|
These methods are binary-only and implement different heuristics for
|
||||||
|
improving the stability of the denominator of the ACC adjustment (`tpr-fpr`).
|
||||||
|
The methods are called "threshold" since said heuristics have to do
|
||||||
|
with different choices of the underlying classifier's threshold.
|
||||||
|
|
||||||
### Explicit Loss Minimization
|
### Explicit Loss Minimization
|
||||||
|
|
||||||
|
|
@ -415,16 +417,18 @@ model.fit(dataset.training)
|
||||||
estim_prevalence = model.predict(dataset.test.instances)
|
estim_prevalence = model.predict(dataset.test.instances)
|
||||||
```
|
```
|
||||||
|
|
||||||
Check the examples on [explicit_loss_minimization](https://github.com/HLT-ISTI/QuaPy/blob/devel/examples/5.explicit_loss_minimization.py)
|
Check the examples on [explicit loss minimization](https://github.com/HLT-ISTI/QuaPy/blob/devel/examples/17.explicit_loss_minimization.py)
|
||||||
and on [one versus all quantification](https://github.com/HLT-ISTI/QuaPy/blob/devel/examples/10.one_vs_all.py) for more details.
|
and on [one versus all quantification](https://github.com/HLT-ISTI/QuaPy/blob/devel/examples/10.one_vs_all.py) for more details.
|
||||||
|
**Note** that the _one versus all_ approach is considered inappropriate under prior probability shift, though.
|
||||||
|
|
||||||
### Kernel Density Estimation methods (KDEy)
|
### Kernel Density Estimation methods (KDEy)
|
||||||
|
|
||||||
_New in v0.1.8_: QuaPy now provides implementations for the three variants
|
QuaPy provides implementations for the three variants
|
||||||
of KDE-based methods proposed in
|
of KDE-based methods proposed in
|
||||||
_[Moreo, A., González, P. and del Coz, J.J., 2023.
|
_[Moreo, A., González, P. and del Coz, J.J..
|
||||||
Kernel Density Estimation for Multiclass Quantification.
|
Kernel Density Estimation for Multiclass Quantification.
|
||||||
arXiv preprint arXiv:2401.00490](https://arxiv.org/abs/2401.00490)_.
|
Machine Learning. Vol 114 (92), 2025](https://link.springer.com/article/10.1007/s10994-024-06726-5)_
|
||||||
|
(a [preprint](https://arxiv.org/abs/2401.00490) is available online).
|
||||||
The variants differ in the divergence metric to be minimized:
|
The variants differ in the divergence metric to be minimized:
|
||||||
|
|
||||||
- KDEy-HD: minimizes the (squared) Hellinger Distance and solves the problem via a Monte Carlo approach
|
- KDEy-HD: minimizes the (squared) Hellinger Distance and solves the problem via a Monte Carlo approach
|
||||||
|
|
@ -435,22 +439,27 @@ These methods are specifically devised for multiclass problems (although they ca
|
||||||
binary problems too).
|
binary problems too).
|
||||||
|
|
||||||
All KDE-based methods depend on the hyperparameter `bandwidth` of the kernel. Typical values
|
All KDE-based methods depend on the hyperparameter `bandwidth` of the kernel. Typical values
|
||||||
that can be explored in model selection range in [0.01, 0.25]. The methods' performance
|
that can be explored in model selection range in [0.01, 0.25]. Previous experiments reveal the methods' performance
|
||||||
vary smoothing with smooth variations of this hyperparameter.
|
varies smoothly at small variations of this hyperparameter.
|
||||||
|
|
||||||
|
|
||||||
## Composable Methods
|
## Composable Methods
|
||||||
|
|
||||||
The [](quapy.method.composable) module allows the composition of quantification methods from loss functions and feature transformations. Any composed method solves a linear system of equations by minimizing the loss after transforming the data. Methods of this kind include ACC, PACC, HDx, HDy, and many other well-known methods, as well as an unlimited number of re-combinations of their building blocks.
|
The `quapy.method.composable` module integrates [qunfold](https://github.com/mirkobunse/qunfold) allows the composition
|
||||||
|
of quantification methods from loss functions and feature transformations (thanks to Mirko Bunse for the integration!).
|
||||||
|
|
||||||
|
Any composed method solves a linear system of equations by minimizing the loss after transforming the data. Methods of this kind include ACC, PACC, HDx, HDy, and many other well-known methods, as well as an unlimited number of re-combinations of their building blocks.
|
||||||
|
|
||||||
### Installation
|
### Installation
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
pip install --upgrade pip setuptools wheel
|
pip install --upgrade pip setuptools wheel
|
||||||
pip install "jax[cpu]"
|
pip install "jax[cpu]"
|
||||||
pip install "qunfold @ git+https://github.com/mirkobunse/qunfold@v0.1.4"
|
pip install "qunfold @ git+https://github.com/mirkobunse/qunfold@v0.1.5"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Note:** since version 0.2.0, QuaPy is only compatible with qunfold >=0.1.5.
|
||||||
|
|
||||||
### Basics
|
### Basics
|
||||||
|
|
||||||
The composition of a method is implemented through the [](quapy.method.composable.ComposableQuantifier) class. Its documentation also features an example to get you started in composing your own methods.
|
The composition of a method is implemented through the [](quapy.method.composable.ComposableQuantifier) class. Its documentation also features an example to get you started in composing your own methods.
|
||||||
|
|
@ -529,10 +538,11 @@ from quapy.method.meta import Ensemble
|
||||||
from sklearn.linear_model import LogisticRegression
|
from sklearn.linear_model import LogisticRegression
|
||||||
|
|
||||||
dataset = qp.datasets.fetch_UCIBinaryDataset('haberman')
|
dataset = qp.datasets.fetch_UCIBinaryDataset('haberman')
|
||||||
|
train, test = dataset.train_test
|
||||||
|
|
||||||
model = Ensemble(quantifier=ACC(LogisticRegression()), size=30, policy='ave', n_jobs=-1)
|
model = Ensemble(quantifier=ACC(LogisticRegression()), size=30, policy='ave', n_jobs=-1)
|
||||||
model.fit(dataset.training)
|
model.fit(*train.Xy)
|
||||||
estim_prevalence = model.predict(dataset.test.instances)
|
estim_prevalence = model.predict(test.X)
|
||||||
```
|
```
|
||||||
|
|
||||||
Other aggregation policies implemented in QuaPy include:
|
Other aggregation policies implemented in QuaPy include:
|
||||||
|
|
@ -579,13 +589,13 @@ learner = NeuralClassifierTrainer(cnn, device='cuda')
|
||||||
|
|
||||||
# train QuaNet
|
# train QuaNet
|
||||||
model = QuaNet(learner, device='cuda')
|
model = QuaNet(learner, device='cuda')
|
||||||
model.fit(dataset.training)
|
model.fit(*dataset.training.Xy)
|
||||||
estim_prevalence = model.predict(dataset.test.instances)
|
estim_prevalence = model.predict(dataset.test.X)
|
||||||
```
|
```
|
||||||
|
|
||||||
## Confidence Regions for Class Prevalence Estimation
|
## Confidence Regions for Class Prevalence Estimation
|
||||||
|
|
||||||
_(New in v0.1.10!)_ Some quantification methods go beyond providing a single point estimate of class prevalence values and also produce confidence regions, which characterize the uncertainty around the point estimate. In QuaPy, two such methods are currently implemented:
|
_(New in v0.2.0!)_ Some quantification methods go beyond providing a single point estimate of class prevalence values and also produce confidence regions, which characterize the uncertainty around the point estimate. In QuaPy, two such methods are currently implemented:
|
||||||
|
|
||||||
* Aggregative Bootstrap: The Aggregative Bootstrap method extends any aggregative quantifier by generating confidence regions for class prevalence estimates through bootstrapping. Key features of this method include:
|
* Aggregative Bootstrap: The Aggregative Bootstrap method extends any aggregative quantifier by generating confidence regions for class prevalence estimates through bootstrapping. Key features of this method include:
|
||||||
|
|
||||||
|
|
@ -593,9 +603,9 @@ _(New in v0.1.10!)_ Some quantification methods go beyond providing a single poi
|
||||||
During training, bootstrap repetitions are performed only after training the classifier once. These repetitions are used to train multiple aggregation functions.
|
During training, bootstrap repetitions are performed only after training the classifier once. These repetitions are used to train multiple aggregation functions.
|
||||||
During inference, bootstrap is applied over pre-classified test instances.
|
During inference, bootstrap is applied over pre-classified test instances.
|
||||||
* General Applicability: Aggregative Bootstrap can be applied to any aggregative quantifier.
|
* General Applicability: Aggregative Bootstrap can be applied to any aggregative quantifier.
|
||||||
For further information, check the [example](https://github.com/HLT-ISTI/QuaPy/tree/master/examples) provided.
|
For further information, check the [example](https://github.com/HLT-ISTI/QuaPy/tree/master/examples/16.confidence_regions.py) provided.
|
||||||
|
|
||||||
* BayesianCC: is a Bayesian variant of the Adjusted Classify & Count (ACC) quantifier (see more details in [Aggregative Quantifiers](#bayesiancc)).
|
* BayesianCC: is a Bayesian variant of the Adjusted Classify & Count (ACC) quantifier; see more details in the [example](https://github.com/HLT-ISTI/QuaPy/tree/master/examples/14.bayesian_quantification.py) provided.
|
||||||
|
|
||||||
Confidence regions are constructed around a point estimate, which is typically computed as the mean value of a set of samples.
|
Confidence regions are constructed around a point estimate, which is typically computed as the mean value of a set of samples.
|
||||||
The confidence region can be instantiated in three ways:
|
The confidence region can be instantiated in three ways:
|
||||||
|
|
|
||||||
|
|
@ -87,7 +87,7 @@ model = qp.model_selection.GridSearchQ(
|
||||||
error='mae', # the error to optimize is the MAE (a quantification-oriented loss)
|
error='mae', # the error to optimize is the MAE (a quantification-oriented loss)
|
||||||
refit=True, # retrain on the whole labelled set once done
|
refit=True, # retrain on the whole labelled set once done
|
||||||
verbose=True # show information as the process goes on
|
verbose=True # show information as the process goes on
|
||||||
).fit(training)
|
).fit(*training.Xy)
|
||||||
|
|
||||||
print(f'model selection ended: best hyper-parameters={model.best_params_}')
|
print(f'model selection ended: best hyper-parameters={model.best_params_}')
|
||||||
model = model.best_model_
|
model = model.best_model_
|
||||||
|
|
@ -133,7 +133,7 @@ learner = GridSearchCV(
|
||||||
LogisticRegression(),
|
LogisticRegression(),
|
||||||
param_grid={'C': np.logspace(-4, 5, 10), 'class_weight': ['balanced', None]},
|
param_grid={'C': np.logspace(-4, 5, 10), 'class_weight': ['balanced', None]},
|
||||||
cv=5)
|
cv=5)
|
||||||
model = DistributionMatching(learner).fit(dataset.train)
|
model = DistributionMatching(learner).fit(*dataset.train.Xy)
|
||||||
```
|
```
|
||||||
|
|
||||||
However, this is conceptually flawed, since the model should be
|
However, this is conceptually flawed, since the model should be
|
||||||
|
|
|
||||||
|
|
@ -2,6 +2,9 @@
|
||||||
|
|
||||||
The module _qp.plot_ implements some basic plotting functions
|
The module _qp.plot_ implements some basic plotting functions
|
||||||
that can help analyse the performance of a quantification method.
|
that can help analyse the performance of a quantification method.
|
||||||
|
See the provided
|
||||||
|
[code example](https://github.com/HLT-ISTI/QuaPy/blob/master/examples/13.plotting.py)
|
||||||
|
for a full example.
|
||||||
|
|
||||||
All plotting functions receive as inputs the outcomes of
|
All plotting functions receive as inputs the outcomes of
|
||||||
some experiments and include, for each experiment,
|
some experiments and include, for each experiment,
|
||||||
|
|
@ -77,7 +80,7 @@ def gen_data():
|
||||||
method_names, true_prevs, estim_prevs, tr_prevs = [], [], [], []
|
method_names, true_prevs, estim_prevs, tr_prevs = [], [], [], []
|
||||||
|
|
||||||
for method_name, model in models():
|
for method_name, model in models():
|
||||||
model.fit(train)
|
model.fit(*train.Xy)
|
||||||
true_prev, estim_prev = qp.evaluation.prediction(model, APP(test, repeats=100, random_state=0))
|
true_prev, estim_prev = qp.evaluation.prediction(model, APP(test, repeats=100, random_state=0))
|
||||||
|
|
||||||
method_names.append(method_name)
|
method_names.append(method_name)
|
||||||
|
|
@ -171,7 +174,7 @@ def gen_data():
|
||||||
training_size = 5000
|
training_size = 5000
|
||||||
# since the problem is binary, it suffices to specify the negative prevalence, since the positive is constrained
|
# since the problem is binary, it suffices to specify the negative prevalence, since the positive is constrained
|
||||||
train_sample = train.sampling(training_size, 1-training_prevalence)
|
train_sample = train.sampling(training_size, 1-training_prevalence)
|
||||||
model.fit(train_sample)
|
model.fit(*train_sample.Xy)
|
||||||
true_prev, estim_prev = qp.evaluation.prediction(model, APP(test, repeats=100, random_state=0))
|
true_prev, estim_prev = qp.evaluation.prediction(model, APP(test, repeats=100, random_state=0))
|
||||||
method_name = 'CC$_{'+f'{int(100*training_prevalence)}' + '\%}$'
|
method_name = 'CC$_{'+f'{int(100*training_prevalence)}' + '\%}$'
|
||||||
method_data.append((method_name, true_prev, estim_prev, train_sample.prevalence()))
|
method_data.append((method_name, true_prev, estim_prev, train_sample.prevalence()))
|
||||||
|
|
|
||||||
|
|
@ -1,7 +1,5 @@
|
||||||
# Protocols
|
# Protocols
|
||||||
|
|
||||||
_New in v0.1.7!_
|
|
||||||
|
|
||||||
Quantification methods are expected to behave robustly in the presence of
|
Quantification methods are expected to behave robustly in the presence of
|
||||||
shift. For this reason, quantification methods need to be confronted with
|
shift. For this reason, quantification methods need to be confronted with
|
||||||
samples exhibiting widely varying amounts of shift.
|
samples exhibiting widely varying amounts of shift.
|
||||||
|
|
@ -106,15 +104,16 @@ train, test = qp.datasets.fetch_reviews('imdb', tfidf=True, min_df=5).train_test
|
||||||
|
|
||||||
# model selection
|
# model selection
|
||||||
train, val = train.split_stratified(train_prop=0.75)
|
train, val = train.split_stratified(train_prop=0.75)
|
||||||
|
Xtr, ytr = train.Xy
|
||||||
quantifier = qp.model_selection.GridSearchQ(
|
quantifier = qp.model_selection.GridSearchQ(
|
||||||
quantifier,
|
quantifier,
|
||||||
param_grid={'classifier__C': np.logspace(-2, 2, 5)},
|
param_grid={'classifier__C': np.logspace(-2, 2, 5)},
|
||||||
protocol=APP(val) # <- this is the protocol we use for generating validation samples
|
protocol=APP(val) # <- this is the protocol we use for generating validation samples
|
||||||
).fit(train)
|
).fit(Xtr, ytr)
|
||||||
|
|
||||||
# default values are n_prevalences=21, repeats=10, random_state=0; this is equialent to:
|
# default values are n_prevalences=21, repeats=10, random_state=0; this is equialent to:
|
||||||
# val_app = APP(val, n_prevalences=21, repeats=10, random_state=0)
|
# val_app = APP(val, n_prevalences=21, repeats=10, random_state=0)
|
||||||
# quantifier = GridSearchQ(quantifier, param_grid, protocol=val_app).fit(train)
|
# quantifier = GridSearchQ(quantifier, param_grid, protocol=val_app).fit(Xtr, ytr)
|
||||||
|
|
||||||
# evaluation with APP
|
# evaluation with APP
|
||||||
mae = qp.evaluation.evaluate(quantifier, protocol=APP(test), error_metric='mae')
|
mae = qp.evaluation.evaluate(quantifier, protocol=APP(test), error_metric='mae')
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue