forked from moreo/QuaPy
88 lines
5.7 KiB
Plaintext
88 lines
5.7 KiB
Plaintext
sample_size should not be mandatory when qp.environ['SAMPLE_SIZE'] has been specified
|
|
clean all the cumbersome methods that have to be implemented for new quantifiers (e.g., n_classes_ prop, etc.)
|
|
make truly parallel the GridSearchQ
|
|
make more examples in the "examples" directory
|
|
merge with master, because I had to fix some problems with QuaNet due to an issue notified via GitHub!
|
|
added cross_val_predict in qp.model_selection (i.e., a cross_val_predict for quantification) --would be nice to have
|
|
it parallelized
|
|
|
|
|
|
Packaging:
|
|
==========================================
|
|
Document methods with paper references
|
|
unit-tests
|
|
clean wiki_examples!
|
|
|
|
Refactor:
|
|
==========================================
|
|
Unify ThresholdOptimization methods, as an extension of PACC (and not ACC), the fit methods are almost identical and
|
|
use a prob classifier (take into account that PACC uses pcc internally, whereas the threshold methods use cc
|
|
instead). The fit method of ACC and PACC has a block for estimating the validation estimates that should be unified
|
|
as well...
|
|
Refactor protocols. APP and NPP related functionalities are duplicated in functional, LabelledCollection, and evaluation
|
|
|
|
|
|
New features:
|
|
==========================================
|
|
Add NAE, NRAE
|
|
Add "measures for evaluating ordinal"?
|
|
Add datasets for topic.
|
|
Do we want to cover cross-lingual quantification natively in QuaPy, or does it make more sense as an application on top?
|
|
|
|
Current issues:
|
|
==========================================
|
|
Revise the class structure of quantification methods and the methods they inherit... There is some confusion regarding
|
|
methods isbinary, isprobabilistic, and the like. The attribute "learner_" in aggregative quantifiers is also
|
|
confusing, since there is a getter and a setter.
|
|
Remove the "deep" in get_params. There is no real compatibility with scikit-learn as for now.
|
|
SVMperf-based learners do not remove temp files in __del__?
|
|
In binary quantification (hp, kindle, imdb) we used F1 in the minority class (which in kindle and hp happens to be the
|
|
negative class). This is not covered in this new implementation, in which the binary case is not treated as such, but as
|
|
an instance of single-label with 2 labels. Check
|
|
Add automatic reindex of class labels in LabelledCollection (currently, class indexes should be ordered and with no gaps)
|
|
OVR I believe is currently tied to aggregative methods. We should provide a general interface also for general quantifiers
|
|
Currently, being "binary" only adds one checker; we should figure out how to impose the check to be automatically performed
|
|
Add random seed management to support replicability (see temp_seed in util.py).
|
|
GridSearchQ is not trully parallelized. It only parallelizes on the predictions.
|
|
In the context of a quantifier (e.g., QuaNet or CC), the parameters of the learner should be prefixed with "estimator__",
|
|
in QuaNet this is resolved with a __check_params_colision, but this should be improved. It might be cumbersome to
|
|
impose the "estimator__" prefix for, e.g., quantifiers like CC though... This should be changed everywhere...
|
|
QuaNet needs refactoring. The base quantifiers ACC and PACC receive val_data with instances already transformed. This
|
|
issue is due to a bad design.
|
|
|
|
Improvements:
|
|
==========================================
|
|
Explore the hyperparameter "number of bins" in HDy
|
|
Rename EMQ to SLD ?
|
|
Parallelize the kFCV in ACC and PACC?
|
|
Parallelize model selection trainings
|
|
We might want to think of (improving and) adding the class Tabular (it is defined and used on branch tweetsent). A more
|
|
recent version is in the project ql4facct. This class is meant to generate latex tables from results (highligting
|
|
best results, computing statistical tests, colouring cells, producing rankings, producing averages, etc.). Trying
|
|
to generate tables is typically a bad idea, but in this specific case we do have pretty good control of what an
|
|
experiment looks like. (Do we want to abstract experimental results? this could be useful not only for tables but
|
|
also for plots).
|
|
Add proper logging system. Currently we use print
|
|
It might be good to simplify the number of methods that have to be implemented for any new Quantifier. At the moment,
|
|
there are many functions like get_params, set_params, and, specially, @property classes_, which are cumbersome to
|
|
implement for quick experiments. A possible solution is to impose get_params and set_params only in cases in which
|
|
the model extends some "ModelSelectable" interface only. The classes_ should have a default implementation.
|
|
|
|
Checks:
|
|
==========================================
|
|
How many times is the system of equations for ACC and PACC not solved? How many times is it clipped? Do they sum up
|
|
to one always?
|
|
Re-check how hyperparameters from the quantifier and hyperparameters from the classifier (in aggregative quantifiers)
|
|
is handled. In scikit-learn the hyperparameters from a wrapper method are indicated directly whereas the hyperparams
|
|
from the internal learner are prefixed with "estimator__". In QuaPy, combinations having to do with the classifier
|
|
can be computed at the begining, and then in an internal loop the hyperparams of the quantifier can be explored,
|
|
passing fit_learner=False.
|
|
Re-check Ensembles. As for now, they are strongly tied to aggregative quantifiers.
|
|
Re-think the environment variables. Maybe add new ones (like, for example, parameters for the plots)
|
|
Do we want to wrap prevalences (currently simple np.ndarray) as a class? This might be convenient for some interfaces
|
|
(e.g., for specifying artificial prevalences in samplings, for printing them -- currently supported through
|
|
F.strprev(), etc.). This might however add some overload, and prevent/difficult post processing with numpy.
|
|
Would be nice to get a better integration with sklearn.
|
|
|
|
|