cleaning

README updated
cleaning
2024-04-11 14:24:23 +02:00 · 2024-04-11 14:23:14 +02:00 · 2024-04-11 14:22:57 +02:00 · 2024-04-08 18:16:25 +02:00 · 2024-04-08 17:59:56 +02:00 · 2024-04-08 17:59:08 +02:00
84 changed files with 2069 additions and 54316 deletions
--- a/.editorconfig
+++ b/.editorconfig
--- a/.gitignore
+++ b/.gitignore
@ -1,30 +1,40 @@
 *.code-workspace
-quavenv/*
 *.pdf
+*.md
+*.html
+
+# virtualenvs
+quavenv/*
 .venv/*
+
+# vscode config
 .vscode/*

-__pycache__/*
-baselines/__pycache__/*
-baselines/densratio/__pycache__/*
-qcdash/__pycache__/*
-qcpanel/__pycache__/*
-quacc/__pycache__/*
-quacc/*/__pycache__/*
-tests/__pycache__/*
-tests/*/__pycache__/*
-tests/*/*/__pycache__/*
-htmlcov/*
-test*.py
+# cache
+*__pycache__*
+.pytest_cache/

+# coverage
+htmlcov/
 *.coverage
 .coverage

-scp_sync.py
-
+# results
+*out
 out/*
 output/*
-# !output/main/
+results/*
+plots/*
+dataset_stats/*

+# pyenv
 .python-version
-poetry.lock
+
+# poetry
+poetry.lock
+
+# log files
+*.log
+log
+
+test*.py
--- a/README.md
+++ b/README.md
@ -1 +1 @@
-# tesi
+# QuAcc
--- a/TODO.html
+++ b/TODO.html
@ -1,143 +0,0 @@
-<!DOCTYPE html>
-    <html>
-    <head>
-        <meta charset="UTF-8">
-        <title></title>
-        <style>
-/* From extension vscode.github */
-/*---------------------------------------------------------------------------------------------
- *  Copyright (c) Microsoft Corporation. All rights reserved.
- *  Licensed under the MIT License. See License.txt in the project root for license information.
- *--------------------------------------------------------------------------------------------*/
-
-.vscode-dark img[src$=\#gh-light-mode-only],
-.vscode-light img[src$=\#gh-dark-mode-only] {
-	display: none;
-}
-
-</style>
-        
-        <link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/Microsoft/vscode/extensions/markdown-language-features/media/markdown.css">
-<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/Microsoft/vscode/extensions/markdown-language-features/media/highlight.css">
-<style>
-            body {
-                font-family: -apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', system-ui, 'Ubuntu', 'Droid Sans', sans-serif;
-                font-size: 14px;
-                line-height: 1.6;
-            }
-        </style>
-        <style>
-.task-list-item {
-    list-style-type: none;
-}
-
-.task-list-item-checkbox {
-    margin-left: -20px;
-    vertical-align: middle;
-    pointer-events: none;
-}
-</style>
-        
-    </head>
-    <body class="vscode-body vscode-light">
-        <ul class="contains-task-list">
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> aggiungere media tabelle</p>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> plot; 3 tipi (appunti + email + garg)</p>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> sistemare kfcv baseline</p>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> aggiungere metodo con CC oltre SLD</p>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> prendere classe più popolosa di rcv1, togliere negativi fino a raggiungere 50/50; poi fare subsampling con 9 training prvalences (da 0.1-0.9 a 0.9-0.1)</p>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> variare parametro recalibration in SLD</p>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> fix grafico diagonal</p>
-<ul>
-<li>seaborn example gallery</li>
-</ul>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> varianti recalib: bcts, SLD (provare exact_train_prev=False)</p>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> vedere cosa usa garg di validation size</p>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> per model selection testare il parametro c del classificatore, si esplora in np.logscale(-3,3, 7) oppure np.logscale(-4, 4, 9), parametro class_weight si esplora in None oppure &quot;balanced&quot;; va usato qp.model_selection.GridSearchQ in funzione di mae come errore, UPP come protocollo</p>
-<ul>
-<li>qp.train_test_split per avere v_train e v_val</li>
-<li>GridSearchQ(
-model: BaseQuantifier,
-param_grid: {
-'classifier__C': np.logspace(-3,3,7),
-'classifier__class_weight': [None, 'balanced'],
-'recalib': [None, 'bcts']
-},
-protocol: UPP(V_val, repeats=1000),
-error = qp.error.mae,
-refit=True,
-timeout=-1,
-n_jobs=-2,
-verbose=True).fit(V_tr)</li>
-</ul>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> plot collettivo, con sulla x lo shift e prenda in considerazione tutti i training set, facendo la media sui 9 casi (ogni line è un metodo), risultati non ottimizzati e ottimizzati</p>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> salvare il best score ottenuto da ogni applicazione di GridSearchQ</p>
-<ul>
-<li>nel caso di bin fare media dei due best score</li>
-</ul>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> import baselines</p>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox"type="checkbox"> importare mandoline</p>
-<ul>
-<li>mandoline può essere importato, ma richiedere uno slicing delle features a priori che devere essere realizzato ad hoc</li>
-</ul>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox"type="checkbox"> sistemare vecchie iw baselines</p>
-<ul>
-<li>non possono essere fixate perché dipendono da numpy</li>
-</ul>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> plot avg con train prevalence sull'asse x e media su test prevalecne</p>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> realizzare grid search per task specifico partendo da GridSearchQ</p>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> provare PACC come quantificatore</p>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> aggiungere etichette in shift plot</p>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> sistemare exact_train quapy</p>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox" checked=""type="checkbox"> testare anche su imbd</p>
-</li>
-<li class="task-list-item enabled">
-<p><input class="task-list-item-checkbox"type="checkbox"> rivedere nuove baselines</p>
-</li>
-</ul>
-
-        
-        
-    </body>
-    </html>
--- a/TODO.md
+++ b/TODO.md
@ -1,66 +0,0 @@
- [x] aggiungere media tabelle
- [x] plot; 3 tipi (appunti + email + garg)
- [x] sistemare kfcv baseline
- [x] aggiungere metodo con CC oltre SLD
- [x] prendere classe più popolosa di rcv1, togliere negativi fino a raggiungere 50/50; poi fare subsampling con 9 training prvalences (da 0.1-0.9 a 0.9-0.1)
- [x] variare parametro recalibration in SLD
-
-
- [x] fix grafico diagonal
-    - seaborn example gallery
- [x] varianti recalib: bcts, SLD (provare exact_train_prev=False)
- [x] vedere cosa usa garg di validation size
- [x] per model selection testare il parametro c del classificatore, si esplora in np.logscale(-3,3, 7) oppure np.logscale(-4, 4, 9), parametro class_weight si esplora in None oppure "balanced"; va usato qp.model_selection.GridSearchQ in funzione di mae come errore, UPP come protocollo
-    - qp.train_test_split per avere v_train e v_val
-    - GridSearchQ(
-        model: BaseQuantifier,
-        param_grid: {
-            'classifier__C': np.logspace(-3,3,7),
-            'classifier__class_weight': [None, 'balanced'],
-            'recalib': [None, 'bcts']
-        },
-        protocol: UPP(V_val, repeats=1000),
-        error = qp.error.mae,
-        refit=True,
-        timeout=-1,
-        n_jobs=-2,
-        verbose=True).fit(V_tr)
- [x] plot collettivo, con sulla x lo shift e prenda in considerazione tutti i training set, facendo la media sui 9 casi (ogni line è un metodo), risultati non ottimizzati e ottimizzati
- [x] salvare il best score ottenuto da ogni applicazione di GridSearchQ
-    - nel caso di bin fare media dei due best score
- [x] import baselines
-
- [ ] importare mandoline
-  - mandoline può essere importato, ma richiedere uno slicing delle features a priori che devere essere realizzato ad hoc
- [ ] sistemare vecchie iw baselines
-  - non possono essere fixate perché dipendono da numpy
- [x] plot avg con train prevalence sull'asse x e media su test prevalecne
- [x] realizzare grid search per task specifico partendo da GridSearchQ
- [x] provare PACC come quantificatore
- [x] aggiungere etichette in shift plot
- [x] sistemare exact_train quapy
- [x] testare anche su imbd
-
- [x] aggiungere esecuzione remota via ssh
- [x] testare confidence con sia max_conf che exntropy
- [x] implementare mul3
- [ ] rivedere nuove baselines
- [ ] importare nuovi dataset
-
- [ ] testare kernel density estimation (alternativa sld)
- [ ] significatività statistica (lunedì ore 10.00)
- [ ] usare un metodo diverso di classificazione sia di partenza che dentro quantificatore per cifar10
- [ ] valutare altre possibili esplorazioni del caso binario
-
-multiclass:
- [x] aggiungere classe per gestire risultato estimator (ExtendedPrev)
- [x] sistemare return in MCAE e BQAE estimate
- [x] modificare acc e f1 in error.py
- [x] modificare report.py in modo che l'index del dataframe sia una tupla di prevalence
- [x] modificare plot per adattarsi a modifiche report
- [x] aggiungere supporto a multiclass in dataset.py
- [x] aggiungere group_false in ExtensionPolicy
- [ ] modificare BQAE in modo che i quantifier si adattino alla casistica(binary/multi in base a group_false)
-
-fix:
- [ ] make quantifiers predict 0 prevalence for classes for which we have 0 samples
--- a/accuracy_prediction_via_quantification.py
+++ b/accuracy_prediction_via_quantification.py
@ -1,90 +0,0 @@
-import numpy as np
-from sklearn.linear_model import LogisticRegression
-from sklearn.metrics import f1_score
-
-import quapy as qp
-from method.kdey import KDEyML, KDEyCS, KDEyHD
-from quapy.protocol import APP
-from quapy.method.aggregative import PACC, ACC, EMQ, PCC, CC, DMy
-
-datasets = qp.datasets.UCI_DATASETS
-
-# target = 'f1'
-target = 'acc'
-
-errors = []
-
-# dataset_name = datasets[-2]
-for dataset_name in datasets:
-    if dataset_name in ['balance.2', 'acute.a', 'acute.b', 'iris.1']:
-        continue
-    train, test = qp.datasets.fetch_UCIDataset(dataset_name).train_test
-
-    print(f'dataset name = {dataset_name}')
-    print(f'#train = {len(train)}')
-    print(f'#test = {len(test)}')
-
-    cls = LogisticRegression()
-
-    train, val = train.split_stratified(random_state=0)
-
-
-    cls.fit(*train.Xy)
-    y_val = val.labels
-    y_hat_val = cls.predict(val.instances)
-
-    for sample in APP(test, n_prevalences=11, repeats=1, sample_size=100, return_type='labelled_collection')():
-        print('='*80)
-        y_hat = cls.predict(sample.instances)
-        y = sample.labels
-        if target == 'acc':
-            acc = (y_hat==y).mean()
-        else:
-            acc = f1_score(y, y_hat, zero_division=0)
-
-        q = EMQ(cls)
-        q.fit(train, fit_classifier=False)
-
-        # q = EMQ(cls)
-        # q.fit(train, val_split=val, fit_classifier=False)
-        M_hat = ACC.getPteCondEstim(train.classes_, y_val, y_hat_val)
-        M_true = ACC.getPteCondEstim(train.classes_, y, y_hat)
-        p_hat = q.quantify(sample.instances)
-        cont_table_hat = p_hat * M_hat
-
-        tp = cont_table_hat[1,1]
-        tn = cont_table_hat[0,0]
-        fn = cont_table_hat[0,1]
-        fp = cont_table_hat[1,0]
-
-        if target == 'acc':
-            acc_hat = (tp+tn)
-        else:
-            den = (2*tp + fn + fp)
-            if den > 0:
-                acc_hat = 2*tp / den
-            else:
-                acc_hat = 0
-
-        error = abs(acc - acc_hat)
-        errors.append(error)
-
-        print('true_prev: ', sample.prevalence())
-        print('estim_prev: ', p_hat)
-        print('M-true:\n', M_true)
-        print('M-hat:\n', M_hat)
-        print('cont_table:\n', cont_table_hat)
-        print(f'classifier accuracy={acc:.3f}')
-        print(f'estimated accuracy={acc_hat:.3f}')
-        print(f'estimation error={error:.4f}')
-
-print('process end')
-print('='*80)
-print(f'mean error = {np.mean(errors)}')
-print(f'std error = {np.std(errors)}')
-
-
-
-
-
-
--- a/accuracy_prediction_via_quantification2.py
+++ b/accuracy_prediction_via_quantification2.py
@ -1,269 +0,0 @@
-import numpy as np
-import scipy.special
-from sklearn.linear_model import LogisticRegression
-from sklearn.metrics import f1_score
-
-import quapy as qp
-from quapy.protocol import APP
-from quapy.method.aggregative import PACC, ACC, EMQ, PCC, CC, DMy, T50, MS2, KDEyML, KDEyCS, KDEyHD
-from sklearn import clone
-import quapy.functional as F
-
-# datasets = qp.datasets.UCI_DATASETS
-datasets = ['imdb']
-
-# target = 'f1'
-target = 'acc'
-
-errors = []
-
-def method_1(cls, train, val, sample, y=None, y_hat=None):
-    """
-    Converts a misclassification matrix computed in validation (i.e., in the train distribution P) into
-    the corresponding equivalent misclassification matrix in test (i.e., in the test distribution Q)
-    by relying on the PPS assumptions.
-
-    :return: tuple (tn, fn, fp, tp,) of floats in [0,1] summing up to 1
-    """
-
-    y_val = val.labels
-    y_hat_val = cls.predict(val.instances)
-
-    # q = EMQ(LogisticRegression(class_weight='balanced'))
-    # q.fit(val, fit_classifier=True)
-    q = EMQ(cls)
-    q.fit(train, fit_classifier=False)
-
-
-    # q = KDEyML(cls)
-    # q.fit(train, val_split=val, fit_classifier=False)
-    M_hat = ACC.getPteCondEstim(train.classes_, y_val, y_hat_val)
-    M_true = ACC.getPteCondEstim(train.classes_, y, y_hat)
-    p_hat = q.quantify(sample.instances)
-    cont_table_hat = p_hat * M_hat
-    # cont_table_hat = np.clip(cont_table_hat, 0, 1)
-    # cont_table_hat = cont_table_hat / cont_table_hat.sum()
-
-    print('true_prev: ', sample.prevalence())
-    print('estim_prev: ', p_hat)
-    print('M-true:\n', M_true)
-    print('M-hat:\n', M_hat)
-    print('cont_table:\n', cont_table_hat)
-    print('cont_table Sum :\n', cont_table_hat.sum())
-
-    tp = cont_table_hat[1, 1]
-    tn = cont_table_hat[0, 0]
-    fn = cont_table_hat[0, 1]
-    fp = cont_table_hat[1, 0]
-
-    return tn, fn, fp, tp
-
-
-def method_2(cls, train, val, sample, y=None, y_hat=None):
-    """
-    Assume P and Q are the training and test distributions
-    Solves the following system of linear equations:
-    tp + fp = CC (the classify & count estimate, observed)
-    fn + tp = Q(Y=1) (this is not observed but is estimated via quantification)
-    tp + fp + fn + tn = 1 (trivial)
-
-    There are 4 unknowns and 3 equations. The fourth required one is established
-    by assuming that the PPS conditions hold, i.e., that P(X|Y)=Q(X|Y); note that
-    this implies P(hatY|Y)=Q(hatY|Y) if hatY is computed by any measurable function.
-    In particular, we consider that the tpr in P (estimated via validation, hereafter tpr) and
-    in Q (unknown, hereafter tpr_Q) should
-    be the same. This means:
-    tpr = tpr_Q = tp / (tp + fn)
-    after some manipulation:
-    tp (tpr-1) + fn (tpr) = 0 <-- our last equation
-
-    Note that the last equation relies on the estimate tpr. It is likely that, the more
-    positives we have, the more reliable this estimate is. This suggests that, in cases
-    in which we have more negatives in the validation set than positives, it might be
-    convenient to resort to the true negative rate (tnr) instead. This gives rise to
-    the alternative fourth equation:
-    tn (tnr-1) + fp (tnr) = 0
-
-    :return: tuple (tn, fn, fp, tp,) of floats in [0,1] summing up to 1
-    """
-
-    y_val = val.labels
-    y_hat_val = cls.predict(val.instances)
-
-    q = ACC(cls)
-    q.fit(train, val_split=val, fit_classifier=False)
-    p_hat = q.quantify(sample.instances)
-    pos_prev = p_hat[1]
-    # pos_prev = sample.prevalence()[1]
-
-    cc = CC(cls)
-    cc.fit(train, fit_classifier=False)
-    cc_prev = cc.quantify(sample.instances)[1]
-
-    M_hat = ACC.getPteCondEstim(train.classes_, y_val, y_hat_val)
-    M_true = ACC.getPteCondEstim(train.classes_, y, y_hat)
-    cont_table_true = sample.prevalence() * M_true
-
-    if val.prevalence()[1] > 0.5:
-
-        # in this case, the tpr might be a more reliable estimate than tnr
-        tpr_hat = M_hat[1, 1]
-
-        A = np.asarray([
-            [0, 0, 1, 1],
-            [0, 1, 0, 1],
-            [1, 1, 1, 1],
-            [0, tpr_hat, 0, tpr_hat - 1]
-        ])
-
-    else:
-
-        # in this case, the tnr might be a more reliable estimate than tpr
-        tnr_hat = M_hat[0, 0]
-
-        A = np.asarray([
-            [0, 0, 1, 1],
-            [0, 1, 0, 1],
-            [1, 1, 1, 1],
-            [tnr_hat-1, 0, tnr_hat, 0]
-        ])
-
-    b = np.asarray(
-        [cc_prev, pos_prev, 1, 0]
-    )
-
-    tn, fn, fp, tp = np.linalg.solve(A, b)
-
-    cont_table_estim = np.asarray([
-        [tn, fn],
-        [fp, tp]
-    ])
-
-    # if (cont_table_estim < 0).any() or (cont_table_estim>1).any():
-    #     cont_table_estim = scipy.special.softmax(cont_table_estim)
-
-    print('true_prev: ', sample.prevalence())
-    print('estim_prev: ', p_hat)
-    print('true_cont_table:\n', cont_table_true)
-    print('estim_cont_table:\n', cont_table_estim)
-    # print('true_tpr', M_true[1,1])
-    # print('estim_tpr', tpr_hat)
-
-
-    return tn, fn, fp, tp
-
-
-def method_3(cls, train, val, sample, y=None, y_hat=None):
-    """
-    This is just method 2 but without involving any quapy's quantifier.
-
-    :return: tuple (tn, fn, fp, tp,) of floats in [0,1] summing up to 1
-    """
-
-    classes = val.classes_
-    y_val = val.labels
-    y_hat_val = cls.predict(val.instances)
-    M_hat = ACC.getPteCondEstim(classes, y_val, y_hat_val)
-    y_hat_test = cls.predict(sample.instances)
-    pos_prev_cc = F.prevalence_from_labels(y_hat_test, classes)[1]
-    tpr_hat = M_hat[1,1]
-    fpr_hat = M_hat[1,0]
-    tnr_hat = M_hat[0,0]
-    pos_prev_test_hat = (pos_prev_cc - fpr_hat) / (tpr_hat - fpr_hat)
-    pos_prev_test_hat = np.clip(pos_prev_test_hat, 0, 1)
-    pos_prev_val = val.prevalence()[1]
-
-    if pos_prev_val > 0.5:
-        # in this case, the tpr might be a more reliable estimate than tnr
-        A = np.asarray([
-            [0, 0, 1, 1],
-            [0, 1, 0, 1],
-            [1, 1, 1, 1],
-            [0, tpr_hat, 0, tpr_hat - 1]
-        ])
-    else:
-        # in this case, the tnr might be a more reliable estimate than tpr
-        A = np.asarray([
-            [0, 0, 1, 1],
-            [0, 1, 0, 1],
-            [1, 1, 1, 1],
-            [tnr_hat-1, 0, tnr_hat, 0]
-        ])
-
-    b = np.asarray(
-        [pos_prev_cc, pos_prev_test_hat, 1, 0]
-    )
-
-    tn, fn, fp, tp = np.linalg.solve(A, b)
-
-    return tn, fn, fp, tp
-
-
-def cls_eval_from_counters(tn, fn, fp, tp):
-    if target == 'acc':
-        acc_hat = (tp + tn)
-    else:
-        den = (2 * tp + fn + fp)
-        if den > 0:
-            acc_hat = 2 * tp / den
-        else:
-            acc_hat = 0
-    return acc_hat
-
-
-def cls_eval_from_labels(y, y_hat):
-    if target == 'acc':
-        acc = (y_hat == y).mean()
-    else:
-        acc = f1_score(y, y_hat, zero_division=0)
-    return acc
-
-
-for dataset_name in datasets:
-
-    train_orig, test = qp.datasets.fetch_reviews(dataset_name, tfidf=True, min_df=10).train_test
-
-    train_prot = APP(train_orig, n_prevalences=11, repeats=1, return_type='labelled_collection', random_state=0, sample_size=10000)
-    for train in train_prot():
-        if np.product(train.prevalence()) == 0:
-            # skip experiments with no positives or no negatives in training
-            continue
-
-        cls = LogisticRegression(class_weight='balanced')
-
-        train, val = train.split_stratified(train_prop=0.5, random_state=0)
-
-        print(f'dataset name = {dataset_name}')
-        print(f'#train = {len(train)}, prev={F.strprev(train.prevalence())}')
-        print(f'#val = {len(val)}, prev={F.strprev(val.prevalence())}')
-        print(f'#test = {len(test)}, prev={F.strprev(test.prevalence())}')
-
-        cls.fit(*train.Xy)
-
-        for sample in APP(test, n_prevalences=21, repeats=10, sample_size=1000, return_type='labelled_collection')():
-            print('='*80)
-            y_hat = cls.predict(sample.instances)
-            y = sample.labels
-            acc_true = cls_eval_from_labels(y, y_hat)
-
-            tn, fn, fp, tp = method_3(cls, train, val, sample, y, y_hat)
-
-            acc_hat = cls_eval_from_counters(tn, fn, fp, tp)
-
-            error = abs(acc_true - acc_hat)
-            errors.append(error)
-
-            print(f'classifier accuracy={acc_true:.3f}')
-            print(f'estimated accuracy={acc_hat:.3f}')
-            print(f'estimation error={error:.4f}')
-
-print('process end')
-print('='*80)
-print(f'mean error = {np.mean(errors)}')
-print(f'std error = {np.std(errors)}')
-
-
-
-
-
-
--- a/baselines/atc.py
+++ b/baselines/atc.py
@ -1,44 +0,0 @@
-import numpy as np
-from sklearn.metrics import f1_score
-
-
-def get_entropy(probs):
-    return np.sum(np.multiply(probs, np.log(probs + 1e-20)), axis=1)
-
-
-def get_max_conf(probs):
-    return np.max(probs, axis=-1)
-
-
-def find_ATC_threshold(scores, labels):
-    sorted_idx = np.argsort(scores)
-
-    sorted_scores = scores[sorted_idx]
-    sorted_labels = labels[sorted_idx]
-
-    fp = np.sum(labels == 0)
-    fn = 0.0
-
-    min_fp_fn = np.abs(fp - fn)
-    thres = 0.0
-    for i in range(len(labels)):
-        if sorted_labels[i] == 0:
-            fp -= 1
-        else:
-            fn += 1
-
-        if np.abs(fp - fn) < min_fp_fn:
-            min_fp_fn = np.abs(fp - fn)
-            thres = sorted_scores[i]
-
-    return min_fp_fn, thres
-
-
-def get_ATC_acc(thres, scores):
-    return np.mean(scores >= thres)
-
-
-def get_ATC_f1(thres, scores, probs, average="binary"):
-    preds = np.argmax(probs, axis=-1)
-    estim_y = np.abs(1 - (scores >= thres) ^ preds)
-    return f1_score(estim_y, preds, average=average)
--- a/baselines/densratio/RuLSIF.py
+++ b/baselines/densratio/RuLSIF.py
@ -1,277 +0,0 @@
-"""
-Relative Unconstrained Least-Squares Fitting (RuLSIF): A Python Implementation
-References:
-    'Change-point detection in time-series data by relative density-ratio estimation'
-        Song Liu, Makoto Yamada, Nigel Collier and Masashi Sugiyama,
-        Neural Networks 43 (2013) 72-83.
-
-    'A Least-squares Approach to Direct Importance Estimation'
-        Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama,
-        Journal of Machine Learning Research 10 (2009) 1391-1445.
-"""
-
-from warnings import warn
-
-from numpy import (
-    array,
-    asarray,
-    asmatrix,
-    diag,
-    diagflat,
-    empty,
-    exp,
-    inf,
-    log,
-    matrix,
-    multiply,
-    ones,
-    power,
-    sum,
-)
-from numpy.linalg import solve
-from numpy.random import randint
-
-from .density_ratio import DensityRatio, KernelInfo
-from .helpers import guvectorize_compute, np_float, to_ndarray
-
-
-def RuLSIF(x, y, alpha, sigma_range, lambda_range, kernel_num=100, verbose=True):
-    """
-    Estimation of the alpha-Relative Density Ratio p(x)/p_alpha(x) by RuLSIF
-    (Relative Unconstrained Least-Square Importance Fitting)
-
-    p_alpha(x) = alpha * p(x) + (1 - alpha) * q(x)
-
-    Arguments:
-        x (numpy.matrix): Sample from p(x).
-        y (numpy.matrix): Sample from q(x).
-        alpha (float): Mixture parameter.
-        sigma_range (list<float>): Search range of Gaussian kernel bandwidth.
-        lambda_range (list<float>): Search range of regularization parameter.
-        kernel_num (int): Number of kernels. (Default 100)
-        verbose (bool): Indicator to print messages (Default True)
-
-    Returns:
-        densratio.DensityRatio object which has `compute_density_ratio()`.
-    """
-
-    # Number of samples.
-    nx = x.shape[0]
-    ny = y.shape[0]
-
-    # Number of kernel functions.
-    kernel_num = min(kernel_num, nx)
-
-    # Randomly take a subset of x, to identify centers for the kernels.
-    centers = x[randint(nx, size=kernel_num)]
-
-    if verbose:
-        print("RuLSIF starting...")
-
-    if len(sigma_range) == 1 and len(lambda_range) == 1:
-        sigma = sigma_range[0]
-        lambda_ = lambda_range[0]
-    else:
-        if verbose:
-            print("Searching for the optimal sigma and lambda...")
-
-        # Grid-search cross-validation for optimal kernel and regularization parameters.
-        opt_params = search_sigma_and_lambda(
-            x, y, alpha, centers, sigma_range, lambda_range, verbose
-        )
-        sigma = opt_params["sigma"]
-        lambda_ = opt_params["lambda"]
-
-        if verbose:
-            print(
-                "Found optimal sigma = {:.3f}, lambda = {:.3f}.".format(sigma, lambda_)
-            )
-
-    if verbose:
-        print("Optimizing theta...")
-
-    phi_x = compute_kernel_Gaussian(x, centers, sigma)
-    phi_y = compute_kernel_Gaussian(y, centers, sigma)
-    H = alpha * (phi_x.T.dot(phi_x) / nx) + (1 - alpha) * (phi_y.T.dot(phi_y) / ny)
-    h = phi_x.mean(axis=0).T
-    theta = asarray(solve(H + diag(array(lambda_).repeat(kernel_num)), h)).ravel()
-
-    # No negative coefficients.
-    theta[theta < 0] = 0
-
-    # Compute the alpha-relative density ratio, at the given coordinates.
-    def alpha_density_ratio(coordinates):
-        # Evaluate the kernel at these coordinates, and take the dot-product with the weights.
-        coordinates = to_ndarray(coordinates)
-        phi_x = compute_kernel_Gaussian(coordinates, centers, sigma)
-        alpha_density_ratio = phi_x @ theta
-
-        return alpha_density_ratio
-
-    # Compute the approximate alpha-relative PE-divergence, given samples x and y from the respective distributions.
-    def alpha_PE_divergence(x, y):
-        # This is Y, in Reference 1.
-        x = to_ndarray(x)
-
-        # Obtain alpha-relative density ratio at these points.
-        g_x = alpha_density_ratio(x)
-
-        # This is Y', in Reference 1.
-        y = to_ndarray(y)
-
-        # Obtain alpha-relative density ratio at these points.
-        g_y = alpha_density_ratio(y)
-
-        # Compute the alpha-relative PE-divergence as given in Reference 1.
-        n = x.shape[0]
-        divergence = (
-            -alpha * (g_x @ g_x) / 2 - (1 - alpha) * (g_y @ g_y) / 2 + g_x.sum(axis=0)
-        ) / n - 1.0 / 2
-        return divergence
-
-    # Compute the approximate alpha-relative KL-divergence, given samples x and y from the respective distributions.
-    def alpha_KL_divergence(x, y):
-        # This is Y, in Reference 1.
-        x = to_ndarray(x)
-
-        # Obtain alpha-relative density ratio at these points.
-        g_x = alpha_density_ratio(x)
-
-        # Compute the alpha-relative KL-divergence.
-        n = x.shape[0]
-        divergence = log(g_x).sum(axis=0) / n
-        return divergence
-
-    alpha_PE = alpha_PE_divergence(x, y)
-    alpha_KL = alpha_KL_divergence(x, y)
-
-    if verbose:
-        print("Approximate alpha-relative PE-divergence = {:03.2f}".format(alpha_PE))
-        print("Approximate alpha-relative KL-divergence = {:03.2f}".format(alpha_KL))
-
-    kernel_info = KernelInfo(
-        kernel_type="Gaussian", kernel_num=kernel_num, sigma=sigma, centers=centers
-    )
-    result = DensityRatio(
-        method="RuLSIF",
-        alpha=alpha,
-        theta=theta,
-        lambda_=lambda_,
-        alpha_PE=alpha_PE,
-        alpha_KL=alpha_KL,
-        kernel_info=kernel_info,
-        compute_density_ratio=alpha_density_ratio,
-    )
-
-    if verbose:
-        print("RuLSIF completed.")
-
-    return result
-
-
-# Grid-search cross-validation for the optimal parameters sigma and lambda by leave-one-out cross-validation. See Reference 2.
-def search_sigma_and_lambda(x, y, alpha, centers, sigma_range, lambda_range, verbose):
-    nx = x.shape[0]
-    ny = y.shape[0]
-    n_min = min(nx, ny)
-    kernel_num = centers.shape[0]
-
-    score_new = inf
-    sigma_new = 0
-    lambda_new = 0
-
-    for sigma in sigma_range:
-        phi_x = compute_kernel_Gaussian(x, centers, sigma)  # (nx, kernel_num)
-        phi_y = compute_kernel_Gaussian(y, centers, sigma)  # (ny, kernel_num)
-        H = alpha * (phi_x.T @ phi_x / nx) + (1 - alpha) * (
-            phi_y.T @ phi_y / ny
-        )  # (kernel_num, kernel_num)
-        h = phi_x.mean(axis=0).reshape(-1, 1)  # (kernel_num, 1)
-        phi_x = phi_x[:n_min].T  # (kernel_num, n_min)
-        phi_y = phi_y[:n_min].T  # (kernel_num, n_min)
-
-        for lambda_ in lambda_range:
-            B = H + diag(
-                array(lambda_ * (ny - 1) / ny).repeat(kernel_num)
-            )  # (kernel_num, kernel_num)
-            B_inv_X = solve(B, phi_y)  # (kernel_num, n_min)
-            X_B_inv_X = multiply(phi_y, B_inv_X)  # (kernel_num, n_min)
-            denom = ny * ones(n_min) - ones(kernel_num) @ X_B_inv_X  # (n_min, )
-            B0 = solve(B, h @ ones((1, n_min))) + B_inv_X @ diagflat(
-                h.T @ B_inv_X / denom
-            )  # (kernel_num, n_min)
-            B1 = solve(B, phi_x) + B_inv_X @ diagflat(
-                ones(kernel_num) @ multiply(phi_x, B_inv_X)
-            )  # (kernel_num, n_min)
-            B2 = (ny - 1) * (nx * B0 - B1) / (ny * (nx - 1))  # (kernel_num, n_min)
-            B2[B2 < 0] = 0
-            r_y = multiply(phi_y, B2).sum(axis=0).T  # (n_min, )
-            r_x = multiply(phi_x, B2).sum(axis=0).T  # (n_min, )
-
-            # Squared loss of RuLSIF, without regularization term.
-            # Directly related to the negative of the PE-divergence.
-            score = (r_y @ r_y / 2 - r_x.sum(axis=0)) / n_min
-
-            if verbose:
-                print(
-                    "sigma = %.5f, lambda = %.5f, score = %.5f"
-                    % (sigma, lambda_, score)
-                )
-
-            if score < score_new:
-                score_new = score
-                sigma_new = sigma
-                lambda_new = lambda_
-
-    return {"sigma": sigma_new, "lambda": lambda_new}
-
-
-def _compute_kernel_Gaussian(x_list, y_row, neg_gamma, res) -> None:
-    sq_norm = sum(power(x_list - y_row, 2), 1)
-    multiply(neg_gamma, sq_norm, res)
-    exp(res, res)
-
-
-def _target_numpy_wrapper(x_list, y_list, neg_gamma):
-    res = empty((y_list.shape[0], x_list.shape[0]), np_float)
-    if isinstance(x_list, matrix) or isinstance(y_list, matrix):
-        res = asmatrix(res)
-
-    for j, y_row in enumerate(y_list):
-        # `.T` aligns shapes for matrices, does nothing for 1D ndarray.
-        _compute_kernel_Gaussian(x_list, y_row, neg_gamma, res[j].T)
-
-    return res
-
-
-_compute_functions = {"numpy": _target_numpy_wrapper}
-if guvectorize_compute:
-    _compute_functions.update(
-        {
-            key: guvectorize_compute(key)(_compute_kernel_Gaussian)
-            for key in ("cpu", "parallel")
-        }
-    )
-
-_compute_function = _compute_functions[
-    "cpu" if "cpu" in _compute_functions else "numpy"
-]
-
-
-# Returns a 2D numpy matrix of kernel evaluated at the gridpoints with coordinates from x_list and y_list.
-def compute_kernel_Gaussian(x_list, y_list, sigma):
-    return _compute_function(x_list, y_list, -0.5 * sigma**-2).T
-
-
-def set_compute_kernel_target(target: str) -> None:
-    global _compute_function
-    if target not in ("numpy", "cpu", "parallel"):
-        raise ValueError(
-            "'target' must be one of the following: 'numpy', 'cpu', or 'parallel'."
-        )
-
-    if target not in _compute_functions:
-        warn("'numba' not available; defaulting to 'numpy'.", ImportWarning)
-        target = "numpy"
-
-    _compute_function = _compute_functions[target]
--- a/baselines/densratio/init.py
+++ b/baselines/densratio/init.py
@ -1,7 +0,0 @@
-from warnings import filterwarnings
-
-from .core import densratio
-from .RuLSIF import set_compute_kernel_target
-
-filterwarnings("default", message="'numba'", category=ImportWarning, module="densratio")
-__all__ = ["densratio", "set_compute_kernel_target"]
--- a/baselines/densratio/core.py
+++ b/baselines/densratio/core.py
@ -1,70 +0,0 @@
-"""
-densratio.core
-~~~~~~~~~~~~~~
-
-Estimate Density Ratio p(x)/q(y)
-"""
-
-from numpy import linspace
-
-from .helpers import to_ndarray
-from .RuLSIF import RuLSIF
-
-
-def densratio(
-    x, y, alpha=0, sigma_range="auto", lambda_range="auto", kernel_num=100, verbose=True
-):
-    """Estimate alpha-mixture Density Ratio p(x)/(alpha*p(x) + (1 - alpha)*q(x))
-
-    Arguments:
-        x: sample from p(x).
-        y: sample from q(x).
-        alpha: Default 0 - corresponds to ordinary density ratio.
-        sigma_range: search range of Gaussian kernel bandwidth.
-            Default "auto" means 10^-3, 10^-2, ..., 10^9.
-        lambda_range: search range of regularization parameter for uLSIF.
-            Default "auto" means 10^-3, 10^-2, ..., 10^9.
-        kernel_num: number of kernels. Default 100.
-        verbose: indicator to print messages. Default True.
-
-    Returns:
-        densratio.DensityRatio object which has `compute_density_ratio()`.
-
-    Raises:
-        ValueError: if dimension of x != dimension of y
-
-    Usage::
-      >>> from scipy.stats import norm
-      >>> from densratio import densratio
-
-      >>> x = norm.rvs(size=200, loc=1, scale=1./8)
-      >>> y = norm.rvs(size=200, loc=1, scale=1./2)
-      >>> result = densratio(x, y, alpha=0.7)
-      >>> print(result)
-
-      >>> density_ratio = result.compute_density_ratio(y)
-      >>> print(density_ratio)
-    """
-
-    x = to_ndarray(x)
-    y = to_ndarray(y)
-
-    if x.shape[1] != y.shape[1]:
-        raise ValueError("x and y must be same dimensions.")
-
-    if isinstance(sigma_range, str) and sigma_range != "auto":
-        raise TypeError("Invalid value for sigma_range.")
-
-    if isinstance(lambda_range, str) and lambda_range != "auto":
-        raise TypeError("Invalid value for lambda_range.")
-
-    if sigma_range is None or (isinstance(sigma_range, str) and sigma_range == "auto"):
-        sigma_range = 10 ** linspace(-3, 9, 13)
-
-    if lambda_range is None or (
-        isinstance(lambda_range, str) and lambda_range == "auto"
-    ):
-        lambda_range = 10 ** linspace(-3, 9, 13)
-
-    result = RuLSIF(x, y, alpha, sigma_range, lambda_range, kernel_num, verbose)
-    return result
--- a/baselines/densratio/density_ratio.py
+++ b/baselines/densratio/density_ratio.py
@ -1,88 +0,0 @@
-from pprint import pformat
-from re import sub
-
-
-class DensityRatio:
-    """Density Ratio."""
-
-    def __init__(
-        self,
-        method,
-        alpha,
-        theta,
-        lambda_,
-        alpha_PE,
-        alpha_KL,
-        kernel_info,
-        compute_density_ratio,
-    ):
-        self.method = method
-        self.alpha = alpha
-        self.theta = theta
-        self.lambda_ = lambda_
-        self.alpha_PE = alpha_PE
-        self.alpha_KL = alpha_KL
-        self.kernel_info = kernel_info
-        self.compute_density_ratio = compute_density_ratio
-
-    def __str__(self):
-        return """
-Method: %(method)s
-
-Alpha: %(alpha)s
-
-Kernel Information:
-%(kernel_info)s
-
-Kernel Weights (theta):
-  %(theta)s
-
-Regularization Parameter (lambda): %(lambda_)s
-
-Alpha-Relative PE-Divergence: %(alpha_PE)s
-
-Alpha-Relative KL-Divergence: %(alpha_KL)s
-
-Function to Estimate Density Ratio:
-  compute_density_ratio(x)
-  
-"""[
-            1:-1
-        ] % dict(
-            method=self.method,
-            kernel_info=self.kernel_info,
-            alpha=self.alpha,
-            theta=my_format(self.theta),
-            lambda_=self.lambda_,
-            alpha_PE=self.alpha_PE,
-            alpha_KL=self.alpha_KL,
-        )
-
-
-class KernelInfo:
-    """Kernel Information."""
-
-    def __init__(self, kernel_type, kernel_num, sigma, centers):
-        self.kernel_type = kernel_type
-        self.kernel_num = kernel_num
-        self.sigma = sigma
-        self.centers = centers
-
-    def __str__(self):
-        return """
-  Kernel type: %(kernel_type)s
-  Number of kernels: %(kernel_num)s
-  Bandwidth(sigma): %(sigma)s
-  Centers: %(centers)s
-"""[
-            1:-1
-        ] % dict(
-            kernel_type=self.kernel_type,
-            kernel_num=self.kernel_num,
-            sigma=self.sigma,
-            centers=my_format(self.centers),
-        )
-
-
-def my_format(str):
-    return sub(r"\s+", " ", (pformat(str).split("\n")[0] + ".."))
--- a/baselines/densratio/helpers.py
+++ b/baselines/densratio/helpers.py
@ -1,36 +0,0 @@
-from numpy import array, ndarray, result_type
-
-np_float = result_type(float)
-try:
-    import numba as nb
-except ModuleNotFoundError:
-    guvectorize_compute = None
-else:
-    _nb_float = nb.from_dtype(np_float)
-
-    def guvectorize_compute(target: str, *, cache: bool = True):
-        return nb.guvectorize(
-            [nb.void(_nb_float[:, :], _nb_float[:], _nb_float, _nb_float[:])],
-            "(m, p),(p),()->(m)",
-            nopython=True,
-            target=target,
-            cache=cache,
-        )
-
-
-def is_numeric(x):
-    return isinstance(x, int) or isinstance(x, float)
-
-
-def to_ndarray(x):
-    if isinstance(x, ndarray):
-        if len(x.shape) == 1:
-            return x.reshape(-1, 1)
-        else:
-            return x
-    elif str(type(x)) == "<class 'pandas.core.frame.DataFrame'>":
-        return x.values
-    elif not x:
-        raise ValueError("Cannot transform to numpy.matrix.")
-    else:
-        return to_ndarray(array(x))
--- a/baselines/doc.py
+++ b/baselines/doc.py
@ -1,4 +0,0 @@
-import numpy as np
-
-def get_doc(probs1, probs2):
-    return np.mean(probs2) - np.mean(probs1) 
--- a/baselines/gde.py
+++ b/baselines/gde.py
@ -1,5 +0,0 @@
-import numpy as np
-
-
-def get_score(pred1, pred2):
-    return np.mean(pred1 == pred2)
--- a/baselines/impweight.py
+++ b/baselines/impweight.py
@ -1,66 +0,0 @@
-import numpy as np
-from scipy.sparse import issparse, vstack
-from sklearn.linear_model import LogisticRegression
-from sklearn.model_selection import GridSearchCV
-from sklearn.neighbors import KernelDensity
-
-from baselines import densratio
-from baselines.pykliep import DensityRatioEstimator
-
-
-def kliep(Xtr, ytr, Xte):
-    kliep = DensityRatioEstimator()
-    kliep.fit(Xtr, Xte)
-    return kliep.predict(Xtr)
-
-
-def usilf(Xtr, ytr, Xte, alpha=0.0):
-    dense_ratio_obj = densratio(Xtr, Xte, alpha=alpha, verbose=False)
-    return dense_ratio_obj.compute_density_ratio(Xtr)
-
-
-def logreg(Xtr, ytr, Xte):
-    # check "Direct Density Ratio Estimation for
-    # Large-scale Covariate Shift Adaptation", Eq.28
-
-    if issparse(Xtr):
-        X = vstack([Xtr, Xte])
-    else:
-        X = np.concatenate([Xtr, Xte])
-
-    y = [0] * Xtr.shape[0] + [1] * Xte.shape[0]
-
-    logreg = GridSearchCV(
-        LogisticRegression(),
-        param_grid={"C": np.logspace(-3, 3, 7), "class_weight": ["balanced", None]},
-        n_jobs=-1,
-    )
-    logreg.fit(X, y)
-    probs = logreg.predict_proba(Xtr)
-    prob_train, prob_test = probs[:, 0], probs[:, 1]
-    prior_train = Xtr.shape[0]
-    prior_test = Xte.shape[0]
-    w = (prior_train / prior_test) * (prob_test / prob_train)
-    return w
-
-
-kdex2_params = {"bandwidth": np.logspace(-1, 1, 20)}
-
-
-def kdex2_lltr(Xtr):
-    if issparse(Xtr):
-        Xtr = Xtr.toarray()
-    return GridSearchCV(KernelDensity(), kdex2_params).fit(Xtr).score_samples(Xtr)
-
-
-def kdex2_weights(Xtr, Xte, log_likelihood_tr):
-    log_likelihood_te = (
-        GridSearchCV(KernelDensity(), kdex2_params).fit(Xte).score_samples(Xtr)
-    )
-    likelihood_tr = np.exp(log_likelihood_tr)
-    likelihood_te = np.exp(log_likelihood_te)
-    return likelihood_te / likelihood_tr
-
-
-def get_acc(tr_preds, ytr, w):
-    return np.sum((1.0 * (tr_preds == ytr)) * w) / np.sum(w)
--- a/baselines/mandoline.py
+++ b/baselines/mandoline.py
@ -1,261 +0,0 @@
-from functools import partial
-from types import SimpleNamespace
-from typing import List, Optional
-
-import numpy as np
-import scipy.optimize
-import scipy.special
-import sklearn.metrics.pairwise as skmetrics
-
-
-def Phi(
-    D: np.ndarray,
-    edge_list: np.ndarray = None,
-):
-    """
-    Given an n x d matrix of (example, slices), calculate the potential
-    matrix.
-
-    Includes correlations modeled by the edges in the `edge_list`.
-
-    Args:
-        D (np.ndarray): n x d matrix of (example, slice)
-        edge_list (np.ndarray): k x 2 matrix of edge correlations to be modeled.
-            edge_list[i, :] should be indices for a pair of columns of D.
-
-    Returns:
-        Potential matrix. Equals D when edge_list is None, otherwise adds additional
-        (x_i * x_j) "cross-terms" corresponding to the edges in the `edge_list`.
-
-    Examples:
-        >>> D = np.random.choice([-1, 1], size=(100, 6))
-        >>> edge_list = np.array([(0, 1), (1, 4)])
-        >>> Phi(D, edge_list)
-    """
-
-    if edge_list is not None:
-        pairwise_terms = (
-            D[np.arange(len(D)), edge_list[:, 0][:, np.newaxis]].T
-            * D[np.arange(len(D)), edge_list[:, 1][:, np.newaxis]].T
-        )
-        return np.concatenate([D, pairwise_terms], axis=1)
-    else:
-        return D
-
-
-def log_partition_ratio(
-    x: np.ndarray,
-    Phi_D_src: np.ndarray,
-    n_src: int,
-):
-    """
-    Calculate the log-partition ratio in the KLIEP problem.
-    """
-    return np.log(n_src) - scipy.special.logsumexp(Phi_D_src.dot(x))
-
-
-def mandoline(
-    D_src: np.ndarray,
-    D_tgt: np.ndarray,
-    edge_list: np.ndarray,
-    sigma: float = None,
-):
-    """
-    Mandoline solver.
-
-    Args:
-        D_src: (n_src x d) matrix of (example, slices) for the source distribution.
-        D_tgt: (n_tgt x d) matrix of (example, slices) for the source distribution.
-        edge_list: list of edge correlations between slices that should be modeled.
-        sigma: optional parameter that activates RBF kernel-based KLIEP with scale
-        `sigma`.
-
-    Returns: SimpleNamespace that contains
-        opt: result of scipy.optimize
-        Phi_D_src: source potential matrix used in Mandoline
-        Phi_D_tgt: target potential matrix used in Mandoline
-        n_src: number of source samples
-        n_tgt: number of target samples
-        edge_list: the `edge_list` parameter passed as input
-
-    """
-    # Copy and binarize the input matrices to -1/1
-    D_src, D_tgt = np.copy(D_src), np.copy(D_tgt)
-    if np.min(D_src) == 0:
-        D_src[D_src == 0] = -1
-        D_tgt[D_tgt == 0] = -1
-
-    # Edge list encoding dependencies between gs
-    if edge_list is not None:
-        edge_list = np.array(edge_list)
-
-    # Create the potential matrices
-    Phi_D_tgt, Phi_D_src = Phi(D_tgt, edge_list), Phi(D_src, edge_list)
-
-    # Number of examples
-    n_src, n_tgt = Phi_D_src.shape[0], Phi_D_tgt.shape[0]
-
-    def f(x):
-        obj = Phi_D_tgt.dot(x).sum() - n_tgt * scipy.special.logsumexp(Phi_D_src.dot(x))
-        return -obj
-
-    # Set the kernel
-    kernel = partial(skmetrics.rbf_kernel, gamma=sigma)
-
-    def llkliep_f(x):
-        obj = kernel(
-            Phi_D_tgt, x[:, np.newaxis]
-        ).sum() - n_tgt * scipy.special.logsumexp(kernel(Phi_D_src, x[:, np.newaxis]))
-        return -obj
-
-    # Solve
-    if not sigma:
-        opt = scipy.optimize.minimize(
-            f, np.random.randn(Phi_D_tgt.shape[1]), method="BFGS"
-        )
-    else:
-        opt = scipy.optimize.minimize(
-            llkliep_f, np.random.randn(Phi_D_tgt.shape[1]), method="BFGS"
-        )
-
-    return SimpleNamespace(
-        opt=opt,
-        Phi_D_src=Phi_D_src,
-        Phi_D_tgt=Phi_D_tgt,
-        n_src=n_src,
-        n_tgt=n_tgt,
-        edge_list=edge_list,
-    )
-
-
-def log_density_ratio(D, solved):
-    """
-    Calculate the log density ratio for a solved Mandoline run.
-    """
-    Phi_D = Phi(D, None)
-    return Phi_D.dot(solved.opt.x) + log_partition_ratio(
-        solved.opt.x, solved.Phi_D_src, solved.n_src
-    )
-
-
-def get_k_most_unbalanced_gs(D_src, D_tgt, k):
-    """
-    Get the top k slices that shift most between source and target
-    distributions.
-
-    Uses difference in marginals between each slice.
-    """
-    marginal_diff = np.abs(D_src.mean(axis=0) - D_tgt.mean(axis=0))
-    differences = np.sort(marginal_diff)[-k:]
-    indices = np.argsort(marginal_diff)[-k:]
-    return list(indices), list(differences)
-
-
-def weighted_estimator(weights: Optional[np.ndarray], mat: np.ndarray):
-    """
-    Calculate a weighted empirical mean over a matrix of samples.
-
-    Args:
-        weights (Optional[np.ndarray]):
-            length n array of weights that sums to 1. Calculates an unweighted
-            mean if `weights` is None.
-        mat (np.ndarray):
-            (n x r) matrix of empirical observations that is being averaged.
-
-    Returns:
-        Length r np.ndarray of weighted means.
-    """
-    _sum_weights = np.sum(weights)
-    if _sum_weights != 1.0:
-        if (_err := abs(1.0 - _sum_weights)) > 1e-15:
-            print(_err)
-            assert _sum_weights == 1, "`weights` must sum to 1."
-
-    if weights is None:
-        return np.mean(mat, axis=0)
-    return np.sum(weights[:, np.newaxis] * mat, axis=0)
-
-
-def estimate_performance(
-    D_src: np.ndarray,
-    D_tgt: np.ndarray,
-    edge_list: np.ndarray,
-    empirical_mat_list_src: List[np.ndarray],
-):
-    """
-    Estimate performance on a target distribution using slice information from the
-    source and target data.
-
-    This function runs Mandoline to calculate the importance weights to reweight
-    the source data.
-
-    Args:
-        D_src (np.ndarray): (n_src x d) matrix of (example, slices) for the source
-            distribution.
-        D_tgt (np.ndarray): (n_tgt x d) matrix of (example, slices) for the target
-            distribution.
-        edge_list (np.ndarray):
-        empirical_mat_list_src (List[np.ndarray]):
-
-    Returns:
-        SimpleNamespace with 3 attributes
-        - `all_estimates` is a list of SimpleNamespace objects with
-            2 attributes
-            - `weighted` is the estimate for the target distribution
-            - `source` is the estimate for the source distribution
-        - `solved`: result of scipy.optimize Mandoline solver
-        - `weights`: self-normalized importance weights used to weight the source data
-    """
-    # Run the solver
-    solved = mandoline(D_src, D_tgt, edge_list)
-
-    # Compute the weights on the source dataset
-    density_ratios = np.e ** log_density_ratio(solved.Phi_D_src, solved)
-
-    # Self-normalized importance weights
-    weights = density_ratios / np.sum(density_ratios)
-
-    all_estimates = []
-    for mat_src in empirical_mat_list_src:
-        # Estimates is a 1-D array of estimates for each mat e.g.
-        # each mat can correspond to a model's (n x 1) error matrix
-        weighted_estimates = weighted_estimator(weights, mat_src)
-        source_estimates = weighted_estimator(
-            np.ones(solved.n_src) / solved.n_src, mat_src
-        )
-
-        all_estimates.append(
-            SimpleNamespace(
-                weighted=weighted_estimates,
-                source=source_estimates,
-            )
-        )
-
-    return SimpleNamespace(
-        all_estimates=all_estimates,
-        solved=solved,
-        weights=weights,
-    )
-
-
-###########################################################################
-
-
-def get_entropy(probas):
-    return -np.sum(np.multiply(probas, np.log(probas + 1e-20)), axis=1)
-
-
-def get_slices(probas, n_ent_bins=6):
-    ln, ncl = probas.shape
-    preds = np.argmax(probas, axis=1)
-    pred_slices = np.full((ln, ncl), fill_value=-1, dtype="<i8")
-    pred_slices[np.arange(ln), preds] = 1
-
-    ent = get_entropy(probas)
-    range_top = get_entropy(np.array([np.ones(ncl) / ncl]))[0]
-    ent_bins = np.linspace(0, range_top, n_ent_bins + 1)
-    bins_map = np.digitize(ent, bins=ent_bins, right=True) - 1
-    ent_slices = np.full((ln, n_ent_bins), fill_value=-1, dtype="<i8")
-    ent_slices[np.arange(ln), bins_map] = 1
-
-    return np.concatenate([pred_slices, ent_slices], axis=1)
--- a/baselines/models.py
+++ b/baselines/models.py
@ -1,140 +0,0 @@
-# import itertools
-# from typing import Iterable
-
-# import quapy as qp
-# import quapy.functional as F
-# from densratio import densratio
-# from quapy.method.aggregative import *
-# from quapy.protocol import (
-#     AbstractStochasticSeededProtocol,
-#     OnLabelledCollectionProtocol,
-# )
-# from scipy.sparse import issparse, vstack
-# from scipy.spatial.distance import cdist
-# from scipy.stats import multivariate_normal
-# from sklearn.linear_model import LogisticRegression
-# from sklearn.model_selection import GridSearchCV
-# from sklearn.neighbors import KernelDensity
-
-import time
-
-import numpy as np
-import sklearn.metrics as metrics
-from pykliep import DensityRatioEstimator
-from quapy.protocol import APP
-from scipy.sparse import issparse, vstack
-from sklearn.linear_model import LogisticRegression
-from sklearn.model_selection import GridSearchCV
-from sklearn.neighbors import KernelDensity
-
-import baselines.impweight as iw
-from baselines.densratio import densratio
-from quacc.dataset import Dataset
-
-
-# ---------------------------------------------------------------------------------------
-# Methods of "importance weight", e.g., by ratio density estimation (KLIEP, SILF, LogReg)
-# ---------------------------------------------------------------------------------------
-class ImportanceWeight:
-    def weights(self, Xtr, ytr, Xte):
-        ...
-
-
-class KLIEP(ImportanceWeight):
-    def __init__(self):
-        pass
-
-    def weights(self, Xtr, ytr, Xte):
-        kliep = DensityRatioEstimator()
-        kliep.fit(Xtr, Xte)
-        return kliep.predict(Xtr)
-
-
-class USILF(ImportanceWeight):
-    def __init__(self, alpha=0.0):
-        self.alpha = alpha
-
-    def weights(self, Xtr, ytr, Xte):
-        dense_ratio_obj = densratio(Xtr, Xte, alpha=self.alpha, verbose=False)
-        return dense_ratio_obj.compute_density_ratio(Xtr)
-
-
-class LogReg(ImportanceWeight):
-    def __init__(self):
-        pass
-
-    def weights(self, Xtr, ytr, Xte):
-        # check "Direct Density Ratio Estimation for
-        # Large-scale Covariate Shift Adaptation", Eq.28
-
-        if issparse(Xtr):
-            X = vstack([Xtr, Xte])
-        else:
-            X = np.concatenate([Xtr, Xte])
-
-        y = [0] * Xtr.shape[0] + [1] * Xte.shape[0]
-
-        logreg = GridSearchCV(
-            LogisticRegression(),
-            param_grid={"C": np.logspace(-3, 3, 7), "class_weight": ["balanced", None]},
-            n_jobs=-1,
-        )
-        logreg.fit(X, y)
-        probs = logreg.predict_proba(Xtr)
-        prob_train, prob_test = probs[:, 0], probs[:, 1]
-        prior_train = Xtr.shape[0]
-        prior_test = Xte.shape[0]
-        w = (prior_train / prior_test) * (prob_test / prob_train)
-        return w
-
-
-class KDEx2(ImportanceWeight):
-    def __init__(self):
-        pass
-
-    def weights(self, Xtr, ytr, Xte):
-        params = {"bandwidth": np.logspace(-1, 1, 20)}
-        log_likelihood_tr = (
-            GridSearchCV(KernelDensity(), params).fit(Xtr).score_samples(Xtr)
-        )
-        log_likelihood_te = (
-            GridSearchCV(KernelDensity(), params).fit(Xte).score_samples(Xtr)
-        )
-        likelihood_tr = np.exp(log_likelihood_tr)
-        likelihood_te = np.exp(log_likelihood_te)
-        return likelihood_te / likelihood_tr
-
-
-if __name__ == "__main__":
-    # d = Dataset("rcv1", target="CCAT").get_raw()
-    d = Dataset("imdb", n_prevalences=1).get()[0]
-
-    tstart = time.time()
-    lr = LogisticRegression()
-    lr.fit(*d.train.Xy)
-    val_preds = lr.predict(d.validation.X)
-    protocol = APP(
-        d.test,
-        n_prevalences=21,
-        repeats=1,
-        sample_size=100,
-        return_type="labelled_collection",
-    )
-
-    results = []
-    for sample in protocol():
-        wx = iw.kliep(d.validation.X, d.validation.y, sample.X)
-        test_preds = lr.predict(sample.X)
-        estim_acc = np.sum((1.0 * (val_preds == d.validation.y)) * wx) / np.sum(wx)
-        true_acc = metrics.accuracy_score(sample.y, test_preds)
-        results.append((sample.prevalence(), estim_acc, true_acc))
-
-    tend = time.time()
-
-    for r in results:
-        print(*r)
-
-    print(f"logreg finished [took {tend-tstart:.3f}s]")
-    import win11toast
-
-    win11toast.notify("models.py", "Completed")
--- a/baselines/pykliep.py
+++ b/baselines/pykliep.py
@ -1,221 +0,0 @@
-import warnings
-
-import numpy as np
-from scipy.sparse import csr_matrix
-
-
-class DensityRatioEstimator:
-    """
-    Class to accomplish direct density estimation implementing the original KLIEP
-    algorithm from Direct Importance Estimation with Model Selection
-    and Its Application to Covariate Shift Adaptation by Sugiyama et al.
-
-    The training set is distributed via
-                                            train ~ p(x)
-    and the test set is distributed via
-                                            test ~ q(x).
-
-    The KLIEP algorithm and its variants approximate w(x) = q(x) / p(x) directly. The predict function returns the
-    estimate of w(x). The function w(x) can serve as sample weights for the training set during
-    training to modify the expectation function that the model's loss function is optimized via,
-    i.e.
-
-            E_{x ~ w(x)p(x)} loss(x) = E_{x ~ q(x)} loss(x).
-
-    Usage :
-        The fit method is used to run the KLIEP algorithm using LCV and returns value of J
-        trained on the entire training/test set with the best sigma found.
-        Use the predict method on the training set to determine the sample weights from the KLIEP algorithm.
-    """
-
-    def __init__(
-        self,
-        max_iter=5000,
-        num_params=[0.1, 0.2],
-        epsilon=1e-4,
-        cv=3,
-        sigmas=[0.01, 0.1, 0.25, 0.5, 0.75, 1],
-        random_state=None,
-        verbose=0,
-    ):
-        """
-        Direct density estimation using an inner LCV loop to estimate the proper model. Can be used with sklearn
-        cross validation methods with or without storing the inner CV. To use a standard grid search.
-
-
-        max_iter : Number of iterations to perform
-        num_params : List of number of test set vectors used to construct the approximation for inner LCV.
-                     Must be a float. Original paper used 10%, i.e. =.1
-        sigmas : List of sigmas to be used in inner LCV loop.
-        epsilon : Additive factor in the iterative algorithm for numerical stability.
-        """
-        self.max_iter = max_iter
-        self.num_params = num_params
-        self.epsilon = epsilon
-        self.verbose = verbose
-        self.sigmas = sigmas
-        self.cv = cv
-        self.random_state = 0
-
-    def fit(self, X_train, X_test, alpha_0=None):
-        """Uses cross validation to select sigma as in the original paper (LCV).
-        In a break from sklearn convention, y=X_test.
-        The parameter cv corresponds to R in the original paper.
-        Once found, the best sigma is used to train on the full set."""
-
-        # LCV loop, shuffle a copy in place for performance.
-        cv = self.cv
-        chunk = int(X_test.shape[0] / float(cv))
-        if self.random_state is not None:
-            np.random.seed(self.random_state)
-        # if isinstance(X_test, csr_matrix):
-        #     X_test_shuffled = X_test.toarray()
-        # else:
-        #     X_test_shuffled = X_test.copy()
-        X_test_shuffled = X_test.copy()
-
-        X_test_index = np.arange(X_test_shuffled.shape[0])
-        np.random.shuffle(X_test_index)
-        X_test_shuffled = X_test_shuffled[X_test_index, :]
-
-        j_scores = {}
-
-        if type(self.sigmas) != list:
-            self.sigmas = [self.sigmas]
-
-        if type(self.num_params) != list:
-            self.num_params = [self.num_params]
-
-        if len(self.sigmas) * len(self.num_params) > 1:
-            # Inner LCV loop
-            for num_param in self.num_params:
-                for sigma in self.sigmas:
-                    j_scores[(num_param, sigma)] = np.zeros(cv)
-                    for k in range(1, cv + 1):
-                        if self.verbose > 0:
-                            print("Training: sigma: %s    R: %s" % (sigma, k))
-                        X_test_fold = X_test_shuffled[(k - 1) * chunk : k * chunk, :]
-                        j_scores[(num_param, sigma)][k - 1] = self._fit(
-                            X_train=X_train,
-                            X_test=X_test_fold,
-                            num_parameters=num_param,
-                            sigma=sigma,
-                        )
-                    j_scores[(num_param, sigma)] = np.mean(j_scores[(num_param, sigma)])
-
-            sorted_scores = sorted(
-                [x for x in j_scores.items() if np.isfinite(x[1])],
-                key=lambda x: x[1],
-                reverse=True,
-            )
-            if len(sorted_scores) == 0:
-                warnings.warn("LCV failed to converge for all values of sigma.")
-                return self
-            self._sigma = sorted_scores[0][0][1]
-            self._num_parameters = sorted_scores[0][0][0]
-            self._j_scores = sorted_scores
-        else:
-            self._sigma = self.sigmas[0]
-            self._num_parameters = self.num_params[0]
-            # best sigma
-        self._j = self._fit(
-            X_train=X_train,
-            X_test=X_test_shuffled,
-            num_parameters=self._num_parameters,
-            sigma=self._sigma,
-        )
-
-        return self  # Compatibility with sklearn
-
-    def _fit(self, X_train, X_test, num_parameters, sigma, alpha_0=None):
-        """Fits the estimator with the given parameters w-hat and returns J"""
-
-        num_parameters = num_parameters
-
-        if type(num_parameters) == float:
-            num_parameters = int(X_test.shape[0] * num_parameters)
-
-        self._select_param_vectors(
-            X_test=X_test, sigma=sigma, num_parameters=num_parameters
-        )
-
-        # if isinstance(X_train, csr_matrix):
-        #     X_train = X_train.toarray()
-        X_train = self._reshape_X(X_train)
-        X_test = self._reshape_X(X_test)
-
-        if alpha_0 is None:
-            alpha_0 = np.ones(shape=(num_parameters, 1)) / float(num_parameters)
-
-        self._find_alpha(
-            X_train=X_train,
-            X_test=X_test,
-            num_parameters=num_parameters,
-            epsilon=self.epsilon,
-            alpha_0=alpha_0,
-            sigma=sigma,
-        )
-
-        return self._calculate_j(X_test, sigma=sigma)
-
-    def _calculate_j(self, X_test, sigma):
-        pred = self.predict(X_test, sigma=sigma) + 0.0000001
-        log = np.log(pred).sum()
-        return log / (X_test.shape[0])
-
-    def score(self, X_test):
-        """Return the J score, similar to sklearn's API"""
-        return self._calculate_j(X_test=X_test, sigma=self._sigma)
-
-    @staticmethod
-    def _reshape_X(X):
-        """Reshape input from mxn to mx1xn to take advantage of numpy broadcasting."""
-        if len(X.shape) != 3:
-            return X.reshape((X.shape[0], 1, X.shape[1]))
-        return X
-
-    def _select_param_vectors(self, X_test, sigma, num_parameters):
-        """X_test is the test set. b is the number of parameters."""
-        indices = np.random.choice(X_test.shape[0], size=num_parameters, replace=False)
-        self._test_vectors = X_test[indices, :].copy()
-        self._phi_fitted = True
-
-    def _phi(self, X, sigma=None):
-        if sigma is None:
-            sigma = self._sigma
-
-        if self._phi_fitted:
-            return np.exp(
-                -np.sum((X - self._test_vectors) ** 2, axis=-1) / (2 * sigma**2)
-            )
-        raise Exception("Phi not fitted.")
-
-    def _find_alpha(self, alpha_0, X_train, X_test, num_parameters, sigma, epsilon):
-        A = np.zeros(shape=(X_test.shape[0], num_parameters))
-        b = np.zeros(shape=(num_parameters, 1))
-
-        A = self._phi(X_test, sigma)
-        b = self._phi(X_train, sigma).sum(axis=0) / X_train.shape[0]
-        b = b.reshape((num_parameters, 1))
-
-        out = alpha_0.copy()
-        for k in range(self.max_iter):
-            mat = np.dot(A, out)
-            mat += 0.000000001
-            out += epsilon * np.dot(np.transpose(A), 1.0 / mat)
-            out += b * (
-                ((1 - np.dot(np.transpose(b), out)) / np.dot(np.transpose(b), b))
-            )
-            out = np.maximum(0, out)
-            out /= np.dot(np.transpose(b), out)
-
-        self._alpha = out
-        self._fitted = True
-
-    def predict(self, X, sigma=None):
-        """Equivalent of w(X) from the original paper."""
-
-        X = self._reshape_X(X)
-        if not self._fitted:
-            raise Exception("Not fitted!")
-        return np.dot(self._phi(X, sigma=sigma), self._alpha).reshape((X.shape[0],))
--- a/baselines/rca.py
+++ b/baselines/rca.py
@ -1,5 +0,0 @@
-import numpy as np
-
-
-def get_score(pred1, pred2, labels):
-    return np.mean((pred1 == labels).astype(int) - (pred2 == labels).astype(int))
--- a/baselines/utils.py
+++ b/baselines/utils.py
@ -1,8 +0,0 @@
-from sklearn import clone
-from sklearn.base import BaseEstimator
-
-
-def clone_fit(c_model: BaseEstimator, data, labels):
-    c_model2 = clone(c_model)
-    c_model2.fit(data, labels)
-    return c_model2
--- a/conf.yaml
+++ b/conf.yaml
@ -1,488 +0,0 @@
-debug_conf: &debug_conf
-  global:
-    METRICS: 
-      - acc
-    OUT_DIR_NAME: output/debug
-    DATASET_N_PREVS: 4
-    # DATASET_PREVS: [[0.1, 0.1, 0.8]]
-    COMP_ESTIMATORS:
-      # - bin_sld_lr
-      # - mul_sld_lr
-      # - m3w_sld_lr
-      # - d_bin_sld_lr
-      # - d_mul_sld_lr
-      # - d_m3w_sld_lr
-      # - d_bin_sld_rbf
-      # - d_mul_sld_rbf
-      # - d_m3w_sld_rbf
-      # - bin_kde_lr
-      # - mul_kde_lr
-      # - m3w_kde_lr
-      # - d_bin_kde_lr
-      # - d_mul_kde_lr
-      # - d_m3w_kde_lr
-      # - d_bin_kde_rbf
-      # - d_mul_kde_rbf
-      # - d_m3w_kde_rbf
-      # - mandoline
-      # - bin_sld_lr_is
-      - bin_sld_lr_gs
-      - mul_sld_lr_gs
-      # - m3w_sld_lr_is
-      # - rca
-      # - rca_star
-      - doc
-      - atc_mc
-    N_JOBS: -2
-
-  confs:
-    - DATASET_NAME: twitter_gasp
-  other_confs:
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: GCAT
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: MCAT
-    - DATASET_NAME: imdb
-    - DATASET_NAME: imdb
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: CCAT
-
-test_conf: &test_conf
-  global:
-    METRICS: 
-      - acc
-      - f1
-    OUT_DIR_NAME: output/test
-    DATASET_N_PREVS: 9
-    COMP_ESTIMATORS:
-      - cross
-      - cross2
-      - bin_sld_lr
-      - mul_sld_lr
-      - m3w_sld_lr
-      - bin_sld_lr_is
-      - mul_sld_lr_is
-      - m3w_sld_lr_is
-      - doc 
-      - atc_mc
-    N_JOBS: -2
-
-  confs:
-    - DATASET_NAME: imdb
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: CCAT
-  other_confs:
-    - DATASET_NAME: twitter_gasp
-      
-main:
-  confs: &main_confs
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: CCAT
-    - DATASET_NAME: imdb
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: GCAT
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: MCAT
-  other_confs:
-
-sld_lr_conf: &sld_lr_conf
-
-  global:
-    METRICS: 
-      - acc
-      - f1
-    OUT_DIR_NAME: output/sld_lr
-    DATASET_N_PREVS: 9
-    N_JOBS: -2
-    COMP_ESTIMATORS:
-      - bin_sld_lr
-      - mul_sld_lr
-      - m3w_sld_lr
-      - bin_sld_lr_c
-      - mul_sld_lr_c
-      - m3w_sld_lr_c
-      - bin_sld_lr_mc
-      - mul_sld_lr_mc
-      - m3w_sld_lr_mc
-      - bin_sld_lr_ne
-      - mul_sld_lr_ne
-      - m3w_sld_lr_ne
-      - bin_sld_lr_is
-      - mul_sld_lr_is
-      - m3w_sld_lr_is
-      - bin_sld_lr_a
-      - mul_sld_lr_a
-      - m3w_sld_lr_a
-      - bin_sld_lr_gs
-      - mul_sld_lr_gs
-      - m3w_sld_lr_gs
-      - doc
-      - atc_mc
-
-  confs: *main_confs
-  confs_next:
-    - DATASET_NAME: imdb
-    - DATASET_NAME: twitter_gasp
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: CCAT
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: GCAT
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: MCAT
-    - DATASET_NAME: cifar10
-      DATASET_TARGET: dog
-
-d_sld_lr_conf: &d_sld_lr_conf
-
-  global:
-    METRICS: 
-      - acc
-      - f1
-    OUT_DIR_NAME: output/d_sld_lr
-    DATASET_N_PREVS: 9
-    N_JOBS: -2
-    COMP_ESTIMATORS:
-      - d_bin_sld_lr
-      - d_mul_sld_lr
-      - d_m3w_sld_lr
-      - d_bin_sld_lr_c
-      - d_mul_sld_lr_c
-      - d_m3w_sld_lr_c
-      - d_bin_sld_lr_mc
-      - d_mul_sld_lr_mc
-      - d_m3w_sld_lr_mc
-      - d_bin_sld_lr_ne
-      - d_mul_sld_lr_ne
-      - d_m3w_sld_lr_ne
-      - d_bin_sld_lr_is
-      - d_mul_sld_lr_is
-      - d_m3w_sld_lr_is
-      - d_bin_sld_lr_a
-      - d_mul_sld_lr_a
-      - d_m3w_sld_lr_a
-      - d_bin_sld_lr_gs
-      - d_mul_sld_lr_gs
-      - d_m3w_sld_lr_gs
-      - doc
-      - atc_mc
-
-  confs: *main_confs
-  confs_next:
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: CCAT
-    - DATASET_NAME: imdb
-    - DATASET_NAME: twitter_gasp
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: GCAT
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: MCAT
-    - DATASET_NAME: cifar10
-      DATASET_TARGET: dog
-
-d_sld_rbf_conf: &d_sld_rbf_conf
-
-  global:
-    METRICS: 
-      - acc
-      - f1
-    OUT_DIR_NAME: output/d_sld_rbf
-    DATASET_N_PREVS: 9
-    N_JOBS: -2
-    COMP_ESTIMATORS:
-      - d_bin_sld_rbf
-      - d_mul_sld_rbf
-      - d_m3w_sld_rbf
-      - d_bin_sld_rbf_c
-      - d_mul_sld_rbf_c
-      - d_m3w_sld_rbf_c
-      - d_bin_sld_rbf_mc
-      - d_mul_sld_rbf_mc
-      - d_m3w_sld_rbf_mc
-      - d_bin_sld_rbf_ne
-      - d_mul_sld_rbf_ne
-      - d_m3w_sld_rbf_ne
-      - d_bin_sld_rbf_is
-      - d_mul_sld_rbf_is
-      - d_m3w_sld_rbf_is
-      - d_bin_sld_rbf_a
-      - d_mul_sld_rbf_a
-      - d_m3w_sld_rbf_a
-      - d_bin_sld_rbf_gs
-      - d_mul_sld_rbf_gs
-      - d_m3w_sld_rbf_gs
-      - doc
-      - atc_mc
-
-  confs: *main_confs
-  confs_next:
-    - DATASET_NAME: imdb
-    - DATASET_NAME: twitter_gasp
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: CCAT
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: GCAT
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: MCAT
-    - DATASET_NAME: cifar10
-      DATASET_TARGET: dog
-
-kde_lr_conf: &kde_lr_conf
-  global:
-    METRICS: 
-      - acc
-      - f1
-    OUT_DIR_NAME: output/kde_lr
-    DATASET_N_PREVS: 9
-    COMP_ESTIMATORS:
-      - bin_kde_lr
-      - mul_kde_lr
-      - m3w_kde_lr
-      - bin_kde_lr_c
-      - mul_kde_lr_c
-      - m3w_kde_lr_c
-      - bin_kde_lr_mc
-      - mul_kde_lr_mc
-      - m3w_kde_lr_mc
-      - bin_kde_lr_ne
-      - mul_kde_lr_ne
-      - m3w_kde_lr_ne
-      - bin_kde_lr_is
-      - mul_kde_lr_is
-      - m3w_kde_lr_is
-      - bin_kde_lr_a
-      - mul_kde_lr_a
-      - m3w_kde_lr_a
-      - bin_kde_lr_gs
-      - mul_kde_lr_gs
-      - m3w_kde_lr_gs
-      - doc
-      - atc_mc
-    N_JOBS: -2
-
-  confs: *main_confs
-  other_confs:
-    - DATASET_NAME: imdb
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: CCAT
-
-d_kde_lr_conf: &d_kde_lr_conf
-  global:
-    METRICS: 
-      - acc
-      - f1
-    OUT_DIR_NAME: output/d_kde_lr
-    DATASET_N_PREVS: 9
-    COMP_ESTIMATORS:
-      - d_bin_kde_lr
-      - d_mul_kde_lr
-      - d_m3w_kde_lr
-      - d_bin_kde_lr_c
-      - d_mul_kde_lr_c
-      - d_m3w_kde_lr_c
-      - d_bin_kde_lr_mc
-      - d_mul_kde_lr_mc
-      - d_m3w_kde_lr_mc
-      - d_bin_kde_lr_ne
-      - d_mul_kde_lr_ne
-      - d_m3w_kde_lr_ne
-      - d_bin_kde_lr_is
-      - d_mul_kde_lr_is
-      - d_m3w_kde_lr_is
-      - d_bin_kde_lr_a
-      - d_mul_kde_lr_a
-      - d_m3w_kde_lr_a
-      - d_bin_kde_lr_gs
-      - d_mul_kde_lr_gs
-      - d_m3w_kde_lr_gs
-      - doc
-      - atc_mc
-    N_JOBS: -2
-
-  confs: *main_confs
-  other_confs:
-    - DATASET_NAME: imdb
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: CCAT
-
-d_kde_rbf_conf: &d_kde_rbf_conf
-  global:
-    METRICS: 
-      - acc
-      - f1
-    OUT_DIR_NAME: output/d_kde_rbf
-    DATASET_N_PREVS: 9
-    COMP_ESTIMATORS:
-      - d_bin_kde_rbf
-      - d_mul_kde_rbf
-      - d_m3w_kde_rbf
-      - d_bin_kde_rbf_c
-      - d_mul_kde_rbf_c
-      - d_m3w_kde_rbf_c
-      - d_bin_kde_rbf_mc
-      - d_mul_kde_rbf_mc
-      - d_m3w_kde_rbf_mc
-      - d_bin_kde_rbf_ne
-      - d_mul_kde_rbf_ne
-      - d_m3w_kde_rbf_ne
-      - d_bin_kde_rbf_is
-      - d_mul_kde_rbf_is
-      - d_m3w_kde_rbf_is
-      - d_bin_kde_rbf_a
-      - d_mul_kde_rbf_a
-      - d_m3w_kde_rbf_a
-      - d_bin_kde_rbf_gs
-      - d_mul_kde_rbf_gs
-      - d_m3w_kde_rbf_gs
-      - doc
-      - atc_mc
-    N_JOBS: -2
-
-  confs: *main_confs
-  other_confs:
-    - DATASET_NAME: imdb
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: CCAT
-
-cc_lr_conf: &cc_lr_conf
-  global:
-    METRICS: 
-      - acc
-      - f1
-    OUT_DIR_NAME: output/cc_lr
-    DATASET_N_PREVS: 9
-    COMP_ESTIMATORS:
-      # - bin_cc_lr
-      # - mul_cc_lr
-      # - m3w_cc_lr
-      # - bin_cc_lr_c
-      # - mul_cc_lr_c
-      # - m3w_cc_lr_c
-      # - bin_cc_lr_mc
-      # - mul_cc_lr_mc
-      # - m3w_cc_lr_mc
-      # - bin_cc_lr_ne
-      # - mul_cc_lr_ne
-      # - m3w_cc_lr_ne
-      # - bin_cc_lr_is
-      # - mul_cc_lr_is
-      # - m3w_cc_lr_is
-      # - bin_cc_lr_a
-      # - mul_cc_lr_a
-      # - m3w_cc_lr_a
-      - bin_cc_lr_gs
-      - mul_cc_lr_gs
-      - m3w_cc_lr_gs
-    N_JOBS: -2
-
-  confs: *main_confs
-  other_confs:
-    - DATASET_NAME: imdb
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: CCAT
-
-baselines_conf: &baselines_conf
-  global:
-    METRICS: 
-      - acc
-      - f1
-    OUT_DIR_NAME: output/baselines
-    DATASET_N_PREVS: 9
-    COMP_ESTIMATORS:
-      - doc
-      - atc_mc
-      - naive
-      # - mandoline
-      # - rca
-      # - rca_star
-    N_JOBS: -2
-
-  confs: *main_confs
-  other_confs:
-    - DATASET_NAME: imdb
-    - DATASET_NAME: rcv1
-      DATASET_TARGET: CCAT
-
-kde_lr_gs_conf: &kde_lr_gs_conf
-  global:
-    METRICS: 
-      - acc
-      - f1
-    OUT_DIR_NAME: output/kde_lr_gs
-    DATASET_N_PREVS: 9
-    COMP_ESTIMATORS:
-      - bin_kde_lr_gs
-      - mul_kde_lr_gs
-      - m3w_kde_lr_gs
-    N_JOBS: -2
-
-  confs: 
-    - DATASET_NAME: twitter_gasp
-
-
-multiclass_conf: &multiclass_conf
-  global:
-    METRICS: 
-      - acc
-      - f1
-    OUT_DIR_NAME: output/multiclass
-    DATASET_N_PREVS: 5
-    COMP_ESTIMATORS:
-      - bin_sld_lr_a
-      - mul_sld_lr_a
-      - bin_sld_lr_gs
-      - mul_sld_lr_gs
-      - bin_kde_lr_gs
-      - mul_kde_lr_gs
-      - atc_mc
-      - doc
-    N_JOBS: -2
-
-  confs: *main_confs
-
-timing_conf: &timing_conf
-  global:
-    METRICS:
-      - acc
-      - f1
-    OUT_DIR_NAME: output/timing
-    DATASET_N_PREVS: 1
-    COMP_ESTIMATORS:
-      - bin_sld_lr_a
-      - mul_sld_lr_a
-      - m3w_sld_lr_a
-      - bin_kde_lr_a
-      - mul_kde_lr_a
-      - m3w_kde_lr_a
-      - doc 
-      - atc_mc
-      - rca
-      - rca_star
-      - mandoline
-      - naive
-    N_JOBS: 1
-    PROTOCOL_REPEATS: 1
-    # PROTOCOL_N_PREVS: 1
-
-  confs: *main_confs
-
-timing_gs_conf: &timing_gs_conf
-  global:
-    METRICS:
-      - acc
-      - f1
-    OUT_DIR_NAME: output/timing_gs
-    DATASET_N_PREVS: 1
-    COMP_ESTIMATORS:
-      - bin_sld_lr_gs
-      - mul_sld_lr_gs
-      - m3w_sld_lr_gs
-      - bin_kde_lr_gs
-      - mul_kde_lr_gs
-      - m3w_kde_lr_gs
-    N_JOBS: -1
-    PROTOCOL_REPEATS: 1
-    # PROTOCOL_N_PREVS: 1
-
-  confs: *main_confs
-
-exec: *timing_conf
--- a/copy_res.sh
+++ b/copy_res.sh
@ -1,23 +0,0 @@
-#!/bin/bash
-
-DIRS=()
-# DIRS+=("kde_lr_gs")
-# DIRS+=("cc_lr")
-# DIRS+=("baselines")
-# DIRS+=("d_sld_rbf")
-# DIRS+=("d_sld_lr")
-# DIRS+=("debug")
-DIRS+=("multiclass")
-
-for dir in ${DIRS[@]}; do
-	scp -r andreaesuli@edge-nd1.isti.cnr.it:/home/andreaesuli/raid/lorenzo/output/${dir} ./output/
-	scp -r ./output/${dir} volpi@ilona.isti.cnr.it:/home/volpi/tesi/output/
-done
-
-# scp -r andreaesuli@edge-nd1.isti.cnr.it:/home/andreaesuli/raid/lorenzo/output/kde_lr_gs ./output/
-# scp -r andreaesuli@edge-nd1.isti.cnr.it:/home/andreaesuli/raid/lorenzo/output/cc_lr ./output/
-# scp -r andreaesuli@edge-nd1.isti.cnr.it:/home/andreaesuli/raid/lorenzo/output/baselines ./output/
-
-# scp -r ./output/kde_lr_gs volpi@ilona.isti.cnr.it:/home/volpi/tesi/output/
-# scp -r ./output/cc_lr volpi@ilona.isti.cnr.it:/home/volpi/tesi/output/
-# scp -r ./output/baselines volpi@ilona.isti.cnr.it:/home/volpi/tesi/output/
--- a/copy_source.sh
+++ b/copy_source.sh
@ -1,13 +0,0 @@
-#!/bin/bash
-
-# CMD="cp"
-# DEST="~/tesi_docker/"
-CMD="scp"
-DEST="andreaesuli@edge-nd1.isti.cnr.it:/home/andreaesuli/raid/lorenzo/"
-
-bash -c "${CMD} -r quacc ${DEST}"
-bash -c "${CMD} -r baselines ${DEST}"
-bash -c "${CMD} run.py ${DEST}"
-bash -c "${CMD} remote.py ${DEST}"
-bash -c "${CMD} conf.yaml ${DEST}"
-bash -c "${CMD} requirements.txt ${DEST}"
--- a/10
+++ b/10
@ -1,10 +0,0 @@
-#!/bin/bash
-
-if [[ "${1}" == "r" ]]; then
-	scp volpi@ilona.isti.cnr.it:~/tesi/quacc.log ~/tesi/remote.log &>/dev/null
-	ssh volpi@ilona.isti.cnr.it tail -n 500 -f /home/volpi/tesi/quacc.log | bat -P --language=log
-elif [[ "${1}" == "d" ]]; then
-	ssh andreaesuli@edge-nd1.isti.cnr.it tail -n 500 -f /home/andreaesuli/raid/lorenzo/quacc.log | bat -P --language=log
-else
-	tail -n 500 -f /home/lorev/quacc/quacc.log | bat --paging=never --language log
-fi
--- a/merge_data.py
+++ b/merge_data.py
@ -1,110 +0,0 @@
-import argparse
-import os
-import shutil
-from pathlib import Path
-
-import numpy as np
-import pandas as pd
-
-from quacc.evaluation.estimators import CE
-from quacc.evaluation.report import DatasetReport, DatasetReportInfo
-
-
-def load_report_info(path: Path) -> DatasetReportInfo:
-    return DatasetReport.unpickle(path, report_info=True)
-
-
-def list_reports(base_path: Path | str):
-    if isinstance(base_path, str):
-        base_path = Path(base_path)
-
-    if base_path.name == "plot":
-        return []
-
-    reports = []
-    for f in os.listdir(base_path):
-        fp = base_path / f
-        if fp.is_dir():
-            reports.extend(list_reports(fp))
-        elif fp.is_file():
-            if fp.suffix == ".pickle" and fp.stem == base_path.name:
-                reports.append(load_report_info(fp))
-
-    return reports
-
-
-def playground():
-    data_a = np.array(np.random.random((4, 6)))
-    data_b = np.array(np.random.random((4, 4)))
-    _ind1 = pd.MultiIndex.from_product([["0.2", "0.8"], ["0", "1"]])
-    _col1 = pd.MultiIndex.from_product([["a", "b"], ["1", "2", "5"]])
-    _col2 = pd.MultiIndex.from_product([["a", "b"], ["1", "2"]])
-    a = pd.DataFrame(data_a, index=_ind1, columns=_col1)
-    b = pd.DataFrame(data_b, index=_ind1, columns=_col2)
-    print(a)
-    print(b)
-    print((a.index == b.index).all())
-    update_col = a.columns.intersection(b.columns)
-    col_to_join = b.columns.difference(update_col)
-    _b = b.drop(columns=[(slice(None), "2")])
-    _join = pd.concat([a, _b.loc[:, col_to_join]], axis=1)
-    _join.loc[:, update_col.to_list()] = _b.loc[:, update_col.to_list()]
-    _join.sort_index(axis=1, level=0, sort_remaining=False, inplace=True)
-
-    print(_join)
-
-
-def merge(dri1: DatasetReportInfo, dri2: DatasetReportInfo, path: Path):
-    drm = dri1.dr.join(dri2.dr, estimators=CE.name.all)
-
-    # save merged dr
-    _path = path / drm.name / f"{drm.name}.pickle"
-    os.makedirs(_path.parent, exist_ok=True)
-    drm.pickle(_path)
-
-    # rename dri1 pickle
-    dri1_bp = Path(dri1.name) / f"{dri1.name.split('/')[-1]}.pickle"
-    os.rename(dri1_bp, dri1_bp.with_suffix(f".pickle.pre_{dri2.name.split('/')[-2]}"))
-
-    # copy merged pickle in place of old dri1 one
-    shutil.copyfile(_path, dri1_bp)
-
-    # copy dri2 log file inside dri1 folder
-    dri2_bp = Path(dri2.name) / f"{dri2.name.split('/')[-1]}.pickle"
-    shutil.copyfile(
-        dri2_bp.with_suffix(".log"),
-        dri1_bp.with_name(f"{dri1_bp.stem}_{dri2.name.split('/')[-2]}.log"),
-    )
-
-
-def run():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("path1", nargs="?", default=None)
-    parser.add_argument("path2", nargs="?", default=None)
-    parser.add_argument("-l", "--list", action="store_true", dest="list")
-    parser.add_argument("-v", "--verbose", action="store_true", dest="verbose")
-    parser.add_argument(
-        "-o", "--output", action="store", dest="output", default="output/merge"
-    )
-    args = parser.parse_args()
-
-    reports = list_reports("output")
-    reports = {r.name: r for r in reports}
-
-    if args.list:
-        for i, r in enumerate(reports.values()):
-            if args.verbose:
-                print(f"{i}: {r}")
-            else:
-                print(f"{i}: {r.name}")
-    else:
-        dri1, dri2 = reports.get(args.path1, None), reports.get(args.path2, None)
-        if dri1 is None or dri2 is None:
-            raise ValueError(
-                f"({args.path1}, {args.path2}) is not a valid pair of paths"
-            )
-        merge(dri1, dri2, path=Path(args.output))
-
-
-if __name__ == "__main__":
-    run()
--- a/91
+++ b/91
--- a/qcdash/app.py
+++ b/qcdash/app.py
@ -12,10 +12,12 @@ import numpy as np
 from dash import Dash, Input, Output, State, callback, ctx, dash_table, dcc, html
 from dash.dash_table.Format import Align, Format, Scheme

-from quacc import plot
-from quacc.evaluation.estimators import CE, _renames
-from quacc.evaluation.report import CompReport, DatasetReport
-from quacc.evaluation.stats import wilcoxon
+from quacc.experiments.report import Report
+from quacc.experiments.util import get_acc_name
+from quacc.legacy.evaluation.estimators import CE, _renames
+from quacc.legacy.evaluation.report import CompReport, DatasetReport
+from quacc.legacy.evaluation.stats import wilcoxon
+from quacc.plot.plotly import plot_delta, plot_diagonal, plot_shift

 valid_plot_modes = defaultdict(lambda: CompReport._default_modes)
 valid_plot_modes["avg"] = DatasetReport._default_dr_modes
@ -74,29 +76,71 @@ def get_datasets(root: str | Path) -> List[DatasetReport]:
    return {str(drp.parent): load_dataset(drp) for drp in dr_paths}


-def get_fig(dr: DatasetReport, metric, estimators, view, mode, backend=None):
-    _backend = backend or plot.get_backend("plotly")
-    estimators = CE.name[estimators]
+def get_fig(rep: Report, cls_name, acc_name, dataset_name, estimators, view, mode):
    match (view, mode):
-        case ("avg", _):
-            return dr.get_plots(
-                mode=mode,
-                metric=metric,
-                estimators=estimators,
-                conf="plotly",
-                save_fig=False,
-                backend=_backend,
+        case ("avg", "diagonal"):
+            true_accs, estim_accs = rep.diagonal_plot_data(
+                dataset_name=dataset_name,
+                method_names=estimators,
+                acc_name=acc_name,
+            )
+            return plot_diagonal(
+                method_names=estimators,
+                true_accs=true_accs,
+                estim_accs=estim_accs,
+                cls_name=cls_name,
+                acc_name=acc_name,
+                dataset_name=dataset_name,
+            )
+        case ("avg", "delta_train"):
+            prevs, acc_errs = rep.delta_train_plot_data(
+                dataset_name=dataset_name,
+                method_names=estimators,
+                acc_name=acc_name,
+            )
+            return plot_delta(
+                method_names=estimators,
+                prevs=prevs,
+                acc_errs=acc_errs,
+                cls_name=cls_name,
+                acc_name=acc_name,
+                dataset_name=dataset_name,
+                prev_name="Test",
+            )
+        case ("avg", "stdev_train"):
+            prevs, acc_errs, stdevs = rep.delta_train_plot_data(
+                dataset_name=dataset_name,
+                method_names=estimators,
+                acc_name=acc_name,
+                stdev=True,
+            )
+            return plot_delta(
+                method_names=estimators,
+                prevs=prevs,
+                acc_errs=acc_errs,
+                cls_name=cls_name,
+                acc_name=acc_name,
+                dataset_name=dataset_name,
+                prev_name="Test",
+                stdevs=stdevs,
+            )
+        case ("avg", "shift"):
+            prevs, acc_errs, counts = rep.shift_plot_data(
+                dataset_name=dataset_name,
+                method_names=estimators,
+                acc_name=acc_name,
+            )
+            return plot_shift(
+                method_names=estimators,
+                prevs=prevs,
+                acc_errs=acc_errs,
+                cls_name=cls_name,
+                acc_name=acc_name,
+                dataset_name=dataset_name,
+                counts=counts,
            )
        case (_, _):
-            cr = dr.crs[[_get_prev_str(c.train_prev) for c in dr.crs].index(view)]
-            return cr.get_plots(
-                mode=mode,
-                metric=metric,
-                estimators=estimators,
-                conf="plotly",
-                save_fig=False,
-                backend=_backend,
-            )
+            return None


 def get_table(dr: DatasetReport, metric, estimators, view, mode):
@ -509,7 +553,7 @@ def update_content(dataset, metric, estimators, view, mode, root):
        case _:
            fig = get_fig(
                dr=dr,
-                metric=metric,
+                acc_name=metric,
                estimators=estimators,
                view=view,
                mode=mode,
--- a/qcpanel/run.py
+++ b/qcpanel/run.py
@ -1,100 +0,0 @@
-import argparse
-
-import panel as pn
-from panel.theme.fast import FastDarkTheme, FastDefaultTheme
-
-from qcpanel.viewer import QuaccTestViewer
-
-# pn.config.design = Fast
-# pn.config.theme = "dark"
-pn.config.notifications = True
-
-
-def app_instance():
-    param_init = {
-        k: v
-        for k, v in pn.state.location.query_params.items()
-        if k in ["root", "dataset", "metric", "plot_view", "mode", "estimators"]
-    }
-    qtv = QuaccTestViewer(param_init=param_init)
-    pn.state.location.sync(
-        qtv,
-        {
-            "root": "root",
-            "dataset": "dataset",
-            "metric": "metric",
-            "plot_view": "plot_view",
-            "mode": "mode",
-            "estimators": "estimators",
-        },
-    )
-
-    def save_callback(event):
-        app.open_modal()
-
-    def refresh_callback(event):
-        qtv.update_datasets()
-
-    save_button = pn.widgets.Button(
-        # name="Save",
-        icon="device-floppy",
-        icon_size="16px",
-        # sizing_mode="scale_width",
-        button_style="solid",
-        button_type="success",
-    )
-    save_button.on_click(save_callback)
-
-    refresh_button = pn.widgets.Button(
-        icon="refresh",
-        icon_size="16px",
-        button_style="solid",
-    )
-    refresh_button.on_click(refresh_callback)
-
-    app = pn.template.FastListTemplate(
-        title="quacc tests",
-        sidebar=[
-            pn.FlexBox(save_button, refresh_button, flex_direction="row-reverse"),
-            qtv.get_param_pane,
-        ],
-        main=[pn.Column(qtv.get_plot, sizing_mode="stretch_both")],
-        modal=[qtv.modal_pane],
-        # theme=FastDefaultTheme,
-        theme_toggle=True,
-    )
-
-    app.servable()
-    return app
-
-
-def serve(address="localhost"):
-    __port = 33420
-    __allowed = [address]
-    if address == "localhost":
-        __allowed.append("127.0.0.1")
-
-    pn.serve(
-        app_instance,
-        autoreload=True,
-        port=__port,
-        show=False,
-        address=address,
-        websocket_origin=[f"{_a}:{__port}" for _a in __allowed],
-    )
-
-
-def run():
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--address",
-        action="store",
-        dest="address",
-        default="localhost",
-    )
-    args = parser.parse_args()
-    serve(address=args.address)
-
-
-if __name__ == "__main__":
-    run()
--- a/qcpanel/util.py
+++ b/qcpanel/util.py
@ -1,156 +0,0 @@
-import os
-from collections import defaultdict
-from pathlib import Path
-
-import numpy as np
-import panel as pn
-
-from quacc.evaluation.estimators import CE
-from quacc.evaluation.report import CompReport, DatasetReport
-from quacc.evaluation.stats import wilcoxon
-
-_plot_sizing_mode = "stretch_both"
-valid_plot_modes = defaultdict(lambda: CompReport._default_modes)
-valid_plot_modes["avg"] = DatasetReport._default_dr_modes
-
-
-def _get_prev_str(prev: np.ndarray):
-    return str(tuple(np.around(prev, decimals=2)))
-
-
-def create_plot(
-    dr: DatasetReport,
-    mode="delta",
-    metric="acc",
-    estimators=None,
-    plot_view=None,
-):
-    _prevs = [_get_prev_str(cr.train_prev) for cr in dr.crs]
-    estimators = CE.name[estimators]
-    if mode is None:
-        mode = valid_plot_modes[plot_view][0]
-    match (plot_view, mode):
-        case ("avg", _ as plot_mode):
-            _plot = dr.get_plots(
-                mode=mode,
-                metric=metric,
-                estimators=estimators,
-                conf="panel",
-                save_fig=False,
-            )
-        case (_, _ as plot_mode):
-            cr = dr.crs[_prevs.index(plot_view)]
-            _plot = cr.get_plots(
-                mode=plot_mode,
-                metric=metric,
-                estimators=estimators,
-                conf="panel",
-                save_fig=False,
-            )
-    if _plot is None:
-        return None
-
-    return pn.pane.Matplotlib(
-        _plot,
-        tight=True,
-        format="png",
-        # sizing_mode="scale_height",
-        sizing_mode=_plot_sizing_mode,
-        styles=dict(margin="0"),
-        # sizing_mode="scale_both",
-    )
-
-
-def create_table(
-    dr: DatasetReport,
-    mode="delta",
-    metric="acc",
-    estimators=None,
-    plot_view=None,
-):
-    _prevs = [round(cr.train_prev[1] * 100) for cr in dr.crs]
-    estimators = CE.name[estimators]
-    if mode is None:
-        mode = valid_plot_modes[plot_view][0]
-    match (plot_view, mode):
-        case ("avg", "train_table"):
-            _data = (
-                dr.data(metric=metric, estimators=estimators).groupby(level=1).mean()
-            )
-        case ("avg", "test_table"):
-            _data = (
-                dr.data(metric=metric, estimators=estimators).groupby(level=0).mean()
-            )
-        case ("avg", "shift_table"):
-            _data = (
-                dr.shift_data(metric=metric, estimators=estimators)
-                .groupby(level=0)
-                .mean()
-            )
-        case ("avg", "stats_table"):
-            _data = wilcoxon(dr, metric=metric, estimators=estimators)
-        case (_, "train_table"):
-            cr = dr.crs[_prevs.index(int(plot_view))]
-            _data = (
-                cr.data(metric=metric, estimators=estimators).groupby(level=0).mean()
-            )
-        case (_, "shift_table"):
-            cr = dr.crs[_prevs.index(int(plot_view))]
-            _data = (
-                cr.shift_data(metric=metric, estimators=estimators)
-                .groupby(level=0)
-                .mean()
-            )
-        case (_, "stats_table"):
-            cr = dr.crs[_prevs.index(int(plot_view))]
-            _data = wilcoxon(cr, metric=metric, estimators=estimators)
-
-    return (
-        pn.Column(
-            pn.pane.DataFrame(
-                _data,
-                align="center",
-                float_format=lambda v: f"{v:6e}",
-                styles={"font-size-adjust": "0.62"},
-            ),
-            sizing_mode="stretch_both",
-            # scroll=True,
-        )
-        if not _data.empty
-        else None
-    )
-
-
-def create_result(
-    dr: DatasetReport,
-    mode="delta",
-    metric="acc",
-    estimators=None,
-    plot_view=None,
-):
-    match mode:
-        case m if m.endswith("table"):
-            return create_table(dr, mode, metric, estimators, plot_view)
-        case _:
-            return create_plot(dr, mode, metric, estimators, plot_view)
-
-
-def explore_datasets(root: Path | str):
-    if isinstance(root, str):
-        root = Path(root)
-
-    if root.name == "plot":
-        return []
-
-    if not root.exists():
-        return []
-
-    drs = []
-    for f in os.listdir(root):
-        if (root / f).is_dir():
-            drs += explore_datasets(root / f)
-        elif f == f"{root.name}.pickle":
-            drs.append(root / f)
-            # drs.append((str(root),))
-
-    return drs
--- a/qcpanel/viewer.py
+++ b/qcpanel/viewer.py
@ -1,386 +0,0 @@
-import os
-from pathlib import Path
-
-import numpy as np
-import pandas as pd
-import panel as pn
-import param
-
-from qcpanel.util import (
-    _get_prev_str,
-    create_result,
-    explore_datasets,
-    valid_plot_modes,
-)
-from quacc.evaluation.estimators import CE
-from quacc.evaluation.report import DatasetReport
-
-
-class QuaccTestViewer(param.Parameterized):
-    __base_path = "output"
-
-    dataset = param.Selector()
-    metric = param.Selector()
-    estimators = param.ListSelector()
-    plot_view = param.Selector()
-    mode = param.Selector()
-
-    modal_estimators = param.ListSelector()
-    modal_plot_view = param.ListSelector()
-    modal_mode_prev = param.ListSelector(
-        objects=valid_plot_modes[0], default=valid_plot_modes[0]
-    )
-    modal_mode_avg = param.ListSelector(
-        objects=valid_plot_modes["avg"], default=valid_plot_modes["avg"]
-    )
-
-    param_pane = param.Parameter()
-    plot_pane = param.Parameter()
-    modal_pane = param.Parameter()
-
-    root = param.String()
-
-    def __init__(self, param_init=None, **params):
-        super().__init__(**params)
-
-        self.param_init = param_init
-        self.__setup_watchers()
-        self.update_datasets()
-        # self._update_on_dataset()
-        self.__create_param_pane()
-        self.__create_modal_pane()
-
-    def __get_param_init(self, val):
-        __b = val in self.param_init
-        if __b:
-            setattr(self, val, self.param_init[val])
-            del self.param_init[val]
-
-        return __b
-
-    def __save_callback(self, event):
-        _home = Path("output")
-        _save_input_val = self.save_input.value_input
-        _config = "default" if len(_save_input_val) == 0 else _save_input_val
-        base_path = _home / self.dataset / _config
-        os.makedirs(base_path, exist_ok=True)
-        base_plot = base_path / "plot"
-        os.makedirs(base_plot, exist_ok=True)
-
-        l_dr = self.datasets_[self.dataset]
-        res = l_dr.to_md(
-            conf=_config,
-            metric=self.metric,
-            estimators=CE.name[self.modal_estimators],
-            dr_modes=self.modal_mode_avg,
-            cr_modes=self.modal_mode_prev,
-            cr_prevs=self.modal_plot_view,
-            plot_path=base_plot,
-        )
-        with open(base_path / f"{self.metric}.md", "w") as f:
-            f.write(res)
-
-        pn.state.notifications.success(f'"{_config}" successfully saved')
-
-    def __create_param_pane(self):
-        self.dataset_widget = pn.Param(
-            self,
-            show_name=False,
-            parameters=["dataset"],
-            widgets={"dataset": {"widget_type": pn.widgets.Select}},
-        )
-        self.metric_widget = pn.Param(
-            self,
-            show_name=False,
-            parameters=["metric"],
-            widgets={"metric": {"widget_type": pn.widgets.Select}},
-        )
-        self.estimators_widgets = pn.Param(
-            self,
-            show_name=False,
-            parameters=["estimators"],
-            widgets={
-                "estimators": {
-                    "widget_type": pn.widgets.MultiChoice,
-                    # "orientation": "vertical",
-                    "sizing_mode": "scale_width",
-                    # "button_type": "primary",
-                    # "button_style": "outline",
-                    "solid": True,
-                    "search_option_limit": 1000,
-                    "option_limit": 1000,
-                    "max_items": 1000,
-                }
-            },
-        )
-        self.plot_view_widget = pn.Param(
-            self,
-            show_name=False,
-            parameters=["plot_view"],
-            widgets={
-                "plot_view": {
-                    "widget_type": pn.widgets.RadioButtonGroup,
-                    "orientation": "vertical",
-                    "button_type": "primary",
-                    "button_style": "outline",
-                }
-            },
-        )
-        self.mode_widget = pn.Param(
-            self,
-            show_name=False,
-            parameters=["mode"],
-            widgets={
-                "mode": {
-                    "widget_type": pn.widgets.RadioButtonGroup,
-                    "orientation": "vertical",
-                    "sizing_mode": "scale_width",
-                    "button_type": "primary",
-                    "button_style": "outline",
-                }
-            },
-            align="center",
-        )
-        self.param_pane = pn.Column(
-            self.dataset_widget,
-            self.metric_widget,
-            pn.Row(
-                self.plot_view_widget,
-                self.mode_widget,
-            ),
-            self.estimators_widgets,
-        )
-
-    def __create_modal_pane(self):
-        self.modal_estimators_widgets = pn.Param(
-            self,
-            show_name=False,
-            parameters=["modal_estimators"],
-            widgets={
-                "modal_estimators": {
-                    "widget_type": pn.widgets.CheckButtonGroup,
-                    "orientation": "vertical",
-                    "sizing_mode": "scale_width",
-                    "button_type": "primary",
-                    "button_style": "outline",
-                }
-            },
-        )
-        self.modal_plot_view_widget = pn.Param(
-            self,
-            show_name=False,
-            parameters=["modal_plot_view"],
-            widgets={
-                "modal_plot_view": {
-                    "widget_type": pn.widgets.CheckButtonGroup,
-                    "orientation": "vertical",
-                    "button_type": "primary",
-                    "button_style": "outline",
-                }
-            },
-        )
-        self.modal_mode_prev_widget = pn.Param(
-            self,
-            show_name=False,
-            parameters=["modal_mode_prev"],
-            widgets={
-                "modal_mode_prev": {
-                    "widget_type": pn.widgets.CheckButtonGroup,
-                    "orientation": "vertical",
-                    "sizing_mode": "scale_width",
-                    "button_type": "primary",
-                    "button_style": "outline",
-                }
-            },
-            align="center",
-        )
-        self.modal_mode_avg_widget = pn.Param(
-            self,
-            show_name=False,
-            parameters=["modal_mode_avg"],
-            widgets={
-                "modal_mode_avg": {
-                    "widget_type": pn.widgets.CheckButtonGroup,
-                    "orientation": "vertical",
-                    "sizing_mode": "scale_width",
-                    "button_type": "primary",
-                    "button_style": "outline",
-                }
-            },
-            align="center",
-        )
-
-        self.save_input = pn.widgets.TextInput(
-            name="Configuration Name", placeholder="default", sizing_mode="scale_width"
-        )
-        self.save_button = pn.widgets.Button(
-            name="Save",
-            sizing_mode="scale_width",
-            button_style="solid",
-            button_type="success",
-        )
-        self.save_button.on_click(self.__save_callback)
-
-        _title_styles = {
-            "font-size": "14pt",
-            "font-weight": "bold",
-        }
-        self.modal_pane = pn.Column(
-            pn.Column(
-                pn.pane.Str("Avg. configuration", styles=_title_styles),
-                self.modal_mode_avg_widget,
-                pn.pane.Str("Train prevs. configuration", styles=_title_styles),
-                pn.Row(
-                    self.modal_plot_view_widget,
-                    self.modal_mode_prev_widget,
-                ),
-                pn.pane.Str("Estimators configuration", styles=_title_styles),
-                self.modal_estimators_widgets,
-                self.save_input,
-                self.save_button,
-                pn.Spacer(height=20),
-                width=450,
-                align="center",
-                scroll=True,
-            ),
-            sizing_mode="stretch_both",
-        )
-
-    def update_datasets(self):
-        if not self.__get_param_init("root"):
-            self.root = self.__base_path
-
-        dataset_paths = sorted(
-            explore_datasets(self.root), key=lambda t: (-len(t.parts), t)
-        )
-        self.datasets_ = {
-            str(dp.parent.relative_to(Path(self.root))): DatasetReport.unpickle(dp)
-            for dp in dataset_paths
-        }
-
-        self.available_datasets = list(self.datasets_.keys())
-        _old_dataset = self.dataset
-        self.param["dataset"].objects = self.available_datasets
-        if not self.__get_param_init("dataset"):
-            self.dataset = (
-                _old_dataset
-                if _old_dataset in self.available_datasets
-                else self.available_datasets[0]
-            )
-
-    def __setup_watchers(self):
-        self.param.watch(
-            self._update_on_dataset,
-            ["dataset"],
-            queued=True,
-            precedence=0,
-        )
-        self.param.watch(self._update_on_view, ["plot_view"], queued=True, precedence=1)
-        self.param.watch(self._update_on_metric, ["metric"], queued=True, precedence=2)
-        self.param.watch(
-            self._update_plot,
-            ["dataset", "metric", "estimators", "plot_view", "mode"],
-            # ["metric", "estimators", "mode"],
-            onlychanged=False,
-            precedence=3,
-        )
-        self.param.watch(
-            self._update_on_estimators,
-            ["estimators"],
-            queued=True,
-            precedence=4,
-        )
-
-    def _update_on_dataset(self, *events):
-        l_dr = self.datasets_[self.dataset]
-        l_data = l_dr.data()
-
-        l_metrics = l_data.columns.unique(0)
-        l_valid_metrics = [m for m in l_metrics if not m.endswith("_score")]
-        _old_metric = self.metric
-        self.param["metric"].objects = l_valid_metrics
-        if not self.__get_param_init("metric"):
-            self.metric = (
-                _old_metric if _old_metric in l_valid_metrics else l_valid_metrics[0]
-            )
-
-        _old_estimators = self.estimators
-        l_valid_estimators = l_dr.data(metric=self.metric).columns.unique(0).to_numpy()
-        _new_estimators = l_valid_estimators[
-            np.isin(l_valid_estimators, _old_estimators)
-        ].tolist()
-        self.param["estimators"].objects = l_valid_estimators
-        if not self.__get_param_init("estimators"):
-            self.estimators = _new_estimators
-
-        l_valid_views = [_get_prev_str(cr.train_prev) for cr in l_dr.crs]
-        l_valid_views = ["avg"] + l_valid_views
-        _old_view = self.plot_view
-        self.param["plot_view"].objects = l_valid_views
-        if not self.__get_param_init("plot_view"):
-            self.plot_view = _old_view if _old_view in l_valid_views else "avg"
-
-        self.param["mode"].objects = valid_plot_modes[self.plot_view]
-        if not self.__get_param_init("mode"):
-            _old_mode = self.mode
-            if _old_mode in valid_plot_modes[self.plot_view]:
-                self.mode = _old_mode
-            else:
-                self.mode = valid_plot_modes[self.plot_view][0]
-
-        self.param["modal_estimators"].objects = l_valid_estimators
-        self.modal_estimators = []
-
-        self.param["modal_plot_view"].objects = l_valid_views
-        self.modal_plot_view = l_valid_views.copy()
-
-    def _update_on_view(self, *events):
-        _old_mode = self.mode
-        self.param["mode"].objects = valid_plot_modes[self.plot_view]
-        if _old_mode in valid_plot_modes[self.plot_view]:
-            self.mode = _old_mode
-        else:
-            self.mode = valid_plot_modes[self.plot_view][0]
-
-    def _update_on_metric(self, *events):
-        _old_estimators = self.estimators
-
-        l_dr = self.datasets_[self.dataset]
-        l_data: pd.DataFrame = l_dr.data(metric=self.metric)
-        l_valid_estimators: np.ndarray = l_data.columns.unique(0).to_numpy()
-        _new_estimators = l_valid_estimators[
-            np.isin(l_valid_estimators, _old_estimators)
-        ].tolist()
-        self.param["estimators"].objects = l_valid_estimators
-        self.estimators = _new_estimators
-
-    def _update_on_estimators(self, *events):
-        self.modal_estimators = self.estimators.copy()
-
-    def _update_plot(self, *events):
-        __svg = pn.pane.SVG(
-            """<svg xmlns="http://www.w3.org/2000/svg" class="icon icon-tabler icon-tabler-chart-area-filled" width="24" height="24" viewBox="0 0 24 24" stroke-width="2" stroke="currentColor" fill="none" stroke-linecap="round" stroke-linejoin="round">
-                    <path stroke="none" d="M0 0h24v24H0z" fill="none" />
-                    <path d="M20 18a1 1 0 0 1 .117 1.993l-.117 .007h-16a1 1 0 0 1 -.117 -1.993l.117 -.007h16z" stroke-width="0" fill="currentColor" />
-                    <path d="M15.22 5.375a1 1 0 0 1 1.393 -.165l.094 .083l4 4a1 1 0 0 1 .284 .576l.009 .131v5a1 1 0 0 1 -.883 .993l-.117 .007h-16.022l-.11 -.009l-.11 -.02l-.107 -.034l-.105 -.046l-.1 -.059l-.094 -.07l-.06 -.055l-.072 -.082l-.064 -.089l-.054 -.096l-.016 -.035l-.04 -.103l-.027 -.106l-.015 -.108l-.004 -.11l.009 -.11l.019 -.105c.01 -.04 .022 -.077 .035 -.112l.046 -.105l.059 -.1l4 -6a1 1 0 0 1 1.165 -.39l.114 .05l3.277 1.638l3.495 -4.369z" stroke-width="0" fill="currentColor" />
-                </svg>""",
-            sizing_mode="stretch_both",
-        )
-        if len(self.estimators) == 0:
-            self.plot_pane = __svg
-        else:
-            _dr = self.datasets_[self.dataset]
-            __plot = create_result(
-                _dr,
-                mode=self.mode,
-                metric=self.metric,
-                estimators=self.estimators,
-                plot_view=self.plot_view,
-            )
-            self.plot_pane = __svg if __plot is None else __plot
-
-    def get_plot(self):
-        return self.plot_pane
-
-    def get_param_pane(self):
-        return self.param_pane
--- a/quacc.log
+++ b/quacc.log
--- a/quacc/init.py
+++ b/quacc/init.py
@ -1,9 +1,9 @@
-import quacc.dataset as dataset
-import quacc.error as error
-import quacc.logger as logger
-import quacc.plot as plot
-import quacc.utils as utils
-from quacc.environment import env
+import quacc.dataset as dataset  # noqa: F401
+import quacc.error as error  # noqa: F401
+import quacc.logger as logger  # noqa: F401
+import quacc.plot as plot  # noqa: F401
+import quacc.utils.commons as commons  # noqa: F401
+from quacc.legacy.environment import env


 def _get_njobs(n_jobs):
--- a/quacc/data.py
+++ b/quacc/data.py
@ -1,376 +0,0 @@
-from typing import List, Tuple
-
-import numpy as np
-import scipy.sparse as sp
-from quapy.data import LabelledCollection
-
-# Extended classes
-#
-# 0 ~ True 0
-# 1 ~ False 1
-# 2 ~ False 0
-# 3 ~ True 1
-#      _____________________
-#     |          |          |
-#     |  True 0  |  False 1 |
-#     |__________|__________|
-#     |          |          |
-#     |  False 0 |  True 1  |
-#     |__________|__________|
-#
-
-
-def _split_index_by_pred(pred_proba: np.ndarray) -> List[np.ndarray]:
-    _pred_label = np.argmax(pred_proba, axis=1)
-    return [(_pred_label == cl).nonzero()[0] for cl in np.arange(pred_proba.shape[1])]
-
-
-class ExtensionPolicy:
-    def __init__(self, collapse_false=False, group_false=False, dense=False):
-        self.collapse_false = collapse_false
-        self.group_false = group_false
-        self.dense = dense
-
-    def qclasses(self, nbcl):
-        if self.collapse_false:
-            return np.arange(nbcl + 1)
-        elif self.group_false:
-            return np.arange(nbcl * 2)
-
-        return np.arange(nbcl**2)
-
-    def eclasses(self, nbcl):
-        return np.arange(nbcl**2)
-
-    def tfp_classes(self, nbcl):
-        if self.group_false:
-            return np.arange(2)
-        else:
-            return np.arange(nbcl)
-
-    def matrix_idx(self, nbcl):
-        if self.collapse_false:
-            _idxs = np.array([[i, i] for i in range(nbcl)] + [[0, 1]]).T
-            return tuple(_idxs)
-        elif self.group_false:
-            diag_idxs = np.diag_indices(nbcl)
-            sub_diag_idxs = tuple(
-                np.array([((i + 1) % nbcl, i) for i in range(nbcl)]).T
-            )
-            return tuple(np.concatenate(axis) for axis in zip(diag_idxs, sub_diag_idxs))
-            # def mask_fn(m, k):
-            #     n = m.shape[0]
-            #     d = np.diag(np.tile(1, n))
-            #     d[tuple(zip(*[(i, (i + 1) % n) for i in range(n)]))] = 1
-            #     return d
-
-            # _mi = np.mask_indices(nbcl, mask_func=mask_fn)
-            # print(_mi)
-            # return _mi
-        else:
-            _idxs = np.indices((nbcl, nbcl))
-            return _idxs[0].flatten(), _idxs[1].flatten()
-
-    def ext_lbl(self, nbcl):
-        if self.collapse_false:
-
-            def cf_fun(t, p):
-                return t if t == p else nbcl
-
-            return np.vectorize(cf_fun, signature="(),()->()")
-
-        elif self.group_false:
-
-            def gf_fun(t, p):
-                # if t < nbcl - 1:
-                #     return t * 2 if t == p else (t * 2) + 1
-                # else:
-                #     return t * 2 if t != p else (t * 2) + 1
-                return p if t == p else nbcl + p
-
-            return np.vectorize(gf_fun, signature="(),()->()")
-
-        else:
-
-            def default_fn(t, p):
-                return t * nbcl + p
-
-            return np.vectorize(default_fn, signature="(),()->()")
-
-    def true_lbl_from_pred(self, nbcl):
-        if self.group_false:
-            return np.vectorize(lambda t, p: 0 if t == p else 1, signature="(),()->()")
-        else:
-            return np.vectorize(lambda t, p: t, signature="(),()->()")
-
-    def can_f1(self, nbcl):
-        return nbcl == 2 or (not self.collapse_false and not self.group_false)
-
-
-class ExtendedData:
-    def __init__(
-        self,
-        instances: np.ndarray | sp.csr_matrix,
-        pred_proba: np.ndarray,
-        ext: np.ndarray = None,
-        extpol=None,
-    ):
-        self.extpol = ExtensionPolicy() if extpol is None else extpol
-        self.b_instances_ = instances
-        self.pred_proba_ = pred_proba
-        self.ext_ = ext
-        self.instances = self.__extend_instances(instances, pred_proba, ext=ext)
-
-    def __extend_instances(
-        self,
-        instances: np.ndarray | sp.csr_matrix,
-        pred_proba: np.ndarray,
-        ext: np.ndarray = None,
-    ) -> np.ndarray | sp.csr_matrix:
-        to_append = ext
-        if ext is None:
-            to_append = pred_proba
-
-        if isinstance(instances, sp.csr_matrix):
-            if self.extpol.dense:
-                n_x = to_append
-            else:
-                n_x = sp.hstack([instances, sp.csr_matrix(to_append)], format="csr")
-        elif isinstance(instances, np.ndarray):
-            _concat = [instances, to_append] if not self.extpol.dense else [to_append]
-            n_x = np.concatenate(_concat, axis=1)
-        else:
-            raise ValueError("Unsupported matrix format")
-
-        return n_x
-
-    @property
-    def X(self):
-        return self.instances
-
-    @property
-    def nbcl(self):
-        return self.pred_proba_.shape[1]
-
-    def split_by_pred(self, _indexes: List[np.ndarray] | None = None):
-        def _empty_matrix():
-            if isinstance(self.instances, np.ndarray):
-                return np.asarray([], dtype=int)
-            elif isinstance(self.instances, sp.csr_matrix):
-                return sp.csr_matrix(np.empty((0, 0), dtype=int))
-
-        if _indexes is None:
-            _indexes = _split_index_by_pred(self.pred_proba_)
-
-        _instances = [
-            self.instances[ind] if ind.shape[0] > 0 else _empty_matrix()
-            for ind in _indexes
-        ]
-
-        return _instances
-
-    def __len__(self):
-        return self.instances.shape[0]
-
-
-class ExtendedLabels:
-    def __init__(
-        self,
-        true: np.ndarray,
-        pred: np.ndarray,
-        nbcl: np.ndarray,
-        extpol: ExtensionPolicy = None,
-    ):
-        self.extpol = ExtensionPolicy() if extpol is None else extpol
-        self.true = true
-        self.pred = pred
-        self.nbcl = nbcl
-
-    @property
-    def y(self):
-        return self.extpol.ext_lbl(self.nbcl)(self.true, self.pred)
-
-    @property
-    def classes(self):
-        return self.extpol.qclasses(self.nbcl)
-
-    def __getitem__(self, idx):
-        return ExtendedLabels(self.true[idx], self.pred[idx], self.nbcl)
-
-    def split_by_pred(self, _indexes: List[np.ndarray]):
-        _labels = []
-        for cl, ind in enumerate(_indexes):
-            _true, _pred = self.true[ind], self.pred[ind]
-            assert (
-                _pred.shape[0] == 0 or (_pred == _pred[0]).all()
-            ), "index is selecting non uniform class"
-            _tfp = self.extpol.true_lbl_from_pred(self.nbcl)(_true, _pred)
-            _labels.append(_tfp)
-
-        return _labels, self.extpol.tfp_classes(self.nbcl)
-
-
-class ExtendedPrev:
-    def __init__(
-        self,
-        flat: np.ndarray,
-        nbcl: int,
-        extpol: ExtensionPolicy = None,
-    ):
-        self.flat = flat
-        self.nbcl = nbcl
-        self.extpol = ExtensionPolicy() if extpol is None else extpol
-        # self._matrix = self.__build_matrix()
-
-    def __build_matrix(self):
-        _matrix = np.zeros((self.nbcl, self.nbcl))
-        _matrix[self.extpol.matrix_idx(self.nbcl)] = self.flat
-        return _matrix
-
-    def can_f1(self):
-        return self.extpol.can_f1(self.nbcl)
-
-    @property
-    def A(self):
-        # return self._matrix
-        return self.__build_matrix()
-
-    @property
-    def classes(self):
-        return self.extpol.qclasses(self.nbcl)
-
-
-class ExtMulPrev(ExtendedPrev):
-    def __init__(
-        self,
-        flat: np.ndarray,
-        nbcl: int,
-        q_classes: list = None,
-        extpol: ExtensionPolicy = None,
-    ):
-        super().__init__(flat, nbcl, extpol=extpol)
-        self.flat = self.__check_q_classes(q_classes, flat)
-
-    def __check_q_classes(self, q_classes, flat):
-        if q_classes is None:
-            return flat
-        q_classes = np.array(q_classes)
-        _flat = np.zeros(self.extpol.qclasses(self.nbcl).shape)
-        _flat[q_classes] = flat
-        return _flat
-
-
-class ExtBinPrev(ExtendedPrev):
-    def __init__(
-        self,
-        flat: List[np.ndarray],
-        nbcl: int,
-        q_classes: List[List[int]] = None,
-        extpol: ExtensionPolicy = None,
-    ):
-        super().__init__(flat, nbcl, extpol=extpol)
-        flat = self.__check_q_classes(q_classes, flat)
-        self.flat = self.__build_flat(flat)
-
-    def __check_q_classes(self, q_classes, flat):
-        if q_classes is None:
-            return flat
-        _flat = []
-        for fl, qc in zip(flat, q_classes):
-            qc = np.array(qc)
-            _fl = np.zeros(self.extpol.tfp_classes(self.nbcl).shape)
-            _fl[qc] = fl
-            _flat.append(_fl)
-        return np.array(_flat)
-
-    def __build_flat(self, flat):
-        return np.concatenate(flat.T)
-
-
-class ExtendedCollection(LabelledCollection):
-    def __init__(
-        self,
-        instances: np.ndarray | sp.csr_matrix,
-        labels: np.ndarray,
-        pred_proba: np.ndarray = None,
-        ext: np.ndarray = None,
-        extpol=None,
-    ):
-        self.extpol = ExtensionPolicy() if extpol is None else extpol
-        e_data, e_labels = self.__extend_collection(
-            instances=instances,
-            labels=labels,
-            pred_proba=pred_proba,
-            ext=ext,
-        )
-        self.e_data_ = e_data
-        self.e_labels_ = e_labels
-        super().__init__(e_data.X, e_labels.y, classes=e_labels.classes)
-
-    @classmethod
-    def from_lc(
-        cls,
-        lc: LabelledCollection,
-        pred_proba: np.ndarray,
-        ext: np.ndarray = None,
-        extpol=None,
-    ):
-        return ExtendedCollection(
-            lc.X, lc.y, pred_proba=pred_proba, ext=ext, extpol=extpol
-        )
-
-    @property
-    def pred_proba(self):
-        return self.e_data_.pred_proba_
-
-    @property
-    def ext(self):
-        return self.e_data_.ext_
-
-    @property
-    def eX(self):
-        return self.e_data_
-
-    @property
-    def ey(self):
-        return self.e_labels_
-
-    @property
-    def n_base_classes(self):
-        return self.e_labels_.nbcl
-
-    @property
-    def n_classes(self):
-        return len(self.e_labels_.classes)
-
-    def e_prevalence(self) -> ExtendedPrev:
-        _prev = self.prevalence()
-        return ExtendedPrev(_prev, self.n_base_classes, extpol=self.extpol)
-
-    def split_by_pred(self):
-        _indexes = _split_index_by_pred(self.pred_proba)
-        _instances = self.e_data_.split_by_pred(_indexes)
-        # _labels = [self.ey[ind] for ind in _indexes]
-        _labels, _cls = self.e_labels_.split_by_pred(_indexes)
-        return [
-            LabelledCollection(inst, lbl, classes=_cls)
-            for inst, lbl in zip(_instances, _labels)
-        ]
-
-    def __extend_collection(
-        self,
-        instances: sp.csr_matrix | np.ndarray,
-        labels: np.ndarray,
-        pred_proba: np.ndarray,
-        ext: np.ndarray = None,
-        extpol=None,
-    ) -> Tuple[ExtendedData, ExtendedLabels]:
-        n_classes = pred_proba.shape[1]
-        # n_X = [ X | predicted probs. ]
-        e_instances = ExtendedData(instances, pred_proba, ext=ext, extpol=self.extpol)
-
-        # n_y = (exptected y, predicted y)
-        preds = np.argmax(pred_proba, axis=-1)
-        e_labels = ExtendedLabels(labels, preds, n_classes, extpol=self.extpol)
-
-        return e_instances, e_labels
--- a/quacc/dataset.py
+++ b/quacc/dataset.py
@ -1,29 +1,31 @@
-import itertools
 import math
 import os
 import pickle
 import tarfile
-from typing import List, Tuple
+from typing import List

 import numpy as np
 import quapy as qp
 from quapy.data.base import LabelledCollection
-from sklearn.conftest import fetch_rcv1
+from quapy.data.datasets import fetch_lequa2022, fetch_UCIMulticlassLabelledCollection
+from sklearn.datasets import fetch_20newsgroups, fetch_rcv1
+from sklearn.feature_extraction.text import TfidfVectorizer
 from sklearn.utils import Bunch

-from quacc import utils
-from quacc.environment import env
+from quacc.legacy.environment import env
+from quacc.utils import commons
+from quacc.utils.commons import save_json_file

 TRAIN_VAL_PROP = 0.5


 def fetch_cifar10() -> Bunch:
    URL = "https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
-    data_home = utils.get_quacc_home()
+    data_home = commons.get_quacc_home()
    unzipped_path = data_home / "cifar-10-batches-py"
    if not unzipped_path.exists():
        downloaded_path = data_home / URL.split("/")[-1]
-        utils.download_file(URL, downloaded_path)
+        commons.download_file(URL, downloaded_path)
        with tarfile.open(downloaded_path) as f:
            f.extractall(data_home)
        os.remove(downloaded_path)
@ -58,11 +60,11 @@ def fetch_cifar10() -> Bunch:

 def fetch_cifar100():
    URL = "https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz"
-    data_home = utils.get_quacc_home()
+    data_home = commons.get_quacc_home()
    unzipped_path = data_home / "cifar-100-python"
    if not unzipped_path.exists():
        downloaded_path = data_home / URL.split("/")[-1]
-        utils.download_file(URL, downloaded_path)
+        commons.download_file(URL, downloaded_path)
        with tarfile.open(downloaded_path) as f:
            f.extractall(data_home)
        os.remove(downloaded_path)
@ -96,6 +98,23 @@ def fetch_cifar100():
    )


+def save_dataset_stats(path, test_prot, L, V):
+    test_prevs = [Ui.prevalence() for Ui in test_prot()]
+    shifts = [qp.error.ae(L.prevalence(), Ui_prev) for Ui_prev in test_prevs]
+    info = {
+        "n_classes": L.n_classes,
+        "n_train": len(L),
+        "n_val": len(V),
+        "train_prev": L.prevalence().tolist(),
+        "val_prev": V.prevalence().tolist(),
+        "test_prevs": [x.tolist() for x in test_prevs],
+        "shifts": [x.tolist() for x in shifts],
+        "sample_size": test_prot.sample_size,
+        "num_samples": test_prot.total(),
+    }
+    save_json_file(path, info)
+
+
 class DatasetSample:
    def __init__(
        self,
@ -121,49 +140,69 @@ class DatasetSample:


 class DatasetProvider:
-    def __spambase(self, **kwargs):
-        return qp.datasets.fetch_UCIDataset("spambase", verbose=False).train_test
+    @classmethod
+    def _split_train(cls, train: LabelledCollection):
+        return train.split_stratified(0.5, random_state=0)

-    # provare min_df=5
-    def __imdb(self, **kwargs):
-        return qp.datasets.fetch_reviews("imdb", tfidf=True, min_df=3).train_test
+    @classmethod
+    def _split_whole(cls, dataset: LabelledCollection):
+        train, U = dataset.split_stratified(train_prop=0.66, random_state=0)
+        T, V = train.split_stratified(train_prop=0.5, random_state=0)
+        return T, V, U
+
+    @classmethod
+    def spambase(cls):
+        train, U = qp.datasets.fetch_UCIDataset("spambase", verbose=False).train_test
+        T, V = cls._split_train(train)
+        return T, V, U
+
+    @classmethod
+    def imdb(cls):
+        train, U = qp.datasets.fetch_reviews(
+            "imdb", tfidf=True, min_df=10, pickle=True
+        ).train_test
+        T, V = cls._split_train(train)
+        return T, V, U
+
+    @classmethod
+    def rcv1(cls, target):
+        training = fetch_rcv1(subset="train")
+        test = fetch_rcv1(subset="test")

-    def __rcv1(self, target, **kwargs):
-        n_train = 23149
        available_targets = ["CCAT", "GCAT", "MCAT"]
+        if target is None or target not in available_targets:
+            raise ValueError(f"Invalid target {target}")
+
+        class_names = training.target_names.tolist()
+        class_idx = class_names.index(target)
+        tr_labels = training.target[:, class_idx].toarray().flatten()
+        te_labels = test.target[:, class_idx].toarray().flatten()
+        tr = LabelledCollection(training.data, tr_labels)
+        U = LabelledCollection(test.data, te_labels)
+        T, V = cls._split_train(tr)
+        return T, V, U
+
+    @classmethod
+    def cifar10(cls, target):
+        dataset = fetch_cifar10()
+        available_targets: list = dataset.label_names

        if target is None or target not in available_targets:
            raise ValueError(f"Invalid target {target}")

-        dataset = fetch_rcv1()
-        target_index = np.where(dataset.target_names == target)[0]
-        all_train_d = dataset.data[:n_train, :]
-        test_d = dataset.data[n_train:, :]
-        labels = dataset.target[:, target_index].toarray().flatten()
-        all_train_l, test_l = labels[:n_train], labels[n_train:]
-        all_train = LabelledCollection(all_train_d, all_train_l, classes=[0, 1])
-        test = LabelledCollection(test_d, test_l, classes=[0, 1])
-
-        return all_train, test
-
-    def __cifar10(self, target, **kwargs):
-        dataset = fetch_cifar10()
-        available_targets: list = dataset.label_names
-
-        if target is None or self._target not in available_targets:
-            raise ValueError(f"Invalid target {target}")
-
-        target_index = available_targets.index(target)
-        all_train_d = dataset.train.data
-        all_train_l = (dataset.train.labels == target_index).astype(int)
+        target_idx = available_targets.index(target)
+        train_d = dataset.train.data
+        train_l = (dataset.train.labels == target_idx).astype(int)
        test_d = dataset.test.data
-        test_l = (dataset.test.labels == target_index).astype(int)
-        all_train = LabelledCollection(all_train_d, all_train_l, classes=[0, 1])
-        test = LabelledCollection(test_d, test_l, classes=[0, 1])
+        test_l = (dataset.test.labels == target_idx).astype(int)
+        train = LabelledCollection(train_d, train_l, classes=[0, 1])
+        U = LabelledCollection(test_d, test_l, classes=[0, 1])
+        T, V = cls._split_train(train)

-        return all_train, test
+        return T, V, U

-    def __cifar100(self, target, **kwargs):
+    @classmethod
+    def cifar100(cls, target):
        dataset = fetch_cifar100()
        available_targets: list = dataset.coarse_label_names

@ -171,31 +210,48 @@ class DatasetProvider:
            raise ValueError(f"Invalid target {target}")

        target_index = available_targets.index(target)
-        all_train_d = dataset.train.data
-        all_train_l = (dataset.train.coarse_labels == target_index).astype(int)
+        train_d = dataset.train.data
+        train_l = (dataset.train.coarse_labels == target_index).astype(int)
        test_d = dataset.test.data
        test_l = (dataset.test.coarse_labels == target_index).astype(int)
-        all_train = LabelledCollection(all_train_d, all_train_l, classes=[0, 1])
-        test = LabelledCollection(test_d, test_l, classes=[0, 1])
+        train = LabelledCollection(train_d, train_l, classes=[0, 1])
+        U = LabelledCollection(test_d, test_l, classes=[0, 1])
+        T, V = cls._split_train(train)

-        return all_train, test
+        return T, V, U

-    def __twitter_gasp(self, **kwargs):
-        return qp.datasets.fetch_twitter("gasp", min_df=3).train_test
+    @classmethod
+    def twitter(cls, dataset_name):
+        data = qp.datasets.fetch_twitter(dataset_name, min_df=3, pickle=True)
+        T, V = cls._split_train(data.training)
+        U = data.test
+        return T, V, U

-    def alltrain_test(
-        self, name: str, target: str | None
-    ) -> Tuple[LabelledCollection, LabelledCollection]:
-        all_train, test = {
-            "spambase": self.__spambase,
-            "imdb": self.__imdb,
-            "rcv1": self.__rcv1,
-            "cifar10": self.__cifar10,
-            "cifar100": self.__cifar100,
-            "twitter_gasp": self.__twitter_gasp,
-        }[name](target=target)
+    @classmethod
+    def uci_multiclass(cls, dataset_name):
+        dataset = fetch_UCIMulticlassLabelledCollection(dataset_name)
+        return cls._split_whole(dataset)

-        return all_train, test
+    @classmethod
+    def news20(cls):
+        train = fetch_20newsgroups(
+            subset="train", remove=("headers", "footers", "quotes")
+        )
+        test = fetch_20newsgroups(
+            subset="test", remove=("headers", "footers", "quotes")
+        )
+        tfidf = TfidfVectorizer(min_df=5, sublinear_tf=True)
+        Xtr = tfidf.fit_transform(train.data)
+        Xte = tfidf.transform((test.data))
+        train = LabelledCollection(instances=Xtr, labels=train.target)
+        U = LabelledCollection(instances=Xte, labels=test.target)
+        T, V = cls._split_train(train)
+        return T, V, U
+
+    @classmethod
+    def t1b_lequa2022(cls):
+        dataset, _, _ = fetch_lequa2022(task="T1B")
+        return cls._split_whole(dataset)


 class Dataset(DatasetProvider):
--- a/quacc/environment.py
+++ b/quacc/environment.py
@ -1,86 +0,0 @@
-from contextlib import contextmanager
-
-import numpy as np
-import quapy as qp
-import yaml
-
-
-class environ:
-    _default_env = {
-        "DATASET_NAME": None,
-        "DATASET_TARGET": None,
-        "METRICS": [],
-        "COMP_ESTIMATORS": [],
-        "DATASET_N_PREVS": 9,
-        "DATASET_PREVS": None,
-        "OUT_DIR_NAME": "output",
-        "OUT_DIR": None,
-        "PLOT_DIR_NAME": "plot",
-        "PLOT_OUT_DIR": None,
-        "DATASET_DIR_UPDATE": False,
-        "PROTOCOL_N_PREVS": 21,
-        "PROTOCOL_REPEATS": 100,
-        "SAMPLE_SIZE": 1000,
-        # "PLOT_ESTIMATORS": [],
-        "PLOT_STDEV": False,
-        "_R_SEED": 0,
-        "N_JOBS": 1,
-    }
-    _keys = list(_default_env.keys())
-
-    def __init__(self):
-        self.__load_file()
-
-    def __load_file(self):
-        _state = environ._default_env.copy()
-
-        with open("conf.yaml", "r") as f:
-            confs = yaml.safe_load(f)["exec"]
-
-        _state = _state | confs["global"]
-        self.__setdict(_state)
-        self._confs = confs["confs"]
-
-    def __setdict(self, d: dict):
-        for k, v in d.items():
-            super().__setattr__(k, v)
-            match k:
-                case "SAMPLE_SIZE":
-                    qp.environ["SAMPLE_SIZE"] = v
-                case "_R_SEED":
-                    qp.environ["_R_SEED"] = v
-                    np.random.seed(v)
-
-    def to_dict(self) -> dict:
-        return {k: self.__getattribute__(k) for k in environ._keys}
-
-    @property
-    def confs(self):
-        return self._confs.copy()
-
-    @contextmanager
-    def load(self, conf):
-        __current = self.to_dict()
-        __np_random_state = np.random.get_state()
-
-        if conf is None:
-            conf = {}
-
-        if isinstance(conf, environ):
-            conf = conf.to_dict()
-
-        self.__setdict(conf)
-
-        try:
-            yield
-        finally:
-            self.__setdict(__current)
-            np.random.set_state(__np_random_state)
-
-    def load_confs(self):
-        for c in self.confs:
-            with self.load(c):
-                yield c
-
-
-env = environ()
--- a/quacc/error.py
+++ b/quacc/error.py
@ -3,8 +3,9 @@ from typing import List

 import numpy as np
 import quapy as qp
+from sklearn.metrics import accuracy_score, f1_score

-from quacc.data import ExtendedPrev
+from quacc.legacy.data import ExtendedPrev


 def from_name(err_name):
@ -78,6 +79,85 @@ def maccd(
    return accd(true_prevs, estim_prevs).mean()


+def from_contingency_table(param1, param2):
+    if (
+        param2 is None
+        and isinstance(param1, np.ndarray)
+        and param1.ndim == 2
+        and (param1.shape[0] == param1.shape[1])
+    ):
+        return True
+    elif (
+        isinstance(param1, np.ndarray)
+        and isinstance(param2, np.ndarray)
+        and param1.shape == param2.shape
+    ):
+        return False
+    else:
+        raise ValueError("parameters for evaluation function not understood")
+
+
+def vanilla_acc_fn(param1, param2=None):
+    if from_contingency_table(param1, param2):
+        return _vanilla_acc_from_ct(param1)
+    else:
+        return accuracy_score(param1, param2)
+
+
+def macrof1_fn(param1, param2=None):
+    if from_contingency_table(param1, param2):
+        return macro_f1_from_ct(param1)
+    else:
+        return f1_score(param1, param2, average="macro")
+
+
+def _vanilla_acc_from_ct(cont_table):
+    return np.diag(cont_table).sum() / cont_table.sum()
+
+
+def _f1_bin(tp, fp, fn):
+    if tp + fp + fn == 0:
+        return 1
+    else:
+        return (2 * tp) / (2 * tp + fp + fn)
+
+
+def macro_f1_from_ct(cont_table):
+    n = cont_table.shape[0]
+
+    if n == 2:
+        tp = cont_table[1, 1]
+        fp = cont_table[0, 1]
+        fn = cont_table[1, 0]
+        return _f1_bin(tp, fp, fn)
+
+    f1_per_class = []
+    for i in range(n):
+        tp = cont_table[i, i]
+        fp = cont_table[:, i].sum() - tp
+        fn = cont_table[i, :].sum() - tp
+        f1_per_class.append(_f1_bin(tp, fp, fn))
+
+    return np.mean(f1_per_class)
+
+
+def microf1(cont_table):
+    n = cont_table.shape[0]
+
+    if n == 2:
+        tp = cont_table[1, 1]
+        fp = cont_table[0, 1]
+        fn = cont_table[1, 0]
+        return _f1_bin(tp, fp, fn)
+
+    tp, fp, fn = 0, 0, 0
+    for i in range(n):
+        tp += cont_table[i, i]
+        fp += cont_table[:, i] - tp
+        fn += cont_table[i, :] - tp
+    return _f1_bin(tp, fp, fn)
+
+
 ACCURACY_ERROR = {maccd}
 ACCURACY_ERROR_SINGLE = {accd}
 ACCURACY_ERROR_NAMES = {func.__name__ for func in ACCURACY_ERROR}
--- a/quacc/evaluation/init.py
+++ b/quacc/evaluation/init.py
--- a/quacc/evaluation/alt.py
+++ b/quacc/evaluation/alt.py
@ -1,115 +0,0 @@
-from functools import wraps
-
-import numpy as np
-import quapy.functional as F
-import sklearn.metrics as metrics
-from quapy.method.aggregative import ACC, EMQ
-from sklearn import clone
-from sklearn.linear_model import LogisticRegression
-
-import quacc as qc
-from quacc.evaluation.report import EvaluationReport
-
-_alts = {}
-
-
-def alt(func):
-    @wraps(func)
-    def wrapper(c_model, validation, protocol):
-        return func(c_model, validation, protocol)
-
-    wrapper.name = func.__name__
-    _alts[func.__name__] = wrapper
-
-    return wrapper
-
-
-@alt
-def cross(c_model, validation, protocol):
-    y_val = validation.labels
-    y_hat_val = c_model.predict(validation.instances)
-
-    qcls = clone(c_model)
-    qcls.fit(*validation.Xy)
-
-    er = EvaluationReport(name="cross")
-    for sample in protocol():
-        y_hat = c_model.predict(sample.instances)
-        y = sample.labels
-        ground_acc = (y_hat == y).mean()
-        ground_f1 = metrics.f1_score(y, y_hat, zero_division=0)
-
-        q = EMQ(qcls)
-        q.fit(validation, fit_classifier=False)
-
-        M_hat = ACC.getPteCondEstim(validation.classes_, y_val, y_hat_val)
-        p_hat = q.quantify(sample.instances)
-        cont_table_hat = p_hat * M_hat
-
-        acc_score = qc.error.acc(cont_table_hat)
-        f1_score = qc.error.f1(cont_table_hat)
-
-        meta_acc = abs(acc_score - ground_acc)
-        meta_f1 = abs(f1_score - ground_f1)
-        er.append_row(
-            sample.prevalence(),
-            acc=meta_acc,
-            f1=meta_f1,
-            acc_score=acc_score,
-            f1_score=f1_score,
-        )
-
-    return er
-
-
-@alt
-def cross2(c_model, validation, protocol):
-    classes = validation.classes_
-    y_val = validation.labels
-    y_hat_val = c_model.predict(validation.instances)
-    M_hat = ACC.getPteCondEstim(classes, y_val, y_hat_val)
-    pos_prev_val = validation.prevalence()[1]
-
-    er = EvaluationReport(name="cross2")
-    for sample in protocol():
-        y_test = sample.labels
-        y_hat_test = c_model.predict(sample.instances)
-        ground_acc = (y_hat_test == y_test).mean()
-        ground_f1 = metrics.f1_score(y_test, y_hat_test, zero_division=0)
-        pos_prev_cc = F.prevalence_from_labels(y_hat_test, classes)[1]
-        tpr_hat = M_hat[1, 1]
-        fpr_hat = M_hat[1, 0]
-        tnr_hat = M_hat[0, 0]
-        pos_prev_test_hat = (pos_prev_cc - fpr_hat) / (tpr_hat - fpr_hat)
-        pos_prev_test_hat = np.clip(pos_prev_test_hat, 0, 1)
-
-        if pos_prev_val > 0.5:
-            # in this case, the tpr might be a more reliable estimate than tnr
-            A = np.asarray(
-                [[0, 0, 1, 1], [0, 1, 0, 1], [1, 1, 1, 1], [0, tpr_hat, 0, tpr_hat - 1]]
-            )
-        else:
-            # in this case, the tnr might be a more reliable estimate than tpr
-            A = np.asarray(
-                [[0, 0, 1, 1], [0, 1, 0, 1], [1, 1, 1, 1], [tnr_hat - 1, 0, tnr_hat, 0]]
-            )
-
-        b = np.asarray([pos_prev_cc, pos_prev_test_hat, 1, 0])
-
-        tn, fn, fp, tp = np.linalg.solve(A, b)
-        cont_table_hat = np.array([[tn, fp], [fn, tp]])
-
-        acc_score = qc.error.acc(cont_table_hat)
-        f1_score = qc.error.f1(cont_table_hat)
-
-        meta_acc = abs(acc_score - ground_acc)
-        meta_f1 = abs(f1_score - ground_f1)
-        er.append_row(
-            sample.prevalence(),
-            acc=meta_acc,
-            f1=meta_f1,
-            acc_score=acc_score,
-            f1_score=f1_score,
-        )
-
-    return er
--- a/quacc/evaluation/baseline.py
+++ b/quacc/evaluation/baseline.py
@ -1,590 +0,0 @@
-from functools import wraps
-from statistics import mean
-
-import numpy as np
-import sklearn.metrics as metrics
-from quapy.data import LabelledCollection
-from quapy.protocol import APP, AbstractStochasticSeededProtocol
-from scipy.sparse import issparse
-from sklearn.base import BaseEstimator
-from sklearn.linear_model import LinearRegression
-from sklearn.model_selection import cross_validate
-
-import baselines.atc as atc
-import baselines.doc as doclib
-import baselines.gde as gdelib
-import baselines.impweight as iw
-import baselines.mandoline as mandolib
-import baselines.rca as rcalib
-from baselines.utils import clone_fit
-from quacc.environment import env
-
-from .report import EvaluationReport
-
-_baselines = {}
-
-
-def baseline(func):
-    @wraps(func)
-    def wrapper(c_model, validation, protocol):
-        return func(c_model, validation, protocol)
-
-    wrapper.name = func.__name__
-    _baselines[func.__name__] = wrapper
-
-    return wrapper
-
-
-@baseline
-def kfcv(
-    c_model: BaseEstimator,
-    validation: LabelledCollection,
-    protocol: AbstractStochasticSeededProtocol,
-    predict_method="predict",
-):
-    c_model_predict = getattr(c_model, predict_method)
-    f1_average = "binary" if validation.n_classes == 2 else "macro"
-
-    scoring = ["accuracy", "f1_macro"]
-    scores = cross_validate(c_model, validation.X, validation.y, scoring=scoring)
-    acc_score = mean(scores["test_accuracy"])
-    f1_score = mean(scores["test_f1_macro"])
-
-    report = EvaluationReport(name="kfcv")
-    for test in protocol():
-        test_preds = c_model_predict(test.X)
-        meta_acc = abs(acc_score - metrics.accuracy_score(test.y, test_preds))
-        meta_f1 = abs(
-            f1_score - metrics.f1_score(test.y, test_preds, average=f1_average)
-        )
-        report.append_row(
-            test.prevalence(),
-            acc_score=acc_score,
-            f1_score=f1_score,
-            acc=meta_acc,
-            f1=meta_f1,
-        )
-
-    return report
-
-
-@baseline
-def ref(
-    c_model: BaseEstimator,
-    validation: LabelledCollection,
-    protocol: AbstractStochasticSeededProtocol,
-):
-    c_model_predict = getattr(c_model, "predict")
-    f1_average = "binary" if validation.n_classes == 2 else "macro"
-
-    report = EvaluationReport(name="ref")
-    for test in protocol():
-        test_preds = c_model_predict(test.X)
-        report.append_row(
-            test.prevalence(),
-            acc_score=metrics.accuracy_score(test.y, test_preds),
-            f1_score=metrics.f1_score(test.y, test_preds, average=f1_average),
-        )
-
-    return report
-
-
-@baseline
-def naive(
-    c_model: BaseEstimator,
-    validation: LabelledCollection,
-    protocol: AbstractStochasticSeededProtocol,
-    predict_method="predict",
-):
-    c_model_predict = getattr(c_model, predict_method)
-    f1_average = "binary" if validation.n_classes == 2 else "macro"
-
-    val_preds = c_model_predict(validation.X)
-    val_acc = metrics.accuracy_score(validation.y, val_preds)
-    val_f1 = metrics.f1_score(validation.y, val_preds, average=f1_average)
-
-    report = EvaluationReport(name="naive")
-    for test in protocol():
-        test_preds = c_model_predict(test.X)
-        test_acc = metrics.accuracy_score(test.y, test_preds)
-        test_f1 = metrics.f1_score(test.y, test_preds, average=f1_average)
-        meta_acc = abs(val_acc - test_acc)
-        meta_f1 = abs(val_f1 - test_f1)
-        report.append_row(
-            test.prevalence(),
-            acc_score=val_acc,
-            f1_score=val_f1,
-            acc=meta_acc,
-            f1=meta_f1,
-        )
-
-    return report
-
-
-@baseline
-def mandoline(
-    c_model: BaseEstimator,
-    validation: LabelledCollection,
-    protocol: AbstractStochasticSeededProtocol,
-    predict_method="predict_proba",
-) -> EvaluationReport:
-    c_model_predict = getattr(c_model, predict_method)
-
-    val_probs = c_model_predict(validation.X)
-    val_preds = np.argmax(val_probs, axis=1)
-    D_val = mandolib.get_slices(val_probs)
-    emprical_mat_list_val = (1.0 * (val_preds == validation.y))[:, np.newaxis]
-
-    report = EvaluationReport(name="mandoline")
-    for test in protocol():
-        test_probs = c_model_predict(test.X)
-        test_pred = np.argmax(test_probs, axis=1)
-        D_test = mandolib.get_slices(test_probs)
-        wp = mandolib.estimate_performance(D_val, D_test, None, emprical_mat_list_val)
-        score = wp.all_estimates[0].weighted[0]
-        meta_score = abs(score - metrics.accuracy_score(test.y, test_pred))
-        report.append_row(test.prevalence(), acc=meta_score, acc_score=score)
-
-    return report
-
-
-@baseline
-def rca(
-    c_model: BaseEstimator,
-    validation: LabelledCollection,
-    protocol: AbstractStochasticSeededProtocol,
-    predict_method="predict",
-):
-    """elsahar19"""
-    c_model_predict = getattr(c_model, predict_method)
-    f1_average = "binary" if validation.n_classes == 2 else "macro"
-    val1, val2 = validation.split_stratified(train_prop=0.5, random_state=env._R_SEED)
-    val1_pred1 = c_model_predict(val1.X)
-
-    val2_protocol = APP(
-        val2,
-        n_prevalences=21,
-        repeats=100,
-        return_type="labelled_collection",
-    )
-    val2_prot_preds = []
-    val2_rca = []
-    val2_prot_preds = []
-    val2_prot_y = []
-    for v2 in val2_protocol():
-        _preds = c_model_predict(v2.X)
-        try:
-            c_model2 = clone_fit(c_model, v2.X, _preds)
-            c_model2_predict = getattr(c_model2, predict_method)
-            val1_pred2 = c_model2_predict(val1.X)
-            rca_score = 1.0 - rcalib.get_score(val1_pred1, val1_pred2, val1.y)
-            val2_rca.append(rca_score)
-            val2_prot_preds.append(_preds)
-            val2_prot_y.append(v2.y)
-        except ValueError:
-            pass
-
-    val_targets_acc = np.array(
-        [
-            metrics.accuracy_score(v2_y, v2_preds)
-            for v2_y, v2_preds in zip(val2_prot_y, val2_prot_preds)
-        ]
-    )
-    reg_acc = LinearRegression().fit(np.array(val2_rca)[:, np.newaxis], val_targets_acc)
-    val_targets_f1 = np.array(
-        [
-            metrics.f1_score(v2_y, v2_preds, average=f1_average)
-            for v2_y, v2_preds in zip(val2_prot_y, val2_prot_preds)
-        ]
-    )
-    reg_f1 = LinearRegression().fit(np.array(val2_rca)[:, np.newaxis], val_targets_f1)
-
-    report = EvaluationReport(name="rca")
-    for test in protocol():
-        try:
-            test_preds = c_model_predict(test.X)
-            c_model2 = clone_fit(c_model, test.X, test_preds)
-            c_model2_predict = getattr(c_model2, predict_method)
-            val1_pred2 = c_model2_predict(val1.X)
-            rca_score = 1.0 - rcalib.get_score(val1_pred1, val1_pred2, val1.y)
-            acc_score = reg_acc.predict(np.array([[rca_score]]))[0]
-            f1_score = reg_f1.predict(np.array([[rca_score]]))[0]
-            meta_acc = abs(acc_score - metrics.accuracy_score(test.y, test_preds))
-            meta_f1 = abs(
-                f1_score - metrics.f1_score(test.y, test_preds, average=f1_average)
-            )
-            report.append_row(
-                test.prevalence(),
-                acc=meta_acc,
-                acc_score=acc_score,
-                f1=meta_f1,
-                f1_score=f1_score,
-            )
-        except ValueError:
-            report.append_row(
-                test.prevalence(),
-                acc=np.nan,
-                acc_score=np.nan,
-                f1=np.nan,
-                f1_score=np.nan,
-            )
-
-    return report
-
-
-@baseline
-def rca_star(
-    c_model: BaseEstimator,
-    validation: LabelledCollection,
-    protocol: AbstractStochasticSeededProtocol,
-    predict_method="predict",
-):
-    """elsahar19"""
-    c_model_predict = getattr(c_model, predict_method)
-    f1_average = "binary" if validation.n_classes == 2 else "macro"
-    validation1, val2 = validation.split_stratified(
-        train_prop=0.5, random_state=env._R_SEED
-    )
-    val11, val12 = validation1.split_stratified(
-        train_prop=0.5, random_state=env._R_SEED
-    )
-
-    val11_pred = c_model_predict(val11.X)
-    c_model1 = clone_fit(c_model, val11.X, val11_pred)
-    c_model1_predict = getattr(c_model1, predict_method)
-    val12_pred1 = c_model1_predict(val12.X)
-
-    val2_protocol = APP(
-        val2,
-        n_prevalences=21,
-        repeats=100,
-        return_type="labelled_collection",
-    )
-    val2_prot_preds = []
-    val2_rca = []
-    val2_prot_preds = []
-    val2_prot_y = []
-    for v2 in val2_protocol():
-        _preds = c_model_predict(v2.X)
-        try:
-            c_model2 = clone_fit(c_model, v2.X, _preds)
-            c_model2_predict = getattr(c_model2, predict_method)
-            val12_pred2 = c_model2_predict(val12.X)
-            rca_score = 1.0 - rcalib.get_score(val12_pred1, val12_pred2, val12.y)
-            val2_rca.append(rca_score)
-            val2_prot_preds.append(_preds)
-            val2_prot_y.append(v2.y)
-        except ValueError:
-            pass
-
-    val_targets_acc = np.array(
-        [
-            metrics.accuracy_score(v2_y, v2_preds)
-            for v2_y, v2_preds in zip(val2_prot_y, val2_prot_preds)
-        ]
-    )
-    reg_acc = LinearRegression().fit(np.array(val2_rca)[:, np.newaxis], val_targets_acc)
-    val_targets_f1 = np.array(
-        [
-            metrics.f1_score(v2_y, v2_preds, average=f1_average)
-            for v2_y, v2_preds in zip(val2_prot_y, val2_prot_preds)
-        ]
-    )
-    reg_f1 = LinearRegression().fit(np.array(val2_rca)[:, np.newaxis], val_targets_f1)
-
-    report = EvaluationReport(name="rca_star")
-    for test in protocol():
-        try:
-            test_pred = c_model_predict(test.X)
-            c_model2 = clone_fit(c_model, test.X, test_pred)
-            c_model2_predict = getattr(c_model2, predict_method)
-            val12_pred2 = c_model2_predict(val12.X)
-            rca_star_score = 1.0 - rcalib.get_score(val12_pred1, val12_pred2, val12.y)
-            acc_score = reg_acc.predict(np.array([[rca_star_score]]))[0]
-            f1_score = reg_f1.predict(np.array([[rca_score]]))[0]
-            meta_acc = abs(acc_score - metrics.accuracy_score(test.y, test_pred))
-            meta_f1 = abs(
-                f1_score - metrics.f1_score(test.y, test_pred, average=f1_average)
-            )
-            report.append_row(
-                test.prevalence(),
-                acc=meta_acc,
-                acc_score=acc_score,
-                f1=meta_f1,
-                f1_score=f1_score,
-            )
-        except ValueError:
-            report.append_row(
-                test.prevalence(),
-                acc=np.nan,
-                acc_score=np.nan,
-                f1=np.nan,
-                f1_score=np.nan,
-            )
-
-    return report
-
-
-@baseline
-def atc_mc(
-    c_model: BaseEstimator,
-    validation: LabelledCollection,
-    protocol: AbstractStochasticSeededProtocol,
-    predict_method="predict_proba",
-):
-    """garg"""
-    c_model_predict = getattr(c_model, predict_method)
-    f1_average = "binary" if validation.n_classes == 2 else "macro"
-
-    ## Load ID validation data probs and labels
-    val_probs, val_labels = c_model_predict(validation.X), validation.y
-
-    ## score function, e.g., negative entropy or argmax confidence
-    val_scores = atc.get_max_conf(val_probs)
-    val_preds = np.argmax(val_probs, axis=-1)
-    _, atc_thres = atc.find_ATC_threshold(val_scores, val_labels == val_preds)
-
-    report = EvaluationReport(name="atc_mc")
-    for test in protocol():
-        ## Load OOD test data probs
-        test_probs = c_model_predict(test.X)
-        test_preds = np.argmax(test_probs, axis=-1)
-        test_scores = atc.get_max_conf(test_probs)
-        atc_accuracy = atc.get_ATC_acc(atc_thres, test_scores)
-        meta_acc = abs(atc_accuracy - metrics.accuracy_score(test.y, test_preds))
-        f1_score = atc.get_ATC_f1(
-            atc_thres, test_scores, test_probs, average=f1_average
-        )
-        meta_f1 = abs(
-            f1_score - metrics.f1_score(test.y, test_preds, average=f1_average)
-        )
-        report.append_row(
-            test.prevalence(),
-            acc=meta_acc,
-            acc_score=atc_accuracy,
-            f1_score=f1_score,
-            f1=meta_f1,
-        )
-
-    return report
-
-
-@baseline
-def atc_ne(
-    c_model: BaseEstimator,
-    validation: LabelledCollection,
-    protocol: AbstractStochasticSeededProtocol,
-    predict_method="predict_proba",
-):
-    """garg"""
-    c_model_predict = getattr(c_model, predict_method)
-    f1_average = "binary" if validation.n_classes == 2 else "macro"
-
-    ## Load ID validation data probs and labels
-    val_probs, val_labels = c_model_predict(validation.X), validation.y
-
-    ## score function, e.g., negative entropy or argmax confidence
-    val_scores = atc.get_entropy(val_probs)
-    val_preds = np.argmax(val_probs, axis=-1)
-    _, atc_thres = atc.find_ATC_threshold(val_scores, val_labels == val_preds)
-
-    report = EvaluationReport(name="atc_ne")
-    for test in protocol():
-        ## Load OOD test data probs
-        test_probs = c_model_predict(test.X)
-        test_preds = np.argmax(test_probs, axis=-1)
-        test_scores = atc.get_entropy(test_probs)
-        atc_accuracy = atc.get_ATC_acc(atc_thres, test_scores)
-        meta_acc = abs(atc_accuracy - metrics.accuracy_score(test.y, test_preds))
-        f1_score = atc.get_ATC_f1(
-            atc_thres, test_scores, test_probs, average=f1_average
-        )
-        meta_f1 = abs(
-            f1_score - metrics.f1_score(test.y, test_preds, average=f1_average)
-        )
-        report.append_row(
-            test.prevalence(),
-            acc=meta_acc,
-            acc_score=atc_accuracy,
-            f1_score=f1_score,
-            f1=meta_f1,
-        )
-
-    return report
-
-
-@baseline
-def doc(
-    c_model: BaseEstimator,
-    validation: LabelledCollection,
-    protocol: AbstractStochasticSeededProtocol,
-    predict_method="predict_proba",
-):
-    c_model_predict = getattr(c_model, predict_method)
-    f1_average = "binary" if validation.n_classes == 2 else "macro"
-
-    val1, val2 = validation.split_stratified(train_prop=0.5, random_state=env._R_SEED)
-    val1_probs = c_model_predict(val1.X)
-    val1_mc = np.max(val1_probs, axis=-1)
-    val1_preds = np.argmax(val1_probs, axis=-1)
-    val1_acc = metrics.accuracy_score(val1.y, val1_preds)
-    val1_f1 = metrics.f1_score(val1.y, val1_preds, average=f1_average)
-    val2_protocol = APP(
-        val2,
-        n_prevalences=21,
-        repeats=100,
-        return_type="labelled_collection",
-    )
-    val2_prot_mc = []
-    val2_prot_preds = []
-    val2_prot_y = []
-    for v2 in val2_protocol():
-        _probs = c_model_predict(v2.X)
-        _mc = np.max(_probs, axis=-1)
-        _preds = np.argmax(_probs, axis=-1)
-        val2_prot_mc.append(_mc)
-        val2_prot_preds.append(_preds)
-        val2_prot_y.append(v2.y)
-
-    val_scores = np.array([doclib.get_doc(val1_mc, v2_mc) for v2_mc in val2_prot_mc])
-    val_targets_acc = np.array(
-        [
-            val1_acc - metrics.accuracy_score(v2_y, v2_preds)
-            for v2_y, v2_preds in zip(val2_prot_y, val2_prot_preds)
-        ]
-    )
-    reg_acc = LinearRegression().fit(val_scores[:, np.newaxis], val_targets_acc)
-    val_targets_f1 = np.array(
-        [
-            val1_f1 - metrics.f1_score(v2_y, v2_preds, average=f1_average)
-            for v2_y, v2_preds in zip(val2_prot_y, val2_prot_preds)
-        ]
-    )
-    reg_f1 = LinearRegression().fit(val_scores[:, np.newaxis], val_targets_f1)
-
-    report = EvaluationReport(name="doc")
-    for test in protocol():
-        test_probs = c_model_predict(test.X)
-        test_preds = np.argmax(test_probs, axis=-1)
-        test_mc = np.max(test_probs, axis=-1)
-        acc_score = (
-            val1_acc
-            - reg_acc.predict(np.array([[doclib.get_doc(val1_mc, test_mc)]]))[0]
-        )
-        f1_score = (
-            val1_f1 - reg_f1.predict(np.array([[doclib.get_doc(val1_mc, test_mc)]]))[0]
-        )
-        meta_acc = abs(acc_score - metrics.accuracy_score(test.y, test_preds))
-        meta_f1 = abs(
-            f1_score - metrics.f1_score(test.y, test_preds, average=f1_average)
-        )
-        report.append_row(
-            test.prevalence(),
-            acc=meta_acc,
-            acc_score=acc_score,
-            f1=meta_f1,
-            f1_score=f1_score,
-        )
-
-    return report
-
-
-@baseline
-def doc_feat(
-    c_model: BaseEstimator,
-    validation: LabelledCollection,
-    protocol: AbstractStochasticSeededProtocol,
-    predict_method="predict_proba",
-):
-    c_model_predict = getattr(c_model, predict_method)
-
-    val_probs, val_labels = c_model_predict(validation.X), validation.y
-    val_scores = np.max(val_probs, axis=-1)
-    val_preds = np.argmax(val_probs, axis=-1)
-    v1acc = np.mean(val_preds == val_labels) * 100
-
-    report = EvaluationReport(name="doc_feat")
-    for test in protocol():
-        test_probs = c_model_predict(test.X)
-        test_preds = np.argmax(test_probs, axis=-1)
-        test_scores = np.max(test_probs, axis=-1)
-        score = (v1acc + doc.get_doc(val_scores, test_scores)) / 100.0
-        meta_acc = abs(score - metrics.accuracy_score(test.y, test_preds))
-        report.append_row(test.prevalence(), acc=meta_acc, acc_score=score)
-
-    return report
-
-
-@baseline
-def gde(
-    c_model: BaseEstimator,
-    validation: LabelledCollection,
-    protocol: AbstractStochasticSeededProtocol,
-    predict_method="predict",
-) -> EvaluationReport:
-    c_model_predict = getattr(c_model, predict_method)
-    val1, val2 = validation.split_stratified(train_prop=0.5, random_state=env._R_SEED)
-    c_model1 = clone_fit(c_model, val1.X, val1.y)
-    c_model1_predict = getattr(c_model1, predict_method)
-    c_model2 = clone_fit(c_model, val2.X, val2.y)
-    c_model2_predict = getattr(c_model2, predict_method)
-
-    report = EvaluationReport(name="gde")
-    for test in protocol():
-        test_pred = c_model_predict(test.X)
-        test_pred1 = c_model1_predict(test.X)
-        test_pred2 = c_model2_predict(test.X)
-        score = gdelib.get_score(test_pred1, test_pred2)
-        meta_score = abs(score - metrics.accuracy_score(test.y, test_pred))
-        report.append_row(test.prevalence(), acc=meta_score, acc_score=score)
-
-    return report
-
-
-@baseline
-def logreg(
-    c_model: BaseEstimator,
-    validation: LabelledCollection,
-    protocol: AbstractStochasticSeededProtocol,
-    predict_method="predict",
-):
-    c_model_predict = getattr(c_model, predict_method)
-
-    val_preds = c_model_predict(validation.X)
-
-    report = EvaluationReport(name="logreg")
-    for test in protocol():
-        wx = iw.logreg(validation.X, validation.y, test.X)
-        test_preds = c_model_predict(test.X)
-        estim_acc = iw.get_acc(val_preds, validation.y, wx)
-        true_acc = metrics.accuracy_score(test.y, test_preds)
-        meta_score = abs(estim_acc - true_acc)
-        report.append_row(test.prevalence(), acc=meta_score, acc_score=estim_acc)
-
-    return report
-
-
-@baseline
-def kdex2(
-    c_model: BaseEstimator,
-    validation: LabelledCollection,
-    protocol: AbstractStochasticSeededProtocol,
-    predict_method="predict",
-):
-    c_model_predict = getattr(c_model, predict_method)
-
-    val_preds = c_model_predict(validation.X)
-    log_likelihood_val = iw.kdex2_lltr(validation.X)
-    Xval = validation.X.toarray() if issparse(validation.X) else validation.X
-
-    report = EvaluationReport(name="kdex2")
-    for test in protocol():
-        Xte = test.X.toarray() if issparse(test.X) else test.X
-        wx = iw.kdex2_weights(Xval, Xte, log_likelihood_val)
-        test_preds = c_model_predict(Xte)
-        estim_acc = iw.get_acc(val_preds, validation.y, wx)
-        true_acc = metrics.accuracy_score(test.y, test_preds)
-        meta_score = abs(estim_acc - true_acc)
-        report.append_row(test.prevalence(), acc=meta_score, acc_score=estim_acc)
-
-    return report
--- a/quacc/evaluation/comp.py
+++ b/quacc/evaluation/comp.py
@ -1,121 +0,0 @@
-import os
-import time
-from traceback import print_exception as traceback
-
-import numpy as np
-import pandas as pd
-import quapy as qp
-from joblib import Parallel, delayed
-from quapy.protocol import APP
-from sklearn.linear_model import LogisticRegression
-
-from quacc import logger
-from quacc.dataset import Dataset
-from quacc.environment import env
-from quacc.evaluation.estimators import CE
-from quacc.evaluation.report import CompReport, DatasetReport
-from quacc.utils import parallel
-
-# from quacc.logger import logger, logger_manager
-
-# from quacc.evaluation.worker import WorkerArgs, estimate_worker
-
-pd.set_option("display.float_format", "{:.4f}".format)
-# qp.environ["SAMPLE_SIZE"] = env.SAMPLE_SIZE
-
-
-def estimate_worker(_estimate, train, validation, test, q=None):
-    # qp.environ["SAMPLE_SIZE"] = env.SAMPLE_SIZE
-    log = logger.setup_worker_logger(q)
-
-    model = LogisticRegression()
-
-    model.fit(*train.Xy)
-    protocol = APP(
-        test,
-        n_prevalences=env.PROTOCOL_N_PREVS,
-        repeats=env.PROTOCOL_REPEATS,
-        return_type="labelled_collection",
-        random_state=env._R_SEED,
-    )
-    start = time.time()
-    try:
-        result = _estimate(model, validation, protocol)
-    except Exception as e:
-        log.warning(f"Method {_estimate.name} failed. Exception: {e}")
-        traceback(e)
-        return None
-
-    result.time = time.time() - start
-    log.info(f"{_estimate.name} finished [took {result.time:.4f}s]")
-
-    logger.logger_manager().rm_worker()
-
-    return result
-
-
-def split_tasks(estimators, train, validation, test, q):
-    _par, _seq = [], []
-    for estim in estimators:
-        if hasattr(estim, "nocall"):
-            continue
-        _task = [estim, train, validation, test]
-        match estim.name:
-            case n if n.endswith("_gs"):
-                _seq.append(_task)
-            case _:
-                _par.append(_task + [q])
-
-    return _par, _seq
-
-
-def evaluate_comparison(dataset: Dataset, estimators=None) -> DatasetReport:
-    # log = Logger.logger()
-    log = logger.logger()
-    # with multiprocessing.Pool(1) as pool:
-    __pool_size = round(os.cpu_count() * 0.8)
-    # with multiprocessing.Pool(__pool_size) as pool:
-    dr = DatasetReport(dataset.name)
-    log.info(f"dataset {dataset.name} [pool size: {__pool_size}]")
-    for d in dataset():
-        log.info(
-            f"Dataset sample {np.around(d.train_prev, decimals=2)} "
-            f"of dataset {dataset.name} started"
-        )
-        par_tasks, seq_tasks = split_tasks(
-            CE.func[estimators],
-            d.train,
-            d.validation,
-            d.test,
-            logger.logger_manager().q,
-        )
-        try:
-            tstart = time.time()
-            results = parallel(estimate_worker, par_tasks, n_jobs=env.N_JOBS, _env=env)
-            results += parallel(estimate_worker, seq_tasks, n_jobs=1, _env=env)
-            results = [r for r in results if r is not None]
-
-            g_time = time.time() - tstart
-            log.info(
-                f"Dataset sample {np.around(d.train_prev, decimals=2)} "
-                f"of dataset {dataset.name} finished "
-                f"[took {g_time:.4f}s]"
-            )
-
-            cr = CompReport(
-                results,
-                name=dataset.name,
-                train_prev=d.train_prev,
-                valid_prev=d.validation_prev,
-                g_time=g_time,
-            )
-            dr += cr
-
-        except Exception as e:
-            log.warning(
-                f"Dataset sample {np.around(d.train_prev, decimals=2)} "
-                f"of dataset {dataset.name} failed. "
-                f"Exception: {e}"
-            )
-            traceback(e)
-    return dr
--- a/quacc/evaluation/estimators.py
+++ b/quacc/evaluation/estimators.py
@ -1,112 +0,0 @@
-from typing import List
-
-import numpy as np
-
-from quacc.evaluation import baseline, method, alt
-
-
-class CompEstimatorFunc_:
-    def __init__(self, ce):
-        self.ce = ce
-
-    def __getitem__(self, e: str | List[str]):
-        if isinstance(e, str):
-            return list(self.ce._CompEstimator__get(e).values())[0]
-        elif isinstance(e, list):
-            return list(self.ce._CompEstimator__get(e).values())
-
-
-class CompEstimatorName_:
-    def __init__(self, ce):
-        self.ce = ce
-
-    def __getitem__(self, e: str | List[str]):
-        if isinstance(e, str):
-            return list(self.ce._CompEstimator__get(e).keys())[0]
-        elif isinstance(e, list):
-            return list(self.ce._CompEstimator__get(e).keys())
-
-    def sort(self, e: List[str]):
-        return list(self.ce._CompEstimator__get(e, get_ref=False).keys())
-
-    @property
-    def all(self):
-        return list(self.ce._CompEstimator__get("__all").keys())
-
-    @property
-    def baselines(self):
-        return list(self.ce._CompEstimator__get("__baselines").keys())
-
-
-class CompEstimator:
-    def __get(cls, e: str | List[str], get_ref=True):
-        _dict = alt._alts | baseline._baselines | method._methods
-
-        if isinstance(e, str) and e == "__all":
-            e = list(_dict.keys())
-        if isinstance(e, str) and e == "__baselines":
-            e = list(baseline._baselines.keys())
-
-        if isinstance(e, str):
-            try:
-                return {e: _dict[e]}
-            except KeyError:
-                raise KeyError(f"Invalid estimator: estimator {e} does not exist")
-        elif isinstance(e, list) or isinstance(e, np.ndarray):
-            _subtr = np.setdiff1d(e, list(_dict.keys()))
-            if len(_subtr) > 0:
-                raise KeyError(
-                    f"Invalid estimator: estimator {_subtr[0]} does not exist"
-                )
-
-            e_fun = {k: fun for k, fun in _dict.items() if k in e}
-            if get_ref and "ref" not in e:
-                e_fun["ref"] = _dict["ref"]
-            elif not get_ref and "ref" in e:
-                del e_fun["ref"]
-
-            return e_fun
-
-    @property
-    def name(self):
-        return CompEstimatorName_(self)
-
-    @property
-    def func(self):
-        return CompEstimatorFunc_(self)
-
-
-CE = CompEstimator()
-
-_renames = {
-    "bin_sld_lr": "(2x2)_SLD_LR",
-    "mul_sld_lr": "(1x4)_SLD_LR",
-    "m3w_sld_lr": "(1x3)_SLD_LR",
-    "d_bin_sld_lr": "d_(2x2)_SLD_LR",
-    "d_mul_sld_lr": "d_(1x4)_SLD_LR",
-    "d_m3w_sld_lr": "d_(1x3)_SLD_LR",
-    "d_bin_sld_rbf": "(2x2)_SLD_RBF",
-    "d_mul_sld_rbf": "(1x4)_SLD_RBF",
-    "d_m3w_sld_rbf": "(1x3)_SLD_RBF",
-    # "sld_lr_gs": "MS_SLD_LR",
-    "sld_lr_gs": "QuAcc(SLD)",
-    "bin_kde_lr": "(2x2)_KDEy_LR",
-    "mul_kde_lr": "(1x4)_KDEy_LR",
-    "m3w_kde_lr": "(1x3)_KDEy_LR",
-    "d_bin_kde_lr": "d_(2x2)_KDEy_LR",
-    "d_mul_kde_lr": "d_(1x4)_KDEy_LR",
-    "d_m3w_kde_lr": "d_(1x3)_KDEy_LR",
-    "bin_cc_lr": "(2x2)_CC_LR",
-    "mul_cc_lr": "(1x4)_CC_LR",
-    "m3w_cc_lr": "(1x3)_CC_LR",
-    # "kde_lr_gs": "MS_KDEy_LR",
-    "kde_lr_gs": "QuAcc(KDEy)",
-    # "cc_lr_gs": "MS_CC_LR",
-    "cc_lr_gs": "QuAcc(CC)",
-    "atc_mc": "ATC",
-    "doc": "DoC",
-    "mandoline": "Mandoline",
-    "rca": "RCA",
-    "rca_star": "RCA*",
-    "naive": "Naive",
-}
--- a/quacc/evaluation/evaluate.py
+++ b/quacc/evaluation/evaluate.py
@ -1,32 +0,0 @@
-from typing import Callable, Union
-
-from quapy.protocol import AbstractProtocol, OnLabelledCollectionProtocol
-
-import quacc as qc
-from quacc.method.base import BaseAccuracyEstimator
-
-
-def evaluate(
-    estimator: BaseAccuracyEstimator,
-    protocol: AbstractProtocol,
-    error_metric: Union[Callable | str],
-) -> float:
-    if isinstance(error_metric, str):
-        error_metric = qc.error.from_name(error_metric)
-
-    collator_bck_ = protocol.collator
-    protocol.collator = OnLabelledCollectionProtocol.get_collator("labelled_collection")
-
-    estim_prevs, true_prevs = [], []
-    for sample in protocol():
-        e_sample = estimator.extend(sample)
-        estim_prev = estimator.estimate(e_sample.eX)
-        estim_prevs.append(estim_prev)
-        true_prevs.append(e_sample.e_prevalence())
-
-    protocol.collator = collator_bck_
-
-    # true_prevs = np.array(true_prevs)
-    # estim_prevs = np.array(estim_prevs)
-
-    return error_metric(true_prevs, estim_prevs)
--- a/quacc/evaluation/method.py
+++ b/quacc/evaluation/method.py
@ -1,517 +0,0 @@
-from dataclasses import dataclass
-from typing import Callable, List, Union
-
-import numpy as np
-from matplotlib.pylab import rand
-from quapy.method.aggregative import CC, PACC, SLD, BaseQuantifier
-from quapy.protocol import UPP, AbstractProtocol, OnLabelledCollectionProtocol
-from sklearn.linear_model import LogisticRegression
-from sklearn.svm import SVC, LinearSVC
-
-import quacc as qc
-from quacc.environment import env
-from quacc.evaluation.report import EvaluationReport
-from quacc.method.base import BQAE, MCAE, BaseAccuracyEstimator
-from quacc.method.model_selection import (
-    GridSearchAE,
-    SpiderSearchAE,
-)
-from quacc.quantification import KDEy
-import traceback
-
-
-def _param_grid(method, X_fit: np.ndarray):
-    match method:
-        case "sld_lr":
-            return {
-                "q__classifier__C": np.logspace(-3, 3, 7),
-                "q__classifier__class_weight": [None, "balanced"],
-                "q__recalib": [None, "bcts"],
-                "confidence": [
-                    None,
-                    ["isoft"],
-                    ["max_conf", "entropy"],
-                    ["max_conf", "entropy", "isoft"],
-                ],
-            }
-        case "sld_rbf":
-            _scale = 1.0 / (X_fit.shape[1] * X_fit.var())
-            return {
-                "q__classifier__C": np.logspace(-3, 3, 7),
-                "q__classifier__class_weight": [None, "balanced"],
-                "q__classifier__gamma": _scale * np.logspace(-2, 2, 5),
-                "q__recalib": [None, "bcts"],
-                "confidence": [
-                    None,
-                    ["isoft"],
-                    ["max_conf", "entropy"],
-                    ["max_conf", "entropy", "isoft"],
-                ],
-            }
-        case "pacc":
-            return {
-                "q__classifier__C": np.logspace(-3, 3, 7),
-                "q__classifier__class_weight": [None, "balanced"],
-                "confidence": [None, ["isoft"], ["max_conf", "entropy"]],
-            }
-        case "cc_lr":
-            return {
-                "q__classifier__C": np.logspace(-3, 3, 7),
-                "q__classifier__class_weight": [None, "balanced"],
-                "confidence": [
-                    None,
-                    ["isoft"],
-                    ["max_conf", "entropy"],
-                    ["max_conf", "entropy", "isoft"],
-                ],
-            }
-        case "kde_lr":
-            return {
-                "q__classifier__C": np.logspace(-3, 3, 7),
-                "q__classifier__class_weight": [None, "balanced"],
-                "q__bandwidth": np.linspace(0.01, 0.2, 20),
-                "confidence": [None, ["isoft"], ["max_conf", "entropy", "isoft"]],
-            }
-        case "kde_rbf":
-            _scale = 1.0 / (X_fit.shape[1] * X_fit.var())
-            return {
-                "q__classifier__C": np.logspace(-3, 3, 7),
-                "q__classifier__class_weight": [None, "balanced"],
-                "q__classifier__gamma": _scale * np.logspace(-2, 2, 5),
-                "q__bandwidth": np.linspace(0.01, 0.2, 20),
-                "confidence": [None, ["isoft"], ["max_conf", "entropy", "isoft"]],
-            }
-
-
-def evaluation_report(
-    estimator: BaseAccuracyEstimator, protocol: AbstractProtocol, method_name=None
-) -> EvaluationReport:
-    # method_name = inspect.stack()[1].function
-    report = EvaluationReport(name=method_name)
-    for sample in protocol():
-        try:
-            e_sample = estimator.extend(sample)
-            estim_prev = estimator.estimate(e_sample.eX)
-            true_prev = e_sample.e_prevalence()
-            acc_score = qc.error.acc(estim_prev)
-            row = dict(
-                acc_score=acc_score,
-                acc=abs(qc.error.acc(true_prev) - acc_score),
-            )
-            if estim_prev.can_f1():
-                f1_score = qc.error.f1(estim_prev)
-                row = row | dict(
-                    f1_score=f1_score,
-                    f1=abs(qc.error.f1(true_prev) - f1_score),
-                )
-            report.append_row(sample.prevalence(), **row)
-        except Exception as e:
-            print(f"sample prediction failed for method {method_name}: {e}")
-            traceback.print_exception(e)
-            report.append_row(
-                sample.prevalence(),
-                acc_score=np.nan,
-                acc=np.nan,
-                f1_score=np.nan,
-                f1=np.nan,
-            )
-
-    return report
-
-
-@dataclass(frozen=True)
-class EmptyMethod:
-    name: str
-    nocall: bool = True
-
-    def __call__(self, c_model, validation, protocol) -> EvaluationReport:
-        pass
-
-
-@dataclass(frozen=True)
-class EvaluationMethod:
-    name: str
-    q: BaseQuantifier
-    est_n: str
-    conf: List[str] | str = None
-    cf: bool = False  # collapse_false
-    gf: bool = False  # group_false
-    d: bool = False  # dense
-
-    def get_est(self, c_model):
-        match self.est_n:
-            case "mul":
-                return MCAE(
-                    c_model,
-                    self.q,
-                    confidence=self.conf,
-                    collapse_false=self.cf,
-                    group_false=self.gf,
-                    dense=self.d,
-                )
-            case "bin":
-                return BQAE(
-                    c_model,
-                    self.q,
-                    confidence=self.conf,
-                    group_false=self.gf,
-                    dense=self.d,
-                )
-
-    def __call__(self, c_model, validation, protocol) -> EvaluationReport:
-        est = self.get_est(c_model).fit(validation)
-        return evaluation_report(
-            estimator=est, protocol=protocol, method_name=self.name
-        )
-
-
-@dataclass(frozen=True)
-class EvaluationMethodGridSearch(EvaluationMethod):
-    pg: str = "sld"
-    search: str = "grid"
-
-    def get_search(self):
-        match self.search:
-            case "grid":
-                return (GridSearchAE, {})
-            case "spider" | "spider2":
-                return (SpiderSearchAE, dict(best_width=2))
-            case "spider3":
-                return (SpiderSearchAE, dict(best_width=3))
-            case _:
-                return GridSearchAE
-
-    def __call__(self, c_model, validation, protocol) -> EvaluationReport:
-        v_train, v_val = validation.split_stratified(0.6, random_state=env._R_SEED)
-        _model = self.get_est(c_model)
-        _grid = _param_grid(self.pg, X_fit=_model.extend(v_train, prefit=True).X)
-        _search_class, _search_params = self.get_search()
-        est = _search_class(
-            model=_model,
-            param_grid=_grid,
-            refit=False,
-            protocol=UPP(v_val, repeats=100),
-            verbose=False,
-            **_search_params,
-        ).fit(v_train)
-        er = evaluation_report(
-            estimator=est,
-            protocol=protocol,
-            method_name=self.name,
-        )
-        er.fit_score = est.best_score()
-        return er
-
-
-E = EmptyMethod
-M = EvaluationMethod
-G = EvaluationMethodGridSearch
-
-
-def __sld_lr():
-    return SLD(LogisticRegression())
-
-
-def __sld_rbf():
-    return SLD(SVC(kernel="rbf", probability=True))
-
-
-def __kde_lr():
-    return KDEy(LogisticRegression(), random_state=env._R_SEED)
-
-
-def __kde_rbf():
-    return KDEy(SVC(kernel="rbf", probability=True), random_state=env._R_SEED)
-
-
-def __sld_lsvc():
-    return SLD(LinearSVC())
-
-
-def __pacc_lr():
-    return PACC(LogisticRegression())
-
-
-def __cc_lr():
-    return CC(LogisticRegression())
-
-
-# fmt: off
-
-__sld_lr_set = [
-    M("bin_sld_lr",      __sld_lr(),  "bin"                                       ),
-    M("bgf_sld_lr",      __sld_lr(),  "bin",                               gf=True),
-    M("mul_sld_lr",      __sld_lr(),  "mul"                                       ),
-    M("m3w_sld_lr",      __sld_lr(),  "mul",                               cf=True),
-    M("mgf_sld_lr",      __sld_lr(),  "mul",                               gf=True),
-    # max_conf sld
-    M("bin_sld_lr_mc",   __sld_lr(),  "bin", conf="max_conf",                     ),
-    M("bgf_sld_lr_mc",   __sld_lr(),  "bin", conf="max_conf",              gf=True),
-    M("mul_sld_lr_mc",   __sld_lr(),  "mul", conf="max_conf",                     ),
-    M("m3w_sld_lr_mc",   __sld_lr(),  "mul", conf="max_conf",              cf=True),
-    M("mgf_sld_lr_mc",   __sld_lr(),  "mul", conf="max_conf",              gf=True),
-    # entropy sld
-    M("bin_sld_lr_ne",   __sld_lr(),  "bin", conf="entropy",                      ),
-    M("bgf_sld_lr_ne",   __sld_lr(),  "bin", conf="entropy",               gf=True),
-    M("mul_sld_lr_ne",   __sld_lr(),  "mul", conf="entropy",                      ),
-    M("m3w_sld_lr_ne",   __sld_lr(),  "mul", conf="entropy",               cf=True),
-    M("mgf_sld_lr_ne",   __sld_lr(),  "mul", conf="entropy",               gf=True),
-    # inverse softmax sld
-    M("bin_sld_lr_is",   __sld_lr(),  "bin", conf="isoft",                        ),
-    M("bgf_sld_lr_is",   __sld_lr(),  "bin", conf="isoft",                 gf=True),
-    M("mul_sld_lr_is",   __sld_lr(),  "mul", conf="isoft",                        ),
-    M("m3w_sld_lr_is",   __sld_lr(),  "mul", conf="isoft",                 cf=True),
-    M("mgf_sld_lr_is",   __sld_lr(),  "mul", conf="isoft",                 gf=True),
-    # max_conf + entropy sld
-    M("bin_sld_lr_c",    __sld_lr(),  "bin", conf=["max_conf", "entropy"]         ),
-    M("bgf_sld_lr_c",    __sld_lr(),  "bin", conf=["max_conf", "entropy"], gf=True),
-    M("mul_sld_lr_c",    __sld_lr(),  "mul", conf=["max_conf", "entropy"]         ),
-    M("m3w_sld_lr_c",    __sld_lr(),  "mul", conf=["max_conf", "entropy"], cf=True),
-    M("mgf_sld_lr_c",    __sld_lr(),  "mul", conf=["max_conf", "entropy"], gf=True),
-    # sld all
-    M("bin_sld_lr_a",   __sld_lr(),  "bin", conf=["max_conf", "entropy", "isoft"],         ),
-    M("bgf_sld_lr_a",   __sld_lr(),  "bin", conf=["max_conf", "entropy", "isoft"],  gf=True),
-    M("mul_sld_lr_a",   __sld_lr(),  "mul", conf=["max_conf", "entropy", "isoft"],         ),
-    M("m3w_sld_lr_a",   __sld_lr(),  "mul", conf=["max_conf", "entropy", "isoft"],  cf=True),
-    M("mgf_sld_lr_a",   __sld_lr(),  "mul", conf=["max_conf", "entropy", "isoft"],  gf=True),
-    # gs sld
-    G("bin_sld_lr_gs",   __sld_lr(),  "bin", pg="sld_lr"                          ),
-    G("bgf_sld_lr_gs",   __sld_lr(),  "bin", pg="sld_lr",                  gf=True),
-    G("mul_sld_lr_gs",   __sld_lr(),  "mul", pg="sld_lr"                          ),
-    G("m3w_sld_lr_gs",   __sld_lr(),  "mul", pg="sld_lr",                  cf=True),
-    G("mgf_sld_lr_gs",   __sld_lr(),  "mul", pg="sld_lr",                  gf=True),
-]
-
-__dense_sld_lr_set = [
-    M("d_bin_sld_lr",      __sld_lr(),  "bin", d=True,                                      ),
-    M("d_bgf_sld_lr",      __sld_lr(),  "bin", d=True,                               gf=True),
-    M("d_mul_sld_lr",      __sld_lr(),  "mul", d=True,                                      ),
-    M("d_m3w_sld_lr",      __sld_lr(),  "mul", d=True,                               cf=True),
-    M("d_mgf_sld_lr",      __sld_lr(),  "mul", d=True,                               gf=True),
-    # max_conf sld
-    M("d_bin_sld_lr_mc",   __sld_lr(),  "bin", d=True, conf="max_conf",                     ),
-    M("d_bgf_sld_lr_mc",   __sld_lr(),  "bin", d=True, conf="max_conf",              gf=True),
-    M("d_mul_sld_lr_mc",   __sld_lr(),  "mul", d=True, conf="max_conf",                     ),
-    M("d_m3w_sld_lr_mc",   __sld_lr(),  "mul", d=True, conf="max_conf",              cf=True),
-    M("d_mgf_sld_lr_mc",   __sld_lr(),  "mul", d=True, conf="max_conf",              gf=True),
-    # entropy sld
-    M("d_bin_sld_lr_ne",   __sld_lr(),  "bin", d=True, conf="entropy",                      ),
-    M("d_bgf_sld_lr_ne",   __sld_lr(),  "bin", d=True, conf="entropy",               gf=True),
-    M("d_mul_sld_lr_ne",   __sld_lr(),  "mul", d=True, conf="entropy",                      ),
-    M("d_m3w_sld_lr_ne",   __sld_lr(),  "mul", d=True, conf="entropy",               cf=True),
-    M("d_mgf_sld_lr_ne",   __sld_lr(),  "mul", d=True, conf="entropy",               gf=True),
-    # inverse softmax sld
-    M("d_bin_sld_lr_is",   __sld_lr(),  "bin", d=True, conf="isoft",                        ),
-    M("d_bgf_sld_lr_is",   __sld_lr(),  "bin", d=True, conf="isoft",                 gf=True),
-    M("d_mul_sld_lr_is",   __sld_lr(),  "mul", d=True, conf="isoft",                        ),
-    M("d_m3w_sld_lr_is",   __sld_lr(),  "mul", d=True, conf="isoft",                 cf=True),
-    M("d_mgf_sld_lr_is",   __sld_lr(),  "mul", d=True, conf="isoft",                 gf=True),
-    # max_conf + entropy sld
-    M("d_bin_sld_lr_c",    __sld_lr(),  "bin", d=True, conf=["max_conf", "entropy"]         ),
-    M("d_bgf_sld_lr_c",    __sld_lr(),  "bin", d=True, conf=["max_conf", "entropy"], gf=True),
-    M("d_mul_sld_lr_c",    __sld_lr(),  "mul", d=True, conf=["max_conf", "entropy"]         ),
-    M("d_m3w_sld_lr_c",    __sld_lr(),  "mul", d=True, conf=["max_conf", "entropy"], cf=True),
-    M("d_mgf_sld_lr_c",    __sld_lr(),  "mul", d=True, conf=["max_conf", "entropy"], gf=True),
-    # sld all
-    M("d_bin_sld_lr_a",    __sld_lr(),  "bin", d=True, conf=["max_conf", "entropy", "isoft"],         ),
-    M("d_bgf_sld_lr_a",    __sld_lr(),  "bin", d=True, conf=["max_conf", "entropy", "isoft"],  gf=True),
-    M("d_mul_sld_lr_a",    __sld_lr(),  "mul", d=True, conf=["max_conf", "entropy", "isoft"],         ),
-    M("d_m3w_sld_lr_a",    __sld_lr(),  "mul", d=True, conf=["max_conf", "entropy", "isoft"],  cf=True),
-    M("d_mgf_sld_lr_a",    __sld_lr(),  "mul", d=True, conf=["max_conf", "entropy", "isoft"],  gf=True),
-    # gs sld
-    G("d_bin_sld_lr_gs",   __sld_lr(),  "bin", d=True, pg="sld_lr"                          ),
-    G("d_bgf_sld_lr_gs",   __sld_lr(),  "bin", d=True, pg="sld_lr",                  gf=True),
-    G("d_mul_sld_lr_gs",   __sld_lr(),  "mul", d=True, pg="sld_lr"                          ),
-    G("d_m3w_sld_lr_gs",   __sld_lr(),  "mul", d=True, pg="sld_lr",                  cf=True),
-    G("d_mgf_sld_lr_gs",   __sld_lr(),  "mul", d=True, pg="sld_lr",                  gf=True),
-]
-
-__dense_sld_rbf_set = [
-    M("d_bin_sld_rbf",    __sld_rbf(), "bin", d=True,                                       ),
-    M("d_bgf_sld_rbf",    __sld_rbf(), "bin", d=True,                                 gf=True),
-    M("d_mul_sld_rbf",    __sld_rbf(), "mul", d=True,                                       ),
-    M("d_m3w_sld_rbf",    __sld_rbf(), "mul", d=True,                                 cf=True),
-    M("d_mgf_sld_rbf",    __sld_rbf(), "mul", d=True,                                 gf=True),
-    # max_conf sld
-    M("d_bin_sld_rbf_mc", __sld_rbf(), "bin", d=True, conf="max_conf",                       ),
-    M("d_bgf_sld_rbf_mc", __sld_rbf(), "bin", d=True, conf="max_conf",                gf=True),
-    M("d_mul_sld_rbf_mc", __sld_rbf(), "mul", d=True, conf="max_conf",                       ),
-    M("d_m3w_sld_rbf_mc", __sld_rbf(), "mul", d=True, conf="max_conf",                cf=True),
-    M("d_mgf_sld_rbf_mc", __sld_rbf(), "mul", d=True, conf="max_conf",                gf=True),
-    # entropy sld
-    M("d_bin_sld_rbf_ne", __sld_rbf(), "bin", d=True, conf="entropy",                        ),
-    M("d_bgf_sld_rbf_ne", __sld_rbf(), "bin", d=True, conf="entropy",                 gf=True),
-    M("d_mul_sld_rbf_ne", __sld_rbf(), "mul", d=True, conf="entropy",                        ),
-    M("d_m3w_sld_rbf_ne", __sld_rbf(), "mul", d=True, conf="entropy",                 cf=True),
-    M("d_mgf_sld_rbf_ne", __sld_rbf(), "mul", d=True, conf="entropy",                 gf=True),
-    # inverse softmax sld
-    M("d_bin_sld_rbf_is", __sld_rbf(), "bin", d=True, conf="isoft",                          ),
-    M("d_bgf_sld_rbf_is", __sld_rbf(), "bin", d=True, conf="isoft",                   gf=True),
-    M("d_mul_sld_rbf_is", __sld_rbf(), "mul", d=True, conf="isoft",                          ),
-    M("d_m3w_sld_rbf_is", __sld_rbf(), "mul", d=True, conf="isoft",                   cf=True),
-    M("d_mgf_sld_rbf_is", __sld_rbf(), "mul", d=True, conf="isoft",                   gf=True),
-    # max_conf + entropy sld
-    M("d_bin_sld_rbf_c",  __sld_rbf(), "bin", d=True, conf=["max_conf", "entropy"]           ),
-    M("d_bgf_sld_rbf_c",  __sld_rbf(), "bin", d=True, conf=["max_conf", "entropy"],   gf=True),
-    M("d_mul_sld_rbf_c",  __sld_rbf(), "mul", d=True, conf=["max_conf", "entropy"]           ),
-    M("d_m3w_sld_rbf_c",  __sld_rbf(), "mul", d=True, conf=["max_conf", "entropy"],   cf=True),
-    M("d_mgf_sld_rbf_c",  __sld_rbf(), "mul", d=True, conf=["max_conf", "entropy"],   gf=True),
-    # sld all
-    M("d_bin_sld_rbf_a",  __sld_rbf(), "bin", d=True, conf=["max_conf", "entropy", "isoft"],         ),
-    M("d_bgf_sld_rbf_a",  __sld_rbf(), "bin", d=True, conf=["max_conf", "entropy", "isoft"],  gf=True),
-    M("d_mul_sld_rbf_a",  __sld_rbf(), "mul", d=True, conf=["max_conf", "entropy", "isoft"],         ),
-    M("d_m3w_sld_rbf_a",  __sld_rbf(), "mul", d=True, conf=["max_conf", "entropy", "isoft"],  cf=True),
-    M("d_mgf_sld_rbf_a",  __sld_rbf(), "mul", d=True, conf=["max_conf", "entropy", "isoft"],  gf=True),
-    # gs sld
-    G("d_bin_sld_rbf_gs", __sld_rbf(), "bin", d=True, pg="sld_rbf", search="grid",        ),
-    G("d_bgf_sld_rbf_gs", __sld_rbf(), "bin", d=True, pg="sld_rbf", search="grid", gf=True),
-    G("d_mul_sld_rbf_gs", __sld_rbf(), "mul", d=True, pg="sld_rbf", search="grid",        ),
-    G("d_m3w_sld_rbf_gs", __sld_rbf(), "mul", d=True, pg="sld_rbf", search="grid", cf=True),
-    G("d_mgf_sld_rbf_gs", __sld_rbf(), "mul", d=True, pg="sld_rbf", search="grid", gf=True),
-]
-
-__kde_lr_set = [
-    # base kde
-    M("bin_kde_lr",    __kde_lr(), "bin"                                       ),
-    M("mul_kde_lr",    __kde_lr(), "mul"                                       ),
-    M("m3w_kde_lr",    __kde_lr(), "mul",                               cf=True),
-    # max_conf kde
-    M("bin_kde_lr_mc", __kde_lr(), "bin", conf="max_conf",                     ),
-    M("mul_kde_lr_mc", __kde_lr(), "mul", conf="max_conf",                     ),
-    M("m3w_kde_lr_mc", __kde_lr(), "mul", conf="max_conf",              cf=True),
-    # entropy kde
-    M("bin_kde_lr_ne", __kde_lr(), "bin", conf="entropy",                      ),
-    M("mul_kde_lr_ne", __kde_lr(), "mul", conf="entropy",                      ),
-    M("m3w_kde_lr_ne", __kde_lr(), "mul", conf="entropy",               cf=True),
-    # inverse softmax kde
-    M("bin_kde_lr_is", __kde_lr(), "bin", conf="isoft",                        ),
-    M("mul_kde_lr_is", __kde_lr(), "mul", conf="isoft",                        ),
-    M("m3w_kde_lr_is", __kde_lr(), "mul", conf="isoft",                 cf=True),
-    # max_conf + entropy kde
-    M("bin_kde_lr_c",  __kde_lr(), "bin", conf=["max_conf", "entropy"]         ),
-    M("mul_kde_lr_c",  __kde_lr(), "mul", conf=["max_conf", "entropy"]         ),
-    M("m3w_kde_lr_c",  __kde_lr(), "mul", conf=["max_conf", "entropy"], cf=True),
-    # kde all
-    M("bin_kde_lr_a",  __kde_lr(), "bin", conf=["max_conf", "entropy", "isoft"],         ),
-    M("mul_kde_lr_a",  __kde_lr(), "mul", conf=["max_conf", "entropy", "isoft"],         ),
-    M("m3w_kde_lr_a",  __kde_lr(), "mul", conf=["max_conf", "entropy", "isoft"],  cf=True),
-    # gs kde
-    G("bin_kde_lr_gs", __kde_lr(), "bin", pg="kde_lr", search="grid"         ),
-    G("mul_kde_lr_gs", __kde_lr(), "mul", pg="kde_lr", search="grid"         ),
-    G("m3w_kde_lr_gs", __kde_lr(), "mul", pg="kde_lr", search="grid", cf=True),
-]
-
-__dense_kde_lr_set = [
-    # base kde
-    M("d_bin_kde_lr",    __kde_lr(), "bin", d=True,                                      ),
-    M("d_mul_kde_lr",    __kde_lr(), "mul", d=True,                                      ),
-    M("d_m3w_kde_lr",    __kde_lr(), "mul", d=True,                               cf=True),
-    # max_conf kde                       
-    M("d_bin_kde_lr_mc", __kde_lr(), "bin", d=True, conf="max_conf",                     ),
-    M("d_mul_kde_lr_mc", __kde_lr(), "mul", d=True, conf="max_conf",                     ),
-    M("d_m3w_kde_lr_mc", __kde_lr(), "mul", d=True, conf="max_conf",              cf=True),
-    # entropy kde                        
-    M("d_bin_kde_lr_ne", __kde_lr(), "bin", d=True, conf="entropy",                      ),
-    M("d_mul_kde_lr_ne", __kde_lr(), "mul", d=True, conf="entropy",                      ),
-    M("d_m3w_kde_lr_ne", __kde_lr(), "mul", d=True, conf="entropy",               cf=True),
-    # inverse softmax kde                  d=True,
-    M("d_bin_kde_lr_is", __kde_lr(), "bin", d=True, conf="isoft",                        ),
-    M("d_mul_kde_lr_is", __kde_lr(), "mul", d=True, conf="isoft",                        ),
-    M("d_m3w_kde_lr_is", __kde_lr(), "mul", d=True, conf="isoft",                 cf=True),
-    # max_conf + entropy kde               
-    M("d_bin_kde_lr_c",  __kde_lr(), "bin", d=True, conf=["max_conf", "entropy"]         ),
-    M("d_mul_kde_lr_c",  __kde_lr(), "mul", d=True, conf=["max_conf", "entropy"]         ),
-    M("d_m3w_kde_lr_c",  __kde_lr(), "mul", d=True, conf=["max_conf", "entropy"], cf=True),
-    # kde all
-    M("d_bin_kde_lr_a",  __kde_lr(), "bin", d=True, conf=["max_conf", "entropy", "isoft"],         ),
-    M("d_mul_kde_lr_a",  __kde_lr(), "mul", d=True, conf=["max_conf", "entropy", "isoft"],         ),
-    M("d_m3w_kde_lr_a",  __kde_lr(), "mul", d=True, conf=["max_conf", "entropy", "isoft"],  cf=True),
-    # gs kde                             
-    G("d_bin_kde_lr_gs", __kde_lr(), "bin", d=True, pg="kde_lr", search="grid"            ),
-    G("d_mul_kde_lr_gs", __kde_lr(), "mul", d=True, pg="kde_lr", search="grid"            ),
-    G("d_m3w_kde_lr_gs", __kde_lr(), "mul", d=True, pg="kde_lr", search="grid",    cf=True),
-]
-
-__dense_kde_rbf_set = [
-    # base kde
-    M("d_bin_kde_rbf",    __kde_rbf(), "bin", d=True,                                       ),
-    M("d_mul_kde_rbf",    __kde_rbf(), "mul", d=True,                                       ),
-    M("d_m3w_kde_rbf",    __kde_rbf(), "mul", d=True,                                cf=True),
-    # max_conf kde
-    M("d_bin_kde_rbf_mc", __kde_rbf(), "bin", d=True, conf="max_conf",                      ),
-    M("d_mul_kde_rbf_mc", __kde_rbf(), "mul", d=True, conf="max_conf",                      ),
-    M("d_m3w_kde_rbf_mc", __kde_rbf(), "mul", d=True, conf="max_conf",               cf=True),
-    # entropy kde
-    M("d_bin_kde_rbf_ne", __kde_rbf(), "bin", d=True, conf="entropy",                       ),
-    M("d_mul_kde_rbf_ne", __kde_rbf(), "mul", d=True, conf="entropy",                       ),
-    M("d_m3w_kde_rbf_ne", __kde_rbf(), "mul", d=True, conf="entropy",                cf=True),
-    # inverse softmax kde
-    M("d_bin_kde_rbf_is", __kde_rbf(), "bin", d=True, conf="isoft",                         ),
-    M("d_mul_kde_rbf_is", __kde_rbf(), "mul", d=True, conf="isoft",                         ),
-    M("d_m3w_kde_rbf_is", __kde_rbf(), "mul", d=True, conf="isoft",                  cf=True),
-    # max_conf + entropy kde
-    M("d_bin_kde_rbf_c",  __kde_rbf(), "bin", d=True, conf=["max_conf", "entropy"]          ),
-    M("d_mul_kde_rbf_c",  __kde_rbf(), "mul", d=True, conf=["max_conf", "entropy"]          ),
-    M("d_m3w_kde_rbf_c",  __kde_rbf(), "mul", d=True, conf=["max_conf", "entropy"],  cf=True),
-    # kde all
-    M("d_bin_kde_rbf_a",  __kde_rbf(), "bin", d=True, conf=["max_conf", "entropy", "isoft"],         ),
-    M("d_mul_kde_rbf_a",  __kde_rbf(), "mul", d=True, conf=["max_conf", "entropy", "isoft"],         ),
-    M("d_m3w_kde_rbf_a",  __kde_rbf(), "mul", d=True, conf=["max_conf", "entropy", "isoft"],  cf=True),
-    # gs kde
-    G("d_bin_kde_rbf_gs", __kde_rbf(), "bin", d=True, pg="kde_rbf", search="spider"          ),
-    G("d_mul_kde_rbf_gs", __kde_rbf(), "mul", d=True, pg="kde_rbf", search="spider"          ),
-    G("d_m3w_kde_rbf_gs", __kde_rbf(), "mul", d=True, pg="kde_rbf", search="spider", cf=True),
-]
-
-__cc_lr_set = [
-    # base cc
-    M("bin_cc_lr",    __cc_lr(), "bin"                                       ),
-    M("mul_cc_lr",    __cc_lr(), "mul"                                       ),
-    M("m3w_cc_lr",    __cc_lr(), "mul",                               cf=True),
-    # max_conf cc
-    M("bin_cc_lr_mc", __cc_lr(), "bin", conf="max_conf",                     ),
-    M("mul_cc_lr_mc", __cc_lr(), "mul", conf="max_conf",                     ),
-    M("m3w_cc_lr_mc", __cc_lr(), "mul", conf="max_conf",              cf=True),
-    # entropy cc
-    M("bin_cc_lr_ne", __cc_lr(), "bin", conf="entropy",                      ),
-    M("mul_cc_lr_ne", __cc_lr(), "mul", conf="entropy",                      ),
-    M("m3w_cc_lr_ne", __cc_lr(), "mul", conf="entropy",               cf=True),
-    # inverse softmax cc
-    M("bin_cc_lr_is", __cc_lr(), "bin", conf="isoft",                        ),
-    M("mul_cc_lr_is", __cc_lr(), "mul", conf="isoft",                        ),
-    M("m3w_cc_lr_is", __cc_lr(), "mul", conf="isoft",                 cf=True),
-    # max_conf + entropy cc
-    M("bin_cc_lr_c",  __cc_lr(), "bin", conf=["max_conf", "entropy"]         ),
-    M("mul_cc_lr_c",  __cc_lr(), "mul", conf=["max_conf", "entropy"]         ),
-    M("m3w_cc_lr_c",  __cc_lr(), "mul", conf=["max_conf", "entropy"], cf=True),
-    # cc all
-    M("bin_cc_lr_a",  __cc_lr(), "bin", conf=["max_conf", "entropy", "isoft"],         ),
-    M("mul_cc_lr_a",  __cc_lr(), "mul", conf=["max_conf", "entropy", "isoft"],         ),
-    M("m3w_cc_lr_a",  __cc_lr(), "mul", conf=["max_conf", "entropy", "isoft"],  cf=True),
-    # gs cc
-    G("bin_cc_lr_gs", __cc_lr(), "bin", pg="cc_lr", search="grid"         ),
-    G("mul_cc_lr_gs", __cc_lr(), "mul", pg="cc_lr", search="grid"         ),
-    G("m3w_cc_lr_gs", __cc_lr(), "mul", pg="cc_lr", search="grid", cf=True),
-]
-
-__ms_set = [
-    E("cc_lr_gs"),
-    E("sld_lr_gs"),
-    E("kde_lr_gs"),
-    E("QuAcc"),
-]
-
-# fmt: on
-
-__methods_set = (
-    __sld_lr_set
-    + __dense_sld_lr_set
-    + __dense_sld_rbf_set
-    + __kde_lr_set
-    + __dense_kde_lr_set
-    + __dense_kde_rbf_set
-    + __cc_lr_set
-    + __ms_set
-)
-
-_methods = {m.name: m for m in __methods_set}
--- a/quacc/evaluation/report.py
+++ b/quacc/evaluation/report.py
@ -1,956 +0,0 @@
-import json
-import pickle
-from collections import defaultdict
-from pathlib import Path
-from typing import List, Tuple
-
-import numpy as np
-import pandas as pd
-
-import quacc as qc
-import quacc.plot as plot
-from quacc.utils import fmt_line_md
-
-
-def _get_metric(metric: str):
-    return slice(None) if metric is None else metric
-
-
-def _get_estimators(estimators: List[str], cols: np.ndarray):
-    if estimators is None:
-        return slice(None)
-
-    estimators = np.array(estimators)
-    return estimators[np.isin(estimators, cols)]
-
-
-def _get_shift(index: np.ndarray, train_prev: np.ndarray):
-    index = np.array([np.array(tp) for tp in index])
-    train_prevs = np.tile(train_prev, (index.shape[0], 1))
-    # assert index.shape[1] == train_prev.shape[0], "Mismatch in prevalence shape"
-    # _shift = np.abs(index - train_prev)[:, 1:].sum(axis=1)
-    _shift = qc.error.nae(index, train_prevs)
-    return np.around(_shift, decimals=2)
-
-
-class EvaluationReport:
-    def __init__(self, name=None):
-        self.data: pd.DataFrame | None = None
-        self.name = name if name is not None else "default"
-        self.time = 0.0
-        self.fit_score = None
-
-    def append_row(self, basep: np.ndarray | Tuple, **row):
-        # bp = basep[1]
-        bp = tuple(basep)
-        _keys, _values = zip(*row.items())
-        # _keys = list(row.keys())
-        # _values = list(row.values())
-
-        if self.data is None:
-            _idx = 0
-            self.data = pd.DataFrame(
-                {k: [v] for k, v in row.items()},
-                index=pd.MultiIndex.from_tuples([(bp, _idx)]),
-                columns=_keys,
-            )
-            return
-
-        _idx = len(self.data.loc[(bp,), :]) if (bp,) in self.data.index else 0
-        not_in_data = np.setdiff1d(list(row.keys()), self.data.columns.unique(0))
-        self.data.loc[:, not_in_data] = np.nan
-        self.data.loc[(bp, _idx), :] = row
-        return
-
-    @property
-    def columns(self) -> np.ndarray:
-        return self.data.columns.unique(0)
-
-    @property
-    def prevs(self):
-        return np.sort(self.data.index.unique(0))
-
-
-class CompReport:
-    _default_modes = [
-        "delta_train",
-        "stdev_train",
-        "train_table",
-        "shift",
-        "shift_table",
-        "diagonal",
-        "stats_table",
-    ]
-
-    def __init__(
-        self,
-        datas: List[EvaluationReport] | pd.DataFrame,
-        name="default",
-        train_prev: np.ndarray = None,
-        valid_prev: np.ndarray = None,
-        times=None,
-        fit_scores=None,
-        g_time=None,
-    ):
-        if isinstance(datas, pd.DataFrame):
-            self._data: pd.DataFrame = datas
-        else:
-            self._data: pd.DataFrame = (
-                pd.concat(
-                    [er.data for er in datas],
-                    keys=[er.name for er in datas],
-                    axis=1,
-                )
-                .swaplevel(0, 1, axis=1)
-                .sort_index(axis=1, level=0, sort_remaining=False)
-                .sort_index(axis=0, level=0, ascending=False, sort_remaining=False)
-            )
-
-        if fit_scores is None:
-            self.fit_scores = {
-                er.name: er.fit_score for er in datas if er.fit_score is not None
-            }
-        else:
-            self.fit_scores = fit_scores
-
-        if times is None:
-            self.times = {er.name: er.time for er in datas}
-        else:
-            self.times = times
-
-        self.times["tot"] = g_time if g_time is not None else 0.0
-        self.train_prev = train_prev
-        self.valid_prev = valid_prev
-
-    def postprocess(
-        self,
-        f_data: pd.DataFrame,
-        _data: pd.DataFrame,
-        metric=None,
-        estimators=None,
-    ) -> pd.DataFrame:
-        _mapping = {
-            "sld_lr_gs": [
-                "bin_sld_lr_gs",
-                "mul_sld_lr_gs",
-                "m3w_sld_lr_gs",
-            ],
-            "kde_lr_gs": [
-                "bin_kde_lr_gs",
-                "mul_kde_lr_gs",
-                "m3w_kde_lr_gs",
-            ],
-            "cc_lr_gs": [
-                "bin_cc_lr_gs",
-                "mul_cc_lr_gs",
-                "m3w_cc_lr_gs",
-            ],
-            "QuAcc": [
-                "bin_sld_lr_gs",
-                "mul_sld_lr_gs",
-                "m3w_sld_lr_gs",
-                "bin_kde_lr_gs",
-                "mul_kde_lr_gs",
-                "m3w_kde_lr_gs",
-            ],
-        }
-
-        for name, methods in _mapping.items():
-            if estimators is not None and name not in estimators:
-                continue
-
-            available_idx = np.where(np.in1d(methods, self._data.columns.unique(1)))[0]
-            if len(available_idx) == 0:
-                continue
-            methods = np.array(methods)[available_idx]
-
-            _metric = _get_metric(metric)
-            m_data = _data.loc[:, (_metric, methods)]
-            _fit_scores = [(k, v) for (k, v) in self.fit_scores.items() if k in methods]
-            _best_method = [k for k, v in _fit_scores][
-                np.argmin([v for k, v in _fit_scores])
-            ]
-            _metric = (
-                [_metric]
-                if _metric is isinstance(_metric, str)
-                else m_data.columns.unique(0)
-            )
-            for _m in _metric:
-                f_data.loc[:, (_m, name)] = m_data.loc[:, (_m, _best_method)]
-
-        return f_data
-
-    @property
-    def prevs(self) -> np.ndarray:
-        return self.data().index.unique(0)
-
-    def join(self, other, how="update", estimators=None):
-        if how not in ["update"]:
-            how = "update"
-
-        if not (self.train_prev == other.train_prev).all():
-            raise ValueError(
-                f"self has train prev. {self.train_prev} while other has {other.train_prev}"
-            )
-
-        self_data = self.data(estimators=estimators)
-        other_data = other.data(estimators=estimators)
-
-        if not (self_data.index == other_data.index).all():
-            raise ValueError("self and other have different indexes")
-
-        update_col = self_data.columns.intersection(other_data.columns)
-        other_join_col = other_data.columns.difference(update_col)
-
-        _join = pd.concat(
-            [self_data, other_data.loc[:, other_join_col.to_list()]],
-            axis=1,
-        )
-        _join.loc[:, update_col.to_list()] = other_data.loc[:, update_col.to_list()]
-        _join.sort_index(axis=1, level=0, sort_remaining=False, inplace=True)
-
-        df = CompReport(
-            _join,
-            self.name if hasattr(self, "name") else "default",
-            train_prev=self.train_prev,
-            valid_prev=self.valid_prev,
-            times=self.times | other.times,
-            fit_scores=self.fit_scores | other.fit_scores,
-            g_time=self.times["tot"] + other.times["tot"],
-        )
-
-        return df
-
-    def data(self, metric: str = None, estimators: List[str] = None) -> pd.DataFrame:
-        _metric = _get_metric(metric)
-        _estimators = _get_estimators(
-            estimators, self._data.loc[:, (_metric, slice(None))].columns.unique(1)
-        )
-        _data: pd.DataFrame = self._data.copy()
-        f_data: pd.DataFrame = _data.loc[:, (_metric, _estimators)]
-
-        f_data = self.postprocess(f_data, _data, metric=metric, estimators=estimators)
-
-        if len(f_data.columns.unique(0)) == 1:
-            f_data = f_data.droplevel(level=0, axis=1)
-
-        return f_data
-
-    def shift_data(
-        self, metric: str = None, estimators: List[str] = None
-    ) -> pd.DataFrame:
-        shift_idx_0 = _get_shift(
-            self._data.index.get_level_values(0).to_numpy(),
-            self.train_prev,
-        )
-
-        shift_idx_1 = np.zeros(shape=shift_idx_0.shape[0], dtype="<i4")
-        for _id in np.unique(shift_idx_0):
-            _wh = (shift_idx_0 == _id).nonzero()[0]
-            shift_idx_1[_wh] = np.arange(_wh.shape[0], dtype="<i4")
-
-        shift_data = self._data.copy()
-        shift_data.index = pd.MultiIndex.from_arrays([shift_idx_0, shift_idx_1])
-        shift_data = shift_data.sort_index(axis=0, level=0)
-
-        _metric = _get_metric(metric)
-        _estimators = _get_estimators(
-            estimators, shift_data.loc[:, (_metric, slice(None))].columns.unique(1)
-        )
-        s_data: pd.DataFrame = shift_data
-        shift_data: pd.DataFrame = shift_data.loc[:, (_metric, _estimators)]
-        shift_data = self.postprocess(
-            shift_data, s_data, metric=metric, estimators=estimators
-        )
-
-        if len(shift_data.columns.unique(0)) == 1:
-            shift_data = shift_data.droplevel(level=0, axis=1)
-
-        return shift_data
-
-    def avg_by_prevs(
-        self, metric: str = None, estimators: List[str] = None
-    ) -> pd.DataFrame:
-        f_dict = self.data(metric=metric, estimators=estimators)
-        return f_dict.groupby(level=0, sort=False).mean()
-
-    def stdev_by_prevs(
-        self, metric: str = None, estimators: List[str] = None
-    ) -> pd.DataFrame:
-        f_dict = self.data(metric=metric, estimators=estimators)
-        return f_dict.groupby(level=0, sort=False).std()
-
-    def train_table(
-        self, metric: str = None, estimators: List[str] = None
-    ) -> pd.DataFrame:
-        f_data = self.data(metric=metric, estimators=estimators)
-        avg_p = f_data.groupby(level=0, sort=False).mean()
-        avg_p.loc["mean", :] = f_data.mean()
-        return avg_p
-
-    def shift_table(
-        self, metric: str = None, estimators: List[str] = None
-    ) -> pd.DataFrame:
-        f_data = self.shift_data(metric=metric, estimators=estimators)
-        avg_p = f_data.groupby(level=0, sort=False).mean()
-        avg_p.loc["mean", :] = f_data.mean()
-        return avg_p
-
-    def get_plots(
-        self,
-        mode="delta_train",
-        metric="acc",
-        estimators=None,
-        conf="default",
-        save_fig=True,
-        base_path=None,
-        backend=None,
-    ) -> List[Tuple[str, Path]]:
-        if mode == "delta_train":
-            avg_data = self.avg_by_prevs(metric=metric, estimators=estimators)
-            if avg_data.empty:
-                return None
-
-            return plot.plot_delta(
-                base_prevs=self.prevs,
-                columns=avg_data.columns.to_numpy(),
-                data=avg_data.T.to_numpy(),
-                metric=metric,
-                name=conf,
-                train_prev=self.train_prev,
-                save_fig=save_fig,
-                base_path=base_path,
-                backend=backend,
-            )
-        elif mode == "stdev_train":
-            avg_data = self.avg_by_prevs(metric=metric, estimators=estimators)
-            if avg_data.empty is True:
-                return None
-
-            st_data = self.stdev_by_prevs(metric=metric, estimators=estimators)
-            return plot.plot_delta(
-                base_prevs=self.prevs,
-                columns=avg_data.columns.to_numpy(),
-                data=avg_data.T.to_numpy(),
-                metric=metric,
-                name=conf,
-                train_prev=self.train_prev,
-                stdevs=st_data.T.to_numpy(),
-                save_fig=save_fig,
-                base_path=base_path,
-                backend=backend,
-            )
-        elif mode == "diagonal":
-            f_data = self.data(metric=metric + "_score", estimators=estimators)
-            if f_data.empty is True:
-                return None
-
-            ref: pd.Series = f_data.loc[:, "ref"]
-            f_data.drop(columns=["ref"], inplace=True)
-            return plot.plot_diagonal(
-                reference=ref.to_numpy(),
-                columns=f_data.columns.to_numpy(),
-                data=f_data.T.to_numpy(),
-                metric=metric,
-                name=conf,
-                train_prev=self.train_prev,
-                save_fig=save_fig,
-                base_path=base_path,
-                backend=backend,
-            )
-        elif mode == "shift":
-            _shift_data = self.shift_data(metric=metric, estimators=estimators)
-            if _shift_data.empty is True:
-                return None
-
-            shift_avg = _shift_data.groupby(level=0, sort=False).mean()
-            shift_counts = _shift_data.groupby(level=0, sort=False).count()
-            shift_prevs = shift_avg.index.unique(0)
-            # shift_prevs = np.around(
-            #     [(1.0 - p, p) for p in np.sort(shift_avg.index.unique(0))],
-            #     decimals=2,
-            # )
-            return plot.plot_shift(
-                shift_prevs=shift_prevs,
-                columns=shift_avg.columns.to_numpy(),
-                data=shift_avg.T.to_numpy(),
-                metric=metric,
-                name=conf,
-                train_prev=self.train_prev,
-                counts=shift_counts.T.to_numpy(),
-                save_fig=save_fig,
-                base_path=base_path,
-                backend=backend,
-            )
-
-    def to_md(
-        self,
-        conf="default",
-        metric="acc",
-        estimators=None,
-        modes=_default_modes,
-        plot_path=None,
-    ) -> str:
-        res = f"## {int(np.around(self.train_prev, decimals=2)[1]*100)}% positives\n"
-        res += fmt_line_md(f"train: {str(self.train_prev)}")
-        res += fmt_line_md(f"validation: {str(self.valid_prev)}")
-        for k, v in self.times.items():
-            if estimators is not None and k not in estimators:
-                continue
-            res += fmt_line_md(f"{k}: {v:.3f}s")
-        res += "\n"
-        if "train_table" in modes:
-            res += "### table\n"
-            res += (
-                self.train_table(metric=metric, estimators=estimators).to_html()
-                + "\n\n"
-            )
-        if "shift_table" in modes:
-            res += "### shift table\n"
-            res += (
-                self.shift_table(metric=metric, estimators=estimators).to_html()
-                + "\n\n"
-            )
-
-        plot_modes = [m for m in modes if not m.endswith("table")]
-        for mode in plot_modes:
-            res += f"### {mode}\n"
-            _, op = self.get_plots(
-                mode=mode,
-                metric=metric,
-                estimators=estimators,
-                conf=conf,
-                save_fig=True,
-                base_path=plot_path,
-            )
-            res += f"![plot_{mode}]({op.relative_to(op.parents[1]).as_posix()})\n"
-
-        return res
-
-
-def _cr_train_prev(cr: CompReport):
-    return tuple(np.around(cr.train_prev, decimals=2))
-
-
-def _cr_data(cr: CompReport, metric=None, estimators=None):
-    return cr.data(metric, estimators)
-
-
-def _key_reverse_delta_train(idx):
-    idx = idx.to_numpy()
-    sorted_idx = np.array(
-        sorted(list(idx), key=lambda x: x[-1]), dtype=("float," * len(idx[0]))[:-1]
-    )
-    # get sorting index
-    nparr = np.nonzero(idx[:, None] == sorted_idx)[1]
-    return nparr
-
-
-class DatasetReport:
-    _default_dr_modes = [
-        "delta_train",
-        "stdev_train",
-        "train_table",
-        "train_std_table",
-        "shift",
-        "shift_table",
-        "delta_test",
-        "stdev_test",
-        "test_table",
-        "diagonal",
-        "stats_table",
-        "fit_scores",
-    ]
-    _default_cr_modes = CompReport._default_modes
-
-    def __init__(self, name, crs=None):
-        self.name = name
-        self.crs: List[CompReport] = [] if crs is None else crs
-
-    def sort_delta_train_index(self, data):
-        # data_ = data.sort_index(axis=0, level=0, ascending=True, sort_remaining=False)
-        data_ = data.sort_index(
-            axis=0,
-            level=0,
-            key=_key_reverse_delta_train,
-        )
-        print(data_.index)
-        return data_
-
-    def join(self, other, estimators=None):
-        _crs = [
-            s_cr.join(o_cr, estimators=estimators)
-            for s_cr, o_cr in zip(self.crs, other.crs)
-        ]
-
-        return DatasetReport(self.name, _crs)
-
-    def fit_scores(self, metric: str = None, estimators: List[str] = None):
-        def _get_sort_idx(arr):
-            return np.array([np.searchsorted(np.sort(a), a) + 1 for a in arr])
-
-        def _get_best_idx(arr):
-            return np.argmin(arr, axis=1)
-
-        def _fdata_idx(idx) -> np.ndarray:
-            return _fdata.loc[(idx, slice(None), slice(None)), :].to_numpy()
-
-        _crs_train = [_cr_train_prev(cr) for cr in self.crs]
-
-        for cr in self.crs:
-            if not hasattr(cr, "fit_scores"):
-                return None
-
-        _crs_fit_scores = [cr.fit_scores for cr in self.crs]
-
-        _fit_scores = pd.DataFrame(_crs_fit_scores, index=_crs_train)
-        _fit_scores = _fit_scores.sort_index(axis=0, ascending=False)
-
-        _estimators = _get_estimators(estimators, _fit_scores.columns)
-        if _estimators.shape[0] == 0:
-            return None
-
-        _fdata = self.data(metric=metric, estimators=_estimators)
-
-        # ensure that columns in _fit_scores have the same ordering of _fdata
-        _fit_scores = _fit_scores.loc[:, _fdata.columns]
-
-        _best_fit_estimators = _get_best_idx(_fit_scores.to_numpy())
-
-        # scores = np.array(
-        #     [
-        #         _get_sort_idx(
-        #             _fdata.loc[(idx, slice(None), slice(None)), :].to_numpy()
-        #         )[:, cl].mean()
-        #         for idx, cl in zip(_fit_scores.index, _best_fit_estimators)
-        #     ]
-        # )
-        # for idx, cl in zip(_fit_scores.index, _best_fit_estimators):
-        #     print(_fdata_idx(idx)[:, cl])
-        #     print(_fdata_idx(idx).min(axis=1), end="\n\n")
-
-        scores = np.array(
-            [
-                np.abs(_fdata_idx(idx)[:, cl] - _fdata_idx(idx).min(axis=1)).mean()
-                for idx, cl in zip(_fit_scores.index, _best_fit_estimators)
-            ]
-        )
-
-        return scores
-
-    def data(self, metric: str = None, estimators: List[str] = None) -> pd.DataFrame:
-        _crs_sorted = sorted(
-            [(_cr_train_prev(cr), _cr_data(cr, metric, estimators)) for cr in self.crs],
-            key=lambda cr: len(cr[1].columns),
-            reverse=True,
-        )
-        _crs_train, _crs_data = zip(*_crs_sorted)
-
-        _data: pd.DataFrame = pd.concat(
-            _crs_data,
-            axis=0,
-            keys=_crs_train,
-        )
-
-        # The MultiIndex is recreated to make the outer-most level a tuple and not a
-        # sequence of values
-        _len_tr_idx = len(_crs_train[0])
-        _idx = _data.index.to_list()
-        _idx = pd.MultiIndex.from_tuples(
-            [tuple([midx[:_len_tr_idx]] + list(midx[_len_tr_idx:])) for midx in _idx]
-        )
-        _data.index = _idx
-
-        _data = _data.sort_index(axis=0, level=0, ascending=False, sort_remaining=False)
-
-        return _data
-
-    def shift_data(
-        self, metric: str = None, estimators: List[str] = None
-    ) -> pd.DataFrame:
-        _shift_data: pd.DataFrame = pd.concat(
-            sorted(
-                [cr.shift_data(metric, estimators) for cr in self.crs],
-                key=lambda d: len(d.columns),
-                reverse=True,
-            ),
-            axis=0,
-        )
-
-        shift_idx_0 = _shift_data.index.get_level_values(0)
-
-        shift_idx_1 = np.empty(shape=shift_idx_0.shape, dtype="<i4")
-        for _id in np.unique(shift_idx_0):
-            _wh = np.where(shift_idx_0 == _id)[0]
-            shift_idx_1[_wh] = np.arange(_wh.shape[0])
-
-        _shift_data.index = pd.MultiIndex.from_arrays([shift_idx_0, shift_idx_1])
-        _shift_data = _shift_data.sort_index(axis=0, level=0)
-
-        return _shift_data
-
-    def add(self, cr: CompReport):
-        if cr is None:
-            return
-
-        self.crs.append(cr)
-
-    def __add__(self, cr: CompReport):
-        if cr is None:
-            return
-
-        return DatasetReport(self.name, crs=self.crs + [cr])
-
-    def __iadd__(self, cr: CompReport):
-        self.add(cr)
-        return self
-
-    def train_table(
-        self, metric: str = None, estimators: List[str] = None
-    ) -> pd.DataFrame:
-        f_data = self.data(metric=metric, estimators=estimators)
-        avg_p = f_data.groupby(level=1, sort=False).mean()
-        avg_p.loc["mean", :] = f_data.mean()
-        return avg_p
-
-    def train_std_table(self, metric: str = None, estimators: List[str] = None):
-        f_data = self.data(metric=metric, estimators=estimators)
-        avg_p = f_data.groupby(level=1, sort=False).mean()
-        avg_p.loc["mean", :] = f_data.mean()
-        avg_s = f_data.groupby(level=1, sort=False).std()
-        avg_s.loc["mean", :] = f_data.std()
-        avg_r = pd.concat([avg_p, avg_s], axis=1, keys=["avg", "std"])
-        return avg_r
-
-    def test_table(
-        self, metric: str = None, estimators: List[str] = None
-    ) -> pd.DataFrame:
-        f_data = self.data(metric=metric, estimators=estimators)
-        avg_p = f_data.groupby(level=0, sort=False).mean()
-        avg_p.loc["mean", :] = f_data.mean()
-        return avg_p
-
-    def shift_table(
-        self, metric: str = None, estimators: List[str] = None
-    ) -> pd.DataFrame:
-        f_data = self.shift_data(metric=metric, estimators=estimators)
-        avg_p = f_data.groupby(level=0, sort=False).mean()
-        avg_p.loc["mean", :] = f_data.mean()
-        return avg_p
-
-    def get_plots(
-        self,
-        data=None,
-        mode="delta_train",
-        metric="acc",
-        estimators=None,
-        conf="default",
-        save_fig=True,
-        base_path=None,
-        backend=None,
-    ):
-        if mode == "delta_train":
-            _data = self.data(metric, estimators) if data is None else data
-            avg_on_train = _data.groupby(level=1, sort=False).mean()
-            if avg_on_train.empty:
-                return None
-            # sort index in reverse order
-            avg_on_train = self.sort_delta_train_index(avg_on_train)
-            prevs_on_train = avg_on_train.index.unique(0)
-            return plot.plot_delta(
-                # base_prevs=np.around(
-                #     [(1.0 - p, p) for p in prevs_on_train], decimals=2
-                # ),
-                base_prevs=prevs_on_train,
-                columns=avg_on_train.columns.to_numpy(),
-                data=avg_on_train.T.to_numpy(),
-                metric=metric,
-                name=conf,
-                train_prev=None,
-                avg="train",
-                save_fig=save_fig,
-                base_path=base_path,
-                backend=backend,
-            )
-        elif mode == "stdev_train":
-            _data = self.data(metric, estimators) if data is None else data
-            avg_on_train = _data.groupby(level=1, sort=False).mean()
-            if avg_on_train.empty:
-                return None
-            prevs_on_train = avg_on_train.index.unique(0)
-            stdev_on_train = _data.groupby(level=1, sort=False).std()
-            return plot.plot_delta(
-                # base_prevs=np.around(
-                #     [(1.0 - p, p) for p in prevs_on_train], decimals=2
-                # ),
-                base_prevs=prevs_on_train,
-                columns=avg_on_train.columns.to_numpy(),
-                data=avg_on_train.T.to_numpy(),
-                metric=metric,
-                name=conf,
-                train_prev=None,
-                stdevs=stdev_on_train.T.to_numpy(),
-                avg="train",
-                save_fig=save_fig,
-                base_path=base_path,
-                backend=backend,
-            )
-        elif mode == "delta_test":
-            _data = self.data(metric, estimators) if data is None else data
-            avg_on_test = _data.groupby(level=0, sort=False).mean()
-            if avg_on_test.empty:
-                return None
-            prevs_on_test = avg_on_test.index.unique(0)
-            return plot.plot_delta(
-                # base_prevs=np.around([(1.0 - p, p) for p in prevs_on_test], decimals=2),
-                base_prevs=prevs_on_test,
-                columns=avg_on_test.columns.to_numpy(),
-                data=avg_on_test.T.to_numpy(),
-                metric=metric,
-                name=conf,
-                train_prev=None,
-                avg="test",
-                save_fig=save_fig,
-                base_path=base_path,
-                backend=backend,
-            )
-        elif mode == "stdev_test":
-            _data = self.data(metric, estimators) if data is None else data
-            avg_on_test = _data.groupby(level=0, sort=False).mean()
-            if avg_on_test.empty:
-                return None
-            prevs_on_test = avg_on_test.index.unique(0)
-            stdev_on_test = _data.groupby(level=0, sort=False).std()
-            return plot.plot_delta(
-                # base_prevs=np.around([(1.0 - p, p) for p in prevs_on_test], decimals=2),
-                base_prevs=prevs_on_test,
-                columns=avg_on_test.columns.to_numpy(),
-                data=avg_on_test.T.to_numpy(),
-                metric=metric,
-                name=conf,
-                train_prev=None,
-                stdevs=stdev_on_test.T.to_numpy(),
-                avg="test",
-                save_fig=save_fig,
-                base_path=base_path,
-                backend=backend,
-            )
-        elif mode == "shift":
-            _shift_data = self.shift_data(metric, estimators) if data is None else data
-            avg_shift = _shift_data.groupby(level=0, sort=False).mean()
-            if avg_shift.empty:
-                return None
-            count_shift = _shift_data.groupby(level=0, sort=False).count()
-            prevs_shift = avg_shift.index.unique(0)
-            return plot.plot_shift(
-                # shift_prevs=np.around([(1.0 - p, p) for p in prevs_shift], decimals=2),
-                shift_prevs=prevs_shift,
-                columns=avg_shift.columns.to_numpy(),
-                data=avg_shift.T.to_numpy(),
-                metric=metric,
-                name=conf,
-                train_prev=None,
-                counts=count_shift.T.to_numpy(),
-                save_fig=save_fig,
-                base_path=base_path,
-                backend=backend,
-            )
-        elif mode == "fit_scores":
-            _fit_scores = self.fit_scores(metric, estimators) if data is None else data
-            if _fit_scores is None:
-                return None
-            train_prevs = self.data(metric, estimators).index.unique(0)
-            return plot.plot_fit_scores(
-                train_prevs=train_prevs,
-                scores=_fit_scores,
-                metric=metric,
-                name=conf,
-                save_fig=save_fig,
-                base_path=base_path,
-                backend=backend,
-            )
-        elif mode == "diagonal":
-            f_data = self.data(metric=metric + "_score", estimators=estimators)
-            if f_data.empty:
-                return None
-
-            ref: pd.Series = f_data.loc[:, "ref"]
-            f_data.drop(columns=["ref"], inplace=True)
-            return plot.plot_diagonal(
-                reference=ref.to_numpy(),
-                columns=f_data.columns.to_numpy(),
-                data=f_data.T.to_numpy(),
-                metric=metric,
-                name=conf,
-                # train_prev=self.train_prev,
-                fixed_lim=True,
-                save_fig=save_fig,
-                base_path=base_path,
-                backend=backend,
-            )
-
-    def to_md(
-        self,
-        conf="default",
-        metric="acc",
-        estimators=[],
-        dr_modes=_default_dr_modes,
-        cr_modes=_default_cr_modes,
-        cr_prevs: List[str] = None,
-        plot_path=None,
-    ):
-        res = f"# {self.name}\n\n"
-        for cr in self.crs:
-            if (
-                cr_prevs is not None
-                and str(round(cr.train_prev[1] * 100)) not in cr_prevs
-            ):
-                continue
-            _md = cr.to_md(
-                conf,
-                metric=metric,
-                estimators=estimators,
-                modes=cr_modes,
-                plot_path=plot_path,
-            )
-            res += f"{_md}\n\n"
-
-        _data = self.data(metric=metric, estimators=estimators)
-        _shift_data = self.shift_data(metric=metric, estimators=estimators)
-
-        res += "## avg\n"
-
-        ######################## avg on train ########################
-        res += "### avg on train\n"
-
-        if "train_table" in dr_modes:
-            avg_on_train_tbl = _data.groupby(level=1, sort=False).mean()
-            avg_on_train_tbl.loc["avg", :] = _data.mean()
-            res += avg_on_train_tbl.to_html() + "\n\n"
-
-        if "delta_train" in dr_modes:
-            _, delta_op = self.get_plots(
-                data=_data,
-                mode="delta_train",
-                metric=metric,
-                estimators=estimators,
-                conf=conf,
-                base_path=plot_path,
-                save_fig=True,
-            )
-            _op = delta_op.relative_to(delta_op.parents[1]).as_posix()
-            res += f"![plot_delta]({_op})\n"
-
-        if "stdev_train" in dr_modes:
-            _, delta_stdev_op = self.get_plots(
-                data=_data,
-                mode="stdev_train",
-                metric=metric,
-                estimators=estimators,
-                conf=conf,
-                base_path=plot_path,
-                save_fig=True,
-            )
-            _op = delta_stdev_op.relative_to(delta_stdev_op.parents[1]).as_posix()
-            res += f"![plot_delta_stdev]({_op})\n"
-
-        ######################## avg on test ########################
-        res += "### avg on test\n"
-
-        if "test_table" in dr_modes:
-            avg_on_test_tbl = _data.groupby(level=0, sort=False).mean()
-            avg_on_test_tbl.loc["avg", :] = _data.mean()
-            res += avg_on_test_tbl.to_html() + "\n\n"
-
-        if "delta_test" in dr_modes:
-            _, delta_op = self.get_plots(
-                data=_data,
-                mode="delta_test",
-                metric=metric,
-                estimators=estimators,
-                conf=conf,
-                base_path=plot_path,
-                save_fig=True,
-            )
-            _op = delta_op.relative_to(delta_op.parents[1]).as_posix()
-            res += f"![plot_delta]({_op})\n"
-
-        if "stdev_test" in dr_modes:
-            _, delta_stdev_op = self.get_plots(
-                data=_data,
-                mode="stdev_test",
-                metric=metric,
-                estimators=estimators,
-                conf=conf,
-                base_path=plot_path,
-                save_fig=True,
-            )
-            _op = delta_stdev_op.relative_to(delta_stdev_op.parents[1]).as_posix()
-            res += f"![plot_delta_stdev]({_op})\n"
-
-        ######################## avg shift ########################
-        res += "### avg dataset shift\n"
-
-        if "shift_table" in dr_modes:
-            shift_on_train_tbl = _shift_data.groupby(level=0, sort=False).mean()
-            shift_on_train_tbl.loc["avg", :] = _shift_data.mean()
-            res += shift_on_train_tbl.to_html() + "\n\n"
-
-        if "shift" in dr_modes:
-            _, shift_op = self.get_plots(
-                data=_shift_data,
-                mode="shift",
-                metric=metric,
-                estimators=estimators,
-                conf=conf,
-                base_path=plot_path,
-                save_fig=True,
-            )
-            _op = shift_op.relative_to(shift_op.parents[1]).as_posix()
-            res += f"![plot_shift]({_op})\n"
-
-        return res
-
-    def pickle(self, pickle_path: Path):
-        with open(pickle_path, "wb") as f:
-            pickle.dump(self, f)
-
-        return self
-
-    @classmethod
-    def unpickle(cls, pickle_path: Path, report_info=False):
-        with open(pickle_path, "rb") as f:
-            dr = pickle.load(f)
-
-        if report_info:
-            return DatasetReportInfo(dr, pickle_path)
-
-        return dr
-
-    def __iter__(self):
-        return (cr for cr in self.crs)
-
-
-class DatasetReportInfo:
-    def __init__(self, dr: DatasetReport, path: Path):
-        self.dr = dr
-        self.name = str(path.parent)
-        _data = dr.data()
-        self.columns = defaultdict(list)
-        for metric, estim in _data.columns:
-            self.columns[estim].append(metric)
-        # self.columns = list(_data.columns.unique(1))
-        self.train_prevs = len(self.dr.crs)
-        self.test_prevs = len(_data.index.unique(1))
-        self.repeats = len(_data.index.unique(2))
-
-    def __repr__(self) -> str:
-        _d = {
-            "train prevs.": self.train_prevs,
-            "test prevs.": self.test_prevs,
-            "repeats": self.repeats,
-            "columns": self.columns,
-        }
-        _r = f"{self.name}\n{json.dumps(_d, indent=2)}\n"
-
-        return _r
--- a/quacc/evaluation/stats.py
+++ b/quacc/evaluation/stats.py
@ -1,41 +0,0 @@
-from typing import List
-
-import numpy as np
-import pandas as pd
-from scipy import stats as sp_stats
-
-# from quacc.evaluation.estimators import CE
-from quacc.evaluation.report import CompReport, DatasetReport
-
-
-def shapiro(
-    r: DatasetReport | CompReport, metric: str = None, estimators: List[str] = None
-) -> pd.DataFrame:
-    _data = r.data(metric, estimators)
-    shapiro_data = np.array(
-        [sp_stats.shapiro(_data.loc[:, e]) for e in _data.columns.unique(0)]
-    ).T
-    dr_index = ["shapiro_W", "shapiro_p"]
-    dr_columns = _data.columns.unique(0)
-    return pd.DataFrame(shapiro_data, columns=dr_columns, index=dr_index)
-
-
-def wilcoxon(
-    r: DatasetReport | CompReport, metric: str = None, estimators: List[str] = None
-) -> pd.DataFrame:
-    _data = r.data(metric, estimators)
-
-    _data = _data.dropna(axis=0, how="any")
-    _wilcoxon = {}
-    for est in _data.columns.unique(0):
-        _wilcoxon[est] = [
-            sp_stats.wilcoxon(_data.loc[:, est], _data.loc[:, e]).pvalue
-            if e != est
-            else 1.0
-            for e in _data.columns.unique(0)
-        ]
-    wilcoxon_data = np.array(list(_wilcoxon.values()))
-
-    dr_index = list(_wilcoxon.keys())
-    dr_columns = _data.columns.unique(0)
-    return pd.DataFrame(wilcoxon_data, columns=dr_columns, index=dr_index)
--- a/quacc/experiments/generators.py
+++ b/quacc/experiments/generators.py
@ -0,0 +1,146 @@
+import os
+
+import numpy as np
+import quapy as qp
+from quapy.data.base import LabelledCollection
+from quapy.data.datasets import (
+    TWITTER_SENTIMENT_DATASETS_TEST,
+    UCI_MULTICLASS_DATASETS,
+)
+from quapy.method.aggregative import EMQ
+from sklearn.linear_model import LogisticRegression
+
+from quacc.dataset import DatasetProvider as DP
+from quacc.error import macrof1_fn, vanilla_acc_fn
+from quacc.models.base import ClassifierAccuracyPrediction
+from quacc.models.baselines import ATC, DoC
+from quacc.models.cont_table import (
+    CAPContingencyTable,
+    ContTableTransferCAP,
+    NaiveCAP,
+    QuAcc1xN2,
+    QuAccNxN,
+)
+from quacc.utils.commons import get_results_path
+
+
+def gen_classifiers():
+    param_grid = {"C": np.logspace(-4, -4, 9), "class_weight": ["balanced", None]}
+
+    yield "LR", LogisticRegression()
+    # yield 'LR-opt', GridSearchCV(LogisticRegression(), param_grid, cv=5, n_jobs=-1)
+    # yield 'NB', GaussianNB()
+    # yield 'SVM(rbf)', SVC()
+    # yield 'SVM(linear)', LinearSVC()
+
+
+def gen_multi_datasets(
+    only_names=False,
+) -> [str, [LabelledCollection, LabelledCollection, LabelledCollection]]:
+    for dataset_name in np.setdiff1d(UCI_MULTICLASS_DATASETS, ["wine-quality"]):
+        if only_names:
+            yield dataset_name, None
+        else:
+            yield dataset_name, DP.uci_multiclass(dataset_name)
+
+    # yields the 20 newsgroups dataset
+    if only_names:
+        yield "20news", None
+    else:
+        yield "20news", DP.news20()
+
+    # yields the T1B@LeQua2022 (training) dataset
+    if only_names:
+        yield "T1B-LeQua2022", None
+    else:
+        yield "T1B-LeQua2022", DP.t1b_lequa2022()
+
+
+def gen_tweet_datasets(
+    only_names=False,
+) -> [str, [LabelledCollection, LabelledCollection, LabelledCollection]]:
+    for dataset_name in TWITTER_SENTIMENT_DATASETS_TEST:
+        if only_names:
+            yield dataset_name, None
+        else:
+            yield dataset_name, DP.twitter(dataset_name)
+
+
+def gen_bin_datasets(
+    only_names=False,
+) -> [str, [LabelledCollection, LabelledCollection, LabelledCollection]]:
+    _IMDB = [
+        "imdb",
+    ]
+    _RCV1 = [
+        "CCAT",
+        "GCAT",
+        "MCAT",
+    ]
+    for dn in _IMDB:
+        dval = None if only_names else DP.imdb()
+        yield dn, dval
+    for dn in _RCV1:
+        dval = None if only_names else DP.rcv1(dn)
+        yield dn, dval
+
+
+def gen_CAP(h, acc_fn, with_oracle=False) -> [str, ClassifierAccuracyPrediction]:
+    ### CAP methods ###
+    # yield 'SebCAP', SebastianiCAP(h, acc_fn, ACC)
+    # yield 'SebCAP-SLD', SebastianiCAP(h, acc_fn, EMQ, predict_train_prev=not with_oracle)
+    # yield 'SebCAP-KDE', SebastianiCAP(h, acc_fn, KDEyML)
+    # yield 'SebCAPweight', SebastianiCAP(h, acc_fn, ACC, alpha=0)
+    # yield 'PabCAP', PabloCAP(h, acc_fn, ACC)
+    # yield 'PabCAP-SLD-median', PabloCAP(h, acc_fn, EMQ, aggr='median')
+
+    ### baselines ###
+    yield "ATC-MC", ATC(h, acc_fn, scoring_fn="maxconf")
+    # yield 'ATC-NE', ATC(h, acc_fn, scoring_fn='neg_entropy')
+    yield "DoC", DoC(h, acc_fn, sample_size=qp.environ["SAMPLE_SIZE"])
+
+
+# fmt: off
+def gen_CAP_cont_table(h) -> [str, CAPContingencyTable]:
+    acc_fn = None
+    yield "Naive", NaiveCAP(h, acc_fn)
+    # yield "CT-PPS-EMQ", ContTableTransferCAP(h, acc_fn, EMQ(LogisticRegression()))
+    # yield 'CT-PPS-KDE', ContTableTransferCAP(h, acc_fn, KDEyML(LogisticRegression(class_weight='balanced'), bandwidth=0.01))
+    # yield 'CT-PPS-KDE05', ContTableTransferCAP(h, acc_fn, KDEyML(LogisticRegression(class_weight='balanced'), bandwidth=0.05))
+    # yield 'QuAcc(EMQ)nxn-noX', QuAccNxN(h, acc_fn, EMQ(LogisticRegression()), add_posteriors=True, add_X=False)
+    # yield 'QuAcc(EMQ)nxn', QuAccNxN(h, acc_fn, EMQ(LogisticRegression()))
+    yield "QuAcc(EMQ)nxn-MC", QuAccNxN(h, acc_fn, EMQ(LogisticRegression()), add_maxconf=True)
+    # yield 'QuAcc(EMQ)nxn-NE', QuAccNxN(h, acc_fn, EMQ(LogisticRegression()), add_negentropy=True)
+    yield 'QuAcc(EMQ)nxn-MIS', QuAccNxN(h, acc_fn, EMQ(LogisticRegression()), add_maxinfsoft=True)
+    yield 'QuAcc(EMQ)nxn-MC-MIS', QuAccNxN(h, acc_fn, EMQ(LogisticRegression()), add_maxconf=True, add_maxinfsoft=True)
+    # yield 'QuAcc(EMQ)1xn2', QuAcc1xN2(h, acc_fn, EMQ(LogisticRegression()))
+    yield 'QuAcc(EMQ)1xn2-MC', QuAcc1xN2(h, acc_fn, EMQ(LogisticRegression()), add_maxconf=True)
+    # yield 'QuAcc(EMQ)1xn2-NE', QuAcc1xN2(h, acc_fn, EMQ(LogisticRegression()), add_negentropy=True)
+    yield 'QuAcc(EMQ)1xn2-MIS', QuAcc1xN2(h, acc_fn, EMQ(LogisticRegression()), add_maxinfsoft=True)
+    yield 'QuAcc(EMQ)1xn2-MC-MIS', QuAcc1xN2(h, acc_fn, EMQ(LogisticRegression()), add_maxconf=True, add_maxinfsoft=True)
+    # yield 'CT-PPSh-EMQ', ContTableTransferCAP(h, acc_fn, EMQ(LogisticRegression()), reuse_h=True)
+    # yield 'Equations-ACCh', NsquaredEquationsCAP(h, acc_fn, ACC, reuse_h=True)
+    # yield 'Equations-ACC', NsquaredEquationsCAP(h, acc_fn, ACC)
+    # yield 'Equations-SLD', NsquaredEquationsCAP(h, acc_fn, EMQ)
+# fmt: on
+
+
+def get_method_names():
+    mock_h = LogisticRegression()
+    return [m for m, _ in gen_CAP(mock_h, None)] + [
+        m for m, _ in gen_CAP_cont_table(mock_h)
+    ]
+
+
+def gen_acc_measure():
+    yield "vanilla_accuracy", vanilla_acc_fn
+    yield "macro-F1", macrof1_fn
+
+
+def any_missing(basedir, cls_name, dataset_name, method_name):
+    for acc_name, _ in gen_acc_measure():
+        if not os.path.exists(
+            get_results_path(basedir, cls_name, acc_name, dataset_name, method_name)
+        ):
+            return True
+    return False
--- a/quacc/experiments/plotting.py
+++ b/quacc/experiments/plotting.py
@ -0,0 +1,60 @@
+import numpy as np
+
+from quacc.experiments.generators import get_method_names
+from quacc.experiments.report import Report
+from quacc.plot.matplotlib import plot_delta, plot_diagonal
+
+
+def save_plot_diagonal(
+    basedir, cls_name, acc_name, dataset_name="*", report: Report = None
+):
+    methods = get_method_names()
+    report = (
+        Report.load_results(
+            basedir,
+            cls_name,
+            acc_name,
+            dataset_name=dataset_name,
+            method_name=methods,
+        )
+        if report is None
+        else report
+    )
+    _methods, _true_accs, _estim_accs = report.diagonal_plot_data()
+    plot_diagonal(
+        method_names=_methods,
+        true_accs=_true_accs,
+        estim_accs=_estim_accs,
+        cls_name=cls_name,
+        acc_name=acc_name,
+        dataset_name=dataset_name,
+        basedir=basedir,
+    )
+
+
+def save_plot_delta(
+    basedir, cls_name, acc_name, dataset_name="*", stdev=False, report: Report = None
+):
+    methods = get_method_names()
+    report = (
+        Report.load_results(
+            basedir,
+            cls_name,
+            acc_name,
+            dataset_name=dataset_name,
+            method_name=methods,
+        )
+        if report is None
+        else report
+    )
+    _methods, _prevs, _acc_errs, _stdevs = report.delta_plot_data(stdev=stdev)
+    plot_delta(
+        method_names=_methods,
+        prevs=_prevs,
+        acc_errs=_acc_errs,
+        cls_name=cls_name,
+        acc_name=acc_name,
+        dataset_name=dataset_name,
+        basedir=basedir,
+        stdevs=_stdevs,
+    )
--- a/quacc/experiments/report.py
+++ b/quacc/experiments/report.py
@ -0,0 +1,171 @@
+import itertools
+from collections import defaultdict
+from glob import glob
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+
+from quacc.error import nae
+from quacc.utils.commons import get_results_path, load_json_file, save_json_file
+
+
+def _get_shift(index: np.ndarray, train_prev: np.ndarray):
+    index = np.array([np.array(tp) for tp in index])
+    train_prevs = np.tile(train_prev, (index.shape[0], 1))
+    _shift = nae(index, train_prevs)
+    return np.around(_shift, decimals=2)
+
+
+class TestReport:
+    def __init__(
+        self,
+        basedir,
+        cls_name,
+        acc_name,
+        dataset_name,
+        method_name,
+        train_prev,
+        val_prev,
+    ):
+        self.basedir = basedir
+        self.cls_name = cls_name
+        self.acc_name = acc_name
+        self.dataset_name = dataset_name
+        self.method_name = method_name
+        self.train_prev = train_prev
+        self.val_prev = val_prev
+
+    @property
+    def path(self):
+        return get_results_path(
+            self.basedir,
+            self.cls_name,
+            self.acc_name,
+            self.dataset_name,
+            self.method_name,
+        )
+
+    def add_result(self, test_prevs, true_accs, estim_accs, t_train, t_test_ave):
+        self.test_prevs = test_prevs
+        self.true_accs = true_accs
+        self.estim_accs = estim_accs
+        self.t_train = t_train
+        self.t_test_ave = t_test_ave
+        return self
+
+    def save_json(self, basedir):
+        if not all([hasattr(self, _attr) for _attr in ["true_accs", "estim_accs"]]):
+            raise AttributeError("Incomplete report cannot be dumped")
+
+        result = {
+            "basedir": self.basedir,
+            "cls_name": self.cls_name,
+            "acc_name": self.acc_name,
+            "dataset_name": self.dataset_name,
+            "method_name": self.method_name,
+            "train_prev": self.train_prev,
+            "val_prev": self.val_prev,
+            "test_prevs": self.test_prevs,
+            "true_accs": self.true_accs,
+            "estim_accs": self.estim_accs,
+            "t_train": self.t_train,
+            "t_test_ave": self.t_test_ave,
+        }
+
+        save_json_file(self.path, result)
+
+    @classmethod
+    def load_json(cls, path) -> "TestReport":
+        def _test_report_hook(_dict):
+            return TestReport(
+                basedir=_dict["basedir"],
+                cls_name=_dict["cls_name"],
+                acc_name=_dict["acc_name"],
+                dataset_name=_dict["dataset_name"],
+                method_name=_dict["method_name"],
+                train_prev=_dict["train_prev"],
+                val_prev=_dict["val_prev"],
+            ).add_result(
+                test_prevs=_dict["test_prevs"],
+                true_accs=_dict["true_accs"],
+                estim_accs=_dict["estim_accs"],
+                t_train=_dict["t_train"],
+                t_test_ave=_dict["t_test_ave"],
+            )
+
+        return load_json_file(path, object_hook=_test_report_hook)
+
+
+class Report:
+    def __init__(self, results: dict[str, list[TestReport]]):
+        self.results = results
+
+    @classmethod
+    def load_results(
+        cls, basedir, cls_name, acc_name, dataset_name="*", method_name="*"
+    ) -> "Report":
+        _results = defaultdict(lambda: [])
+        if isinstance(method_name, str):
+            method_name = [method_name]
+        if isinstance(dataset_name, str):
+            dataset_name = [dataset_name]
+        for dataset_, method_ in itertools.product(dataset_name, method_name):
+            path = get_results_path(basedir, cls_name, acc_name, dataset_, method_)
+            for file in glob(path):
+                if file.endswith(".json"):
+                    # print(file)
+                    method = Path(file).stem
+                    _res = TestReport.load_json(file)
+                    _results[method].append(_res)
+        return Report(_results)
+
+    def train_table(self):
+        pass
+
+    def test_table(self):
+        pass
+
+    def shift_table(self):
+        pass
+
+    def diagonal_plot_data(self):
+        methods = []
+        true_accs = []
+        estim_accs = []
+        for _method, _results in self.results.items():
+            methods.append(_method)
+            _true_acc = np.array([_r.true_accs for _r in _results]).flatten()
+            _estim_acc = np.array([_r.estim_accs for _r in _results]).flatten()
+            true_accs.append(_true_acc)
+            estim_accs.append(_estim_acc)
+
+        return methods, true_accs, estim_accs
+
+    def delta_plot_data(self, stdev=False):
+        methods = []
+        prevs = []
+        acc_errs = []
+        stdevs = None if stdev is None else []
+        for _method, _results in self.results.items():
+            methods.append(_method)
+            _prevs = np.array(
+                [_r.test_prevs for _r in _results]
+            ).flatten()  # should not be flattened, check this
+            _true_accs = np.array([_r.true_accs for _r in _results]).flatten()
+            _estim_accs = np.array([_r.estim_accs for _r in _results]).flatten()
+            _acc_errs = np.abs(_true_accs - _estim_accs)
+            df = pd.DataFrame(
+                np.array([_prevs, _acc_errs]).T, columns=["prevs", "errs"]
+            )
+            df_acc_errs = df.groupby(["prevs"]).mean().reset_index()
+            prevs.append(df_acc_errs["prevs"].to_numpy())
+            acc_errs.append(df_acc_errs["errs"].to_numpy())
+            if stdev:
+                df_stdevs = df.groupby(["prevs"]).std().reset_index()
+                stdevs.append(df_stdevs["errs"].to_numpy())
+
+        return methods, prevs, acc_errs, stdevs
+
+    def shift_plot_data(self):
+        pass
--- a/quacc/experiments/run.py
+++ b/quacc/experiments/run.py
@ -0,0 +1,145 @@
+import itertools
+import os
+
+import quapy as qp
+from quapy.protocol import UPP
+
+from quacc.dataset import save_dataset_stats
+from quacc.experiments.generators import (
+    any_missing,
+    gen_acc_measure,
+    gen_bin_datasets,
+    gen_CAP,
+    gen_CAP_cont_table,
+    gen_classifiers,
+    gen_multi_datasets,
+    gen_tweet_datasets,
+)
+from quacc.experiments.plotting import save_plot_delta, save_plot_diagonal
+from quacc.experiments.report import Report, TestReport
+from quacc.experiments.util import (
+    fit_method,
+    predictionsCAP,
+    predictionsCAPcont_table,
+    prevs_from_prot,
+    true_acc,
+)
+
+PROBLEM = "binary"
+ORACLE = False
+basedir = PROBLEM + ("-oracle" if ORACLE else "")
+EXPERIMENT = True
+PLOTTING = True
+
+
+if PROBLEM == "binary":
+    qp.environ["SAMPLE_SIZE"] = 1000
+    NUM_TEST = 1000
+    gen_datasets = gen_bin_datasets
+elif PROBLEM == "multiclass":
+    qp.environ["SAMPLE_SIZE"] = 250
+    NUM_TEST = 1000
+    gen_datasets = gen_multi_datasets
+elif PROBLEM == "tweet":
+    qp.environ["SAMPLE_SIZE"] = 100
+    NUM_TEST = 1000
+    gen_datasets = gen_tweet_datasets
+
+
+if EXPERIMENT:
+    for (cls_name, h), (dataset_name, (L, V, U)) in itertools.product(
+        gen_classifiers(), gen_datasets()
+    ):
+        print(f"training {cls_name} in {dataset_name}")
+        h.fit(*L.Xy)
+
+        # test generation protocol
+        test_prot = UPP(
+            U, repeats=NUM_TEST, return_type="labelled_collection", random_state=0
+        )
+
+        # compute some stats of the dataset
+        save_dataset_stats(f"dataset_stats/{dataset_name}.json", test_prot, L, V)
+
+        # precompute the actual accuracy values
+        true_accs = {}
+        for acc_name, acc_fn in gen_acc_measure():
+            true_accs[acc_name] = [true_acc(h, acc_fn, Ui) for Ui in test_prot()]
+
+        print("CAP methods")
+        # instances of ClassifierAccuracyPrediction are bound to the evaluation measure, so they
+        # must be nested in the acc-for
+        for acc_name, acc_fn in gen_acc_measure():
+            print(f"\tfor measure {acc_name}")
+            for method_name, method in gen_CAP(h, acc_fn, with_oracle=ORACLE):
+                report = TestReport(
+                    basedir=basedir,
+                    cls_name=cls_name,
+                    acc_name=acc_name,
+                    dataset_name=dataset_name,
+                    method_name=method_name,
+                    train_prev=L.prevalence().tolist(),
+                    val_prev=V.prevalence().tolist(),
+                )
+                if os.path.exists(report.path):
+                    print(f"\t\t{method_name}-{acc_name} exists, skipping")
+                    continue
+
+                print(f"\t\t{method_name} computing...")
+                method, t_train = fit_method(method, V)
+                estim_accs, t_test_ave = predictionsCAP(method, test_prot, ORACLE)
+                test_prevs = prevs_from_prot(test_prot)
+                report.add_result(
+                    test_prevs=test_prevs,
+                    true_accs=true_accs[acc_name],
+                    estim_accs=estim_accs,
+                    t_train=t_train,
+                    t_test_ave=t_test_ave,
+                ).save_json(basedir)
+
+        print("\nCAP_cont_table methods")
+        # instances of CAPContingencyTable instead are generic, and the evaluation measure can
+        # be nested to the predictions to speed up things
+        for method_name, method in gen_CAP_cont_table(h):
+            if not any_missing(basedir, cls_name, dataset_name, method_name):
+                print(
+                    f"\t\tmethod {method_name} has all results already computed. Skipping."
+                )
+                continue
+
+            print(f"\t\tmethod {method_name} computing...")
+
+            method, t_train = fit_method(method, V)
+            estim_accs_dict, t_test_ave = predictionsCAPcont_table(
+                method, test_prot, gen_acc_measure, ORACLE
+            )
+            for acc_name, estim_accs in estim_accs_dict.items():
+                report = TestReport(
+                    basedir, cls_name, acc_name, dataset_name, method_name
+                )
+                test_prevs = prevs_from_prot(test_prot)
+                report.add_result(
+                    test_prevs=test_prevs,
+                    true_accs=true_accs[acc_name],
+                    estim_accs=estim_accs,
+                    t_train=t_train,
+                    t_test_ave=t_test_ave,
+                ).save_json(basedir)
+
+        print()
+
+# generate plots
+if PLOTTING:
+    for (cls_name, _), (acc_name, _) in itertools.product(
+        gen_classifiers(), gen_acc_measure()
+    ):
+        save_plot_diagonal(basedir, cls_name, acc_name)
+        for dataset_name, _ in gen_datasets(only_names=True):
+            save_plot_diagonal(basedir, cls_name, acc_name, dataset_name=dataset_name)
+            save_plot_delta(basedir, cls_name, acc_name, dataset_name=dataset_name)
+            save_plot_delta(
+                basedir, cls_name, acc_name, dataset_name=dataset_name, stdev=True
+            )
+
+# print("generating tables")
+# gen_tables(basedir, datasets=[d for d, _ in gen_datasets(only_names=True)])
--- a/quacc/experiments/util.py
+++ b/quacc/experiments/util.py
@ -0,0 +1,64 @@
+from time import time
+
+import numpy as np
+from quapy.data.base import LabelledCollection
+from sklearn.base import BaseEstimator
+from sklearn.metrics import confusion_matrix
+
+
+def fit_method(method, V):
+    tinit = time()
+    method.fit(V)
+    t_train = time() - tinit
+    return method, t_train
+
+
+def predictionsCAP(method, test_prot, oracle=False):
+    tinit = time()
+    if not oracle:
+        estim_accs = [method.predict(Ui.X) for Ui in test_prot()]
+    else:
+        estim_accs = [
+            method.predict(Ui.X, oracle_prev=Ui.prevalence()) for Ui in test_prot()
+        ]
+    t_test_ave = (time() - tinit) / test_prot.total()
+    return estim_accs, t_test_ave
+
+
+def predictionsCAPcont_table(method, test_prot, gen_acc_measure, oracle=False):
+    estim_accs_dict = {}
+    tinit = time()
+    if not oracle:
+        estim_tables = [method.predict_ct(Ui.X) for Ui in test_prot()]
+    else:
+        estim_tables = [
+            method.predict_ct(Ui.X, oracle_prev=Ui.prevalence()) for Ui in test_prot()
+        ]
+    for acc_name, acc_fn in gen_acc_measure():
+        estim_accs_dict[acc_name] = [acc_fn(cont_table) for cont_table in estim_tables]
+    t_test_ave = (time() - tinit) / test_prot.total()
+    return estim_accs_dict, t_test_ave
+
+
+def prevs_from_prot(prot):
+    def _get_plain_prev(prev: np.ndarray):
+        if prev.shape[0] > 2:
+            return tuple(prev[1:])
+        else:
+            return prev[-1]
+
+    return [_get_plain_prev(Ui.prevalence()) for Ui in prot()]
+
+
+def true_acc(h: BaseEstimator, acc_fn: callable, U: LabelledCollection):
+    y_pred = h.predict(U.X)
+    y_true = U.y
+    conf_table = confusion_matrix(y_true, y_pred=y_pred, labels=U.classes_)
+    return acc_fn(conf_table)
+
+
+def get_acc_name(acc_name):
+    return {
+        "Vanilla Accuracy": "vanilla_accuracy",
+        "Macro F1": "macro-F1",
+    }
--- a/quacc/logger.py
+++ b/quacc/logger.py
@ -1,132 +0,0 @@
-import logging
-import logging.handlers
-import multiprocessing
-import threading
-from pathlib import Path
-from typing import List
-
-_logger_manager = None
-
-
-class LoggerManager:
-    def __init__(self, q, worker, listener=None, th=None):
-        self.th: threading.Thread = th
-        self.q: multiprocessing.Queue = q
-        self.listener: logging.Logger = listener
-        self._worker: List[logging.Logger] = [worker]
-        self._listener_handlers: List[logging.Handler] = []
-
-    def close(self):
-        if self.th is not None:
-            self.q.put(None)
-            self.th.join()
-
-    def rm_worker(self):
-        self._worker.pop()
-
-    @property
-    def worker(self):
-        return self._worker[-1]
-
-    def new_worker(self):
-        log = logging.getLogger(f"worker{len(self._worker)}")
-        log.handlers.clear()
-        self._worker.append(log)
-        return log
-
-    def add_listener_handler(self, rh):
-        self._listener_handlers.append(rh)
-        self.listener.addHandler(rh)
-        self.listener.info("-" * 100)
-
-    def clear_listener_handlers(self):
-        for rh in self._listener_handlers:
-            self.listener.removeHandler(rh)
-        self._listener_handlers.clear()
-
-
-def log_listener(root, q):
-    while True:
-        msg = q.get()
-        if msg is None:
-            return
-        root.handle(msg)
-
-
-def setup_logger():
-    q = multiprocessing.Manager().Queue()
-
-    log_file = "quacc.log"
-    root_name = "listener"
-    root = logging.getLogger(root_name)
-    root.setLevel(logging.DEBUG)
-    fh = logging.FileHandler(log_file)
-    fh.setLevel(logging.DEBUG)
-    root.addHandler(fh)
-
-    th = threading.Thread(target=log_listener, args=[root, q])
-    th.start()
-
-    worker_name = "worker"
-    worker = logging.getLogger(worker_name)
-    worker.setLevel(logging.DEBUG)
-    qh = logging.handlers.QueueHandler(q)
-    qh.setLevel(logging.DEBUG)
-    qh.setFormatter(
-        logging.Formatter(
-            fmt="%(asctime)s| %(levelname)-8s %(message)s",
-            datefmt="%d/%m/%y %H:%M:%S",
-        )
-    )
-    worker.addHandler(qh)
-
-    global _logger_manager
-    _logger_manager = LoggerManager(q, worker, listener=root, th=th)
-
-    return _logger_manager.worker
-
-
-def setup_worker_logger(q: multiprocessing.Queue = None):
-    formatter = logging.Formatter(
-        fmt="%(asctime)s| %(levelname)-12s%(message)s",
-        datefmt="%d/%m/%y %H:%M:%S",
-    )
-
-    global _logger_manager
-    if _logger_manager is None:
-        worker_name = "worker"
-        worker = logging.getLogger(worker_name)
-        worker.setLevel(logging.DEBUG)
-        qh = logging.handlers.QueueHandler(q)
-        qh.setLevel(logging.DEBUG)
-        qh.setFormatter(formatter)
-        worker.addHandler(qh)
-
-        _logger_manager = LoggerManager(q, worker)
-        return _logger_manager.worker
-    else:
-        worker = _logger_manager.new_worker()
-        worker.setLevel(logging.DEBUG)
-        qh = logging.handlers.QueueHandler(_logger_manager.q)
-        qh.setLevel(logging.DEBUG)
-        qh.setFormatter(formatter)
-        worker.addHandler(qh)
-        return worker
-
-
-def logger():
-    return _logger_manager.worker
-
-
-def logger_manager():
-    return _logger_manager
-
-
-def add_handler(path: Path):
-    rh = logging.FileHandler(path, mode="a")
-    rh.setLevel(logging.DEBUG)
-    _logger_manager.add_listener_handler(rh)
-
-
-def clear_handlers():
-    _logger_manager.clear_listener_handlers()
--- a/quacc/main.py
+++ b/quacc/main.py
@ -1,58 +0,0 @@
-from traceback import print_exception as traceback
-
-import quacc.evaluation.comp as comp
-
-# from quacc.logger import Logger
-from quacc import logger
-from quacc.dataset import Dataset
-from quacc.environment import env
-from quacc.evaluation.estimators import CE
-from quacc.utils import create_dataser_dir
-
-
-def estimate_comparison():
-    # log = Logger.logger()
-    log = logger.logger()
-    for conf in env.load_confs():
-        dataset = Dataset(
-            env.DATASET_NAME,
-            target=env.DATASET_TARGET,
-            n_prevalences=env.DATASET_N_PREVS,
-            prevs=env.DATASET_PREVS,
-        )
-        create_dataser_dir(
-            dataset.name,
-            update=env.DATASET_DIR_UPDATE,
-        )
-        # Logger.add_handler(env.OUT_DIR / f"{dataset.name}.log")
-        logger.add_handler(env.OUT_DIR / f"{dataset.name}.log")
-        try:
-            dr = comp.evaluate_comparison(
-                dataset,
-                estimators=CE.name[env.COMP_ESTIMATORS],
-            )
-            dr.pickle(env.OUT_DIR / f"{dataset.name}.pickle")
-        except Exception as e:
-            log.error(f"Evaluation over {dataset.name} failed. Exception: {e}")
-            traceback(e)
-
-        # Logger.clear_handlers()
-        logger.clear_handlers()
-
-
-def main():
-    # log = Logger.logger()
-    log = logger.setup_logger()
-
-    try:
-        estimate_comparison()
-    except Exception as e:
-        log.error(f"estimate comparison failed. Exception: {e}")
-        traceback(e)
-
-    # Logger.close()
-    logger.logger_manager().close()
-
-
-if __name__ == "__main__":
-    main()
--- a/quacc/method/init.py
+++ b/quacc/method/init.py
--- a/quacc/method/base.py
+++ b/quacc/method/base.py
@ -1,355 +0,0 @@
-from abc import abstractmethod
-from copy import deepcopy
-from typing import List
-
-import numpy as np
-import scipy.sparse as sp
-from quapy.data import LabelledCollection
-from quapy.method.aggregative import BaseQuantifier
-from sklearn.base import BaseEstimator
-
-import quacc.method.confidence as conf
-from quacc.data import (
-    ExtBinPrev,
-    ExtendedCollection,
-    ExtendedData,
-    ExtendedPrev,
-    ExtensionPolicy,
-    ExtMulPrev,
-)
-
-
-class BaseAccuracyEstimator(BaseQuantifier):
-    def __init__(
-        self,
-        classifier: BaseEstimator,
-        quantifier: BaseQuantifier,
-        dense=False,
-    ):
-        self.__check_classifier(classifier)
-        self.quantifier = quantifier
-        self.extpol = ExtensionPolicy(dense=dense)
-
-    def __check_classifier(self, classifier):
-        if not hasattr(classifier, "predict_proba"):
-            raise ValueError(
-                f"Passed classifier {classifier.__class__.__name__} cannot predict probabilities."
-            )
-        self.classifier = classifier
-
-    def extend(self, coll: LabelledCollection, pred_proba=None) -> ExtendedCollection:
-        if pred_proba is None:
-            pred_proba = self.classifier.predict_proba(coll.X)
-
-        return ExtendedCollection.from_lc(
-            coll, pred_proba=pred_proba, ext=pred_proba, extpol=self.extpol
-        )
-
-    def _extend_instances(self, instances: np.ndarray | sp.csr_matrix):
-        pred_proba = self.classifier.predict_proba(instances)
-        return ExtendedData(instances, pred_proba=pred_proba, extpol=self.extpol)
-
-    @abstractmethod
-    def fit(self, train: LabelledCollection | ExtendedCollection):
-        ...
-
-    @abstractmethod
-    def estimate(self, instances, ext=False) -> ExtendedPrev:
-        ...
-
-    @property
-    def dense(self):
-        return self.extpol.dense
-
-
-class ConfidenceBasedAccuracyEstimator(BaseAccuracyEstimator):
-    def __init__(
-        self,
-        classifier: BaseEstimator,
-        quantifier: BaseQuantifier,
-        confidence=None,
-    ):
-        super().__init__(
-            classifier=classifier,
-            quantifier=quantifier,
-        )
-        self.__check_confidence(confidence)
-        self.calibrator = None
-
-    def __check_confidence(self, confidence):
-        if isinstance(confidence, str):
-            self.confidence = [confidence]
-        elif isinstance(confidence, list):
-            self.confidence = confidence
-        else:
-            self.confidence = None
-
-    def _fit_confidence(self, X, y, probas):
-        self.confidence_metrics = conf.get_metrics(self.confidence)
-        if self.confidence_metrics is None:
-            return
-
-        for m in self.confidence_metrics:
-            m.fit(X, y, probas)
-
-    def _get_pred_ext(self, pred_proba: np.ndarray):
-        return pred_proba
-
-    def __get_ext(
-        self, X: np.ndarray | sp.csr_matrix, pred_proba: np.ndarray
-    ) -> np.ndarray:
-        if self.confidence_metrics is None or len(self.confidence_metrics) == 0:
-            return pred_proba
-
-        _conf_ext = np.concatenate(
-            [m.conf(X, pred_proba) for m in self.confidence_metrics],
-            axis=1,
-        )
-
-        _pred_ext = self._get_pred_ext(pred_proba)
-
-        return np.concatenate([_conf_ext, _pred_ext], axis=1)
-
-    def extend(
-        self, coll: LabelledCollection, pred_proba=None, prefit=False
-    ) -> ExtendedCollection:
-        if pred_proba is None:
-            pred_proba = self.classifier.predict_proba(coll.X)
-
-        if prefit:
-            self._fit_confidence(coll.X, coll.y, pred_proba)
-        else:
-            if not hasattr(self, "confidence_metrics"):
-                raise AttributeError(
-                    "Confidence metrics are not fit and cannot be computed."
-                    "Consider setting prefit to True."
-                )
-
-        _ext = self.__get_ext(coll.X, pred_proba)
-        return ExtendedCollection.from_lc(
-            coll, pred_proba=pred_proba, ext=_ext, extpol=self.extpol
-        )
-
-    def _extend_instances(
-        self,
-        instances: np.ndarray | sp.csr_matrix,
-    ) -> ExtendedData:
-        pred_proba = self.classifier.predict_proba(instances)
-        _ext = self.__get_ext(instances, pred_proba)
-        return ExtendedData(
-            instances, pred_proba=pred_proba, ext=_ext, extpol=self.extpol
-        )
-
-
-class MultiClassAccuracyEstimator(ConfidenceBasedAccuracyEstimator):
-    def __init__(
-        self,
-        classifier: BaseEstimator,
-        quantifier: BaseQuantifier,
-        confidence: str = None,
-        collapse_false=False,
-        group_false=False,
-        dense=False,
-    ):
-        super().__init__(
-            classifier=classifier,
-            quantifier=quantifier,
-            confidence=confidence,
-        )
-        self.extpol = ExtensionPolicy(
-            collapse_false=collapse_false,
-            group_false=group_false,
-            dense=dense,
-        )
-        self.e_train = None
-
-    # def _get_pred_ext(self, pred_proba: np.ndarray):
-    #     return np.argmax(pred_proba, axis=1, keepdims=True)
-
-    def _get_multi_quant(self, quant, train: LabelledCollection):
-        _nz = np.nonzero(train.counts())[0]
-        if _nz.shape[0] == 1:
-            return TrivialQuantifier(train.n_classes, _nz[0])
-        else:
-            return quant
-
-    def fit(self, train: LabelledCollection):
-        pred_proba = self.classifier.predict_proba(train.X)
-        self._fit_confidence(train.X, train.y, pred_proba)
-        self.e_train = self.extend(train, pred_proba=pred_proba)
-
-        self.quantifier = self._get_multi_quant(self.quantifier, self.e_train)
-        self.quantifier.fit(self.e_train)
-
-        return self
-
-    def estimate(
-        self, instances: ExtendedData | np.ndarray | sp.csr_matrix
-    ) -> ExtendedPrev:
-        e_inst = instances
-        if not isinstance(e_inst, ExtendedData):
-            e_inst = self._extend_instances(instances)
-
-        estim_prev = self.quantifier.quantify(e_inst.X)
-        return ExtMulPrev(
-            estim_prev,
-            e_inst.nbcl,
-            q_classes=self.quantifier.classes_,
-            extpol=self.extpol,
-        )
-
-    @property
-    def collapse_false(self):
-        return self.extpol.collapse_false
-
-    @property
-    def group_false(self):
-        return self.extpol.group_false
-
-
-class TrivialQuantifier:
-    def __init__(self, n_classes, trivial_class):
-        self.trivial_class = trivial_class
-
-    def fit(self, train: LabelledCollection):
-        pass
-
-    def quantify(self, inst: LabelledCollection) -> np.ndarray:
-        return np.array([1.0])
-
-    @property
-    def classes_(self):
-        return np.array([self.trivial_class])
-
-
-class QuantifierProxy:
-    def __init__(self, train: LabelledCollection):
-        self.o_nclasses = train.n_classes
-        self.o_classes = train.classes_
-        self.o_index = {c: i for i, c in enumerate(train.classes_)}
-
-        self.mapping = {}
-        self.r_mapping = {}
-        _cnt = 0
-        for cl, c in zip(train.classes_, train.counts()):
-            if c > 0:
-                self.mapping[cl] = _cnt
-                self.r_mapping[_cnt] = cl
-                _cnt += 1
-
-        self.n_nclasses = len(self.mapping)
-
-    def apply_mapping(self, coll: LabelledCollection) -> LabelledCollection:
-        if not self.proxied:
-            return coll
-
-        n_labels = np.copy(coll.labels)
-        for k in self.mapping:
-            n_labels[coll.labels == k] = self.mapping[k]
-
-        return LabelledCollection(coll.X, n_labels, classes=np.arange(self.n_nclasses))
-
-    def apply_rmapping(self, prevs: np.ndarray, q_classes: np.ndarray) -> np.ndarray:
-        if not self.proxied:
-            return prevs, q_classes
-
-        n_qclasses = np.array([self.r_mapping[qc] for qc in q_classes])
-
-        return prevs, n_qclasses
-
-    def get_trivial(self):
-        return TrivialQuantifier(self.o_nclasses, self.n_nclasses)
-
-    @property
-    def proxied(self):
-        return self.o_nclasses != self.n_nclasses
-
-
-class BinaryQuantifierAccuracyEstimator(ConfidenceBasedAccuracyEstimator):
-    def __init__(
-        self,
-        classifier: BaseEstimator,
-        quantifier: BaseAccuracyEstimator,
-        confidence: str = None,
-        group_false: bool = False,
-        dense: bool = False,
-    ):
-        super().__init__(
-            classifier=classifier,
-            quantifier=quantifier,
-            confidence=confidence,
-        )
-        self.quantifiers = []
-        self.extpol = ExtensionPolicy(
-            group_false=group_false,
-            dense=dense,
-        )
-
-    def _get_binary_quant(self, quant, train: LabelledCollection):
-        _nz = np.nonzero(train.counts())[0]
-        if _nz.shape[0] == 1:
-            return TrivialQuantifier(train.n_classes, _nz[0])
-        else:
-            return deepcopy(quant)
-
-    def fit(self, train: LabelledCollection | ExtendedCollection):
-        pred_proba = self.classifier.predict_proba(train.X)
-        self._fit_confidence(train.X, train.y, pred_proba)
-        self.e_train = self.extend(train, pred_proba=pred_proba)
-
-        self.n_classes = self.e_train.n_classes
-        e_trains = self.e_train.split_by_pred()
-
-        self.quantifiers = []
-        for train in e_trains:
-            quant = self._get_binary_quant(self.quantifier, train)
-            quant.fit(train)
-            self.quantifiers.append(quant)
-
-        return self
-
-    def estimate(
-        self, instances: ExtendedData | np.ndarray | sp.csr_matrix
-    ) -> np.ndarray:
-        e_inst = instances
-        if not isinstance(e_inst, ExtendedData):
-            e_inst = self._extend_instances(instances)
-
-        s_inst = e_inst.split_by_pred()
-        norms = [s_i.shape[0] / len(e_inst) for s_i in s_inst]
-        estim_prevs = self._quantify_helper(s_inst, norms)
-
-        # estim_prev = np.concatenate(estim_prevs.T)
-        # return ExtendedPrev(estim_prev, e_inst.nbcl, extpol=self.extpol)
-
-        return ExtBinPrev(
-            estim_prevs,
-            e_inst.nbcl,
-            q_classes=[quant.classes_ for quant in self.quantifiers],
-            extpol=self.extpol,
-        )
-
-    def _quantify_helper(
-        self,
-        s_inst: List[np.ndarray | sp.csr_matrix],
-        norms: List[float],
-    ):
-        estim_prevs = []
-        for quant, inst, norm in zip(self.quantifiers, s_inst, norms):
-            if inst.shape[0] > 0:
-                estim_prev = quant.quantify(inst) * norm
-                estim_prevs.append(estim_prev)
-            else:
-                estim_prevs.append(np.zeros((len(quant.classes_),)))
-
-        # return np.array(estim_prevs)
-        return estim_prevs
-
-    @property
-    def group_false(self):
-        return self.extpol.group_false
-
-
-BAE = BaseAccuracyEstimator
-MCAE = MultiClassAccuracyEstimator
-BQAE = BinaryQuantifierAccuracyEstimator
--- a/quacc/method/confidence.py
+++ b/quacc/method/confidence.py
@ -1,98 +0,0 @@
-from typing import List
-
-import numpy as np
-import scipy.sparse as sp
-from sklearn.linear_model import LinearRegression
-
-import baselines.atc as atc
-
-__confs = {}
-
-
-def metric(name):
-    def wrapper(cl):
-        __confs[name] = cl
-        return cl
-
-    return wrapper
-
-
-class ConfidenceMetric:
-    def fit(self, X, y, probas):
-        pass
-
-    def conf(self, X, probas):
-        return probas
-
-
-@metric("max_conf")
-class MaxConf(ConfidenceMetric):
-    def conf(self, X, probas):
-        _mc = np.max(probas, axis=1, keepdims=True)
-        return _mc
-
-
-@metric("entropy")
-class Entropy(ConfidenceMetric):
-    def conf(self, X, probas):
-        _ent = np.sum(
-            np.multiply(probas, np.log(probas + 1e-20)), axis=1, keepdims=True
-        )
-        return _ent
-
-
-@metric("isoft")
-class InverseSoftmax(ConfidenceMetric):
-    def conf(self, X, probas):
-        _probas = probas / np.sum(probas, axis=1, keepdims=True)
-        _probas = np.log(_probas) - np.mean(np.log(_probas), axis=1, keepdims=True)
-        return np.max(_probas, axis=1, keepdims=True)
-
-
-@metric("threshold")
-class Threshold(ConfidenceMetric):
-    def get_scores(self, probas, keepdims=False):
-        return np.max(probas, axis=1, keepdims=keepdims)
-
-    def fit(self, X, y, probas):
-        scores = self.get_scores(probas)
-        _, self.threshold = atc.find_ATC_threshold(scores, y)
-
-    def conf(self, X, probas):
-        scores = self.get_scores(probas, keepdims=True)
-        _exp = scores - self.threshold
-        return _exp
-
-    # def conf(self, X, probas):
-    #     scores = self.get_scores(probas)
-    #     _exp = np.where(
-    #         scores >= self.threshold, np.ones(scores.shape), np.zeros(scores.shape)
-    #     )
-    #     return _exp[:, np.newaxis]
-
-
-@metric("linreg")
-class LinReg(ConfidenceMetric):
-    def extend(self, X, probas):
-        if sp.issparse(X):
-            return sp.hstack([X, probas])
-        else:
-            return np.concatenate([X, probas], axis=1)
-
-    def fit(self, X, y, probas):
-        reg_X = self.extend(X, probas)
-        reg_y = probas[np.arange(probas.shape[0]), y]
-        self.reg = LinearRegression()
-        self.reg.fit(reg_X, reg_y)
-
-    def conf(self, X, probas):
-        reg_X = self.extend(X, probas)
-        return self.reg.predict(reg_X)[:, np.newaxis]
-
-
-def get_metrics(names: List[str]):
-    if names is None:
-        return None
-
-    __fnames = [n for n in names if n in __confs]
-    return [__confs[m]() for m in __fnames]
--- a/quacc/method/model_selection.py
+++ b/quacc/method/model_selection.py
@ -1,481 +0,0 @@
-import itertools
-import math
-import os
-from copy import deepcopy
-from time import time
-from typing import Callable, Union
-
-import numpy as np
-from joblib import Parallel
-from quapy.data import LabelledCollection
-from quapy.protocol import (
-    AbstractProtocol,
-    OnLabelledCollectionProtocol,
-)
-
-import quacc as qc
-import quacc.error
-from quacc.data import ExtendedCollection
-from quacc.evaluation.evaluate import evaluate
-from quacc.logger import logger
-from quacc.method.base import (
-    BaseAccuracyEstimator,
-)
-
-
-class GridSearchAE(BaseAccuracyEstimator):
-    def __init__(
-        self,
-        model: BaseAccuracyEstimator,
-        param_grid: dict,
-        protocol: AbstractProtocol,
-        error: Union[Callable, str] = qc.error.maccd,
-        refit=True,
-        # timeout=-1,
-        n_jobs=None,
-        verbose=False,
-    ):
-        self.model = model
-        self.param_grid = self.__normalize_params(param_grid)
-        self.protocol = protocol
-        self.refit = refit
-        # self.timeout = timeout
-        self.n_jobs = qc._get_njobs(n_jobs)
-        self.verbose = verbose
-        self.__check_error(error)
-        assert isinstance(protocol, AbstractProtocol), "unknown protocol"
-
-    def _sout(self, msg, level=0):
-        if level > 0 or self.verbose:
-            print(f"[{self.__class__.__name__}@{self.model.__class__.__name__}]: {msg}")
-
-    def __normalize_params(self, params):
-        __remap = {}
-        for key in params.keys():
-            k, delim, sub_key = key.partition("__")
-            if delim and k == "q":
-                __remap[key] = f"quantifier__{sub_key}"
-
-        return {(__remap[k] if k in __remap else k): v for k, v in params.items()}
-
-    def __check_error(self, error):
-        if error in qc.error.ACCURACY_ERROR:
-            self.error = error
-        elif isinstance(error, str):
-            self.error = qc.error.from_name(error)
-        elif hasattr(error, "__call__"):
-            self.error = error
-        else:
-            raise ValueError(
-                f"unexpected error type; must either be a callable function or a str representing\n"
-                f"the name of an error function in {qc.error.ACCURACY_ERROR_NAMES}"
-            )
-
-    def fit(self, training: LabelledCollection):
-        """Learning routine. Fits methods with all combinations of hyperparameters and selects the one minimizing
-            the error metric.
-
-        :param training: the training set on which to optimize the hyperparameters
-        :return: self
-        """
-        params_keys = list(self.param_grid.keys())
-        params_values = list(self.param_grid.values())
-
-        protocol = self.protocol
-
-        self.param_scores_ = {}
-        self.best_score_ = None
-
-        tinit = time()
-
-        hyper = [
-            dict(zip(params_keys, val)) for val in itertools.product(*params_values)
-        ]
-
-        self._sout(f"starting model selection with {self.n_jobs =}")
-        # self._sout("starting model selection")
-
-        # scores = [self.__params_eval((params, training)) for params in hyper]
-        scores = self._select_scores(hyper, training)
-
-        for params, score, model in scores:
-            if score is not None:
-                if self.best_score_ is None or score < self.best_score_:
-                    self.best_score_ = score
-                    self.best_params_ = params
-                    self.best_model_ = model
-                self.param_scores_[str(params)] = score
-            else:
-                self.param_scores_[str(params)] = "timeout"
-
-        tend = time() - tinit
-
-        if self.best_score_ is None:
-            raise TimeoutError("no combination of hyperparameters seem to work")
-
-        self._sout(
-            f"optimization finished: best params {self.best_params_} (score={self.best_score_:.5f}) "
-            f"[took {tend:.4f}s]",
-            level=1,
-        )
-
-        # log = Logger.logger()
-        log = logger()
-        log.debug(
-            f"[{self.model.__class__.__name__}] "
-            f"optimization finished: best params {self.best_params_} (score={self.best_score_:.5f}) "
-            f"[took {tend:.4f}s]"
-        )
-
-        if self.refit:
-            if isinstance(protocol, OnLabelledCollectionProtocol):
-                self._sout("refitting on the whole development set")
-                self.best_model_.fit(training + protocol.get_labelled_collection())
-            else:
-                raise RuntimeWarning(
-                    f'"refit" was requested, but the protocol does not '
-                    f"implement the {OnLabelledCollectionProtocol.__name__} interface"
-                )
-
-        return self
-
-    def _select_scores(self, hyper, training):
-        return qc.utils.parallel(
-            self._params_eval,
-            [(params, training) for params in hyper],
-            n_jobs=self.n_jobs,
-            verbose=1,
-        )
-
-    def _params_eval(self, params, training, protocol=None):
-        protocol = self.protocol if protocol is None else protocol
-        error = self.error
-
-        # if self.timeout > 0:
-
-        #     def handler(signum, frame):
-        #         raise TimeoutError()
-
-        #     signal.signal(signal.SIGALRM, handler)
-
-        tinit = time()
-
-        # if self.timeout > 0:
-        #     signal.alarm(self.timeout)
-
-        try:
-            model = deepcopy(self.model)
-            # overrides default parameters with the parameters being explored at this iteration
-            model.set_params(**params)
-            # print({k: v for k, v in model.get_params().items() if k in params})
-            model.fit(training)
-            score = evaluate(model, protocol=protocol, error_metric=error)
-
-            ttime = time() - tinit
-            self._sout(
-                f"hyperparams={params}\t got score {score:.5f} [took {ttime:.4f}s]",
-            )
-
-            # if self.timeout > 0:
-            #     signal.alarm(0)
-        # except TimeoutError:
-        #     self._sout(f"timeout ({self.timeout}s) reached for config {params}")
-        #     score = None
-        except ValueError as e:
-            self._sout(
-                f"the combination of hyperparameters {params} is invalid. Exception: {e}",
-                level=1,
-            )
-            score = None
-            # raise e
-        except Exception as e:
-            self._sout(
-                f"something went wrong for config {params}; skipping:"
-                f"\tException: {e}",
-                level=1,
-            )
-            # raise e
-            score = None
-
-        return params, score, model
-
-    def extend(
-        self, coll: LabelledCollection, pred_proba=None, prefit=False
-    ) -> ExtendedCollection:
-        assert hasattr(self, "best_model_"), "quantify called before fit"
-        return self.best_model().extend(coll, pred_proba=pred_proba, prefit=prefit)
-
-    def estimate(self, instances):
-        """Estimate class prevalence values using the best model found after calling the :meth:`fit` method.
-
-        :param instances: sample contanining the instances
-        :return: a ndarray of shape `(n_classes)` with class prevalence estimates as according to the best model found
-            by the model selection process.
-        """
-
-        assert hasattr(self, "best_model_"), "estimate called before fit"
-        return self.best_model().estimate(instances)
-
-    def set_params(self, **parameters):
-        """Sets the hyper-parameters to explore.
-
-        :param parameters: a dictionary with keys the parameter names and values the list of values to explore
-        """
-        self.param_grid = parameters
-
-    def get_params(self, deep=True):
-        """Returns the dictionary of hyper-parameters to explore (`param_grid`)
-
-        :param deep: Unused
-        :return: the dictionary `param_grid`
-        """
-        return self.param_grid
-
-    def best_model(self):
-        """
-        Returns the best model found after calling the :meth:`fit` method, i.e., the one trained on the combination
-        of hyper-parameters that minimized the error function.
-
-        :return: a trained quantifier
-        """
-        if hasattr(self, "best_model_"):
-            return self.best_model_
-        raise ValueError("best_model called before fit")
-
-    def best_score(self):
-        if hasattr(self, "best_score_"):
-            return self.best_score_
-        raise ValueError("best_score called before fit")
-
-
-class RandomizedSearchAE(GridSearchAE):
-    ERR_THRESHOLD = 1e-4
-    MAX_ITER_IMPROV = 3
-
-    def _select_scores(self, hyper, training: LabelledCollection):
-        log = logger()
-        hyper = np.array(hyper)
-        rand_index = np.random.choice(
-            np.arange(len(hyper)), size=len(hyper), replace=False
-        )
-        _n_jobs = os.cpu_count() + 1 + self.n_jobs if self.n_jobs < 0 else self.n_jobs
-        batch_size = _n_jobs
-
-        log.debug(f"{batch_size = }")
-        rand_index = list(
-            rand_index[: (len(hyper) // batch_size) * batch_size].reshape(
-                (len(hyper) // batch_size, batch_size)
-            )
-        ) + [rand_index[(len(hyper) // batch_size) * batch_size :]]
-        scores = []
-        best_score, iter_from_improv = np.inf, 0
-        with Parallel(n_jobs=self.n_jobs) as parallel:
-            for i, ri in enumerate(rand_index):
-                tstart = time()
-                _iter_scores = qc.utils.parallel(
-                    self._params_eval,
-                    [(params, training) for params in hyper[ri]],
-                    parallel=parallel,
-                )
-                _best_iter_score = np.min(
-                    [s for _, s, _ in _iter_scores if s is not None]
-                )
-
-                log.debug(
-                    f"[iter {i}] best score = {_best_iter_score:.8f} [took {time() - tstart:.3f}s]"
-                )
-                scores += _iter_scores
-
-                _check, best_score, iter_from_improv = self.__stop_condition(
-                    _best_iter_score, best_score, iter_from_improv
-                )
-                if _check:
-                    break
-
-        return scores
-
-    def __stop_condition(self, best_iter_score, best_score, iter_from_improv):
-        if best_iter_score < best_score:
-            _improv = best_score - best_iter_score
-            best_score = best_iter_score
-        else:
-            _improv = 0
-
-        if _improv > self.ERR_THRESHOLD:
-            iter_from_improv = 0
-        else:
-            iter_from_improv += 1
-
-        return iter_from_improv > self.MAX_ITER_IMPROV, best_score, iter_from_improv
-
-
-class HalvingSearchAE(GridSearchAE):
-    def _select_scores(self, hyper, training: LabelledCollection):
-        log = logger()
-        hyper = np.array(hyper)
-
-        threshold = 22
-        factor = 3
-        n_steps = math.ceil(math.log(len(hyper) / threshold, factor))
-        steps = np.logspace(n_steps, 0, base=1.0 / factor, num=n_steps + 1)
-        with Parallel(n_jobs=self.n_jobs, verbose=1) as parallel:
-            for _step in steps:
-                tstart = time()
-                _training, _ = (
-                    training.split_stratified(train_prop=_step)
-                    if _step < 1.0
-                    else (training, None)
-                )
-
-                results = qc.utils.parallel(
-                    self._params_eval,
-                    [(params, _training) for params in hyper],
-                    parallel=parallel,
-                )
-                scores = [(1.0 if s is None else s) for _, s, _ in results]
-                res_hyper = np.array([h for h, _, _ in results], dtype="object")
-                sorted_scores_idx = np.argsort(scores)
-                best_score = scores[sorted_scores_idx[0]]
-                hyper = res_hyper[
-                    sorted_scores_idx[: round(len(res_hyper) * (1.0 / factor))]
-                ]
-
-                log.debug(
-                    f"[step {_step}] best score = {best_score:.8f} [took {time() - tstart:.3f}s]"
-                )
-
-        return results
-
-
-class SpiderSearchAE(GridSearchAE):
-    def __init__(
-        self,
-        model: BaseAccuracyEstimator,
-        param_grid: dict,
-        protocol: AbstractProtocol,
-        error: Union[Callable, str] = qc.error.maccd,
-        refit=True,
-        n_jobs=None,
-        verbose=False,
-        err_threshold=1e-4,
-        max_iter_improv=0,
-        pd_th_min=1,
-        best_width=2,
-    ):
-        super().__init__(
-            model=model,
-            param_grid=param_grid,
-            protocol=protocol,
-            error=error,
-            refit=refit,
-            n_jobs=n_jobs,
-            verbose=verbose,
-        )
-        self.err_threshold = err_threshold
-        self.max_iter_improv = max_iter_improv
-        self.pd_th_min = pd_th_min
-        self.best_width = best_width
-
-    def _select_scores(self, hyper, training: LabelledCollection):
-        log = logger()
-        hyper = np.array(hyper)
-        _n_jobs = os.cpu_count() + 1 + self.n_jobs if self.n_jobs < 0 else self.n_jobs
-        batch_size = _n_jobs
-
-        rand_index = np.arange(len(hyper))
-        np.random.shuffle(rand_index)
-        rand_index = rand_index[:batch_size]
-        remaining_index = np.setdiff1d(np.arange(len(hyper)), rand_index)
-        _hyper, _hyper_remaining = hyper[rand_index], hyper[remaining_index]
-
-        scores = []
-        best_score, last_best, iter_from_improv = np.inf, np.inf, 0
-        with Parallel(n_jobs=self.n_jobs, verbose=1) as parallel:
-            while len(_hyper) > 0:
-                # log.debug(f"{len(_hyper_remaining)=}")
-                tstart = time()
-                _iter_scores = qc.utils.parallel(
-                    self._params_eval,
-                    [(params, training) for params in _hyper],
-                    parallel=parallel,
-                )
-
-                # if all scores are None, select a new random batch
-                if all([s[1] is None for s in _iter_scores]):
-                    rand_index = np.arange(len(_hyper_remaining))
-                    np.random.shuffle(rand_index)
-                    rand_index = rand_index[:batch_size]
-                    remaining_index = np.setdiff1d(
-                        np.arange(len(_hyper_remaining)), rand_index
-                    )
-                    _hyper = _hyper_remaining[rand_index]
-                    _hyper_remaining = _hyper_remaining[remaining_index]
-                    continue
-
-                _sorted_idx = np.argsort(
-                    [1.0 if s is None else s for _, s, _ in _iter_scores]
-                )
-                _sorted_scores = np.array(_iter_scores, dtype="object")[_sorted_idx]
-                _best_iter_params = np.array(
-                    [p for p, _, _ in _sorted_scores], dtype="object"
-                )
-                _best_iter_scores = np.array(
-                    [s for _, s, _ in _sorted_scores], dtype="object"
-                )
-
-                for i, (_score, _param) in enumerate(
-                    zip(
-                        _best_iter_scores[: self.best_width],
-                        _best_iter_params[: self.best_width],
-                    )
-                ):
-                    log.debug(
-                        f"[size={len(_hyper)},place={i+1}] best score = {_score:.8f}; "
-                        f"best param = {_param} [took {time() - tstart:.3f}s]"
-                    )
-                scores += _iter_scores
-
-                _improv = best_score - _best_iter_scores[0]
-                _improv_last = last_best - _best_iter_scores[0]
-                if _improv > self.err_threshold:
-                    iter_from_improv = 0
-                    best_score = _best_iter_scores[0]
-                elif _improv_last < 0:
-                    iter_from_improv += 1
-
-                last_best = _best_iter_scores[0]
-
-                if iter_from_improv > self.max_iter_improv:
-                    break
-
-                _new_hyper = np.array([], dtype="object")
-                for _base_param in _best_iter_params[: self.best_width]:
-                    _rem_pds = np.array(
-                        [
-                            self.__param_distance(_base_param, h)
-                            for h in _hyper_remaining
-                        ]
-                    )
-                    _rem_pd_sort_idx = np.argsort(_rem_pds)
-                    # _min_pd = np.min(_rem_pds)
-                    _min_pd_len = (_rem_pds <= self.pd_th_min).nonzero()[0].shape[0]
-                    _new_hyper_idx = _rem_pd_sort_idx[:_min_pd_len]
-                    _hyper_rem_idx = np.setdiff1d(
-                        np.arange(len(_hyper_remaining)), _new_hyper_idx
-                    )
-                    _new_hyper = np.concatenate(
-                        [_new_hyper, _hyper_remaining[_new_hyper_idx]]
-                    )
-                    _hyper_remaining = _hyper_remaining[_hyper_rem_idx]
-                _hyper = _new_hyper
-
-        return scores
-
-    def __param_distance(self, param1, param2):
-        score = 0
-        for k, v in param1.items():
-            if param2[k] != v:
-                score += 1
-
-        return score
-
--- a/quacc/models/base.py
+++ b/quacc/models/base.py
@ -0,0 +1,144 @@
+from abc import ABC, abstractmethod
+from copy import deepcopy
+
+import numpy as np
+import quapy as qp
+import quapy.functional as F
+from quapy.protocol import UPP
+from sklearn.base import BaseEstimator
+from sklearn.metrics import confusion_matrix
+
+from quacc.legacy.data import LabelledCollection
+
+
+class ClassifierAccuracyPrediction(ABC):
+    def __init__(self, h: BaseEstimator, acc: callable):
+        self.h = h
+        self.acc = acc
+
+    @abstractmethod
+    def fit(self, val: LabelledCollection): ...
+
+    @abstractmethod
+    def predict(self, X, oracle_prev=None):
+        """
+        Evaluates the accuracy function on the predicted contingency table
+
+        :param X: test data
+        :param oracle_prev: np.ndarray with the class prevalence of the test set as estimated by
+            an oracle. This is meant to test the effect of the errors in CAP that are explained by
+            the errors in quantification performance
+        :return: float
+        """
+        return ...
+
+    def true_acc(self, sample: LabelledCollection):
+        y_pred = self.h.predict(sample.X)
+        y_true = sample.y
+        conf_table = confusion_matrix(y_true, y_pred=y_pred, labels=sample.classes_)
+        return self.acc(conf_table)
+
+
+class SebastianiCAP(ClassifierAccuracyPrediction):
+    def __init__(
+        self, h, acc_fn, q_class, n_val_samples=500, alpha=0.3, predict_train_prev=True
+    ):
+        self.h = h
+        self.acc = acc_fn
+        self.q = q_class(h)
+        self.n_val_samples = n_val_samples
+        self.alpha = alpha
+        self.sample_size = qp.environ["SAMPLE_SIZE"]
+        self.predict_train_prev = predict_train_prev
+
+    def fit(self, val: LabelledCollection):
+        v2, v1 = val.split_stratified(train_prop=0.5)
+        self.q.fit(v1, fit_classifier=False, val_split=v1)
+
+        # precompute classifier predictions on samples
+        gen_samples = UPP(
+            v2,
+            repeats=self.n_val_samples,
+            sample_size=self.sample_size,
+            return_type="labelled_collection",
+        )
+        self.sigma_acc = [self.true_acc(sigma_i) for sigma_i in gen_samples()]
+
+        # precompute prevalence predictions on samples
+        if self.predict_train_prev:
+            gen_samples.on_preclassified_instances(self.q.classify(v2.X), in_place=True)
+            self.sigma_pred_prevs = [
+                self.q.aggregate(sigma_i.X) for sigma_i in gen_samples()
+            ]
+        else:
+            self.sigma_pred_prevs = [sigma_i.prevalence() for sigma_i in gen_samples()]
+
+    def predict(self, X, oracle_prev=None):
+        if oracle_prev is None:
+            test_pred_prev = self.q.quantify(X)
+        else:
+            test_pred_prev = oracle_prev
+
+        if self.alpha > 0:
+            # select samples from V2 with predicted prevalence close to the predicted prevalence for U
+            selected_accuracies = []
+            for pred_prev_i, acc_i in zip(self.sigma_pred_prevs, self.sigma_acc):
+                max_discrepancy = np.max(np.abs(pred_prev_i - test_pred_prev))
+                if max_discrepancy < self.alpha:
+                    selected_accuracies.append(acc_i)
+
+            return np.median(selected_accuracies)
+        else:
+            # mean average, weights samples from V2 according to the closeness of predicted prevalence in U
+            accum_weight = 0
+            moving_mean = 0
+            epsilon = 10e-4
+            for pred_prev_i, acc_i in zip(self.sigma_pred_prevs, self.sigma_acc):
+                max_discrepancy = np.max(np.abs(pred_prev_i - test_pred_prev))
+                weight = -np.log(max_discrepancy + epsilon)
+                accum_weight += weight
+                moving_mean += weight * acc_i
+
+            return moving_mean / accum_weight
+
+
+class PabloCAP(ClassifierAccuracyPrediction):
+    def __init__(self, h, acc_fn, q_class, n_val_samples=100, aggr="mean"):
+        self.h = h
+        self.acc = acc_fn
+        self.q = q_class(deepcopy(h))
+        self.n_val_samples = n_val_samples
+        self.aggr = aggr
+        assert aggr in [
+            "mean",
+            "median",
+        ], "unknown aggregation function, use mean or median"
+
+    def fit(self, val: LabelledCollection):
+        self.q.fit(val)
+        label_predictions = self.h.predict(val.X)
+        self.pre_classified = LabelledCollection(
+            instances=label_predictions, labels=val.labels
+        )
+
+    def predict(self, X, oracle_prev=None):
+        if oracle_prev is None:
+            pred_prev = F.smooth(self.q.quantify(X))
+        else:
+            pred_prev = oracle_prev
+        X_size = X.shape[0]
+        acc_estim = []
+        for _ in range(self.n_val_samples):
+            sigma_i = self.pre_classified.sampling(X_size, *pred_prev[:-1])
+            y_pred, y_true = sigma_i.Xy
+            conf_table = confusion_matrix(
+                y_true, y_pred=y_pred, labels=sigma_i.classes_
+            )
+            acc_i = self.acc(conf_table)
+            acc_estim.append(acc_i)
+        if self.aggr == "mean":
+            return np.mean(acc_estim)
+        elif self.aggr == "median":
+            return np.median(acc_estim)
+        else:
+            raise ValueError("unknown aggregation function")
--- a/quacc/models/baselines.py
+++ b/quacc/models/baselines.py
@ -0,0 +1,128 @@
+import numpy as np
+from quapy.data.base import LabelledCollection
+from quapy.protocol import UPP
+from sklearn.linear_model import LinearRegression
+
+from quacc.models.base import ClassifierAccuracyPrediction
+from quacc.models.utils import get_posteriors_from_h, max_conf, neg_entropy
+
+
+class ATC(ClassifierAccuracyPrediction):
+    VALID_FUNCTIONS = {"maxconf", "neg_entropy"}
+
+    def __init__(self, h, acc_fn, scoring_fn="maxconf"):
+        assert (
+            scoring_fn in ATC.VALID_FUNCTIONS
+        ), f"unknown scoring function, use any from {ATC.VALID_FUNCTIONS}"
+        # assert acc_fn == 'vanilla_accuracy', \
+        #    'use acc_fn=="vanilla_accuracy"; other metris are not yet tested in ATC'
+        self.h = h
+        self.acc_fn = acc_fn
+        self.scoring_fn = scoring_fn
+
+    def get_scores(self, P):
+        if self.scoring_fn == "maxconf":
+            scores = max_conf(P)
+        else:
+            scores = neg_entropy(P)
+        return scores
+
+    def fit(self, val: LabelledCollection):
+        P = get_posteriors_from_h(self.h, val.X)
+        pred_labels = np.argmax(P, axis=1)
+        true_labels = val.y
+        scores = self.get_scores(P)
+        _, self.threshold = self.__find_ATC_threshold(
+            scores=scores, labels=(pred_labels == true_labels)
+        )
+
+    def predict(self, X, oracle_prev=None):
+        P = get_posteriors_from_h(self.h, X)
+        scores = self.get_scores(P)
+        # assert self.acc_fn == 'vanilla_accuracy', \
+        #    'use acc_fn=="vanilla_accuracy"; other metris are not yet tested in ATC'
+        return self.__get_ATC_acc(self.threshold, scores)
+
+    def __find_ATC_threshold(self, scores, labels):
+        # code copy-pasted from https://github.com/saurabhgarg1996/ATC_code/blob/master/ATC_helper.py
+        sorted_idx = np.argsort(scores)
+
+        sorted_scores = scores[sorted_idx]
+        sorted_labels = labels[sorted_idx]
+
+        fp = np.sum(labels == 0)
+        fn = 0.0
+
+        min_fp_fn = np.abs(fp - fn)
+        thres = 0.0
+        for i in range(len(labels)):
+            if sorted_labels[i] == 0:
+                fp -= 1
+            else:
+                fn += 1
+
+            if np.abs(fp - fn) < min_fp_fn:
+                min_fp_fn = np.abs(fp - fn)
+                thres = sorted_scores[i]
+
+        return min_fp_fn, thres
+
+    def __get_ATC_acc(self, thres, scores):
+        # code copy-pasted from https://github.com/saurabhgarg1996/ATC_code/blob/master/ATC_helper.py
+        return np.mean(scores >= thres)
+
+
+class DoC(ClassifierAccuracyPrediction):
+    def __init__(self, h, acc, sample_size, num_samples=500, clip_vals=(0, 1)):
+        self.h = h
+        self.acc = acc
+        self.sample_size = sample_size
+        self.num_samples = num_samples
+        self.clip_vals = clip_vals
+
+    def _get_post_stats(self, X, y):
+        P = get_posteriors_from_h(self.h, X)
+        mc = max_conf(P)
+        pred_labels = np.argmax(P, axis=-1)
+        acc = self.acc(y, pred_labels)
+        return mc, acc
+
+    def _doc(self, mc1, mc2):
+        return mc2.mean() - mc1.mean()
+
+    def train_regression(self, v2_mcs, v2_accs):
+        docs = [self._doc(self.v1_mc, v2_mc_i) for v2_mc_i in v2_mcs]
+        target = [self.v1_acc - v2_acc_i for v2_acc_i in v2_accs]
+        docs = np.asarray(docs).reshape(-1, 1)
+        target = np.asarray(target)
+        lin_reg = LinearRegression()
+        return lin_reg.fit(docs, target)
+
+    def predict_regression(self, test_mc):
+        docs = np.asarray([self._doc(self.v1_mc, test_mc)]).reshape(-1, 1)
+        pred_acc = self.reg_model.predict(docs)
+        return self.v1_acc - pred_acc
+
+    def fit(self, val: LabelledCollection):
+        v1, v2 = val.split_stratified(train_prop=0.5, random_state=0)
+
+        self.v1_mc, self.v1_acc = self._get_post_stats(*v1.Xy)
+
+        v2_prot = UPP(
+            v2,
+            sample_size=self.sample_size,
+            repeats=self.num_samples,
+            return_type="labelled_collection",
+        )
+        v2_stats = [self._get_post_stats(*sample.Xy) for sample in v2_prot()]
+        v2_mcs, v2_accs = list(zip(*v2_stats))
+
+        self.reg_model = self.train_regression(v2_mcs, v2_accs)
+
+    def predict(self, X, oracle_prev=None):
+        P = get_posteriors_from_h(self.h, X)
+        mc = max_conf(P)
+        acc_pred = self.predict_regression(mc)[0]
+        if self.clip_vals is not None:
+            acc_pred = np.clip(acc_pred, *self.clip_vals)
+        return acc_pred
--- a/quacc/models/cont_table.py
+++ b/quacc/models/cont_table.py
@ -0,0 +1,505 @@
+from abc import abstractmethod
+from copy import deepcopy
+
+import numpy as np
+import quapy.functional as F
+import scipy
+from quapy.data.base import LabelledCollection as LC
+from quapy.method.aggregative import AggregativeQuantifier
+from quapy.method.base import BaseQuantifier
+from scipy.sparse import csr_matrix, issparse
+from sklearn.base import BaseEstimator
+from sklearn.metrics import confusion_matrix
+
+from quacc.models.base import ClassifierAccuracyPrediction
+from quacc.models.utils import get_posteriors_from_h, max_conf, neg_entropy
+
+
+class LabelledCollection(LC):
+    def empty_classes(self):
+        """
+        Returns a np.ndarray of empty classes (classes present in self.classes_ but with
+        no positive instance). In case there is none, then an empty np.ndarray is returned
+
+        :return: np.ndarray
+        """
+        idx = np.argwhere(self.counts() == 0).flatten()
+        return self.classes_[idx]
+
+    def non_empty_classes(self):
+        """
+        Returns a np.ndarray of non-empty classes (classes present in self.classes_ but with
+        at least one positive instance). In case there is none, then an empty np.ndarray is returned
+
+        :return: np.ndarray
+        """
+        idx = np.argwhere(self.counts() > 0).flatten()
+        return self.classes_[idx]
+
+    def has_empty_classes(self):
+        """
+        Checks whether the collection has empty classes
+
+        :return: boolean
+        """
+        return len(self.empty_classes()) > 0
+
+    def compact_classes(self):
+        """
+        Generates a new LabelledCollection object with no empty classes. It also returns a np.ndarray of
+        indexes that correspond to the old indexes of the new self.classes_.
+
+        :return: (LabelledCollection, np.ndarray,)
+        """
+        non_empty = self.non_empty_classes()
+        all_classes = self.classes_
+        old_pos = np.searchsorted(all_classes, non_empty)
+        non_empty_collection = LabelledCollection(*self.Xy, classes=non_empty)
+        return non_empty_collection, old_pos
+
+
+class CAPContingencyTable(ClassifierAccuracyPrediction):
+    def __init__(self, h: BaseEstimator, acc: callable):
+        self.h = h
+        self.acc = acc
+
+    def predict(self, X, oracle_prev=None):
+        """
+        Evaluates the accuracy function on the predicted contingency table
+
+        :param X: test data
+        :param oracle_prev: np.ndarray with the class prevalence of the test set as estimated by
+            an oracle. This is meant to test the effect of the errors in CAP that are explained by
+            the errors in quantification performance
+        :return: float
+        """
+        cont_table = self.predict_ct(X, oracle_prev)
+        raw_acc = self.acc(cont_table)
+        norm_acc = np.clip(raw_acc, 0, 1)
+        return norm_acc
+
+    @abstractmethod
+    def predict_ct(self, X, oracle_prev=None):
+        """
+        Predicts the contingency table for the test data
+
+        :param X: test data
+        :param oracle_prev: np.ndarray with the class prevalence of the test set as estimated by
+            an oracle. This is meant to test the effect of the errors in CAP that are explained by
+            the errors in quantification performance
+        :return: a contingency table
+        """
+        ...
+
+
+class NaiveCAP(CAPContingencyTable):
+    """
+    The Naive CAP is a method that relies on the IID assumption, and thus uses the estimation in the validation data
+    as an estimate for the test data.
+    """
+
+    def __init__(self, h: BaseEstimator, acc: callable):
+        super().__init__(h, acc)
+
+    def fit(self, val: LabelledCollection):
+        y_hat = self.h.predict(val.X)
+        y_true = val.y
+        self.cont_table = confusion_matrix(y_true, y_pred=y_hat, labels=val.classes_)
+        return self
+
+    def predict_ct(self, test, oracle_prev=None):
+        """
+        This method disregards the test set, under the assumption that it is IID wrt the training. This meaning that
+        the confusion matrix for the test data should coincide with the one computed for training (using any cross
+        validation strategy).
+
+        :param test: test collection (ignored)
+        :param oracle_prev: ignored
+        :return: a confusion matrix in the return format of `sklearn.metrics.confusion_matrix`
+        """
+        return self.cont_table
+
+
+class CAPContingencyTableQ(CAPContingencyTable):
+    def __init__(
+        self,
+        h: BaseEstimator,
+        acc: callable,
+        q_class: AggregativeQuantifier,
+        reuse_h=False,
+    ):
+        super().__init__(h, acc)
+        self.reuse_h = reuse_h
+        if reuse_h:
+            assert isinstance(
+                q_class, AggregativeQuantifier
+            ), f"quantifier {q_class} is not of type aggregative"
+            self.q = deepcopy(q_class)
+            self.q.set_params(classifier=h)
+        else:
+            self.q = q_class
+
+    def quantifier_fit(self, val: LabelledCollection):
+        if self.reuse_h:
+            self.q.fit(val, fit_classifier=False, val_split=val)
+        else:
+            self.q.fit(val)
+
+
+class ContTableTransferCAP(CAPContingencyTableQ):
+    """ """
+
+    def __init__(self, h: BaseEstimator, acc: callable, q_class, reuse_h=False):
+        super().__init__(h, acc, q_class, reuse_h)
+
+    def fit(self, val: LabelledCollection):
+        y_hat = self.h.predict(val.X)
+        y_true = val.y
+        self.cont_table = confusion_matrix(
+            y_true=y_true, y_pred=y_hat, labels=val.classes_, normalize="all"
+        )
+        self.train_prev = val.prevalence()
+        self.quantifier_fit(val)
+        return self
+
+    def predict_ct(self, test, oracle_prev=None):
+        """
+        :param test: test collection (ignored)
+        :param oracle_prev: np.ndarray with the class prevalence of the test set as estimated by
+            an oracle. This is meant to test the effect of the errors in CAP that are explained by
+            the errors in quantification performance
+        :return: a confusion matrix in the return format of `sklearn.metrics.confusion_matrix`
+        """
+        if oracle_prev is None:
+            prev_hat = self.q.quantify(test)
+        else:
+            prev_hat = oracle_prev
+        adjustment = prev_hat / self.train_prev
+        return self.cont_table * adjustment[:, np.newaxis]
+
+
+class NsquaredEquationsCAP(CAPContingencyTableQ):
+    """ """
+
+    def __init__(self, h: BaseEstimator, acc: callable, q_class, reuse_h=False):
+        super().__init__(h, acc, q_class, reuse_h)
+
+    def fit(self, val: LabelledCollection):
+        y_hat = self.h.predict(val.X)
+        y_true = val.y
+        self.cont_table = confusion_matrix(y_true, y_pred=y_hat, labels=val.classes_)
+        self.quantifier_fit(val)
+        self.A, self.partial_b = self._construct_equations()
+        return self
+
+    def _construct_equations(self):
+        # we need a n x n matrix of unknowns
+        n = self.cont_table.shape[1]
+
+        # I is the matrix of indexes of unknowns. For example, if we need the counts of
+        # all instances belonging to class i that have been classified as belonging to 0, 1, ..., n:
+        # the indexes of the corresponding unknowns are given by I[i,:]
+        I = np.arange(n * n).reshape(n, n)
+
+        # system of equations: Ax=b, A.shape=(n*n, n*n,), b.shape=(n*n,)
+        A = np.zeros(shape=(n * n, n * n))
+        b = np.zeros(shape=(n * n))
+
+        # first equation: the sum of all unknowns is 1
+        eq_no = 0
+        A[eq_no, :] = 1
+        b[eq_no] = 1
+        eq_no += 1
+
+        # (n-1)*(n-1) equations: the class cond rations should be the same in training and in test due to the
+        # PPS assumptions. Example in three classes, a ratio: a/(a+b+c) [test] = ar [a ratio in training]
+        # a / (a + b + c) = ar
+        # a = (a + b + c) * ar
+        # a = a ar + b ar + c ar
+        # a - a ar - b ar - c ar = 0
+        # a (1-ar) + b (-ar)  + c (-ar) = 0
+        class_cond_ratios_tr = self.cont_table / self.cont_table.sum(
+            axis=1, keepdims=True
+        )
+        for i in range(1, n):
+            for j in range(1, n):
+                ratio_ij = class_cond_ratios_tr[i, j]
+                A[eq_no, I[i, :]] = -ratio_ij
+                A[eq_no, I[i, j]] = 1 - ratio_ij
+                b[eq_no] = 0
+                eq_no += 1
+
+        # n-1 equations: the sum of class-cond counts must equal the C&C prevalence prediction
+        for i in range(1, n):
+            A[eq_no, I[:, i]] = 1
+            # b[eq_no] = cc_prev_estim[i]
+            eq_no += 1
+
+        # n-1 equations: the sum of true true class-conditional positives must equal the class prev label in test
+        for i in range(1, n):
+            A[eq_no, I[i, :]] = 1
+            # b[eq_no] = q_prev_estim[i]
+            eq_no += 1
+
+        return A, b
+
+    def predict_ct(self, test, oracle_prev):
+        """
+        :param test: test collection (ignored)
+        :param oracle_prev: np.ndarray with the class prevalence of the test set as estimated by
+            an oracle. This is meant to test the effect of the errors in CAP that are explained by
+            the errors in quantification performance
+        :return: a confusion matrix in the return format of `sklearn.metrics.confusion_matrix`
+        """
+
+        n = self.cont_table.shape[1]
+
+        h_label_preds = self.h.predict(test)
+        cc_prev_estim = F.prevalence_from_labels(h_label_preds, self.h.classes_)
+        if oracle_prev is None:
+            q_prev_estim = self.q.quantify(test)
+        else:
+            q_prev_estim = oracle_prev
+
+        A = self.A
+        b = self.partial_b
+
+        # b is partially filled; we finish the vector by plugin in the classify and count
+        # prevalence estimates (n-1 values only), and the quantification estimates (n-1 values only)
+
+        b[-2 * (n - 1) : -(n - 1)] = cc_prev_estim[1:]
+        b[-(n - 1) :] = q_prev_estim[1:]
+
+        # try the fast solution (may not be valid)
+        x = np.linalg.solve(A, b)
+
+        if any(x < 0) or any(x > 0) or not np.isclose(x.sum(), 1):
+            print("L", end="")
+
+            # try the iterative solution
+            def loss(x):
+                return np.linalg.norm(A @ x - b, ord=2)
+
+            x = F.optim_minimize(loss, n_classes=n**2)
+
+        else:
+            print(".", end="")
+
+        cont_table_test = x.reshape(n, n)
+        return cont_table_test
+
+
+class QuAcc:
+    def _get_X_dot(self, X):
+        h = self.h
+
+        P = get_posteriors_from_h(h, X)
+
+        add_covs = []
+
+        if self.add_posteriors:
+            add_covs.append(P[:, 1:])
+
+        if self.add_maxconf:
+            mc = max_conf(P, keepdims=True)
+            add_covs.append(mc)
+
+        if self.add_negentropy:
+            ne = neg_entropy(P, keepdims=True)
+            add_covs.append(ne)
+
+        if self.add_maxinfsoft:
+            lgP = np.log(P)
+            mis = np.max(lgP - lgP.mean(axis=1, keepdims=True), axis=1, keepdims=True)
+            add_covs.append(mis)
+
+        if len(add_covs) > 0:
+            X_dot = np.hstack(add_covs)
+
+        if self.add_X:
+            X_dot = safehstack(X, X_dot)
+
+        return X_dot
+
+
+class QuAcc1xN2(CAPContingencyTableQ, QuAcc):
+    def __init__(
+        self,
+        h: BaseEstimator,
+        acc: callable,
+        q_class: AggregativeQuantifier,
+        add_X=True,
+        add_posteriors=True,
+        add_maxconf=False,
+        add_negentropy=False,
+        add_maxinfsoft=False,
+    ):
+        self.h = h
+        self.acc = acc
+        self.q = EmptySafeQuantifier(q_class)
+        self.add_X = add_X
+        self.add_posteriors = add_posteriors
+        self.add_maxconf = add_maxconf
+        self.add_negentropy = add_negentropy
+        self.add_maxinfsoft = add_maxinfsoft
+
+    def fit(self, val: LabelledCollection):
+        pred_labels = self.h.predict(val.X)
+        true_labels = val.y
+
+        self.ncl = val.n_classes
+        classes_dot = np.arange(self.ncl**2)
+        ct_class_idx = classes_dot.reshape(self.ncl, self.ncl)
+
+        X_dot = self._get_X_dot(val.X)
+        y_dot = ct_class_idx[true_labels, pred_labels]
+        val_dot = LabelledCollection(X_dot, y_dot, classes=classes_dot)
+        self.q.fit(val_dot)
+
+    def predict_ct(self, X, oracle_prev=None):
+        X_dot = self._get_X_dot(X)
+        flat_ct = self.q.quantify(X_dot)
+        return flat_ct.reshape(self.ncl, self.ncl)
+
+
+class QuAcc1xNp1(CAPContingencyTableQ, QuAcc):
+    def __init__(
+        self,
+        h: BaseEstimator,
+        acc: callable,
+        q_class: AggregativeQuantifier,
+        add_X=True,
+        add_posteriors=True,
+        add_maxconf=False,
+        add_negentropy=False,
+        add_maxinfsoft=False,
+    ):
+        self.h = h
+        self.acc = acc
+        self.q = EmptySafeQuantifier(q_class)
+        self.add_X = add_X
+        self.add_posteriors = add_posteriors
+        self.add_maxconf = add_maxconf
+        self.add_negentropy = add_negentropy
+        self.add_maxinfsoft = add_maxinfsoft
+
+    def fit(self, val: LabelledCollection):
+        pred_labels = self.h.predict(val.X)
+        true_labels = val.y
+
+        self.ncl = val.n_classes
+        classes_dot = np.arange(self.ncl + 1)
+        # ct_class_idx = classes_dot.reshape(n, n)
+        ct_class_idx = np.full((self.ncl, self.ncl), self.ncl)
+        ct_class_idx[np.diag_indices(self.ncl)] = np.arange(self.ncl)
+
+        X_dot = self._get_X_dot(val.X)
+        y_dot = ct_class_idx[true_labels, pred_labels]
+        val_dot = LabelledCollection(X_dot, y_dot, classes=classes_dot)
+        self.q.fit(val_dot)
+
+    def _get_ct_hat(self, n, ct_compressed):
+        _diag_idx = np.diag_indices(n)
+        ct_rev_idx = (np.append(_diag_idx[0], 0), np.append(_diag_idx[1], 1))
+        ct_hat = np.zeros((n, n))
+        ct_hat[ct_rev_idx] = ct_compressed
+        return ct_hat
+
+    def predict_ct(self, X: LabelledCollection, oracle_prev=None):
+        X_dot = self._get_X_dot(X)
+        ct_compressed = self.q.quantify(X_dot)
+        return self._get_ct_hat(self.ncl, ct_compressed)
+
+
+class QuAccNxN(CAPContingencyTableQ, QuAcc):
+    def __init__(
+        self,
+        h: BaseEstimator,
+        acc: callable,
+        q_class: AggregativeQuantifier,
+        add_X=True,
+        add_posteriors=True,
+        add_maxconf=False,
+        add_negentropy=False,
+        add_maxinfsoft=False,
+    ):
+        self.h = h
+        self.acc = acc
+        self.q_class = q_class
+        self.add_X = add_X
+        self.add_posteriors = add_posteriors
+        self.add_maxconf = add_maxconf
+        self.add_negentropy = add_negentropy
+        self.add_maxinfsoft = add_maxinfsoft
+
+    def fit(self, val: LabelledCollection):
+        pred_labels = self.h.predict(val.X)
+        true_labels = val.y
+        X_dot = self._get_X_dot(val.X)
+
+        self.q = []
+        for class_i in self.h.classes_:
+            X_dot_i = X_dot[pred_labels == class_i]
+            y_i = true_labels[pred_labels == class_i]
+            data_i = LabelledCollection(X_dot_i, y_i, classes=val.classes_)
+
+            q_i = EmptySafeQuantifier(deepcopy(self.q_class))
+            q_i.fit(data_i)
+            self.q.append(q_i)
+
+    def predict_ct(self, X, oracle_prev=None):
+        classes = self.h.classes_
+        pred_labels = self.h.predict(X)
+        X_dot = self._get_X_dot(X)
+        pred_prev = F.prevalence_from_labels(pred_labels, classes)
+        cont_table = []
+        for class_i, q_i, p_i in zip(classes, self.q, pred_prev):
+            X_dot_i = X_dot[pred_labels == class_i]
+            classcond_cond_table_prevs = q_i.quantify(X_dot_i)
+            cond_table_prevs = p_i * classcond_cond_table_prevs
+            cont_table.append(cond_table_prevs)
+        cont_table = np.vstack(cont_table)
+        return cont_table
+
+
+def safehstack(X, P):
+    if issparse(X) or issparse(P):
+        XP = scipy.sparse.hstack([X, P])
+        XP = csr_matrix(XP)
+    else:
+        XP = np.hstack([X, P])
+    return XP
+
+
+class EmptySafeQuantifier(BaseQuantifier):
+    def __init__(self, surrogate_quantifier: BaseQuantifier):
+        self.surrogate = surrogate_quantifier
+
+    def fit(self, data: LabelledCollection):
+        self.n_classes = data.n_classes
+        class_compact_data, self.old_class_idx = data.compact_classes()
+        if self.num_non_empty_classes() > 1:
+            self.surrogate.fit(class_compact_data)
+        return self
+
+    def quantify(self, instances):
+        num_instances = instances.shape[0]
+        if self.num_non_empty_classes() == 0 or num_instances == 0:
+            # returns the uniform prevalence vector
+            uniform = np.full(
+                fill_value=1.0 / self.n_classes, shape=self.n_classes, dtype=float
+            )
+            return uniform
+        elif self.num_non_empty_classes() == 1:
+            # returns a prevalence vector with 100% of the mass in the only non empty class
+            prev_vector = np.full(fill_value=0.0, shape=self.n_classes, dtype=float)
+            prev_vector[self.old_class_idx[0]] = 1
+            return prev_vector
+        else:
+            class_compact_prev = self.surrogate.quantify(instances)
+            prev_vector = np.full(fill_value=0.0, shape=self.n_classes, dtype=float)
+            prev_vector[self.old_class_idx] = class_compact_prev
+            return prev_vector
+
+    def num_non_empty_classes(self):
+        return len(self.old_class_idx)
--- a/quacc/models/utils.py
+++ b/quacc/models/utils.py
@ -0,0 +1,26 @@
+import numpy as np
+import scipy
+
+
+def get_posteriors_from_h(h, X):
+    if hasattr(h, "predict_proba"):
+        P = h.predict_proba(X)
+    else:
+        n_classes = len(h.classes_)
+        dec_scores = h.decision_function(X)
+        if n_classes == 1:
+            dec_scores = np.vstack([-dec_scores, dec_scores]).T
+        P = scipy.special.softmax(dec_scores, axis=1)
+    return P
+
+
+def max_conf(P, keepdims=False):
+    mc = P.max(axis=1, keepdims=keepdims)
+    return mc
+
+
+def neg_entropy(P, keepdims=False):
+    ne = scipy.stats.entropy(P, axis=1)
+    if keepdims:
+        ne = ne.reshape(-1, 1)
+    return ne
--- a/quacc/plot/init.py
+++ b/quacc/plot/init.py
@ -1,7 +1,7 @@
-from quacc.plot.plot import (
+from quacc.legacy.plot.plot import (
    get_backend,
    plot_delta,
    plot_diagonal,
-    plot_shift,
    plot_fit_scores,
+    plot_shift,
 )
--- a/quacc/plot/base.py
+++ b/quacc/plot/base.py
@ -1,68 +0,0 @@
-from pathlib import Path
-
-
-class BasePlot:
-    @classmethod
-    def save_fig(cls, fig, base_path, title) -> Path:
-        ...
-
-    @classmethod
-    def plot_diagonal(
-        cls,
-        reference,
-        columns,
-        data,
-        *,
-        pos_class=1,
-        title="default",
-        x_label="true",
-        y_label="estim.",
-        fixed_lim=False,
-        legend=True,
-    ):
-        ...
-
-    @classmethod
-    def plot_delta(
-        cls,
-        base_prevs,
-        columns,
-        data,
-        *,
-        stdevs=None,
-        pos_class=1,
-        title="default",
-        x_label="prevs.",
-        y_label="error",
-        legend=True,
-    ):
-        ...
-
-    @classmethod
-    def plot_shift(
-        cls,
-        shift_prevs,
-        columns,
-        data,
-        *,
-        counts=None,
-        pos_class=1,
-        title="default",
-        x_label="true",
-        y_label="estim.",
-        legend=True,
-    ):
-        ...
-
-    @classmethod
-    def plot_fit_scores(
-        train_prevs,
-        scores,
-        *,
-        pos_class=1,
-        title="default",
-        x_label="prev.",
-        y_label="position",
-        legend=True,
-    ):
-        ...
--- a/quacc/plot/matplotlib.py
+++ b/quacc/plot/matplotlib.py
@ -0,0 +1,149 @@
+import os
+from pathlib import Path
+
+import matplotlib.pyplot as plt
+import numpy as np
+from cycler import cycler
+from matplotlib.figure import Figure
+
+from quacc.plot.utils import _get_ref_limits
+from quacc.utils.commons import get_plots_path
+
+
+def _get_markers(num: int):
+    ls = "ovx+sDph*^1234X><.Pd"
+    if num > len(ls):
+        ls = ls * (num / len(ls) + 1)
+    return list(ls)[:num]
+
+
+def _get_cycler(num):
+    cm = plt.get_cmap("tab20") if num > 10 else plt.get_cmap("tab10")
+    return cycler(color=[cm(i) for i in range(num)])
+
+
+def _save_or_return(
+    fig: Figure, basedir, cls_name, acc_name, dataset_name, plot_type
+) -> Figure | None:
+    if basedir is None:
+        return fig
+
+    plotsubdir = "all" if dataset_name == "*" else dataset_name
+    file = get_plots_path(basedir, cls_name, acc_name, plotsubdir, plot_type)
+    os.makedirs(Path(file).parent, exist_ok=True)
+    fig.savefig(file)
+
+
+def plot_diagonal(
+    method_names: list[str],
+    true_accs: np.ndarray,
+    estim_accs: np.ndarray,
+    cls_name,
+    acc_name,
+    dataset_name,
+    *,
+    basedir=None,
+):
+    fig, ax = plt.subplots()
+    ax.grid()
+    ax.set_aspect("equal")
+
+    cy = _get_cycler(len(method_names))
+
+    for name, x, estim, _cy in zip(method_names, true_accs, estim_accs, cy):
+        ax.plot(
+            x,
+            estim,
+            label=name,
+            color=_cy["color"],
+            linestyle="None",
+            marker="o",
+            markersize=3,
+            zorder=2,
+            alpha=0.25,
+        )
+
+    # ensure limits are equal for both axes
+    _lims = _get_ref_limits(true_accs, estim_accs)
+    ax.set(xlim=_lims[0], ylim=_lims[1])
+
+    # draw polyfit line per method
+    # for name, x, estim, _cy in zip(method_names, true_accs, estim_accs, cy):
+    #     slope, interc = np.polyfit(x, estim, 1)
+    #     y_lr = np.array([slope * x + interc for x in _lims])
+    #     ax.plot(
+    #         _lims,
+    #         y_lr,
+    #         label=name,
+    #         color=_cy["color"],
+    #         linestyle="-",
+    #         markersize="0",
+    #         zorder=1,
+    #     )
+
+    # plot reference line
+    ax.plot(
+        _lims,
+        _lims,
+        color="black",
+        linestyle="--",
+        markersize=0,
+        zorder=1,
+    )
+
+    ax.set(xlabel=f"True {acc_name}", ylabel=f"Estimated {acc_name}")
+
+    ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
+
+    return _save_or_return(fig, basedir, cls_name, acc_name, dataset_name, "diagonal")
+
+
+def plot_delta(
+    method_names: list[str],
+    prevs: np.ndarray,
+    acc_errs: np.ndarray,
+    cls_name,
+    acc_name,
+    dataset_name,
+    prev_name,
+    *,
+    stdevs: np.ndarray | None = None,
+    basedir=None,
+):
+    fig, ax = plt.subplots()
+    ax.set_aspect("auto")
+    ax.grid()
+
+    cy = _get_cycler(len(method_names))
+
+    x = [str(bp) for bp in prevs]
+    if stdevs is None:
+        stdevs = [None] * len(method_names)
+    for name, delta, stdev, _cy in zip(method_names, acc_errs, stdevs, cy):
+        ax.plot(
+            x,
+            delta,
+            label=name,
+            color=_cy["color"],
+            linestyle="-",
+            marker="",
+            markersize=3,
+            zorder=2,
+        )
+        if stdev is not None:
+            ax.fill_between(
+                prevs,
+                delta - stdev,
+                delta + stdev,
+                color=_cy["color"],
+                alpha=0.25,
+            )
+
+    ax.set(
+        xlabel=f"{prev_name} Prevalence",
+        ylabel=f"Prediction Error for {acc_name}",
+    )
+
+    ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
+
+    return fig
--- a/quacc/plot/mpl.py
+++ b/quacc/plot/mpl.py
@ -1,238 +0,0 @@
-from pathlib import Path
-from re import X
-
-import matplotlib
-import matplotlib.pyplot as plt
-import numpy as np
-from cycler import cycler
-from sklearn import base
-
-from quacc import utils
-from quacc.plot.base import BasePlot
-
-matplotlib.use("agg")
-
-
-class MplPlot(BasePlot):
-    def _get_markers(self, n: int):
-        ls = "ovx+sDph*^1234X><.Pd"
-        if n > len(ls):
-            ls = ls * (n / len(ls) + 1)
-        return list(ls)[:n]
-
-    def save_fig(self, fig, base_path, title) -> Path:
-        if base_path is None:
-            base_path = utils.get_quacc_home() / "plots"
-        output_path = base_path / f"{title}.png"
-        fig.savefig(output_path, bbox_inches="tight")
-        return output_path
-
-    def plot_delta(
-        self,
-        base_prevs,
-        columns,
-        data,
-        *,
-        stdevs=None,
-        pos_class=1,
-        title="default",
-        x_label="prevs.",
-        y_label="error",
-        legend=True,
-    ):
-        fig, ax = plt.subplots()
-        ax.set_aspect("auto")
-        ax.grid()
-
-        NUM_COLORS = len(data)
-        cm = plt.get_cmap("tab10")
-        if NUM_COLORS > 10:
-            cm = plt.get_cmap("tab20")
-        cy = cycler(color=[cm(i) for i in range(NUM_COLORS)])
-
-        # base_prevs = base_prevs[:, pos_class]
-        if isinstance(base_prevs[0], float):
-            base_prevs = np.around([(1 - bp, bp) for bp in base_prevs], decimals=4)
-        str_base_prevs = [str(tuple(bp)) for bp in base_prevs]
-        # xticks = [str(bp) for bp in base_prevs]
-        xticks = np.arange(len(base_prevs))
-        for method, deltas, _cy in zip(columns, data, cy):
-            ax.plot(
-                xticks,
-                deltas,
-                label=method,
-                color=_cy["color"],
-                linestyle="-",
-                marker="o",
-                markersize=3,
-                zorder=2,
-            )
-            if stdevs is not None:
-                _col_idx = np.where(columns == method)[0]
-                stdev = stdevs[_col_idx].flatten()
-                nn_idx = np.intersect1d(
-                    np.where(deltas != np.nan)[0],
-                    np.where(stdev != np.nan)[0],
-                )
-                _bps, _ds, _st = xticks[nn_idx], deltas[nn_idx], stdev[nn_idx]
-                ax.fill_between(
-                    _bps,
-                    _ds - _st,
-                    _ds + _st,
-                    color=_cy["color"],
-                    alpha=0.25,
-                )
-
-        def format_fn(tick_val, tick_pos):
-            if int(tick_val) in xticks:
-                return str_base_prevs[int(tick_val)]
-
-            return ""
-
-        ax.xaxis.set_major_locator(plt.MaxNLocator(nbins=6, integer=True, prune="both"))
-        ax.xaxis.set_major_formatter(format_fn)
-
-        ax.set(
-            xlabel=f"{x_label} prevalence",
-            ylabel=y_label,
-            title=title,
-        )
-
-        if legend:
-            ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
-
-        return fig
-
-    def plot_diagonal(
-        self,
-        reference,
-        columns,
-        data,
-        *,
-        pos_class=1,
-        title="default",
-        x_label="true",
-        y_label="estim.",
-        legend=True,
-    ):
-        fig, ax = plt.subplots()
-        ax.set_aspect("auto")
-        ax.grid()
-        ax.set_aspect("equal")
-
-        NUM_COLORS = len(data)
-        cm = plt.get_cmap("tab10")
-        if NUM_COLORS > 10:
-            cm = plt.get_cmap("tab20")
-        cy = cycler(
-            color=[cm(i) for i in range(NUM_COLORS)],
-            marker=self._get_markers(NUM_COLORS),
-        )
-
-        reference = np.array(reference)
-        x_ticks = np.unique(reference)
-        x_ticks.sort()
-
-        for deltas, _cy in zip(data, cy):
-            ax.plot(
-                reference,
-                deltas,
-                color=_cy["color"],
-                linestyle="None",
-                marker=_cy["marker"],
-                markersize=3,
-                zorder=2,
-                alpha=0.25,
-            )
-
-        # ensure limits are equal for both axes
-        _alims = np.stack(((ax.get_xlim(), ax.get_ylim())), axis=-1)
-        _lims = np.array([f(ls) for f, ls in zip([np.min, np.max], _alims)])
-        ax.set(xlim=tuple(_lims), ylim=tuple(_lims))
-
-        for method, deltas, _cy in zip(columns, data, cy):
-            slope, interc = np.polyfit(reference, deltas, 1)
-            y_lr = np.array([slope * x + interc for x in _lims])
-            ax.plot(
-                _lims,
-                y_lr,
-                label=method,
-                color=_cy["color"],
-                linestyle="-",
-                markersize="0",
-                zorder=1,
-            )
-
-        # plot reference line
-        ax.plot(
-            _lims,
-            _lims,
-            color="black",
-            linestyle="--",
-            markersize=0,
-            zorder=1,
-        )
-
-        ax.set(xlabel=x_label, ylabel=y_label, title=title)
-
-        if legend:
-            ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
-
-        return fig
-
-    def plot_shift(
-        self,
-        shift_prevs,
-        columns,
-        data,
-        *,
-        counts=None,
-        pos_class=1,
-        title="default",
-        x_label="true",
-        y_label="estim.",
-        legend=True,
-    ):
-        fig, ax = plt.subplots()
-        ax.set_aspect("auto")
-        ax.grid()
-
-        NUM_COLORS = len(data)
-        cm = plt.get_cmap("tab10")
-        if NUM_COLORS > 10:
-            cm = plt.get_cmap("tab20")
-        cy = cycler(color=[cm(i) for i in range(NUM_COLORS)])
-
-        # shift_prevs = shift_prevs[:, pos_class]
-        for method, shifts, _cy in zip(columns, data, cy):
-            ax.plot(
-                shift_prevs,
-                shifts,
-                label=method,
-                color=_cy["color"],
-                linestyle="-",
-                marker="o",
-                markersize=3,
-                zorder=2,
-            )
-            if counts is not None:
-                _col_idx = np.where(columns == method)[0]
-                count = counts[_col_idx].flatten()
-                for prev, shift, cnt in zip(shift_prevs, shifts, count):
-                    label = f"{cnt}"
-                    plt.annotate(
-                        label,
-                        (prev, shift),
-                        textcoords="offset points",
-                        xytext=(0, 10),
-                        ha="center",
-                        color=_cy["color"],
-                        fontsize=12.0,
-                    )
-
-        ax.set(xlabel=x_label, ylabel=y_label, title=title)
-
-        if legend:
-            ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
-
-        return fig
--- a/quacc/plot/plot.py
+++ b/quacc/plot/plot.py
@ -1,197 +0,0 @@
-from quacc.plot.base import BasePlot
-from quacc.plot.mpl import MplPlot
-from quacc.plot.plotly import PlotlyPlot
-
-__backend: BasePlot = MplPlot()
-
-
-def get_backend(name, theme=None):
-    match name:
-        case "matplotlib" | "mpl":
-            return MplPlot()
-        case "plotly":
-            return PlotlyPlot(theme=theme)
-        case _:
-            return MplPlot()
-
-
-def plot_delta(
-    base_prevs,
-    columns,
-    data,
-    *,
-    stdevs=None,
-    pos_class=1,
-    metric="acc",
-    name="default",
-    train_prev=None,
-    legend=True,
-    avg=None,
-    save_fig=False,
-    base_path=None,
-    backend=None,
-):
-    backend = __backend if backend is None else backend
-    _base_title = "delta_stdev" if stdevs is not None else "delta"
-    if train_prev is not None:
-        t_prev_pos = int(round(train_prev[pos_class] * 100))
-        title = f"{_base_title}_{name}_{t_prev_pos}_{metric}"
-    else:
-        title = f"{_base_title}_{name}_avg_{avg}_{metric}"
-
-    if avg is None or avg == "train":
-        x_label = "Test Prevalence"
-    else:
-        x_label = "Train Prevalence"
-    if metric == "acc":
-        y_label = "Prediction Error for Vanilla Accuracy"
-    elif metric == "f1":
-        y_label = "Prediction Error for F1"
-    else:
-        y_label = f"{metric} error"
-    fig = backend.plot_delta(
-        base_prevs,
-        columns,
-        data,
-        stdevs=stdevs,
-        pos_class=pos_class,
-        title=title,
-        x_label=x_label,
-        y_label=y_label,
-        legend=legend,
-    )
-
-    if save_fig:
-        output_path = backend.save_fig(fig, base_path, title)
-        return fig, output_path
-
-    return fig
-
-
-def plot_diagonal(
-    reference,
-    columns,
-    data,
-    *,
-    pos_class=1,
-    metric="acc",
-    name="default",
-    train_prev=None,
-    fixed_lim=False,
-    legend=True,
-    save_fig=False,
-    base_path=None,
-    backend=None,
-):
-    backend = __backend if backend is None else backend
-    if train_prev is not None:
-        t_prev_pos = int(round(train_prev[pos_class] * 100))
-        title = f"diagonal_{name}_{t_prev_pos}_{metric}"
-    else:
-        title = f"diagonal_{name}_{metric}"
-
-    if metric == "acc":
-        x_label = "True Vanilla Accuracy"
-        y_label = "Estimated Vanilla Accuracy"
-    else:
-        x_label = f"true {metric}"
-        y_label = f"estim. {metric}"
-    fig = backend.plot_diagonal(
-        reference,
-        columns,
-        data,
-        pos_class=pos_class,
-        title=title,
-        x_label=x_label,
-        y_label=y_label,
-        fixed_lim=fixed_lim,
-        legend=legend,
-    )
-
-    if save_fig:
-        output_path = backend.save_fig(fig, base_path, title)
-        return fig, output_path
-
-    return fig
-
-
-def plot_shift(
-    shift_prevs,
-    columns,
-    data,
-    *,
-    counts=None,
-    pos_class=1,
-    metric="acc",
-    name="default",
-    train_prev=None,
-    legend=True,
-    save_fig=False,
-    base_path=None,
-    backend=None,
-):
-    backend = __backend if backend is None else backend
-    if train_prev is not None:
-        t_prev_pos = int(round(train_prev[pos_class] * 100))
-        title = f"shift_{name}_{t_prev_pos}_{metric}"
-    else:
-        title = f"shift_{name}_avg_{metric}"
-
-    x_label = "Amount of Prior Probability Shift"
-    if metric == "acc":
-        y_label = "Prediction Error for Vanilla Accuracy"
-    elif metric == "f1":
-        y_label = "Prediction Error for F1"
-    else:
-        y_label = f"{metric} error"
-    fig = backend.plot_shift(
-        shift_prevs,
-        columns,
-        data,
-        counts=counts,
-        pos_class=pos_class,
-        title=title,
-        x_label=x_label,
-        y_label=y_label,
-        legend=legend,
-    )
-
-    if save_fig:
-        output_path = backend.save_fig(fig, base_path, title)
-        return fig, output_path
-
-    return fig
-
-
-def plot_fit_scores(
-    train_prevs,
-    scores,
-    *,
-    pos_class=1,
-    metric="acc",
-    name="default",
-    legend=True,
-    save_fig=False,
-    base_path=None,
-    backend=None,
-):
-    backend = __backend if backend is None else backend
-    title = f"fit_scores_{name}_avg_{metric}"
-
-    x_label = "train prev."
-    y_label = "position"
-    fig = backend.plot_fit_scores(
-        train_prevs,
-        scores,
-        pos_class=pos_class,
-        title=title,
-        x_label=x_label,
-        y_label=y_label,
-        legend=legend,
-    )
-
-    if save_fig:
-        output_path = backend.save_fig(fig, base_path, title)
-        return fig, output_path
-
-    return fig
--- a/quacc/plot/plotly.py
+++ b/quacc/plot/plotly.py
@ -1,330 +1,209 @@
-from collections import defaultdict
-from pathlib import Path
-
 import numpy as np
 import plotly
 import plotly.graph_objects as go

-from quacc.evaluation.estimators import CE, _renames
-from quacc.plot.base import BasePlot
+from quacc.plot.utils import _get_ref_limits

-
-class PlotCfg:
-    def __init__(self, mode, lwidth, font=None, legend=None, template="seaborn"):
-        self.mode = mode
-        self.lwidth = lwidth
-        self.legend = {} if legend is None else legend
-        self.font = {} if font is None else font
-        self.template = template
-
-
-web_cfg = PlotCfg("lines+markers", 2)
-png_cfg_old = PlotCfg(
-    "lines",
-    5,
-    legend=dict(
-        orientation="h",
-        yanchor="bottom",
-        xanchor="right",
-        y=1.02,
-        x=1,
-        font=dict(size=24),
-    ),
-    font=dict(size=24),
-    # template="ggplot2",
-)
-png_cfg = PlotCfg(
-    "lines",
-    5,
-    legend=dict(
-        font=dict(
-            family="DejaVu Sans",
-            size=24,
-        ),
-    ),
-    font=dict(size=24),
-    # template="ggplot2",
-)
-
-_cfg = png_cfg
-
-
-class PlotlyPlot(BasePlot):
-    __themes = defaultdict(
-        lambda: {
-            "template": _cfg.template,
-        }
-    )
-    __themes = __themes | {
-        "dark": {
-            "template": "plotly_dark",
-        },
+MODE = "lines"
+L_WIDTH = 5
+LEGEND = {
+    "font": {
+        "family": "DejaVu Sans",
+        "size": 24,
    }
+}
+FONT = {"size": 24}
+TEMPLATE = "ggplot2"

-    def __init__(self, theme=None):
-        self.theme = PlotlyPlot.__themes[theme]
-        self.rename = True

-    def hex_to_rgb(self, hex: str, t: float | None = None):
-        hex = hex.lstrip("#")
-        rgb = [int(hex[i : i + 2], 16) for i in [0, 2, 4]]
-        if t is not None:
-            rgb.append(t)
-        return f"{'rgb' if t is None else 'rgba'}{str(tuple(rgb))}"
+def _update_layout(fig, x_label, y_label, **kwargs):
+    fig.update_layout(
+        xaxis_title=x_label,
+        yaxis_title=y_label,
+        template=TEMPLATE,
+        font=FONT,
+        legend=LEGEND,
+        **kwargs,
+    )

-    def get_colors(self, num):
-        match num:
-            case v if v > 10:
-                __colors = plotly.colors.qualitative.Light24
-            case _:
-                __colors = plotly.colors.qualitative.G10

-        def __generator(cs):
-            while True:
-                for c in cs:
-                    yield c
+def _hex_to_rgb(hex: str, t: float | None = None):
+    hex = hex.lstrip("#")
+    rgb = [int(hex[i : i + 2], 16) for i in [0, 2, 4]]
+    if t is not None:
+        rgb.append(t)
+    return f"{'rgb' if t is None else 'rgba'}{str(tuple(rgb))}"

-        return __generator(__colors)

-    def update_layout(self, fig, title, x_label, y_label):
-        fig.update_layout(
-            # title=title,
-            xaxis_title=x_label,
-            yaxis_title=y_label,
-            template=self.theme["template"],
-            font=_cfg.font,
-            legend=_cfg.legend,
-        )
+def _get_colors(num):
+    match num:
+        case v if v > 10:
+            __colors = plotly.colors.qualitative.Light24
+        case _:
+            __colors = plotly.colors.qualitative.G10

-    def save_fig(self, fig, base_path, title) -> Path:
-        return None
+    def __generator(cs):
+        while True:
+            for c in cs:
+                yield c

-    def rename_plots(
-        self,
-        columns,
-    ):
-        if not self.rename:
-            return columns
+    return __generator(__colors)

-        new_columns = []
-        for c in columns:
-            nc = c
-            for old, new in _renames.items():
-                if c.startswith(old):
-                    nc = new + c[len(old) :]

-            new_columns.append(nc)
+def plot_diagonal(
+    method_names,
+    true_accs,
+    estim_accs,
+    cls_name,
+    acc_name,
+    dataset_name,
+    *,
+    basedir=None,
+) -> go.Figure:
+    fig = go.Figure()
+    line_colors = _get_colors(len(method_names))
+    _lims = _get_ref_limits(true_accs, estim_accs)

-        return np.array(new_columns)
-
-    def plot_delta(
-        self,
-        base_prevs,
-        columns,
-        data,
-        *,
-        stdevs=None,
-        pos_class=1,
-        title="default",
-        x_label="prevs.",
-        y_label="error",
-        legend=True,
-    ) -> go.Figure:
-        fig = go.Figure()
-        if isinstance(base_prevs[0], float):
-            base_prevs = np.around([(1 - bp, bp) for bp in base_prevs], decimals=4)
-        x = [str(tuple(bp)) for bp in base_prevs]
-        named_data = {c: d for c, d in zip(columns, data)}
-        r_columns = {c: r for c, r in zip(columns, self.rename_plots(columns))}
-        line_colors = self.get_colors(len(columns))
-        # for name, delta in zip(columns, data):
-        columns = np.array(CE.name.sort(columns))
-        for name in columns:
-            delta = named_data[name]
-            r_name = r_columns[name]
-            color = next(line_colors)
-            _line = [
+    for name, x, estim in zip(method_names, true_accs, estim_accs):
+        color = next(line_colors)
+        slope, interc = np.polyfit(x, estim, 1)
+        fig.add_traces(
+            [
                go.Scatter(
                    x=x,
-                    y=delta,
-                    mode=_cfg.mode,
-                    name=r_name,
-                    line=dict(color=self.hex_to_rgb(color), width=_cfg.lwidth),
-                    hovertemplate="prev.: %{x}<br>error: %{y:,.4f}",
+                    y=estim,
+                    customdata=np.stack((estim - x,), axis=-1),
+                    mode="markers",
+                    name=name,
+                    marker=dict(color=_hex_to_rgb(color, t=0.5)),
+                    hovertemplate="true acc: %{x:,.4f}<br>estim. acc: %{y:,.4f}<br>acc err.: %{customdata[0]:,.4f}",
+                ),
+            ]
+        )
+    fig.add_trace(
+        go.Scatter(
+            x=_lims[0],
+            y=_lims[1],
+            mode="lines",
+            name="reference",
+            showlegend=False,
+            line=dict(color=_hex_to_rgb("#000000"), dash="dash"),
+        )
+    )
+
+    _update_layout(
+        fig,
+        x_label=f"True {acc_name}",
+        y_label=f"Estimated {acc_name}",
+        autosize=False,
+        width=1300,
+        height=1000,
+        yaxis_scaleanchor="x",
+        yaxis_scaleratio=1.0,
+        yaxis_range=[-0.1, 1.1],
+    )
+    # return _save_or_return(fig, basedir, cls_name, acc_name, dataset_name, "diagonal")
+    return fig
+
+
+def plot_delta(
+    method_names: list[str],
+    prevs: np.ndarray,
+    acc_errs: np.ndarray,
+    cls_name,
+    acc_name,
+    dataset_name,
+    prev_name,
+    *,
+    stdevs: np.ndarray | None = None,
+    basedir=None,
+) -> go.Figure:
+    fig = go.Figure()
+    x = [str(bp) for bp in prevs]
+    line_colors = _get_colors(len(method_names))
+    if stdevs is None:
+        stdevs = [None] * len(method_names)
+    for name, delta, stdev in zip(method_names, acc_errs, stdevs):
+        color = next(line_colors)
+        _line = [
+            go.Scatter(
+                x=x,
+                y=delta,
+                mode=MODE,
+                name=name,
+                line=dict(color=_hex_to_rgb(color), width=L_WIDTH),
+                hovertemplate="prev.: %{x}<br>error: %{y:,.4f}",
+            )
+        ]
+        _error = []
+        if stdev is not None:
+            _error = [
+                go.Scatter(
+                    x=np.concatenate([x, x[::-1]]),
+                    y=np.concatenate([delta - stdev, (delta + stdev)[::-1]]),
+                    name=name,
+                    fill="toself",
+                    fillcolor=_hex_to_rgb(color, t=0.2),
+                    line=dict(color="rgba(255, 255, 255, 0)"),
+                    hoverinfo="skip",
+                    showlegend=False,
                )
            ]
-            _error = []
-            if stdevs is not None:
-                _col_idx = np.where(columns == name)[0]
-                stdev = stdevs[_col_idx].flatten()
-                _error = [
-                    go.Scatter(
-                        x=np.concatenate([x, x[::-1]]),
-                        y=np.concatenate([delta - stdev, (delta + stdev)[::-1]]),
-                        name=int(_col_idx[0]),
-                        fill="toself",
-                        fillcolor=self.hex_to_rgb(color, t=0.2),
-                        line=dict(color="rgba(255, 255, 255, 0)"),
-                        hoverinfo="skip",
-                        showlegend=False,
-                    )
-                ]
-            fig.add_traces(_line + _error)
+        fig.add_traces(_line + _error)

-        self.update_layout(fig, title, x_label, y_label)
-        return fig
+    _update_layout(
+        fig,
+        x_label=f"{prev_name} Prevalence",
+        y_label=f"Prediction Error for {acc_name}",
+    )
+    # return _save_or_return(
+    #     fig,
+    #     basedir,
+    #     cls_name,
+    #     acc_mame,
+    #     dataset_name,
+    #     "delta" if stdevs is None else "stdev",
+    # )
+    return fig

-    def plot_diagonal(
-        self,
-        reference,
-        columns,
-        data,
-        *,
-        pos_class=1,
-        title="default",
-        x_label="true",
-        y_label="estim.",
-        fixed_lim=False,
-        legend=True,
-    ) -> go.Figure:
-        fig = go.Figure()
-        x = reference
-        line_colors = self.get_colors(len(columns))

-        if fixed_lim:
-            _lims = np.array([[0.0, 1.0], [0.0, 1.0]])
-        else:
-            _edges = (
-                np.min([np.min(x), np.min(data)]),
-                np.max([np.max(x), np.max(data)]),
-            )
-            _lims = np.array([[_edges[0], _edges[1]], [_edges[0], _edges[1]]])
-
-        named_data = {c: d for c, d in zip(columns, data)}
-        r_columns = {c: r for c, r in zip(columns, self.rename_plots(columns))}
-        columns = np.array(CE.name.sort(columns))
-        for name in columns:
-            val = named_data[name]
-            r_name = r_columns[name]
-            color = next(line_colors)
-            slope, interc = np.polyfit(x, val, 1)
-            # y_lr = np.array([slope * _x + interc for _x in _lims[0]])
-            fig.add_traces(
-                [
-                    go.Scatter(
-                        x=x,
-                        y=val,
-                        customdata=np.stack((val - x,), axis=-1),
-                        mode="markers",
-                        name=r_name,
-                        marker=dict(color=self.hex_to_rgb(color, t=0.5)),
-                        hovertemplate="true acc: %{x:,.4f}<br>estim. acc: %{y:,.4f}<br>acc err.: %{customdata[0]:,.4f}",
-                        # showlegend=False,
-                    ),
-                    # go.Scatter(
-                    #     x=[x[-1]],
-                    #     y=[val[-1]],
-                    #     mode="markers",
-                    #     marker=dict(color=self.hex_to_rgb(color), size=8),
-                    #     name=r_name,
-                    # ),
-                    # go.Scatter(
-                    #     x=_lims[0],
-                    #     y=y_lr,
-                    #     mode="lines",
-                    #     name=name,
-                    #     line=dict(color=self.hex_to_rgb(color), width=3),
-                    #     showlegend=False,
-                    # ),
-                ]
-            )
-        fig.add_trace(
-            go.Scatter(
-                x=_lims[0],
-                y=_lims[1],
-                mode="lines",
-                name="reference",
-                showlegend=False,
-                line=dict(color=self.hex_to_rgb("#000000"), dash="dash"),
-            )
-        )
-
-        self.update_layout(fig, title, x_label, y_label)
-        fig.update_layout(
-            autosize=False,
-            width=1300,
-            height=1000,
-            yaxis_scaleanchor="x",
-            yaxis_scaleratio=1.0,
-            yaxis_range=[-0.1, 1.1],
-        )
-        return fig
-
-    def plot_shift(
-        self,
-        shift_prevs,
-        columns,
-        data,
-        *,
-        counts=None,
-        pos_class=1,
-        title="default",
-        x_label="true",
-        y_label="estim.",
-        legend=True,
-    ) -> go.Figure:
-        fig = go.Figure()
-        # x = shift_prevs[:, pos_class]
-        x = shift_prevs
-        line_colors = self.get_colors(len(columns))
-        named_data = {c: d for c, d in zip(columns, data)}
-        r_columns = {c: r for c, r in zip(columns, self.rename_plots(columns))}
-        columns = np.array(CE.name.sort(columns))
-        for name in columns:
-            delta = named_data[name]
-            r_name = r_columns[name]
-            col_idx = (columns == name).nonzero()[0][0]
-            color = next(line_colors)
-            fig.add_trace(
-                go.Scatter(
-                    x=x,
-                    y=delta,
-                    customdata=np.stack((counts[col_idx],), axis=-1),
-                    mode=_cfg.mode,
-                    name=r_name,
-                    line=dict(color=self.hex_to_rgb(color), width=_cfg.lwidth),
-                    hovertemplate="shift: %{x}<br>error: %{y}"
-                    + "<br>count: %{customdata[0]}"
-                    if counts is not None
-                    else "",
-                )
-            )
-
-        self.update_layout(fig, title, x_label, y_label)
-        return fig
-
-    def plot_fit_scores(
-        self,
-        train_prevs,
-        scores,
-        *,
-        pos_class=1,
-        title="default",
-        x_label="prev.",
-        y_label="position",
-        legend=True,
-    ) -> go.Figure:
-        fig = go.Figure()
-        # x = train_prevs
-        x = [str(tuple(bp)) for bp in train_prevs]
+def plot_shift(
+    method_names: list[str],
+    prevs: np.ndarray,
+    acc_errs: np.ndarray,
+    cls_name,
+    acc_name,
+    dataset_name,
+    *,
+    counts: np.ndarray | None = None,
+    basedir=None,
+) -> go.Figure:
+    fig = go.Figure()
+    x = prevs
+    line_colors = _get_colors(len(method_names))
+    if counts is None:
+        counts = [None] * len(method_names)
+    for name, delta, count in zip(method_names, acc_errs, counts):
+        color = next(line_colors)
        fig.add_trace(
            go.Scatter(
                x=x,
-                y=scores,
-                mode="lines+markers",
-                showlegend=False,
-            ),
+                y=delta,
+                customdata=np.stack((count,), axis=-1),
+                mode=MODE,
+                name=name,
+                line=dict(color=_hex_to_rgb(color), width=L_WIDTH),
+                hovertemplate="shift: %{x}<br>error: %{y}"
+                + "<br>count: %{customdata[0]}"
+                if count is not None
+                else "",
+            )
        )

-        self.update_layout(fig, title, x_label, y_label)
-        return fig
+    _update_layout(
+        fig,
+        x_label="Amount of Prior Probability Shift",
+        y_label=f"Prediction Error for {acc_name}",
+    )
+    # return _save_or_return(fig, basedir, cls_name, acc_name, dataset_name, "shift")
+    return fig
--- a/quacc/plot/utils.py
+++ b/quacc/plot/utils.py
@ -0,0 +1,15 @@
+import numpy as np
+import plotly.graph_objects as go
+
+from quacc.utils.commons import get_plots_path
+
+
+def _get_ref_limits(true_accs: np.ndarray, estim_accs: np.ndarray):
+    """get lmits of reference line"""
+
+    _edges = (
+        np.min([np.min(true_accs), np.min(estim_accs)]),
+        np.max([np.max(true_accs), np.max(estim_accs)]),
+    )
+    _lims = np.array([[_edges[0], _edges[1]], [_edges[0], _edges[1]]])
+    return _lims
--- a/quacc/quantification/init.py
+++ b/quacc/quantification/init.py
@ -1 +0,0 @@
-from .kdey import KDEy
--- a/quacc/quantification/kdey.py
+++ b/quacc/quantification/kdey.py
@ -1,86 +0,0 @@
-from typing import Union, Callable
-import numpy as np
-from sklearn.base import BaseEstimator
-from sklearn.neighbors import KernelDensity
-from quapy.data import LabelledCollection
-from quapy.method.aggregative import AggregativeProbabilisticQuantifier, cross_generate_predictions
-import quapy as qp
-
-
-class KDEy(AggregativeProbabilisticQuantifier):
-
-    def __init__(self, classifier: BaseEstimator, val_split=10, bandwidth=0.1, n_jobs=None, random_state=0):
-        self.classifier = classifier
-        self.val_split = val_split
-        self.bandwidth = bandwidth
-        self.n_jobs = n_jobs
-        self.random_state = random_state
-
-    def get_kde_function(self, posteriors):
-        return KernelDensity(bandwidth=self.bandwidth).fit(posteriors)
-
-    def pdf(self, kde, posteriors):
-        return np.exp(kde.score_samples(posteriors))
-
-    def fit(self, data: LabelledCollection, fit_classifier=True, val_split: Union[float, LabelledCollection] = None):
-        """
-
-        :param data: the training set
-        :param fit_classifier: set to False to bypass the training (the learner is assumed to be already fit)
-        :param val_split: either a float in (0,1) indicating the proportion of training instances to use for
-         validation (e.g., 0.3 for using 30% of the training set as validation data), or a LabelledCollection
-         indicating the validation set itself, or an int indicating the number k of folds to be used in kFCV
-         to estimate the parameters
-        """
-        if val_split is None:
-            val_split = self.val_split
-
-        with qp.util.temp_seed(self.random_state):
-            self.classifier, y, posteriors, classes, class_count = cross_generate_predictions(
-                data, self.classifier, val_split, probabilistic=True, fit_classifier=fit_classifier, n_jobs=self.n_jobs
-            )
-            self.val_densities = [self.get_kde_function(posteriors[y == cat]) for cat in range(data.n_classes)]
-
-        return self
-
-    def aggregate(self, posteriors: np.ndarray):
-        """
-        Searches for the mixture model parameter (the sought prevalence values) that yields a validation distribution
-        (the mixture) that best matches the test distribution, in terms of the divergence measure of choice.
-
-        :param instances: instances in the sample
-        :return: a vector of class prevalence estimates
-        """
-        eps = 1e-10
-        np.random.RandomState(self.random_state)
-        n_classes = len(self.val_densities)
-        test_densities = [self.pdf(kde_i, posteriors) for kde_i in self.val_densities]
-
-        def neg_loglikelihood(prev):
-            test_mixture_likelihood = sum(prev_i * dens_i for prev_i, dens_i in zip (prev, test_densities))
-            test_loglikelihood = np.log(test_mixture_likelihood + eps)
-            return -np.sum(test_loglikelihood)
-
-        return optim_minimize(neg_loglikelihood, n_classes)
-
-
-def optim_minimize(loss, n_classes):
-    """
-    Searches for the optimal prevalence values, i.e., an `n_classes`-dimensional vector of the (`n_classes`-1)-simplex
-    that yields the smallest lost. This optimization is carried out by means of a constrained search using scipy's
-    SLSQP routine.
-
-    :param loss: (callable) the function to minimize
-    :param n_classes: (int) the number of classes, i.e., the dimensionality of the prevalence vector
-    :return: (ndarray) the best prevalence vector found
-    """
-    from scipy import optimize
-
-    # the initial point is set as the uniform distribution
-    uniform_distribution = np.full(fill_value=1 / n_classes, shape=(n_classes,))
-
-    # solutions are bounded to those contained in the unit-simplex
-    bounds = tuple((0, 1) for _ in range(n_classes))  # values in [0,1]
-    constraints = ({'type': 'eq', 'fun': lambda x: 1 - sum(x)})  # values summing up to 1
-    r = optimize.minimize(loss, x0=uniform_distribution, method='SLSQP', bounds=bounds, constraints=constraints)
-    return r.x
--- a/quacc/utils/init.py
+++ b/quacc/utils/init.py
--- a/quacc/utils/commons.py
+++ b/quacc/utils/commons.py
@ -1,8 +1,10 @@
 import functools
+import json
 import os
 import shutil
 from contextlib import ExitStack
 from pathlib import Path
+from time import time
 from urllib.request import urlretrieve

 import pandas as pd
@ -10,7 +12,7 @@ from joblib import Parallel, delayed
 from tqdm import tqdm

 from quacc import logger
-from quacc.environment import env, environ
+from quacc.legacy.environment import env, environ


 def combine_dataframes(dfs, df_index=[]) -> pd.DataFrame:
@ -106,3 +108,28 @@ def parallel(
        Parallel(n_jobs=n_jobs, verbose=verbose) if parallel is None else parallel
    )
    return parallel(delayed(wrapper)(*_args) for _args in f_args)
+
+
+def save_json_file(path, data):
+    os.makedirs(Path(path).parent, exist_ok=True)
+    with open(path, "w") as f:
+        json.dump(data, f)
+
+
+def load_json_file(path, object_hook=None):
+    if not os.path.exists(path):
+        raise ValueError("Ivalid path for json file")
+    with open(path, "r") as f:
+        return json.load(f, object_hook=object_hook)
+
+
+def get_results_path(basedir, cls_name, acc_name, dataset_name, method_name):
+    return os.path.join(
+        "results", basedir, cls_name, acc_name, dataset_name, method_name + ".json"
+    )
+
+
+def get_plots_path(basedir, cls_name, acc_name, dataset_name, plot_type):
+    return os.path.join(
+        "plots", basedir, cls_name, acc_name, dataset_name, plot_type + ".svg"
+    )
--- a/rates.md
+++ b/rates.md
@ -1,15 +0,0 @@
-# Additional covariates percentage
-
-Rate of usage of additional covariates, recalibration and "balanced" class_weight
-during grid search:
-
-| method          | av %   | recalib % | rebalance % |
-| --------------: | :----: | :-------: | :---------: |
-| imdb_sld_lr     | 81.49% | 77.78%    | 59.26%      |
-| imdb_kde_lr     | 71.43% | NA        | 88.18%      |
-| rcv1_CCAT_sld_lr| 62.97% | 70.38%    | 77.78%      |
-| rcv1_CCAT_kde_lr| 78.06% | NA        | 84.82%      |
-| rcv1_GCAT_sld_lr| 76.93% | 61.54%    | 65.39%      |
-| rcv1_GCAT_kde_lr| 71.36% | NA        | 78.65%      |
-| rcv1_MCAT_sld_lr| 62.97% | 48.15%    | 74.08%      |
-| rcv1_MCAT_kde_lr| 71.03% | NA        | 68.70%      |
--- a/remote.log
+++ b/remote.log
--- a/remote.py
+++ b/remote.py
@ -1,214 +0,0 @@
-import os
-import queue
-import stat
-import subprocess
-import threading
-from itertools import product as itproduct
-from os.path import expanduser
-from pathlib import Path
-from subprocess import DEVNULL, STDOUT
-
-import paramiko
-from tqdm import tqdm
-
-known_hosts = Path(expanduser("~/.ssh/known_hosts"))
-hostname = "ilona.isti.cnr.it"
-username = "volpi"
-
-__exec_main = "cd tesi; /home/volpi/.local/bin/poetry run main"
-__exec_log = "/usr/bin/tail -f -n 0 tesi/quacc.log"
-__log_file = "remote.log"
-__target_dir = Path("/home/volpi/tesi")
-__to_sync_up = {
-    "dir": [
-        "quacc",
-        "baselines",
-        "qcpanel",
-        "qcdash",
-    ],
-    "file": [
-        "conf.yaml",
-        "run.py",
-        "remote.py",
-        "merge_data.py",
-        "pyproject.toml",
-    ],
-}
-__to_sync_down = {
-    "dir": [
-        "output",
-    ],
-    "file": [],
-}
-
-
-def prune_remote(sftp: paramiko.SFTPClient, remote: Path):
-    _ex_list = []
-    mode = sftp.stat(str(remote)).st_mode
-    if stat.S_ISDIR(mode):
-        for f in sftp.listdir(str(remote)):
-            _ex_list.append([prune_remote, sftp, remote / f])
-        _ex_list.append([sftp.rmdir, str(remote)])
-    elif stat.S_ISREG(mode):
-        _ex_list.append([sftp.remove, str(remote)])
-
-    return _ex_list
-
-
-def put_dir(sftp: paramiko.SFTPClient, from_: Path, to_: Path):
-    _ex_list = []
-
-    _ex_list.append([sftp.mkdir, str(to_)])
-
-    from_list = os.listdir(from_)
-    for f in from_list:
-        if (from_ / f).is_file():
-            _ex_list.append([sftp.put, str(from_ / f), str(to_ / f)])
-        elif (from_ / f).is_dir():
-            _ex_list += put_dir(sftp, from_ / f, to_ / f)
-
-    try:
-        to_list = sftp.listdir(str(to_))
-        for f in to_list:
-            if f not in from_list:
-                _ex_list += prune_remote(sftp, to_ / f)
-    except FileNotFoundError:
-        pass
-
-    return _ex_list
-
-
-def get_dir(sftp: paramiko.SFTPClient, from_: Path, to_: Path):
-    _ex_list = []
-
-    if not (to_.exists() and to_.is_dir()):
-        _ex_list.append([os.mkdir, to_])
-
-    for f in sftp.listdir(str(from_)):
-        mode = sftp.stat(str(from_ / f)).st_mode
-        if stat.S_ISDIR(mode):
-            _ex_list += get_dir(sftp, from_ / f, to_ / f)
-            # _ex_list.append([sftp.rmdir, str(from_ / f)])
-        elif stat.S_ISREG(mode):
-            _ex_list.append([sftp.get, str(from_ / f), str(to_ / f)])
-            # _ex_list.append([sftp.remove, str(from_ / f)])
-
-    return _ex_list
-
-
-def sync_code(*, ssh: paramiko.SSHClient = None, verbose=False):
-    _was_ssh = ssh is not None
-    if ssh is None:
-        ssh = paramiko.SSHClient()
-        ssh.load_host_keys(known_hosts)
-        ssh.connect(hostname=hostname, username=username)
-
-    sftp = ssh.open_sftp()
-
-    to_move = [item for k, vs in __to_sync_up.items() for item in itproduct([k], vs)]
-    _ex_list = []
-    for mode, f in to_move:
-        from_ = Path(f).absolute()
-        to_ = __target_dir / f
-        if mode == "dir":
-            _ex_list += put_dir(sftp, from_, to_)
-        elif mode == "file":
-            _ex_list.append([sftp.put, str(from_), str(to_)])
-
-    for _ex in tqdm(_ex_list, desc="synching code: "):
-        fn_ = _ex[0]
-        try:
-            fn_(*_ex[1:])
-        except IOError:
-            if verbose:
-                print(f"Info: directory {to_} already exists.")
-
-    sftp.close()
-    if not _was_ssh:
-        ssh.close()
-
-
-def sync_output(*, ssh: paramiko.SSHClient = None):
-    _was_ssh = ssh is not None
-    if ssh is None:
-        ssh = paramiko.SSHClient()
-        ssh.load_host_keys(known_hosts)
-        ssh.connect(hostname=hostname, username=username)
-
-    sftp = ssh.open_sftp()
-
-    to_move = [item for k, vs in __to_sync_down.items() for item in itproduct([k], vs)]
-    _ex_list = []
-    for mode, f in to_move:
-        from_ = __target_dir / f
-        to_ = Path(f).absolute()
-        if mode == "dir":
-            _ex_list += get_dir(sftp, from_, to_)
-        elif mode == "file":
-            _ex_list.append([sftp.get, str(from_), str(to_)])
-
-    for _ex in tqdm(_ex_list, desc="synching output: "):
-        fn_ = _ex[0]
-        fn_(*_ex[1:])
-
-    sftp.close()
-    if not _was_ssh:
-        ssh.close()
-
-
-def _echo_channel(ch: paramiko.ChannelFile):
-    while line := ch.readline():
-        print(line, end="")
-
-
-def _echo_log(ssh: paramiko.SSHClient, q_: queue.Queue):
-    _, rout, _ = ssh.exec_command(__exec_log, timeout=5.0)
-    while True:
-        try:
-            _line = rout.readline()
-            with open(__log_file, "a") as f:
-                f.write(_line)
-        except TimeoutError:
-            pass
-
-        try:
-            q_.get_nowait()
-            return
-        except queue.Empty:
-            pass
-
-
-def remote(detatch=False):
-    ssh = paramiko.SSHClient()
-    ssh.load_host_keys(known_hosts)
-    ssh.connect(hostname=hostname, username=username)
-    sync_code(ssh=ssh)
-
-    __to_exec = __exec_main
-    if detatch:
-        __to_exec += " &> out & disown"
-
-    _, rout, rerr = ssh.exec_command(__to_exec)
-
-    if detatch:
-        ssh.close()
-        return
-
-    q = queue.Queue()
-    _tlog = threading.Thread(target=_echo_log, args=[ssh, q])
-    _tlog.start()
-
-    _tchans = [threading.Thread(target=_echo_channel, args=[ch]) for ch in [rout, rerr]]
-
-    for th in _tchans:
-        th.start()
-
-    for th in _tchans:
-        th.join()
-
-    q.put(None)
-
-    sync_output(ssh=ssh)
-    _tlog.join()
-
-    ssh.close()
--- a/roadmap.md
+++ b/roadmap.md
@ -1,40 +0,0 @@
-
-## Roadmap
-
-#### quantificator domain
-
-  - single multilabel quantificator
-  
-  - vector of binary quantificators
-  
-    | quantificator       |                |                |
-    |:-------------------:|:--------------:|:--------------:|
-    | true quantificator  | true positive  | false positive |
-    | false quantificator | false negative | true negative  |
-
-#### dataset split
-  
-  - train | test
-    - classificator C is fit on train
-    - quantificator Q is fit on cross validation of C over train
-  - train | validation | test
-    - classificator C is fit on train
-    - quantificator Q is fit on validation
-    
-#### classificator origin
-
-  - black box
-  - crystal box
-
-#### test metrics
-
-  - f1_score
-  - K
-
-#### models
-
-  - classificator
-  - quantificator
-
-
-
--- a/run.py
+++ b/run.py
@ -1,21 +0,0 @@
-import argparse
-
-from quacc.main import main as run_local
-from remote import remote as run_remote
-
-
-def run():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("-l", "--local", action="store_true", dest="local")
-    parser.add_argument("-r", "--remote", action="store_true", dest="remote")
-    parser.add_argument("-d", "--detatch", action="store_true", dest="detatch")
-    args = parser.parse_args()
-
-    if args.local:
-        run_local()
-    elif args.remote:
-        run_remote(detatch=args.detatch)
-
-
-if __name__ == "__main__":
-    run()
--- a/selected_gs.py
+++ b/selected_gs.py
@ -1,48 +0,0 @@
-import numpy as np
-
-from quacc.evaluation.report import DatasetReport
-
-datasets = [
-    "imdb/imdb.pickle",
-    "rcv1_CCAT/rcv1_CCAT.pickle",
-    "rcv1_GCAT/rcv1_GCAT.pickle",
-    "rcv1_MCAT/rcv1_MCAT.pickle",
-]
-
-gs = {
-    "sld_lr_gs": [
-        "bin_sld_lr_gs",
-        "mul_sld_lr_gs",
-        "m3w_sld_lr_gs",
-    ],
-    "kde_lr_gs": [
-        "bin_kde_lr_gs",
-        "mul_kde_lr_gs",
-        "m3w_kde_lr_gs",
-    ],
-}
-
-for dst in datasets:
-    dr = DatasetReport.unpickle("output/main/" + dst)
-    print(f"{dst}\n")
-    for name, methods in gs.items():
-        print(f"{name}")
-        sel_methods = [
-            {k: v for k, v in cr.fit_scores.items() if k in methods} for cr in dr.crs
-        ]
-
-        best_methods = [
-            list(ms.keys())[np.argmin(list(ms.values()))] for ms in sel_methods
-        ]
-        m_cnt = []
-        for m in methods:
-            m_cnt.append((np.array(best_methods) == m).nonzero()[0].shape[0])
-        m_cnt = np.array(m_cnt)
-        m_freq = m_cnt / len(best_methods)
-
-        for n in methods:
-            print(n, end="\t")
-        print()
-        for v in m_freq:
-            print(f"{v*100:.2f}", end="\t")
-        print("\n\n")
--- a/test_imdb_max_shift.py
+++ b/test_imdb_max_shift.py
@ -1,15 +0,0 @@
-from quacc.evaluation.report import DatasetReport
-import pandas as pd
-
-dr = DatasetReport.unpickle("output/main/imdb/imdb.pickle")
-
-_data = dr.data(
-    metric="acc", estimators=["bin_sld_lr_mc", "bin_sld_lr_ne", "bin_sld_lr_c"]
-)
-d1 = _data.loc[((0.9, 0.1), (1.0, 0.0), slice(None)), :]
-d2 = _data.loc[((0.1, 0.9), (0.0, 1.0), slice(None)), :]
-dd = pd.concat([d1, d2], axis=0)
-
-print(d1.to_numpy(), "\n", d1.mean(), "\n")
-print(d2.to_numpy(), "\n", d2.mean(), "\n")
-print(dd.to_numpy(), "\n", dd.mean(), "\n")
--- a/test_postprocess.py
+++ b/test_postprocess.py
@ -1,9 +0,0 @@
-from quacc.evaluation.report import DatasetReport
-
-dr = DatasetReport.unpickle("output/main/imdb/imdb.pickle")
-_estimators = ["sld_lr_gs", "bin_sld_lr_gs", "mul_sld_lr_gs", "m3w_sld_lr_gs"]
-_data = dr.data(metric="acc", estimators=_estimators)
-for idx, cr in zip(_data.index.unique(0), dr.crs[::-1]):
-    print(cr.train_prev)
-    print({k: v for k, v in cr.fit_scores.items() if k in _estimators})
-    print(_data.loc[(idx, slice(None), slice(None)), :])
--- a/tests/test_data.py
+++ b/tests/test_data.py
@ -4,7 +4,7 @@ import numpy as np
 import pytest
 import scipy.sparse as sp

-from quacc.data import (
+from quacc.legacy.data import (
    ExtBinPrev,
    ExtendedCollection,
    ExtendedData,
--- a/tests/test_error.py
+++ b/tests/test_error.py
@ -2,7 +2,7 @@ import numpy as np
 import pytest

 from quacc import error
-from quacc.data import ExtendedPrev, ExtensionPolicy
+from quacc.legacy.data import ExtendedPrev, ExtensionPolicy


@pytest.mark.err
--- a/tests/test_evaluation/test_report.py
+++ b/tests/test_evaluation/test_report.py
@ -1,7 +1,7 @@
 import numpy as np
 import pytest

-from quacc.evaluation.report import (
+from quacc.legacy.evaluation.report import (
    CompReport,
    DatasetReport,
    EvaluationReport,
--- a/tests/test_method/test_base.py
+++ b/tests/test_method/test_base.py
@ -2,8 +2,8 @@ import numpy as np
 import pytest
 import scipy.sparse as sp

-from quacc.data import ExtendedData, ExtensionPolicy
-from quacc.method.base import MultiClassAccuracyEstimator
+from quacc.deprecated.method.base import MultiClassAccuracyEstimator
+from quacc.legacy.data import ExtendedData, ExtensionPolicy


@pytest.mark.mcae
Author	SHA1	Message	Date
Lorenzo Volpi	923a4c8037	cleaning	2024-04-11 14:24:23 +02:00
Lorenzo Volpi	f109ac241d	README updated	2024-04-11 14:23:14 +02:00
Lorenzo Volpi	c179c3b5d6	cleaning	2024-04-11 14:22:57 +02:00
Lorenzo Volpi	bd0c15b178	bug found in delta_plot_data, get_shift method added	2024-04-08 18:16:25 +02:00
Lorenzo Volpi	e0fec320cf	gitignore updated	2024-04-08 17:59:56 +02:00
Lorenzo Volpi	432dd7d41d	imports fixed	2024-04-08 17:59:08 +02:00
Lorenzo Volpi	ad9fdef786	matplotlib added, shift missing, plotly updated	2024-04-08 17:58:56 +02:00
Lorenzo Volpi	8a087e3e2f	1xn2 cont_table output fixed	2024-04-08 17:58:34 +02:00
Lorenzo Volpi	5cfd5d87dd	laod results fixed, data added, methods for plot data added	2024-04-08 17:58:08 +02:00
Lorenzo Volpi	d7bb8bb2b9	plot saving methods added	2024-04-08 17:57:25 +02:00
Lorenzo Volpi	2b5c7c7e35	quacc methods uncommented	2024-04-08 17:57:01 +02:00
Lorenzo Volpi	b17ae5e45d	imports fixed, added control variables	2024-04-08 17:56:41 +02:00
Lorenzo Volpi	5bb66b85c8	name bug fixed	2024-04-08 17:56:16 +02:00
Lorenzo Volpi	4c6a5e69f3	cleaning	2024-04-05 17:28:11 +02:00
Lorenzo Volpi	af0f1c7085	generators updated, cleaning	2024-04-05 17:22:59 +02:00
Lorenzo Volpi	cb6d1c8f2a	imports cleaned	2024-04-05 17:20:38 +02:00
Lorenzo Volpi	3403cc05fe	path methods added	2024-04-05 15:57:20 +02:00
Lorenzo Volpi	a3ffd689b1	plotly methods fixed, plot saving implemented	2024-04-05 15:57:05 +02:00
Lorenzo Volpi	43056e76a8	legacy imports fixed	2024-04-05 15:56:37 +02:00
Lorenzo Volpi	4090bd0cee	gitignore updated	2024-04-05 15:56:21 +02:00
Lorenzo Volpi	89df5be187	cleaning	2024-04-05 15:54:29 +02:00
Lorenzo Volpi	ddce8634ac	getpath moved and renamed, prevs_from_prot fixed	2024-04-05 15:54:19 +02:00
Lorenzo Volpi	dcbbaba361	test_prevs bug fixed, basedir added to TestReport, added load_results	2024-04-05 15:53:33 +02:00
Lorenzo Volpi	5ee04a2a19	improved binary dataset generator	2024-04-05 15:52:29 +02:00
Lorenzo Volpi	f787c4510d	prevs bug fixed, testing results loading	2024-04-05 15:52:05 +02:00
Lorenzo Volpi	7854569c5e	gitignore updated	2024-04-05 15:50:07 +02:00
Lorenzo Volpi	322f060a13	qcdash updated for plotting refactoring	2024-04-05 15:49:57 +02:00
Lorenzo Volpi	558c3231a3	cleaning	2024-04-04 17:09:05 +02:00
Lorenzo Volpi	fd8dae5ceb	quacc init updated, not adapted	2024-04-04 17:08:47 +02:00
Lorenzo Volpi	0f48b9dcb5	qcdash refactoring started	2024-04-04 17:08:25 +02:00
Lorenzo Volpi	085889b87e	ideas file created	2024-04-04 17:07:56 +02:00
Lorenzo Volpi	cbc07304c3	TODO updated	2024-04-04 17:07:43 +02:00
Lorenzo Volpi	af06068077	merge_data updated, not adapted	2024-04-04 17:07:27 +02:00
Lorenzo Volpi	9d2a6dcdec	datasets added and adapted to refactoring	2024-04-04 17:06:51 +02:00
Lorenzo Volpi	d4b0212a92	quacc errors added to adapt refactoring	2024-04-04 17:06:21 +02:00
Lorenzo Volpi	2f0632b63d	models refactored	2024-04-04 17:05:51 +02:00
Lorenzo Volpi	1318ef0aaa	qcpanel updated, not adapted	2024-04-04 17:05:28 +02:00
Lorenzo Volpi	13aa616e3b	running files updated, not adapted	2024-04-04 17:04:59 +02:00
Lorenzo Volpi	33f9561e81	quacc utils updated	2024-04-04 17:04:20 +02:00
Lorenzo Volpi	b8e43c02f2	plots refactoring started	2024-04-04 17:03:52 +02:00
Lorenzo Volpi	4a06c83256	tests updated, not adapted	2024-04-04 17:03:04 +02:00
Lorenzo Volpi	9bc1208309	experiments created, report refactoring started	2024-04-04 17:02:25 +02:00
Lorenzo Volpi	51867f3e9c	legacy created	2024-04-04 17:01:29 +02:00