<!doctype html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <title>Model Selection — QuaPy 0.1.6 documentation</title> <link rel="stylesheet" type="text/css" href="_static/pygments.css" /> <link rel="stylesheet" type="text/css" href="_static/bizstyle.css" /> <script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script> <script src="_static/jquery.js"></script> <script src="_static/underscore.js"></script> <script src="_static/doctools.js"></script> <script src="_static/bizstyle.js"></script> <link rel="index" title="Index" href="genindex.html" /> <link rel="search" title="Search" href="search.html" /> <meta name="viewport" content="width=device-width,initial-scale=1.0" /> <!--[if lt IE 9]> <script src="_static/css3-mediaqueries.js"></script> <![endif]--> </head><body> <div class="related" role="navigation" aria-label="related navigation"> <h3>Navigation</h3> <ul> <li class="right" style="margin-right: 10px"> <a href="genindex.html" title="General Index" accesskey="I">index</a></li> <li class="right" > <a href="py-modindex.html" title="Python Module Index" >modules</a> |</li> <li class="nav-item nav-item-0"><a href="index.html">QuaPy 0.1.6 documentation</a> »</li> <li class="nav-item nav-item-this"><a href="">Model Selection</a></li> </ul> </div> <div class="document"> <div class="documentwrapper"> <div class="bodywrapper"> <div class="body" role="main"> <div class="tex2jax_ignore mathjax_ignore section" id="model-selection"> <h1>Model Selection<a class="headerlink" href="#model-selection" title="Permalink to this headline">¶</a></h1> <p>As a supervised machine learning task, quantification methods can strongly depend on a good choice of model hyper-parameters. The process whereby those hyper-parameters are chosen is typically known as <em>Model Selection</em>, and typically consists of testing different settings and picking the one that performed best in a held-out validation set in terms of any given evaluation measure.</p> <div class="section" id="targeting-a-quantification-oriented-loss"> <h2>Targeting a Quantification-oriented loss<a class="headerlink" href="#targeting-a-quantification-oriented-loss" title="Permalink to this headline">¶</a></h2> <p>The task being optimized determines the evaluation protocol, i.e., the criteria according to which the performance of any given method for solving is to be assessed. As a task on its own right, quantification should impose its own model selection strategies, i.e., strategies aimed at finding appropriate configurations specifically designed for the task of quantification.</p> <p>Quantification has long been regarded as an add-on of classification, and thus the model selection strategies customarily adopted in classification have simply been applied to quantification (see the next section). It has been argued in <em>Moreo, Alejandro, and Fabrizio Sebastiani. “Re-Assessing the” Classify and Count” Quantification Method.” arXiv preprint arXiv:2011.02552 (2020).</em> that specific model selection strategies should be adopted for quantification. That is, model selection strategies for quantification should target quantification-oriented losses and be tested in a variety of scenarios exhibiting different degrees of prior probability shift.</p> <p>The class <em>qp.model_selection.GridSearchQ</em> implements a grid-search exploration over the space of hyper-parameter combinations that evaluates each<br /> combination of hyper-parameters by means of a given quantification-oriented error metric (e.g., any of the error functions implemented in <em>qp.error</em>) and according to the <a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation"><em>artificial sampling protocol</em></a>.</p> <p>The following is an example of model selection for quantification:</p> <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span> <span class="kn">from</span> <span class="nn">quapy.method.aggregative</span> <span class="kn">import</span> <span class="n">PCC</span> <span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span> <span class="c1"># set a seed to replicate runs</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">500</span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">'hp'</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="c1"># The model will be returned by the fit method of GridSearchQ.</span> <span class="c1"># Model selection will be performed with a fixed budget of 1000 evaluations</span> <span class="c1"># for each hyper-parameter combination. The error to optimize is the MAE for</span> <span class="c1"># quantification, as evaluated on artificially drawn samples at prevalences </span> <span class="c1"># covering the entire spectrum on a held-out portion (40%) of the training set.</span> <span class="n">model</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">model_selection</span><span class="o">.</span><span class="n">GridSearchQ</span><span class="p">(</span> <span class="n">model</span><span class="o">=</span><span class="n">PCC</span><span class="p">(</span><span class="n">LogisticRegression</span><span class="p">()),</span> <span class="n">param_grid</span><span class="o">=</span><span class="p">{</span><span class="s1">'C'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">10</span><span class="p">),</span> <span class="s1">'class_weight'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'balanced'</span><span class="p">,</span> <span class="kc">None</span><span class="p">]},</span> <span class="n">sample_size</span><span class="o">=</span><span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">],</span> <span class="n">eval_budget</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span> <span class="n">error</span><span class="o">=</span><span class="s1">'mae'</span><span class="p">,</span> <span class="n">refit</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="c1"># retrain on the whole labelled set</span> <span class="n">val_split</span><span class="o">=</span><span class="mf">0.4</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="kc">True</span> <span class="c1"># show information as the process goes on</span> <span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'model selection ended: best hyper-parameters=</span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">best_params_</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span> <span class="n">model</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">best_model_</span> <span class="c1"># evaluation in terms of MAE</span> <span class="n">results</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">artificial_sampling_eval</span><span class="p">(</span> <span class="n">model</span><span class="p">,</span> <span class="n">dataset</span><span class="o">.</span><span class="n">test</span><span class="p">,</span> <span class="n">sample_size</span><span class="o">=</span><span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">],</span> <span class="n">n_prevpoints</span><span class="o">=</span><span class="mi">101</span><span class="p">,</span> <span class="n">n_repetitions</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">error_metric</span><span class="o">=</span><span class="s1">'mae'</span> <span class="p">)</span> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'MAE=</span><span class="si">{</span><span class="n">results</span><span class="si">:</span><span class="s1">.5f</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span> </pre></div> </div> <p>In this example, the system outputs:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span>[GridSearchQ]: starting optimization with n_jobs=1 [GridSearchQ]: checking hyperparams={'C': 0.0001, 'class_weight': 'balanced'} got mae score 0.24987 [GridSearchQ]: checking hyperparams={'C': 0.0001, 'class_weight': None} got mae score 0.48135 [GridSearchQ]: checking hyperparams={'C': 0.001, 'class_weight': 'balanced'} got mae score 0.24866 [...] [GridSearchQ]: checking hyperparams={'C': 100000.0, 'class_weight': None} got mae score 0.43676 [GridSearchQ]: optimization finished: best params {'C': 0.1, 'class_weight': 'balanced'} (score=0.19982) [GridSearchQ]: refitting on the whole development set model selection ended: best hyper-parameters={'C': 0.1, 'class_weight': 'balanced'} 1010 evaluations will be performed for each combination of hyper-parameters [artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00<00:00, 5005.54it/s] MAE=0.20342 </pre></div> </div> <p>The parameter <em>val_split</em> can alternatively be used to indicate a validation set (i.e., an instance of <em>LabelledCollection</em>) instead of a proportion. This could be useful if one wants to have control on the specific data split to be used across different model selection experiments.</p> </div> <div class="section" id="targeting-a-classification-oriented-loss"> <h2>Targeting a Classification-oriented loss<a class="headerlink" href="#targeting-a-classification-oriented-loss" title="Permalink to this headline">¶</a></h2> <p>Optimizing a model for quantification could rather be computationally costly. In aggregative methods, one could alternatively try to optimize the classifier’s hyper-parameters for classification. Although this is theoretically suboptimal, many articles in quantification literature have opted for this strategy.</p> <p>In QuaPy, this is achieved by simply instantiating the classifier learner as a GridSearchCV from scikit-learn. The following code illustrates how to do that:</p> <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">learner</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span> <span class="n">LogisticRegression</span><span class="p">(),</span> <span class="n">param_grid</span><span class="o">=</span><span class="p">{</span><span class="s1">'C'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span> <span class="s1">'class_weight'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'balanced'</span><span class="p">,</span> <span class="kc">None</span><span class="p">]},</span> <span class="n">cv</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="n">model</span> <span class="o">=</span> <span class="n">PCC</span><span class="p">(</span><span class="n">learner</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'model selection ended: best hyper-parameters=</span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">learner</span><span class="o">.</span><span class="n">best_params_</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span> </pre></div> </div> <p>In this example, the system outputs:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span>model selection ended: best hyper-parameters={'C': 10000.0, 'class_weight': None} 1010 evaluations will be performed for each combination of hyper-parameters [artificial sampling protocol] generating predictions: 100%|██████████| 1010/1010 [00:00<00:00, 5379.55it/s] MAE=0.41734 </pre></div> </div> <p>Note that the MAE is worse than the one we obtained when optimizing for quantification and, indeed, the hyper-parameters found optimal largely differ between the two selection modalities. The hyper-parameters C=10000 and class_weight=None have been found to work well for the specific training prevalence of the HP dataset, but these hyper-parameters turned out to be suboptimal when the class prevalences of the test set differs (as is indeed tested in scenarios of quantification).</p> <p>This is, however, not always the case, and one could, in practice, find examples in which optimizing for classification ends up resulting in a better quantifier than when optimizing for quantification. Nonetheless, this is theoretically unlikely to happen.</p> </div> </div> <div class="clearer"></div> </div> </div> </div> <div class="sphinxsidebar" role="navigation" aria-label="main navigation"> <div class="sphinxsidebarwrapper"> <h3><a href="index.html">Table of Contents</a></h3> <ul> <li><a class="reference internal" href="#">Model Selection</a><ul> <li><a class="reference internal" href="#targeting-a-quantification-oriented-loss">Targeting a Quantification-oriented loss</a></li> <li><a class="reference internal" href="#targeting-a-classification-oriented-loss">Targeting a Classification-oriented loss</a></li> </ul> </li> </ul> <div role="note" aria-label="source link"> <h3>This Page</h3> <ul class="this-page-menu"> <li><a href="_sources/Model-Selection.md.txt" rel="nofollow">Show Source</a></li> </ul> </div> <div id="searchbox" style="display: none" role="search"> <h3 id="searchlabel">Quick search</h3> <div class="searchformwrapper"> <form class="search" action="search.html" method="get"> <input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/> <input type="submit" value="Go" /> </form> </div> </div> <script>$('#searchbox').show(0);</script> </div> </div> <div class="clearer"></div> </div> <div class="related" role="navigation" aria-label="related navigation"> <h3>Navigation</h3> <ul> <li class="right" style="margin-right: 10px"> <a href="genindex.html" title="General Index" >index</a></li> <li class="right" > <a href="py-modindex.html" title="Python Module Index" >modules</a> |</li> <li class="nav-item nav-item-0"><a href="index.html">QuaPy 0.1.6 documentation</a> »</li> <li class="nav-item nav-item-this"><a href="">Model Selection</a></li> </ul> </div> <div class="footer" role="contentinfo"> © Copyright 2021, Alejandro Moreo. Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 4.2.0. </div> </body> </html>