<!doctype html> <html lang="en"> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.19: https://docutils.sourceforge.io/" /> <title>Model Selection — QuaPy 0.1.7 documentation</title> <link rel="stylesheet" type="text/css" href="_static/pygments.css" /> <link rel="stylesheet" type="text/css" href="_static/bizstyle.css" /> <script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script> <script src="_static/jquery.js"></script> <script src="_static/underscore.js"></script> <script src="_static/_sphinx_javascript_frameworks_compat.js"></script> <script src="_static/doctools.js"></script> <script src="_static/sphinx_highlight.js"></script> <script src="_static/bizstyle.js"></script> <link rel="index" title="Index" href="genindex.html" /> <link rel="search" title="Search" href="search.html" /> <link rel="next" title="Plotting" href="Plotting.html" /> <link rel="prev" title="Quantification Methods" href="Methods.html" /> <meta name="viewport" content="width=device-width,initial-scale=1.0" /> <!--[if lt IE 9]> <script src="_static/css3-mediaqueries.js"></script> <![endif]--> </head><body> <div class="related" role="navigation" aria-label="related navigation"> <h3>Navigation</h3> <ul> <li class="right" style="margin-right: 10px"> <a href="genindex.html" title="General Index" accesskey="I">index</a></li> <li class="right" > <a href="py-modindex.html" title="Python Module Index" >modules</a> |</li> <li class="right" > <a href="Plotting.html" title="Plotting" accesskey="N">next</a> |</li> <li class="right" > <a href="Methods.html" title="Quantification Methods" accesskey="P">previous</a> |</li> <li class="nav-item nav-item-0"><a href="index.html">QuaPy 0.1.7 documentation</a> »</li> <li class="nav-item nav-item-this"><a href="">Model Selection</a></li> </ul> </div> <div class="document"> <div class="documentwrapper"> <div class="bodywrapper"> <div class="body" role="main"> <section id="model-selection"> <h1>Model Selection<a class="headerlink" href="#model-selection" title="Permalink to this heading">¶</a></h1> <p>As a supervised machine learning task, quantification methods can strongly depend on a good choice of model hyper-parameters. The process whereby those hyper-parameters are chosen is typically known as <em>Model Selection</em>, and typically consists of testing different settings and picking the one that performed best in a held-out validation set in terms of any given evaluation measure.</p> <section id="targeting-a-quantification-oriented-loss"> <h2>Targeting a Quantification-oriented loss<a class="headerlink" href="#targeting-a-quantification-oriented-loss" title="Permalink to this heading">¶</a></h2> <p>The task being optimized determines the evaluation protocol, i.e., the criteria according to which the performance of any given method for solving is to be assessed. As a task on its own right, quantification should impose its own model selection strategies, i.e., strategies aimed at finding appropriate configurations specifically designed for the task of quantification.</p> <p>Quantification has long been regarded as an add-on of classification, and thus the model selection strategies customarily adopted in classification have simply been applied to quantification (see the next section). It has been argued in <a class="reference external" href="https://link.springer.com/chapter/10.1007/978-3-030-72240-1_6">Moreo, Alejandro, and Fabrizio Sebastiani. Re-Assessing the “Classify and Count” Quantification Method. ECIR 2021: Advances in Information Retrieval pp 75–91.</a> that specific model selection strategies should be adopted for quantification. That is, model selection strategies for quantification should target quantification-oriented losses and be tested in a variety of scenarios exhibiting different degrees of prior probability shift.</p> <p>The class <em>qp.model_selection.GridSearchQ</em> implements a grid-search exploration over the space of hyper-parameter combinations that <a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation">evaluates</a> each combination of hyper-parameters by means of a given quantification-oriented error metric (e.g., any of the error functions implemented in <em>qp.error</em>) and according to a <a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Protocols">sampling generation protocol</a>.</p> <p>The following is an example (also included in the examples folder) of model selection for quantification:</p> <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span> <span class="kn">from</span> <span class="nn">quapy.protocol</span> <span class="kn">import</span> <span class="n">APP</span> <span class="kn">from</span> <span class="nn">quapy.method.aggregative</span> <span class="kn">import</span> <span class="n">DistributionMatching</span> <span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span> <span class="sd">"""</span> <span class="sd">In this example, we show how to perform model selection on a DistributionMatching quantifier.</span> <span class="sd">"""</span> <span class="n">model</span> <span class="o">=</span> <span class="n">DistributionMatching</span><span class="p">(</span><span class="n">LogisticRegression</span><span class="p">())</span> <span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">100</span> <span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'N_JOBS'</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span> <span class="c1"># explore hyper-parameters in parallel</span> <span class="n">training</span><span class="p">,</span> <span class="n">test</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">'imdb'</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span><span class="o">.</span><span class="n">train_test</span> <span class="c1"># The model will be returned by the fit method of GridSearchQ.</span> <span class="c1"># Every combination of hyper-parameters will be evaluated by confronting the</span> <span class="c1"># quantifier thus configured against a series of samples generated by means</span> <span class="c1"># of a sample generation protocol. For this example, we will use the</span> <span class="c1"># artificial-prevalence protocol (APP), that generates samples with prevalence</span> <span class="c1"># values in the entire range of values from a grid (e.g., [0, 0.1, 0.2, ..., 1]).</span> <span class="c1"># We devote 30% of the dataset for this exploration.</span> <span class="n">training</span><span class="p">,</span> <span class="n">validation</span> <span class="o">=</span> <span class="n">training</span><span class="o">.</span><span class="n">split_stratified</span><span class="p">(</span><span class="n">train_prop</span><span class="o">=</span><span class="mf">0.7</span><span class="p">)</span> <span class="n">protocol</span> <span class="o">=</span> <span class="n">APP</span><span class="p">(</span><span class="n">validation</span><span class="p">)</span> <span class="c1"># We will explore a classification-dependent hyper-parameter (e.g., the 'C'</span> <span class="c1"># hyper-parameter of LogisticRegression) and a quantification-dependent hyper-parameter</span> <span class="c1"># (e.g., the number of bins in a DistributionMatching quantifier.</span> <span class="c1"># Classifier-dependent hyper-parameters have to be marked with a prefix "classifier__"</span> <span class="c1"># in order to let the quantifier know this hyper-parameter belongs to its underlying</span> <span class="c1"># classifier.</span> <span class="n">param_grid</span> <span class="o">=</span> <span class="p">{</span> <span class="s1">'classifier__C'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">7</span><span class="p">),</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="p">[</span><span class="mi">8</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">64</span><span class="p">],</span> <span class="p">}</span> <span class="n">model</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">model_selection</span><span class="o">.</span><span class="n">GridSearchQ</span><span class="p">(</span> <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span> <span class="n">param_grid</span><span class="o">=</span><span class="n">param_grid</span><span class="p">,</span> <span class="n">protocol</span><span class="o">=</span><span class="n">protocol</span><span class="p">,</span> <span class="n">error</span><span class="o">=</span><span class="s1">'mae'</span><span class="p">,</span> <span class="c1"># the error to optimize is the MAE (a quantification-oriented loss)</span> <span class="n">refit</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="c1"># retrain on the whole labelled set once done</span> <span class="n">verbose</span><span class="o">=</span><span class="kc">True</span> <span class="c1"># show information as the process goes on</span> <span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">training</span><span class="p">)</span> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'model selection ended: best hyper-parameters=</span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">best_params_</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span> <span class="n">model</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">best_model_</span> <span class="c1"># evaluation in terms of MAE</span> <span class="c1"># we use the same evaluation protocol (APP) on the test set</span> <span class="n">mae_score</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">protocol</span><span class="o">=</span><span class="n">APP</span><span class="p">(</span><span class="n">test</span><span class="p">),</span> <span class="n">error_metric</span><span class="o">=</span><span class="s1">'mae'</span><span class="p">)</span> <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'MAE=</span><span class="si">{</span><span class="n">mae_score</span><span class="si">:</span><span class="s1">.5f</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span> </pre></div> </div> <p>In this example, the system outputs:</p> <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">starting</span> <span class="n">model</span> <span class="n">selection</span> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_jobs</span> <span class="o">=-</span><span class="mi">1</span> <span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">0.01</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">64</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.04021</span> <span class="p">[</span><span class="n">took</span> <span class="mf">1.1356</span><span class="n">s</span><span class="p">]</span> <span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">0.01</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">32</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.04286</span> <span class="p">[</span><span class="n">took</span> <span class="mf">1.2139</span><span class="n">s</span><span class="p">]</span> <span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">0.01</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">16</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.04888</span> <span class="p">[</span><span class="n">took</span> <span class="mf">1.2491</span><span class="n">s</span><span class="p">]</span> <span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">0.001</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">8</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.05163</span> <span class="p">[</span><span class="n">took</span> <span class="mf">1.5372</span><span class="n">s</span><span class="p">]</span> <span class="p">[</span><span class="o">...</span><span class="p">]</span> <span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">1000.0</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">32</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.02445</span> <span class="p">[</span><span class="n">took</span> <span class="mf">2.9056</span><span class="n">s</span><span class="p">]</span> <span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">optimization</span> <span class="n">finished</span><span class="p">:</span> <span class="n">best</span> <span class="n">params</span> <span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">100.0</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">32</span><span class="p">}</span> <span class="p">(</span><span class="n">score</span><span class="o">=</span><span class="mf">0.02234</span><span class="p">)</span> <span class="p">[</span><span class="n">took</span> <span class="mf">7.3114</span><span class="n">s</span><span class="p">]</span> <span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">refitting</span> <span class="n">on</span> <span class="n">the</span> <span class="n">whole</span> <span class="n">development</span> <span class="nb">set</span> <span class="n">model</span> <span class="n">selection</span> <span class="n">ended</span><span class="p">:</span> <span class="n">best</span> <span class="n">hyper</span><span class="o">-</span><span class="n">parameters</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">100.0</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">32</span><span class="p">}</span> <span class="n">MAE</span><span class="o">=</span><span class="mf">0.03102</span> </pre></div> </div> <p>The parameter <em>val_split</em> can alternatively be used to indicate a validation set (i.e., an instance of <em>LabelledCollection</em>) instead of a proportion. This could be useful if one wants to have control on the specific data split to be used across different model selection experiments.</p> </section> <section id="targeting-a-classification-oriented-loss"> <h2>Targeting a Classification-oriented loss<a class="headerlink" href="#targeting-a-classification-oriented-loss" title="Permalink to this heading">¶</a></h2> <p>Optimizing a model for quantification could rather be computationally costly. In aggregative methods, one could alternatively try to optimize the classifier’s hyper-parameters for classification. Although this is theoretically suboptimal, many articles in quantification literature have opted for this strategy.</p> <p>In QuaPy, this is achieved by simply instantiating the classifier learner as a GridSearchCV from scikit-learn. The following code illustrates how to do that:</p> <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">learner</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span> <span class="n">LogisticRegression</span><span class="p">(),</span> <span class="n">param_grid</span><span class="o">=</span><span class="p">{</span><span class="s1">'C'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span> <span class="s1">'class_weight'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'balanced'</span><span class="p">,</span> <span class="kc">None</span><span class="p">]},</span> <span class="n">cv</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="n">model</span> <span class="o">=</span> <span class="n">DistributionMatching</span><span class="p">(</span><span class="n">learner</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span> </pre></div> </div> <p>However, this is conceptually flawed, since the model should be optimized for the task at hand (quantification), and not for a surrogate task (classification), i.e., the model should be requested to deliver low quantification errors, rather than low classification errors.</p> </section> </section> <div class="clearer"></div> </div> </div> </div> <div class="sphinxsidebar" role="navigation" aria-label="main navigation"> <div class="sphinxsidebarwrapper"> <div> <h3><a href="index.html">Table of Contents</a></h3> <ul> <li><a class="reference internal" href="#">Model Selection</a><ul> <li><a class="reference internal" href="#targeting-a-quantification-oriented-loss">Targeting a Quantification-oriented loss</a></li> <li><a class="reference internal" href="#targeting-a-classification-oriented-loss">Targeting a Classification-oriented loss</a></li> </ul> </li> </ul> </div> <div> <h4>Previous topic</h4> <p class="topless"><a href="Methods.html" title="previous chapter">Quantification Methods</a></p> </div> <div> <h4>Next topic</h4> <p class="topless"><a href="Plotting.html" title="next chapter">Plotting</a></p> </div> <div role="note" aria-label="source link"> <h3>This Page</h3> <ul class="this-page-menu"> <li><a href="_sources/Model-Selection.md.txt" rel="nofollow">Show Source</a></li> </ul> </div> <div id="searchbox" style="display: none" role="search"> <h3 id="searchlabel">Quick search</h3> <div class="searchformwrapper"> <form class="search" action="search.html" method="get"> <input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/> <input type="submit" value="Go" /> </form> </div> </div> <script>document.getElementById('searchbox').style.display = "block"</script> </div> </div> <div class="clearer"></div> </div> <div class="related" role="navigation" aria-label="related navigation"> <h3>Navigation</h3> <ul> <li class="right" style="margin-right: 10px"> <a href="genindex.html" title="General Index" >index</a></li> <li class="right" > <a href="py-modindex.html" title="Python Module Index" >modules</a> |</li> <li class="right" > <a href="Plotting.html" title="Plotting" >next</a> |</li> <li class="right" > <a href="Methods.html" title="Quantification Methods" >previous</a> |</li> <li class="nav-item nav-item-0"><a href="index.html">QuaPy 0.1.7 documentation</a> »</li> <li class="nav-item nav-item-this"><a href="">Model Selection</a></li> </ul> </div> <div class="footer" role="contentinfo"> © Copyright 2021, Alejandro Moreo. Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 5.3.0. </div> </body> </html>