forked from moreo/QuaPy
268 lines
23 KiB
HTML
268 lines
23 KiB
HTML
|
||
|
||
<!doctype html>
|
||
|
||
<html lang="en">
|
||
<head>
|
||
<meta charset="utf-8" />
|
||
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.19: https://docutils.sourceforge.io/" />
|
||
|
||
<title>Model Selection — QuaPy 0.1.7 documentation</title>
|
||
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
|
||
<link rel="stylesheet" type="text/css" href="_static/bizstyle.css" />
|
||
|
||
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
|
||
<script src="_static/jquery.js"></script>
|
||
<script src="_static/underscore.js"></script>
|
||
<script src="_static/_sphinx_javascript_frameworks_compat.js"></script>
|
||
<script src="_static/doctools.js"></script>
|
||
<script src="_static/sphinx_highlight.js"></script>
|
||
<script src="_static/bizstyle.js"></script>
|
||
<link rel="index" title="Index" href="genindex.html" />
|
||
<link rel="search" title="Search" href="search.html" />
|
||
<link rel="next" title="Plotting" href="Plotting.html" />
|
||
<link rel="prev" title="Quantification Methods" href="Methods.html" />
|
||
<meta name="viewport" content="width=device-width,initial-scale=1.0" />
|
||
<!--[if lt IE 9]>
|
||
<script src="_static/css3-mediaqueries.js"></script>
|
||
<![endif]-->
|
||
</head><body>
|
||
<div class="related" role="navigation" aria-label="related navigation">
|
||
<h3>Navigation</h3>
|
||
<ul>
|
||
<li class="right" style="margin-right: 10px">
|
||
<a href="genindex.html" title="General Index"
|
||
accesskey="I">index</a></li>
|
||
<li class="right" >
|
||
<a href="py-modindex.html" title="Python Module Index"
|
||
>modules</a> |</li>
|
||
<li class="right" >
|
||
<a href="Plotting.html" title="Plotting"
|
||
accesskey="N">next</a> |</li>
|
||
<li class="right" >
|
||
<a href="Methods.html" title="Quantification Methods"
|
||
accesskey="P">previous</a> |</li>
|
||
<li class="nav-item nav-item-0"><a href="index.html">QuaPy 0.1.7 documentation</a> »</li>
|
||
<li class="nav-item nav-item-this"><a href="">Model Selection</a></li>
|
||
</ul>
|
||
</div>
|
||
|
||
<div class="document">
|
||
<div class="documentwrapper">
|
||
<div class="bodywrapper">
|
||
<div class="body" role="main">
|
||
|
||
<section id="model-selection">
|
||
<h1>Model Selection<a class="headerlink" href="#model-selection" title="Permalink to this heading">¶</a></h1>
|
||
<p>As a supervised machine learning task, quantification methods
|
||
can strongly depend on a good choice of model hyper-parameters.
|
||
The process whereby those hyper-parameters are chosen is
|
||
typically known as <em>Model Selection</em>, and typically consists of
|
||
testing different settings and picking the one that performed
|
||
best in a held-out validation set in terms of any given
|
||
evaluation measure.</p>
|
||
<section id="targeting-a-quantification-oriented-loss">
|
||
<h2>Targeting a Quantification-oriented loss<a class="headerlink" href="#targeting-a-quantification-oriented-loss" title="Permalink to this heading">¶</a></h2>
|
||
<p>The task being optimized determines the evaluation protocol,
|
||
i.e., the criteria according to which the performance of
|
||
any given method for solving is to be assessed.
|
||
As a task on its own right, quantification should impose
|
||
its own model selection strategies, i.e., strategies
|
||
aimed at finding appropriate configurations
|
||
specifically designed for the task of quantification.</p>
|
||
<p>Quantification has long been regarded as an add-on of
|
||
classification, and thus the model selection strategies
|
||
customarily adopted in classification have simply been
|
||
applied to quantification (see the next section).
|
||
It has been argued in <a class="reference external" href="https://link.springer.com/chapter/10.1007/978-3-030-72240-1_6">Moreo, Alejandro, and Fabrizio Sebastiani.
|
||
Re-Assessing the “Classify and Count” Quantification Method.
|
||
ECIR 2021: Advances in Information Retrieval pp 75–91.</a>
|
||
that specific model selection strategies should
|
||
be adopted for quantification. That is, model selection
|
||
strategies for quantification should target
|
||
quantification-oriented losses and be tested in a variety
|
||
of scenarios exhibiting different degrees of prior
|
||
probability shift.</p>
|
||
<p>The class <em>qp.model_selection.GridSearchQ</em> implements a grid-search exploration over the space of
|
||
hyper-parameter combinations that <a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Evaluation">evaluates</a>
|
||
each combination of hyper-parameters by means of a given quantification-oriented
|
||
error metric (e.g., any of the error functions implemented
|
||
in <em>qp.error</em>) and according to a
|
||
<a class="reference external" href="https://github.com/HLT-ISTI/QuaPy/wiki/Protocols">sampling generation protocol</a>.</p>
|
||
<p>The following is an example (also included in the examples folder) of model selection for quantification:</p>
|
||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">quapy</span> <span class="k">as</span> <span class="nn">qp</span>
|
||
<span class="kn">from</span> <span class="nn">quapy.protocol</span> <span class="kn">import</span> <span class="n">APP</span>
|
||
<span class="kn">from</span> <span class="nn">quapy.method.aggregative</span> <span class="kn">import</span> <span class="n">DistributionMatching</span>
|
||
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
|
||
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
|
||
|
||
<span class="sd">"""</span>
|
||
<span class="sd">In this example, we show how to perform model selection on a DistributionMatching quantifier.</span>
|
||
<span class="sd">"""</span>
|
||
|
||
<span class="n">model</span> <span class="o">=</span> <span class="n">DistributionMatching</span><span class="p">(</span><span class="n">LogisticRegression</span><span class="p">())</span>
|
||
|
||
<span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'SAMPLE_SIZE'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">100</span>
|
||
<span class="n">qp</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'N_JOBS'</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span> <span class="c1"># explore hyper-parameters in parallel</span>
|
||
|
||
<span class="n">training</span><span class="p">,</span> <span class="n">test</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fetch_reviews</span><span class="p">(</span><span class="s1">'imdb'</span><span class="p">,</span> <span class="n">tfidf</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">min_df</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span><span class="o">.</span><span class="n">train_test</span>
|
||
|
||
<span class="c1"># The model will be returned by the fit method of GridSearchQ.</span>
|
||
<span class="c1"># Every combination of hyper-parameters will be evaluated by confronting the</span>
|
||
<span class="c1"># quantifier thus configured against a series of samples generated by means</span>
|
||
<span class="c1"># of a sample generation protocol. For this example, we will use the</span>
|
||
<span class="c1"># artificial-prevalence protocol (APP), that generates samples with prevalence</span>
|
||
<span class="c1"># values in the entire range of values from a grid (e.g., [0, 0.1, 0.2, ..., 1]).</span>
|
||
<span class="c1"># We devote 30% of the dataset for this exploration.</span>
|
||
<span class="n">training</span><span class="p">,</span> <span class="n">validation</span> <span class="o">=</span> <span class="n">training</span><span class="o">.</span><span class="n">split_stratified</span><span class="p">(</span><span class="n">train_prop</span><span class="o">=</span><span class="mf">0.7</span><span class="p">)</span>
|
||
<span class="n">protocol</span> <span class="o">=</span> <span class="n">APP</span><span class="p">(</span><span class="n">validation</span><span class="p">)</span>
|
||
|
||
<span class="c1"># We will explore a classification-dependent hyper-parameter (e.g., the 'C'</span>
|
||
<span class="c1"># hyper-parameter of LogisticRegression) and a quantification-dependent hyper-parameter</span>
|
||
<span class="c1"># (e.g., the number of bins in a DistributionMatching quantifier.</span>
|
||
<span class="c1"># Classifier-dependent hyper-parameters have to be marked with a prefix "classifier__"</span>
|
||
<span class="c1"># in order to let the quantifier know this hyper-parameter belongs to its underlying</span>
|
||
<span class="c1"># classifier.</span>
|
||
<span class="n">param_grid</span> <span class="o">=</span> <span class="p">{</span>
|
||
<span class="s1">'classifier__C'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">7</span><span class="p">),</span>
|
||
<span class="s1">'nbins'</span><span class="p">:</span> <span class="p">[</span><span class="mi">8</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">64</span><span class="p">],</span>
|
||
<span class="p">}</span>
|
||
|
||
<span class="n">model</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">model_selection</span><span class="o">.</span><span class="n">GridSearchQ</span><span class="p">(</span>
|
||
<span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
|
||
<span class="n">param_grid</span><span class="o">=</span><span class="n">param_grid</span><span class="p">,</span>
|
||
<span class="n">protocol</span><span class="o">=</span><span class="n">protocol</span><span class="p">,</span>
|
||
<span class="n">error</span><span class="o">=</span><span class="s1">'mae'</span><span class="p">,</span> <span class="c1"># the error to optimize is the MAE (a quantification-oriented loss)</span>
|
||
<span class="n">refit</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="c1"># retrain on the whole labelled set once done</span>
|
||
<span class="n">verbose</span><span class="o">=</span><span class="kc">True</span> <span class="c1"># show information as the process goes on</span>
|
||
<span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">training</span><span class="p">)</span>
|
||
|
||
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'model selection ended: best hyper-parameters=</span><span class="si">{</span><span class="n">model</span><span class="o">.</span><span class="n">best_params_</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
|
||
<span class="n">model</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">best_model_</span>
|
||
|
||
<span class="c1"># evaluation in terms of MAE</span>
|
||
<span class="c1"># we use the same evaluation protocol (APP) on the test set</span>
|
||
<span class="n">mae_score</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">evaluation</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">protocol</span><span class="o">=</span><span class="n">APP</span><span class="p">(</span><span class="n">test</span><span class="p">),</span> <span class="n">error_metric</span><span class="o">=</span><span class="s1">'mae'</span><span class="p">)</span>
|
||
|
||
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'MAE=</span><span class="si">{</span><span class="n">mae_score</span><span class="si">:</span><span class="s1">.5f</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
|
||
</pre></div>
|
||
</div>
|
||
<p>In this example, the system outputs:</p>
|
||
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">starting</span> <span class="n">model</span> <span class="n">selection</span> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_jobs</span> <span class="o">=-</span><span class="mi">1</span>
|
||
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">0.01</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">64</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.04021</span> <span class="p">[</span><span class="n">took</span> <span class="mf">1.1356</span><span class="n">s</span><span class="p">]</span>
|
||
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">0.01</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">32</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.04286</span> <span class="p">[</span><span class="n">took</span> <span class="mf">1.2139</span><span class="n">s</span><span class="p">]</span>
|
||
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">0.01</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">16</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.04888</span> <span class="p">[</span><span class="n">took</span> <span class="mf">1.2491</span><span class="n">s</span><span class="p">]</span>
|
||
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">0.001</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">8</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.05163</span> <span class="p">[</span><span class="n">took</span> <span class="mf">1.5372</span><span class="n">s</span><span class="p">]</span>
|
||
<span class="p">[</span><span class="o">...</span><span class="p">]</span>
|
||
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">hyperparams</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">1000.0</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">32</span><span class="p">}</span> <span class="n">got</span> <span class="n">mae</span> <span class="n">score</span> <span class="mf">0.02445</span> <span class="p">[</span><span class="n">took</span> <span class="mf">2.9056</span><span class="n">s</span><span class="p">]</span>
|
||
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">optimization</span> <span class="n">finished</span><span class="p">:</span> <span class="n">best</span> <span class="n">params</span> <span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">100.0</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">32</span><span class="p">}</span> <span class="p">(</span><span class="n">score</span><span class="o">=</span><span class="mf">0.02234</span><span class="p">)</span> <span class="p">[</span><span class="n">took</span> <span class="mf">7.3114</span><span class="n">s</span><span class="p">]</span>
|
||
<span class="p">[</span><span class="n">GridSearchQ</span><span class="p">]:</span> <span class="n">refitting</span> <span class="n">on</span> <span class="n">the</span> <span class="n">whole</span> <span class="n">development</span> <span class="nb">set</span>
|
||
<span class="n">model</span> <span class="n">selection</span> <span class="n">ended</span><span class="p">:</span> <span class="n">best</span> <span class="n">hyper</span><span class="o">-</span><span class="n">parameters</span><span class="o">=</span><span class="p">{</span><span class="s1">'classifier__C'</span><span class="p">:</span> <span class="mf">100.0</span><span class="p">,</span> <span class="s1">'nbins'</span><span class="p">:</span> <span class="mi">32</span><span class="p">}</span>
|
||
<span class="n">MAE</span><span class="o">=</span><span class="mf">0.03102</span>
|
||
</pre></div>
|
||
</div>
|
||
<p>The parameter <em>val_split</em> can alternatively be used to indicate
|
||
a validation set (i.e., an instance of <em>LabelledCollection</em>) instead
|
||
of a proportion. This could be useful if one wants to have control
|
||
on the specific data split to be used across different model selection
|
||
experiments.</p>
|
||
</section>
|
||
<section id="targeting-a-classification-oriented-loss">
|
||
<h2>Targeting a Classification-oriented loss<a class="headerlink" href="#targeting-a-classification-oriented-loss" title="Permalink to this heading">¶</a></h2>
|
||
<p>Optimizing a model for quantification could rather be
|
||
computationally costly.
|
||
In aggregative methods, one could alternatively try to optimize
|
||
the classifier’s hyper-parameters for classification.
|
||
Although this is theoretically suboptimal, many articles in
|
||
quantification literature have opted for this strategy.</p>
|
||
<p>In QuaPy, this is achieved by simply instantiating the
|
||
classifier learner as a GridSearchCV from scikit-learn.
|
||
The following code illustrates how to do that:</p>
|
||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">learner</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span>
|
||
<span class="n">LogisticRegression</span><span class="p">(),</span>
|
||
<span class="n">param_grid</span><span class="o">=</span><span class="p">{</span><span class="s1">'C'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span> <span class="s1">'class_weight'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'balanced'</span><span class="p">,</span> <span class="kc">None</span><span class="p">]},</span>
|
||
<span class="n">cv</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
|
||
<span class="n">model</span> <span class="o">=</span> <span class="n">DistributionMatching</span><span class="p">(</span><span class="n">learner</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">training</span><span class="p">)</span>
|
||
</pre></div>
|
||
</div>
|
||
<p>However, this is conceptually flawed, since the model should be
|
||
optimized for the task at hand (quantification), and not for a surrogate task (classification),
|
||
i.e., the model should be requested to deliver low quantification errors, rather
|
||
than low classification errors.</p>
|
||
</section>
|
||
</section>
|
||
|
||
|
||
<div class="clearer"></div>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<div class="sphinxsidebar" role="navigation" aria-label="main navigation">
|
||
<div class="sphinxsidebarwrapper">
|
||
<div>
|
||
<h3><a href="index.html">Table of Contents</a></h3>
|
||
<ul>
|
||
<li><a class="reference internal" href="#">Model Selection</a><ul>
|
||
<li><a class="reference internal" href="#targeting-a-quantification-oriented-loss">Targeting a Quantification-oriented loss</a></li>
|
||
<li><a class="reference internal" href="#targeting-a-classification-oriented-loss">Targeting a Classification-oriented loss</a></li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
|
||
</div>
|
||
<div>
|
||
<h4>Previous topic</h4>
|
||
<p class="topless"><a href="Methods.html"
|
||
title="previous chapter">Quantification Methods</a></p>
|
||
</div>
|
||
<div>
|
||
<h4>Next topic</h4>
|
||
<p class="topless"><a href="Plotting.html"
|
||
title="next chapter">Plotting</a></p>
|
||
</div>
|
||
<div role="note" aria-label="source link">
|
||
<h3>This Page</h3>
|
||
<ul class="this-page-menu">
|
||
<li><a href="_sources/Model-Selection.md.txt"
|
||
rel="nofollow">Show Source</a></li>
|
||
</ul>
|
||
</div>
|
||
<div id="searchbox" style="display: none" role="search">
|
||
<h3 id="searchlabel">Quick search</h3>
|
||
<div class="searchformwrapper">
|
||
<form class="search" action="search.html" method="get">
|
||
<input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/>
|
||
<input type="submit" value="Go" />
|
||
</form>
|
||
</div>
|
||
</div>
|
||
<script>document.getElementById('searchbox').style.display = "block"</script>
|
||
</div>
|
||
</div>
|
||
<div class="clearer"></div>
|
||
</div>
|
||
<div class="related" role="navigation" aria-label="related navigation">
|
||
<h3>Navigation</h3>
|
||
<ul>
|
||
<li class="right" style="margin-right: 10px">
|
||
<a href="genindex.html" title="General Index"
|
||
>index</a></li>
|
||
<li class="right" >
|
||
<a href="py-modindex.html" title="Python Module Index"
|
||
>modules</a> |</li>
|
||
<li class="right" >
|
||
<a href="Plotting.html" title="Plotting"
|
||
>next</a> |</li>
|
||
<li class="right" >
|
||
<a href="Methods.html" title="Quantification Methods"
|
||
>previous</a> |</li>
|
||
<li class="nav-item nav-item-0"><a href="index.html">QuaPy 0.1.7 documentation</a> »</li>
|
||
<li class="nav-item nav-item-this"><a href="">Model Selection</a></li>
|
||
</ul>
|
||
</div>
|
||
<div class="footer" role="contentinfo">
|
||
© Copyright 2021, Alejandro Moreo.
|
||
Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 5.3.0.
|
||
</div>
|
||
</body>
|
||
</html> |