Recap Feb. 2021: - Adapt everything to testing a classic neural training for AA (i.e., projector+classifier training) vs. applying Supervised Contrastive Learning (SCL) as a pretraining step for solving SAV, and then training a linear classifier with the projector network frozen. Reassess the work in terms of SAV and made connections with KTA and SVM. Maybe claim that SCL+SVM is the way to go. - Compare (Attribution): - S.Ruder systems - My system (projector+classifier layer) as a reimplementation of S.Ruder's systems - Projector trained via SCL + Classifier layer trained alone. - Projector trained via SCL + SVM Classifier. - Projector trained via KTA + SVM Classifier. - Comparator or Siamese networks for SAV + Classifier layer. - Compare (SAV): - My system (projector+binary-classifier layer) - Projector trained via SCL + Binary Classifier layer trained alone. - Projector trained via SCL + SVM Classifier. - Projector trained via KTA + SVM Classifier. - Other systems (maybe Diff-Vectors, maybe Impostors, maybe distance-based) - Comparator or Siamese networks for SAV. - Additional experiments: - show the kernel matrix Future: - Test also in general TC? there are some torch datasets in torchtext that could simplify things... but that would blur the idea of SCL-SAV Code: - redo dataset in terms of pytorch's data_loader --------------------- Things to clarify: about the network: ================== remove the .to() calls inside the Module and use the self.on_cpu instead process datasets and leave it as a generic parameter padding could start at any random point between [0, length_i-pad_length] - in training, pad to the shortest - in test, pad to the largest about the loss and the KTA: =========================== not clear whether we should define the loss as in "On kernel target alignment", i.e., a numerator with f (and change sign to minimize) or as |K-Y|f norm. What about the denominator (now, the normalization factor is n**2)? maybe the sav-loss is something which may have sense to impose, as a regularization, across many last layers, and not only the last one? are the contribution of the two losses comparable? or one contributes far more than the other? is the TwoClassBatch the best way? maybe I have to review the validation of the sav-loss; since it is batched, it might be always checking the same submatrices of for alignment, and those may be mostly positive or mostly near an identity? SAV: how should the range of k(xi,xj) be interpreted? how to decide for value threshold for returning -1 or +1? I guess the best thing to do is to learn a simple threshold, one feed forward 1-to-1 plot the kernel matrix as an imshow, with rows/cols arranged by authors, and check whether the KTA that SCL yields is better than that obtained using a traditional training for attribution.