Fine-tuning multi-lingual XLM for Commerce Integrity domain

West Coast NLP Summit (WeCNLP)

Abstract

Commerce Integrity is important to have a safe and trustworthy ecosystem for buyers and sellers to conduct business with peace of mind in any e-commerce platform. Product text including title and description is critical source of information to detect whether a given product is safe or poses risks. Building expressive text representations in form of embeddings can provide a flexible and computationally cheap way of training machine learning models to detect different types of violations. Transformer based pre-trained language models (PLM) like BERT (Devlin et al., 2018) and XLM (Lample and Conneau, 2019) have had significant breakthroughs in many NLP tasks, the later especially in cross-lingual classification and machine translation. However these models are trained on generic datasets, and it has been shown that in-domain fine-tuning can improve model performance (Gururangan et al., 2020; Rietzler et al., 2019; Sun et al., 2019; Edwards et al., 2020) in most cases. In other cases, plain off-the-shelf PLM’s are able to provide good performance (Sushil et al., 2021). The impact of domain knowledge and fine tuning of text embedding generator models remain a question for commerce integrity. In this work we address these questions and shed light onto them by introducing our work on building commerce text embeddings used for training integrity violation detection models for a C2C ecommerce platform.


SUPPLEMENTARY MATERIAL

Featured Publications