ECIR2024 Paper List

论文	作者	组织	摘要	翻译	代码	引用数
Large Language Models are Zero-Shot Rankers for Recommender Systems	Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian J. McAuley, Wayne Xin Zhao		Recently, large language models (LLMs) (e.g. GPT-4) have demonstrated impressive general-purpose task-solving abilities, including the potential to approach recommendation tasks. Along this line of research, this work aims to investigate the capacity of LLMs that act as the ranking model for recommender systems. To conduct our empirical study, we first formalize the recommendation problem as a conditional ranking task, considering sequential interaction histories as conditions and the items retrieved by the candidate generation model as candidates. We adopt a specific prompting approach to solving the ranking task by LLMs: we carefully design the prompting template by including the sequential interaction history, the candidate items, and the ranking instruction. We conduct extensive experiments on two widely-used datasets for recommender systems and derive several key findings for the use of LLMs in recommender systems. We show that LLMs have promising zero-shot ranking abilities, even competitive to or better than conventional recommendation models on candidates retrieved by multiple candidate generators. We also demonstrate that LLMs struggle to perceive the order of historical interactions and can be affected by biases like position bias, while these issues can be alleviated via specially designed prompting and bootstrapping strategies. The code to reproduce this work is available at https://github.com/RUCAIBox/LLMRank.	最近，大型语言模型(LLM)(例如 GPT-4)展示了令人印象深刻的通用任务解决能力，包括接近推荐任务的潜力。沿着这条研究路线，这项工作旨在调查作为推荐系统排名模型的 LLM 的能力。为了进行实证研究，我们首先将推荐问题形式化为一个条件排序任务，将序贯交互历史作为条件，并将候选生成模型检索到的项目作为候选项。我们采用了一种特定的提示方法来解决 LLM 的排序问题: 我们仔细设计了提示模板，包括顺序交互历史，候选项，和排序指令。我们对推荐系统中两个广泛使用的数据集进行了广泛的实验，得出了在推荐系统中使用 LLM 的几个关键发现。我们证明了 LLM 具有良好的零拍排序能力，甚至比传统的推荐模型更有竞争力或更好的候选人由多个候选生成器检索。我们还证明 LLM 很难感知历史交互的次序，并且可能受到位置偏差等偏见的影响，而这些问题可以通过特别设计的激励和自举策略得到缓解。复制这项工作的代码可在 https://github.com/rucaibox/llmrank 找到。	code	3
Exploring Large Language Models and Hierarchical Frameworks for Classification of Large Unstructured Legal Documents	Nishchal Prasad, Mohand Boughanem, Taoufiq Dkaki		Legal judgment prediction suffers from the problem of long case documents exceeding tens of thousands of words, in general, and having a non-uniform structure. Predicting judgments from such documents becomes a challenging task, more so on documents with no structural annotation. We explore the classification of these large legal documents and their lack of structural information with a deep-learning-based hierarchical framework which we call MESc; "Multi-stage Encoder-based Supervised with-clustering"; for judgment prediction. Specifically, we divide a document into parts to extract their embeddings from the last four layers of a custom fine-tuned Large Language Model, and try to approximate their structure through unsupervised clustering. Which we use in another set of transformer encoder layers to learn the inter-chunk representations. We analyze the adaptability of Large Language Models (LLMs) with multi-billion parameters (GPT-Neo, and GPT-J) with the hierarchical framework of MESc and compare them with their standalone performance on legal texts. We also study their intra-domain(legal) transfer learning capability and the impact of combining embeddings from their last layers in MESc. We test these methods and their effectiveness with extensive experiments and ablation studies on legal documents from India, the European Union, and the United States with the ILDC dataset and a subset of the LexGLUE dataset. Our approach achieves a minimum total performance gain of approximately 2 points over previous state-of-the-art methods.	法律判决预测存在案件文书篇幅过长、文字过长、结构不统一等问题。从这些文档中预测判断是一项具有挑战性的任务，对于没有结构注释的文档更是如此。我们探讨了这些大型法律文件的分类以及它们缺乏结构信息的问题，采用了一种基于深度学习的层次结构框架，我们称之为 MESc，“基于多级编码器的聚类监督”，用于判断预测。具体来说，我们将一个文档分成几个部分，从定制的微调大语言模型的最后四层中提取它们的嵌入，并尝试通过无监督聚类来近似它们的结构。我们在另一组转换器编码器层中使用它来学习块间表示。本文采用 MESc 层次结构分析了具有数十亿参数(GPT-Neo 和 GPT-J)的大语言模型(LLM)的适应性，并与它们在法律文本中的独立性进行了比较。我们还研究了它们的域内(法律)迁移学习能力以及在 MESc 中结合它们最后一层的嵌入所产生的影响。我们使用 ILDC 数据集和 LexGLUE 数据集的子集对来自印度，欧盟和美国的法律文件进行广泛的实验和消融研究，以测试这些方法及其有效性。我们的方法比以前的最先进的方法获得了大约2点的最小总性能增益。	code	1
Overview of PAN 2024: Multi-author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification - Extended Abstract	Janek Bevendorff, Xavier Bonet Casals, Berta Chulvi, Daryna Dementieva, Ashaf Elnagar, Dayne Freitag, Maik Fröbe, Damir Korencic, Maximilian Mayerl, Animesh Mukherjee, Alexander Panchenko, Martin Potthast, Francisco Rangel, Paolo Rosso, Alisa Smirnova, Efstathios Stamatatos, Benno Stein, Mariona Taulé, Dmitry Ustalov, Matti Wiegmann, Eva Zangerle				code	1
Incorporating Query Recommendation for Improving In-Car Conversational Search	Md. Rashad Al Hasan Rony, Soumya Ranjan Sahoo, Abbas Goher Khan, Ken E. Friedl, Viju Sudhi, Christian Süß				code	0
ChatGPT Goes Shopping: LLMs Can Predict Relevance in eCommerce Search	Beatriz Soviero, Daniel Kuhn, Alexandre Salle, Viviane Pereira Moreira				code	0
Lottery4CVR: Neuron-Connection Level Sharing for Multi-task Learning in Video Conversion Rate Prediction	Xuanji Xiao, Jimmy Chen, Yuzhen Liu, Xing Yao, Pei Liu, Chaosheng Fan				code	0
Utilizing Low-Dimensional Molecular Embeddings for Rapid Chemical Similarity Search	Kathryn E. Kirchoff, James Wellnitz, Joshua E. Hochuli, Travis Maxfield, Konstantin I. Popov, Shawn M. Gomez, Alexander Tropsha	Eshelman School of Pharmacy, UNC Chapel Hill.; Department of Pharmacology, UNC Chapel Hill.; Department of Computer Science, UNC Chapel Hill.	Nearest neighbor-based similarity searching is a common task in chemistry, with notable use cases in drug discovery. Yet, some of the most commonly used approaches for this task still leverage a brute-force approach. In practice this can be computationally costly and overly time-consuming, due in part to the sheer size of modern chemical databases. Previous computational advancements for this task have generally relied on improvements to hardware or dataset-specific tricks that lack generalizability. Approaches that leverage lower-complexity searching algorithms remain relatively underexplored. However, many of these algorithms are approximate solutions and/or struggle with typical high-dimensional chemical embeddings. Here we evaluate whether a combination of low-dimensional chemical embeddings and a k-d tree data structure can achieve fast nearest neighbor queries while maintaining performance on standard chemical similarity search benchmarks. We examine different dimensionality reductions of standard chemical embeddings as well as a learned, structurally-aware embedding-SmallSA-for this task. With this framework, searches on over one billion chemicals execute in less than a second on a single CPU core, five orders of magnitude faster than the brute-force approach. We also demonstrate that SmallSA achieves competitive performance on chemical similarity benchmarks.	基于最近邻的相似性搜索是化学中的一个常见任务，在药物发现中有着显著的应用案例。然而，这项任务中一些最常用的方法仍然使用蛮力方法。在实践中，由于现代化学品数据库的庞大规模，这可能会造成计算成本高昂和过度耗时。此任务之前的计算改进通常依赖于对缺乏普遍性的硬件或数据集特定技巧的改进。利用低复杂度搜索算法的方法仍然相对缺乏探索。然而，许多这些算法是近似解决方案和/或与典型的高维化学嵌入斗争。在这里，我们评估是否结合低维化学嵌入和 k-d 树数据结构可以实现快速最近邻查询，同时保持标准化学相似性搜索基准的性能。我们考察了不同维度的标准化学嵌入降低以及一个学习，结构意识的嵌入-SmallSA-为这项任务。有了这个框架，超过十亿种化学物质的搜索在不到一秒钟的时间内在一个 CPU 核心上执行，比蛮力搜索数量级快5倍。我们亦证明 SmallSA 在化学相似性基准方面取得具竞争力的表现。	code	0
Evaluating the Impact of Content Deletion on Tabular Data Similarity and Retrieval Using Contextual Word Embeddings	Alberto Berenguer, David Tomás, JoseNorberto Mazón				code	0
RIGHT: Retrieval-Augmented Generation for Mainstream Hashtag Recommendation	RunZe Fan, Yixing Fan, Jiangui Chen, Jiafeng Guo, Ruqing Zhang, Xueqi Cheng		Automatic mainstream hashtag recommendation aims to accurately provide users with concise and popular topical hashtags before publication. Generally, mainstream hashtag recommendation faces challenges in the comprehensive difficulty of newly posted tweets in response to new topics, and the accurate identification of mainstream hashtags beyond semantic correctness. However, previous retrieval-based methods based on a fixed predefined mainstream hashtag list excel in producing mainstream hashtags, but fail to understand the constant flow of up-to-date information. Conversely, generation-based methods demonstrate a superior ability to comprehend newly posted tweets, but their capacity is constrained to identifying mainstream hashtags without additional features. Inspired by the recent success of the retrieval-augmented technique, in this work, we attempt to adopt this framework to combine the advantages of both approaches. Meantime, with the help of the generator component, we could rethink how to further improve the quality of the retriever component at a low cost. Therefore, we propose RetrIeval-augmented Generative Mainstream HashTag Recommender (RIGHT), which consists of three components: 1) a retriever seeks relevant hashtags from the entire tweet-hashtags set; 2) a selector enhances mainstream identification by introducing global signals; and 3) a generator incorporates input tweets and selected hashtags to directly generate the desired hashtags. The experimental results show that our method achieves significant improvements over state-of-the-art baselines. Moreover, RIGHT can be easily integrated into large language models, improving the performance of ChatGPT by more than 10%.	自动主流话题标签推荐的目的是准确地为用户提供简洁和流行的话题标签发布前。一般来说，主流话题标签推荐面临的挑战包括: 新发布的推文在回应新话题方面的综合难度，以及在语义正确性之外对主流话题标签的准确识别。然而，以往基于固定预定义主流标签列表的检索方法在生成主流标签方面表现出色，但不能理解不断更新的信息流。相反，基于生成的方法展示了理解新发布的 tweet 的优越能力，但它们的能力仅限于识别主流标签，而没有其他特性。受近年来检索增强技术的成功启发，本文尝试采用这一框架将两种方法的优点结合起来。同时，借助于发生器组件，我们可以重新思考如何以较低的成本进一步提高检索器组件的质量。因此，我们提出了 RetrIeval 增强的生成主流 HashTag 推荐器(RIGHT) ，它由三个组成部分组成: 1)检索器从整个 tweet-HashTag 集中寻找相关的 HashTag; 2)选择器通过引入全局信号增强主流识别; 3)生成器结合输入 tweet 和选定的 HashTag 直接生成所需的 HashTag。实验结果表明，我们的方法取得了显着的改进，在最先进的基线。此外，可以很容易地将权限集成到大型语言模型中，使 ChatGPT 的性能提高10% 以上。	code	0
Exploring the Nexus Between Retrievability and Query Generation Strategies	Aman Sinha, Priyanshu Raj Mall, Dwaipayan Roy		Quantifying bias in retrieval functions through document retrievability scores is vital for assessing recall-oriented retrieval systems. However, many studies investigating retrieval model bias lack validation of their query generation methods as accurate representations of retrievability for real users and their queries. This limitation results from the absence of established criteria for query generation in retrievability assessments. Typically, researchers resort to using frequent collocations from document corpora when no query log is available. In this study, we address the issue of reproducibility and seek to validate query generation methods by comparing retrievability scores generated from artificially generated queries to those derived from query logs. Our findings demonstrate a minimal or negligible correlation between retrievability scores from artificial queries and those from query logs. This suggests that artificially generated queries may not accurately reflect retrievability scores as derived from query logs. We further explore alternative query generation techniques, uncovering a variation that exhibits the highest correlation. This alternative approach holds promise for improving reproducibility when query logs are unavailable.	通过文档可检索性评分量化检索功能中的偏差对于评估面向回忆的检索系统至关重要。然而，许多研究检索模型偏倚的研究缺乏验证其查询生成方法作为准确表示的可检索性的真实用户和他们的查询。这种局限性是由于在可检索性评估中缺乏确定的查询生成标准造成的。通常，当没有查询日志可用时，研究人员会使用文档语料库中的频繁搭配。在这项研究中，我们解决了重复性的问题，并寻求验证查询生成方法，通过比较从人工生成的查询和从查询日志得到的查询可检索性得分。我们的研究结果表明，人工查询和查询日志的可检索性得分之间的相关性很小，甚至可以忽略不计。这表明人工生成的查询可能不能准确地反映从查询日志中获得的可检索性得分。我们进一步探索替代的查询生成技术，发现具有最高相关性的变体。这种替代方法有望在查询日志不可用时提高可重复性。	code	0
GLAD: Graph-Based Long-Term Attentive Dynamic Memory for Sequential Recommendation	Deepanshu Pandey, Arindam Sarkar, Prakash Mandayam Comar				code	0
BertPE: A BERT-Based Pre-retrieval Estimator for Query Performance Prediction	Maryam Khodabakhsh, Fattane Zarrinkalam, Negar Arabzadeh				code	0
Estimating the Usefulness of Clarifying Questions and Answers for Conversational Search	Ivan Sekulic, Weronika Lajewska, Krisztian Balog, Fabio Crestani		While the body of research directed towards constructing and generating clarifying questions in mixed-initiative conversational search systems is vast, research aimed at processing and comprehending users' answers to such questions is scarce. To this end, we present a simple yet effective method for processing answers to clarifying questions, moving away from previous work that simply appends answers to the original query and thus potentially degrades retrieval performance. Specifically, we propose a classifier for assessing usefulness of the prompted clarifying question and an answer given by the user. Useful questions or answers are further appended to the conversation history and passed to a transformer-based query rewriting module. Results demonstrate significant improvements over strong non-mixed-initiative baselines. Furthermore, the proposed approach mitigates the performance drops when non useful questions and answers are utilized.	尽管在混合主动会话搜索系统中，针对构建和生成澄清问题的研究机构非常庞大，但是针对处理和理解用户对这些问题的回答的研究却很少。为此，我们提出了一种简单而有效的方法，用于处理澄清问题的答案，避免了以前的工作，即只是将答案附加到原始查询，从而可能降低检索性能。具体来说，我们提出了一个分类器，用于评估提示的澄清问题和用户给出的答案的有用性。有用的问题或答案将进一步附加到会话历史中，并传递给基于转换器的查询重写模块。结果显示，与强大的非混合倡议基线相比，有了显著改善。此外，当使用非有用的问题和答案时，提出的方法可以减少性能下降。	code	0
Measuring Bias in Search Results Through Retrieval List Comparison	Linda Ratz, Markus Schedl, Simone Kopeinik, Navid Rekabsaz				code	0
Cascading Ranking Pipelines for Sensitivity-Aware Search	Jack McKechnie				code	0
Advancing Multimedia Retrieval in Medical, Social Media and Content Recommendation Applications with ImageCLEF 2024	Bogdan Ionescu, Henning Müller, AnaMaria Claudia Dragulinescu, Ahmad IdrissiYaghir, Ahmedkhan Radzhabov, Alba Garcia Seco de Herrera, Alexandra Andrei, Alexandru Stan, Andrea M. Storås, Asma Ben Abacha, Benjamin Lecouteux, Benno Stein, Cécile Macaire, Christoph M. Friedrich, Cynthia S. Schmidt, Didier Schwab, Emmanuelle EsperançaRodier, George Ioannidis, Griffin Adams, Henning Schäfer, Hugo Manguinhas, Ioan Coman, Johanna Schöler, Johannes Kiesel, Johannes Rückert, Louise Bloch, Martin Potthast, Maximilian Heinrich, Meliha Yetisgen, Michael A. Riegler, Neal Snider, Pål Halvorsen, Raphael Brüngel, Steven Alexander Hicks, Vajira Thambawita, Vassili Kovalev, Yuri Prokopchuk, Wenwai Yim				code	0
Ranking Heterogeneous Search Result Pages Using the Interactive Probability Ranking Principle	Kanaad Pathak, Leif Azzopardi, Martin Halvey		The Probability Ranking Principle (PRP) ranks search results based on their expected utility derived solely from document contents, often overlooking the nuances of presentation and user interaction. However, with the evolution of Search Engine Result Pages (SERPs), now comprising a variety of result cards, the manner in which these results are presented is pivotal in influencing user engagement and satisfaction. This shift prompts the question: How does the PRP and its user-centric counterpart, the Interactive Probability Ranking Principle (iPRP), compare in the context of these heterogeneous SERPs? Our study draws a comparison between the PRP and the iPRP, revealing significant differences in their output. The iPRP, accounting for item-specific costs and interaction probabilities to determine the “Expected Perceived Utility" (EPU), yields different result orderings compared to the PRP. We evaluate the effect of the EPU on the ordering of results by observing changes in the ranking within a heterogeneous SERP compared to the traditional “ten blue links”. We find that changing the presentation affects the ranking of items according to the (iPRP) by up to 48% (with respect to DCG, TBG and RBO) in ad-hoc search tasks on the TREC WaPo Collection. This work suggests that the iPRP should be employed when ranking heterogeneous SERPs to provide a user-centric ranking that adapts the ordering based on the presentation and user engagement.	概率排序原则(PRP)根据搜索结果的预期效用来排序，这些效用完全来自文档内容，往往忽略了表示和用户交互的细微差别。然而，随着搜索引擎结果页面(SERP)的发展，现在包含了各种各样的结果卡，这些结果的呈现方式对于影响用户的参与度和满意度是至关重要的。这种转变提出了一个问题: PRP 和它的以用户为中心的对应物，交互式概率排序原则(iPRP) ，如何在这些异构 SERP 的上下文中进行比较？我们的研究对 PRP 和 iPRP 进行了比较，发现它们的输出存在显著差异。IPRP 考虑了项目特定成本和交互概率，以确定“预期感知效用”(EPU) ，与 PRP 相比产生了不同的结果排序。我们通过观察一个异构 SERP 中的排序变化来评估 EPU 对结果排序的影响，与传统的“十个蓝色链接”相比。我们发现，在 TREC WaPo 集合的特别搜索任务中，根据(iPRP)改变表示影响项目的排名高达48% (相对于 DCG，TBG 和 RBO)。这项工作表明，iPRP 应该被用来排名异构的 SERP 时，提供一个以用户为中心的排名，适应排序的基础上的表示和用户参与。	code	0
Query Exposure Prediction for Groups of Documents in Rankings	Thomas Jänich, Graham McDonald, Iadh Ounis		The main objective of an Information Retrieval system is to provide a user with the most relevant documents to the user's query. To do this, modern IR systems typically deploy a re-ranking pipeline in which a set of documents is retrieved by a lightweight first-stage retrieval process and then re-ranked by a more effective but expensive model. However, the success of a re-ranking pipeline is heavily dependent on the performance of the first stage retrieval, since new documents are not usually identified during the re-ranking stage. Moreover, this can impact the amount of exposure that a particular group of documents, such as documents from a particular demographic group, can receive in the final ranking. For example, the fair allocation of exposure becomes more challenging or impossible if the first stage retrieval returns too few documents from certain groups, since the number of group documents in the ranking affects the exposure more than the documents' positions. With this in mind, it is beneficial to predict the amount of exposure that a group of documents is likely to receive in the results of the first stage retrieval process, in order to ensure that there are a sufficient number of documents included from each of the groups. In this paper, we introduce the novel task of query exposure prediction (QEP). Specifically, we propose the first approach for predicting the distribution of exposure that groups of documents will receive for a given query. Our new approach, called GEP, uses lexical information from individual groups of documents to estimate the exposure the groups will receive in a ranking. Our experiments on the TREC 2021 and 2022 Fair Ranking Track test collections show that our proposed GEP approach results in exposure predictions that are up to 40 of adapted existing query performance prediction and resource allocation approaches.	信息检索系统的主要目的是向用户提供与其查询最相关的文件。为了做到这一点，现代 IR 系统通常部署一个重新排序的管道，其中一组文档通过轻量级的第一阶段检索过程检索，然后通过一个更有效但昂贵的模型重新排序。然而，重新排序管道的成功与否在很大程度上取决于第一阶段检索的性能，因为在重新排序阶段通常不能确定新文档。此外，这可能会影响特定文档组(如来自特定人口组的文档)在最终排名中可以接受的曝光量。例如，如果第一阶段检索从某些群组返回的文档太少，则公平分配曝光变得更具挑战性或不可能，因为排名中群组文档的数量比文档的位置更能影响曝光。考虑到这一点，有益的做法是预测一组文件在第一阶段检索过程的结果中可能接触的数量，以确保每组文件都有足够的数量。本文介绍了一种新的查询暴露预测(QEP)任务。具体来说，我们提出了第一种方法，用于预测给定查询将接收到的文档组的曝光分布。我们的新方法被称为 GEP，它使用来自单个文档组的词汇信息来估计这些组在一个排名中将接收到的信息。我们在 TREC 2021和2022公平排名跟踪测试集合上的实验表明，我们提出的 GEP 方法导致暴露预测，这是多达40种适应现有查询性能预测和资源分配方法。	code	0
Investigating the Robustness of Sequential Recommender Systems Against Training Data Perturbations	Filippo Betello, Federico Siciliano, Pushkar Mishra, Fabrizio Silvestri		Sequential Recommender Systems (SRSs) have been widely used to model user behavior over time, but their robustness in the face of perturbations to training data is a critical issue. In this paper, we conduct an empirical study to investigate the effects of removing items at different positions within a temporally ordered sequence. We evaluate two different SRS models on multiple datasets, measuring their performance using Normalized Discounted Cumulative Gain (NDCG) and Rank Sensitivity List metrics. Our results demonstrate that removing items at the end of the sequence significantly impacts performance, with NDCG decreasing up to 60%, while removing items from the beginning or middle has no significant effect. These findings highlight the importance of considering the position of the perturbed items in the training data and shall inform the design of more robust SRSs.	随着时间的推移，序贯推荐系统(SRS)已被广泛用于模拟用户行为，但是它们在训练数据受到干扰时的鲁棒性是一个关键问题。在本文中，我们进行了一个实证研究，以探讨删除项目在不同位置的时间顺序的影响。我们在多个数据集上评估两种不同的 SRS 模型，使用归一化贴现累积增益(NDCG)和秩敏感性列表度量衡量它们的性能。结果表明: 去除序列末端的项目对性能有显著影响，NDCG 下降幅度达60% ，而去除序列开头或中间的项目对性能无显著影响。这些发现强调了考虑受干扰项目在训练数据中的位置的重要性，并将为设计更强健的战略参考系提供信息。	code	0
Conversational Search with Tail Entities	Hai Dang Tran, Andrew Yates, Gerhard Weikum				code	0
Event-Specific Document Ranking Through Multi-stage Query Expansion Using an Event Knowledge Graph	Sara Abdollahi, Tin Kuculo, Simon Gottschalk				code	0
Simulating Follow-Up Questions in Conversational Search	Johannes Kiesel, Marcel Gohsen, Nailia Mirzakhmedova, Matthias Hagen, Benno Stein				code	0
MOReGIn: Multi-Objective Recommendation at the Global and Individual Levels	Elizabeth Gómez, David Contreras, Ludovico Boratto, Maria Salamó		Multi-Objective Recommender Systems (MORSs) emerged as a paradigm to guarantee multiple (often conflicting) goals. Besides accuracy, a MORS can operate at the global level, where additional beyond-accuracy goals are met for the system as a whole, or at the individual level, meaning that the recommendations are tailored to the needs of each user. The state-of-the-art MORSs either operate at the global or individual level, without assuming the co-existence of the two perspectives. In this study, we show that when global and individual objectives co-exist, MORSs are not able to meet both types of goals. To overcome this issue, we present an approach that regulates the recommendation lists so as to guarantee both global and individual perspectives, while preserving its effectiveness. Specifically, as individual perspective, we tackle genre calibration and, as global perspective, provider fairness. We validate our approach on two real-world datasets, publicly released with this paper.	多目标推荐系统(MORS)作为一种范式出现，以保证多个(经常相互冲突)的目标。除了准确性之外，一个 MORS 还可以在全球一级运作，在这一级可以为整个系统或在个人一级实现额外的超准确性目标，这意味着建议是根据每个用户的需要量身定制的。最先进的监测系统既可以在全球一级运作，也可以在个人一级运作，而不必假设这两种观点并存。在这项研究中，我们发现当全球目标和个人目标共存时，MORS 不能同时满足这两种目标。为了解决这一问题，我们提出了一种管理建议清单的办法，以保证全球和个人的观点，同时保持其有效性。具体来说，作为个人的角度，我们处理体裁校准和作为全球的角度，提供者的公平性。我们验证了我们的方法在两个真实世界的数据集，公开发布与本文。	code	0
VEMO: A Versatile Elastic Multi-modal Model for Search-Oriented Multi-task Learning	Nanyi Fei, Hao Jiang, Haoyu Lu, Jinqiang Long, Yanqi Dai, Tuo Fan, Zhao Cao, Zhiwu Lu				code	0
Lightweight Modality Adaptation to Sequential Recommendation via Correlation Supervision	Hengchang Hu, Qijiong Liu, Chuang Li, MinYen Kan		In Sequential Recommenders (SR), encoding and utilizing modalities in an end-to-end manner is costly in terms of modality encoder sizes. Two-stage approaches can mitigate such concerns, but they suffer from poor performance due to modality forgetting, where the sequential objective overshadows modality representation. We propose a lightweight knowledge distillation solution that preserves both merits: retaining modality information and maintaining high efficiency. Specifically, we introduce a novel method that enhances the learning of embeddings in SR through the supervision of modality correlations. The supervision signals are distilled from the original modality representations, including both (1) holistic correlations, which quantify their overall associations, and (2) dissected correlation types, which refine their relationship facets (honing in on specific aspects like color or shape consistency). To further address the issue of modality forgetting, we propose an asynchronous learning step, allowing the original information to be retained longer for training the representation learning module. Our approach is compatible with various backbone architectures and outperforms the top baselines by 6.8 original feature associations from modality encoders significantly boosts task-specific recommendation adaptation. Additionally, we find that larger modality encoders (e.g., Large Language Models) contain richer feature sets which necessitate more fine-grained modeling to reach their full performance potential.	在序列推荐器(SR)中，以端到端的方式编码和利用模式在模式编码器大小方面是昂贵的。两阶段方法可以减轻这种担忧，但是由于情态遗忘，它们的表现很差，其中连续的目标掩盖了情态表示。提出了一种轻量级的知识提取方法，该方法既保留了模态信息，又保持了高效率。具体来说，我们提出了一种新的方法，通过监督情态相关性来提高嵌入的学习效果。监督信号是从原始的模态表示中提取出来的，包括(1)量化其总体关联的整体相关性和(2)剖析的相关类型，这些相关类型细化了它们的关系方面(在特定方面如颜色或形状一致性上打磨)。为了进一步解决模态遗忘问题，我们提出了一个异步学习步骤，允许原始信息保留更长的时间来训练表征学习模块。我们的方法与各种骨干架构兼容，并优于最高基线6.8原始特征关联的形式编码器显着提高任务特定的推荐适应性。此外，我们发现较大的模态编码器(例如，大型语言模型)包含更丰富的特征集，这需要更细粒度的建模来达到其全部性能潜力。	code	0
DREQ: Document Re-ranking Using Entity-Based Query Understanding	Shubham Chatterjee, Iain Mackie, Jeff Dalton		While entity-oriented neural IR models have advanced significantly, they often overlook a key nuance: the varying degrees of influence individual entities within a document have on its overall relevance. Addressing this gap, we present DREQ, an entity-oriented dense document re-ranking model. Uniquely, we emphasize the query-relevant entities within a document's representation while simultaneously attenuating the less relevant ones, thus obtaining a query-specific entity-centric document representation. We then combine this entity-centric document representation with the text-centric representation of the document to obtain a "hybrid" representation of the document. We learn a relevance score for the document using this hybrid representation. Using four large-scale benchmarks, we show that DREQ outperforms state-of-the-art neural and non-neural re-ranking methods, highlighting the effectiveness of our entity-oriented representation approach.	尽管面向实体的神经 IR 模型已经取得了显著的进步，但它们往往忽略了一个关键的细微差别: 文档中各个实体对其总体相关性的不同程度的影响。针对这一差距，我们提出了面向实体的密集文档重排序模型 DREQ。独特的是，我们强调文档表示中的查询相关实体，同时减弱相关性较差的实体，从而获得一个特定于查询的以实体为中心的文档表示。然后，我们将这种以实体为中心的文档表示与以文本为中心的文档表示结合起来，以获得文档的“混合”表示。我们使用这种混合表示学习文档的相关性得分。使用四个大规模的基准测试，我们表明 DREQ 优于最先进的神经元和非神经元重新排序方法，突出了我们的面向实体的表示方法的有效性。	code	0
Beyond Topicality: Including Multidimensional Relevance in Cross-encoder Re-ranking - The Health Misinformation Case Study	Rishabh Upadhyay, Arian Askari, Gabriella Pasi, Marco Viviani				code	0
Query Obfuscation for Information Retrieval Through Differential Privacy	Guglielmo Faggioli, Nicola Ferro				code	0
On-Device Query Auto-completion for Email Search	Yifan Qiao, Otto Godwin, Hua Ouyang		AbstractTraditional query auto-completion (QAC) relies heavily on search logs collected over many users. However, in on-device email search, the scarcity of logs and the governing privacy constraints make QAC a challenging task. In this work, we propose an on-device QAC method that runs directly on users’ devices, where users’ sensitive data and interaction logs are not collected, shared, or aggregated through web services. This method retrieves candidates using pseudo relevance feedback, and ranks them based on relevance signals that explore the textual and structural information from users’ emails. We also propose a private corpora based evaluation method, and empirically demonstrate the effectiveness of our proposed method.	传统的查询自动完成(QAC)在很大程度上依赖于多个用户收集的搜索日志。然而，在设备上的电子邮件搜索，日志的稀缺性和管理隐私的约束使 QAC 一个具有挑战性的任务。在这项工作中，我们提出了一个在设备上的 QAC 方法，直接运行在用户的设备上，其中用户的敏感数据和交互日志不收集，共享，或通过 Web 服务聚合。这种方法使用伪关联反馈检索候选人，并根据相关信号对他们进行排名，这些相关信号探索用户电子邮件的文本和结构信息。我们还提出了一种基于私人语料库的评价方法，并通过实例验证了该方法的有效性。	code	0
Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query?	Juan Manuel Rodriguez, Nima Tavassoli, Eliezer Levy, Gil Lederman, Dima Sivov, Matteo Lissandrini, Davide Mottin				code	0
Query Generation Using Large Language Models - A Reproducibility Study of Unsupervised Passage Reranking	David Rau, Jaap Kamps				code	0
Ranking Distance Metric for Privacy Budget in Distributed Learning of Finite Embedding Data	Georgios Papadopoulos, Yash Satsangi, Shaltiel Eloul, Marco Pistoia				code	0
Effective Adhoc Retrieval Through Traversal of a Query-Document Graph	Erlend Frayling, Sean MacAvaney, Craig Macdonald, Iadh Ounis				code	0
MMCRec: Towards Multi-modal Generative AI in Conversational Recommendation	Tendai Mukande, Esraa Ali, Annalina Caputo, Ruihai Dong, Noel E. O'Connor				code	0
Federated Conversational Recommender Systems	Allen Lin, Jianling Wang, Ziwei Zhu, James Caverlee				code	0
Improving Exposure Allocation in Rankings by Query Generation	Thomas Jänich, Graham McDonald, Iadh Ounis				code	0
KnowFIRES: A Knowledge-Graph Framework for Interpreting Retrieved Entities from Search	Negar Arabzadeh, Kiarash Golzadeh, Christopher Risi, Charles L. A. Clarke, Jian Zhao				code	0
A Conversational Search Framework for Multimedia Archives	Anastasia Potyagalova, Gareth J. F. Jones				code	0
Effective and Efficient Transformer Models for Sequential Recommendation	Aleksandr V. Petrov				code	0
Quantum Computing for Information Retrieval and Recommender Systems	Maurizio Ferrari Dacrema, Andrea Pasin, Paolo Cremonesi, Nicola Ferro				code	0
Transformers for Sequential Recommendation	Aleksandr V. Petrov, Craig Macdonald	Ocean University of China, Qingdao, China; National University of Singapore, Singapore, Singapore; Wuhan University, Wuhan, China; University of Hong Kong, Hong Kong, China	Learning dynamic user preference has become an increasingly important component for many online platforms (e.g., video-sharing sites, e-commerce systems) to make sequential recommendations. Previous works have made many efforts to model item-item transitions over user interaction sequences, based on various architectures, e.g., recurrent neural networks and self-attention mechanism. Recently emerged graph neural networks also serve as useful backbone models to capture item dependencies in sequential recommendation scenarios. Despite their effectiveness, existing methods have far focused on item sequence representation with singular type of interactions, and thus are limited to capture dynamic heterogeneous relational structures between users and items (e.g., page view, add-to-favorite, purchase). To tackle this challenge, we design a Multi-Behavior Hypergraph-enhanced T ransformer framework (MBHT) to capture both short-term and long-term cross-type behavior dependencies. Specifically, a multi-scale Transformer is equipped with low-rank self-attention to jointly encode behavior-aware sequential patterns from fine-grained and coarse-grained levels. Additionally,we incorporate the global multi-behavior dependency into the hypergraph neural architecture to capture the hierarchical long-range item correlations in a customized manner. Experimental results demonstrate the superiority of our MBHT over various state-of- the-art recommendation solutions across different settings. Further ablation studies validate the effectiveness of our model design and benefits of the new MBHT framework. Our implementation code is released at: https://github.com/yuh-yang/MBHT-KDD22.	学习动态用户偏好已经成为许多在线平台(如视频分享网站、电子商务系统)提供连续推荐的一个越来越重要的组成部分。以往的研究基于多种体系结构，如递归神经网络和自我注意机制，对用户交互序列上的项目-项目转换进行了大量的研究。最近出现的图形神经网络也可以作为有用的骨干模型，以捕获项目依赖的顺序推荐场景。尽管现有的方法很有效，但是现有的方法都集中在单一交互类型的项目序列表示上，因此仅限于捕获用户和项目之间的动态异构关系结构(例如，页面查看、添加到收藏夹、购买)。为了应对这一挑战，我们设计了一个多行为超图增强型 T 变换器框架(MBHT)来捕获短期和长期的跨类型行为依赖。具体而言，多尺度变压器配备低级自注意，以从细粒度和粗粒度级别联合编码行为感知的序列模式。此外，我们将全局多行为依赖引入到超图神经结构中，以自定义的方式获取层次化的长期项目相关性。实验结果表明，我们的 MBHT 优于不同设置的各种最先进的推荐解决方案。进一步的消融研究验证了我们的模型设计的有效性和新的 MBHT 框架的好处。我们的实施代码在以下 https://github.com/yuh-yang/mbht-kdd22发布:。	code	0
Context-Aware Query Term Difficulty Estimation for Performance Prediction	Abbas Saleminezhad, Negar Arabzadeh, Soosan Beheshti, Ebrahim Bagheri				code	0
Navigating the Thin Line: Examining User Behavior in Search to Detect Engagement and Backfire Effects	Federico Maria Cau, Nava Tintarev		Opinionated users often seek information that aligns with their preexisting beliefs while dismissing contradictory evidence due to confirmation bias. This conduct hinders their ability to consider alternative stances when searching the web. Despite this, few studies have analyzed how the diversification of search results on disputed topics influences the search behavior of highly opinionated users. To this end, we present a preregistered user study (n = 257) investigating whether different levels (low and high) of bias metrics and search results presentation (with or without AI-predicted stances labels) can affect the stance diversity consumption and search behavior of opinionated users on three debated topics (i.e., atheism, intellectual property rights, and school uniforms). Our results show that exposing participants to (counter-attitudinally) biased search results increases their consumption of attitude-opposing content, but we also found that bias was associated with a trend toward overall fewer interactions within the search page. We also found that 19 any search results. When we removed these participants in a post-hoc analysis, we found that stance labels increased the diversity of stances consumed by users, particularly when the search results were biased. Our findings highlight the need for future research to explore distinct search scenario settings to gain insight into opinionated users' behavior.	固执己见的用户往往寻求与他们先前存在的信念相一致的信息，而由于确认偏见而排除相互矛盾的证据。这种行为妨碍了他们在搜索网页时考虑其他立场的能力。尽管如此，很少有研究分析有争议话题的搜索结果的多样化如何影响高度固执己见的用户的搜索行为。为此，我们提出了一项预先注册的用户研究(n = 257) ，调查不同水平(低和高)的偏倚指标和搜索结果表示(有或没有 AI 预测的立场标签)是否会影响立场多样性消费和搜索行为有意见的用户在三个有争议的话题(即无神论，知识产权和校服)。我们的研究结果显示，将参与者暴露在(反态度的)有偏见的搜索结果中，会增加他们对与态度相反的内容的消费，但是我们也发现，偏见与搜索页面内的整体互动减少的趋势有关。我们还发现19任何搜索结果。当我们在一个事后比较中移除这些参与者时，我们发现立场标签增加了用户使用的立场的多样性，特别是当搜索结果有偏见时。我们的研究结果强调了未来研究探索不同搜索场景设置的必要性，以深入了解固执己见的用户的行为。	code	0
Measuring Bias in a Ranked List Using Term-Based Representations	Amin Abolghasemi, Leif Azzopardi, Arian Askari, Maarten de Rijke, Suzan Verberne		In most recent studies, gender bias in document ranking is evaluated with the NFaiRR metric, which measures bias in a ranked list based on an aggregation over the unbiasedness scores of each ranked document. This perspective in measuring the bias of a ranked list has a key limitation: individual documents of a ranked list might be biased while the ranked list as a whole balances the groups' representations. To address this issue, we propose a novel metric called TExFAIR (term exposure-based fairness), which is based on two new extensions to a generic fairness evaluation framework, attention-weighted ranking fairness (AWRF). TExFAIR assesses fairness based on the term-based representation of groups in a ranked list: (i) an explicit definition of associating documents to groups based on probabilistic term-level associations, and (ii) a rank-biased discounting factor (RBDF) for counting non-representative documents towards the measurement of the fairness of a ranked list. We assess TExFAIR on the task of measuring gender bias in passage ranking, and study the relationship between TExFAIR and NFaiRR. Our experiments show that there is no strong correlation between TExFAIR and NFaiRR, which indicates that TExFAIR measures a different dimension of fairness than NFaiRR. With TExFAIR, we extend the AWRF framework to allow for the evaluation of fairness in settings with term-based representations of groups in documents in a ranked list.	在最近的大多数研究中，文档排名中的性别偏见是通过 NFaiRR 度量来评估的，该度量基于每个排名文档的无偏评分的聚合来衡量排名列表中的偏见。这种测量排名表偏差的视角有一个关键的局限性: 排名表的个别文档可能有偏差，而排名表作为一个整体平衡各组的表示。为了解决这个问题，我们提出了一种新的度量方法 TExFAIR (术语暴露公平性) ，它基于通用公平性评估框架的两个新的扩展，即注意力加权排序公平性(AWRF)。TExFAIR 基于排名列表中基于术语的群体表示来评估公平性: (i)基于概率术语水平关联的关联文档与群体的明确定义，以及(ii)用于计数非代表性文档的排名折扣因子(RBDF)对排名列表的公平性进行测量。我们通过测量文章排序中的性别偏见来评估 TExFAIR，并研究 TExFAIR 和 NFaiRR 之间的关系。我们的实验表明，TExFAIR 和 NFaiRR 之间没有很强的相关性，这表明 TExFAIR 测量的公平性维度不同于 NFaiRR。通过 TExFAIR，我们扩展了 AWRF 框架，允许在排名列表中的文档中使用基于术语的群组表示来评估设置中的公平性。	code	0
Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation	Eugene Yang, Dawn J. Lawrie, James Mayfield, Douglas W. Oard, Scott Miller		Prior work on English monolingual retrieval has shown that a cross-encoder trained using a large number of relevance judgments for query-document pairs can be used as a teacher to train more efficient, but similarly effective, dual-encoder student models. Applying a similar knowledge distillation approach to training an efficient dual-encoder model for Cross-Language Information Retrieval (CLIR), where queries and documents are in different languages, is challenging due to the lack of a sufficiently large training collection when the query and document languages differ. The state of the art for CLIR thus relies on translating queries, documents, or both from the large English MS MARCO training set, an approach called Translate-Train. This paper proposes an alternative, Translate-Distill, in which knowledge distillation from either a monolingual cross-encoder or a CLIR cross-encoder is used to train a dual-encoder CLIR student model. This richer design space enables the teacher model to perform inference in an optimized setting, while training the student model directly for CLIR. Trained models and artifacts are publicly available on Huggingface.	先前关于英语单语检索的工作已经表明，使用大量的查询文档对相关性判断训练的交叉编码器可以作为教师来训练更有效但同样有效的双编码器学生模型。应用类似的知识提取方法来训练一个有效的双跨语检索编码器模型(CLIR) ，其中查询和文档使用不同的语言，这是一个挑战，因为当查询和文档语言不同时，缺乏足够大训练集合。因此，CLIR 的技术状态依赖于翻译查询、文档，或者两者都来自大型英文 MS MARCO 训练集，这种方法称为 Translate-Train。本文提出了一种翻译-提取的方法，利用从单语交叉编码器或 CLIR 交叉编码器中提取的知识来训练双语交叉编码器的学生模型。这个更丰富的设计空间使得教师模型能够在一个优化的设置中执行推理，同时直接为 CLIR 培训学生模型。受过训练的模型和工件可以在 Huggingface 上公开获得。	code	0
DESIRE-ME: Domain-Enhanced Supervised Information Retrieval Using Mixture-of-Experts	Pranav Kasela, Gabriella Pasi, Raffaele Perego, Nicola Tonellotto		Open-domain question answering requires retrieval systems able to cope with the diverse and varied nature of questions, providing accurate answers across a broad spectrum of query types and topics. To deal with such topic heterogeneity through a unique model, we propose DESIRE-ME, a neural information retrieval model that leverages the Mixture-of-Experts framework to combine multiple specialized neural models. We rely on Wikipedia data to train an effective neural gating mechanism that classifies the incoming query and that weighs the predictions of the different domain-specific experts correspondingly. This allows DESIRE-ME to specialize adaptively in multiple domains. Through extensive experiments on publicly available datasets, we show that our proposal can effectively generalize domain-enhanced neural models. DESIRE-ME excels in handling open-domain questions adaptively, boosting by up to 12 22	开放领域的问题回答要求检索系统能够处理各种各样的问题，提供准确的答案跨广泛的查询类型和主题。为了通过一个独特的模型来处理这样的话题异质性，我们提出了 DESIRE-ME，一个神经信息检索模型，它利用专家混合框架来结合多个专门的神经模型。我们依靠 Wikipedia 数据来训练一种有效的神经门控机制，该机制对传入的查询进行分类，并相应地权衡不同领域专家的预测。这使得 DESIRE-ME 可以自适应地专门处理多个域。通过在公开数据集上的大量实验，我们表明我们的方案可以有效地推广领域增强的神经模型。DESIRE-ME 擅长于自适应地处理开放领域的问题，最多可提高12	code	0
A Deep Learning Approach for Selective Relevance Feedback	Suchana Datta, Debasis Ganguly, Sean MacAvaney, Derek Greene		Pseudo-relevance feedback (PRF) can enhance average retrieval effectiveness over a sufficiently large number of queries. However, PRF often introduces a drift into the original information need, thus hurting the retrieval effectiveness of several queries. While a selective application of PRF can potentially alleviate this issue, previous approaches have largely relied on unsupervised or feature-based learning to determine whether a query should be expanded. In contrast, we revisit the problem of selective PRF from a deep learning perspective, presenting a model that is entirely data-driven and trained in an end-to-end manner. The proposed model leverages a transformer-based bi-encoder architecture. Additionally, to further improve retrieval effectiveness with this selective PRF approach, we make use of the model's confidence estimates to combine the information from the original and expanded queries. In our experiments, we apply this selective feedback on a number of different combinations of ranking and feedback models, and show that our proposed approach consistently improves retrieval effectiveness for both sparse and dense ranking models, with the feedback models being either sparse, dense or generative.	伪相关反馈(PRF)可以提高对足够大数量查询的平均检索效率。然而，PRF 常常引入对原始信息需求的漂移，从而影响了多个查询的检索效率。尽管 PRF 的选择性应用有可能缓解这一问题，但以前的方法在很大程度上依赖于无监督或基于特征的学习来确定是否应该扩展查询。相比之下，我们从深度学习的角度重新审视选择性 PRF 的问题，提出了一个完全由数据驱动并以端到端方式进行训练的模型。该模型利用了基于变压器的双编码器结构。此外，为了进一步提高这种选择性 PRF 方法的检索效率，我们利用模型的置信度估计来组合来自原始和扩展查询的信息。在我们的实验中，我们将这种选择性反馈应用于许多不同的排序和反馈模型组合，并且表明我们提出的方法始终如一地提高了稀疏和密集排序模型的检索效率，反馈模型要么是稀疏的，要么是密集的，要么是生成的。	code	0
Self Contrastive Learning for Session-Based Recommendation	Zhengxiang Shi, Xi Wang, Aldo Lipani		Session-based recommendation, which aims to predict the next item of users' interest as per an existing sequence interaction of items, has attracted growing applications of Contrastive Learning (CL) with improved user and item representations. However, these contrastive objectives: (1) serve a similar role as the cross-entropy loss while ignoring the item representation space optimisation; and (2) commonly require complicated modelling, including complex positive/negative sample constructions and extra data augmentation. In this work, we introduce Self-Contrastive Learning (SCL), which simplifies the application of CL and enhances the performance of state-of-the-art CL-based recommendation techniques. Specifically, SCL is formulated as an objective function that directly promotes a uniform distribution among item representations and efficiently replaces all the existing contrastive objective components of state-of-the-art models. Unlike previous works, SCL eliminates the need for any positive/negative sample construction or data augmentation, leading to enhanced interpretability of the item representation space and facilitating its extensibility to existing recommender systems. Through experiments on three benchmark datasets, we demonstrate that SCL consistently improves the performance of state-of-the-art models with statistical significance. Notably, our experiments show that SCL improves the performance of two best-performing models by 8.2% and 9.5% in P@10 (Precision) and 9.9% and 11.2% in MRR@10 (Mean Reciprocal Rank) on average across different benchmarks. Additionally, our analysis elucidates the improvement in terms of alignment and uniformity of representations, as well as the effectiveness of SCL with a low computational cost.	基于会话的推荐，旨在根据已有的项目序列交互预测用户的下一个兴趣项目，已经吸引了越来越多的应用对比学习(CL)与改进的用户和项目表示。然而，这些对比的目标: (1)服务于类似的作用作为交叉熵损失，而忽略项目表示空间优化; (2)通常需要复杂的建模，包括复杂的正/负样本结构和额外的数据增强。本文介绍了自对比学习(SCL) ，简化了 CL 的应用，提高了基于 CL 的推荐技术的性能。具体来说，SCL 是一个直接促进项目表征之间均匀分布的目标函数，它有效地替代了现有最先进模型的所有对比性目标成分。与以前的工作不同，SCL 消除了任何正/负样本构建或数据增强的需要，从而增强了项目表示空间的可解释性，并促进了其对现有推荐系统的可扩展性。通过对三个基准数据集的实验，我们证明了 SCL 能够持续地提高具有统计学意义的最先进模型的性能。值得注意的是，我们的实验表明，在不同的基准测试中，SCL 提高了两个性能最好的模型的性能，P@10(精度)平均提高了8.2% 和9.5% ，MRR@10(平均倒数排名)平均提高了9.9% 和11.2% 。此外，我们的分析阐明了改进方面的对齐和一致性的表示，以及有效的 SCL 与低计算成本。	code	0
Revealing the Hidden Impact of Top-N Metrics on Optimization in Recommender Systems	Lukas Wegmeth, Tobias Vente, Lennart Purucker		The hyperparameters of recommender systems for top-n predictions are typically optimized to enhance the predictive performance of algorithms. Thereby, the optimization algorithm, e.g., grid search or random search, searches for the best hyperparameter configuration according to an optimization-target metric, like nDCG or Precision. In contrast, the optimized algorithm, internally optimizes a different loss function during training, like squared error or cross-entropy. To tackle this discrepancy, recent work focused on generating loss functions better suited for recommender systems. Yet, when evaluating an algorithm using a top-n metric during optimization, another discrepancy between the optimization-target metric and the training loss has so far been ignored. During optimization, the top-n items are selected for computing a top-n metric; ignoring that the top-n items are selected from the recommendations of a model trained with an entirely different loss function. Item recommendations suitable for optimization-target metrics could be outside the top-n recommended items; hiddenly impacting the optimization performance. Therefore, we were motivated to analyze whether the top-n items are optimal for optimization-target top-n metrics. In pursuit of an answer, we exhaustively evaluate the predictive performance of 250 selection strategies besides selecting the top-n. We extensively evaluate each selection strategy over twelve implicit feedback and eight explicit feedback data sets with eleven recommender systems algorithms. Our results show that there exist selection strategies other than top-n that increase predictive performance for various algorithms and recommendation domains. However, the performance of the top 43 of selection strategies is not significantly different. We discuss the impact of our findings on optimization and re-ranking in recommender systems and feasible solutions.	为了提高算法的预测性能，对推荐系统的超参数进行了典型的优化。因此，优化算法，例如网格搜索或随机搜索，根据优化目标度量(如 nDCG 或 Precision)搜索最佳超参数配置。相比之下，优化后的算法，在训练期间内部优化了不同的损失函数，如平方误差或交叉熵。为了解决这个差异，最近的工作集中在产生更适合推荐系统的损失函数。然而，当在优化过程中使用 top-n 度量对算法进行评估时，优化目标度量与训练损失之间的另一个差异被忽略了。在优化过程中，选择 top-n 项目来计算 top-n 度量; 忽略 top-n 项目是从使用完全不同的损失函数训练的模型的建议中选择的。适合于优化的项目推荐——目标指标可能不在推荐项目的前列; 这会对优化性能产生隐性影响。因此，我们被激励去分析是否前 n 个项目对于优化目标的前 n 个度量是最佳的。在寻找答案的过程中，我们除了选择前 n 个选择策略外，还对250个选择策略的预测性能进行了详尽的评估。我们使用十二个隐式反馈和8个显式反馈数据集和十一个推荐系统算法对每个选择策略进行了广泛的评估。我们的研究结果表明，除了 top-n 之外，还存在其他的选择策略可以提高各种算法和推荐域的预测性能。然而，前43名选择策略的表现并没有显著差异。我们讨论了我们的研究结果对优化和重新排序的推荐系统和可行的解决方案的影响。	code	0
TWOLAR: A TWO-Step LLM-Augmented Distillation Method for Passage Reranking	Davide Baldelli, Junfeng Jiang, Akiko Aizawa, Paolo Torroni		In this paper, we present TWOLAR: a two-stage pipeline for passage reranking based on the distillation of knowledge from Large Language Models (LLM). TWOLAR introduces a new scoring strategy and a distillation process consisting in the creation of a novel and diverse training dataset. The dataset consists of 20K queries, each associated with a set of documents retrieved via four distinct retrieval methods to ensure diversity, and then reranked by exploiting the zero-shot reranking capabilities of an LLM. Our ablation studies demonstrate the contribution of each new component we introduced. Our experimental results show that TWOLAR significantly enhances the document reranking ability of the underlying model, matching and in some cases even outperforming state-of-the-art models with three orders of magnitude more parameters on the TREC-DL test sets and the zero-shot evaluation benchmark BEIR. To facilitate future work we release our data set, finetuned models, and code.	在本文中，我们提出了 TWOLAR: 一个基于从大语言模型(LLM)中提取知识的两阶段通道重新排序流水线。TWOLAR 引入了一个新的评分策略和一个精馏过程，包括创建一个新的和多样化的训练数据集。该数据集由20K 个查询组成，每个查询与一组文档相关联，这些文档通过四种不同的检索方法检索以确保多样性，然后通过利用 LLM 的零拍重新排序功能进行重新排序。我们的消融研究证明了我们引入的每个新组件的贡献。我们的实验结果显示，TWOLAR 显著提高了基础模型的文档重新排序能力，在 TREC-dL 测试集和零拍评估基准 BEIR 上，通过三个以上的参数，匹配甚至在某些情况下超越了最先进的模型，从而提高了文档重新排序的数量级。为了方便未来的工作，我们发布了我们的数据集、微调模型和代码。	code	0
Estimating Query Performance Through Rich Contextualized Query Representations	Sajad Ebrahimi, Maryam Khodabakhsh, Negar Arabzadeh, Ebrahim Bagheri				code	0
Performance Comparison of Session-Based Recommendation Algorithms Based on GNNs	Faisal Shehzad, Dietmar Jannach		In session-based recommendation settings, a recommender system has to base its suggestions on the user interactions that are ob served in an ongoing session. Since such sessions can consist of only a small set of interactions, various approaches based on Graph Neural Networks (GNN) were recently proposed, as they allow us to integrate various types of side information about the items in a natural way. Unfortunately, a variety of evaluation settings are used in the literature, e.g., in terms of protocols, metrics and baselines, making it difficult to assess what represents the state of the art. In this work, we present the results of an evaluation of eight recent GNN-based approaches that were published in high-quality outlets. For a fair comparison, all models are systematically tuned and tested under identical conditions using three common datasets. We furthermore include k-nearest-neighbor and sequential rules-based models as baselines, as such models have previously exhibited competitive performance results for similar settings. To our surprise, the evaluation showed that the simple models outperform all recent GNN models in terms of the Mean Reciprocal Rank, which we used as an optimization criterion, and were only outperformed in three cases in terms of the Hit Rate. Additional analyses furthermore reveal that several other factors that are often not deeply discussed in papers, e.g., random seeds, can markedly impact the performance of GNN-based models. Our results therefore (a) point to continuing issues in the community in terms of research methodology and (b) indicate that there is ample room for improvement in session-based recommendation.	在基于会话的推荐设置中，推荐系统必须根据当前会话中的用户交互情况提出建议。由于这样的会议可以只包括一小组交互，最近提出了各种基于图神经网络(GNN)的方法，因为它们允许我们以一种自然的方式整合关于项目的各种类型的副信息。不幸的是，文献中使用了各种各样的评估设置，例如，在协议、指标和基线方面，这使得评估什么代表了最先进的技术变得困难。在这项工作中，我们介绍了最近在高质量网点发表的八种基于 GNN 的方法的评价结果。为了进行公平的比较，使用三个共同的数据集，在相同的条件下系统地调整和测试所有模型。我们进一步包括 k 最近邻和顺序规则为基线的模型，因为这样的模型已经表现出竞争性能结果在类似的设置。令我们惊讶的是，评估显示，简单的模型在平均倒数排名方面表现优于所有最近的 GNN 模型，我们将其作为优化标准，在命中率方面只有三种情况表现优于 GNN 模型。进一步的分析表明，论文中通常不深入讨论的其他几个因素，例如随机种子，可以显著影响基于 GNN 的模型的性能。因此，我们的研究结果(a)指出了社区在研究方法方面仍然存在的问题，(b)表明在基于会话的推荐方面还有很大的改进空间。	code	0
Weighted AUReC: Handling Skew in Shard Map Quality Estimation for Selective Search	Gijs Hendriksen, Djoerd Hiemstra, Arjen P. de Vries				code	0
Measuring Item Fairness in Next Basket Recommendation: A Reproducibility Study	Yuanna Liu, Ming Li, Mozhdeh Ariannezhad, Masoud Mansoury, Mohammad Aliannejadi, Maarten de Rijke				code	0
Is Interpretable Machine Learning Effective at Feature Selection for Neural Learning-to-Rank?	Lijun Lyu, Nirmal Roy, Harrie Oosterhuis, Avishek Anand				code	0
The Impact of Differential Privacy on Recommendation Accuracy and Popularity Bias	Peter Müllner, Elisabeth Lex, Markus Schedl, Dominik Kowald		Collaborative filtering-based recommender systems leverage vast amounts of behavioral user data, which poses severe privacy risks. Thus, often, random noise is added to the data to ensure Differential Privacy (DP). However, to date, it is not well understood, in which ways this impacts personalized recommendations. In this work, we study how DP impacts recommendation accuracy and popularity bias, when applied to the training data of state-of-the-art recommendation models. Our findings are three-fold: First, we find that nearly all users' recommendations change when DP is applied. Second, recommendation accuracy drops substantially while recommended item popularity experiences a sharp increase, suggesting that popularity bias worsens. Third, we find that DP exacerbates popularity bias more severely for users who prefer unpopular items than for users that prefer popular items.	基于协同过滤的推荐系统利用了大量的行为用户数据，这带来了严重的隐私风险。因此，随机噪音往往被添加到数据中，以确保差分隐私(DP)。然而，到目前为止，人们还没有很好地理解这对个性化推荐的影响。在本研究中，我们研究了当应用于最先进的推荐模型的训练数据时，DP 如何影响推荐的准确性和受欢迎程度偏差。我们的发现有三个方面: 首先，我们发现几乎所有用户的建议在应用 DP 时都会发生变化。其次，推荐的准确性大幅下降，而推荐项目的流行经历了急剧增加，这表明流行偏差恶化。第三，我们发现对于喜欢不受欢迎项目的用户而言，DP 加剧流行偏见的程度要比喜欢受欢迎项目的用户严重得多。	code	0
How to Forget Clients in Federated Online Learning to Rank?	Shuyi Wang, Bing Liu, Guido Zuccon		Data protection legislation like the European Union's General Data Protection Regulation (GDPR) establishes the right to be forgotten: a user (client) can request contributions made using their data to be removed from learned models. In this paper, we study how to remove the contributions made by a client participating in a Federated Online Learning to Rank (FOLTR) system. In a FOLTR system, a ranker is learned by aggregating local updates to the global ranking model. Local updates are learned in an online manner at a client-level using queries and implicit interactions that have occurred within that specific client. By doing so, each client's local data is not shared with other clients or with a centralised search service, while at the same time clients can benefit from an effective global ranking model learned from contributions of each client in the federation. In this paper, we study an effective and efficient unlearning method that can remove a client's contribution without compromising the overall ranker effectiveness and without needing to retrain the global ranker from scratch. A key challenge is how to measure whether the model has unlearned the contributions from the client c^* that has requested removal. For this, we instruct c^* to perform a poisoning attack (add noise to this client updates) and then we measure whether the impact of the attack is lessened when the unlearning process has taken place. Through experiments on four datasets, we demonstrate the effectiveness and efficiency of the unlearning strategy under different combinations of parameter settings.	数据保护立法，如欧盟的一般数据保护条例(GDPR)规定了被遗忘的权利: 用户(客户)可以要求使用他们的数据作出贡献，从学习的模型中删除。在本文中，我们研究了如何删除参与联邦在线学习排名(FOLTR)系统的客户所做的贡献。在 FOLTR 系统中，通过将本地更新聚合到全局排名模型中来学习排名器。使用在特定客户端中发生的查询和隐式交互，在客户端级别以联机方式学习本地更新。通过这样做，每个客户的本地数据不会与其他客户共享，也不会与中央搜索服务共享，同时客户可以从联合会中每个客户贡献的有效全球排名模型中受益。在本文中，我们研究了一个有效和高效的去除方法，可以消除客户的贡献，而不损害整体排名有效性，不需要从头再培训全球排名。一个关键的挑战是如何衡量模型是否已经从请求删除的客户机 c ^ * 那里忘记了贡献。为此，我们指示 c ^ * 执行中毒攻击(为客户端更新添加噪声) ，然后在发生忘记过程时测量攻击的影响是否减轻。通过对四个数据集的实验，验证了在不同的参数设置组合下，忘却策略的有效性和效率。	code	0
InDi: Informative and Diverse Sampling for Dense Retrieval	Nachshon Cohen, Hedda Cohen Indelman, Yaron Fairstein, Guy Kushilevitz				code	0
Learning-to-Rank with Nested Feedback	Hitesh Sagtani, Olivier Jeunen, Aleksei Ustimenko		Many platforms on the web present ranked lists of content to users, typically optimized for engagement-, satisfaction- or retention- driven metrics. Advances in the Learning-to-Rank (LTR) research literature have enabled rapid growth in this application area. Several popular interfaces now include nested lists, where users can enter a 2nd-level feed via any given 1st-level item. Naturally, this has implications for evaluation metrics, objective functions, and the ranking policies we wish to learn. We propose a theoretically grounded method to incorporate 2nd-level feedback into any 1st-level ranking model. Online experiments on a large-scale recommendation system confirm our theoretical findings.	网络上的许多平台对用户的内容列表进行排序，通常针对参与度、满意度或保留驱动的指标进行优化。学习到等级(LTR)研究文献的进步使得这一应用领域的快速增长成为可能。一些流行的界面现在包括嵌套列表，用户可以通过任何给定的第一级项目输入第二级提要。当然，这对评估指标、目标函数和我们希望学习的排名策略都有影响。我们提出了一个理论基础的方法，将二级反馈纳入任何一级排名模型。在一个大规模推荐系统上的在线实验证实了我们的理论发现。	code	0
Simple Domain Adaptation for Sparse Retrievers	Mathias Vast, Yuxuan Zong, Benjamin Piwowarski, Laure Soulier		In Information Retrieval, and more generally in Natural Language Processing, adapting models to specific domains is conducted through fine-tuning. Despite the successes achieved by this method and its versatility, the need for human-curated and labeled data makes it impractical to transfer to new tasks, domains, and/or languages when training data doesn't exist. Using the model without training (zero-shot) is another option that however suffers an effectiveness cost, especially in the case of first-stage retrievers. Numerous research directions have emerged to tackle these issues, most of them in the context of adapting to a task or a language. However, the literature is scarcer for domain (or topic) adaptation. In this paper, we address this issue of cross-topic discrepancy for a sparse first-stage retriever by transposing a method initially designed for language adaptation. By leveraging pre-training on the target data to learn domain-specific knowledge, this technique alleviates the need for annotated data and expands the scope of domain adaptation. Despite their relatively good generalization ability, we show that even sparse retrievers can benefit from our simple domain adaptation method.	在自然语言处理信息检索，以及更广泛的自然语言处理领域，通过微调来调整模型以适应特定的领域。尽管这种方法取得了成功，而且通用性强，但是对人工管理和标记数据的需求使得在培训数据不存在的情况下将数据转移到新的任务、领域和/或语言是不切实际的。不经训练就使用该模型(零射击)是另一种选择，但是这种方法会带来有效性损失，特别是对于第一阶段的检索器。为了解决这些问题，出现了许多研究方向，其中大多数是在适应一项任务或一种语言的背景下。然而，文献对领域(或主题)的适应性较少。在本文中，我们解决这个问题的跨主题差异的稀疏第一阶段的检索，移位的方法最初设计的语言适应。通过利用对目标数据的预训练来学习特定领域的知识，该技术减轻了对带注释数据的需求，并扩大了领域适应的范围。尽管它们具有相对较好的泛化能力，但是我们表明即使是稀疏的检索器也可以从我们简单的领域自适应方法中受益。	code	0
Selma: A Semantic Local Code Search Platform	Anja Reusch, Guilherme C. Lopes, Wilhelm Pertsch, Hannes Ueck, Julius Gonsior, Wolfgang Lehner				code	0
FAR-AI: A Modular Platform for Investment Recommendation in the Financial Domain	Javier SanzCruzado, Edward Richards, Richard McCreadie				code	0
Semantic Content Search on IKEA.com	Mateusz Slominski, Ezgi Yildirim, Martin Tegner				code	0
Semantic Search in Archive Collections Through Interpretable and Adaptable Relation Extraction About Person and Places	Nicolas Gutehrlé				code	0
Reproduction and Simulation of Interactive Retrieval Experiments	Jana Isabelle Friese				code	0
Efficient Multi-vector Dense Retrieval with Bit Vectors	Franco Maria Nardini, Cosimo Rulli, Rossano Venturini				code	0
Prompt-Based Generative News Recommendation (PGNR): Accuracy and Controllability	Xinyi Li, Yongfeng Zhang, Edward C. Malthouse				code	0
CaseGNN: Graph Neural Networks for Legal Case Retrieval with Text-Attributed Graphs	Yanran Tang, Ruihong Qiu, Yilun Liu, Xue Li, Zi Huang		Legal case retrieval is an information retrieval task in the legal domain, which aims to retrieve relevant cases with a given query case. Recent research of legal case retrieval mainly relies on traditional bag-of-words models and language models. Although these methods have achieved significant improvement in retrieval accuracy, there are still two challenges: (1) Legal structural information neglect. Previous neural legal case retrieval models mostly encode the unstructured raw text of case into a case representation, which causes the lack of important legal structural information in a case and leads to poor case representation; (2) Lengthy legal text limitation. When using the powerful BERT-based models, there is a limit of input text lengths, which inevitably requires to shorten the input via truncation or division with a loss of legal context information. In this paper, a graph neural networks-based legal case retrieval model, CaseGNN, is developed to tackle these challenges. To effectively utilise the legal structural information during encoding, a case is firstly converted into a Text-Attributed Case Graph (TACG), followed by a designed Edge Graph Attention Layer and a readout function to obtain the case graph representation. The CaseGNN model is optimised with a carefully designed contrastive loss with easy and hard negative sampling. Since the text attributes in the case graph come from individual sentences, the restriction of using language models is further avoided without losing the legal context. Extensive experiments have been conducted on two benchmarks from COLIEE 2022 and COLIEE 2023, which demonstrate that CaseGNN outperforms other state-of-the-art legal case retrieval methods. The code has been released on https://github.com/yanran-tang/CaseGNN.	法律案例检索是法律领域的一项信息检索工作，其目的是检索具有给定查询案例的相关案例。目前法律案例检索的研究主要依赖于传统的词袋模型和语言模型。虽然这些方法在检索精度方面取得了显著的进步，但仍然存在两个挑战: (1)法律结构信息的忽视。以往的神经网络法律案例检索模型大多将非结构化的原始案例文本编码为案例表示，导致案例缺乏重要的法律结构信息，导致案例表示效果不佳;。在使用基于 BERT 的强大模型时，存在输入文本长度的限制，这就不可避免地要求通过截断或除法来缩短输入，同时丢失法律上下文信息。本文提出了一种基于图神经网络的法律案例检索模型 CaseGNN，以解决这些问题。为了在编码过程中有效地利用法律结构信息，首先将案例转换为文本属性案例图(TACG) ，然后设计边缘图注意层和读出功能，得到案例图表示。CaseGNN 模型通过精心设计的对比损失和简单和硬负采样进行优化。由于案例图中的文本属性来自于单个句子，因此在不失去法律上下文的前提下，进一步避免了语言模型的使用限制。对 COLIEE 2022和 COLIEE 2023的两个基准进行了广泛的实验，证明 CaseGNN 优于其他最先进的法律案例检索方法。密码已经在 https://github.com/yanran-tang/casegnn 上发布了。	code	0
Context-Driven Interactive Query Simulations Based on Generative Large Language Models	Björn Engelmann, Timo Breuer, Jana Isabelle Friese, Philipp Schaer, Norbert Fuhr		Simulating user interactions enables a more user-oriented evaluation of information retrieval (IR) systems. While user simulations are cost-efficient and reproducible, many approaches often lack fidelity regarding real user behavior. Most notably, current user models neglect the user's context, which is the primary driver of perceived relevance and the interactions with the search results. To this end, this work introduces the simulation of context-driven query reformulations. The proposed query generation methods build upon recent Large Language Model (LLM) approaches and consider the user's context throughout the simulation of a search session. Compared to simple context-free query generation approaches, these methods show better effectiveness and allow the simulation of more efficient IR sessions. Similarly, our evaluations consider more interaction context than current session-based measures and reveal interesting complementary insights in addition to the established evaluation protocols. We conclude with directions for future work and provide an entirely open experimental setup.	通过模拟用户交互，可以对信息检索系统进行更加面向用户的评估。虽然用户模拟具有成本效益和可重复性，但许多方法通常缺乏真实用户行为的保真度。最值得注意的是，当前的用户模型忽视了用户的上下文，而上下文是感知相关性和与搜索结果交互的主要驱动因素。为此，本文介绍了上下文驱动的查询重构的仿真。提出的查询生成方法建立在最新的大型语言模型(LLM)方法的基础上，并在搜索会话的整个仿真过程中考虑用户的上下文。与简单的上下文无关的查询生成方法相比，这些方法显示出更好的效率，并允许模拟更有效的 IR 会话。同样，我们的评价考虑了比目前基于会议的措施更多的互动背景，除了既定的评价方案之外，还揭示了有趣的互补见解。我们总结了未来工作的方向，并提供了一个完全开放的实验装置。	code	0
Emotional Insights for Food Recommendations	Mehrdad Rostami, Ali Vardasbi, Mohammad Aliannejadi, Mourad Oussalah				code	0
LaQuE: Enabling Entity Search at Scale	Negar Arabzadeh, Amin Bigdeli, Ebrahim Bagheri				code	0
Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models	Andrew Parry, Maik Fröbe, Sean MacAvaney, Martin Potthast, Matthias Hagen		Modern sequence-to-sequence relevance models like monoT5 can effectively capture complex textual interactions between queries and documents through cross-encoding. However, the use of natural language tokens in prompts, such as Query, Document, and Relevant for monoT5, opens an attack vector for malicious documents to manipulate their relevance score through prompt injection, e.g., by adding target words such as true. Since such possibilities have not yet been considered in retrieval evaluation, we analyze the impact of query-independent prompt injection via manually constructed templates and LLM-based rewriting of documents on several existing relevance models. Our experiments on the TREC Deep Learning track show that adversarial documents can easily manipulate different sequence-to-sequence relevance models, while BM25 (as a typical lexical model) is not affected. Remarkably, the attacks also affect encoder-only relevance models (which do not rely on natural language prompt tokens), albeit to a lesser extent.	现代的序列-序列相关模型，如 monoT5，可以通过交叉编码有效地捕获查询和文档之间复杂的文本交互。然而，在提示符中使用自然语言标记，比如 Query、 Document 和 RelationformonoT5，为恶意文档打开了一个攻击向量，通过提示注入操纵它们的相关性得分，例如，通过添加目标词，比如 true。由于在检索评估中还没有考虑到这种可能性，我们通过手工构建模板和基于 LLM 的文档重写来分析与查询无关的提示注入对几种现有相关性模型的影响。我们在 TREC Deep Learning 进行的实验表明，对抗性文档可以轻易地操纵不同的顺序-顺序关联模型，而 BM25(作为一个典型的词汇模型)不受影响。值得注意的是，这些攻击还会影响编码器相关性模型(不依赖于自然语言提示符) ，尽管影响程度较小。	code	0
Two-Step SPLADE: Simple, Efficient and Effective Approximation of SPLADE	Carlos Lassance, Hervé Déjean, Stéphane Clinchant, Nicola Tonellotto		Learned sparse models such as SPLADE have successfully shown how to incorporate the benefits of state-of-the-art neural information retrieval models into the classical inverted index data structure. Despite their improvements in effectiveness, learned sparse models are not as efficient as classical sparse model such as BM25. The problem has been investigated and addressed by recently developed strategies, such as guided traversal query processing and static pruning, with different degrees of success on in-domain and out-of-domain datasets. In this work, we propose a new query processing strategy for SPLADE based on a two-step cascade. The first step uses a pruned and reweighted version of the SPLADE sparse vectors, and the second step uses the original SPLADE vectors to re-score a sample of documents retrieved in the first stage. Our extensive experiments, performed on 30 different in-domain and out-of-domain datasets, show that our proposed strategy is able to improve mean and tail response times over the original single-stage SPLADE processing by up to 30× and 40×, respectively, for in-domain datasets, and by 12x to 25x, for mean response on out-of-domain datasets, while not incurring in statistical significant difference in 60% of datasets.	像 SPLADE 这样的稀疏学习模型已经成功地展示了如何将最先进的神经信息检索模型的优点融入到经典的倒排索引数据结构中。尽管学习稀疏模型的有效性有所提高，但其效率不如经典稀疏模型如 BM25。该问题已经通过最近开发的策略得到了研究和解决，如引导遍历查询处理和静态剪枝，在域内和域外数据集上取得了不同程度的成功。本文提出了一种新的基于两步级联的 SPLADE 查询处理策略。第一步使用 SPLADE 稀疏向量的修剪和重新加权版本，第二步使用原始 SPLADE 向量对在第一阶段检索到的文档样本进行重新评分。我们在30个不同的域内和域外数据集上进行的广泛实验表明，我们提出的策略能够将原始单阶段 SPLADE 处理的平均和尾部响应时间分别提高30倍和40倍，对于域内数据集，提高12倍至25倍，对于域外数据集的平均响应，同时在60% 的数据集中不引起统计学显着差异。	code	0
Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers	Negar Arabzadeh, Amin Bigdeli, Charles L. A. Clarke		Large language models can now directly generate answers to many factual questions without referencing external sources. Unfortunately, relatively little attention has been paid to methods for evaluating the quality and correctness of these answers, for comparing the performance of one model to another, or for comparing one prompt to another. In addition, the quality of generated answers are rarely directly compared to the quality of retrieved answers. As models evolve and prompts are modified, we have no systematic way to measure improvements without resorting to expensive human judgments. To address this problem we adapt standard retrieval benchmarks to evaluate answers generated by large language models. Inspired by the BERTScore metric for summarization, we explore two approaches. In the first, we base our evaluation on the benchmark relevance judgments. We empirically run experiments on how information retrieval relevance judgments can be utilized as an anchor to evaluating the generated answers. In the second, we compare generated answers to the top results retrieved by a diverse set of retrieval models, ranging from traditional approaches to advanced methods, allowing us to measure improvements without human judgments. In both cases, we measure the similarity between an embedded representation of the generated answer and an embedded representation of a known, or assumed, relevant passage from the retrieval benchmark.	大型语言模型现在可以直接生成许多实际问题的答案，而无需引用外部资源。遗憾的是，对于评价这些答案的质量和正确性、比较一个模型与另一个模型的表现或比较一个提示与另一个提示的方法，人们的关注相对较少。此外，生成的答案的质量很少直接比较检索的答案的质量。随着模型的发展和提示的修改，我们没有系统的方法来衡量改进而不诉诸昂贵的人类判断。为了解决这个问题，我们采用标准的检索基准来评估由大型语言模型生成的答案。受到用于总结的 BERTScore 度量的启发，我们探索了两种方法。首先，我们以基准相关性判断为基础进行评价。我们通过实验来研究信息检索相关性判断是如何被用来作为评估生成的答案的锚的。在第二个实验中，我们将生成的答案与不同检索模型(从传统方法到高级方法)检索到的最高结果进行比较，使我们能够在没有人为判断的情况下衡量改进情况。在这两种情况下，我们测量生成的答案的嵌入表示和检索基准中已知或假定的相关段落的嵌入表示之间的相似性。	code	0
Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control	Thong Nguyen, Mariya Hendriksen, Andrew Yates, Maarten de Rijke		Learned sparse retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors that can be indexed and retrieved efficiently with an inverted index. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. While LSR has seen success in text retrieval, its application in multimodal retrieval remains underexplored. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors. We address issues of high dimension co-activation and semantic deviation through a new training algorithm, using Bernoulli random variables to control query expansion. Experiments with two dense models (BLIP, ALBEF) and two datasets (MSCOCO, Flickr30k) show that our proposed algorithm effectively reduces co-activation and semantic deviation. Our best-performing sparsified model outperforms state-of-the-art text-image LSR models with a shorter training time and lower GPU memory requirements. Our approach offers an effective solution for training LSR retrieval models in multimodal settings. Our code and model checkpoints are available at github.com/thongnt99/lsr-multimodal	学习稀疏检索(LSR)是一类将查询和文档编码成稀疏词汇向量的神经元方法，可以通过反向索引有效地进行索引和检索。我们探讨了 LSR 在多模态领域的应用，重点研究了文本图像检索。虽然 LSR 在文本检索方面取得了成功，但它在多模态检索中的应用仍然有待探索。目前的方法如 LexLIP 和 STAIR 需要对大量数据集进行复杂的多步训练。我们提出的方法有效地将密集向量从一个冻结的密集模型转换成稀疏的词汇向量。通过一种新的训练算法，利用贝努利随机变量控制查询扩展，解决了高维共激活和语义偏差的问题。对两个密集模型(BLIP，ALBEF)和两个数据集(MSCOCO，Flickr30k)的实验表明，该算法有效地减少了协同激活和语义偏差。我们性能最好的稀疏模型优于最先进的文本图像 LSR 模型，具有更短的训练时间和更低的 GPU 内存需求。该方法为在多模态环境下训练 LSR 检索模型提供了一种有效的解决方案。我们的代码和模型检查点在 github.com/thongnt99/lsr-multimodal 都有	code	0
Alleviating Confounding Effects with Contrastive Learning in Recommendation	Di You, Kyumin Lee				code	0
Align MacridVAE: Multimodal Alignment for Disentangled Recommendations	Ignacio Avas, Liesbeth Allein, Katrien Laenen, MarieFrancine Moens				code	0
Learning Action Embeddings for Off-Policy Evaluation	Matej Cief, Jacek Golebiowski, Philipp Schmidt, Ziawasch Abedjan, Artur Bekasov		Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. OPE is a viable alternative to running expensive online A/B tests: it can speed up the development of new policies, and reduces the risk of exposing customers to suboptimal treatments. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims (arXiv:2202.06317v2 [cs.LG]) propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.	非策略评估(OPE)方法允许我们通过使用不同策略收集的日志数据来计算策略的预期回报。相对于运行昂贵的在线 A/B 测试，OPE 是一种可行的替代方案: 它可以加快新政策的制定，并降低客户接触次优治疗的风险。然而，当操作的数量很大，或者某些操作被日志策略低估时，基于逆倾向评分(IPS)的现有估计量可能会有很高甚至无限的方差。Saito 和 Joachims (arXiv: 2202.06317 v2[ cs.LG ])提出使用动作嵌入的边缘化 IPS (MIPS) ，这减少了大动作空间中 IPS 的方差。MIPS 假设好的操作嵌入可以由从业人员定义，这在许多实际应用程序中是很难做到的。在这项工作中，我们探讨了从日志数据学习动作嵌入。特别地，我们使用训练过的奖励模型的中间输出来定义 MIPS 的行动嵌入。这种方法将 MIPS 扩展到更多的应用程序，并且在我们的实验中通过预定义的嵌入以及在合成和真实世界数据上的标准基线改进了 MIPS。我们的方法不对奖励模型类做假设，并支持使用额外的行动信息，以进一步改善估计。提出的方法提出了一个吸引人的替代 DR 相结合的 DM 的低方差和 IPS 的低偏差。	code	0
Simulated Task Oriented Dialogues for Developing Versatile Conversational Agents	Xi Wang, Procheta Sen, Ruizhe Li, Emine Yilmaz				code	0
Hypergraphs with Attention on Reviews for Explainable Recommendation	Theis E. Jendal, TrungHoang Le, Hady W. Lauw, Matteo Lissandrini, Peter Dolog, Katja Hose				code	0
Investigating the Usage of Formulae in Mathematical Answer Retrieval	Anja Reusch, Julius Gonsior, Claudio Hartmann, Wolfgang Lehner				code	0
Empowering Legal Citation Recommendation via Efficient Instruction-Tuning of Pre-trained Language Models	Jie Wang, Kanha Bansal, Ioannis Arapakis, Xuri Ge, Joemon M. Jose				code	0
Fine-Tuning CLIP via Explainability Map Propagation for Boosting Image and Video Retrieval	Yoav Shalev, Lior Wolf				code	0
Cross-Modal Retrieval for Knowledge-Based Visual Question Answering	Paul Lerner, Olivier Ferret, Camille Guinaudeau		Knowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base. Named entities have diverse visual representations and are therefore difficult to recognize. We argue that cross-modal retrieval may help bridge the semantic gap between an entity and its depictions, and is foremost complementary with mono-modal retrieval. We provide empirical evidence through experiments with a multimodal dual encoder, namely CLIP, on the recent ViQuAE, InfoSeek, and Encyclopedic-VQA datasets. Additionally, we study three different strategies to fine-tune such a model: mono-modal, cross-modal, or joint training. Our method, which combines mono-and cross-modal retrieval, is competitive with billion-parameter models on the three datasets, while being conceptually simpler and computationally cheaper.	基于知识的命名实体可视化问答是一项具有挑战性的任务，需要从多模态知识库中检索信息。命名实体具有不同的可视化表示，因此难以识别。我们认为，跨模态检索有助于弥合实体与其描述之间的语义鸿沟，并与单模态检索相辅相成。我们提供经验证明通过实验与多模态双编码器，即 CLIP，在最近的 ViQuAE，资讯搜寻和百科全书-VQA 数据集。此外，我们还研究了三种不同的策略来微调这种模型: 单模态、跨模态或联合训练。我们的方法结合了单模态检索和跨模态检索，与三个数据集上的十亿参数模型相比具有竞争力，同时在概念上更简单，计算成本更低。	code	0
Learning to Jointly Transform and Rank Difficult Queries	Amin Bigdeli, Negar Arabzadeh, Ebrahim Bagheri				code	0
Instant Answering in E-Commerce Buyer-Seller Messaging Using Message-to-Question Reformulation	Besnik Fetahu, Tejas Mehta, Qun Song, Nikhita Vedula, Oleg Rokhlenko, Shervin Malmasi		E-commerce customers frequently seek detailed product information for purchase decisions, commonly contacting sellers directly with extended queries. This manual response requirement imposes additional costs and disrupts buyer's shopping experience with response time fluctuations ranging from hours to days. We seek to automate buyer inquiries to sellers in a leading e-commerce store using a domain-specific federated Question Answering (QA) system. The main challenge is adapting current QA systems, designed for single questions, to address detailed customer queries. We address this with a low-latency, sequence-to-sequence approach, MESSAGE-TO-QUESTION ( M2Q ). It reformulates buyer messages into succinct questions by identifying and extracting the most salient information from a message. Evaluation against baselines shows that M2Q yields relative increases of 757 answering rate from the federated QA system. Live deployment shows that automatic answering saves sellers from manually responding to millions of messages per year, and also accelerates customer purchase decisions by eliminating the need for buyers to wait for a reply	电子商务客户经常为购买决策寻找详细的产品信息，通常直接与销售商进行扩展查询。这种手动响应要求增加了额外的成本，并且由于响应时间从几小时到几天的波动而扰乱了买家的购物体验。我们寻求在一个领先的电子商务商店使用领域特定的联邦问题回答(QA)系统自动化的买方询问卖方。主要的挑战是适应当前的 QA 系统，为单个问题设计，以解决详细的客户查询。我们使用低延迟、序列到序列的方法 MESSAGE-TO-QUESTION (M2Q)来解决这个问题。它通过从消息中识别和提取最突出的信息，将买方消息重新表述为简洁的问题。对基线的评估表明，M2Q 在联邦 QA 系统中的应答率相对提高了757。实时部署显示，自动回复可以节省卖家每年手动回复数百万条消息的时间，还可以消除买家等待回复的需要，从而加快客户的购买决策	code	0
Towards Automated End-to-End Health Misinformation Free Search with a Large Language Model	Ronak Pradeep, Jimmy Lin				code	0
Reproducibility Analysis and Enhancements for Multi-aspect Dense Retriever with Aspect Learning	Keping Bi, Xiaojie Sun, Jiafeng Guo, Xueqi Cheng		Multi-aspect dense retrieval aims to incorporate aspect information (e.g., brand and category) into dual encoders to facilitate relevance matching. As an early and representative multi-aspect dense retriever, MADRAL learns several extra aspect embeddings and fuses the explicit aspects with an implicit aspect "OTHER" for final representation. MADRAL was evaluated on proprietary data and its code was not released, making it challenging to validate its effectiveness on other datasets. We failed to reproduce its effectiveness on the public MA-Amazon data, motivating us to probe the reasons and re-examine its components. We propose several component alternatives for comparisons, including replacing "OTHER" with "CLS" and representing aspects with the first several content tokens. Through extensive experiments, we confirm that learning "OTHER" from scratch in aspect fusion is harmful. In contrast, our proposed variants can greatly enhance the retrieval performance. Our research not only sheds light on the limitations of MADRAL but also provides valuable insights for future studies on more powerful multi-aspect dense retrieval models. Code will be released at: https://github.com/sunxiaojie99/Reproducibility-for-MADRAL.	多方面密集检索旨在将方面信息(例如，品牌和类别)合并到双编码器中，以促进相关性匹配。作为一个早期的、有代表性的多方面密集检索器，MADRAL 学习了一些额外的方面嵌入，并将显式方面与隐式方面“ OTHER”融合以得到最终的表示。MADRAL 是根据专有数据进行评估的，其代码没有发布，这使得在其他数据集上验证其有效性具有挑战性。我们未能在公开的 MA-Amazon 数据上再现其有效性，这促使我们探究其原因并重新检查其组成部分。我们提出了几种可供比较的组件替代方案，包括用“ CLS”替换“ OTHER”，以及用前几个内容标记表示方面。通过大量的实验，我们证实了在方面融合中从头学习“其他”是有害的。相比之下，我们提出的变量可以大大提高检索性能。我们的研究不仅揭示了 MADRAL 的局限性，而且为未来更强大的多方面密集检索模型的研究提供了有价值的见解。密码将在下列 https://github.com/sunxiaojie99/reproducibility-for-madral 公布:。	code	0
An Empirical Analysis of Intervention Strategies' Effectiveness for Countering Misinformation Amplification by Recommendation Algorithms	Royal Pathak, Francesca Spezzano				code	0
Not Just Algorithms: Strategically Addressing Consumer Impacts in Information Retrieval	Michael D. Ekstrand, Lex Beattie, Maria Soledad Pera, Henriette Cramer				code	0
A Study of Pre-processing Fairness Intervention Methods for Ranking People	Clara Rus, Andrew Yates, Maarten de Rijke				code	0
Evaluating the Explainability of Neural Rankers	Saran Pandian, Debasis Ganguly, Sean MacAvaney		Information retrieval models have witnessed a paradigm shift from unsupervised statistical approaches to feature-based supervised approaches to completely data-driven ones that make use of the pre-training of large language models. While the increasing complexity of the search models have been able to demonstrate improvements in effectiveness (measured in terms of relevance of top-retrieved results), a question worthy of a thorough inspection is - "how explainable are these models?", which is what this paper aims to evaluate. In particular, we propose a common evaluation platform to systematically evaluate the explainability of any ranking model (the explanation algorithm being identical for all the models that are to be evaluated). In our proposed framework, each model, in addition to returning a ranked list of documents, also requires to return a list of explanation units or rationales for each document. This meta-information from each document is then used to measure how locally consistent these rationales are as an intrinsic measure of interpretability - one that does not require manual relevance assessments. Additionally, as an extrinsic measure, we compute how relevant these rationales are by leveraging sub-document level relevance assessments. Our findings show a number of interesting observations, such as sentence-level rationales are more consistent, an increase in complexity mostly leads to less consistent explanations, and that interpretability measures offer a complementary dimension of evaluation of IR systems because consistency is not well-correlated with nDCG at top ranks.	信息检索模型已经见证了从无监督统计方法到基于特征的监督方法到完全数据驱动方法的范式转变，这种方法利用了大型语言模型的预训练。虽然搜索模型日益增加的复杂性已经能够证明有效性的改善(根据检索结果的相关性衡量) ，但是一个值得彻底检查的问题是——“这些模型如何解释?”这就是本文的目的。特别是，我们提出了一个通用的评估平台，系统地评估任何排名模型的可解释性(解释算法对于所有待评估的模型都是相同的)。在我们提出的框架中，每个模型除了返回排序的文档列表之外，还需要返回每个文档的解释单元或基本原理的列表。然后，利用每份文件中的元信息来衡量这些理由在当地的一致程度，作为衡量可解释性的内在尺度，而不需要人工进行相关性评估。此外，作为一个外在的测量，我们通过利用子文档级别的相关性评估来计算这些基本原理的相关性。我们的研究结果显示，许多有趣的观察结果，例如句子水平的基本原理更加一致，复杂性的增加主要导致不一致的解释，并且可解释性测量提供了 IR 系统评估的补充维度，因为一致性与顶级的 nDCG 不相关。	code	0
Knowledge Graph Cross-View Contrastive Learning for Recommendation	Zeyuan Meng, Iadh Ounis, Craig Macdonald, Zixuan Yi				code	0
Recommendation Fairness in eParticipation: Listening to Minority, Vulnerable and NIMBY Citizens	Marina AlonsoCortés, Iván Cantador, Alejandro Bellogín				code	0
Responsible Opinion Formation on Debated Topics in Web Search	Alisa Rieger, Tim Draws, Nicolas Mattis, David Maxwell, David Elsweiler, Ujwal Gadiraju, Dana McKay, Alessandro Bozzon, Maria Soledad Pera				code	0
Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search Engines	Janek Bevendorff, Matti Wiegmann, Martin Potthast, Benno Stein				code	0
Robustness in Fairness Against Edge-Level Perturbations in GNN-Based Recommendation	Ludovico Boratto, Francesco Fabbri, Gianni Fenu, Mirko Marras, Giacomo Medda		Efforts in the recommendation community are shifting from the sole emphasis on utility to considering beyond-utility factors, such as fairness and robustness. Robustness of recommendation models is typically linked to their ability to maintain the original utility when subjected to attacks. Limited research has explored the robustness of a recommendation model in terms of fairness, e.g., the parity in performance across groups, under attack scenarios. In this paper, we aim to assess the robustness of graph-based recommender systems concerning fairness, when exposed to attacks based on edge-level perturbations. To this end, we considered four different fairness operationalizations, including both consumer and provider perspectives. Experiments on three datasets shed light on the impact of perturbations on the targeted fairness notion, uncovering key shortcomings in existing evaluation protocols for robustness. As an example, we observed perturbations affect consumer fairness on a higher extent than provider fairness, with alarming unfairness for the former. Source code: https://github.com/jackmedda/CPFairRobust	推荐社区的工作正在从单纯强调效用转向考虑效用以外的因素，如公平性和稳健性。推荐模型的健壮性通常与它们在受到攻击时维护原始实用程序的能力有关。有限的研究已经探索了推荐模型在公平性方面的健壮性，例如，在攻击场景下组间的性能均等。本文旨在评估基于图形的推荐系统在受到基于边界层扰动的攻击时对公平性的鲁棒性。为此，我们考虑了四种不同的公平可操作性，包括消费者和提供者视角。在三个数据集上的实验揭示了扰动对目标公平性概念的影响，揭示了现有鲁棒性评估协议的关键缺陷。作为一个例子，我们观察到扰动对消费者公平性的影响程度高于提供者公平性，前者的不公平性令人担忧。源代码: https://github.com/jackmedda/cpfairrobust	code	0
Shallow Cross-Encoders for Low-Latency Retrieval	Aleksandr V. Petrov, Sean MacAvaney, Craig Macdonald		Transformer-based Cross-Encoders achieve state-of-the-art effectiveness in text retrieval. However, Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window. However, keeping search latencies low is important for user satisfaction and energy usage. In this paper, we show that weaker shallow transformer models (i.e., transformers with a limited number of layers) actually perform better than full-scale models when constrained to these practical low-latency settings since they can estimate the relevance of more documents in the same time budget. We further show that shallow transformers may benefit from the generalized Binary Cross-Entropy (gBCE) training scheme, which has recently demonstrated success for recommendation tasks. Our experiments with TREC Deep Learning passage ranking query sets demonstrate significant improvements in shallow and full-scale models in low-latency scenarios. For example, when the latency limit is 25ms per query, MonoBERT-Large (a cross-encoder based on a full-scale BERT model) is only able to achieve NDCG@10 of 0.431 on TREC DL 2019, while TinyBERT-gBCE (a cross-encoder based on TinyBERT trained with gBCE) reaches NDCG@10 of 0.652, a +51 Cross-Encoders are effective even when used without a GPU (e.g., with CPU inference, NDCG@10 decreases only by 3 latency), which makes Cross-Encoders practical to run even without specialized hardware acceleration.	基于变压器的交叉编码器在文本检索中取得了最先进的效果。然而，基于大型转换器模型(如 BERT 或 T5)的交叉编码器在计算上是昂贵的，并且只允许在一个相当小的延迟窗口内对少量文档进行评分。然而，保持较低的搜索延迟对于用户满意度和能量使用非常重要。在本文中，我们表明，较弱的浅层变压器模型(即，有限层数的变压器)实际上比全尺寸模型表现更好，当约束到这些实际的低延迟设置，因为他们可以估计相关性更多的文件在同一时间预算。我们进一步表明，浅变压器可以受益于广义二进制交叉熵(gBCE)训练方案，最近证明了推荐任务的成功。我们对 TREC 深度学习段落排序查询集的实验表明，在低延迟场景中，浅层和全尺度模型有了显著的改进。例如，当每个查询的延迟限制为25毫秒时，MonoBERT-Large (一种基于全尺寸 BERT 模型的交叉编码器)在 TREC dL 2019上只能达到0.431的 NDCG@10，而 TinyBERT-gBCE (一种基于 TinyBERT 的交叉编码器，经 gBCE 训练)达到0.652的 NDCG@10，a + 51交叉编码器即使在没有图形处理器的情况下也是有效的(例如，根据 CPU 推断，NDCG@10只减少了3个延迟) ，这使得交叉编码器即使没有专门的硬件加速也能实际运行。	code	0
Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies	Puxuan Yu, Antonio Mallia, Matthias Petri		We explore leveraging corpus-specific vocabularies that improve both efficiency and effectiveness of learned sparse retrieval systems. We find that pre-training the underlying BERT model on the target corpus, specifically targeting different vocabulary sizes incorporated into the document expansion process, improves retrieval quality by up to 12% while in some scenarios decreasing latency by up to 50%. Our experiments show that adopting corpus-specific vocabulary and increasing vocabulary size decreases average postings list length which in turn reduces latency. Ablation studies show interesting interactions between custom vocabularies, document expansion techniques, and sparsification objectives of sparse models. Both effectiveness and efficiency improvements transfer to different retrieval approaches such as uniCOIL and SPLADE and offer a simple yet effective approach to providing new efficiency-effectiveness trade-offs for learned sparse retrieval systems.	我们探索利用特定于语料库的词汇来提高学习的稀疏检索系统的效率和有效性。我们发现，在目标语料库上预先训练基础的 BERT 模型，特别是针对文档扩展过程中包含的不同词汇量，可以提高检索质量达12% ，而在某些情况下可以减少50% 的延迟。我们的实验表明，采用特定语料库词汇和增加词汇量减少了平均发布列表长度，从而减少了延迟。消融研究显示了自定义词汇表、文档扩展技术和稀疏模型的稀疏化目标之间有趣的交互作用。成效和效率的提高都转移到不同的检索方法，如 uniCOIL 和 SPLADE，并提供了一种简单而有效的方法，为学习的稀疏检索系统提供新的效率效益权衡。	code	0
An Adaptive Framework of Geographical Group-Specific Network on O2O Recommendation	Luo Ji, Jiayu Mao, Hailong Shi, Qian Li, Yunfei Chu, Hongxia Yang		Online to offline recommendation strongly correlates with the user and service's spatiotemporal information, therefore calling for a higher degree of model personalization. The traditional methodology is based on a uniform model structure trained by collected centralized data, which is unlikely to capture all user patterns over different geographical areas or time periods. To tackle this challenge, we propose a geographical group-specific modeling method called GeoGrouse, which simultaneously studies the common knowledge as well as group-specific knowledge of user preferences. An automatic grouping paradigm is employed and verified based on users' geographical grouping indicators. Offline and online experiments are conducted to verify the effectiveness of our approach, and substantial business improvement is achieved.	Online To Offline线上到线下推荐与用户和服务的时空信息密切相关，因此需要更高程度的模型个性化。传统的方法是基于由收集的中央数据训练的统一模型结构，这种结构不太可能捕获不同地理区域或不同时期的所有用户模式。为了应对这一挑战，我们提出了一种名为 GeoGrouse 的地理组特定建模方法，该方法同时研究用户偏好的常识和组特定知识。基于用户的地理分组指标，采用自动分组范式进行验证。通过离线和在线实验验证了该方法的有效性，并取得了实质性的业务改进。	code	0
GenQREnsemble: Zero-Shot LLM Ensemble Prompting for Generative Query Reformulation	Kaustubh D. Dhole, Eugene Agichtein		Query Reformulation(QR) is a set of techniques used to transform a user's original search query to a text that better aligns with the user's intent and improves their search experience. Recently, zero-shot QR has been shown to be a promising approach due to its ability to exploit knowledge inherent in large language models. By taking inspiration from the success of ensemble prompting strategies which have benefited many tasks, we investigate if they can help improve query reformulation. In this context, we propose an ensemble based prompting technique, GenQREnsemble which leverages paraphrases of a zero-shot instruction to generate multiple sets of keywords ultimately improving retrieval performance. We further introduce its post-retrieval variant, GenQREnsembleRF to incorporate pseudo relevant feedback. On evaluations over four IR benchmarks, we find that GenQREnsemble generates better reformulations with relative nDCG@10 improvements up to 18 the previous zero-shot state-of-art. On the MSMarco Passage Ranking task, GenQREnsembleRF shows relative gains of 5 and 9	查询重构(Query Reformation，QR)是一组技术，用于将用户的原始搜索查询转换为更好地符合用户意图并改善其搜索体验的文本。最近，零拍 QR 已被证明是一种有前途的方法，因为它能够利用知识固有的大型语言模型。本文从集成提示策略的成功经验中得到启发，探讨了集成提示策略是否有助于改进查询重构。在这种背景下，我们提出了一种基于集成的提示技术，GenQR 集成，它利用一个零拍指令的释义来生成多组关键字，最终提高检索性能。我们进一步引入其检索后变体，GenQREnsembleRF，以纳入伪相关反馈。通过对四个 IR 基准的评估，我们发现 GenQREnamble 在相对 nDCG@10的改进下产生了更好的重构效果，最高达到了之前的最高水平。在 MSMarco 通道排名任务中，GenQREnsembleRF 显示相对增益为5和9	code	0
Improving the Robustness of Dense Retrievers Against Typos via Multi-Positive Contrastive Learning	Georgios Sidiropoulos, Evangelos Kanoulas		Dense retrieval has become the new paradigm in passage retrieval. Despite its effectiveness on typo-free queries, it is not robust when dealing with queries that contain typos. Current works on improving the typo-robustness of dense retrievers combine (i) data augmentation to obtain the typoed queries during training time with (ii) additional robustifying subtasks that aim to align the original, typo-free queries with their typoed variants. Even though multiple typoed variants are available as positive samples per query, some methods assume a single positive sample and a set of negative ones per anchor and tackle the robustifying subtask with contrastive learning; therefore, making insufficient use of the multiple positives (typoed queries). In contrast, in this work, we argue that all available positives can be used at the same time and employ contrastive learning that supports multiple positives (multi-positive). Experimental results on two datasets show that our proposed approach of leveraging all positives simultaneously and employing multi-positive contrastive learning on the robustifying subtask yields improvements in robustness against using contrastive learning with a single positive.	密集检索已经成为文章检索的新范式。尽管它对于无输入错误的查询有效，但是在处理包含输入错误的查询时并不健壮。目前致力于改善密集检索器的输入鲁棒性，将(i)数据增强结合起来以在训练期间获得输入查询和(ii)额外的鲁棒子任务，旨在将原始的，无输入的查询与其输入变体对齐。尽管每个查询可以提供多个类型变体作为正面样本，但是一些方法假设每个锚具有单个正面样本和一组负面样本，并通过对比学习处理强健的子任务; 因此，没有充分利用多个正面样本(类型查询)。相比之下，在这项工作中，我们认为所有可用的积极因素可以同时使用，并采用对比学习，支持多个积极因素(多积极)。在两个数据集上的实验结果表明，我们提出的方法同时利用所有的积极和使用多个积极的对比学习的鲁棒性子任务产生的鲁棒性对使用单个积极的对比学习的改善。	code	0
Towards Reliable and Factual Response Generation: Detecting Unanswerable Questions in Information-Seeking Conversations	Weronika Lajewska, Krisztian Balog		Generative AI models face the challenge of hallucinations that can undermine users' trust in such systems. We approach the problem of conversational information seeking as a two-step process, where relevant passages in a corpus are identified first and then summarized into a final system response. This way we can automatically assess if the answer to the user's question is present in the corpus. Specifically, our proposed method employs a sentence-level classifier to detect if the answer is present, then aggregates these predictions on the passage level, and eventually across the top-ranked passages to arrive at a final answerability estimate. For training and evaluation, we develop a dataset based on the TREC CAsT benchmark that includes answerability labels on the sentence, passage, and ranking levels. We demonstrate that our proposed method represents a strong baseline and outperforms a state-of-the-art LLM on the answerability prediction task.	生成型人工智能模型面临着幻觉的挑战，幻觉可能会破坏用户对这类系统的信任。我们将会话信息搜寻问题分为两个步骤，首先识别语料库中的相关段落，然后将其归纳为最终的系统反应。这样我们就可以自动评估用户问题的答案是否在语料库中。具体来说，我们提出的方法使用一个句子级别的分类器来检测答案是否存在，然后将这些预测集中在短文级别，最终通过排名最高的段落来得到最终的可回答性估计。为了培训和评估，我们开发了一个基于 TREC CAsT 基准的数据集，包括句子、段落和排名等级上的可回答性标签。我们证明，我们提出的方法代表了一个强大的基线和优于国家的最先进的 LLM 的应答性预测任务。	code	0
On the Influence of Reading Sequences on Knowledge Gain During Web Search	Wolfgang Gritz, Anett Hoppe, Ralph Ewerth		Nowadays, learning increasingly involves the usage of search engines and web resources. The related interdisciplinary research field search as learning aims to understand how people learn on the web. Previous work has investigated several feature classes to predict, for instance, the expected knowledge gain during web search. Therein, eye-tracking features have not been extensively studied so far. In this paper, we extend a previously used reading model from a line-based one to one that can detect reading sequences across multiple lines. We use publicly available study data from a web-based learning task to examine the relationship between our feature set and the participants' test scores. Our findings demonstrate that learners with higher knowledge gain spent significantly more time reading, and processing more words in total. We also find evidence that faster reading at the expense of more backward regressions may be an indicator of better web-based learning. We make our code publicly available at https://github.com/TIBHannover/reading_web_search.	如今，学习越来越多地涉及到搜索引擎和网络资源的使用。相关的科际整合领域搜索作为学习，旨在了解人们如何在网上学习。以前的工作已经调查了几个特征类来预测，例如，在网络搜索期间的预期知识增益。其中，眼球跟踪特征还没有被广泛研究到目前为止。在本文中，我们将以前使用的读取模型从基于行的模型扩展到能够跨多行检测读取序列的模型。我们使用来自网络学习任务的公开可用的研究数据来检查我们的特征集和参与者的测试分数之间的关系。我们的研究结果表明，获得更高知识的学习者花费更多的时间阅读，并处理更多的单词总数。我们还发现，以更多倒退回归为代价的更快阅读可能是更好的网络学习的一个指标。我们让我们的代码在 https://github.com/tibhannover/reading_web_search 上公开。	code	0
SPARe: Supercharged Lexical Retrievers on GPU with Sparse Kernels	Tiago Almeida, Sérgio Matos				code	0
Beneath the [MASK]: An Analysis of Structural Query Tokens in ColBERT	Ben Giacalone, Greg Paiement, Quinn Tucker, Richard Zanibbi				code	0
A Cost-Sensitive Meta-learning Strategy for Fair Provider Exposure in Recommendation	Ludovico Boratto, Giulia Cerniglia, Mirko Marras, Alessandra Perniciano, Barbara Pes		When devising recommendation services, it is important to account for the interests of all content providers, encompassing not only newcomers but also minority demographic groups. In various instances, certain provider groups find themselves underrepresented in the item catalog, a situation that can influence recommendation results. Hence, platform owners often seek to regulate the exposure of these provider groups in the recommended lists. In this paper, we propose a novel cost-sensitive approach designed to guarantee these target exposure levels in pairwise recommendation models. This approach quantifies, and consequently mitigate, the discrepancies between the volume of recommendations allocated to groups and their contribution in the item catalog, under the principle of equity. Our results show that this approach, while aligning groups exposure with their assigned levels, does not compromise to the original recommendation utility. Source code and pre-processed data can be retrieved at https://github.com/alessandraperniciano/meta-learning-strategy-fair-provider-exposure.	在设计推荐服务时，必须考虑到所有内容提供者的利益，不仅包括新来者，还包括少数人口群体。在各种情况下，某些提供者组发现自己在项目目录中的表示不足，这种情况可能会影响推荐结果。因此，平台所有者往往试图在建议名单中规范这些提供商群体的风险敞口。在本文中，我们提出了一种新的成本敏感的方法来保证这些目标暴露水平的配对推荐模型。根据公平原则，这种方法量化并因此减少了分配给各组的建议数量与它们在项目目录中的贡献之间的差异。我们的研究结果表明，这种方法，虽然调整组暴露与他们指定的水平，不妥协的原始推荐实用程序。源代码和预处理数据可以在 https://github.com/alessandraperniciano/meta-learning-strategy-fair-provider-exposure 检索到。	code	0
Multiple Testing for IR and Recommendation System Experiments	Ngozi Ihemelandu, Michael D. Ekstrand				code	0
An In-Depth Comparison of Neural and Probabilistic Tree Models for Learning-to-rank	Haonan Tan, Kaiyu Yang, Haitao Yu				code	0
GenRec: Large Language Model for Generative Recommendation	Jianchao Ji, Zelong Li, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Juntao Tan, Yongfeng Zhang		In recent years, large language models (LLM) have emerged as powerful tools for diverse natural language processing tasks. However, their potential for recommender systems under the generative recommendation paradigm remains relatively unexplored. This paper presents an innovative approach to recommendation systems using large language models (LLMs) based on text data. In this paper, we present a novel LLM for generative recommendation (GenRec) that utilized the expressive power of LLM to directly generate the target item to recommend, rather than calculating ranking score for each candidate item one by one as in traditional discriminative recommendation. GenRec uses LLM's understanding ability to interpret context, learn user preferences, and generate relevant recommendation. Our proposed approach leverages the vast knowledge encoded in large language models to accomplish recommendation tasks. We first we formulate specialized prompts to enhance the ability of LLM to comprehend recommendation tasks. Subsequently, we use these prompts to fine-tune the LLaMA backbone LLM on a dataset of user-item interactions, represented by textual data, to capture user preferences and item characteristics. Our research underscores the potential of LLM-based generative recommendation in revolutionizing the domain of recommendation systems and offers a foundational framework for future explorations in this field. We conduct extensive experiments on benchmark datasets, and the experiments shows that our GenRec has significant better results on large dataset.	近年来，大型语言模型(LLM)已经成为处理各种自然语言处理任务的强大工具。然而，他们的潜力推荐系统在生成推荐范式仍然相对未开发。本文提出了一种基于文本数据的大语言模型(LLM)推荐系统的创新方法。本文提出了一种新的生成推荐 LLM (GenRec) ，它利用 LLM 的表达能力直接生成推荐的目标项目，而不是像传统的区分推荐那样逐个计算每个候选项目的排名得分。GenRec 使用 LLM 的理解能力来解释上下文、学习用户偏好并生成相关推荐。我们提出的方法利用大型语言模型中编码的大量知识来完成推荐任务。我们首先制定专门的提示来增强 LLM 理解推荐任务的能力。随后，我们使用这些提示对用户-项目交互的数据集(由文本数据表示)上的 LLaMA 主干 LLM 进行微调，以捕获用户偏好和项目特征。我们的研究强调了基于 LLM 的生成式推荐在推荐系统领域革命性变革中的潜力，并为未来该领域的探索提供了一个基础框架。我们在基准数据集上进行了广泛的实验，实验结果表明我们的 GenRec 在大数据集上有明显的更好的结果。	code	0
News Gathering: Leveraging Transformers to Rank News	Carlos Muñoz, María José Apolo, Maximiliano Ojeda, Hans Lobel, Marcelo Mendoza				code	0
Answer Retrieval in Legal Community Question Answering	Arian Askari, Zihui Yang, Zhaochun Ren, Suzan Verberne		The task of answer retrieval in the legal domain aims to help users to seek relevant legal advice from massive amounts of professional responses. Two main challenges hinder applying existing answer retrieval approaches in other domains to the legal domain: (1) a huge knowledge gap between lawyers and non-professionals; and (2) a mix of informal and formal content on legal QA websites. To tackle these challenges, we propose CE_FS, a novel cross-encoder (CE) re-ranker based on the fine-grained structured inputs. CE_FS uses additional structured information in the CQA data to improve the effectiveness of cross-encoder re-rankers. Furthermore, we propose LegalQA: a real-world benchmark dataset for evaluating answer retrieval in the legal domain. Experiments conducted on LegalQA show that our proposed method significantly outperforms strong cross-encoder re-rankers fine-tuned on MS MARCO. Our novel finding is that adding the question tags of each question besides the question description and title into the input of cross-encoder re-rankers structurally boosts the rankers' effectiveness. While we study our proposed method in the legal domain, we believe that our method can be applied in similar applications in other domains.	法律领域的答案检索任务旨在帮助用户从大量的专业答复中寻求相关的法律咨询。两个主要的挑战阻碍了现有的回答检索方法在其他领域的法律领域: (1)律师和非专业人士之间的巨大知识差距; 和(2)在法律质量保证网站的非正式和正式内容的组合。为了解决这些问题，我们提出了一种基于细粒度结构化输入的交叉编码器(CE)重排序算法 CE _ FS。CE _ FS 在 CQA 数据中使用额外的结构化信息来提高交叉编码器重新排序的有效性。此外，我们提出了 LegalQA: 一个真实世界的基准数据集，用于评估法律领域的答案检索。在 LegalQA 上进行的实验表明，本文提出的方法明显优于在 MS MARCO 上进行微调的强交叉编码器重排序器。我们的新发现是，除了问题描述和题名之外，在交叉编码器重新排序的输入中增加每个问题的问题标签，从结构上提高了排序的有效性。在法律领域研究我们提出的方法的同时，我们相信我们的方法可以应用于其他领域的类似应用。	code	0
Towards Optimizing Ranking in Grid-Layout for Provider-Side Fairness	Amifa Raj, Michael D. Ekstrand				code	0
A Conversational Robot for Children's Access to a Cultural Heritage Multimedia Archive	Thomas Beelen, Roeland Ordelman, Khiet P. Truong, Vanessa Evers, Theo Huibers				code	0
MathMex: Search Engine for Math Definitions	Shea Durgin, James Gore, Behrooz Mansouri				code	0
XSearchKG: A Platform for Explainable Keyword Search over Knowledge Graphs	Leila Feddoul, Martin Birke, Sirko Schindler				code	0
Result Assessment Tool: Software to Support Studies Based on Data from Search Engines	Sebastian Sünkler, Nurce Yagci, Sebastian Schultheiß, Sonja von Mach, Dirk Lewandowski				code	0
Translating Justice: A Cross-Lingual Information Retrieval System for Maltese Case Law Documents	Joel Azzopardi				code	0
Displaying Evolving Events Via Hierarchical Information Threads for Sensitivity Review	Hitarth Narvala, Graham McDonald, Iadh Ounis				code	0
Analyzing Mathematical Content for Plagiarism and Recommendations	Ankit Satpute				code	0
Explainable Recommender Systems with Knowledge Graphs and Language Models	Giacomo Balloccu, Ludovico Boratto, Gianni Fenu, Francesca Maridina Malloci, Mirko Marras				code	0
Recent Advances in Generative Information Retrieval	Yubao Tang, Ruqing Zhang, Zhaochun Ren, Jiafeng Guo, Maarten de Rijke				code	0
Affective Computing for Social Good Applications: Current Advances, Gaps and Opportunities in Conversational Setting	Priyanshu Priya, Mauajama Firdaus, Gopendra Vikram Singh, Asif Ekbal				code	0
Query Performance Prediction: From Fundamentals to Advanced Techniques	Negar Arabzadeh, Chuan Meng, Mohammad Aliannejadi, Ebrahim Bagheri				code	0
Fairness Through Domain Awareness: Mitigating Popularity Bias for Music Discovery	Rebecca Salganik, Fernando Diaz, Golnoosh Farnadi		As online music platforms grow, music recommender systems play a vital role in helping users navigate and discover content within their vast musical databases. At odds with this larger goal, is the presence of popularity bias, which causes algorithmic systems to favor mainstream content over, potentially more relevant, but niche items. In this work we explore the intrinsic relationship between music discovery and popularity bias. To mitigate this issue we propose a domain-aware, individual fairness-based approach which addresses popularity bias in graph neural network (GNNs) based recommender systems. Our approach uses individual fairness to reflect a ground truth listening experience, i.e., if two songs sound similar, this similarity should be reflected in their representations. In doing so, we facilitate meaningful music discovery that is robust to popularity bias and grounded in the music domain. We apply our BOOST methodology to two discovery based tasks, performing recommendations at both the playlist level and user level. Then, we ground our evaluation in the cold start setting, showing that our approach outperforms existing fairness benchmarks in both performance and recommendation of lesser-known content. Finally, our analysis explains why our proposed methodology is a novel and promising approach to mitigating popularity bias and improving the discovery of new and niche content in music recommender systems.	随着在线音乐平台的发展，音乐推荐系统在帮助用户浏览和发现其庞大的音乐数据库中的内容方面发挥着至关重要的作用。与这个更大的目标不一致的是流行偏见的存在，它导致算法系统偏爱主流内容，而不是潜在的更相关的，但是利基项目。在这项工作中，我们探讨音乐发现和流行偏见之间的内在关系。为了缓解这一问题，我们提出了一种基于领域感知的、基于个体公平性的方法，该方法解决了基于图神经网络(GNN)的推荐系统中的流行偏差问题。我们的方法使用个人的公平性来反映一个基本的真理倾听经验，也就是说，如果两首歌听起来相似，这种相似性应该反映在他们的表述中。这样做，我们促进了有意义的音乐发现，这是强大的流行偏见，并在音乐领域的基础。我们将 BOOST 方法应用于两个基于发现的任务，在播放列表级别和用户级别执行建议。然后，我们在冷启动环境下进行评估，结果表明我们的方法在性能和推荐不太知名内容方面都优于现有的公平性基准。最后，我们的分析解释了为什么我们提出的方法是一个新颖和有前途的方法，以减少流行偏见和改善发现新的和利基内容的音乐推荐系统。	code	0
Countering Mainstream Bias via End-to-End Adaptive Local Learning	Jinhao Pan, Ziwei Zhu, Jianling Wang, Allen Lin, James Caverlee		Collaborative filtering (CF) based recommendations suffer from mainstream bias – where mainstream users are favored over niche users, leading to poor recommendation quality for many long-tail users. In this paper, we identify two root causes of this mainstream bias: (i) discrepancy modeling, whereby CF algorithms focus on modeling mainstream users while neglecting niche users with unique preferences; and (ii) unsynchronized learning, where niche users require more training epochs than mainstream users to reach peak performance. Targeting these causes, we propose a novel end-To-end Adaptive Local Learning (TALL) framework to provide high-quality recommendations to both mainstream and niche users. TALL uses a loss-driven Mixture-of-Experts module to adaptively ensemble experts to provide customized local models for different users. Further, it contains an adaptive weight module to synchronize the learning paces of different users by dynamically adjusting weights in the loss. Extensive experiments demonstrate the state-of-the-art performance of the proposed model. Code and data are provided at https://github.com/JP-25/end-To-end-Adaptive-Local-Leanring-TALL-	基于协同过滤(CF)的推荐受到主流偏见的影响——主流用户比小众用户更受青睐，导致许多长尾用户的推荐质量较差。在本文中，我们确定了这种主流偏见的两个根本原因: (i)差异建模，即 CF 算法侧重于建模主流用户，而忽视具有独特偏好的小生境用户; 和(ii)非同步学习，其中小生境用户需要比主流用户更多的训练周期才能达到峰值性能。针对这些原因，我们提出了一个新颖的端到端适应性本地学习(TALL)框架，为主流和小众用户提供高质量的建议。TALL 使用一个损耗驱动的专家混合模块来自适应地集成专家，为不同的用户提供定制的本地模型。此外，它还包含一个自适应权重模块，通过动态调整权重来同步不同用户的学习步伐。大量的实验证明了该模型的最新性能。代码和数据载于 < https://github.com/jp-25/end-to-end-adaptive-local-leanring-tall-	code	0
BioASQ at CLEF2024: The Twelfth Edition of the Large-Scale Biomedical Semantic Indexing and Question Answering Challenge	Anastasios Nentidis, Anastasia Krithara, Georgios Paliouras, Martin Krallinger, Luis Gascó Sánchez, Salvador LimaLópez, Eulàlia Farré, Natalia V. Loukachevitch, Vera Davydova, Elena Tutubalina				code	0
ProMap: Product Mapping Datasets	Katerina Macková, Martin Pilát				code	0
Eliminating Contextual Bias in Aspect-Based Sentiment Analysis	Ruize An, Chen Zhang, Dawei Song				code	0
A Streaming Approach to Neural Team Formation Training	Hossein Fani, Reza Barzegar, Arman Dashti, Mahdis Saeedi				code	0
A Second Look on BASS - Boosting Abstractive Summarization with Unified Semantic Graphs - A Replication Study	Osman Alperen Koras, Jörg Schlötterer, Christin Seifert		We present a detailed replication study of the BASS framework, an abstractive summarization system based on the notion of Unified Semantic Graphs. Our investigation includes challenges in replicating key components and an ablation study to systematically isolate error sources rooted in replicating novel components. Our findings reveal discrepancies in performance compared to the original work. We highlight the significance of paying careful attention even to reasonably omitted details for replicating advanced frameworks like BASS, and emphasize key practices for writing replicable papers.	我们提出了一个详细的复制研究的 BASS 框架，一个抽象的摘要系统的概念为基础的统一语义图。我们的研究包括复制关键组件的挑战和一项消融研究，以系统地隔离根植于复制新组件的错误源。我们的发现揭示了与原始工作相比在性能上的差异。我们强调认真注意甚至合理忽略复制高级框架(如 BASS)的细节的重要性，并强调编写可复制论文的关键实践。	code	0
Absolute Variation Distance: An Inversion Attack Evaluation Metric for Federated Learning	Georgios Papadopoulos, Yash Satsangi, Shaltiel Eloul, Marco Pistoia				code	0
Experiments in News Bias Detection with Pre-trained Neural Transformers	Tim Menzner, Jochen L. Leidner				code	0
A Transformer-Based Object-Centric Approach for Date Estimation of Historical Photographs	Francesc Net, Núria Hernández, Adrià Molina, Lluís Gómez				code	0
Bias Detection and Mitigation in Textual Data: A Study on Fake News and Hate Speech Detection	Apostolos Kasampalis, Despoina Chatzakou, Theodora Tsikrika, Stefanos Vrochidis, Ioannis Kompatsiaris				code	0
DQNC2S: DQN-Based Cross-Stream Crisis Event Summarizer	Daniele Rege Cambrin, Luca Cagliero, Paolo Garza		Summarizing multiple disaster-relevant data streams simultaneously is particularly challenging as existing Retrieve&Re-ranking strategies suffer from the inherent redundancy of multi-stream data and limited scalability in a multi-query setting. This work proposes an online approach to crisis timeline generation based on weak annotation with Deep Q-Networks. It selects on-the-fly the relevant pieces of text without requiring neither human annotations nor content re-ranking. This makes the inference time independent of the number of input queries. The proposed approach also incorporates a redundancy filter into the reward function to effectively handle cross-stream content overlaps. The achieved ROUGE and BERTScore results are superior to those of best-performing models on the CrisisFACTS 2022 benchmark.	同时汇总多个与灾难相关的数据流尤其具有挑战性，因为现有的检索和重新排序策略受到多流数据的固有冗余和多查询设置中有限的可伸缩性的影响。提出了一种基于深度 Q 网络弱注释的危机时间表在线生成方法。它动态地选择相关的文本片段，而不需要人工注释或内容重新排序。这使得推理时间与输入查询的数量无关。该方法还在奖励函数中引入了冗余过滤器，以有效地处理跨流内容重叠。所获得的 ROUGE 和 BERTScore 结果优于那些在 CrisisFACTS 2022基准上表现最好的模型。	code	0
QuantPlorer: Exploration of Quantities in Text	Satya Almasian, Alexander Kosnac, Michael Gertz				code	0
ARElight: Context Sampling of Large Texts for Deep Learning Relation Extraction	Nicolay Rusnachenko, Huizhi Liang, Maksim Kalameyets, Lei Shi				code	0
Variance Reduction in Ratio Metrics for Efficient Online Experiments	Shubham Baweja, Neeti Pokharna, Aleksei Ustimenko, Olivier Jeunen		Online controlled experiments, such as A/B-tests, are commonly used by modern tech companies to enable continuous system improvements. Despite their paramount importance, A/B-tests are expensive: by their very definition, a percentage of traffic is assigned an inferior system variant. To ensure statistical significance on top-level metrics, online experiments typically run for several weeks. Even then, a considerable amount of experiments will lead to inconclusive results (i.e. false negatives, or type-II error). The main culprit for this inefficiency is the variance of the online metrics. Variance reduction techniques have been proposed in the literature, but their direct applicability to commonly used ratio metrics (e.g. click-through rate or user retention) is limited. In this work, we successfully apply variance reduction techniques to ratio metrics on a large-scale short-video platform: ShareChat. Our empirical results show that we can either improve A/B-test confidence in 77 retain the same level of confidence with 30 show that the common approach of including as many covariates as possible in regression is counter-productive, highlighting that control variates based on Gradient-Boosted Decision Tree predictors are most effective. We discuss the practicalities of implementing these methods at scale and showcase the cost reduction they beget.	在线控制实验，如 A/B 测试，通常被现代科技公司用来实现持续的系统改进。尽管 A/B 测试非常重要，但它们的成本很高: 根据它们的定义，一定比例的流量被分配给一个劣质的系统变体。为了确保顶级指标的统计显著性，在线实验通常要运行数周。即使这样，大量的实验也会导致不确定的结果(例如，假阴性，或 II 型错误)。这种低效率的罪魁祸首是在线指标的变化。文献中已经提出了减少方差的技术，但它们对常用比率指标(如点进率或用户保留)的直接适用性是有限的。在这项工作中，我们成功地将方差减少技术应用到一个大规模的短视频平台上: ShareChat。我们的实证结果显示，我们可以提高77个中的 A/B 检验置信度保持相同的置信水平，30个显示在回归中包含尽可能多的协变量的常见方法是适得其反的，突出显示基于梯度增强决策树预测器的控制变量是最有效的。我们讨论了在规模上实施这些方法的实用性，并展示了它们带来的成本降低。	code	0
CLEF 2024 SimpleText Track - Improving Access to Scientific Texts for Everyone	Liana Ermakova, Eric SanJuan, Stéphane Huet, Hosein Azarbonyad, Giorgio Maria Di Nunzio, Federica Vezzani, Jennifer D'Souza, Salomon Kabongo, Hamed Babaei Giglou, Yue Zhang, Sören Auer, Jaap Kamps				code	0
LifeCLEF 2024 Teaser: Challenges on Species Distribution Prediction and Identification	Alexis Joly, Lukás Picek, Stefan Kahl, Hervé Goëau, Vincent Espitalier, Christophe Botella, Benjamin Deneu, Diego Marcos, Joaquim Estopinan, César Leblanc, Théo Larcher, Milan Sulc, Marek Hrúz, Maximilien Servajean, Jirí Matas, Hervé Glotin, Robert Planqué, WillemPier Vellinga, Holger Klinck, Tom Denton, Andrew M. Durso, Ivan Eggel, Pierre Bonnet, Henning Müller				code	0
The CLEF 2024 Monster Track: One Lab to Rule Them All	Nicola Ferro, Julio Gonzalo, Jussi Karlgren, Henning Müller				code	0
CLEF 2024 JOKER Lab: Automatic Humour Analysis	Liana Ermakova, AnneGwenn Bosser, Tristan Miller, Tremaine Thomas, Victor Manuel PalmaPreciado, Grigori Sidorov, Adam Jatowt				code	0
iDPP@CLEF 2024: The Intelligent Disease Progression Prediction Challenge	Helena Aidos, Roberto Bergamaschi, Paola Cavalla, Adriano Chiò, Arianna Dagliati, Barbara Di Camillo, Mamede de Carvalho, Nicola Ferro, Piero Fariselli, Jose Manuel García Dominguez, Sara C. Madeira, Eleonora Tavazzi				code	0
LongEval: Longitudinal Evaluation of Model Performance at CLEF 2024	Rabab Alkhalifa, Hsuvas Borkakoty, Romain Deveaud, Alaa ElEbshihy, Luis Espinosa Anke, Tobias Fink, Gabriela González Sáez, Petra Galuscáková, Lorraine Goeuriot, David Iommi, Maria Liakata, Harish Tayyar Madabushi, Pablo MedinaAlias, Philippe Mulhem, Florina Piroi, Martin Popel, Christophe Servan, Arkaitz Zubiaga				code	0
CrisisKAN: Knowledge-Infused and Explainable Multimodal Attention Network for Crisis Event Classification	Shubham Gupta, Nandini Saini, Suman Kundu, Debasis Das		Pervasive use of social media has become the emerging source for real-time information (like images, text, or both) to identify various events. Despite the rapid growth of image and text-based event classification, the state-of-the-art (SOTA) models find it challenging to bridge the semantic gap between features of image and text modalities due to inconsistent encoding. Also, the black-box nature of models fails to explain the model's outcomes for building trust in high-stakes situations such as disasters, pandemic. Additionally, the word limit imposed on social media posts can potentially introduce bias towards specific events. To address these issues, we proposed CrisisKAN, a novel Knowledge-infused and Explainable Multimodal Attention Network that entails images and texts in conjunction with external knowledge from Wikipedia to classify crisis events. To enrich the context-specific understanding of textual information, we integrated Wikipedia knowledge using proposed wiki extraction algorithm. Along with this, a guided cross-attention module is implemented to fill the semantic gap in integrating visual and textual data. In order to ensure reliability, we employ a model-specific approach called Gradient-weighted Class Activation Mapping (Grad-CAM) that provides a robust explanation of the predictions of the proposed model. The comprehensive experiments conducted on the CrisisMMD dataset yield in-depth analysis across various crisis-specific tasks and settings. As a result, CrisisKAN outperforms existing SOTA methodologies and provides a novel view in the domain of explainable multimodal event classification.	社交媒体的广泛使用已经成为实时信息(如图像、文本或两者)的新兴来源，用于识别各种事件。尽管图像和基于文本的事件分类发展迅速，但是由于编码不一致，最新的 SOTA 模型在消除图像特征和文本模式之间的语义鸿沟方面遇到了挑战。此外，模型的黑盒子性质也无法解释模型在灾难、流行病等高风险情况下建立信任的结果。此外，对社交媒体帖子的字数限制可能会引起对特定事件的偏见。为了解决这些问题，我们提出了 CrisisKAN，一个新颖的知识注入和可解释的多模式注意力网络，将图像和文本与来自维基百科的外部知识结合起来，对危机事件进行分类。为了丰富文本信息的上下文特定理解，我们使用提出的 wiki 抽取算法集成 Wikipedia 知识。与此同时，引导交叉注意模块的实现，以填补在整合视觉和文本数据的语义差距。为了确保可靠性，我们采用了一种特定于模型的方法，称为梯度加权类激活映射(Grad-CAM) ，它为所提出的模型的预测提供了一个稳健的解释。在 CrisisMMD 数据集上进行的综合实验产生了对各种危机特定任务和设置的深入分析。因此，CrisisKAN 优于现有的 SOTA 方法，在可解释多模态事件分类领域提供了一种新的视角。	code	0
Probing Pretrained Language Models with Hierarchy Properties	Jesús LovónMelgarejo, José G. Moreno, Romaric Besançon, Olivier Ferret, Lynda Tamine		Since Pretrained Language Models (PLMs) are the cornerstone of the most recent Information Retrieval (IR) models, the way they encode semantic knowledge is particularly important. However, little attention has been given to studying the PLMs' capability to capture hierarchical semantic knowledge. Traditionally, evaluating such knowledge encoded in PLMs relies on their performance on a task-dependent evaluation approach based on proxy tasks, such as hypernymy detection. Unfortunately, this approach potentially ignores other implicit and complex taxonomic relations. In this work, we propose a task-agnostic evaluation method able to evaluate to what extent PLMs can capture complex taxonomy relations, such as ancestors and siblings. The evaluation is based on intrinsic properties that capture the hierarchical nature of taxonomies. Our experimental evaluation shows that the lexico-semantic knowledge implicitly encoded in PLMs does not always capture hierarchical relations. We further demonstrate that the proposed properties can be injected into PLMs to improve their understanding of hierarchy. Through evaluations on taxonomy reconstruction, hypernym discovery and reading comprehension tasks, we show that the knowledge about hierarchy is moderately but not systematically transferable across tasks.	由于预训练语言模型是最新的信息检索模型的基石，因此它们编码语义知识的方式尤为重要。然而，对于 PLM 获取层次化语义知识的能力的研究却很少被关注。传统上，评估编码在 PLM 中的此类知识依赖于基于代理任务的任务相关评估方法，如上位词检测。不幸的是，这种方法可能忽略了其他隐式和复杂的分类关系。在这项工作中，我们提出了一个任务无关的评估方法，能够评估 PLM 在多大程度上可以捕获复杂的分类关系，如祖先和兄弟姐妹。评估基于捕获分类法的层次性质的内在属性。实验结果表明，PLM 中隐含的词汇语义知识并不总是能够捕获层次关系。我们进一步证明了所提议的属性可以被注入到 PLM 中，以提高它们对层次结构的理解。通过对分类学重建、上位词发现和阅读理解任务的评估，我们发现关于等级的知识适度但不能系统地跨任务转移。	code	0
HyperPIE: Hyperparameter Information Extraction from Scientific Publications	Tarek Saier, Mayumi Ohta, Takuto Asakura, Michael Färber		Automatic extraction of information from publications is key to making scientific knowledge machine readable at a large scale. The extracted information can, for example, facilitate academic search, decision making, and knowledge graph construction. An important type of information not covered by existing approaches is hyperparameters. In this paper, we formalize and tackle hyperparameter information extraction (HyperPIE) as an entity recognition and relation extraction task. We create a labeled data set covering publications from a variety of computer science disciplines. Using this data set, we train and evaluate BERT-based fine-tuned models as well as five large language models: GPT-3.5, GALACTICA, Falcon, Vicuna, and WizardLM. For fine-tuned models, we develop a relation extraction approach that achieves an improvement of 29 develop an approach leveraging YAML output for structured data extraction, which achieves an average improvement of 5.5 using JSON. With our best performing model we extract hyperparameter information from a large number of unannotated papers, and analyze patterns across disciplines. All our data and source code is publicly available at https://github.com/IllDepence/hyperpie	从出版物中自动提取信息是使科学知识机器具有大规模可读性的关键。提取的信息可以方便学术搜索、决策制定和知识图的构建。现有方法未涵盖的一种重要类型的信息是超参数。在本文中，我们将超参数信息抽取(HyperpIE)形式化并处理为一个实体识别和关系提取任务。我们创建了一个标签数据集，涵盖了来自各种计算机科学学科的出版物。使用这个数据集，我们训练和评估基于 BERT 的微调模型以及五种大型语言模型: GPT-3.5、 GALACTICA、 Falcon、 Vicuna 和 WizardLM。对于微调模型，我们开发了一种关系提取方法，它实现了29个改进，开发了一种利用 YAML 输出进行结构化数据提取的方法，它使用 JSON 实现了平均5.5个改进。使用性能最好的模型，我们从大量未注释的论文中提取超参数信息，并分析跨学科的模式。我们所有的数据和源代码都可以在 https://github.com/illdepence/hyperpie 上公开	code	0
An EcoSage Assistant: Towards Building A Multimodal Plant Care Dialogue Assistant	Mohit Tomar, Abhisek Tiwari, Tulika Saha, Prince Jha, Sriparna Saha		In recent times, there has been an increasing awareness about imminent environmental challenges, resulting in people showing a stronger dedication to taking care of the environment and nurturing green life. The current $19.6 billion indoor gardening industry, reflective of this growing sentiment, not only signifies a monetary value but also speaks of a profound human desire to reconnect with the natural world. However, several recent surveys cast a revealing light on the fate of plants within our care, with more than half succumbing primarily due to the silent menace of improper care. Thus, the need for accessible expertise capable of assisting and guiding individuals through the intricacies of plant care has become paramount more than ever. In this work, we make the very first attempt at building a plant care assistant, which aims to assist people with plant(-ing) concerns through conversations. We propose a plant care conversational dataset named Plantational, which contains around 1K dialogues between users and plant care experts. Our end-to-end proposed approach is two-fold : (i) We first benchmark the dataset with the help of various large language models (LLMs) and visual language model (VLM) by studying the impact of instruction tuning (zero-shot and few-shot prompting) and fine-tuning techniques on this task; (ii) finally, we build EcoSage, a multi-modal plant care assisting dialogue generation framework, incorporating an adapter-based modality infusion using a gated mechanism. We performed an extensive examination (both automated and manual evaluation) of the performance exhibited by various LLMs and VLM in the generation of the domain-specific dialogue responses to underscore the respective strengths and weaknesses of these diverse models.	近年来，人们越来越认识到迫在眉睫的环境挑战，因此人们更加致力于保护环境和培育绿色生活。目前196亿美元的室内园艺产业，反映了这种日益增长的情绪，不仅意味着货币价值，而且表明了人类与自然世界重新建立联系的强烈愿望。然而，最近的一些调查揭示了我们所照料的植物的命运，超过一半的植物死亡主要是由于不当照料的无声威胁。因此，现在比以往任何时候都更需要能够帮助和指导个人通过复杂的植物护理的可获得的专业知识。在这项工作中，我们首次尝试构建一个植物护理助手，其目的是通过对话帮助人们处理植物问题。我们提出了一个名为 Plantational 的植物护理会话数据集，它包含用户和植物护理专家之间大约1K 的对话。我们提出的端到端的方法是双重的: (i)我们首先在各种大型语言模型(LLM)和可视化语言模型(VLM)的帮助下，通过研究指令调优(零拍摄和少拍摄提示)和微调技术对这项任务的影响来测试数据集; (ii)最后，我们构建 EcoSage，一个多模态植物护理辅助对话生成框架，使用门控机制结合基于适配器的模式输入。我们对各种 LLM 和 VLM 在生成特定领域的对话响应时所展示的性能进行了广泛的检查(包括自动和手动评估) ，以强调这些不同模型各自的优缺点。	code	0
Controllable Decontextualization of Yes/No Question and Answers into Factual Statements	Lingbo Mo, Besnik Fetahu, Oleg Rokhlenko, Shervin Malmasi				code	0
Reading Between the Frames: Multi-modal Depression Detection in Videos from Non-verbal Cues	David GimenoGómez, AnaMaria Bucur, Adrian Cosma, Carlos David MartínezHinarejos, Paolo Rosso		Depression, a prominent contributor to global disability, affects a substantial portion of the population. Efforts to detect depression from social media texts have been prevalent, yet only a few works explored depression detection from user-generated video content. In this work, we address this research gap by proposing a simple and flexible multi-modal temporal model capable of discerning non-verbal depression cues from diverse modalities in noisy, real-world videos. We show that, for in-the-wild videos, using additional high-level non-verbal cues is crucial to achieving good performance, and we extracted and processed audio speech embeddings, face emotion embeddings, face, body and hand landmarks, and gaze and blinking information. Through extensive experiments, we show that our model achieves state-of-the-art results on three key benchmark datasets for depression detection from video by a substantial margin. Our code is publicly available on GitHub.	抑郁症是导致全球性残疾的一个重要因素，影响着相当一部分人口。从社交媒体文本中检测抑郁症的努力已经很普遍，然而只有少数作品探索了从用户生成的视频内容中检测抑郁症。在这项工作中，我们通过提出一个简单而灵活的多模态时间模型来解决这一研究差距，该模型能够从嘈杂的现实世界视频中的不同模式中辨别出非语言性抑郁的线索。我们表明，对于野外视频，使用额外的高水平非语言线索对获得良好的表现至关重要，我们提取和处理语音嵌入，面部情感嵌入，面部，身体和手的地标，以及凝视和眨眼信息。通过广泛的实验，我们表明，我们的模型实现了国家的最先进的结果，三个关键的基准数据集抑郁症检测从视频相当大的幅度。我们的代码在 GitHub 上公开可用。	code	0
Investigating the Effects of Sparse Attention on Cross-Encoders	Ferdinand Schlatt, Maik Fröbe, Matthias Hagen		Cross-encoders are effective passage and document re-rankers but less efficient than other neural or classic retrieval models. A few previous studies have applied windowed self-attention to make cross-encoders more efficient. However, these studies did not investigate the potential and limits of different attention patterns or window sizes. We close this gap and systematically analyze how token interactions can be reduced without harming the re-ranking effectiveness. Experimenting with asymmetric attention and different window sizes, we find that the query tokens do not need to attend to the passage or document tokens for effective re-ranking and that very small window sizes suffice. In our experiments, even windows of 4 tokens still yield effectiveness on par with previous cross-encoders while reducing the memory requirements to at most 78 for passages / documents.	交叉编码器是有效的通道和文档重新排序，但效率低于其他神经或经典检索模型。以前的一些研究已经应用窗口自我注意，使交叉编码器更有效率。然而，这些研究并没有调查不同注意模式或窗口大小的潜力和局限性。我们缩小了这一差距，并系统地分析了如何在不损害重新排序效率的情况下减少令牌交互。通过不对称注意和不同窗口大小的实验，我们发现查询标记不需要注意文章或文档标记来进行有效的重新排序，非常小的窗口大小就足够了。在我们的实验中，即使是4个令牌的窗口也能产生与以前的交叉编码器相同的效率，同时将文章/文档的内存需求降低到最多78个。	code	0
SumBlogger: Abstractive Summarization of Large Collections of Scientific Articles	Pavlos Zakkas, Suzan Verberne, Jakub Zavrel				code	0
Role-Guided Contrastive Learning for Event Argument Extraction	Chunyu Yao, Yi Guo, Xue Chen, Zhenzhen Duan, Jiaojiao Fu				code	0
Attend All Options at Once: Full Context Input for Multi-choice Reading Comprehension	Runda Wang, Suzan Verberne, Marco Spruit				code	0
Zero-Shot Generative Large Language Models for Systematic Review Screening Automation	Shuai Wang, Harrisen Scells, Shengyao Zhuang, Martin Potthast, Bevan Koopman, Guido Zuccon		Systematic reviews are crucial for evidence-based medicine as they comprehensively analyse published research findings on specific questions. Conducting such reviews is often resource- and time-intensive, especially in the screening phase, where abstracts of publications are assessed for inclusion in a review. This study investigates the effectiveness of using zero-shot large language models~(LLMs) for automatic screening. We evaluate the effectiveness of eight different LLMs and investigate a calibration technique that uses a predefined recall threshold to determine whether a publication should be included in a systematic review. Our comprehensive evaluation using five standard test collections shows that instruction fine-tuning plays an important role in screening, that calibration renders LLMs practical for achieving a targeted recall, and that combining both with an ensemble of zero-shot models saves significant screening time compared to state-of-the-art approaches.	系统评价对于循证医学来说至关重要，因为它们全面分析了已发表的关于具体问题的研究成果。进行这种审查往往需要大量资源和时间，特别是在筛选阶段，对出版物摘要进行评估以纳入审查。本研究旨在探讨使用大语言模型 ~ (LLM)进行自动筛选的有效性。我们评估了8种不同 LLM 的有效性，并研究了一种校准技术，该技术使用预定义的召回阈值来确定是否应该将出版物纳入系统综述。我们使用五个标准测试集进行的综合评估表明，指令微调在筛选中起着重要作用，校准使 LLM 实用于实现有针对性的召回，并且与一系列零拍模型相结合，与最先进的方法相比节省了显着的筛选时间。	code	0
WebSAM-Adapter: Adapting Segment Anything Model for Web Page Segmentation	Bowen Ren, Zefeng Qian, Yuchen Sun, Chao Gao, Chongyang Zhang				code	0
A Phrase-Level Attention Enhanced CRF for Keyphrase Extraction	Shinian Li, Tao Jiang, Yuxiang Zhang				code	0
Taxonomy of Mathematical Plagiarism	Ankit Satpute, André GreinerPetter, Noah Gießing, Isabel Beckenbach, Moritz Schubotz, Olaf Teschke, Akiko Aizawa, Bela Gipp		Plagiarism is a pressing concern, even more so with the availability of large language models. Existing plagiarism detection systems reliably find copied and moderately reworded text but fail for idea plagiarism, especially in mathematical science, which heavily uses formal mathematical notation. We make two contributions. First, we establish a taxonomy of mathematical content reuse by annotating potentially plagiarised 122 scientific document pairs. Second, we analyze the best-performing approaches to detect plagiarism and mathematical content similarity on the newly established taxonomy. We found that the best-performing methods for plagiarism and math content similarity achieve an overall detection score (PlagDet) of 0.06 and 0.16, respectively. The best-performing methods failed to detect most cases from all seven newly established math similarity types. Outlined contributions will benefit research in plagiarism detection systems, recommender systems, question-answering systems, and search engines. We make our experiment's code and annotated dataset available to the community: https://github.com/gipplab/Taxonomy-of-Mathematical-Plagiarism	剽窃是一个迫切需要关注的问题，在大型语言模型的可用性方面更是如此。现有的剽窃检测系统可靠地发现抄袭和适度重写的文本，但无法发现思想剽窃，尤其是在大量使用正式数学符号的数学科学领域。我们有两个贡献。首先，我们通过对可能抄袭的122个科学文档对进行注释，建立了数学内容重用的分类。其次，我们分析了在新建立的分类法中检测剽窃和数学内容相似性的最佳方法。我们发现表现最好的剽窃和数学内容相似性方法的总体检测得分(PlagDet)分别为0.06和0.16。表现最好的方法无法从所有七个新建立的数学相似性类型中检测出大多数案例。概述的贡献将有助于剽窃检测系统、推荐系统、问答系统和搜索引擎的研究。我们将我们实验的代码和注释数据集提供给社区: https://github.com/gipplab/taxonomy-of-mathematical-plagiarism	code	0
Unraveling Disagreement Constituents in Hateful Speech	Giulia Rizzi, Alessandro Astorino, Paolo Rosso, Elisabetta Fersini				code	0
SoftQE: Learned Representations of Queries Expanded by LLMs	Varad Pimpalkhute, John Heyer, Xusen Yin, Sameer Gupta		We investigate the integration of Large Language Models (LLMs) into query encoders to improve dense retrieval without increasing latency and cost, by circumventing the dependency on LLMs at inference time. SoftQE incorporates knowledge from LLMs by mapping embeddings of input queries to those of the LLM-expanded queries. While improvements over various strong baselines on in-domain MS-MARCO metrics are marginal, SoftQE improves performance by 2.83 absolute percentage points on average on five out-of-domain BEIR tasks.	我们研究了如何将大语言模型(LLM)集成到查询编码器中，通过在推理时避免对 LLM 的依赖，在不增加延迟和成本的情况下提高密集检索。SoftQE 通过将输入查询的嵌入映射到 LLM 扩展查询的嵌入来整合来自 LLM 的知识。虽然在域内 MS-MARCO 指标的各种强基线上的改进是微乎其微的，但是在5个域外 BEIR 任务上，SoftQE 平均提高了2.83个绝对百分点的性能。	code	0
Optimizing BERTopic: Analysis and Reproducibility Study of Parameter Influences on Topic Modeling	Martin Borcin, Joemon M. Jose				code	0
A Reproducibility Study of Goldilocks: Just-Right Tuning of BERT for TAR	Xinyu Mao, Bevan Koopman, Guido Zuccon		Screening documents is a tedious and time-consuming aspect of high-recall retrieval tasks, such as compiling a systematic literature review, where the goal is to identify all relevant documents for a topic. To help streamline this process, many Technology-Assisted Review (TAR) methods leverage active learning techniques to reduce the number of documents requiring review. BERT-based models have shown high effectiveness in text classification, leading to interest in their potential use in TAR workflows. In this paper, we investigate recent work that examined the impact of further pre-training epochs on the effectiveness and efficiency of a BERT-based active learning pipeline. We first report that we could replicate the original experiments on two specific TAR datasets, confirming some of the findings: importantly, that further pre-training is critical to high effectiveness, but requires attention in terms of selecting the correct training epoch. We then investigate the generalisability of the pipeline on a different TAR task, that of medical systematic reviews. In this context, we show that there is no need for further pre-training if a domain-specific BERT backbone is used within the active learning pipeline. This finding provides practical implications for using the studied active learning pipeline within domain-specific TAR tasks.	筛选文档是高召回率检索任务的一个乏味和耗时的方面，例如编写系统的文献综述，其目标是确定某个主题的所有相关文档。为了帮助简化这个过程，许多技术辅助评审(TAR)方法利用主动学习技术来减少需要评审的文档数量。基于 BERT 的模型在文本分类方面表现出了很高的效率，这使得人们对它们在 TAR 工作流中的潜在用途产生了兴趣。在本文中，我们调查了最近的工作，检查进一步的预训练时代对基于 BERT 的主动学习流水线的有效性和效率的影响。我们首先报告说，我们可以在两个特定的 TAR 数据集上复制原始实验，证实了一些发现: 重要的是，进一步的预训练对于高效性至关重要，但是在选择正确的训练时代方面需要注意。然后，我们调查管道的普遍性在一个不同的 TAR 任务，即医疗系统评价。在这种情况下，我们表明，没有必要进一步的预训练，如果领域特定的 BERT 骨干是在主动学习流水线使用。这一发现为在特定领域的 TAR 任务中使用所研究的主动学习流水线提供了实际意义。	code	0
Good for Children, Good for All?	Monica Landoni, Theo Huibers, Emiliana Murgia, Maria Soledad Pera				code	0
Mu2STS: A Multitask Multimodal Sarcasm-Humor-Differential Teacher-Student Model for Sarcastic Meme Detection	Gitanjali Kumari, Chandranath Adak, Asif Ekbal				code	0
An Adaptive Feature Selection Method for Learning-to-Enumerate Problem	Satoshi Horikawa, Chiyonosuke Nemoto, Keishi Tajima, Masaki Matsubara, Atsuyuki Morishima				code	0
Asking Questions Framework for Oral History Archives	Jan Svec, Martin Bulín, Adam Frémund, Filip Polák				code	0
Yes, This Is What I Was Looking For! Towards Multi-modal Medical Consultation Concern Summary Generation	Abhisek Tiwari, Shreyangshu Bera, Sriparna Saha, Pushpak Bhattacharyya, Samrat Ghosh		Over the past few years, the use of the Internet for healthcare-related tasks has grown by leaps and bounds, posing a challenge in effectively managing and processing information to ensure its efficient utilization. During moments of emotional turmoil and psychological challenges, we frequently turn to the internet as our initial source of support, choosing this over discussing our feelings with others due to the associated social stigma. In this paper, we propose a new task of multi-modal medical concern summary (MMCS) generation, which provides a short and precise summary of patients' major concerns brought up during the consultation. Nonverbal cues, such as patients' gestures and facial expressions, aid in accurately identifying patients' concerns. Doctors also consider patients' personal information, such as age and gender, in order to describe the medical condition appropriately. Motivated by the potential efficacy of patients' personal context and visual gestures, we propose a transformer-based multi-task, multi-modal intent-recognition, and medical concern summary generation (IR-MMCSG) system. Furthermore, we propose a multitasking framework for intent recognition and medical concern summary generation for doctor-patient consultations. We construct the first multi-modal medical concern summary generation (MM-MediConSummation) corpus, which includes patient-doctor consultations annotated with medical concern summaries, intents, patient personal information, doctor's recommendations, and keywords. Our experiments and analysis demonstrate (a) the significant role of patients' expressions/gestures and their personal information in intent identification and medical concern summary generation, and (b) the strong correlation between intent recognition and patients' medical concern summary generation The dataset and source code are available at https://github.com/NLP-RL/MMCSG.	在过去几年中，互联网用于与保健有关的任务的使用突飞猛进，对有效管理和处理信息以确保其有效利用提出了挑战。在情绪动荡和心理挑战的时刻，我们经常求助于互联网作为我们最初的支持来源，由于相关的社会耻辱，我们选择这种方式而不是与他人讨论我们的感受。在本文中，我们提出了一个新的任务多模式医疗关注摘要(MMCS)生成，它提供了一个简短而精确的总结病人的主要关注在咨询过程中提出。非语言线索，如病人的手势和面部表情，有助于准确识别病人的关切。医生还会考虑病人的个人信息，如年龄和性别，以便恰当地描述病情。基于患者个人情境和视觉手势的潜在功效，我们提出了一个基于转换器的多任务、多模态意图识别和医学关注摘要生成(IR-MMCSG)系统。此外，我们提出了一个多任务框架的意图识别和医疗关注摘要生成的医患咨询。我们构建了第一个多模式医疗关注摘要生成(MM-MediConSumsum)语料库，其中包括用医疗关注摘要，意图，患者个人信息，医生建议和关键字注释的患者-医生咨询。我们的实验和分析表明(a)患者的表情/手势及其个人信息在意图识别和医疗关注摘要生成中的重要作用，以及(b)意图识别和患者医疗关注摘要生成之间的强相关性。数据集和源代码可在 https://github.com/nlp-rl/mmcsg 获得。	code	0
Interactive Topic Tagging in Community Question Answering Platforms	Radin Hamidi Rad, Silviu Cucerzan, Nirupama Chandrasekaran, Michael Gamon				code	0
Mitigating Data Sparsity via Neuro-Symbolic Knowledge Transfer	Tommaso Carraro, Alessandro Daniele, Fabio Aiolli, Luciano Serafini				code	0
Enhancing Legal Named Entity Recognition Using RoBERTa-GCN with CRF: A Nuanced Approach for Fine-Grained Entity Recognition	Arihant Jain, Raksha Sharma				code	0
A Novel Multi-Stage Prompting Approach for Language Agnostic MCQ Generation Using GPT	Subhankar Maity, Aniket Deroy, Sudeshna Sarkar		We introduce a multi-stage prompting approach (MSP) for the generation of multiple choice questions (MCQs), harnessing the capabilities of GPT models such as text-davinci-003 and GPT-4, renowned for their excellence across various NLP tasks. Our approach incorporates the innovative concept of chain-of-thought prompting, a progressive technique in which the GPT model is provided with a series of interconnected cues to guide the MCQ generation process. Automated evaluations consistently demonstrate the superiority of our proposed MSP method over the traditional single-stage prompting (SSP) baseline, resulting in the production of high-quality distractors. Furthermore, the one-shot MSP technique enhances automatic evaluation results, contributing to improved distractor generation in multiple languages, including English, German, Bengali, and Hindi. In human evaluations, questions generated using our approach exhibit superior levels of grammaticality, answerability, and difficulty, highlighting its efficacy in various languages.	我们引入了一种多阶段提示方法(MSP)来生成多项选择题(MCQs) ，利用 GPT 模型的能力，如 text-davinci-003和 GPT-4，它们在各种 NLP 任务中的卓越表现而闻名。我们的方法结合了思维链激励的创新概念，这是一种渐进的技术，其中 GPT 模型提供了一系列相互关联的线索来指导 MCQ 的生成过程。自动评估一致表明，我们提出的 MSP 方法优于传统的单阶段提示(SSP)基线，从而产生了高质量的干扰器。此外，一次性 MSP 技术增强了自动评估结果，有助于改善多种语言的干扰生成，包括英语、德语、孟加拉语和印地语。在人类评价中，使用我们的方法产生的问题表现出优越的语法性、可回答性和难度水平，突出了它在各种语言中的功效。	code	0
A Study on Hierarchical Text Classification as a Seq2seq Task	Fatos Torba, Christophe Gravier, Charlotte Laclau, Abderrhammen Kammoun, Julien Subercaze				code	0
MFVIEW: Multi-modal Fake News Detection with View-Specific Information Extraction	Marium Malik, Jiaojiao Jiang, Yang Song, Sanjay Jha				code	0
Navigating Uncertainty: Optimizing API Dependency for Hallucination Reduction in Closed-Book QA	Pierre Erbacher, Louis Falissard, Vincent Guigue, Laure Soulier				code	0
Can We Predict QPP? An Approach Based on Multivariate Outliers	AdrianGabriel Chifu, Sébastien Déjean, Moncef Garouani, Josiane Mothe, Diégo Ortiz, Md Zia Ullah		Query performance prediction (QPP) aims to forecast the effectiveness of a search engine across a range of queries and documents. While state-of-the-art predictors offer a certain level of precision, their accuracy is not flawless. Prior research has recognized the challenges inherent in QPP but often lacks a thorough qualitative analysis. In this paper, we delve into QPP by examining the factors that influence the predictability of query performance accuracy. We propose the working hypothesis that while some queries are readily predictable, others present significant challenges. By focusing on outliers, we aim to identify the queries that are particularly challenging to predict. To this end, we employ multivariate outlier detection method. Our results demonstrate the effectiveness of this approach in identifying queries on which QPP do not perform well, yielding less reliable predictions. Moreover, we provide evidence that excluding these hard-to-predict queries from the analysis significantly enhances the overall accuracy of QPP.	查询性能预测(QPP)旨在预测一个搜索引擎在一系列查询和文档中的有效性。虽然最先进的预测器提供了一定程度的精确性，但它们的准确性并非完美无缺。先前的研究已经认识到 QPP 固有的挑战，但往往缺乏一个彻底的定性分析。在本文中，我们通过研究影响查询性能准确性可预测性的因素来深入研究 QPP。我们提出这样一个工作假设: 尽管一些查询是容易预测的，但是其他查询会带来重大挑战。通过关注异常值，我们的目标是识别特别难以预测的查询。为此，我们采用了多元异常检测方法。我们的研究结果证明了这种方法在识别 QPP 表现不佳的查询时的有效性，从而产生了不太可靠的预测。此外，我们提供的证据表明，排除这些难以预测的查询从分析显着提高了整体准确性的 QPP。	code	0
SALSA: Salience-Based Switching Attack for Adversarial Perturbations in Fake News Detection Models	Chahat Raj, Anjishnu Mukherjee, Hemant Purohit, Antonios Anastasopoulos, Ziwei Zhu				code	0
FakeClaim: A Multiple Platform-Driven Dataset for Identification of Fake News on 2023 Israel-Hamas War	Gautam Kishore Shahi, Amit Kumar Jaiswal, Thomas Mandl		We contribute the first publicly available dataset of factual claims from different platforms and fake YouTube videos on the 2023 Israel-Hamas war for automatic fake YouTube video classification. The FakeClaim data is collected from 60 fact-checking organizations in 30 languages and enriched with metadata from the fact-checking organizations curated by trained journalists specialized in fact-checking. Further, we classify fake videos within the subset of YouTube videos using textual information and user comments. We used a pre-trained model to classify each video with different feature combinations. Our best-performing fine-tuned language model, Universal Sentence Encoder (USE), achieves a Macro F1 of 87%, which shows that the trained model can be helpful for debunking fake videos using the comments from the user discussion. The dataset is available on Github[https://github.com/Gautamshahi/FakeClaim]	我们贡献了第一个公开可用的数据集来自不同平台的事实声明和2023年以色列-哈马斯战争的假 YouTube 视频自动假 YouTube 视频分类。FakeClaim 的数据是从60个事实核查组织以30种语言收集的，并且由专门从事事实核查的训练有素的记者组织的事实核查组织的元数据加以丰富。此外，我们使用文本信息和用户评论将 YouTube 视频子集中的假视频进行分类。我们使用一个预先训练的模型来分类每个视频与不同的特征组合。我们表现最好的微调语言模型，通用句子编码器(USE) ，实现了87% 的宏 F1，这表明，训练有素的模型可以有助于揭穿假视频使用用户讨论的评论。该数据集可在 Github [ https://Github.com/gautamshahi/fakeclaim ]上获得	code	0
MedSumm: A Multimodal Approach to Summarizing Code-Mixed Hindi-English Clinical Queries	Akash Ghosh, Arkadeep Acharya, Prince Jha, Sriparna Saha, Aniket Gaudgaul, Rajdeep Majumdar, Aman Chadha, Raghav Jain, Setu Sinha, Shivani Agarwal		In the healthcare domain, summarizing medical questions posed by patients is critical for improving doctor-patient interactions and medical decision-making. Although medical data has grown in complexity and quantity, the current body of research in this domain has primarily concentrated on text-based methods, overlooking the integration of visual cues. Also prior works in the area of medical question summarisation have been limited to the English language. This work introduces the task of multimodal medical question summarization for codemixed input in a low-resource setting. To address this gap, we introduce the Multimodal Medical Codemixed Question Summarization MMCQS dataset, which combines Hindi-English codemixed medical queries with visual aids. This integration enriches the representation of a patient's medical condition, providing a more comprehensive perspective. We also propose a framework named MedSumm that leverages the power of LLMs and VLMs for this task. By utilizing our MMCQS dataset, we demonstrate the value of integrating visual information from images to improve the creation of medically detailed summaries. This multimodal strategy not only improves healthcare decision-making but also promotes a deeper comprehension of patient queries, paving the way for future exploration in personalized and responsive medical care. Our dataset, code, and pre-trained models will be made publicly available.	在医疗保健领域，总结病人提出的医疗问题对于改善医患互动和医疗决策至关重要。尽管医学数据在复杂性和数量上都有所增长，但目前该领域的研究主要集中在基于文本的方法上，忽视了视觉线索的整合。此外，以前在医疗问题总结领域的工作已经限制在英语。本文介绍了在低资源环境下多模式医学问题摘要的任务。为了解决这一差距，我们引入了多模式医学代码混合问题摘要 MCQS 数据集，它结合了印地语-英语代码混合医学查询和视觉辅助。这种整合丰富了病人的医疗状况的表示，提供了一个更全面的视角。我们还提出了一个名为 MedSumm 的框架，该框架利用 LLM 和 VLM 的强大功能完成此任务。通过利用我们的 MMCQS 数据集，我们证明了整合来自图像的视觉信息以改善医学详细摘要的创建的价值。这种多模式策略不仅改善了医疗保健决策，而且促进了对患者查询的更深入理解，为未来探索个性化和响应式医疗保健铺平了道路。我们的数据集、代码和预先训练的模型将公开发布。	code	0
The Open Web Index - Crawling and Indexing the Web for Public Use	Gijs Hendriksen, Michael Dinzinger, Sheikh Mastura Farzana, Noor Afshan Fathima, Maik Fröbe, Sebastian Schmidt, Saber Zerhoudi, Michael Granitzer, Matthias Hagen, Djoerd Hiemstra, Martin Potthast, Benno Stein				code	0
Towards Robust Expert Finding in Community Question Answering Platforms	Maddalena Amendola, Andrea Passarella, Raffaele Perego				code	0
Interactive Document Summarization	Raoufdine Said, Adrien Guille				code	0
Physio: An LLM-Based Physiotherapy Advisor	Rúben Almeida, Hugo O. Sousa, Luís Filipe Cunha, Nuno Guimarães, Ricardo Campos, Alípio Jorge		The capabilities of the most recent language models have increased the interest in integrating them into real-world applications. However, the fact that these models generate plausible, yet incorrect text poses a constraint when considering their use in several domains. Healthcare is a prime example of a domain where text-generative trustworthiness is a hard requirement to safeguard patient well-being. In this paper, we present Physio, a chat-based application for physical rehabilitation. Physio is capable of making an initial diagnosis while citing reliable health sources to support the information provided. Furthermore, drawing upon external knowledge databases, Physio can recommend rehabilitation exercises and over-the-counter medication for symptom relief. By combining these features, Physio can leverage the power of generative models for language processing while also conditioning its response on dependable and verifiable sources. A live demo of Physio is available at https://physio.inesctec.pt.	最新语言模型的功能增加了将它们集成到实际应用程序中的兴趣。然而，当考虑到这些模型在几个领域中的使用时，这些模型产生合理但不正确的文本这一事实构成了一个限制。医疗保健是这样一个领域的典型例子，在这个领域，文本生成的可信度是保障患者健康的一个硬性要求。在本文中，我们介绍了生理，一个聊天为基础的应用程序的身体康复。理疗师能够做出初步诊断，同时引用可靠的健康来源来支持所提供的信息。此外，利用外部知识数据库，生理学可以推荐康复练习和非处方药物的症状缓解。通过结合这些特征，理疗师可以利用语言处理的生成模型的力量，同时也调节其反应的可靠和可验证的来源。一个现场演示的理疗可在 https://Physio.inesctec.pt。	code	0
eval-rationales: An End-to-End Toolkit to Explain and Evaluate Transformers-Based Models	Khalil Maachou, Jesús LovónMelgarejo, José G. Moreno, Lynda Tamine				code	0
VADIS - A Variable Detection, Interlinking and Summarization System	Yavuz Selim Kartal, Muhammad Ahsan Shahid, Sotaro Takeshita, Tornike Tsereteli, Andrea Zielinski, Benjamin Zapilko, Philipp Mayr		The VADIS system addresses the demand of providing enhanced information access in the domain of the social sciences. This is achieved by allowing users to search and use survey variables in context of their underlying research data and scholarly publications which have been interlinked with each other.	VADIS 系统满足了增强社会科学领域信息获取的需求。这是通过允许用户在其相互关联的基础研究数据和学术出版物的背景下搜索和使用调查变量来实现的。	code	0
Building and Evaluating a WebApp for Effortless Deep Learning Model Deployment	Ruikun Wu, Jiaxuan Han, Jerome Ramos, Aldo Lipani				code	0
indxr: A Python Library for Indexing File Lines	Elias Bassani, Nicola Tonellotto				code	0
SciSpace Literature Review: Harnessing AI for Effortless Scientific Discovery	Siddhant Jain, Asheesh Kumar, Trinita Roy, Kartik Shinde, Goutham Vignesh, Rohan Tondulkar				code	0
Let's Get It Started: Fostering the Discoverability of New Releases on Deezer	Léa Briand, Théo Bontempelli, Walid Bendada, Mathieu Morlon, François Rigaud, Benjamin Chapus, Thomas Bouabça, Guillaume SalhaGalvan				code	0
Augmenting KG Hierarchies Using Neural Transformers	Sanat Sharma, Mayank Poddar, Jayant Kumar, Kosta Blank, Tracy Holloway King				code	0
Document Level Event Extraction from Narratives	Luís Filipe Cunha				code	0
Shuffling a Few Stalls in a Crowded Bazaar: Potential Impact of Document-Side Fairness on Unprivileged Info-Seekers	Sean Healy				code	0
Knowledge Transfer from Resource-Rich to Resource-Scarce Environments	Negin Ghasemi				code	0
PhD Candidacy: A Tutorial on Overcoming Challenges and Achieving Success	Johanne R. Trippas, David Maxwell				code	0
The CLEF-2024 CheckThat! Lab: Check-Worthiness, Subjectivity, Persuasion, Roles, Authorities, and Adversarial Robustness	Alberto BarrónCedeño, Firoj Alam, Tanmoy Chakraborty, Tamer Elsayed, Preslav Nakov, Piotr Przybyla, Julia Maria Struß, Fatima Haouari, Maram Hasanain, Federico Ruggeri, Xingyi Song, Reem Suwaileh				code	0
ELOQUENT CLEF Shared Tasks for Evaluation of Generative Language Model Quality	Jussi Karlgren, Luise Dürlich, Evangelia Gogoulou, Liane Guillou, Joakim Nivre, Magnus Sahlgren, Aarne Talman				code	0
Overview of Touché 2024: Argumentation Systems	Johannes Kiesel, Çagri Çöltekin, Maximilian Heinrich, Maik Fröbe, Milad Alshomary, Bertrand De Longueville, Tomaz Erjavec, Nicolas Handke, Matyás Kopp, Nikola Ljubesic, Katja Meden, Nailia Mirzakhmedova, Vaidas Morkevicius, Theresa ReitisMünstermann, Mario Scharfbillig, Nicolas Stefanovitch, Henning Wachsmuth, Martin Potthast, Benno Stein				code	0
eRisk 2024: Depression, Anorexia, and Eating Disorder Challenges	Javier Parapar, Patricia MartínRodilla, David E. Losada, Fabio Crestani				code	0
QuantumCLEF - Quantum Computing at CLEF	Andrea Pasin, Maurizio Ferrari Dacrema, Paolo Cremonesi, Nicola Ferro				code	0
EXIST 2024: sEXism Identification in Social neTworks and Memes	Laura Plaza, Jorge CarrillodeAlbornoz, Enrique Amigó, Julio Gonzalo, Roser Morante, Paolo Rosso, Damiano Spina, Berta Chulvi, Alba Maeso, Víctor Ruiz				code	0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ecir2024.md

ecir2024.md

ECIR2024 Paper List

Files

ecir2024.md

Latest commit

History

ecir2024.md

File metadata and controls

ECIR2024 Paper List