From 2c35fbfd577e3459e637cf25e5862d06ace1f344 Mon Sep 17 00:00:00 2001
From: github-actions <action@github.com>
Date: Sat, 14 Dec 2024 01:01:07 +0000
Subject: [PATCH] chore: update confs

---
 arxiv.json | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

diff --git a/arxiv.json b/arxiv.json
index 3ae49a89..6f0358db 100644
--- a/arxiv.json
+++ b/arxiv.json
@@ -36818,5 +36818,75 @@
         "pub_date": "2024-12-12",
         "summary": "Deep supervised hashing has become a pivotal technique in large-scale image retrieval, offering significant benefits in terms of storage and search efficiency. However, existing deep supervised hashing models predominantly focus on generating fixed-length hash codes. This approach fails to address the inherent trade-off between efficiency and effectiveness when using hash codes of varying lengths. To determine the optimal hash code length for a specific task, multiple models must be trained for different lengths, leading to increased training time and computational overhead. Furthermore, the current paradigm overlooks the potential relationships between hash codes of different lengths, limiting the overall effectiveness of the models. To address these challenges, we propose the Nested Hash Layer (NHL), a plug-and-play module designed for existing deep supervised hashing models. The NHL framework introduces a novel mechanism to simultaneously generate hash codes of varying lengths in a nested manner. To tackle the optimization conflicts arising from the multiple learning objectives associated with different code lengths, we further propose an adaptive weights strategy that dynamically monitors and adjusts gradients during training. Additionally, recognizing that the structural information in longer hash codes can provide valuable guidance for shorter hash codes, we develop a long-short cascade self-distillation method within the NHL to enhance the overall quality of the generated hash codes. Extensive experiments demonstrate that NHL not only accelerates the training process but also achieves superior retrieval performance across various deep hashing models. Our code is publicly available at https://github.com/hly1998/NHL.",
         "translated": "深度监督哈希已成为大规模图像检索中的关键技术，在存储和搜索效率方面提供了显著优势。然而，现有的深度监督哈希模型主要集中在生成固定长度的哈希码。这种方法未能解决在使用不同长度的哈希码时效率与效果之间的固有权衡问题。为了确定特定任务的最佳哈希码长度，必须为不同长度训练多个模型，这导致了训练时间和计算开销的增加。此外，当前的方法忽视了不同长度哈希码之间潜在的关联，限制了模型的整体效果。为应对这些挑战，我们提出了嵌套哈希层（Nested Hash Layer，NHL），这是一个为现有深度监督哈希模型设计的即插即用模块。NHL框架引入了一种新颖的机制，能够以嵌套方式同时生成不同长度的哈希码。为了解决与不同码长相关的多个学习目标之间的优化冲突，我们进一步提出了一种自适应权重策略，该策略在训练过程中动态监控和调整梯度。此外，考虑到较长哈希码中的结构信息可以为较短哈希码提供有价值的指导，我们在NHL中开发了一种长短码级联自蒸馏方法，以提高生成哈希码的整体质量。大量实验表明，NHL不仅加速了训练过程，而且在各种深度哈希模型中实现了卓越的检索性能。我们的代码已在 https://github.com/hly1998/NHL 公开发布。"
+    },
+    {
+        "title": "Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge\n  Graph-Based RAG",
+        "url": "http://arxiv.org/abs/2412.09614v1",
+        "pub_date": "2024-12-12",
+        "summary": "We introduce a novel approach to enhance the capabilities of text-to-image models by incorporating a graph-based RAG. Our system dynamically retrieves detailed character information and relational data from the knowledge graph, enabling the generation of visually accurate and contextually rich images. This capability significantly improves upon the limitations of existing T2I models, which often struggle with the accurate depiction of complex or culturally specific subjects due to dataset constraints. Furthermore, we propose a novel self-correcting mechanism for text-to-image models to ensure consistency and fidelity in visual outputs, leveraging the rich context from the graph to guide corrections. Our qualitative and quantitative experiments demonstrate that Context Canvas significantly enhances the capabilities of popular models such as Flux, Stable Diffusion, and DALL-E, and improves the functionality of ControlNet for fine-grained image editing tasks. To our knowledge, Context Canvas represents the first application of graph-based RAG in enhancing T2I models, representing a significant advancement for producing high-fidelity, context-aware multi-faceted images.",
+        "translated": "我们提出了一种新颖的方法，通过结合基于图的检索增强生成（RAG）来提升文本到图像模型的能力。我们的系统能够从知识图中动态检索详细的角色信息和关系数据，从而生成视觉上准确且上下文丰富的图像。这一功能显著超越了现有T2I模型的局限性，这些模型通常因数据集限制而在准确描绘复杂或文化特定主题时遇到困难。此外，我们提出了一种新的自校正机制，用于文本到图像模型，以确保视觉输出的连贯性和忠实度，利用图中的丰富上下文来指导校正过程。我们的定性和定量实验表明，Context Canvas显著增强了Flux、Stable Diffusion和DALL-E等流行模型的能力，并提升了ControlNet在精细图像编辑任务中的功能。据我们所知，Context Canvas是首个将基于图的RAG应用于增强T2I模型的实例，标志着在生成高保真、上下文感知的多方面图像方面取得了重大进展。"
+    },
+    {
+        "title": "Olympus: A Universal Task Router for Computer Vision Tasks",
+        "url": "http://arxiv.org/abs/2412.09612v1",
+        "pub_date": "2024-12-12",
+        "summary": "We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates over 20 specialized tasks across images, videos, and 3D objects to dedicated modules. This instruction-based routing enables complex workflows through chained actions without the need for training heavy generative models. Olympus easily integrates with existing MLLMs, expanding their capabilities with comparable performance. Experimental results demonstrate that Olympus achieves an average routing accuracy of 94.75% across 20 tasks and precision of 91.82% in chained action scenarios, showcasing its effectiveness as a universal task router that can solve a diverse range of computer vision tasks. Project page: https://github.com/yuanze-lin/Olympus_page",
+        "translated": "我们提出了Olympus，这是一种将多模态大型语言模型（MLLMs）转化为统一框架的新方法，能够处理广泛的计算机视觉任务。通过利用一个控制器MLLM，Olympus将超过20项专门任务分配给图像、视频和3D对象的专用模块。这种基于指令的路由通过链式操作实现了复杂的任务流程，而无需训练庞大的生成模型。Olympus能够轻松集成现有的MLLMs，扩展其功能并保持相当的性能。实验结果表明，Olympus在20项任务中的平均路由准确率达到94.75%，在链式操作场景中的精确度达到91.82%，展示了其作为解决多样化计算机视觉任务的通用任务路由器的有效性。项目页面：https://github.com/yuanze-lin/Olympus_page"
+    },
+    {
+        "title": "AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web\n  Tutorials",
+        "url": "http://arxiv.org/abs/2412.09605v1",
+        "pub_date": "2024-12-12",
+        "summary": "Graphical User Interface (GUI) agents hold great potential for automating complex tasks across diverse digital environments, from web applications to desktop software. However, the development of such agents is hindered by the lack of high-quality, multi-step trajectory data required for effective training. Existing approaches rely on expensive and labor-intensive human annotation, making them unsustainable at scale. To address this challenge, we propose AgentTrek, a scalable data synthesis pipeline that generates high-quality GUI agent trajectories by leveraging web tutorials. Our method automatically gathers tutorial-like texts from the internet, transforms them into task goals with step-by-step instructions, and employs a visual-language model agent to simulate their execution in a real digital environment. A VLM-based evaluator ensures the correctness of the generated trajectories. We demonstrate that training GUI agents with these synthesized trajectories significantly improves their grounding and planning performance over the current models. Moreover, our approach is more cost-efficient compared to traditional human annotation methods. This work underscores the potential of guided replay with web tutorials as a viable strategy for large-scale GUI agent training, paving the way for more capable and autonomous digital agents.",
+        "translated": "图形用户界面（GUI）代理在自动化跨多种数字环境的复杂任务方面具有巨大潜力，涵盖从网页应用到桌面软件的广泛领域。然而，这类代理的开发受到高质量多步骤轨迹数据缺乏的制约，这些数据对于有效训练至关重要。现有方法依赖于昂贵且劳动密集的人工标注，使其在大规模应用中难以持续。为应对这一挑战，我们提出了AgentTrek，一个可扩展的数据合成管道，通过利用网页教程生成高质量的GUI代理轨迹。我们的方法自动从互联网收集类似教程的文本，将其转化为带有逐步指导的任务目标，并利用视觉语言模型代理在真实数字环境中模拟执行这些任务。基于视觉语言模型（VLM）的评估器确保生成轨迹的正确性。我们证明，使用这些合成的轨迹训练GUI代理能显著提升其基础和规划性能，超越当前模型。此外，与传统的人工标注方法相比，我们的方法更具成本效益。这项工作强调了利用网页教程进行引导回放作为一种可行的策略，为大规模GUI代理训练铺平了道路，推动了更强大、更自主的数字代理的发展。"
+    },
+    {
+        "title": "TimeRefine: Temporal Grounding with Time Refining Video LLM",
+        "url": "http://arxiv.org/abs/2412.09601v1",
+        "pub_date": "2024-12-12",
+        "summary": "Video temporal grounding aims to localize relevant temporal boundaries in a video given a textual prompt. Recent work has focused on enabling Video LLMs to perform video temporal grounding via next-token prediction of temporal timestamps. However, accurately localizing timestamps in videos remains challenging for Video LLMs when relying solely on temporal token prediction. Our proposed TimeRefine addresses this challenge in two ways. First, instead of directly predicting the start and end timestamps, we reformulate the temporal grounding task as a temporal refining task: the model first makes rough predictions and then refines them by predicting offsets to the target segment. This refining process is repeated multiple times, through which the model progressively self-improves its temporal localization accuracy. Second, to enhance the model's temporal perception capabilities, we incorporate an auxiliary prediction head that penalizes the model more if a predicted segment deviates further from the ground truth, thus encouraging the model to make closer and more accurate predictions. Our plug-and-play method can be integrated into most LLM-based temporal grounding approaches. The experimental results demonstrate that TimeRefine achieves 3.6% and 5.0% mIoU improvements on the ActivityNet and Charades-STA datasets, respectively. Code and pretrained models will be released.",
+        "translated": "视频时间定位旨在根据文本提示在视频中定位相关的时间边界。近期的工作主要集中在通过时间戳的下一词预测，使视频大语言模型（Video LLMs）能够执行视频时间定位。然而，仅依赖时间令牌预测来准确地在视频中定位时间戳仍然对Video LLMs构成了挑战。我们提出的TimeRefine通过两种方式解决了这一挑战。首先，我们不直接预测开始和结束时间戳，而是将时间定位任务重新表述为时间精炼任务：模型首先进行粗略预测，然后通过预测目标片段的偏移量来精炼这些预测。这一精炼过程会重复多次，通过这一过程，模型逐步提升其时间定位的准确性。其次，为了增强模型的时序感知能力，我们引入了一个辅助预测头，如果预测的片段与真实值的偏差越大，该预测头会对模型施加更大的惩罚，从而鼓励模型做出更接近且更准确的预测。我们的即插即用方法可以集成到大多数基于LLM的时间定位方法中。实验结果表明，TimeRefine在ActivityNet和Charades-STA数据集上分别实现了3.6%和5.0%的mIoU提升。代码和预训练模型将会发布。"
+    },
+    {
+        "title": "InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for\n  Long-term Streaming Video and Audio Interactions",
+        "url": "http://arxiv.org/abs/2412.09596v1",
+        "pub_date": "2024-12-12",
+        "summary": "Creating AI systems that can interact with environments over long periods, similar to human cognition, has been a longstanding research goal. Recent advancements in multimodal large language models (MLLMs) have made significant strides in open-world understanding. However, the challenge of continuous and simultaneous streaming perception, memory, and reasoning remains largely unexplored. Current MLLMs are constrained by their sequence-to-sequence architecture, which limits their ability to process inputs and generate responses simultaneously, akin to being unable to think while perceiving. Furthermore, relying on long contexts to store historical data is impractical for long-term interactions, as retaining all information becomes costly and inefficient. Therefore, rather than relying on a single foundation model to perform all functions, this project draws inspiration from the concept of the Specialized Generalist AI and introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive (IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module: Processes multimodal information in real-time, storing key details in memory and triggering reasoning in response to user queries. (2) Multi-modal Long Memory Module: Integrates short-term and long-term memory, compressing short-term memories into long-term ones for efficient retrieval and improved accuracy. (3) Reasoning Module: Responds to queries and executes reasoning tasks, coordinating with the perception and memory modules. This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.",
+        "translated": "构建能够像人类认知一样长期与环境互动的AI系统一直是研究的长远目标。近年来，多模态大语言模型（MLLMs）在开放世界理解方面取得了显著进展。然而，持续且同时进行的流式感知、记忆和推理的挑战仍然很大程度上未被探索。当前的MLLMs受限于其序列到序列的架构，这限制了它们同时处理输入和生成响应的能力，类似于无法在感知的同时进行思考。此外，依赖长上下文来存储历史数据对于长期互动是不切实际的，因为保留所有信息既昂贵又低效。因此，本项目不依赖单一的基础模型来执行所有功能，而是借鉴了“专业通才AI”的概念，引入了分离的流式感知、推理和记忆机制，从而实现与流式视频和音频输入的实时互动。提出的框架InternLM-XComposer2.5-OmniLive（IXC2.5-OL）包含三个关键模块：（1）流式感知模块：实时处理多模态信息，将关键细节存储在记忆中，并在响应用户查询时触发推理。（2）多模态长记忆模块：整合短期和长期记忆，将短期记忆压缩为长期记忆，以提高检索效率和准确性。（3）推理模块：响应查询并执行推理任务，与感知和记忆模块协调工作。本项目模拟人类认知，使多模态大语言模型能够提供持续且适应性的服务。"
+    },
+    {
+        "title": "OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets\n  in 50+ Languages",
+        "url": "http://arxiv.org/abs/2412.09587v1",
+        "pub_date": "2024-12-12",
+        "summary": "We present OpenNER 1.0, a standardized collection of openly available named entity recognition (NER) datasets. OpenNER contains 34 datasets spanning 51 languages, annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation, map entity type names to be more consistent across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. We provide baseline models using three pretrained multilingual language models to compare the performance of recent models and facilitate future research in NER.",
+        "translated": "我们发布了OpenNER 1.0，这是一个标准化的、公开可用的命名实体识别（NER）数据集集合。OpenNER包含34个数据集，涵盖51种语言，并采用了多种命名实体本体进行标注。我们对标注格式问题进行了修正，将原始数据集标准化为统一的表示形式，对实体类型名称进行了跨语料库的一致性映射，并提供了便于多语言和多本体NER研究的数据结构。我们使用三种预训练的多语言语言模型提供了基线模型，以便比较近期模型的性能，并促进未来在NER领域的研究。"
+    },
+    {
+        "title": "DISHONEST: Dissecting misInformation Spread using Homogeneous sOcial\n  NEtworks and Semantic Topic classification",
+        "url": "http://arxiv.org/abs/2412.09578v1",
+        "pub_date": "2024-12-12",
+        "summary": "The emergence of the COVID-19 pandemic resulted in a significant rise in the spread of misinformation on online platforms such as Twitter. Oftentimes this growth is blamed on the idea of the \"echo chamber.\" However, the behavior said to characterize these echo chambers exists in two dimensions. The first is in a user's social interactions, where they are said to stick with the same clique of like-minded users. The second is in the content of their posts, where they are said to repeatedly espouse homogeneous ideas. In this study, we link the two by using Twitter's network of retweets to study social interactions and topic modeling to study tweet content. In order to measure the diversity of a user's interactions over time, we develop a novel metric to track the speed at which they travel through the social network. The application of these analysis methods to misinformation-focused data from the pandemic demonstrates correlation between social behavior and tweet content. We believe this correlation supports the common intuition about how antisocial users behave, and further suggests that it holds even in subcommunities already rife with misinformation.",
+        "translated": "COVID-19疫情的爆发导致Twitter等在线平台上错误信息的传播显著增加。这种增长通常被归咎于“回音壁”效应。然而，这种所谓回音壁效应的行为特征存在于两个维度。第一个维度是用户的社交互动，即他们倾向于与志同道合的用户群体保持互动。第二个维度是他们的帖子内容，即他们反复表达同质化的观点。在本研究中，我们通过使用Twitter的转发网络来研究社交互动，并通过主题建模来分析推文内容，从而将这两个维度联系起来。为了衡量用户随时间变化的互动多样性，我们开发了一种新颖的度量方法，用于追踪他们在社交网络中移动的速度。将这些分析方法应用于疫情期间以错误信息为重点的数据，我们发现社交行为与推文内容之间存在相关性。我们认为这种相关性支持了关于反社会用户行为的常见直觉，并进一步表明，即使在已经充斥着错误信息的子社区中，这种相关性依然成立。"
+    },
+    {
+        "title": "DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through\n  Diverse Perspectives and Multi-Agent Interaction",
+        "url": "http://arxiv.org/abs/2412.09572v1",
+        "pub_date": "2024-12-12",
+        "summary": "Quantifying the uncertainty in the factual parametric knowledge of Large Language Models (LLMs), especially in a black-box setting, poses a significant challenge. Existing methods, which gauge a model's uncertainty through evaluating self-consistency in responses to the original query, do not always capture true uncertainty. Models might respond consistently to the origin query with a wrong answer, yet respond correctly to varied questions from different perspectives about the same query, and vice versa. In this paper, we propose a novel method, DiverseAgentEntropy, for evaluating a model's uncertainty using multi-agent interaction under the assumption that if a model is certain, it should consistently recall the answer to the original query across a diverse collection of questions about the same original query. We further implement an abstention policy to withhold responses when uncertainty is high. Our method offers a more accurate prediction of the model's reliability and further detects hallucinations, outperforming other self-consistency-based methods. Additionally, it demonstrates that existing models often fail to consistently retrieve the correct answer to the same query under diverse varied questions even when knowing the correct answer.",
+        "translated": "量化大型语言模型（LLMs）在事实参数知识方面的不确定性，尤其是在黑箱环境下，是一个重大挑战。现有的方法通过评估模型对原始查询响应的自洽性来衡量其不确定性，但这种方法并不总能捕捉到真正的模型不确定性。模型可能会对原始查询给出一致但错误的回答，而对同一查询从不同角度提出的多样化问题却能给出正确的回答，反之亦然。本文提出了一种新的方法，称为DiverseAgentEntropy，通过多智能体交互来评估模型的不确定性。该方法假设如果模型具有确定性，那么它应该能够在针对同一原始查询的多样化问题集合中一致地回忆起原始查询的答案。我们进一步实施了一种弃权策略，在不确定性较高时拒绝响应。我们的方法提供了更准确的模型可靠性预测，并能检测出幻觉现象，优于其他基于自洽性的方法。此外，研究还表明，现有的模型在面对同一查询的多样化变体问题时，即便知道正确答案，往往也无法一致地检索出正确答案。"
+    },
+    {
+        "title": "JuStRank: Benchmarking LLM Judges for System Ranking",
+        "url": "http://arxiv.org/abs/2412.09569v1",
+        "pub_date": "2024-12-12",
+        "summary": "Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge's quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.",
+        "translated": "鉴于生成式人工智能的快速发展，系统性地比较和选择众多可用模型及配置显得尤为迫切。此类评估的规模和多样性使得基于大语言模型（LLM）的评判者成为应对这一挑战的有力解决方案。关键在于，这一方法首先需要验证LLM评判者自身的质量。先前的研究主要集中在基于实例的LLM评判者评估上，即通过一组响应或响应对来评估评判者，而不考虑其来源系统。我们认为，这种设置忽视了影响系统级排序的关键因素，如评判者对某些系统的正向或负向偏见。为填补这一空白，我们进行了首次大规模的LLM评判者作为系统排序器的研究。系统分数通过聚合多个系统输出的评判分数生成，而评判者的质量则通过比较生成的系统排名与基于人类的排名来评估。除了对评判者的整体评估外，我们的分析还提供了对其行为的细致刻画，包括其决断力和偏见。"
+    },
+    {
+        "title": "Does Representation Matter? Exploring Intermediate Layers in Large\n  Language Models",
+        "url": "http://arxiv.org/abs/2412.09563v1",
+        "pub_date": "2024-12-12",
+        "summary": "Understanding what defines a good representation in large language models (LLMs) is fundamental to both theoretical understanding and practical applications. In this paper, we investigate the quality of intermediate representations in various LLM architectures, including Transformers and State Space Models (SSMs). We find that intermediate layers often yield more informative representations for downstream tasks than the final layers. To measure the representation quality, we adapt and apply a suite of metrics - such as prompt entropy, curvature, and augmentation-invariance - originally proposed in other contexts. Our empirical study reveals significant architectural differences, how representations evolve throughout training, and how factors like input randomness and prompt length affect each layer. Notably, we observe a bimodal pattern in the entropy of some intermediate layers and consider potential explanations tied to training data. Overall, our results illuminate the internal mechanics of LLMs and guide strategies for architectural optimization and training.",
+        "translated": "理解在大语言模型（LLMs）中定义良好表示的基础，对于理论理解和实际应用都至关重要。本文中，我们研究了包括Transformer和状态空间模型（SSMs）在内的各种LLM架构中的中间表示质量。我们发现，中间层通常比最终层为下游任务提供更具信息量的表示。为了衡量表示质量，我们采用了在其他背景下最初提出的一系列指标，如提示熵、曲率和增强不变性等。我们的实证研究表明了显著的架构差异，表示在整个训练过程中的演变方式，以及输入随机性和提示长度等因素如何影响每一层。值得注意的是，我们观察到某些中间层熵的二模态分布，并考虑了与训练数据相关的潜在解释。总体而言，我们的研究揭示了LLMs的内部机制，并为架构优化和训练策略提供了指导。"
     }
 ]
\ No newline at end of file