chore: update confs

Doragd · Dec 5, 2024 · ce10c54 · ce10c54
1 parent 525cc7e
commit ce10c54
Showing 1 changed file with 35 additions and 0 deletions.
diff --git a/arxiv.json b/arxiv.json
@@ -36174,5 +36174,40 @@
         "pub_date": "2024-12-03",
         "summary": "Determining company similarity is a vital task in finance, underpinning hedging, risk management, portfolio diversification, and more. Practitioners often rely on sector and industry classifications to gauge similarity, such as SIC-codes and GICS-codes, the former being used by the U.S. Securities and Exchange Commission (SEC), and the latter widely used by the investment community. Clustering embeddings of company descriptions has been proposed as a potential technique for determining company similarity, but the lack of interpretability in token embeddings poses a significant barrier to adoption in high-stakes contexts. Sparse Autoencoders have shown promise in enhancing the interpretability of Large Language Models by decomposing LLM activations into interpretable features. In this paper, we explore the use of SAE features in measuring company similarity and benchmark them against (1) SIC codes and (2) Major Group codes. We conclude that SAE features can reproduce and even surpass sector classifications in quantifying fundamental characteristics of companies, evaluated by the correlation of monthly returns, a proxy for similarity, and PnL from cointegration.",
         "translated": "确定公司相似性在金融领域是一项至关重要的任务，支撑着对冲、风险管理、投资组合多样化等多个方面。从业者通常依赖于行业和部门分类来评估相似性，如SIC代码和GICS代码，前者由美国证券交易委员会（SEC）使用，后者则广泛被投资界采用。将公司描述的嵌入进行聚类已被提出作为一种潜在的公司相似性确定技术，但词嵌入缺乏可解释性对在高风险情境中的应用构成了重大障碍。稀疏自编码器（Sparse Autoencoders）在通过将大型语言模型（LLM）的激活分解为可解释特征来增强其可解释性方面显示出了潜力。本文探讨了在测量公司相似性中使用SAE特征，并将它们与（1）SIC代码和（2）主要组代码进行基准测试。我们的结论是，SAE特征在量化公司的基本特征方面能够重现甚至超越部门分类，这通过月度回报的相关性（作为相似性的代理）和协整的盈亏（PnL）来评估。"
+    },
+    {
+        "title": "Freshness and Informativity Weighted Cognitive Extent and Its\n  Correlation with Cumulative Citation Count",
+        "url": "http://arxiv.org/abs/2412.03557v1",
+        "pub_date": "2024-12-04",
+        "summary": "In this paper, we revisit cognitive extent, originally defined as the number of unique phrases in a quota. We introduce Freshness and Informative Weighted Cognitive Extent (FICE), calculated based on two novel weighting factors, the lifetime ratio and informativity of scientific entities. We model the lifetime of each scientific entity as the time-dependent document frequency, which is fit by the composition of multiple Gaussian profiles. The lifetime ratio is then calculated as the cumulative document frequency at the publication time $t_0$ divided by the cumulative document frequency over its entire lifetime. The informativity is calculated by normalizing the document frequency across all scientific entities recognized in a title. Using the ACL Anthology, we verified the trend formerly observed in several other domains that the number of unique scientific entities per quota increased gradually at a slower rate. We found that FICE exhibits a strong correlation with the average cumulative citation count within a quota. Our code is available at \\href{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent}{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent}",
+        "translated": "在本文中，我们重新审视了认知范围（Cognitive Extent），最初定义为配额中独特短语的数量。我们引入了新鲜度和信息加权认知范围（Freshness and Informative Weighted Cognitive Extent, FICE），该指标基于两个新颖的加权因子计算：科学实体的生命周期比率和信息量。我们将每个科学实体的生命周期建模为时间依赖的文档频率，这一频率通过多个高斯曲线的组合来拟合。生命周期比率随后被计算为在出版时间 \\( t_0 \\) 时的累积文档频率除以其整个生命周期内的累积文档频率。信息量则通过将所有在标题中识别出的科学实体的文档频率进行归一化来计算。利用ACL文集，我们验证了过去在其他几个领域中观察到的趋势，即每个配额中独特科学实体的数量以较慢的速度逐渐增加。我们发现，FICE与配额内平均累积引用次数之间存在强相关性。我们的代码可在以下链接获取：\\href{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent}{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent}。"
+    },
+    {
+        "title": "YT-30M: A multi-lingual multi-category dataset of YouTube comments",
+        "url": "http://arxiv.org/abs/2412.03465v1",
+        "pub_date": "2024-12-04",
+        "summary": "This paper introduces two large-scale multilingual comment datasets, YT-30M (and YT-100K) from YouTube. The analysis in this paper is performed on a smaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and YT-100K (randomly selected 100K sample from YT-30M) are publicly released for further research. YT-30M (YT-100K) contains 32236173 (108694) comments posted by YouTube channel that belong to YouTube categories. Each comment is associated with a video ID, comment ID, commentor name, commentor channel ID, comment text, upvotes, original channel ID and category of the YouTube channel (e.g., 'News &amp; Politics', 'Science &amp; Technology', etc.).",
+        "translated": "本文介绍了两组大规模的多语言评论数据集，即来自YouTube的YT-30M（以及YT-100K）。本文的分析基于YT-30M的一个较小样本（YT-100K）进行。两个数据集：YT-30M（完整版）和YT-100K（从YT-30M中随机选取的100,000个样本）均已公开发布，供进一步研究使用。YT-30M（YT-100K）包含32,236,173（108,694）条评论，这些评论来自属于YouTube分类频道的YouTube频道。每条评论都关联有视频ID、评论ID、评论者名称、评论者频道ID、评论文本、点赞数、原始频道ID以及YouTube频道的类别（例如，'新闻与政治'、'科学与技术'等）。"
+    },
+    {
+        "title": "Beyond Questions: Leveraging ColBERT for Keyphrase Search",
+        "url": "http://arxiv.org/abs/2412.03193v1",
+        "pub_date": "2024-12-04",
+        "summary": "While question-like queries are gaining popularity and search engines' users increasingly adopt them, keyphrase search has traditionally been the cornerstone of web search. This query type is also prevalent in specialised search tasks such as academic or professional search, where experts rely on keyphrases to articulate their information needs. However, current dense retrieval models often fail with keyphrase-like queries, primarily because they are mostly trained on question-like ones. This paper introduces a novel model that employs the ColBERT architecture to enhance document ranking for keyphrase queries. For that, given the lack of large keyphrase-based retrieval datasets, we first explore how Large Language Models can convert question-like queries into keyphrase format. Then, using those keyphrases, we train a keyphrase-based ColBERT ranker (ColBERTKP_QD) to improve the performance when working with keyphrase queries. Furthermore, to reduce the training costs associated with training the full ColBERT model, we investigate the feasibility of training only a keyphrase query encoder while keeping the document encoder weights static (ColBERTKP_Q). We assess our proposals' ranking performance using both automatically generated and manually annotated keyphrases. Our results reveal the potential of the late interaction architecture when working under the keyphrase search scenario.",
+        "translated": "尽管类似问题的查询越来越受欢迎，搜索引擎用户也越来越多地采用这种查询方式，但关键词搜索一直是网络搜索的基石。这种查询类型在学术或专业搜索等特定搜索任务中也普遍存在，专家们依赖关键词来表达他们的信息需求。然而，当前的密集检索模型在处理类似关键词的查询时往往表现不佳，主要是因为它们大多是在类似问题的查询上进行训练的。本文介绍了一种新型模型，该模型采用ColBERT架构来增强关键词查询的文档排名。为此，鉴于缺乏大规模的关键词检索数据集，我们首先探讨了大型语言模型如何将类似问题的查询转换为关键词格式。然后，利用这些关键词，我们训练了一个基于关键词的ColBERT排序器（ColBERTKP_QD），以提高处理关键词查询时的性能。此外，为了减少训练完整ColBERT模型的成本，我们研究了只训练关键词查询编码器而保持文档编码器权重静态（ColBERTKP_Q）的可行性。我们使用自动生成的和手动标注的关键词评估了我们提出的排序性能。我们的结果揭示了在关键词搜索场景下，后期交互架构的潜力。"
+    },
+    {
+        "title": "Enhancing Recommendation Systems with GNNs and Addressing Over-Smoothing",
+        "url": "http://arxiv.org/abs/2412.03097v1",
+        "pub_date": "2024-12-04",
+        "summary": "This paper addresses key challenges in enhancing recommendation systems by leveraging Graph Neural Networks (GNNs) and addressing inherent limitations such as over-smoothing, which reduces model effectiveness as network hierarchy deepens. The proposed approach introduces three GNN-based recommendation models, specifically designed to mitigate over-smoothing through innovative mechanisms like residual connections and identity mapping within the aggregation propagation process. These modifications enable more effective information flow across layers, preserving essential user-item interaction details to improve recommendation accuracy. Additionally, the study emphasizes the critical need for interpretability in recommendation systems, aiming to provide transparent and justifiable suggestions tailored to dynamic user preferences. By integrating collaborative filtering with GNN architectures, the proposed models not only enhance predictive accuracy but also align recommendations more closely with individual behaviors, adapting to nuanced shifts in user interests. This work advances the field by tackling both technical and user-centric challenges, contributing to the development of robust and explainable recommendation systems capable of managing the complexity and scale of modern online environments.",
+        "translated": "本文探讨了通过利用图神经网络（GNNs）并解决诸如过平滑等固有限制来增强推荐系统的关键挑战，过平滑问题随着网络层次的加深降低了模型的有效性。提出的方法引入了三种基于GNN的推荐模型，这些模型通过在聚合传播过程中引入创新机制（如残差连接和恒等映射）来专门设计以缓解过平滑问题。这些修改使得信息在各层之间更有效地流动，保留了关键的用户-项目交互细节，从而提高了推荐的准确性。此外，研究强调了推荐系统中可解释性的重要性，旨在提供透明且合理的建议，以适应动态用户偏好。通过将协同过滤与GNN架构相结合，所提出的模型不仅提高了预测准确性，还使推荐更紧密地符合个体行为，适应用户兴趣的细微变化。这项工作通过解决技术和以用户为中心的挑战，推动了该领域的发展，为开发能够处理现代在线环境复杂性和规模的强大且可解释的推荐系统做出了贡献。"
+    },
+    {
+        "title": "CLAS: A Machine Learning Enhanced Framework for Exploring Large 3D\n  Design Datasets",
+        "url": "http://arxiv.org/abs/2412.02996v1",
+        "pub_date": "2024-12-04",
+        "summary": "Three-dimensional (3D) objects have wide applications. Despite the growing interest in 3D modeling in academia and industries, designing and/or creating 3D objects from scratch remains time-consuming and challenging. With the development of generative artificial intelligence (AI), designers discover a new way to create images for ideation. However, generative AIs are less useful in creating 3D objects with satisfying qualities. To allow 3D designers to access a wide range of 3D objects for creative activities based on their specific demands, we propose a machine learning (ML) enhanced framework CLAS - named after the four-step of capture, label, associate, and search - to enable fully automatic retrieval of 3D objects based on user specifications leveraging the existing datasets of 3D objects. CLAS provides an effective and efficient method for any person or organization to benefit from their existing but not utilized 3D datasets. In addition, CLAS may also be used to produce high-quality 3D object synthesis datasets for training and evaluating 3D generative models. As a proof of concept, we created and showcased a search system with a web user interface (UI) for retrieving 6,778 3D objects of chairs in the ShapeNet dataset powered by CLAS. In a close-set retrieval setting, our retrieval method achieves a mean reciprocal rank (MRR) of 0.58, top 1 accuracy of 42.27%, and top 10 accuracy of 89.64%.",
+        "translated": "三维（3D）物体具有广泛的应用。尽管学术界和工业界对3D建模的兴趣日益增长，但从头设计和/或创建3D物体仍然耗时且具有挑战性。随着生成式人工智能（AI）的发展，设计师发现了一种新的图像创作方式以激发创意。然而，生成式AI在创建质量令人满意的3D物体方面效果不佳。为了使3D设计师能够根据其特定需求访问广泛的3D物体以进行创意活动，我们提出了一种名为CLAS的机器学习（ML）增强框架——取名自捕捉、标记、关联和搜索的四个步骤——以实现基于用户规范的3D物体全自动检索，利用现有的3D物体数据集。CLAS为任何个人或组织提供了一种有效且高效的方法，使其能够从现有但未被利用的3D数据集中受益。此外，CLAS还可用于生成高质量的3D物体合成数据集，以用于训练和评估3D生成模型。作为概念验证，我们创建并展示了一个带有网页用户界面（UI）的搜索系统，该系统基于CLAS在ShapeNet数据集中检索了6,778个椅子3D物体。在封闭集检索设置中，我们的检索方法达到了平均倒数排名（MRR）为0.58，前1准确率为42.27%，前10准确率为89.64%。"
     }
 ]