A curated list of awesome publications and researchers on log analysis, anomaly detection, fault localization, and AIOps.
China (& HK SAR) | ||||
---|---|---|---|---|
Michael R. Lyu, CUHK | Dongmei Zhang, Microsoft | Pengfei Chen, SYSU | Dan Pei, Tsinghua | |
Pinjia He, CUHK-Shenzhen | ||||
USA | ||||
Yuanyuan Zhou, UCSD | Tao Xie, UIUC | Dawson Engler, Stanford | Ben Liblit, Wisconsin–Madison | |
Canada | ||||
Ding Yuan, Toronto University | Ahmed E. Hassan, Queen's University | Weiyi Shang, Concordia University | Zhen Ming (Jack) Jiang, York University | |
Wahab Hamou-Lhadj, Concordia University | ||||
UK | ||||
Europe | ||||
Australia | ||||
Ingo Weber, CSIRO |
Logs are a type of valuable data generated from many sources such as software, systems, networks, devices, etc. They have also been used for a number of tasks related to reliability, security, performance, and energy. Therefore, the research of log analysis has attracted interests from different research areas.
- System area
- Cloud computing area
- Networking area
- Software engineering area
- Reliability area
- Security area
- AI and Bigdata area
- Industrial conferences
Loghub
- [ACM Computing Survey] A Survey on Automated Log Analysis for Reliability Engineering
- [Blog] What is AIOps? Artificial Intelligence for IT Operations Explained
- [Book'14] I Heart Logs
- [Book'12] Logging and Log Management: The Authoritative Guide to Understanding the Concepts Surrounding Logging and Log Management, by Anton A. Chuvakin, Kevin J. Schmidt, Christopher Phillips.
- [Thesis] Log Engineering: Towards Systematic Log Mining to Support the Development of Ultra-large Scale Systems
- [IST'20] A Systematic Literature Review on Automated Log Abstraction Techniques
- [IEEE Software'16] Operational-Log Analysis for Big Data Systems: Challenges and Solutions
- [OSDI 2012] Be Conservative: Enhancing Failure Diagnosis with Proactive Logging
- [TSE 2013] Event Logs for the Analysis of Software Failures: A Rule-Based Approach
- [ICSE 2015] Learning to Log: Helping Developers Make Informed Logging Decisions
- [ICSE 2015] Where do developers log? an empirical study on logging practices in industry
- [ATC 2015] Log2 : A Cost-Aware Logging Mechanism for Performance Diagnosis
- [SOSP 2017] Log20: Fully Automated Optimal Placement of Log Printing Statements under Specified Overhead Threshold
- [HotOS 2017] The Game of Twenty Questions: Do You Know Where to Log?
- [ASE 2020] Where Shall We Log? Studying and Suggesting Logging Locations in Code Blocks
- [ASPLOS 2011] Improving Software Diagnosability via Log Enhancement
- [ASE 2018] Characterizing the Natural Language Descriptions in Software Logging Statements
- [TSE 2019] Which Variables Should I Log?
- [ICPC 2019] PADLA: a dynamic log level adapter using online phase detection
- [ECOOP 1997] Aspect-oriented programming
- [DSN 2010] Assessing and improving the effectiveness of logs for the analysis of software faults
- [ICSE 2012] Characterizing logging practices in open-source software
- [ICSME 2014] Understanding Log Lines Using Development Knowledge
- [ICSE 2015] Industry practices and event logging: assessment of a critical software development process
- [ESE 2015] Studying the relationship between logging characteristics and the code quality of platform software
- [ICSE 2017] Characterizing and Detecting Anti-patterns in the Logging Code
- [OSDI 2018] The FuzzyLog: A Partially Ordered Shared Log
- [ATC 2018] Troubleshooting Transiently-Recurring Errors in Production Systems with Blame-Proportional Logging
- [ATC 2018] NanoLog: A Nanosecond Scale Logging System
- [NSDI 2018] Carousel: Scalable Logging for Intrusion Prevention Systems
- [ICSE 2019] DLFinder: Characterizing and Detecting Duplicate Logging Code Smells
- [ICSE 2016] The bones of the system: a case study of logging and telemetry at Microsoft
- [MSR 2016] Logging library migrations: a case study for the apache software foundation projects
- [ESE 2017] Characterizing logging practices in Java-based open source software projects - a replication study in Apache Software Foundation
- [ESE 2018] Studying and detecting log-related issues
- [ESE 2018] Examining the stability of logging statements
- [ESE 2018] An exploratory study on assessing the energy impact of logging on Android applications
- [ESE 2019] Studying the characteristics of logging practices in mobile apps: a case study on F-Droid
- [ICSE 2020] Studying the Use of Java Logging Utilities in the Wild
- [EMSE 2022] The Sense of Logging in the Linux Kernel
- [IPDPS 2006] Lossless compression for large scale cluster logs
- [ADBIS 2007] Fast and efficient log file compression
- [ICSE 2008] An Industrial Case Study of Customizing Operational Profiles Using Log Compression
- [SIGMOD 2013] Adaptive log compression for massive log data
- [IEEE Trustcom/BigDataSE/ISPA 2016] MLC: An Efficient Multi-level Log Compression Method for Cloud Backup Systems
- [TCSET 2008] Sub-atomic field processing for improved web log compression
- [CCGRID 2015] Cowic: A column-wise independent compression for log stream analysis
- [IMCC 2014] Lightweight Packing of Log Files for Improved Compression in Mobile Tactical Networks
- [DCC 2004] High density compression of log files
- [DaWaK 2003] Comprehensive Log Compression with Frequent Patterns
- [ICEIS 2019] Rough Logs: A Data Reduction Approach for Log Files
- [ASE 2019] Logzip: extracting hidden structures via iterative clustering for log compression
- [EMSE 2019] A Study of the Performance of General Compressors on Log Files
- [Ph.D. Dissertation 2008] Using semantic knowledge to improve compression on log files
- [IPOM'03] A Data Clustering Algorithm for Mining Patterns from Event Logs
- [QSIC'08] Abstracting Execution Logs to Execution Events for Enterprise Applications
- [ICDM'09] Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis
- [MSR'10] Abstracting Log Lines to Log Event Types for Mining Software System Logs
- [CIKM'11] LogSig: Generating System Events from Raw Textual Logs
- [KDD'09] Clustering Event Logs Using Iterative Partitioning
- [TKDE'12] A Lightweight Algorithm for Message Type Extraction in System Application Logs
- [CNSM'15] LogCluster - A Data Clustering and Pattern Mining Algorithm for Event Logs
- [CIKM'16] LogMine: Fast Pattern Recognition for Log Analytics
- [TDSC'18] Towards Automated Log Parsing for Large-Scale Log Data Analysis
- [ICPC'18] A Search-based Approach for Accurate Identification of Log Message Formats
- [SCC'13] Incremental Mining of System Log Format
- [arXiv'15] Length Matters: Clustering System Log Messages using Length of Words
- [TKDE'18] Spell: Online Streaming Parsing of Large Unstructured System Logs
- [ICWS'17] Drain: An Online Log Parsing Approach with Fixed Depth Tree
- [arXiv'18] A Directed Acyclic Graph Approach to Online Log Parsing
- [TSE'20] Logram: Efficient Log Parsing Using n-Gram Dictionaries
- [ICSE-SEIP'19] Tools and benchmarks for automated log parsing
- [ICSME'22] An Effective Approach for Parsing Large Log Files
- [OSDI 2016] Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle
- [FSE 2018] Using finite-state models for log differencing
- [ICSE 2016] Behavioral log analysis with statistical guarantees
- [FSE 2011] Leveraging existing instrumentation to automatically infer invariant-constrained models
- [KDD 2010] Mining program workflow from interleaved traces
- [ICSE 2014] Inferring models of concurrent systems from logs of their behavior with CSight
- [ASE 2019] Statistical log differencing
- [SOSP 2009] Detecting Large-Scale System Problems by Mining Console Logs
- [IPOM 2003] A data clustering algorithm for mining patterns from event logs
- [FSE 2018] Identifying impactful service system problems via log analysis
- [ICSE 2016] Log clustering based problem identification for online service systems
- [ICDM 2007] Failure prediction in ibm bluegene/l event logs
- [IEICE Transactions on Communications 2018] Proactive failure detection learning generation patterns of large-scale network logs
- [ISSRE 2015] Experience report: Anomaly detection of cloud application operations using log and cloud metric correlation analysis
- [USENIX ATC 2010] Mining Invariants from Console Logs for System Problem Detection
- [ICSE 2013] Assisting developers of big data analytics applications when deploying on hadoop clouds
- [ICDM 2009] Online system problem detection by mining patterns of console logs
- [ISSRE 2017] Experience report: Log-based behavioral differencing
- [KDD 2016] Anomaly detection using program control flow graph mining from execution logs
- [ICDM 2009] Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis
- [ASPLOS 2016] Cloudseer: Workflow monitoring of cloud infrastructures via interleaved logs
- [KDD 2005] Dynamic syslog mining for network failure monitoring
- [ISSRE 2016] Experience report: System log analysis for anomaly detection
- [CCS 2017] Deeplog: Anomaly detection and diagnosis from system logs through deep learning
- [FSE 2019] Robust log-based anomaly detection on unstable log data
- [IJCAI 2019] LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs
- [ICCCN 2020] Semantic-aware Representation Framework for Online Log Analysis
- [TCCN 2020] An Intelligent Anomaly Detection Scheme for Micro-services Architectures with Temporal and Spatial Data Analysis
- [ISSRE 2020] [Cross-System Log Anomaly Detection for Software Systems (to appear)]
- [Information Systems Frontiers 2020] LogGAN: a Log-level Generative Adversarial Network for Anomaly Detection using Permutation Event Modeling
- [DASC/PiCom/DataCom/CyberSciTech 2018] Detecting anomaly in big data system logs using convolutional neural network
- [CCS 2019] Log2vec: A Heterogeneous Graph Embedding Based Approach for Detecting Cyber Threats within Enterprise
- [MLCS 2018] Recurrent Neural Network Attention Mechanisms for Interpretable System Log Anomaly Detection
- [MACS18] PreFix: Switch failure prediction in datacenter networks
- [HPDC18] Desh: deep learning for system health prediction of lead times to failure in HPC
- [KDD03] Critical event prediction for proactive management in large-scale computer clusters
- [IPDPS20] Aarohi: Making real-time node failure prediction feasible
- [CLUSTER17] Data Mining-Based Analysis of HPC Center Operations
- [CLUSTER14] Exploring void search for fault detection on extreme scale systems
- [WWW19] Outage Prediction and Diagnosis for Cloud Service Systems
- [FSE18] Predicting Node failure in cloud service systems
- [FSE19] Latent error prediction and fault localization for microservice applications by learning from system trace logs
- [ICSE 2019] An empirical study on leveraging logs for debugging production failures
- [ASPLOS 2016] SherLog: error diagnosis by connecting clues from run-time logs
- [ISSTA 2009] AVA:automated interpretation of dynamically detected anomalies
- [IC2E 2016] LOGAN: Problem diagnosis in the cloud using log-based reference models
- [ICWS 2017] An approach for anomaly diagnosis based on hybrid graph model with logs for distributed services
- [Cloud 2017] Logsed: Anomaly diagnosis through mining time-weighted control flow graph in logs
- [FSE 2018] CloudRaid: hunting concurrency bugs in the cloud via log-mining
- [TPDS 2013] Toward fine-grained, unsupervised, scalable performance diagnosis for production cloud computing systems
- [CLUSTER 2014] Digging deeper into cluster system logs for failure prediction and root cause diagnosis
- [ASPLOS 2014] Comprehending performance from real-world execution traces: A device-driver case
- [ICWS 2017] Log-based abnormal task detection and root cause analysis for spark
- [EDCC 2015] Insights into the diagnosis of system failures from cluster message logs
- [HPC 2010] Diagnosing the root-causes of failures from cluster log files
- [ASE 2019] SCMiner: localizing system-level concurrency faults from large system call traces
- [NSDI 2012] Structured comparative analysis of systems logs to diag- nose performance problems
- [ICSE 2013] Assisting developers of big data analytics applications when deploying on hadoop clouds
- [TPDS 2016] Failure diagnosis for distributed systems using targeted fault injection
- [ICSE 2017] What causes my test alarm? Automatic cause analysis for test alarms in system and integration testing
- [GLOBECOM 2018] Root-Cause Diagnosis Using Logs Generated by User Actions
- [ICSE 2019] Mining Historical Issue Repositories to Heal Large-Scale Online Service Systems
- [CLOUD 2019] An Approach to Cloud Execution Failure Diagnosis Based on Exception Logs in OpenStack
- [FAST 2009] Understanding customer problem troubleshooting from storage system logs
- [DSN 2013] Reading between the lines of failure logs: Understanding how HPC systems fail
- [DSN 2014] What logs should you look at when an application fails? insights from an industrial case study
- [TSE 2018] Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study
- [FSE 2019] How bad can a bug get? an empirical analysis of software failures in the OpenStack cloud computing platform
- [DSN14] Mining Historical Issue Repositories to Heal Large-Scale Online Service Systems
- [ASE98] Testing using log file analysis: Tools, methods, and issues
- [ASE18] An automated approach to estimating code coverage measures via execution logs
- [ASE19] An experience report of generating load tests using log-recovered workloads at varying granularities of user behaviour
- [ASE15] Have we seen enough traces? (T)
- [ICSE08] An approach to detecting duplicate bug reports using natural language and execution information
This repo is under the MIT license.