Being updated weekly, follow the schedule of the course
-
Course Management: Online slides
-
Introduction to Big Data Platforms: PDF Slides
-
Architecting Big Data Platforms: PDF Slides
- Shadi Khalifa, Yehia Elshater, Kiran Sundaravarathan, Aparna Bhat, Patrick Martin, Fahim Imam, Dan Rope, Mike Mcroberts, and Craig Statchuk. 2016. The Six Pillars for Building Big Data Analytics Ecosystems. ACM Comput. Surv. 49, 2, Article 33 (August 2016), 36 pages. DOI: https://doi.org/10.1145/2963143
- NIST Big Data interoperability Framework: the reference architecture
- Chapter 1: Martin Kleppmann, Designing Data-Intensive Applications, O'Reilly Media,1 edition (April 11, 2017)
- Pulkit Agrawal, Rajat Arya, Aanchal Bindal, Sandeep Bhatia, Anupriya Gagneja, Joseph Godlewski, Yucheng Low, Timothy Muss, Mudit Manu Paliwal, Sethu Raman, Vishrut Shah, Bochao Shen, Laura Sugden, Kaiyu Zhao, and Ming-Chuan Wu. 2019. Data Platform for Machine Learning. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). ACM, New York, NY, USA, 1803-1816. DOI: https://doi.org/10.1145/3299869.3314050
- Jordan Tigani, Big Data is Dead
-
Service and Integration Models in Big Data Platforms: Slides
- Adam Jacobs. 2009. The pathologies of big data. Commun. ACM 52, 8 (August 2009), 36-44. DOI: https://doi.org/10.1145/1536616.1536632
- Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference (USENIXATC'10). USENIX Association, Berkeley, CA, USA, 11-11. (https://www.usenix.org/legacy/events/atc10/tech/full_papers/Hunt.pdf)
- J. Lin, "The Lambda and the Kappa," in IEEE Internet Computing, vol. 21, no. 5, pp. 60-66, 2017, doi: 10.1109/MIC.2017.3481351.
-
Edge Cloud Infrastructures for Big Data Platforms: download PDF
-
A Recap on Performance, Dependability, and Fault Tolerance in Distributed Systems: download PDF
-
Some industrial and open source big data platforms for Your tech radar: Slides
-
Data Services: Online Slides (download PDF)
- Data Services - Exploring the technology trends in basic, integrated, and cloud data services., CACM.
- Chapters 5 & 9: Martin Kleppmann, Designing Data-Intensive Applications, O'Reilly Media,1 edition (April 11, 2017)
- http://cacm.acm.org/news/200095-the-data-lake-concept-is-maturing/fulltext
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2006. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7 (OSDI '06), Vol. 7.
- Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. 2017. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD '17).
- CAP: http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
- Eventual consistency
- Consistency Tradeoffs in Modern Distributed Database System Design
- A Model and Survey of Distributed Data-Intensive Systems
-
Addition for Big Data Storage and Database Services: common systems & integration problems: (download PDF).
-
A short example of metadata: Video
-
Big Data Ingestion, Transformation and Orchestration: Online slides (download PDF)
-
Apache Kafka for Streaming Data Ingestion - The Core: PDF, basic Kafka setup and examples of data ingestion with Kafka
-
Hadoop and its Big Data Ecosystems: Online slides (download PDF)
-
Some case studies for Hadoop and data ingestion:
- K. Shvachko, H. Kuang, S. Radia and R. Chansler, "The Hadoop Distributed File System," 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, 2010, pp. 1-10.doi: 10.1109/MSST.2010.5496972
- Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (SOCC '13). ACM, New York, NY, USA, Article 5, 16 pages. DOI:https://doi.org/10.1145/2523616.2523633
- Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2 (August 2009), 1626-1629. DOI: https://doi.org/10.14778/1687553.1687609
- Roshan Sumbaly, Jay Kreps, and Sam Shah. 2013. The big data ecosystem at LinkedIn. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD '13). ACM, New York, NY, USA, 1125-1134. DOI: http://dx.doi.org/10.1145/2463676.2463707
-
Programming Models for Data Processing: Online slides (download PDF), The MapReduce Programming Model)
- Belcastro, L., Cantini, R., Marozzo, F. et al. , Programming big data analysis: principles and solutions
- Matei Zaharia, Bill Chambers , Spark: The Definitive Guide, Book, Code
- Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: a unified engine for big data processing. Commun. ACM 59, 11 (October 2016), 56-65. DOI: https://doi.org/10.1145/2934664
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113. DOI: https://doi.org/10.1145/1327452.1327492
-
Workflows for Big Data Platforms: Online slides (download PDF)
- Running Apache Airflow at Lyft
- Mutaz Barika, Saurabh Garg, Albert Y. Zomaya, Lizhe Wang, Aad Van Moorsel, and Rajiv Ranjan. 2019. Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions. ACM Comput. Surv. 52, 5, Article 95 (September 2019), 41 pages. DOI: https://doi.org/10.1145/3332301
- How Agari Uses Airbnb's Airflow as a Smarter Cron
- Ewa Deelman, Karan Vahi, Mats Rynge, Rajiv Mayani, Rafael Ferreira da Silva, George Papadimitriou, Miron Livny: The Evolution of the Pegasus Workflow Management Software. Computing in Science and Engineering 21(4): 22-36 (2019)
- Mohammad Islam, Angelo K. Huang, Mohamed Battisha, Michelle Chiang, Santhosh Srinivasan, Craig Peters, Andreas Neumann, and Alejandro Abdelnur. 2012. Oozie: towards a scalable workflow management system for Hadoop. In Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies (SWEET '12)https://cacm.acm.org/research/data-services/. ACM, New York, NY, USA, Article 4, 10 pages. DOI: https://doi.org/10.1145/2443416.2443420
- Zijun Li, Yushi Liu, Linsong Guo, Quan Chen, Jiagan Cheng, Wenli Zheng, and Minyi Guo. 2022. FaaSFlow: enable efficient workflow execution for function-as-a-service. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2022). Association for Computing Machinery, New York, NY, USA, 782–796. DOI:https://doi.org/10.1145/3503222.3507717
-
Stream Processing and Big Data Platforms: Online slides (download PDF)
- Gianpaolo Cugola and Alessandro Margara, Processing flows of information: From data stream to complex event processing
- Martin Hirzel, Guillaume Baudart, Angela Bonifati, Emanuele Della Valle, Sherif Sakr, and Akrivi Akrivi Vlachou. 2018. Stream Processing Languages in the Big Data Era. SIGMOD Rec. 47, 2 (December 2018), 29-40. DOI: https://doi.org/10.1145/3299887.3299892
- Tyler Akidau, Streaming 101: The world beyond batch A high-level tour of modern data-processing concepts. August 5, 2015. Link
- Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, Sam Whittle: The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. Proc. VLDB Endow. 8(12): 1792-1803 (2015), http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
- Ellen Friedman and Kostas Tzoumas, Introduction to Apache Flink, Link
- Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoica. 2012. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing (HotCloud'12). USENIX Association, Berkeley, CA, USA, 10-10. Link
-
Big Data Platforms in the Age of LLMs/Gen-AI: [Online Slides]) (download PDF)