diff --git a/contents/data_engineering/data_engineering.qmd b/contents/data_engineering/data_engineering.qmd index 88a18be3..b658fd72 100644 --- a/contents/data_engineering/data_engineering.qmd +++ b/contents/data_engineering/data_engineering.qmd @@ -8,7 +8,7 @@ bibliography: data_engineering.bib Resources: [Slides](#sec-data-engineering-resource), [Videos](#sec-data-engineering-resource), [Exercises](#sec-data-engineering-resource), [Labs](#sec-data-engineering-resource) ::: -![_DALL·E 3 Prompt: Create a rectangular illustration visualizing the concept of data engineering. Include raw data sources, data processing pipelines, storage systems, and refined datasets. Show how raw data is transformed through cleaning, processing, and storage to become valuable information that can be analyzed and used for decision-making._](images/png/cover_data_engineering.png) +![_DALL·E 3 Prompt: Create a rectangular illustration visualizing the concept of data engineering. Include elements such as raw data sources, data processing pipelines, storage systems, and refined datasets. Show how raw data is transformed through cleaning, processing, and storage to become valuable information that can be analyzed and used for decision-making._](images/png/cover_data_engineering.png) Data is the lifeblood of AI systems. Without good data, even the most advanced machine-learning algorithms will not succeed. However, TinyML models operate on devices with limited processing power and memory. This section explores the intricacies of building high-quality datasets to fuel our AI models. Data engineering involves collecting, storing, processing, and managing data to train machine learning models. @@ -33,13 +33,6 @@ Data is the lifeblood of AI systems. Without good data, even the most advanced m ::: ## Introduction -Imagine a world where AI can diagnose diseases with unprecedented accuracy, but only if the data used to train it is unbiased and reliable. This is where data engineering comes in. While over 90% of the world's data has been created in the past two decades, this vast amount of information is only helpful for building effective AI models with proper processing and preparation. Data engineering bridges this gap by transforming raw data into a high-quality format that fuels AI innovation. -In today's data-driven world, protecting user privacy is paramount. Whether mandated by law or driven by user concerns, anonymization techniques like differential privacy and aggregation are vital in mitigating privacy risks. However, careful implementation is crucial to ensure these methods don't compromise data utility. Dataset creators face complex privacy and representation challenges when building high-quality training data, especially for sensitive domains like healthcare. Legally, creators may need to remove direct identifiers like names and ages. Even without legal obligations, removing such information can help build user trust. However, excessive anonymization can compromise dataset utility. Techniques like differential privacy$^{1}$, aggregation, and reducing detail provide alternatives to balance privacy and utility but have downsides. Creators must strike a thoughtful balance based on the use case. - - -While privacy is paramount, ensuring fair and robust AI models requires addressing representation gaps in the data. It is crucial yet insufficient to ensure diversity across individual variables like gender, race, and accent. These combinations, sometimes called higher-order gaps, can significantly impact model performance. For example, a medical dataset could have balanced gender, age, and diagnosis data individually, but it lacks enough cases to capture older women with a specific condition. Such [higher-order gaps](https://blog.google/technology/health/healthcare-ai-systems-put-people-center/) are not immediately obvious but can critically impact model performance. - -Creating useful, ethical training data requires holistic consideration of privacy risks and representation gaps. Elusive perfect solutions necessitate conscientious data engineering practices like anonymization, aggregation, under-sampling of overrepresented groups, and synthesized data generation to balance competing needs. This facilitates models that are both accurate and socially responsible. Cross-functional collaboration and external audits can also strengthen training data. The challenges are multifaceted but surmountable with thoughtful effort. Imagine a world where AI can diagnose diseases with unprecedented accuracy, but only if the data used to train it is unbiased and reliable. This is where data engineering comes in. While over 90% of the world's data has been created in the past two decades, this vast amount of information is only helpful for building effective AI models with proper processing and preparation. Data engineering bridges this gap by transforming raw data into a high-quality format that fuels AI innovation. In today's data-driven world, protecting user privacy is paramount. Whether mandated by law or driven by user concerns, anonymization techniques like differential privacy and aggregation are vital in mitigating privacy risks. However, careful implementation is crucial to ensure these methods don't compromise data utility. Dataset creators face complex privacy and representation challenges when building high-quality training data, especially for sensitive domains like healthcare. Legally, creators may need to remove direct identifiers like names and ages. Even without legal obligations, removing such information can help build user trust. However, excessive anonymization can compromise dataset utility. Techniques like differential privacy$^{1}$, aggregation, and reducing detail provide alternatives to balance privacy and utility but have downsides. Creators must strike a thoughtful balance based on the use case. @@ -87,59 +80,52 @@ Generally, in ML, problem definition has a few key steps: A solid project foundation is essential for its trajectory and eventual success. Central to this foundation is first identifying a clear problem, such as ensuring that voice commands in voice assistance systems are recognized consistently across varying environments. Clear objectives, like creating representative datasets for diverse scenarios, provide a unified direction. Benchmarks, such as system accuracy in keyword detection, offer measurable outcomes to gauge progress. Engaging with stakeholders, from end-users to investors, provides invaluable insights and ensures alignment with market needs. Additionally, understanding platform constraints is important when exploring areas like voice assistance. Embedded systems, such as microcontrollers, come with inherent processing power, memory, and energy efficiency limitations. Recognizing these limitations ensures that functionalities, like keyword detection, are tailored to operate optimally, balancing performance with resource conservation. -Understanding platform constraints is also pivotal when delving into areas like voice assistance. Embedded systems, such as microcontrollers, have inherent processing power, memory, and energy efficiency limitations. Recognizing these limitations is especially crucial for data engineers, as it impacts data collection, pre-processing, and model training on these devices. Functionalities like keyword detection must be tailored to operate optimally, balancing performance with resource conservation while ensuring high data quality. In this context, using KWS as an example, we can break each of the steps out as follows: 1. **Identifying the Problem:** At its core, KWS detects specific keywords amidst ambient sounds and other spoken words. The primary problem is to design a system that can recognize these keywords with high accuracy, low latency, and minimal false positives or negatives, especially when deployed on devices with limited computational resources. 2. **Setting Clear Objectives:** -Designing a KWS system for TinyML involves navigating trade-offs between various critical factors. Key objectives include: -* ** High Accuracy: ** Ensuring the system accurately identifies keywords, often aiming for a specific accuracy rate (e.g., 98%). -* ** Seamless User Experience: ** Minimizing detection time to maintain fluid interactions, with a common target being keyword detection and response within 200 milliseconds. -* ** Power Efficiency: ** Reducing power consumption to extend battery life on embedded devices is crucial for many TinyML applications. -* ** Optimized Model Size: ** Due to TinyML devices' inherent memory limitations, it is essential to adjust the model's size to fit within the device's memory constraints. -These objectives are theoretical concepts crucial for successfully deploying KWS systems in real-world TinyML applications. They ensure the systems are both practical and efficient within the constraints of the devices they operate on. The following sections will delve deeper into the trade-offs between these objectives, providing a comprehensive understanding of the design considerations involved. + The objectives for a KWS system might include: + * Achieving a specific accuracy rate (e.g., 98% accuracy in keyword detection). + * Ensuring low latency (e.g., keyword detection and response within 200 milliseconds). + * Minimizing power consumption to extend battery life on embedded devices. + * Ensuring the model's size is optimized for the available memory on the device. 3. **Benchmarks for Success:** - Establishing clear metrics to evaluate the performance of the KWS system is essential for ensuring its effectiveness and efficiency. These metrics should directly relate to the objectives outlined earlier. Common metrics include: -* **True Positive Rate (TPR): ** The percentage of correctly identified keywords. This metric reflects the system's ability to detect target keywords accurately. -* **False Positive Rate (FPR): ** The percentage of non-keywords incorrectly identified as keywords. A low FPR is crucial to avoid unnecessary system activations. -* **Response Time: ** The time between keyword utterance and system response elapsed. This metric is critical for a seamless user experience. -* **Power Consumption: ** This is the average power consumption during keyword detection. Minimizing power consumption is essential for battery-powered embedded devices. -Systematically measuring these metrics ensures the KWS system meets real-world application performance standards. + Establish clear metrics to measure the success of the KWS system. This could include: + * True Positive Rate: The percentage of correctly identified keywords. + * False Positive Rate: The percentage of non-keywords incorrectly identified as keywords. + * Response Time: The time taken from keyword utterance to system response. + * Power Consumption: Average power used during keyword detection. 4. **Stakeholder Engagement and Understanding:** - Engaging actively with stakeholders throughout the KWS development process is crucial. Stakeholders typically include device manufacturers, hardware and software developers, and end-users. Understanding their needs, capabilities, and constraints is vital for designing a successful system. -* Device manufacturers might prioritize low power consumption to extend battery life. -*Software developers might emphasize ease of integration with existing software frameworks. -*End-users would prioritize high accuracy and responsiveness for a positive user experience. -*Effective stakeholder engagement ensures the KWS system meets all parties' diverse requirements. + Engage with stakeholders, which include device manufacturers, hardware and software developers, and end-users. Understand their needs, capabilities, and constraints. For instance: + * Device manufacturers might prioritize low power consumption. + * Software developers might emphasize ease of integration. + * End-users would prioritize accuracy and responsiveness. 5. **Understanding the Constraints and Limitations of Embedded Systems:** -Embedded devices present unique challenges for KWS system design. Key constraints include: - -**Memory Limitations: ** KWS models must be lightweight to fit within the limited memory of embedded devices. Typically, models need to be as small as 16KB to operate within the always-on island of the System-on-Chip (SoC). This includes the model size and any additional application code for preprocessing. -* **Processing Power: ** The computational capabilities of embedded devices are often limited (a few hundred MHz of clock speed). Therefore, the KWS model must be optimized for efficient execution on these resource-constrained devices. -* **Power Consumption: ** Since many embedded devices are battery-powered, the KWS system must be power-efficient to maximize battery life. -* **Environmental Challenges: ** Devices might be deployed in various environments, from quiet bedrooms to noisy industrial settings. The KWS system must be robust enough to function effectively across these diverse scenarios. -Addressing these constraints is essential for developing a functional and reliable KWS system for embedded devices. + Embedded devices come with their own set of challenges: + * Memory Limitations: KWS models must be lightweight to fit within the memory constraints of embedded devices. Typically, KWS models need to be as small as 16KB to fit in the always-on island of the SoC. Moreover, this is just the model size. Additional application code for preprocessing may also need to fit within the memory constraints. + * Processing Power: The computational capabilities of embedded devices are limited (a few hundred MHz of clock speed), so the KWS model must be optimized for efficiency. + * Power Consumption: Since many embedded devices are battery-powered, the KWS system must be power-efficient. + * Environmental Challenges: Devices might be deployed in various environments, from quiet bedrooms to noisy industrial settings. The KWS system must be robust enough to function effectively across these scenarios. 6. **Data Collection and Analysis:** -The quality and diversity of data are paramount for training a successful KWS system. Key considerations include: -*Variety of Accents: Collect data from speakers with various accents to ensure the system can recognize keywords spoken in different dialects. -*Background Noises: Include data samples with different ambient noises (e.g., traffic, music) to train the model for real-world scenarios. -*Keyword Variations: People might pronounce keywords differently or have slight variations in the wake word itself. Ensure the dataset captures these nuances to improve recognition accuracy. -Comprehensive data collection and analysis are foundational to the robustness of the KWS system. + For a KWS system, the quality and diversity of data are paramount. Considerations might include: + * Variety of Accents: Collect data from speakers with various accents to ensure wide-ranging recognition. + * Background Noises: Include data samples with different ambient noises to train the model for real-world scenarios. + * Keyword Variations: People might either pronounce keywords differently or have slight variations in the wake word itself. Ensure the dataset captures these nuances. 7. **Iterative Feedback and Refinement:** - Developing a KWS system is an iterative process. Once a prototype system is created, it's crucial to test it in real-world scenarios, gather feedback from users and stakeholders, and iteratively refine the model. This ensures that the system remains significantly aligned with the defined problem and objectives as deployment scenarios and user needs evolve. + Once a prototype KWS system is developed, it's crucial to test it in real-world scenarios, gather feedback, and iteratively refine the model. This ensures that the system remains aligned with the defined problem and objectives. This is important because the deployment scenarios change over time as things evolve. :::{#exr-kws .callout-caution collapse="true"} ### Keyword Spotting with TensorFlow Lite Micro -Explore a hands-on guide for building and deploying Keyword Spotting (KWS) systems using TensorFlow Lite Micro. Follow steps from data collection to model training and deployment to microcontrollers. Learn to create efficient KWS models that recognize specific keywords amidst background noise. Perfect for those interested in machine learning on embedded systems. Unlock the potential of voice-enabled devices with TensorFlow Lite Micro! +Explore a hands-on guide for building and deploying Keyword Spotting (KWS) systems using TensorFlow Lite Micro. Follow steps from data collection to model training and deployment to microcontrollers. Learn to create efficient KWS models that recognize specific keywords amidst background noise. Perfect for those interested in machine learning on embedded systems. Unlock the potential of voice-enabled devices with TensorFlow Lite Micro! [![](https://colab.research.google.com/assets/colab-badge.png)](https://colab.research.google.com/drive/17I7GL8WTieGzXYKRtQM2FrFi3eLQIrOM) ::: @@ -148,7 +134,7 @@ The current chapter underscores the essential role of data quality in ML, using ## Data Sourcing -The quality and diversity of data gathered are essential for developing accurate and robust AI systems, particularly for resource-constrained TinyML applications. Sourcing high-quality training data requires careful consideration of objectives, resource limitations, and ethical implications. Data can be obtained from various sources depending on the needs of the project: +The quality and diversity of data gathered are important for developing accurate and robust AI systems. Sourcing high-quality training data requires careful consideration of the objectives, resources, and ethical implications. Data can be obtained from various sources depending on the needs of the project: ### Pre-existing datasets @@ -162,22 +148,9 @@ In addition, bias, validity, and reproducibility issues may exist in these datas ![Training different models on the same dataset. Source: (icons from left to right: Becris; Freepik; Freepik; Paul J; SBTS2018).](images/png/dataset_myopia.png){#fig-misalignment} - ### Web Scraping Web scraping refers to automated techniques for extracting data from websites. It typically involves sending HTTP requests to web servers, retrieving HTML content, and parsing that content to extract relevant information. Popular tools and frameworks for web scraping include Beautiful Soup, Scrapy, and Selenium. These tools offer different functionalities, from parsing HTML content to automating web browser interactions, especially for websites that load content dynamically using JavaScript. -Web scraping can effectively gather large datasets, particularly when human-labeled data is scarce. Here are some use cases: -* **Computer Vision: **Web scraping has enabled the collection of massive images and videos for tasks like object recognition. Examples include datasets like ImageNet](https://www.image-net.org/) and [OpenImages](https://storage.googleapis.com/openimages/web/index.html). For example, one could scrape e-commerce sites to amass product photos for object recognition or social media platforms to collect user uploads for facial analysis. Even before ImageNet, Stanford’s[LabelMe](https://people.csail.mit.edu/torralba/publications/labelmeApplications.pdf) project scraped Flickr for over 63,000 annotated images covering hundreds of object categories. -* **Natural Language Processing: **Researchers can scrape news sites, forums, or social media for tasks like sentiment analysis, dialogue systems research, or topic modeling. For example, the training data for chatbot ChatGPT was obtained by scraping much of the public Internet. GitHub repositories were scraped to train GitHub's Copilot AI coding assistant. -* **Structured Data: **Web scraping can collect structured data like stock prices, weather data, or product information for analytical applications. Once data is scraped, storing it structured is essential, often using databases or data warehouses. Proper data management ensures the usability of the scraped data for future analysis and applications. -However, web scraping has limitations and ethical considerations: -* **Legal Restrictions: **Not all websites permit scraping, and violating these restrictions can lead to legal repercussions. Scraping copyrighted material or private communications is also unethical and potentially illegal. Ethical web scraping mandates adherence to a website's robots.txt file and respecting rate limits. -* **Dynamic Content: **The dynamic nature of web content can challenge consistency, especially for longitudinal studies. However, emerging trends like [Web Navigation](https://arxiv.org/abs/1812.09195) leverage machine learning for navigating dynamic content. -* **Data Quality: **The volume of pertinent data available for scraping might be limited for niche subjects. For example, while scraping for common topics like images of cats and dogs might yield abundant data, searching for rare medical conditions might be less fruitful. Moreover, the data obtained through scraping is often unstructured and noisy, necessitating thorough preprocessing and cleaning. It is crucial to understand that not all scraped data will be of high quality or accuracy. Employing verification methods, such as cross-referencing with alternate data sources, can enhance data reliability. -* **Data Quality Considerations: ** -* ***Scarcity for Niche Domains: ** The volume of relevant data obtainable through scraping can be limited for specialized areas. This might restrict the applicability of web scraping for specific TinyML projects requiring data on less common subjects. -* ***Unstructured and Noisy Data: **Scraped data is often unstructured and requires significant preprocessing and cleaning before effectively training TinyML models. This cleaning process can be resource-intensive, especially for TinyML environments with limited computational power. -* ***Inconsistency and Accuracy: **The accuracy and consistency of scraped data can vary, potentially impacting the performance of TinyML models. Verification methods, such as cross-referencing with established datasets, become crucial for enhancing data reliability. Web scraping can effectively gather large datasets for training machine learning models, particularly when human-labeled data is scarce. For computer vision research, web scraping enables the collection of massive volumes of images and videos. Researchers have used this technique to build influential datasets like [ImageNet](https://www.image-net.org/) and [OpenImages](https://storage.googleapis.com/openimages/web/index.html). For example, one could scrape e-commerce sites to amass product photos for object recognition or social media platforms to collect user uploads for facial analysis. Even before ImageNet, Stanford's [LabelMe](https://people.csail.mit.edu/torralba/publications/labelmeApplications.pdf) project scraped Flickr for over 63,000 annotated images covering hundreds of object categories. @@ -210,34 +183,37 @@ Discover the power of web scraping with Python using libraries like Beautiful So ### Crowdsourcing -Crowdsourcing for Data Collection in TinyML -Crowdsourcing for datasets involves obtaining data through the collective efforts of a vast, distributed group of participants, typically via the Internet. This method leverages the services of many people, either from specific communities or the general public, rather than relying on a small team or a specific organization to collect or label data. Platforms like Amazon Mechanical Turk facilitate the distribution of annotation tasks to a large, diverse workforce, enabling the collection of labels for complex tasks such as sentiment analysis or image recognition that require human judgment. -Crowdsourcing has emerged as an effective approach for data collection and problem-solving. One major advantage is scalability—by distributing tasks to a global pool of contributors on digital platforms, projects can quickly process huge volumes of data. This makes crowdsourcing ideal for large-scale data labeling, collection, and analysis. -In addition, crowdsourcing taps into a diverse group of participants, bringing a wide range of perspectives, cultural insights, and language abilities that can enrich data and enhance creative problem-solving in ways that a more homogenous group may not. Because crowdsourcing draws from a large audience beyond traditional channels, it is often more cost-effective than conventional methods, especially for simpler microtasks. -Crowdsourcing platforms provide significant flexibility, as task parameters can be adjusted in real-time based on initial results. This creates a feedback loop for iterative improvements to the data collection process. Complex jobs can be broken down into microtasks and distributed to multiple people, with results cross-validated by assigning redundant versions of the same task. When thoughtfully managed, crowdsourcing enables community engagement around a collaborative project, where participants find reward in contributing. -* **Advantages for Specific Scenarios: ** While TinyML applications often require specialized sensor data, crowdsourcing can be advantageous for tasks where human perception and subjective labeling are crucial. For example, crowdsourcing could be suitable for collecting data to train TinyML models for tasks like audio anomaly detection, where human judgment is valuable in identifying unusual sounds. -* **Considerations for Effective Crowdsourcing: ** While crowdsourcing offers numerous advantages, it's essential to approach it with a clear strategy. Access to a diverse set of annotators introduces variability in the quality of annotations. Platforms like Mechanical Turk might not always capture a complete demographic spectrum; often, tech-savvy individuals are overrepresented, while children and older people may be underrepresented. Providing clear instructions and training for annotators is crucial. Periodic checks and validations of the labeled data help maintain quality. This ties back to the topic of clear problem definition that we discussed earlier. Crowdsourcing for datasets also requires careful attention to ethical considerations. It's crucial to ensure that participants are informed about how their data will be used and that their privacy is protected. Quality control through detailed protocols, transparency in sourcing, and auditing is essential to ensure reliable outcomes. -* **Challenges Specific to TinyML: ** For TinyML, crowdsourcing presents unique challenges due to the specialized nature of TinyML devices, which are designed for particular tasks within tight constraints: -* Specialized Data Requirements: TinyML applications often rely on data collected from specific sensors or hardware. Crowdsourcing such specialized data from a general audience may be challenging. For example, participants need access to specific devices, such as microphones with consistent sampling rates, to collect relevant audio data for keyword spotting. -* High Granularity and Quality: Given TinyML's limitations, the data must be highly granular and high-quality. Ensuring this level of detail from crowdsourcing participants unfamiliar with the application's context and requirements can be difficult. -* Privacy, Standardization, and Technical Expertise: Additional issues include maintaining privacy, real-time data collection, standardization, and the need for technical expertise to provide accurate data labeling. -* Narrow Task Focus: Many TinyML tasks are narrowly defined, making accurate data labeling easier with proper understanding. Participants may need full context to provide reliable annotations. -* **Careful Planning for Success: ** Thus, while crowdsourcing can work well in many cases, the specialized needs of TinyML introduce unique data challenges. Careful planning is required to set guidelines, target appropriate participants, and implement rigorous quality control. In some applications, crowdsourcing may be feasible, but others may require more focused data collection efforts to obtain relevant, high-quality training data. By understanding the advantages and limitations of crowdsourcing, researchers and developers can make informed decisions about its suitability for their specific TinyML data acquisition needs. +Crowdsourcing for datasets is the practice of obtaining data using the services of many people, either from a specific community or the general public, typically via the Internet. Instead of relying on a small team or specific organization to collect or label data, crowdsourcing leverages the collective effort of a vast, distributed group of participants. Services like Amazon Mechanical Turk enable the distribution of annotation tasks to a large, diverse workforce. This facilitates the collection of labels for complex tasks like sentiment analysis or image recognition requiring human judgment. + +Crowdsourcing has emerged as an effective approach for data collection and problem-solving. One major advantage of crowdsourcing is scalability—by distributing tasks to a large, global pool of contributors on digital platforms, projects can process huge volumes of data quickly. This makes crowdsourcing ideal for large-scale data labeling, collection, and analysis. + +In addition, crowdsourcing taps into a diverse group of participants, bringing a wide range of perspectives, cultural insights, and language abilities that can enrich data and enhance creative problem-solving in ways that a more homogenous group may not. Because crowdsourcing draws from a large audience beyond traditional channels, it is more cost-effective than conventional methods, especially for simpler microtasks. + +Crowdsourcing platforms also allow for great flexibility, as task parameters can be adjusted in real time based on initial results. This creates a feedback loop for iterative improvements to the data collection process. Complex jobs can be broken down into microtasks and distributed to multiple people, with results cross-validated by assigning redundant versions of the same task. When thoughtfully managed, crowdsourcing enables community engagement around a collaborative project, where participants find reward in contributing. + +However, while crowdsourcing offers numerous advantages, it's essential to approach it with a clear strategy. While it provides access to a diverse set of annotators, it also introduces variability in the quality of annotations. Additionally, platforms like Mechanical Turk might not always capture a complete demographic spectrum; often, tech-savvy individuals are overrepresented, while children and older people may be underrepresented. Providing clear instructions and training for the annotators is crucial. Periodic checks and validations of the labeled data help maintain quality. This ties back to the topic of clear Problem Definition that we discussed earlier. Crowdsourcing for datasets also requires careful attention to ethical considerations. It's crucial to ensure that participants are informed about how their data will be used and that their privacy is protected. Quality control through detailed protocols, transparency in sourcing, and auditing is essential to ensure reliable outcomes. + +For TinyML, crowdsourcing can pose some unique challenges. TinyML devices are highly specialized for particular tasks within tight constraints. As a result, the data they require tends to be very specific. Obtaining such specialized data from a general audience may be difficult through crowdsourcing. For example, TinyML applications often rely on data collected from certain sensors or hardware. Crowdsourcing would require participants to have access to very specific and consistent devices - like microphones, with the same sampling rates. These hardware nuances present obstacles even for simple audio tasks like keyword spotting. + +Beyond hardware, the data itself needs high granularity and quality, given the limitations of TinyML. It can be hard to ensure this when crowdsourcing from those unfamiliar with the application's context and requirements. There are also potential issues around privacy, real-time collection, standardization, and technical expertise. Moreover, the narrow nature of many TinyML tasks makes accurate data labeling easier with the proper understanding. Participants may need full context to provide reliable annotations. + +Thus, while crowdsourcing can work well in many cases, the specialized needs of TinyML introduce unique data challenges. Careful planning is required for guidelines, targeting, and quality control. For some applications, crowdsourcing may be feasible, but others may require more focused data collection efforts to obtain relevant, high-quality training data. ### Synthetic Data -Synthetic data generation offers a valuable approach for addressing data collection limitations, particularly when real-world data is scarce, expensive, or ethically challenging to acquire, as is often the case in TinyML applications. This technique involves creating data that wasn't originally captured or observed but is generated using algorithms, simulations, or other techniques to resemble real-world data closely. As illustrated in @fig-synthetic-data, synthetic data is merged with historical data and then used as input for model training. It has become a valuable tool in various fields, particularly when real-world data is scarce, expensive, or ethically challenging (e.g., TinyML). Various techniques, such as Generative Adversarial Networks (GANs), can produce high-quality synthetic data almost indistinguishable from real data. These techniques have advanced significantly, making synthetic data generation increasingly realistic and reliable. +Synthetic data generation can be useful for addressing some of the data collection limitations. It involves creating data that wasn't originally captured or observed but is generated using algorithms, simulations, or other techniques to resemble real-world data. As shown in @fig-synthetic-data, synthetic data is merged with historical data and then used as input for model training. It has become a valuable tool in various fields, particularly when real-world data is scarce, expensive, or ethically challenging (e.g., TinyML). Various techniques, such as Generative Adversarial Networks (GANs), can produce high-quality synthetic data almost indistinguishable from real data. These techniques have advanced significantly, making synthetic data generation increasingly realistic and reliable. -Advantages of Synthetic Data for TinyML: +More real-world data may need to be available for analysis or training machine learning models in many domains, especially emerging ones. Synthetic data can fill this gap by producing large volumes of data that mimic real-world scenarios. For instance, detecting the sound of breaking glass might be challenging in security applications where a TinyML device is trying to identify break-ins. Collecting real-world data would require breaking numerous windows, which is impractical and costly. -Addressing Data Scarcity: Many TinyML applications require particular datasets that may be difficult or expensive to collect in the real world. Synthetic data generation can overcome this challenge by producing large volumes of data that mimic real-world scenarios relevant to the TinyML task. For instance, consider a TinyML device designed for security applications that must identify breaking glass sounds. Gathering real-world data for such a task would require breaking numerous windows, which is impractical and costly. Synthetic data generation offers a viable alternative by creating realistic audio samples of breaking glass. -Enhancing Model Robustness: Diversity in datasets is crucial for effective machine learning, especially in deep learning models. Synthetic data can augment existing datasets by introducing variations in data points, thereby enhancing the robustness of models. For example, SpecAugment is a powerful data augmentation technique to improve automatic speech recognition (ASR) systems. -Privacy Preservation: Privacy and confidentiality are major concerns when dealing with sensitive or personal data datasets. Synthetic data, being artificially generated, doesn't have these direct ties to real individuals. This allows for safer use of data while preserving essential statistical properties relevant for model training. -Cost-Effectiveness: Once the generation mechanisms are established, synthetic data generation can be a more cost-effective alternative to traditional data collection methods. In the security application scenario mentioned earlier, synthetic data eliminates the need for expensive and impractical data collection involving breaking windows. -Control over Data Generation: Many embedded TinyML use cases deal with unique situations, such as those encountered in manufacturing plants, which can be difficult to simulate in real-world environments. Synthetic data allows researchers complete control over the data generation process, enabling the creation of specific scenarios or conditions that are challenging to capture in real life. -Considerations for Using Synthetic Data: +Moreover, having a diverse dataset is crucial in machine learning, especially in deep learning. Synthetic data can augment existing datasets by introducing variations, thereby enhancing the robustness of models. For example, SpecAugment is an excellent data augmentation technique for Automatic Speech Recognition (ASR) systems. -While synthetic data offers numerous advantages, it is essential to use it judiciously. Care must be taken to ensure that the generated data accurately represents the underlying real-world distributions and does not introduce unintended biases. Validation techniques are crucial to verify the quality and representativeness of the synthetic data before using it for model training. +Privacy and confidentiality are also big issues. Datasets containing sensitive or personal information pose privacy concerns when shared or used. Synthetic data, being artificially generated, doesn't have these direct ties to real individuals, allowing for safer use while preserving essential statistical properties. + +Generating synthetic data, especially once the generation mechanisms have been established, can be a more cost-effective alternative. Synthetic data eliminates the need to break multiple windows to gather relevant data in the security above application scenario. + +Many embedded use cases deal with unique situations, such as manufacturing plants, that are difficult to simulate. Synthetic data allows researchers complete control over the data generation process, enabling the creation of specific scenarios or conditions that are challenging to capture in real life. + +While synthetic data offers numerous advantages, it is essential to use it judiciously. Care must be taken to ensure that the generated data accurately represents the underlying real-world distributions and does not introduce unintended biases. ![Increasing training data size with synthetic data generation. Source: [AnyLogic](https://www.anylogic.com/features/artificial-intelligence/synthetic-data/).](images/jpg/synthetic_data.jpg){#fig-synthetic-data} @@ -295,33 +271,23 @@ Some examples of data governance across different sectors include: **Special data storage considerations for TinyML** - -***Efficient Audio Storage Formats:*** Keyword spotting systems need specialized audio storage formats to enable quick keyword searching in audio data. Traditional formats like WAV and MP3 store full audio waveforms, which require extensive processing to search through. Keyword spotting uses compressed storage optimized for snippet-based search. One effective approach is storing compact acoustic features extracted from the raw audio, such as Mel-frequency cepstral coefficients (MFCCs), representing important audio characteristics. -Here's a breakdown of the workflow for efficient audio storage in TinyML: - _**Efficient Audio Storage Formats:**_ Keyword spotting systems need specialized audio storage formats to enable quick keyword searching in audio data. Traditional formats like WAV and MP3 store full audio waveforms, which require extensive processing to search through. Keyword spotting uses compressed storage optimized for snippet-based search. One approach is to store compact acoustic features instead of raw audio. Such a workflow would involve: +* **Extracting acoustic features:** Mel-frequency cepstral coefficients (MFCCs) commonly represent important audio characteristics. -* ** Feature Extraction:** Acoustic features like MFCCs are extracted from the raw audio data. These features capture essential characteristics of the sound, making them ideal for keyword spotting. +* **Creating Embeddings:** Embeddings transform extracted acoustic features into continuous vector spaces, enabling more compact and representative data storage. This representation is essential in converting high-dimensional data, like audio, into a more manageable and efficient format for computation and storage. -* ** Embedding Creation:** The extracted acoustic features are then transformed into low-dimensional vector embeddings. This process reduces the data size significantly while preserving the information crucial for keyword detection. Vector embeddings allow for more efficient storage and computation on resource-constrained devices. +* **Vector quantization:** This technique represents high-dimensional data, like embeddings, with lower-dimensional vectors, reducing storage needs. Initially, a codebook is generated from the training data to define a set of code vectors representing the original data vectors. Subsequently, each data vector is matched to the nearest codeword according to the codebook, ensuring minimal information loss. -* ** Vector Quantization:** High-dimensional data, like embeddings, are represented with lower-dimensional vectors through vector quantization, reducing storage needs. Initially, a codebook is generated from the training data to define a set of code vectors representing the original data vectors. Each data vector is subsequently matched to the nearest codeword according to the codebook, ensuring minimal information loss. - -* ** Sequential Storage:** The audio is fragmented into short frames, and each frame's quantized features (or embeddings) are stored sequentially. This approach maintains the temporal order of the audio data, ensuring context and coherence are preserved for keyword matching. - +* **Sequential storage:** The audio is fragmented into short frames, and the quantized features (or embeddings) for each frame are stored sequentially to maintain the temporal order, preserving the coherence and context of the audio data. This format enables decoding the features frame-by-frame for keyword matching. Searching the features is faster than decompressing the full audio. - -***Selective Network Output Storage:*** During training, only the final network activations (outputs) are retained, discarding intermediate audio features for inference. This approach significantly reduces storage requirements, especially for complex models. The network processes the full audio data during training to extract the necessary features. However, only the final learned representations, captured in the network's outputs, are stored for deployment. This reduces the storage footprint by eliminating redundant intermediate feature layers not crucial for performing keyword spotting during inference. - _**Selective Network Output Storage:**_ Another technique for reducing storage is to discard the intermediate audio features stored during training but not required during inference. The network is run on full audio during training. However, only the final outputs are stored during inference. - ## Data Processing -Data processing refers to the critical steps in transforming raw data into a format suitable for feeding into machine learning algorithms. It is the foundation for successful machine learning projects, enabling models to achieve optimal performance. The time-intensive nature of data cleaning and organization underscores its importance in building robust and reliable ML models.@fig-data-engineering shows a breakdown of a data scientist's time allocation, highlighting the significant portion spent on data cleaning and organizing (%60). +Data processing refers to the steps involved in transforming raw data into a format suitable for feeding into machine learning algorithms. It is a crucial stage in any ML workflow, yet often overlooked. With proper data processing, ML models are likely to achieve optimal performance. @fig-data-engineering shows a breakdown of a data scientist's time allocation, highlighting the significant portion spent on data cleaning and organizing (%60). ![Data scientists' tasks breakdown by time spent. Source: [Forbes.](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=20c55a266f63)](images/jpg/data_engineering_features.jpg){#fig-data-engineering} @@ -340,7 +306,8 @@ Let’s take a look at @fig-data-engineering-kws2 for an example of a data proce The MSWC used a [forced alignment](https://montreal-forced-aligner.readthedocs.io/en/latest/) method to automatically extract individual word recordings to train keyword-spotting models from the [Common Voice](https://commonvoice.mozilla.org/) project, which features crowdsourced sentence-level recordings. Forced alignment refers to long-standing methods in speech processing that predict when speech phenomena like syllables, words, or sentences start and end within an audio recording. In the MSWC data, crowdsourced recordings often feature background noises, such as static and wind. Depending on the model's requirements, these noises can be removed or intentionally retained. -Maintaining the integrity of the data infrastructure is a continuous endeavor in TinyML applications. This encompasses data storage, security, error handling, and stringent version control. Regular updates are crucial, especially for dynamic applications like keyword spotting, to adapt to evolving trends and device integrations. +Maintaining the integrity of the data infrastructure is a continuous endeavor. This encompasses data storage, security, error handling, and stringent version control. Periodic updates are crucial, especially in dynamic realms like keyword spotting, to adjust to evolving linguistic trends and device integrations. + There is a boom in data processing pipelines, commonly found in ML operations toolchains, which we will discuss in the MLOps chapter. Briefly, these include frameworks like MLOps by Google Cloud. It provides methods for automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment, and infrastructure management. Several mechanisms focus on data processing, an integral part of these systems. :::{#exr-dp .callout-caution collapse="true"} @@ -354,7 +321,7 @@ Let us explore two significant projects in speech data processing and machine le ## Data Labeling -High-quality training datasets are essential for effective machine-learning models. Data labeling is critical in achieving this by providing ground truth information, allowing models to learn relationships between inputs and desired outputs. This section covers key considerations for selecting label types, formats, and content to capture the necessary task information. It discusses common annotation approaches, from manual labeling to crowdsourcing to AI-assisted methods, and best practices for ensuring label quality through training, guidelines, and quality checks. The ethical treatment of human annotators is also emphasized. Additionally, the integration of AI to accelerate and augment human annotation is explored. Understanding labeling needs, challenges, and strategies are essential for constructing reliable, useful datasets to train performant, trustworthy machine learning systems. +Data labeling is important in creating high-quality training datasets for machine learning models. Labels provide ground truth information, allowing models to learn relationships between inputs and desired outputs. This section covers key considerations for selecting label types, formats, and content to capture the necessary information for tasks. It discusses common annotation approaches, from manual labeling to crowdsourcing to AI-assisted methods, and best practices for ensuring label quality through training, guidelines, and quality checks. We also emphasize the ethical treatment of human annotators. The integration of AI to accelerate and augment human annotation is also explored. Understanding labeling needs, challenges, and strategies are essential for constructing reliable, useful datasets to train performant, trustworthy machine learning systems. ### Label Types @@ -364,17 +331,11 @@ Labels capture information about key tasks or concepts. @fig-labels includes som Unless focused on self-supervised learning, a dataset will likely provide labels addressing one or more tasks of interest. Given their unique resource constraints, dataset creators must consider what information labels should capture and how they can practically obtain the necessary labels. Creators must first decide what type(s) of content labels should capture. For example, a creator interested in car detection would want to label cars in their dataset. Still, they might also consider whether to simultaneously collect labels for other tasks that the dataset could potentially be used for, such as pedestrian detection. - -Additionally, annotators can provide metadata that provides insight into how the dataset represents different characteristics of interest (see @sec-data-transparency). The Common Voice dataset, for example, includes various types of metadata that provide information about the speakers, recordings, and dataset quality for each language represented (@ardila2020common). They include demographic splits showing the number of recordings by speaker age range and gender. This allows us to see who contributed recordings for each language. They also include statistics like average recording duration and total hours of validated recordings. These give insights into the nature and size of the datasets for each language. -Furthermore, quality control metrics like the percentage of recordings that have been validated are useful to know how complete and clean the datasets are. The metadata also includes normalized demographic splits scaled to 100% for comparison across languages. This highlights representation differences between higher and lower resource languages. -Having decided on the information to capture in labels, creators must determine the format next. For example, a creator interested in car detection might choose between binary classification labels that indicate whether a car is present, bounding boxes that show the general locations of any cars, or pixel-wise segmentation labels that show the exact location of each car. Their choice of label format will depend on the use case and resource constraints, as finer-grained labels are typically more expensive and time-consuming to acquire. Once the desired content and label format are determined, creators can begin the annotation process. - Additionally, annotators can provide metadata that provides insight into how the dataset represents different characteristics of interest (see @sec-data-transparency). The Common Voice dataset, for example, includes various types of metadata that provide information about the speakers, recordings, and dataset quality for each language represented [@ardila2020common]. They include demographic splits showing the number of recordings by speaker age range and gender. This allows us to see who contributed recordings for each language. They also include statistics like average recording duration and total hours of validated recordings. These give insights into the nature and size of the datasets for each language. Additionally, quality control metrics like the percentage of recordings that have been validated are useful to know how complete and clean the datasets are. The metadata also includes normalized demographic splits scaled to 100% for comparison across languages. This highlights representation differences between higher and lower resource languages. Next, creators must determine the format of those labels. For example, a creator interested in car detection might choose between binary classification labels that say whether a car is present, bounding boxes that show the general locations of any cars, or pixel-wise segmentation labels that show the exact location of each car. Their choice of label format may depend on their use case and resource constraints, as finer-grained labels are typically more expensive and time-consuming to acquire. - ### Annotation Methods Common annotation approaches include manual labeling, crowdsourcing, and semi-automated techniques. Manual labeling by experts yields high quality but needs more scalability. Crowdsourcing enables non-experts to distribute annotation, often through dedicated platforms [@victor2019machine]. Weakly supervised and programmatic methods can reduce manual effort by heuristically or automatically generating labels [@ratner2018snorkel]. @@ -399,9 +360,6 @@ Let's get started! ### Ensuring Label Quality - -There is no guarantee that the data labels are correct. @fig-hard-labels shows some examples of hard labeling cases: some errors arise from blurred pictures that make them hard to identify (the frog image), and others stem from a lack of domain knowledge (the black stork case). It is possible that despite the best instructions given to labelers, they still mislabel some images (@northcutt2021pervasive). Strategies like quality checks, training annotators, and collecting multiple labels per datapoint can help ensure label quality. Multiple annotators can help identify controversial datapoints and quantify disagreement levels for ambiguous tasks. - There is no guarantee that the data labels are actually correct. @fig-hard-labels shows some examples of hard labeling cases: some errors arise from blurred pictures that make them hard to identify (the frog image), and others stem from a lack of domain knowledge (the black stork case). It is possible that despite the best instructions being given to labelers, they still mislabel some images [@northcutt2021pervasive]. Strategies like quality checks, training annotators, and collecting multiple labels per datapoint can help ensure label quality. For ambiguous tasks, multiple annotators can help identify controversial datapoints and quantify disagreement levels. ![Some examples of hard labeling cases. Source: @northcutt2021pervasive.](https://raw.githubusercontent.com/cleanlab/assets/master/cleanlab/label-errors-examples.png){#fig-hard-labels} @@ -410,7 +368,7 @@ When working with human annotators, offering fair compensation and otherwise pri ### AI-Assisted Annotation -ML has an insatiable demand for data. Therefore, more data is needed. This raises the question of how we can get more labeled data. Rather than always generating and curating data manually, we can rely on existing AI models to help label datasets more quickly and cheaply, though often with lower quality than human annotation. This can be done in various ways, as shown in @fig-weak-supervision, including the following: +ML has an insatiable demand for data. Therefore, more data is needed. This raises the question of how we can get more labeled data. Rather than always generating and curating data manually, we can rely on existing AI models to help label datasets more quickly and cheaply, though often with lower quality than human annotation. This can be done in various ways as shown in @fig-weak-supervision, including the following: * **Pre-annotation:** AI models can generate preliminary labels for a dataset using methods such as semi-supervised learning [@chapelle2009semisupervised], which humans can then review and correct. This can save a significant amount of time, especially for large datasets. * **Active learning:** AI models can identify the most informative data points in a dataset, which can then be prioritized for human annotation. This can help improve the labeled dataset's quality while reducing the overall annotation time. @@ -451,7 +409,7 @@ With data version control in place, we can track the changes shown in @fig-data- **Popular Data Version Control Systems** -[**[DVC]{.underline}**](https://dvc.org/doc): It stands for Data Version Control in short and is an open-source, lightweight tool that works on top of Git Hub and supports all kinds of data formats. It can seamlessly integrate into the workflow if Git is used to manage code. It captures the versions of data and models in the Git commits while storing them on-premises or on the cloud (e.g., AWS, Google Cloud, Azure). These data and models (e.g., ML artifacts) are defined in the metadata files, which get updated in every commit. It allows metrics tracking of models on different versions of the data. +[**[DVC]{.underline}**](https://dvc.org/doc): It stands for Data Version Control in short and is an open-source, lightweight tool that works on top of Git Hub and supports all kinds of data formats. It can seamlessly integrate into the workflow if Git is used to manage code. It captures the versions of data and models in the Git commits while storing them on-premises or on the cloud (e.g., AWS, Google Cloud, Azure). These data and models (e.g., ML artifacts) are defined in the metadata files, which get updated in every commit. It can allow metrics tracking of models on different versions of the data. **[lakeFS](https://docs.lakefs.io/):** It is an open-source tool that supports the data version control on data lakes. It supports many git-like operations, such as branching and merging of data, as well as reverting to previous versions of the data. It also has a unique UI feature, making exploring and managing data much easier. @@ -459,7 +417,7 @@ With data version control in place, we can track the changes shown in @fig-data- ## Optimizing Data for Embedded AI -Creators working on embedded systems may have unusual priorities when cleaning their datasets. On the one hand, models may be developed for particular use cases, requiring heavy filtering of datasets. While other large language models may be capable of turning any speech into text, a model for an embedded system may be focused on a single limited task, such as detecting a keyword. As a result, creators may aggressively filter out large amounts of data to address the task of interest. An embedded AI system may also be tied to specific hardware devices or environments. For example, a video model may need to process images from a single type of camera, which will only be mounted on doorbells in residential neighborhoods. In this scenario, creators may discard images if they came from a different kind of camera, show the wrong type of scenery, or were taken from the wrong height or angle. +Creators working on embedded systems may have unusual priorities when cleaning their datasets. On the one hand, models may be developed for unusually specific use cases, requiring heavy filtering of datasets. While other natural language models may be capable of turning any speech into text, a model for an embedded system may be focused on a single limited task, such as detecting a keyword. As a result, creators may aggressively filter out large amounts of data because they need to address the task of interest. An embedded AI system may also be tied to specific hardware devices or environments. For example, a video model may need to process images from a single type of camera, which will only be mounted on doorbells in residential neighborhoods. In this scenario, creators may discard images if they came from a different kind of camera, show the wrong type of scenery, or were taken from the wrong height or angle. On the other hand, embedded AI systems are often expected to provide especially accurate performance in unpredictable real-world settings. This may lead creators to design datasets to represent variations in potential inputs and promote model robustness. As a result, they may define a narrow scope for their project but then aim for deep coverage within those bounds. For example, creators of the doorbell model mentioned above might try to cover variations in data arising from: @@ -478,29 +436,16 @@ By providing clear, detailed documentation, creators can help developers underst ![Data card describing a CV dataset. Source: @pushkarna2022data.](images/png/data_card.png){#fig-data-card} -**The Importance of Data Provenance in Machine Learning** - -Data provenance, the ability to track the origin and journey of each data point through the machine learning pipeline, is no longer a nicety but a fundamental requirement for ensuring data quality. Transparent machine learning systems, enabled by robust data provenance, facilitate scrutinizing individual data points. This scrutiny empowers practitioners to identify and rectify errors, biases, and inconsistencies within the data. - -For example, consider a medical ML model exhibiting performance deficiencies in specific areas. By tracing the data provenance, one can pinpoint the root cause: issues with data collection methods, underrepresentation of certain demographic groups, or other factors. This level of transparency goes beyond debugging; it fosters improving data quality. Reliable and trustworthy datasets, bolstered by verifiable data provenance, ultimately enhance model performance and user acceptance. - -**Data Access and Maintenance Considerations** - - -When creating documentation, data creators should explicitly outline user access procedures and long-term maintenance plans for the dataset. For instance, accessing sensitive datasets, such as those containing medical information, may necessitate user training or special permissions from the creators. In some cases, users' direct data access might be restricted. Federated learning setups (@aledhari2020federated) offer an alternative approach, where users submit their models for training on the creators' hardware. Additionally, data creators should specify the dataset's accessibility timeframe, user error reporting mechanisms, and plans for future updates. +Keeping track of data provenance- essentially the origins and the journey of each data point through the data pipeline- is not merely a good practice but an essential requirement for data quality. Data provenance contributes significantly to the transparency of machine learning systems. Transparent systems make it easier to scrutinize data points, enabling better identification and rectification of errors, biases, or inconsistencies. For instance, if an ML model trained on medical data is underperforming in particular areas, tracing the provenance can help identify whether the issue is with the data collection methods, the demographic groups represented in the data or other factors. This level of transparency doesn't just help debug the system but also plays a crucial role in enhancing the overall data quality. By improving the reliability and credibility of the dataset, data provenance also enhances the model's performance and its acceptability among end-users. When producing documentation, creators should also specify how users can access the dataset and how the dataset will be maintained over time. For example, users may need to undergo training or receive special permission from the creators before accessing a protected information dataset, as with many medical datasets. In some cases, users may not access the data directly. Instead, they must submit their model to be trained on the dataset creators' hardware, following a federated learning setup [@aledhari2020federated]. Creators may also describe how long the dataset will remain accessible, how the users can submit feedback on any errors they discover, and whether there are plans to update the dataset. +Some laws and regulations also promote data transparency through new requirements for organizations: -**Legal and Regulatory Landscape** - -Data transparency is increasingly emphasized by legal and regulatory frameworks. The European Union's General Data Protection Regulation (GDPR) mandates stringent data processing and protection protocols for EU citizens' personal data. Organizations must provide clear, plain-language privacy policies that detail data collection purposes, storage duration, sharing practices, and legal justifications for processing. Additionally, GDPR mandates privacy notices encompassing data transfer procedures, retention periods, access and deletion rights, and contact information for data controllers. +* General Data Protection Regulation (GDPR) in the European Union: It establishes strict requirements for processing and protecting the personal data of EU citizens. It mandates plain-language privacy policies that clearly explain what data is collected, why it is used, how long it is stored, and with whom it is shared. GDPR also mandates that privacy notices must include details on the legal basis for processing, data transfers, retention periods, rights to access and deletion, and contact info for data controllers. +* California's Consumer Privacy Act (CCPA): CCPA requires clear privacy policies and opt-out rights to sell personal data. Significantly, it also establishes rights for consumers to request their specific data be disclosed. Businesses must provide copies of collected personal information and details on what it is used for, what categories are collected, and what third parties receive. Consumers can identify data points they believe need to be more accurate. The law represents a major step forward in empowering personal data access. -California's Consumer Privacy Act (CCPA) echoes similar concerns, mandating clear privacy policies and user opt-out rights to sell personal data. Notably, CCPA empowers consumers to request access to their specific data, including details on its usage, categories collected, and recipients. This legislation represents a significant step towards consumer empowerment in managing personal data. - -**Challenges and Considerations** - -While data transparency offers undeniable benefits, it also presents challenges. Establishing and maintaining robust data provenance requires significant time and financial resources. The inherent complexity of data systems can also make achieving full transparency a time-consuming process. Furthermore, overly detailed information might overwhelm users. Finally, it is also important to balance the trade-off between transparency and privacy. +Ensured data transparency presents several challenges, especially because it requires significant time and financial resources. Data systems are also quite complex, and full transparency can take time. Full transparency may also overwhelm consumers with too much detail. Finally, it is also important to balance the tradeoff between transparency and privacy. ## Licensing @@ -522,34 +467,24 @@ New data regulations also impact licensing practices. The legislative landscape Additionally, the EU Act addresses the ethical dimensions and operational challenges in sectors such as healthcare and finance. Key elements include the prohibition of AI systems posing "unacceptable" risks, stringent conditions for high-risk systems, and minimal obligations for "limited risk" AI systems. The proposed European AI Board will oversee and ensure the implementation of efficient regulation. -**Challenges in Constructing ML Training Datasets** +**Challenges in Assembling ML Training Datasets** -***Complexities in Data Access and Usage** -Assembling machine learning (ML) training datasets is a multifaceted, challenging endeavor. Intricate legal issues surrounding proprietary data, copyright law, and privacy regulations constrain the options for building robust datasets. Expanding accessibility through adopting more open licensing practices or fostering public-private data collaborations could significantly accelerate industry progress and elevate ethical standards. -***Data Anonymization, Filtering, and Regulatory Landscape** -Certain portions of a dataset may need to be removed or obfuscated to comply with data usage agreements or safeguard sensitive information. This process often occurs well after active data sourcing and model training. For instance, a user information dataset might require the removal of names, contact details, and other identifying data to ensure anonymity. Similarly, datasets containing copyrighted content or trade secrets may need to be filtered before distribution. -Regulations such as the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and the Amended Act on the Protection of Personal Information([[APPI]{.underline}](https://www.ppc.go.jp/files/pdf/280222_amendedlaw.pdf)) have been established to guarantee the right to be forgotten. These regulations legally mandate model providers to erase user data upon request. Data collectors and providers must be equipped to take appropriate measures to de-identify or filter out any proprietary, licensed, confidential, or regulated information as required. Explicit user requests for data removal may also necessitate action. -***Balancing Data Usage with Privacy** -Data collectors and providers must be equipped to take appropriate measures to de-identify or filter out any proprietary, licensed, confidential, or regulated information as required. Sometimes, the users may explicitly request that their data be removed. +Complex licensing issues around proprietary data, copyright law, and privacy regulations constrain options for assembling ML training datasets. However, expanding accessibility through more open licensing or public-private data collaborations could greatly accelerate industry progress and ethical standards. Sometimes, certain portions of a dataset may need to be removed or obscured to comply with data usage agreements or protect sensitive information. For example, a dataset of user information may have names, contact details, and other identifying data that may need to be removed from the dataset; this is well after the dataset has already been actively sourced and used for training models. Similarly, a dataset that includes copyrighted content or trade secrets may need to filter out those portions before being distributed. Laws such as the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and the Amended Act on the Protection of Personal Information ([APPI](https://www.ppc.go.jp/files/pdf/280222_amendedlaw.pdf)) have been passed to guarantee the right to be forgotten. These regulations legally require model providers to erase user data upon request. -Dataset licensing is a multifaceted domain that intersects technology, ethics, and law. As the world evolves, understanding these intricacies becomes paramount for anyone building datasets during data engineering. +Data collectors and providers need to be able to take appropriate measures to de-identify or filter out any proprietary, licensed, confidential, or regulated information as needed. Sometimes, the users may explicitly request that their data be removed. -## Conclusion +The ability to update the dataset by removing data from the dataset will enable the creators to uphold legal and ethical obligations around data usage and privacy. However, the ability to remove data has some important limitations. We must consider that some models may have already been trained on the dataset, and there is no clear or known way to eliminate a particular data sample's effect from the trained network. There is no erase mechanism. Thus, this begs the question, should the model be retrained from scratch each time a sample is removed? That's a costly option. Once data has been used to train a model, simply removing it from the original dataset may not fully eliminate its impact on the model's behavior. New research is needed around the effects of data removal on already-trained models and whether full retraining is necessary to avoid retaining artifacts of deleted data. This presents an important consideration when balancing data licensing obligations with efficiency and practicality in an evolving, deployed ML system. +Dataset licensing is a multifaceted domain that intersects technology, ethics, and law. Understanding these intricacies becomes paramount for anyone building datasets during data engineering as the world evolves. -Data is the fundamental building block of AI systems. Without quality data, even the most advanced machine learning algorithms will fail. Data engineering encompasses the end-to-end process of collecting, storing, processing, and managing data to fuel the development of machine learning models. It begins with clearly defining the core problem and objectives, which guides effective data collection. Data can be sourced from diverse means, including existing datasets, web scraping, crowdsourcing, and synthetic data generation. Each approach involves tradeoffs between cost, speed, privacy, and specificity. - Once data is collected, thoughtful labeling through manual or AI-assisted annotation enables the creation of high-quality training datasets. Proper storage in databases, warehouses, or lakes facilitates easy access and analysis. Metadata provides contextual details about the data. Data processing transforms raw data into a clean, consistent format for machine learning model development. -Throughout this pipeline, transparency through documentation and provenance tracking is crucial for ethics, auditability, and reproducibility. Data licensing protocols also govern legal data access and use. Key challenges in data engineering include privacy risks, representation gaps, legal restrictions around proprietary data, and the need to balance competing constraints like speed versus quality. - By thoughtfully engineering high-quality training data, machine learning practitioners can develop accurate, robust, and responsible AI systems. This includes applications in embedded systems and TinyML, where resource constraints demand particularly efficient and effective data-handling practices. In the context of TinyML, data engineering practices take on a unique character. Resource-constrained devices often necessitate smaller datasets with high signal-to-noise ratios. Data collection may be limited to on-device sensors or specific environmental conditions. Crowdsourcing and synthetic data generation have become precious tools for generating specialized datasets with limited memory and processing power. Careful optimization techniques for data cleansing, feature selection, and model compression are essential for TinyML applications. By understanding these nuances, data engineers can empower the development of efficient and effective AI solutions at the edge. -## Resources {#sec-data-engineering-resource .unnumbered} +## Conclusion Data is the fundamental building block of AI systems. Without quality data, even the most advanced machine learning algorithms will fail. Data engineering encompasses the end-to-end process of collecting, storing, processing, and managing data to fuel the development of machine learning models. It begins with clearly defining the core problem and objectives, which guides effective data collection. Data can be sourced from diverse means, including existing datasets, web scraping, crowdsourcing, and synthetic data generation. Each approach involves tradeoffs between cost, speed, privacy, and specificity. Once data is collected, thoughtful labeling through manual or AI-assisted annotation enables the creation of high-quality training datasets. Proper storage in databases, warehouses, or lakes facilitates easy access and analysis. Metadata provides contextual details about the data. Data processing transforms raw data into a clean, consistent format for machine learning model development. Throughout this pipeline, transparency through documentation and provenance tracking is crucial for ethics, auditability, and reproducibility. Data licensing protocols also govern legal data access and use. Key challenges in data engineering include privacy risks, representation gaps, legal restrictions around proprietary data, and the need to balance competing constraints like speed versus quality. By thoughtfully engineering high-quality training data, machine learning practitioners can develop accurate, robust, and responsible AI systems, including embedded and TinyML applications. ## Resources {#sec-data-engineering-resource} - Here is a curated list of resources to support students and instructors in their learning and teaching journeys. We are continuously working on expanding this collection and will add new exercises soon. :::{.callout-note collapse="false"}