Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamline/Optimize Orca Active Learning tool (OrcaAL) #30

Open
valentina-s opened this issue Mar 11, 2022 · 10 comments
Open

Streamline/Optimize Orca Active Learning tool (OrcaAL) #30

valentina-s opened this issue Mar 11, 2022 · 10 comments
Assignees
Labels
2022 Suggested project idea for GSoC 2022

Comments

@valentina-s
Copy link
Contributor

valentina-s commented Mar 11, 2022

During summers 2020 and 2021, GSOC students worked on creating the OrcaAL Orca Active Learning tool (repo, demo) which integrates the efforts of human annotators and machine learning experts to create better ML training sets and algorithms. The initial set up was designed to handle one day of Orcasound data with one querying strategy. To make the app more flexible and scale the annotation process, there are several steps one can take to streamline the OrcaAL tool and its performance:

  • Simplify installation
  • Allow for spinning up a more powerful GPU instance for model training
  • Store annotations in cloud-hosted database
  • Allow to continuously obtain new data (after one day is exhausted)
  • Handle examples which cannot be annotated by anyone
  • Integrate embeddings and distance based querying strategy, and optimize performance
  • Allow to select based on embeddings visualization
  • Allow to change source/destinations for use with new datasets
  • Study how querying strategy influences model performance
  • Integrate multiple annotators (model training, uncertainty for querying, …)

Expected outcomes: Improve a tool that accelerates the annotation of Orcasound audio data and the training of machine learning models to classify acoustic signals from orcas.

Required Skills: Python, Machine Learning, Docker

Bonus Skills: Cloud Computing, Flask

Mentors: Valentina, Jesse, Scott

Difficulty level: Hard

Project Size: 175 or 350 h

Resources:

Getting Started:
Follow the instructions to start the API

@valentina-s valentina-s added the 2022 Suggested project idea for GSoC 2022 label Mar 11, 2022
@vipul-sharma20
Copy link

vipul-sharma20 commented Mar 12, 2022

Link to OrcaAL repository / home page looks incorrect. Is it https://github.com/orcasound/orcaal/?

@Benjamintdk
Copy link
Contributor

Benjamintdk commented Mar 13, 2022

Hi @valentina-s @scottveirs @yosoyjay, I am in the midst of writing my proposal and am deciding on which steps to focus on. I wanted to clarify a few matters:

  • Allow for spinning up a more powerful GPU instance for model training

I read that the ML API and endpoints are currently containerized and hosted on AWS LightSail. Can I clarify whether the model currently trains and predicts using a CPU, since LightSail does not seem to support GPU instances from my research? Also, is there currently any form of container orchestration used in AWS, e.g. ECS/EKS?

  • Store annotations in cloud-hosted database

I understand that a PostgreSQL database is currently used to store annotations. However, I'm not sure where it is currently hosted; is it on AWS LightSail as well? Also, can I clarify the reason for wanting to switch over to a cloud-hosted database? Would that mean using a solution such as AWS RDS instead?

  • Allow to continuously obtain new data (after one day is exhausted)

How is obtaining of new unlabeled data currently handled? I don't seem to be able to find much info about it, is it currently done manually, with a new set of data generated by the preprocess_unlabeled utilities and then pushed to the s3 bucket orcagsoc/unlabeled? If so, automating it with a Github Actions workflow might be a possible solution?

  • Handle examples which cannot be annotated by anyone

Can I clarify if this refers to examples which human annotators themselves are unsure of? If that is the case, I suppose that a separate database model (perhaps keeping track of the number of times skipped, so that that can be a key used for querying) for storing these uncertain examples are required?

  • Integrate embeddings and distance based querying strategy, and optimize performance

Would a plausible distance based querying strategy be to use some distance measure (e.g. Manhattan or Euclidean) to find the furthest unlabeled examples from the means of both 'orca' and 'non-orca' embeddings, and treat those as the most 'uncertain' examples?

  • Allow to change source/destinations for use with new datasets

Do these new datasets refer to the labelled, unlabeled or both? Should this change be controllable from the OrcaAL site itself (i.e. can be seen and changed by annotators), or should it be more for internal usage?

  • Study how querying strategy influences model performance

Would this entail some form of A/B testing or canary deployment? The downsides to this are that I foresee requiring 2 isolated databases (or multiple depending on how many strategies we are testing) to contain the different annotations, and probably isolated S3 buckets as well to store the resulting different sets of labelled data thereafter. This method might also take some time to test, especially if the traffic for OrcaAL site isn't very high.

  • Integrate multiple annotators (model training, uncertainty for querying, …)

Does this refer to handling concurrent loads and high traffic (load testing)?

Some additional queries I have:

  • I wanted to clarify if the model presently retrain on the entire labelled dataset every time the RETRAIN_TARGET number is reached, or does it only retrain on the newly labelled examples? My understanding is that it is the former, and I was wondering if some sort of weighting should be applied to the examples, whereby newly human-labelled examples have higher cost if the model gets them wrong during retraining (i.e. some form of boosting). I was thinking that it might be helpful since these human labelled examples are likely to be 'harder' given the greater uncertainty. Not sure if this makes sense as I'm new to the active learning field in general.
  • What proportion of the current labelled dataset remains uncertain (i.e. after training the model, and predicting on this trained dataset, how examples generate scores between 0.1 to 0.9)? My thoughts on this are that if the current train dataset doesn't generate scores for labelled examples consistently above 0.9 or below 0.1, then it would be unreasonable to expect unlabeled examples to have labels above 0.9 or below 0.1. Would looking at this be a useful metric for ML practitioners to decide on whether to step in and try to improve model performance?
  • Would having a back button to relabel an annotation (in case of a mistake made) be useful?

@valentina-s
Copy link
Contributor Author

valentina-s commented Mar 14, 2022

Thanks for all the questions @Benjamintdk!

Hi @valentina-s @scottveirs @yosoyjay, I am in the midst of writing my proposal and am deciding on which steps to focus on. I wanted to clarify a few matters:

  • Allow for spinning up a more powerful GPU instance for model training

I read that the ML API and endpoints are currently containerized and hosted on AWS LightSail. Can I clarify whether the model currently trains and predicts using a CPU, since LightSail does not seem to support GPU instances from my research? Also, is there currently any form of container orchestration used in AWS, e.g. ECS/EKS?

Yes, the training is on CPU and app, database and training are on one instance. The containers are being started separately. If the annotations are moved to a hosted database, and a container is spinned only when needed, then one would need just one container for the app. It is important to consider the two use cases: 1) orcasound community: setting up the tool so that it works well for us. 2) users willing to set up orcaal on their own, and not willing to depend too much on individual AWS services.

  • Store annotations in cloud-hosted database

I understand that a PostgreSQL database is currently used to store annotations. However, I'm not sure where it is currently hosted; is it on AWS LightSail as well? Also, can I clarify the reason for wanting to switch over to a cloud-hosted database? Would that mean using a solution such as AWS RDS instead?

Now the database is on a docker container on AWS LightSail, which is fine, but sometimes the containers stop working so a hosted database (RDS) might be more reliable?

  • Allow to continuously obtain new data (after one day is exhausted)

How is obtaining of new unlabeled data currently handled? I don't seem to be able to find much info about it, is it currently done manually, with a new set of data generated by the preprocess_unlabeled utilities and then pushed to the s3 bucket orcagsoc/unlabeled? If so, automating it with a Github Actions workflow might be a possible solution?

Currently, there are two folders in S3 orcagsoc bucket: labeled_test, unlabeled_test, and the data there comes from a specific day: July 2020, already in .mp3 format. @wetdog worked on grabbing data from the streaming buckets (in .ts), so you should check out the orca_embeddings branch. Yes. Github Actions or some other way to facilitate these workflows will be great!

  • Handle examples which cannot be annotated by anyone

Can I clarify if this refers to examples which human annotators themselves are unsure of? If that is the case, I suppose that a separate database model (perhaps keeping track of the number of times skipped, so that that can be a key used for querying) for storing these uncertain examples are required?

Yes, the ones which are left when clicking the skip button.

  • Integrate embeddings and distance based querying strategy, and optimize performance

Would a plausible distance based querying strategy be to use some distance measure (e.g. Manhattan or Euclidean) to find the furthest unlabeled examples from the means of both 'orca' and 'non-orca' embeddings, and treat those as the most 'uncertain' examples?
Yes, look through embeddings branch and orca-embeddings repo and Jose's blogs.

  • Allow to change source/destinations for use with new datasets

Do these new datasets refer to the labelled, unlabeled or both? Should this change be controllable from the OrcaAL site itself (i.e. can be seen and changed by annotators), or should it be more for internal usage?

It could be both labelled or unlabelled. Maybe not from OrcaAL site at this stage. Even from the command line I think the set up steps are pretty tight up to our AWS set up. So if a colleague comes with their own data and wants to set up OrcaAL, it might not be trivial to get started. This is a bit open for discussion @yosoyjay @scottveirs

  • Study how querying strategy influences model performance

Would this entail some form of A/B testing or canary deployment? The downsides to this are that I foresee requiring 2 isolated databases (or multiple depending on how many strategies we are testing) to contain the different annotations, and probably isolated S3 buckets as well to store the resulting different sets of labelled data thereafter. This method might also take some time to test, especially if the traffic for OrcaAL site isn't very high.

We can compare the different querying strategies as a starter. Even without full deployment we can do some experiments. Kunal did some evaluation of the uncertainty strategy in a jupyter notebook with a very small sample, but I think now we have slightly more labeled data. We also need to look at more proper metrics: now we have accuracy, but the number of training and testing samples changes after each round so one should be more careful about how to interpret results. We can also advertise some experiments to the slack channel for a starter, or make OrcaAl more visible to the citizen scientists.

  • Integrate multiple annotators (model training, uncertainty for querying, …)

Does this refer to handling concurrent loads and high traffic (load testing)?

Maybe simply sending same observation to more than one annotators to check for consistency. We have given a demo at some events with bunch of people: the app has not crashed but we do not know how slow it is for the users. Maybe that should be on the list to investigate.

Some additional queries I have:

  • I wanted to clarify if the model presently retrain on the entire labelled dataset every time the RETRAIN_TARGET number is reached, or does it only retrain on the newly labelled examples? My understanding is that it is the former, and I was wondering if some sort of weighting should be applied to the examples, whereby newly human-labelled examples have higher cost if the model gets them wrong during retraining (i.e. some form of boosting). I was thinking that it might be helpful since these human labelled examples are likely to be 'harder' given the greater uncertainty. Not sure if this makes sense as I'm new to the active learning field in general.

This is an interesting topic to investigate. We started with a small training set so initially the new labels had a big effect, but with time a new batch may not help much, especially with the uncertainty metric. One should be also careful, not to bias the model toward a small batch of observations. You may find some inspiration in this book.

  • What proportion of the current labelled dataset remains uncertain (i.e. after training the model, and predicting on this trained dataset, how examples generate scores between 0.1 to 0.9)? My thoughts on this are that if the current train dataset doesn't generate scores for labelled examples consistently above 0.9 or below 0.1, then it would be unreasonable to expect unlabeled examples to have labels above 0.9 or below 0.1. Would looking at this be a useful metric for ML practitioners to decide on whether to step in and try to improve model performance?

We actually have two strategies for selecting the uncertain samples:

  • in the notebook experiments we used 0.1 - 0.9
  • in the OrcaAl app, the uncertain labels were sorted by their score and certain number with lowest confidence was taken.
    @kunakl07 did quite a lot of experiments on changing the bounds of the confidence score, so he can maybe remind of some details.
  • Would having a back button to relabel an annotation (in case of a mistake made) be useful?

Possibly, we should collect some user feedback.

@valentina-s
Copy link
Contributor Author

Just a note for everybody, that all of the above topics do not need to be in one project 😅 : , these are just potential directions ⛵ !

@yosoyjay
Copy link

  • Allow to change source/destinations for use with new datasets

Do these new datasets refer to the labelled, unlabeled or both? Should this change be controllable from the OrcaAL site itself (i.e. can be seen and changed by annotators), or should it be more for internal usage?

It could be both labelled or unlabelled. Maybe not from OrcaAL site at this stage. Even from the command line I think the set up steps are pretty tight up to our AWS set up. So if a colleague comes with their own data and wants to set up OrcaAL, it might not be trivial to get started. This is a bit open for discussion @yosoyjay @scottveirs

I agree with the assessment the the current implementation of the pipeline is pretty tightly coupled to AWS, but it wouldn't be too difficult to separate AWS specific deployment bits and general, reusable code. This would facilitate someone else setting up their own instance of OrcaAL. I think making changes to the current implementation to point to other datasources might be a bigger, but not too heavy, of a lift, but I think that would be a secondary concern.

@Benjamintdk
Copy link
Contributor

Thanks @yosoyjay and @valentina-s for the replies, they'll be really helpful for my proposal!

Regarding the datasets, I think that I misinterpreted it the first time I read it. It seems like developer experience is something that can be improved upon a lot, as I also remember seeing a possible feature to easily add other models for experimentation as a suggestion for the previous GSoC run.

I agree with the assessment the the current implementation of the pipeline is pretty tightly coupled to AWS, but it wouldn't be too difficult to separate AWS specific deployment bits and general, reusable code. This would facilitate someone else setting up their own instance of OrcaAL. I think making changes to the current implementation to point to other datasources might be a bigger, but not too heavy, of a lift, but I think that would be a secondary concern.

I want to further clarify what sort of data sources @yosoyjay might be referring to, would these be toy datasets that perhaps someone has in a zip file locally, or perhaps in a cloud storage on a separate S3 bucket, or perhaps even on another cloud platform (e.g. Google Cloud Storage in GCP). I was thinking that if it were for experimentation locally, then perhaps creating dev docker containers might be a solution? There can be volume binds as access points for loading data and saving models on the local machine, and it will simplify installation and the non-trivial setup of the repo as Valentina has mentioned.

@yosoyjay
Copy link

Thanks @yosoyjay and @valentina-s for the replies, they'll be really helpful for my proposal!

Regarding the datasets, I think that I misinterpreted it the first time I read it. It seems like developer experience is something that can be improved upon a lot, as I also remember seeing a possible feature to easily add other models for experimentation as a suggestion for the previous GSoC run.

Yeah, absolutely. If this is something folks are interested in pursuing, we might also need to think a bit more about tracking different models.

I agree with the assessment the the current implementation of the pipeline is pretty tightly coupled to AWS, but it wouldn't be too difficult to separate AWS specific deployment bits and general, reusable code. This would facilitate someone else setting up their own instance of OrcaAL. I think making changes to the current implementation to point to other datasources might be a bigger, but not too heavy, of a lift, but I think that would be a secondary concern.

I want to further clarify what sort of data sources @yosoyjay might be referring to, would these be toy datasets that perhaps someone has in a zip file locally, or perhaps in a cloud storage on a separate S3 bucket, or perhaps even on another cloud platform (e.g. Google Cloud Storage in GCP). I was thinking that if it were for experimentation locally, then perhaps creating dev docker containers might be a solution? There can be volume binds as access points for loading data and saving models on the local machine, and it will simplify installation and the non-trivial setup of the repo as Valentina has mentioned.

I don't know the scope of the potential inputs. My response was prompted by @valentina-s mentioning "So if a colleague comes with their own data and wants to set up OrcaAL, it might not be trivial to get started." The issues you bring up is exactly why it was mentioned that it would require additional discussion and why I thought that it should be a secondary concern.

@Benjamintdk
Copy link
Contributor

@valentina-s @yosoyjay @scottveirs another thing that I was considering is the possibility of incorporating data version control into the current pipeline. I realize when looking through the code, that while we keep track of the different model checkpoints, the data in the S3 bucket doesn't seem to be. This would make it difficult to pinpoint exactly which batch(es) of data started to cause performance to degrade should it happen.
A potentially simple way of doing this would be to perform VC via the database, but I was thinking that something more robust might be better. Might this be a useful feature to have?

@Benjamintdk
Copy link
Contributor

Benjamintdk commented Apr 12, 2022

@valentina-s @yosoyjay @scottveirs another thing that I was considering is the possibility of incorporating data version control into the current pipeline. I realize when looking through the code, that while we keep track of the different model checkpoints, the data in the S3 bucket doesn't seem to be. This would make it difficult to pinpoint exactly which batch(es) of data started to cause performance to degrade should it happen. A potentially simple way of doing this would be to perform VC via the database, but I was thinking that something more robust might be better. Might this be a useful feature to have?

Hi @yosoyjay @valentina-s @scottveirs bumping this as I'm not sure if it got missed out. Just to provide a little more context, I came across data versioning while doing a course (Full Stack Deep Learning) by UC Berkeley sometime back.

@valentina-s
Copy link
Contributor Author

@Benjamintdk versioning did come up in the discussions while building OrcaAL but of course we did not have time for it. I remember looking at DVC which was rather new at that time, and now seems it is a whole suite of useful tools! I do not exactly know how they look at diff's of data and whether there is potential to explode size-wise for our data, but in our context versioning the labels of the audio segments which went into training should be enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2022 Suggested project idea for GSoC 2022
Projects
Status: No status
Development

No branches or pull requests

6 participants