-
Notifications
You must be signed in to change notification settings - Fork 5
MVP: User Experience Scenarios
See issue #130
The following describes the experience a user will get when they access our Science Platform.
The goal of this page, is to provide different possible paths with the cost in development for each associated with it, so that we can have a better idea of what is possible as a Minimum Viable Product deliverable.
The first time a user arrives at the site they will reach a landing page, with some general information on the platform, including information on how to gain access to our system, as well as options to either login, or register.
Login will be done via username & password form, and/or using OAuth with federated authentication from any of the generally used tools and social media sites (Gmail, Github, Facebook..). Potentially with the OAuth route we can provide authentication via EGI-Check in or IAM (Institure Federated Auth). In our current prototype the user logs in to Zeppelin, so usernames & passwords are created in a Zeppelin Configuration. The proposal here is separate the authentication process from Zeppelin, so that it is built and managed by us.
The following describes some of the possible paths for authentication:
User requests access via email. We manually create the account and send the user the login credentials. User can then use credentials to login using a login form.
Development Cost:
- Username and password storage (Configuration or Database)
- Administration email for registering users [dm]
- Administration process for registering users [dm]
- Web service & UI form to handle login
- Propagate access token from portal to Zeppelin
Threats:
- Manual management of users. We need to manually add username & passwords in a configuration file or database.
- Registration is manual, so users send us an email and we manually create account for them
-
ManagementAdministration [dm] effort very high, not good if we want to serve a large amount of users - General Data Protection Regulation GDPR [dm]
Benefits:
Simpler to implement [dm]- Minimal cost of development if we are only supporting a small set (<10) of beta testers for our December release Limits:
- Less than 10 users [dm]
User can request access using a registration form. Once registered we either manually authorize the account, or automatically. Once authorized user is notified, and can use the credentials they set to login from our login page.
Development Cost:
- Database of users
- Web Service and UI form to handle login
- Registration Web service & UI
- Automated emails for registration
- Propagate access token from portal to Zeppelin
Threats:
Additional development effort [dm]- We still may need manual effort authorizing users
- General Data Protection Regulation GDPR [dm]
Benefits:
- Less manual effort managing users
- Automated registration/login meets user expectations
Limits:
- More than 10 users release [dm]
Same process as the automated login described above, but additionally, user can login with OAuth from existing services and social media websites (Gmail, Github Facebook) or institution auth services (EGI-Checking/IAM) without entering a username + password in our service.
Development Cost:
- All of the above
- OAuth service we don't create these, but we still need to develop connectors for them
- Connect OAuth account to user account stored in our database
(?)
Threats:
- We haven't created an OAuth login system before, unknown cost of development effort
Benefit:
- Easier access for users. (i.e. just use an existing service like Github, you don't need to remember another username/pass)
- There are tutorials available for doing OAuth. (Very commonly used)
- Some of General Data Protection Regulation GDPR may be handled to OAuth provider [dm]
Once logged in, users will be able to connect to a Zeppelin instance. In our ideal target platform, new users would be able to spawn minimal clusters and immediately be able to access the data and run Spark jobs, but they would also be able to reserve a cluster of a specific size for a set time duration. As the ideal target platform comes with a cost in development effort, we provide different possible user paths for a Minimum Viable product. The goal, as is the case with all these paths, is to decide if the development cost is such that it may risk not meeting our deadline for an MVP product. Additionally, we should, as much as possible, identify common tasks that need to be completed by all paths, so that we start with those.
In our registration based usage scenario a user can select through a calendar based system, a start - end date/time for which they want to book resources. If the resources are available, they can just book the resources now and jump in. If the system is busy, the calendar will offer slots for tomorrow, later this week or next week depending on how busy the system is. If they want to run long running analyses they can book a block of several days. If they are a member of the Gaia consortium they can make an exclusive booking which guarantees their jobs will be run on a separate set of virtual machines from the rest of the system.
Development Cost:
- Build user interface and web services for reservation system, (Create / View / Delete / Edit bookings)
- Create Database table(s) to store reservations
- Build Service to spawn a reserved cluster size at a given timeframe Service needs to be able to stop and get deleted after the given timeframe.
- Clusters need to be deployable from a given configuration where we specify shape/size/other metadata (RAM, Cores, 3d party Libraries, other)
- Build service to manage Quotas to limit how much they can book and for how long (i.e. stop/undefine cluster when time exceeded)
-
Additional Development effort not listed[dm]
Threats:
- Development Effort required to build and manage resource booking system
- Allocating an exclusive platform for every session costs more in resources. (shared resources avoids this [dm])
- Starting up a new Kubernetes cluster can take 10 - 20 minutes depending on the system load. (shared resources avoids this [dm])
- Doesn’t really work for running a quick experiment on an interactive notebook platform today. (shared resources avoids this [dm])
- We may be wasting Openstack resources if we allocate but they are not used. Allocated resources are only available to that users, so they most likely will be under-utilizing them. (shared resources avoids this [dm])
Benefits:
Avoids a lot of the user account handling needed for a shared platform [dm]- Provides guarantee of availability for the selected resources at the given timeframe
- Suitable for a long-running job (several days)
- Addresses the problem of multiple concurrent users using up all the available resources at a given time, meaning any new jobs have to wait for all others to complete (i.e. simple jobs can take several hours)
- Better suited for a use-case of a long-running job (several days)
User is able to create a "small" package (i.e. X worker nodes, Z Gb RAM ..) and immediately connect to it to start running Spark jobs. User will be able to book resources as described above (why just 'small' [dm])
Development Costs:
-
Additional Development effort[dm] - Manage available resources in an automated way, limit how many small clusters can be run, so that any booked clusters are not affected (this also applies to reservations [dm])
- Clean up service that deletes idle small clusters (this also applies to reservations [dm])
- Notification system that checks if there are available resources, and if not notifies users (this also applies to reservations [dm])
-
Additional Development effort not listed[dm]
Threats:
- This may not always be available. What happens if all resources are reserved, or used by other users for immediate access
- Overhead in development and managing available resources, user quotas, need to carefully consider user experience (i.e. what happens if everything is booked? They cannot run even small packages?)
- Management of idle small clusters
Benefits:
- Unless system is fully booked and/or reserved, new users will have immediate access and will be able to get familiar with using Spark to access the data with a small cluster.
Similar user experience to what we have now. All jobs go to the same queue, as happens currently with Yarn, and the jobs are sent to the same pool of worker nodes. Resources are thus shared, and all Spark jobs are distributed to the same pool available workers. In this scenario, if at any given time all worker nodes are busy, new jobs that come in the queue cannot be sent to the worker nodes. In this case the new job has to wait for an available slot. Ideally, we would notify the user on the status of the queue and their job, so that they are not just left with a job that has started, without additional feedback.
Development Costs:
- Need to generate separate Unix accounts for each user
- Synchronise the accounts on all of the compute and storage nodes
- Prevent people from over writing each others files.
- Investigate & Implement notification system, for notifying user about queue status
-
Additional Development effort not listed[dm]
Threats:
- Management and Development effort separating user folders, ensuring one cannot access folder of another
- In the case of high server load, jobs might become very slow and not complete in a reasonable time. Unless we notify the user of what is going on, this will be a bad user experience
- A few bad jobs might slow down the system for others To minimize this we would have to limit the amount of resources (cores, RAM) that a single job can use. But in this case we would be limiting the resources we provide to each user, thus performance would go down, unless the resources a user can utilize change based on roles. (power users get more cores & RAM)
- What happens if 10, 25, 50, 100 users all try to run the example analyses [dm]
Benefits:
-
(Potentially) easier to implement[dm]how?[dm] -
Better Openstack resource utilization[dm]how?[dm]
In this usage scenario, once logged in a user will be able to immediately create a cluster of a given size, assuming there are currently resources available. This is similar to the calendar based system, only it is a first-come-first-serve scenario, where users are able to get immediate access to a cluster, if one is available. does this have a calendar or not?[dm]
This process will either take a few minutes to complete of we have to create the cluster at that moment, or it will be immediate if we create them before they are needed.
If the resources are all currently used, the user will be shown an appropriate message stating that there are currently no available resources. Potentially, we can build a custom queue for users, so that the user is then shown a message like: "You are in position 1 of the queue.". Once resources are freed by another user, the next person in a queue will be allowed to create a cluster.
Development Costs:
- Need to generate separate Unix accounts for each user
- Synchronise the accounts on all of the compute and storage nodes
- Prevent people from over writing each others files.
- Service to create, or assign a previously created cluster to a user
- Queueing service for creating clusters
- Database or other data store for queue and dynamic cluster information
- Notification service
- Service to clear up idle clusters
- Clusters need to be deployable from a given configuration where we specify shape/size/other metadata (RAM, Cores, 3d party Libraries, other)
- Additional development effort not listed
Threat:
-
Management and[dm]development effort building services to automatically handle creating new clusters based on demand, clearing idle clusters -
Management and[dm]development effort handling queue of users - Managing quotas and maximum lifetime of clusters
- If we can't create the clusters beforehand, there may be delay (minutes) while cluster is created
- Overall complexity
- What happens if queue busy for 5min, 50min, 5 days ?[dm]
Benefits:
- (Potentially) better user experience. Users have immediate access to clusters of different sizes without having to reserve at a specific time.
Development effort may be a benefit or a threat, still an unknown.
The choice of the above paths for an MVP would depend on a few things, how many users are we expecting? How many concurrent users are we expected to support? Do we guarantee immediate availability? Do we guarantee availability at a certain date & time? Do we provide preferential treatment to power users? What can we build by December?
Once in the Zeppelin environment there will be a number of example notebooks that demonstrate how to use Pyspark to access Gaia data, and get familiar with our system. The user can start running Spark (or pure Python) jobs iteratively. They will be able to discover and filter through the data, and download results to their local machine. Behind the scenes we will be assigning a new containers (Kubernetes Pod) for every new user Zeppelin session, otherwise they will have share a single Zeppelin service/node. Any Python commands that are do not use the Pyspark lib, will run on the Zeppelin node, which means that it wont scale well to multiple users if we have a single Zeppelin service.
We may not be able to develop all of the options described in this document in time for the release of the first version in December 2020. For this reason we are in the process of defining our MVP, which will help prioritize the order in which we assign development effort to tasks. There will be some common infrastructure components between the different paths that we should implement first. For the different usage scenarios,if we define what is part of the MVP path, we need to implement all relevant tasks first, and then iteratively add features to that. This requires us to develop the first version in a way that doesn't block the other features of our target platform to be added. We may also decide to prototype different options, if there are usage scenarios that we are not yet sure of, and would like to see in practice to help refine our MVP.
Note: For the development costs, not all required development tasks are listed, because this would grow the page quite a bit, and because some of them are still unknown.