The workshop will follow this weekly schedule for the duration of the course:
- Monday: A markdown file containing learning materials for the topics. These are essentially textbook style markdown documents that have code examples inline.
- Monday: A R markdown + python file worksheet. This is a file you can download and open in R Studio and Spyder respectively. It will contain exercises for participants to complete based on the markdown file released on the same day.
- Thursday: Zoom meeting. Course coordinators will participate in a conference call where they will share their screen with participants. Solutions + recordings of the tutorial will be posted asap
- Thursday: Solutions to the worksheet are provided after the zoom meeting to those who could not attend.
Week 1 is an anomaly, dedicated to installing software. The zoom meeting will mostly be an informal meet and greet, if participants have issues or queries they can be addressed in week 1 zoom meeting only.
For the duration of the course, participants are encouraged to use the Github "issues" tab to post code queries or issues. This is by far the easiest way for course coordinators to debug issues you may be having. Debugging issues over zoom is not beneficial for all participants and is time consuming for course coordinators.
When an issue has been solved, it will be marked "closed" and will not appear in the open issues tab. Check the 'closed' issues tab for archived posts.
Working under the assumption that most participants have either a MacOS or a Windows operating system, it is crucial for each user to have access to the same software. For this we have decided to use Anaconda, a package manager deployable across Windows, Mac & Linux systems. The first week of the Workshop will be dedicated to making sure each participant has a fully working version of Anaconda.
A tutorial on how to install Anaconda is available here.
Github is a free website where users can access code repositories, and create their own repositories to store code, notes and small files of data (max 100MB). As the workshop is being conducted via github, we strongly encourage participants to create a github account.
A guide on how to set up a Github account, navigate the website and workshop repository is available here.
A late addition to the workshop, we have decided to cover UNIX shell scripting in week 7. Participants with Linux or Mac OS systems will not need to follow this installation step, as both systems are derivatives of UNIX, sharing core libraries and applications like GNU tools.
For windows 10 users, you can install Windows Sub-system for Linux (WSL). This distribution consists of a Linux environment compiled through Windows and enables most native command-line tools, utilities and binaries from Linux to run on Windows: the users can now run Bash scripts and all popular Linux command-line tools like sed, awk, grep, sort, apt, ssh and others. This will allow most participants to engage with week 7 shell scripting exercises.
A tutorial for windows users to install WSL has been prepared here
If you don't have windows 10, an alternative installation of Cygwin is offered here
Thursday 16th 2-3pm
.
A gentle introduction to R Studio and the R programming language, covering the basic syntax of R. Topics covered include data structures in R, creating + calling variables, logical operators, conditional statements, vectors, functions, for loops, while loops and loading packages in R.
Resources are will be relased on 20/04/2020
.
A gentle introduction to Spyder GUI and the Python programming language, covering the basic syntax of Python. Topics covered include data structures in Python, creating + calling variables, logical operators, conditional statements, for loops,and while loops.
Resources will be released on 20/04/2020
.
Thursday 23rd 2-4pm
Working with matrices in R, reading text/csv files into dataframes and performing maniuplations, operations and subsetting using base R
functions.
Working with dataframes in Python using numpy
and pandas
libraries. Perform operations and tasks on the dataframes and write to files.
Thursday 30th 2-4pm
.
This tutorial covers creating plots using base R and is extended to cover the ggplot
package. Further packages for visualizations are provided in the teaching materials.
This tutorial covers creating plots in Python, using the popular matplotlib
library and the increasingly popular seaborn
library.
Thursday 7th 2-4pm
This tutorial has 3 parts:
1) Descriptive statistics for single variable and multivariable datasets including measures of central tendency, variability and quantiles, along with distributions, the Central Limit Theorem and confidence interval for the mean.
2) Hypothesis testing in the form of 2-samples comparison (the t-test, non-parametric test) and correlation tests (parametric and non-parametric)
3) Linear regression analysis
This tutorial is split into 3 parts and covers:
1) Descriptive statistics for single variable and multivariable datasets including measures of central tendency, variability and quantiles, along with distributions, the Central Limit Theorem and confidence interval for the mean.
2) Hypothesis testing in the form of 2-samples comparison (the t-test, non-parametric test) and correlation tests (parametric and non-parametric)
3) Linear regression analysis
Thursday 14th 2-4pm
2-3pm - Python part
3-4pm - R part
Distance metrics, clustering methods in unsupervised machine learning, visualised as dendograms and heatmaps. Dimensionality reduction using PCA, visualising Principal components in bi plots. Supervised machine learning covering data pre-processing and cleaning, creating training and test sets and implementing KNN, RF and Elastic net machine learning models.