adapting replication package to fit FSE AET requirements

sealuzh · Jun 4, 2020 · cbd928f · cbd928f
1 parent 6dcfd51
commit cbd928f
Show file tree

Hide file tree

Showing 6,600 changed files with 464,741 additions and 31,065 deletions.
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1 @@
+**/*.DS_Store
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,38 @@
+FROM ubuntu:18.04
+
+# install Java 11
+ARG DEBIAN_FRONTEND=noninteractive
+RUN apt-get update -q \
+ && apt-get upgrade -q -y \
+ && apt-get install -q -y \
+        binutils \
+        openjdk-11-jdk-headless
+
+# install Maven (last version)
+RUN apt-get -y install maven
+
+# install Python 3.7
+RUN apt install software-properties-common -y \
+&& add-apt-repository ppa:deadsnakes/ppa -y \
+&& apt install python3.7 -y
+
+# install pip
+RUN apt install python3-pip -y
+
+# install R
+RUN apt-get install r-base -y
+# RUN R -e "install.packages('devtools',dependencies=TRUE, repos='http://cran.rstudio.com/')"
+# RUN R -e "install.packages('likert',dependencies=TRUE, repos='http://cran.rstudio.com/')"
+# > require(devtools)
+# > install_github("likert","jbryer")
+
+# install an editor
+RUN apt-get install vim -y
+
+# copy all content inside
+RUN mkdir /home/cd-linter-artifacts
+COPY . /home/cd-linter-artifacts
+WORKDIR /home/cd-linter-artifacts
+
+# install the required dependencies
+RUN pip3 install -r requirements.txt
diff --git a/INSTALL.md b/INSTALL.md
@@ -0,0 +1,23 @@
+# Installation
+
+In this document, you will find all the information you need to setup an environment that can execute our artifacts.
+
+## Execute the artifacts in a Docker Container (Recommended)
+
+In the root of our replication package, you find a `Dockerfile`. This file can be used the build an image that your Docker installation can run.
+
+1. Install [Docker for Desktop](https://www.docker.com/products/docker-desktop)
+2. Open the terminal and position in the root folder of the replication package.
+3. Build the image `docker build --no-cache -t artifact:v1 .`
+4. Run the image `docker run -it  artifact:v1`
+5. (add dockerignore)
+
+## Setup your own environment
+
+If you like, you can build your own environment to execute the artifacts. You need to manually install the following tools and copy the content of our replication package.
+
+* Java 11
+* Maven 3.6.0
+* Python 3.6
+* pip ...
+* R ...
diff --git a/README.md b/README.md
@@ -1,34 +1,123 @@
-# Configuration Smells in Continuous Delivery Pipelines: A Linter and A Six-Month Study on GitLab (FSE 2020, Replication Package)
+# CD-Linter Artifacts
 
-Authors: Carmine Vassallo, Sebastian Proksch, Anna Jancso, Harald C. Gall, Massimiliano Di Penta.
+This is the replication package of the paper "[Configuration Smells in Continuous Delivery Pipelines: A Linter and A Six-Month Study on GitLab](https://doi.org/10.5281/zenodo.3860984)" accepted for publication at [ESEC/FSE 2020](https://2020.esec-fse.org/).
 
-Link to the preprint: http://doi.org/10.5281/zenodo.3860985
+In the following, we describe the artifacts of our paper and how to use them to replicate the results of our study. When appropriate, we also link the description of the artifacts to the relevant sections in the paper.
 
-## DATASETS (datasets/)
+## Environment Set-up
 
-**Warning** Direct links to the opened issues are omitted according to the double-blind policy. Furthermore, we discourage people browsing issues on GitLab projects that might reveal the author identities.
+Follow the instructions described in [INSTALL.md](INSTALL.md). 
 
-- `analyzedprojects.csv`: all starred Gitlab projects that are no forks.
-- `dataset.csv`: all projects (from previous set) that have a .gitlab.yml file.
-- `rq1-comment-sorting/`: this folder contains all the material related to the card sorting performed on the received comments. The found labels (`card-sorting-labels.csv`) and how the two coders agreed (`card-sorting-agreement.csv`). This folder also contains a summary of the reasons for rejection.
-- `rq1-cdsmells.csv`: info about the opened issues and the corresponding reactions/fixes. 
-- `rq2-precision.csv`: the results of the manual inspection of issues to compute the precision.
-- `rq2-recall/`: this folder contains the selected sample of projects (`recall-sample.csv`) and the results of the recall analysis (`recall-results.csv`).
-- `rq3-analysis/`: it contains all the files to compute the RQ3 and a summary of RQ3 results.
-    - `rq3-results.csv`: the list of all anti-patterns having "dataset.csv" projects as input.
-    - `dataset_yml-update.csv`: "dataset.csv" projects with info about the .gitlab-ci.yml lenght and their (updated) existence on GitLab.
-- `tools-to-exclude-fuzzyversion.txt`: the list of the tools excluded from fuzzy-version detection.
-- `good_bad_cd-practices.csv`: all the good and bad practices that we collected from the Foundations part of Humble's book.
+**Note** To successfully run all scripts listed in this document, you need to keep the root of this repository as your working directory. An internet connection is required.
 
+## <a id="cd-linter"></a>Build and Run CD-Linter
 
-## SCRIPTS (scripts/)
+CD-Linter is a semantic linter for GitLab CI/CD pipelines. It downloads and checks the configuration files (e.g., _.gitlab-ci.yml_, _pom.xml_, _requirements.txt_ files) of open-source projects that are hosted on GitLab.com for the presence of four CD smells (see Section 3.2). 
 
-Note: please install all the required dependencies.
+CD-Linter has been implemented in Java and its source code is available in the `cd-linter` folder. To detect smells, CD-Linter accepts as inputs:
 
-- `study-results.py`: this script analyzes and computes the results of the research questions.
-- `likert-scaled-reactions.R`: this script is used to compute Figure 4 (`reactions.csv` that is generated by "study-results.py" must be passed as input file).
+* the list of projects (see `cd-smell-occurrences/dataset.csv`)
+* the path to the folder where configuration files are cached
+* the path to the resulting file that will contain all detected smells.
 
+To build and execute CD-Linter, please run:
 
-## CD-Linter Implementation (cd-linter-code/)
+	mvn -f cd-linter clean test exec:java -Dexec.mainClass="ch.uzh.seal.app.ConfigurationAnalytics" -Dexec.args="cd-smell-occurrences/dataset.csv cd-smell-occurrences/linter_configfiles cd-linter/target/cd-smells.csv"
+
+The `cd-smell-occurrences/dataset.csv` file contains the list of projects that have been analyzed in our study (see Section 4.1). The `cd-smell-occurrences/linter_configfiles` folder includes the latest version of the configuration files used to measure the CD-smell occurrences in RQ3 (see Section 4.4).
+<!--- The resulting `cd-linter/CI-anti-patterns.csv` file corresponds to `cd-smell-occurrences/rq3-results.csv`.-->
 
-It contains the source code of our linter (all packages have been anonymized).
+### Query GitLab projects and Download Configuration Files
+
+While its primary goal is to detect CD smells, CD-Linter provides another functionality for the purposes of our study. It mines the full list of open-source projects available on GitLab.com together with some basic statistics such as the languages, number of stars and forks.
+
+To try this feature, please execute[^1]: <mark>TODO: Add a valid token that possibly expires soon</mark>
+
+	mvn clean test exec:java -Dexec.mainClass="ch.uzh.seal.datamining.gitlab.GitLabMiner" -Dexec.args="$TOKEN $OUTPUT 0"
+
+<!-- The execution returns a file having the same structure of `cd-smell-occurrences/dataset.csv`. -->
+
+[^1]: Note that a full analysis of the GitLab ecosystem takes weeks.
+
+<!--- ## Construction of the original dataset
+
+We applied several filters from a broad GitLab query to construct a dataset consisting of 5,312 projects.-->
+
+## Analysis of the Reactions to the Opened Issues (RQ1)
+
+<!--- ### Selection of the issues to open
+
+We detected smells on the latest versions of the selected projects available at <mark>XX</mark>. From the resulting 5,312 smells we selected a sample of 168 smells to open applying several filters. During the assessment stage, we discarded 23 from this set and finally open 145 issues. The full list of issues together with the reactions is available at `XX.csv`.-->
+
+
+All the reactions to the opened-issues (see Section 4.2) have been collected in the `reactions-to-issues/rq1-cd-smells.csv` file. Among the others, each line contains the following information[^2]:
+
+*  link to the issue (_linkToOpenedIssue_)
+*  status of the reported smell (_fixed_)
+*  reaction to the issue (_reaction_)
+*  number of upvotes (_numUpvotes_), downvotes (_numDownvotes_), and comments (_numComments_)
+*  status of the issue (_state_)
+*  assignment to team members (_isAssigned_)
+*  resolution time in ms (_resolution-time_)
+
+Taking this file as input, please execute the following command to replicate the analysis performed in RQ1.
+
+	python3 reactions-to-issues/rq1-analysis.py reactions-to-issues/rq1-cdsmells.csv reactions-to-issues/fig4-source.csv > reactions-to-issues/rq1-results.txt
+
+The generated output `reactions-to-issues/rq1-results.txt` contains (i) an analysis of the received reactions, (ii) the labels assigned to the rejected issues with their occurrences, and (iii) the resolution type per CD smell.
+
+The previous script requires a second argument `reactions-to-issues/fig4-source.csv`. This is the path to the file that will store the reactions to the issues in a format that can be processed by the following script to generate `reactions-to-issues/figure4.png` <mark>TODO: Add Missing Dependency in Dockerfile</mark>
+
+	Rscript reactions-to-issues/likert-scaled-reactions.R
+
+[^2]: The last update was done on 05/02/2020.
+
+### Card-sorting of the Received Comments
+
+Two authors performed an independent tagging of the comments that we received in the opened issues and then merged their annotations. The folder `reactions-to-issues/rq1-comment-sorting` contains the labels (`card-sorting-labels.csv`) and the result of the agreement (`card-sorting-agreement.csv`)
+
+## Accuracy of CD-Linter (RQ2)
+
+The results of the manual validation of CD smells (see Section 4.3) are available in the `accuracy/rq2-precision.csv` and `accuracy/rq2-recall.csv` files.
+
+The first file, that we used to compute the precision of our tool, contains a sample of detected CD smells together with the final tag (or rating) assigned by the validators. The `accuracy/rq2-recall-sample.csv` file contains instead the list of projects inspected for the recall analysis of CD-Linter. We compared the list of manually identified smells with the ones detected by CD-Linter in `accuracy/rq2-recall.csv`. Specifically, given a manually identified smell, the column _match with cd-linter_ contains _YES_ if the smell was also detected by CD-Linter.
+
+To compute the results shown in Tables 2 and 3 and the recall values (see Section 5.2), please run:
+
+	python3 accuracy/rq2-analysis.py accuracy/rq2-precision.csv accuracy/rq2-recall.csv > accuracy/rq2-results.txt
+
+
+### Generate a Random Sample to Compute the Recall
+
+We generated a random sample of 100 projects to compute the recall. Specifically, we excluded projects having smells analyzed while studying the precision and contained in the opened issues. To generate such a random sample (and store it in `accuracy/generated-sample.csv`), please run:
+
+	python3 accuracy/rq2-recall-sample.py cd-smell-occurrences/dataset.csv accuracy/rq2-precision.csv reactions-to-issues/rq1-cdsmells.csv accuracy/generated-sample.csv
+
+<!-- After incorporating a few feedback in our tool, we rerun it and detected 5,011 smells. We then evaluated CD-Linter's precision and recall measures.
+
+### Selection of smells for precision
+
+Applying several filters (see Section XX), we selected 868 smells to evaluate. The results are available in `hello.csv`. -->
+
+## Occurrences of CD smells (RQ3)
+
+We run CD Linter against the latest snapshot of the projects that were still available at the end of the six-month study (as already described in [Build and Run CD-Linter](#cd-linter)). The results are available at `cd-smell-occurrences/rq3-results.csv`.
+
+To compute the results (that are described in Section 5.3), please run:
+
+	python3 cd-smell-occurrences/rq3-analysis.py cd-smell-occurrences/dataset_yml-update.csv cd-smell-occurrences/rq3-cd-smells.csv > cd-smell-occurrences/rq3-results.txt
+
+
+### Compute the size of .gitlab-ci.yml files
+
+You might have noticed that the previous script takes `dataset_yml-update.csv` instead of `dataset.csv`. This file is required for the analysis of CD smells across different .gitlab-ci.yml sizes (see Table 5). To generate it please run the following command.
+
+	mvn -f cd-linter clean test exec:java -Dexec.mainClass="ch.uzh.seal.app.ConfigurationMiner" -Dexec.args="cd-smell-occurrences/dataset.csv cd-smell-occurrences/linter_configfiles cd-linter/target/dataset_yml-update.csv"
+
+## Others
+
+### Bad practices in CI/CD
+
+All the good and bad practices that we collected from the _Foundations_ part of Humble's and Farley's book[^3] are collected in `good-bad-cd-practices.csv`.
+
+[^3]: https://www.oreilly.com/library/view/continuous-delivery-reliable/9780321670250/