diff --git a/README.md b/README.md
index 6c46469..496371e 100644
--- a/README.md
+++ b/README.md
@@ -1,10 +1,11 @@
-# Spring Boot Starter (template)
+# Data processing
-A minimal working starter template for a Spring Boot non-web applications using the
-Spring [CommandLineRunner](https://docs.spring.io/spring-boot/docs/current/api/org/springframework/boot/CommandLineRunner.html)
-interface with a demonstration of Java
-annotation-based [Inversion of Control](https://stackoverflow.com/questions/3058/what-is-inversion-of-control)
-via [Dependency Injection](https://stackoverflow.com/questions/130794/what-is-dependency-injection).
+A small Java application that listens to data ready for registration and performs several
+pre-registration
+checks before moving the dataset to an openBIS ETL routine.
+
+> [!NOTE]
+> Requires Java SE 17 or newer.
## Run the app
@@ -14,174 +15,248 @@ Checkout the latest code from `main` and run the Maven goal `spring-boot:run`:
mvn spring-boot:run
```
-## What the app does
-
-This small app just parses a file with a collection of good coding prayers and creates a singleton
-instance of an `CodingPrayersMessageService`. This concrete implementation uses the
-interface `MessageService`, that comes with only one public method: `String collectMessage()`.
-
-This service is used to demonstrate the IoC principle. We have defined another interface `NewsMedia`
-and provide a concrete implementation `DeveloperNews`, that will call the message service to receive
-recent news and forward them to the caller.
+You have to set different environment variables first to configure the individual process parts.
+Have a look at the [configuration](#configuration) setting to learn more .
-In the main app code, we just retrieve this Singleton instance or Bean in Spring lingua from the
-loaded context and call the news media `getNEws()` method. The collected message is then printed out
-to the command line interface:
+## What the app does
-```
+The following figure gives an overview of the process building blocks and flow:
- . ____ _ __ _ _
- /\\ / ___'_ __ _ _(_)_ __ __ _ \ \ \ \
-( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
- \\/ ___)| |_)| | | | | || (_| | ) ) ) )
- ' |____| .__|_| |_|_| |_\__, | / / / /
- =========|_|==============|___/=/_/_/_/
- :: Spring Boot :: (v2.5.6)
-
-2021-11-18 09:17:37.640 INFO 68052 --- [ main] l.q.s.SpringMinimalTemplateApplication : Starting SpringMinimalTemplateApplication using Java 17.0.1 on imperator.am10.uni-tuebingen.de with PID 68052 (/Users/sven1103/git/spring-boot-starter-template/target/classes started by sven1103 in /Users/sven1103/git/spring-boot-starter-template)
-2021-11-18 09:17:37.641 INFO 68052 --- [ main] l.q.s.SpringMinimalTemplateApplication : No active profile set, falling back to default profiles: default
-2021-11-18 09:17:38.164 INFO 68052 --- [ main] l.q.s.SpringMinimalTemplateApplication : Started SpringMinimalTemplateApplication in 0.808 seconds (JVM running for 1.489)
-####################### Message of the day ##################
-Have you written unit tests yet? If not, do it!
-##############################################################
+
-```
+The **basic process flow** can be best described with:
-## Realisation of IoC and DPI
+1. Scanning step
+2. Registration step (preparation)
+3. 1 to N processing steps
-The messages collection is stored in a simple text file `messages.txt`, that is provided with the
-apps `resources`. Just go ahead and change the content of the file and run the app!
+The last processing step usually hands the dataset over to the actual registration system, in our
+case it is several
+openBIS ETL dropboxes. In the current implementation, a marker file is created after successful
+transfer into the target folder: `.MARKER_is_finished_[task ID]`
-
+The current implementation consists of 4 steps: _scanning, registration, processing, evaluation_ and
+are described in the following subsections.
+### Scanning
-So how does the app know where to find this file and **load the messages content**?
+In this step, the application scans a [pre-defined path](#scanner-step-config) and looks for
+existing registration folders.
+If a registration folder is present, it is recorded and will be investigated. All other files in a
+user's directory will be ignored.
-We have configured it as a **external property** in a file `application.properties` and load the
-configuration on application startup! Cool eh?
+> [!NOTE]
+> It is important that the move operation of any dataset in the registration folder is **atomic**!
+> Otherwise, data corruption will occur. Ideally the dataset is staged into the user's home folder
+> first (e.g. a copy operation, an upload via SFTP or SSH) and then **moved** into the registration
+> folder.
+>
+> Moving operations on the same file system are basically a rename of the file path and
+> atomic.
-
+Within a user's registration directory, the application expect a registration task to be bundled in one
+folder, e.g.:
-This is how the file content looks like:
-
-```
-messages.file=messages.txt
+```bash
+|- myuser/registration // registration folder for user `myuser`
+ |- my-registration-batch // folder name is irrelevant
+ |- file1_1.fastq.gz
+ |- file1_2.fastq.gz
+ |- file2_1.fastq.gz
+ |- file2_2.fastq.gz
+ |- metadata.txt // mandatory!
```
-So how do we access the value of the `messages.file` property in our application with Spring?
-
-Have a look in the class `AppConfig`, there the magic happens:
+The folder ``my-registration-batch`` represents an atomic registration unit and must contain the `metadata.txt` with information
+about the measurement ID and the files belonging to this measurement dataset.
-```groovy
-@Configuration
-@PropertySource("application.properties")
-class AppConfig {
+Following the previous example, the content of a matching `metadata.txt` would look like this:
- @Value('${messages.file}')
- public String messagesFile
+```bash
+NGSQTEST001AE-1234512312 file1_1.fastq.gz
+NGSQTEST001AE-1234512312 file1_2.fastq.gz
+NGSQTEST002BC-3321314441 file2_1.fastq.gz
+NGSQTEST002BC-3321314441 file2_2.fastq.gz
```
+Make sure that the columns are `TAB`-separated (`\t`)!
-We define the property source, which is the file `application.properties` that is provided in the
-resource folder of the app and available to the classpath. We also tell with the
-annotation `@Configuration` Spring, hey this is a class that holds app configuration data!
+Once a new registration unit is detected, it gets queued for registration and the next step will take over.
-With the annotation `@Value('${messages.file}')` we tell Spring, which property's value should be
-injected. Here we make use of field injection, other types of injection like method and constructor
-injection are also possible.
+A registration request gets only submitted once to the registration queue and will subsequently get
+ignored by the scanning process, as long as the folder name or modification timestamp does not change.
-So how is the concrete implementation of the `MessageService` presented to Spring? We can use
-the `@Bean` annotation here, to tell Spring: _hey, this is sth you must load on startup and provide
-to the context_.
+If the application quits or stops unexpectedly, on re-start they will get detected and resubmitted
+again.
-```java
-@Configuration
-@PropertySource("application.properties")
-class AppConfig {
+### Registration
- ....
+This process step is preparing the dataset registration for subsequent pre-registration task, to
+guarantee a unified structure and processing model, other steps can build on and take actions
+accordingly (e.g.
+harmonised error handling).
- @Bean
- MessageService messageService() {
- return new CodingPrayersMessageService(messagesFile)
- }
-```
-
-That is all there is, you can now load the bean in your main application code:
+Its configuration parameters can be set via environment variables, see
+the [registration step config](#registration-step-config) section to learn more.
-```java
-@SpringBootApplication
-class SpringMinimalTemplateApplication {
+In the current implementation, the registration step does several things:
- static void main(String[] args) {
- SpringApplication.run(SpringMinimalTemplateApplication, args)
- // load the annotation context
- AnnotationConfigApplicationContext context = new AnnotationConfigApplicationContext(AppConfig.class)
- // get the service bean
- MessageService service = context.getBean("messageService", MessageService.class)
- // collect the message and praise the magic
- println service.collectMessage()
+1. Validating the registration metadata file
+2. Aggregate all measurement files per measurement ID
+3. Assign every task (measurement) a unique task ID
+4. Provide provenance information
-```
+The task id is just a randomly generated UUID-4 to ensure that datasets with the same name do not
+get
+overwritten during the processing.
-### Inversion of Control
+The provenance information will be written into the task directory in an own file next to the
+dataset
+and is of type JSON.
-
+The final task directory structure looks then like this (task dir name is an example):
-You might have already spotted the interface `NewsMedia` and its implementing class `DeveloperNews`
-in the app's source code. Here you can see an example for the magic of inversion of control.
-
-The `NewsMedia` interface is just an abstraction that we will later use, because we don't care about
-the actual implementation details. By this, we also do not create any dependencies to concrete
-implementation details but on actual behaviour. Concrete implementations can then later be exchanged
-without causing any breaking changes in the client code base.
-
-The interface has only one method: `String getNews()`. Now let's have a closer look into the
-class `DeveloperNews` that implements this interface:
-
-```java
-class DeveloperNews implements NewsMedia{
-
- private MessageService service
+```bash provenance.json
+ |- 74c5d26f-b756-42c3-b6f4-2b4825670a2d
+ |- file1_1.fastq.gz
+ |- file1_2.fastq.gz
+ |- provenance.json
+```
- DeveloperNews(MessageService service) {
- this.service = service
- }
+Here is an example of the provenance file:
- @Override
- String getNews() {
- return service.collectMessage()
- }
+```json
+{
+ "origin": "/Users/myuser/registration",
+ "user": "/Users/myuser",
+ "measurementId": "QTEST001AE-1234512312",
+ "history": [
+ "/opt/scanner-app/scanner-processing-dir/74c5d26f-b756-42c3-b6f4-2b4825670a2d"
+ ]
}
```
-When you check the constructor signature, you see that this method has only one argument, which is a
-reference to an object of type `MessageService`. And when the `getNews()` method is called by the
-client, the class delegates this request to the message service. Since we have stored the reference
-in a private field, that is super easy, we known how to call the service.
+> [!NOTE]
+> The following properties can be expected after all process steps have been executed:
+>
+> `origin`: from which path the dataset has been detected during scanning
+>
+> `user`: from which user directory the dataset has been picked up
+>
+> `measurementId`: any valid QBiC measurement ID that has been found in the dataset (this might
+> be `null`) in case the evaluation has not been done yet.
+>
+> `history`: a list of history items, which steps have been performed. The list is ordered by first
+> processing steps being at the start and the latest at the end.
+
+### Processing
+
+In the current implementation, this process step only does some simple checks, and can be extended to e.g.
+perform checksum validation. Feel free to use it as template for subsequent process steps.
+
+### Evaluation
+
+Last but not least, this step looks for any present QBiC measurement ID in the dataset name. If none
+is given, the registration cannot be executed.
+
+In this case the process moves the task directory into the user's home error folder. After the user
+has
+provided a valid QBiC measurement id, they can move the dataset into registration again.
+
+## Configuration
+
+### Global settings
+
+```properties
+#------------------------
+# Global settings
+#------------------------
+# Directory name that will be used for the manual intervention directory
+# Created in the users' home folders
+# e.g. /home//error
+users.error.directory.name=error
+# Directory name that will be used for the detecting dropped datasets
+# Needs to be present in the users' home folders
+# e.g. /home//registration
+users.registration.directory.name=registration
+```
-So why is this inversion of control?
+Configure the names of the two application directories for error handling and registration.
+
+> [!NOTE]
+> The `registration` folder needs to be present, the application is not creating it automatically,
+> no
+> prevent accidental dataset overwrite.
+
+### Scanner step config
+
+```properties
+#--------------------------------------
+# Settings for the data scanning thread
+#--------------------------------------
+# Path to the directory that contains all user directories
+# e.g. /home in Linux or /Users in macOS
+scanner.directory=${SCANNER_DIR:/home}
+# The time interval (milliseconds) the scanner thread iterates through the scanner directory
+# Value must be an integer > 0
+scanner.interval=1000
+```
-Because the `DeveloperNews` class does not manage the instantiation of a concrete message service.
-The configuration happened outside of the class, therefore the DeveloperNews class has no direct
-control over the instantiation. If it had, it would look like this:
+Sets the applications top level scanning directory and considers every folder in it as an own
+user directory.
-```java
-DeveloperNews(String filePathToMessages) {
- this.service = new CodingPrayersMessageService(filePathToMessages)
-}
-```
+The scanner interval is set to 1 second by default is not yet supposed to be configured via
+environment variables (if required, override it with command line arguments).
-That doesn't look good, does it? In order to create an instance of a message service, we would need
-to know the conrete implementation and its required properties (here it is the file path to
-the `messages.txt`). So the `DeveloperNews` class has the control over the message service.
+### Registration step config
-Instead, we would like to not take care about these details, so we invert the control and inject the
-dependency via the constructor.
+Sets the number of threads per process, its working directory and the target directory, to where
+finished tasks are moved to after successful operation.
-Please find more in depth documentation on the
-official [Spring website](https://spring.io/projects/spring-framework).
+```properties
+#----------------
+# Settings for the registration worker threads
+#----------------
+registration.threads=2
+registration.working.dir=${WORKING_DIR:}
+registration.target.dir=${PROCESSING_DIR:}
+```
+### Processing step config
+Sets the number of threads per process, its working directory and the target directory, to where
+finished tasks are moved to after successful operation.
+```properties
+#------------------------------------
+# Settings for the 1. processing step
+# Proper packaging and provenance data, some simple checks
+#------------------------------------
+processing.threads=2
+processing.working.dir=${PROCESSING_DIR}
+processing.target.dir=${EVALUATION_DIR}
+```
+### Evaluation step config
+
+Sets the number of threads per process, its working directory and the target directory, to where
+finished tasks are moved to after successful operation.
+
+```properties
+#----------------------------------
+# Setting for the 2. processing step:
+# Measurement ID evaluation
+# ---------------------------------
+evaluations.threads=2
+evaluation.working.dir=${EVALUATION_DIR}
+# Define one or more target directories here
+# Example single target dir:
+# evaluation.target.dirs=/my/example/target/dir
+# Example multiple target dir:
+# evaluation.target.dirs=/my/example/target/dir1,/my/example/target/dir2,/my/example/target/dir3
+evaluation.target.dirs=${OPENBIS_ETL_DIRS}
+evaluation.measurement-id.pattern=^(MS|NGS)Q[A-Z0-9]{4}[0-9]{3}[A-Z0-9]{2}-[0-9]*
+```
+> [!NOTE]
+> You can define multiple target directories for this process! You just have to provide a `,`-separated list
+> of target directory paths. The implementation will assign the target directories based on a round-robin draw.
diff --git a/img/process-flow.jpg b/img/process-flow.jpg
new file mode 100644
index 0000000..6bb527d
Binary files /dev/null and b/img/process-flow.jpg differ
diff --git a/src/main/java/life/qbic/data/processing/AppConfig.java b/src/main/java/life/qbic/data/processing/AppConfig.java
index fffe848..c9f6b32 100644
--- a/src/main/java/life/qbic/data/processing/AppConfig.java
+++ b/src/main/java/life/qbic/data/processing/AppConfig.java
@@ -85,8 +85,9 @@ ProcessingConfiguration processingConfiguration(ProcessingWorkersConfig processi
@Bean
GlobalConfig globalConfig(
- @Value("${users.error.directory.name}") String usersErrorDirectoryName) {
- return new GlobalConfig(usersErrorDirectoryName);
+ @Value("${users.error.directory.name}") String usersErrorDirectoryName,
+ @Value("${users.registration.directory.name}") String usersRegistrationDirectoryName) {
+ return new GlobalConfig(usersErrorDirectoryName, usersRegistrationDirectoryName);
}
}
diff --git a/src/main/java/life/qbic/data/processing/GlobalConfig.java b/src/main/java/life/qbic/data/processing/GlobalConfig.java
index 0f5610f..9bf90ea 100644
--- a/src/main/java/life/qbic/data/processing/GlobalConfig.java
+++ b/src/main/java/life/qbic/data/processing/GlobalConfig.java
@@ -7,15 +7,25 @@ public class GlobalConfig {
private final Path usersErrorDirectoryName;
- public GlobalConfig(String usersErrorDirectoryName) {
+ private final Path usersDirectoryRegistrationName;
+
+ public GlobalConfig(String usersErrorDirectoryName, String usersRegistrationDirectoryName) {
if (usersErrorDirectoryName == null || usersErrorDirectoryName.isBlank()) {
throw new IllegalArgumentException("usersErrorDirectoryName cannot be null or empty");
}
+ if (usersRegistrationDirectoryName == null || usersRegistrationDirectoryName.isBlank()) {
+ throw new IllegalArgumentException("usersRegistrationDirectoryName cannot be null or empty");
+ }
this.usersErrorDirectoryName = Paths.get(usersErrorDirectoryName);
+ this.usersDirectoryRegistrationName = Paths.get(usersRegistrationDirectoryName);
}
public Path usersErrorDirectory() {
return this.usersErrorDirectoryName;
}
+ public Path usersDirectoryRegistration() {
+ return this.usersDirectoryRegistrationName;
+ }
+
}
diff --git a/src/main/resources/application.properties b/src/main/resources/application.properties
index f782bf6..29d5e52 100644
--- a/src/main/resources/application.properties
+++ b/src/main/resources/application.properties
@@ -9,6 +9,10 @@
# Created in the users' home folders
# e.g. /home//error
users.error.directory.name=error
+# Directory name that will be used for the detecting dropped datasets
+# Needs to be present in the users' home folders
+# e.g. /home//registration
+users.registration.directory.name=registration
#--------------------------------------
# Settings for the data scanning thread