7. Clustering of Submissions

Clustering

By default, JPlag is configured to perform a clustering of the submissions. The clustering partitions the set of submissions into groups of similar submissions. The found clusters can be used candidates for potentially colluding groups. Each cluster has a strength score, that measures how suspicious the cluster is compared to other clusters.

Disabling Clustering

Clustering can take long when there is a large number of submissions. Users who are not interested in the clustering can safely disable it:

* Using the CLI: With the `--cluster-skip` option

* Programmatically:
  ```java
  JPlagOptions options = new JPlagOptions("/path/to/rootDir", LanguageOption.JAVA);
  options.setClusteringOptions(new ClusteringOptions.Builder().enabled(false).build());
  
  JPlag jplag = new JPlag(options);
  ```

Clustering Configuration

Clustering can either be configured using the CLI options or programmatically using the ClusteringOptions class. Both options work analogously and share the same default values.

The clustering is designed to work out-of-the-box for running within the magnitude of about 50-500 submissions, but it can be tweaked when problems occur. For more submissions it might be necessary to increase Max-Runs or Bandwidth so that an appropriate number of clusters can be determined.

TODO Table

Clustering Architecture

All clustering related classes are contained within the de.jplag.clustering(.*) packages in the core project.

The central idea behind the structure of clustering is the ease of use: To use the clustering calling code should only ever interact with the ClusteringOptions, ClusteringFactory, and ClusteringResult classes:

classDiagram
    ClusteringFactory <.. CallingCode
    ClusteringOptions <.. CallingCode : creates
    ClusteringAdapter <.. ClusteringFactory
    ClusteringAlgorithm <.. ClusteringAdapter : runs
    ClusteringAlgorithm <.. ClusteringFactory : creates instances
    ClusteringPreprocessor <.. ClusteringFactory : creates instances
    PreprocessedClusteringAlgorithm <.. ClusteringFactory : creates
    ClusteringOptions <-- ClusteringFactory
    ClusteringAlgorithm <|-- PreprocessedClusteringAlgorithm
    ClusteringAlgorithm <-- PreprocessedClusteringAlgorithm : delegates to
    ClusteringPreprocessor ..o PreprocessedClusteringAlgorithm
    class ClusteringFactory{
        getClusterings(List~JPlagComparison~ comparisons, ClusteringOptions options)$ ClusteringResult~Submission~
    }
    class ClusteringOptions{
    }
    class ClusteringAlgorithm {
        <<interface>>
        cluster(Matrix similarities) ClusteringResult~Integer~
    }
    class ClusteringPreprocessor {
        <<interface>>
        preprocess(Matrix similarities) Matrix
    }
    class ClusteringAdapter{
        ClusteringAdapter(List~JplagComparison~ comparisons)
        doClustering(ClusteringAlgorithm algorithm) ClusteringResult~Submission~
    }
    class PreprocessedClusteringAlgorithm{
        cluster(Matrix similarities) ClusteringResult~Integer~
    }
    class CallingCode{

    }

New clustering algorithms and preprocessors can be implemented using the GenericClusteringAlgorithm and ClusteringPreprocessor interfaces which operate on similarity matrices only. ClusteringAdapter handles the conversion between de.jplag classes and matrices. PreprocessedClusteringAlgorithm adds a preprocessor onto another ClusteringAlgorithm.

Remarks on Spectral Clustering

based on On Spectral Clustering: Analysis and an algorithm (Ng, Jordan & Weiss, 2001)
automatic hyper-parameter search using Bayesian Optimization with a Gaussian Process as the surrogate model and L-BFGS for optimization on the surrogate
the L-BFGS implementation is a pit of technical debt, see here.

Integration Tests

There are integration tests for the Spectral Clustering to verify, that a least in the case of two known sets of similarities the groups known to be colluders are found. However, these are considered to be sensitive data. The datasets are not available to the public and these tests can only be run by maintainers with access.

To run these tests the contents of the PseudonymizedReports repository must added in the folder jplag/src/test/resources/de/jplag/PseudonymizedReports.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly