Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored data clumps with the help of LLMs (research project) #9352

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

compf
Copy link

@compf compf commented Jun 5, 2024

Hello maintainers,

I am conducting a master thesis project focused on enhancing code quality through automated refactoring of data clumps, assisted by Large Language Models (LLMs).

Data clump definition

A data clump exists if

  1. two methods (in the same or in different classes) have at least 3 common parameters and one of those methods does not override the other, or
  2. At least three fields in a class are common with the parameters of a method (in the same or in a different class), or
  3. Two different classes have at least three common fields

See also the following UML diagram as an example
Example data clump

I believe these refactoring can contribute to the project by reducing complexity and enhancing readability of your source code.

Pursuant to the EU AI Act, I fully disclose the use of LLMs in generating these refactorings, emphasizing that all changes have undergone human review for quality assurance.

Even if you decide not to integrate my changes to your codebase (which is perfectly fine), I ask you to fill out a feedback survey, which will be scientifically evaluated to determine the acceptance of AI-supported refactorings. You can find the feedback survey under https://campus.lamapoll.de/Data-clump-refactoring/en

Thank you for considering my contribution. I look forward to your feedback. If you have any other questions or comments, feel free to write a comment, or email me under [email protected] .

Best regards,
Timo Schoemaker
Department of Computer Science
University of Osnabrück

Proposed changelog entries

refactored data clumps

Proposed upgrade guidelines

N/A

Submitter checklist

Desired reviewers

Before the changes are marked as ready-for-merge:

Maintainer checklist

Copy link

welcome bot commented Jun 5, 2024

Yay, your first pull request towards Jenkins core was created successfully! Thank you so much!

A contributor will provide feedback soon. Meanwhile, you can join the chats and community forums to connect with other Jenkins users, developers, and maintainers.

Copy link
Contributor

@mawinter69 mawinter69 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is common to all classes were you added the ProcessProperties is that they inherit from UnixProcess. So instead adding a new class just for the properties wouldn't it be better to just define the things in UnixProcess?

private int ppid = -1;
private EnvVars envVars;
private List<String> arguments;
private ProcessProperties properties = new ProcessProperties(-1, null, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, in all other places the ProcessProperties are defined transient, why not here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats seems to be a oversight by me. Spotbug complained that I should add transient everywhere and when it stopped complaining I didn't look more. Strange :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me why the fields are being made transient. If it's SE_BAD_FIELD, wouldn't making ProcessProperties Serializable address this without potentially causing serialization trouble?

(FWIW removing the transient doesn't fail Spotbugs for me locally.)

@compf
Copy link
Author

compf commented Jun 5, 2024

What is common to all classes were you added the ProcessProperties is that they inherit from UnixProcess. So instead adding a new class just for the properties wouldn't it be better to just define the things in UnixProcess?

Thank you for the feedback. In your particular case, that might be a better solution. But the LLM chooses the approach that always works, But I agree that pulling up those fields can also be a solution to solve data clumps :)

Copy link
Member

@daniel-beck daniel-beck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall #9352 (review) seems preferable to a new type. The following should do it, all we'd lose is the finality of ppid, no different from this proposal.

diff --git a/core/src/main/java/hudson/util/ProcessTree.java b/core/src/main/java/hudson/util/ProcessTree.java
index 8fbb80c8a8..80155d3d37 100644
--- a/core/src/main/java/hudson/util/ProcessTree.java
+++ b/core/src/main/java/hudson/util/ProcessTree.java
@@ -796,6 +796,10 @@ public abstract class ProcessTree implements Iterable<OSProcess>, IProcessTree,
      * A process.
      */
     public abstract class UnixProcess extends OSProcess {
+        protected final int ppid = -1;
+        protected EnvVars envVars;
+        protected List<String> arguments;
+
         protected UnixProcess(int pid) {
             super(pid);
         }
@@ -877,9 +881,6 @@ public abstract class ProcessTree implements Iterable<OSProcess>, IProcessTree,
         }
 
         class LinuxProcess extends UnixProcess {
-            private int ppid = -1;
-            private EnvVars envVars;
-            private List<String> arguments;
 
             LinuxProcess(int pid) throws IOException {
                 super(pid);
@@ -1001,13 +1002,9 @@ public abstract class ProcessTree implements Iterable<OSProcess>, IProcessTree,
              */
             private final boolean b64;
 
-            private final int ppid;
-
             private final long pr_envp;
             private final long pr_argp;
             private final int argc;
-            private EnvVars envVars;
-            private List<String> arguments;
 
             private AIXProcess(int pid) throws IOException {
                 super(pid);
@@ -1327,7 +1324,6 @@ public abstract class ProcessTree implements Iterable<OSProcess>, IProcessTree,
              */
             private final boolean b64;
 
-            private final int ppid;
             /**
              * Address of the environment vector.
              */
@@ -1337,8 +1333,6 @@ public abstract class ProcessTree implements Iterable<OSProcess>, IProcessTree,
              */
             private final long argp;
             private final int argc;
-            private EnvVars envVars;
-            private List<String> arguments;
 
             private SolarisProcess(int pid) throws IOException {
                 super(pid);
@@ -1596,9 +1590,6 @@ public abstract class ProcessTree implements Iterable<OSProcess>, IProcessTree,
         }
 
         private class DarwinProcess extends UnixProcess {
-            private final int ppid;
-            private EnvVars envVars;
-            private List<String> arguments;
 
             DarwinProcess(int pid, int ppid) {
                 super(pid);
@@ -1881,10 +1872,6 @@ public abstract class ProcessTree implements Iterable<OSProcess>, IProcessTree,
 
         private class FreeBSDProcess extends UnixProcess {
 
-            private final int ppid;
-            private EnvVars envVars;
-            private List<String> arguments;
-
             FreeBSDProcess(int pid, int ppid) {
                 super(pid);
                 this.ppid = ppid;

import hudson.EnvVars;
import java.util.List;

public class ProcessProperties {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a public class?

private int ppid = -1;
private EnvVars envVars;
private List<String> arguments;
private ProcessProperties properties = new ProcessProperties(-1, null, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me why the fields are being made transient. If it's SE_BAD_FIELD, wouldn't making ProcessProperties Serializable address this without potentially causing serialization trouble?

(FWIW removing the transient doesn't fail Spotbugs for me locally.)

@@ -0,0 +1,16 @@
package hudson.util;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add license header.

@compf
Copy link
Author

compf commented Jun 13, 2024

Thank you very much for the feedback. @daniel-beck You are correct that your proposal is better. I haven't encountered this corner case where fields are shared in derived classes before so it is interesting that the LLM did not spot this. I can update this PR to use your "pulling fields up proposal" when I find time :)

@MarkEWaite MarkEWaite added the skip-changelog Should not be shown in the changelog label Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip-changelog Should not be shown in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants