Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce DC build time and complexity by getting rid of Scala and Spark #1890

Open
kaspersorensen opened this issue Jan 2, 2022 · 6 comments

Comments

@kaspersorensen
Copy link
Member

Hi all,

I am picking up DC development for a bit, after a long hiatus. And coming back to this project is making me realize how long and complex a build we have. I would like to make DC build (and thereby the overall developer experience) much nicer by simplifying it. Right now I am spending a lot of time just getting it to compile on my fresh installation. And the main culprit is something that I've noticed before: Scala, and to some extent also the Spark module. So I would suggest to simplify the developer experience by:

  1. Removing our dependency on Scala.
  2. Removing the Spark module. Or look at converting it into a separate GitHub project/extension.
@kaspersorensen
Copy link
Member Author

I have prepared a branch to illustrate what would be removed...

https://github.com/datacleaner/DataCleaner/compare/remove-scala?expand=1

@LosD
Copy link
Contributor

LosD commented Jan 3, 2022

The branch looks more or less good to me, although it's been so long I even looked at Java that I'm not sure I'd take my word for it. 😊

Buuuut it's mostly code removal, so as long as it builds and all the runners still work, I guess we're good. I can't test right now, but I'll try to see if I can get some time for it later today or tomorrow.

However, I don't see deletions of the .scala files themselves? But okay, it's just an example, so I guess it doesn't matter for now.

@LosD
Copy link
Contributor

LosD commented Jan 3, 2022

Regarding pulling it into its own extension, I guess it would take quite a bit of refactoring to allow runners, especially ones that needs to change the system in such a major way, in extensions? If I remember correctly (which is not a given), we/you tried something like that originally, but ended up in this way because such a fundamental change to running just got too hard without quite a bit of coupling. But maybe the Scala parts themselves could be kept in the extension, while the base of the runner was kept here? Admittedly that DOES sound like a bit of a strange design, but if we think the Spark runner still has value to users, it would be a shame to lose.

@kaspersorensen
Copy link
Member Author

I just realized that the branch is in no way ready to go :) I mean, there are a bunch of components that are just no longer included, and I guess we just didn't have integration tests for those, but they would disappear from the product if we merged that branch. But I think they're not too hard to reproduce, so that's definitely next step if we want to complete this issue.

Regarding the Spark runner. I agree it's probably not going to be easy to make it a proper extension. I was more thinking that we could make it a separate distribution. A bit like datacleaner-docker or whatever. A distribution that would include it's own Main class and would only be built to work with Spark.

I mean, the other thing is that Spark has moved on massively since this was built. I think everything will break and have to be partially rewritten if we just upgrade Spark to the latest version. But I think it's time that it does get upgraded somehow.

@LosD
Copy link
Contributor

LosD commented Jan 3, 2022

Ah yeah, I remember most of the Spark components being reasonably simple.

How about I finally get back to contributing (and re-jiggle my Java experience a bit) by taking at least some of them on? But it might be a few days before I get started.

@kaspersorensen
Copy link
Member Author

Yeah give it a shot! I'm ready to cheer you on! For my part I'm gonna then look into some of the more simple-n-stupid Scala-to-Java conversions in the non-Spark areas of the code. Like the HTML rendering module and more.

kaspersorensen added a commit that referenced this issue Jan 4, 2022
#1890: Converted Value Distribution renderer to Java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants