Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster import of GTFS to DB #170

Closed
wants to merge 6 commits into from
Closed

Conversation

laem
Copy link
Contributor

@laem laem commented Sep 27, 2024

I've identified that importing lines is a big bottleneck. I see a few reasons for this :

  • formatting each line takes time
  • reading the CSV in JS and creating the 8000 lines batch can take time too
  • inserting to the DB by batched of 8000

See this PR as trial and errors to divide by 10 the import times :)

@laem laem marked this pull request as draft September 27, 2024 19:42
Looks like this function consumes 5 seconds out of 13 for my control
dataset "toulouse".

This memoization does not make us gain these 5 seconds, but could
nevertheless help
@brendannee
Copy link
Member

Thanks for pointing this out and testing everything.

I'll work on getting this implemented.

@laem
Copy link
Contributor Author

laem commented Sep 28, 2024

I was trying to use the native CLI sqlite3 ".import" feature, that can import CSV. From my first test, writing to CSV in JS and then importing through exec() looks way faster.

One other technique would be to wrap the prepare.run() calls in a transaction(). From what people are saying, this would speed the process by a lot.

Not sure I'll have time in the next days to finish this PR.

In overall I'm pretty sure a x2 gain can be achieved, maybe 3, maybe 5. The final bottleneck would be the JS functions that run on every record.

One last suggestion would be to load data in the DB and then run these functions (e.g. secondsAfterMidnight) as a SQL request, since they're not that complicated to reimplement :)

laem added a commit to cartesapp/serveur that referenced this pull request Sep 29, 2024
@laem
Copy link
Contributor Author

laem commented Oct 3, 2024

Looks like my db creation misses two things :

  • indexes
  • void tables from what I can see

OK, it's because I'm writing to a test DB ^^

@laem
Copy link
Contributor Author

laem commented Oct 3, 2024

OK, now that I use the createTables function, ints are treated as strings by .import, and makes the checks fail.

Edit : ok, I was using another DB, now it works with the DB prepared by createTables. Just a pb : I need to disable the checks, they don't work.

But the time gain has been well reduced. Turns out without the indexes, the time for a GTFS that takes 1min05 with the master branch, takes 52 seconds with my branch (no checks) but 35 seconds without indexes.

It appears though that creating the indexes after the imports is known to be faster. I'm trying that.

Yes, it brings the time down to 44 seconds, and could make a cumulative gain if the 10 seconds to index is not proportional.

@laem laem mentioned this pull request Oct 3, 2024
@laem laem closed this Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants