Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely low performance with 20 simultaneous clients and 9Gb squashfs image #150

Open
jonsito opened this issue Jul 12, 2023 · 3 comments

Comments

@jonsito
Copy link

jonsito commented Jul 12, 2023

Hi all,

I have an students lab in my university with about 180 clients (i5,16Gb,Gigabit network) running ubuntu-20.04 from pxe+nbd by mean of LTSP.(Gigabit network)
The generated squashfs image is about 9Gb. So we have 5 NBD servers running in parallel to avoid load issues.
Things go fine: load is balanced across servers, startup time is acceptable when a single client starts.... but problems arise when the class begins and 60+ clients are started at same time.

Network load is almost none. "htop" command tell that servers (i7-10thgen, 8cpu, 32GbRam, NVMe disks) have minimal cpu work... but clients seems as frozen... they last 5+ minutes to show login screen. Seems that some one is thrasing data with no real effect. Load balancing goes right: each nbd server handles about 20 clients

We've tried some tips as change number of workers, split image files... none of these solutions work.
So I've starting in LTSP intrinsics diveing... change NBD to NFS, use of raw image instead of squashfs... or even change nbd-client and squashfs block size.
Perhaps LTSP is not designed to handle so big images with so many simultaneous clients...
Any ideas?
Thanks in adance

@yoe
Copy link
Member

yoe commented Jul 28, 2023

This is a difficult one to figure out with the information given.

The size of the image should not be relevant; NBD does not use memory much. Caching might be involved, but unless you enable the copy-on-write option, this cache should be shared amongst all clients.

Is this perhaps a problem with IO bandwidth on your server? Could you check if something like "iotop" on the server is enlightening?

@lordbink
Copy link

lordbink commented Oct 17, 2023

@jonsito,
Did you get anywhere with this?

I would not expect CPU or memory on the server would make much a difference. The network, client system, configuration and in between network (switches) would be suspect IMO.

How are you loading the squashfs image? From what I remember when using squashfs the client needs to load the whole file before booting. Do you know if the load time of the squashfs image is what takes so long? (depending on your config you will see it load the squashfs followed by many dots). It could be that 60 systems simultaneously loading the 9G squashfs image is the problem.

L

@jonsito
Copy link
Author

jonsito commented Oct 19, 2023

We have 4Gb bandwidth in our labs. Traces show no congestion nor traffic problems.
In fact there are two bottlenecks:

  • In bootp, kernel and initrd load: TFTP has very poor performance... so in our grub process had to change from tftp to http for kernel and initrd loading
  • If we use nfsroot and ltsp.image instead of nbdroot in grub cmdline: linux startup speed ( once loaded initrd ) get twice as fast as using bare nbdroot

Here comes some samples from our grub entries:
nfs+squashfs based:
...
linux (http,SERVER)/tftpboot/ltsp/LABDit_2023-x86_64/vmlinuz ro splash apparmor=0 ip=dhcp modules_load=e1000e rootfstype=squashfs forcepae root=/dev/nfs nfsroot=SERVER:/opt/ltsp/images ltsp.image=root_squashfs.img
....
( notice root, nfsroot and ltsp.image entries )

nbd+squashfs based:
...
linux (http,SERVER)/tftpboot/ltsp/LABDit_2023-x86_64/vmlinuz ro splash apparmor=0 ip=dhcp modules_load=e1000e rootfstype=squashfs forcepae root=/dev/nbd0 nbdroot=SERVER:10809/LABDit_2023
....
( also notice again "root" and "nbdroot" entries. "nbdroot" instruct nbdserver to provide same root_squashfs.img file as above )

First one starts desktop twice faster than second one
Even starting a single client first solution is faster

My feelings is about nbd server being an user-space program ( using nfs-kernel-server) and too many user/kernel space context changes.

Anyway, at this moment nfs+squashfs work for us w/o problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants