Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usbhid-ups changes: more resiliant w/failures; scales Tripp Lite SMART1500LCDT #122

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

sbutler
Copy link
Contributor

@sbutler sbutler commented Apr 19, 2014

Had trouble getting nut to work for my new Tripp Lite SMART1500LCDT. So in best open source fashion, I hacked at it till it worked. Couple things in this pull request:

  1. Scale values for some of the SMART1500LCD parameters.
  2. With my hardware and kernel, usbhid-ups wouldn't run more than 10 minutes before exiting with a fatal error. This seems to be related to holding usb_claim_interface() the entire life of the program. I've refactored claim_interface/release_interface calls to be around only when work is being done. Now usbhid-ups has run for 24+ hours without issue.
  3. The very nature of USB in a home means that sometimes the driver will lose communications with the UPS (accidently unplug the hub, needed a port for a camera, etc). This shouldn't be a fatal error. I've tried to detect these situations and let usbhid-ups retry gracefully. Also, at startup the UPS no longer has to be immediately present. It will keep retrying upsdrv_initups() until it finds one (only implemented for usbhid-ups, but someone could do the work for other drivers).

@aquette aquette added this to the 2.7.3 milestone Apr 24, 2014
@aquette
Copy link
Member

aquette commented Jul 12, 2014

this introduces changes on the drivers interface that requires more thinking and review.
this is scheduled and I'll get back on this soon.

the "Scale for SMART1500LCDT" commit should also be extracted and submitted as a separate pull request.
thanks for your work!

@aquette aquette modified the milestones: NUT 2.8, 2.7.3 Jul 12, 2014
clepple added a commit that referenced this pull request Nov 11, 2015
So far, the 3016 protocol units all seem to have this scaling issue.
@bcl
Copy link

bcl commented Nov 21, 2015

I applied these patches to nut 2.7.3 (running on Fedora 21). They improved it from 'never works' to 'works sometimes, maybe'. Repeated runs of usbhid-ups -u root -DDD -a tripplite and lsusb -v -d 09ae:3016 sometimes have what look like complete results, other times complain about timeouts, and a third state with 7 Item(Main ): (null), data=none

When trying to run nut-server the driver initially seems to work and then repeatedly fails with usbhid-ups[11274]: libusb_get_report: Input/output error and upsd[11275]: Data for UPS [tripplite] is stale - check driver

@clepple
Copy link
Member

clepple commented Nov 21, 2015

@bcl this seems to be motherboard-dependent:

http://article.gmane.org/gmane.comp.monitoring.nut.user/9465

What kind of hardware are you running on? Any luck with moving to non-USB3 ports?

Also what kernel and libusb does F21 provide?

@bcl
Copy link

bcl commented Nov 22, 2015

Interesting, I'm not sure what it was plugged into, I'll try switching it. The motherboard is an ASRock B85M-ITX

The kernel is 4.1.13-100.fc21.x86_64 and libusb is libusb-0.1.5-5.fc21.x86_64, I also tried some manual poking around using libhid-python-0.2.17-17.fc21.x86_64 and ran into similar problems - Error from libusb: Input/output error

@clepple
Copy link
Member

clepple commented Nov 22, 2015

Interesting, I'm not sure what it was plugged into, I'll try switching it. The motherboard is an ASRock B85M-ITX

A quick Google search didn't turn up any "lspci" output for that motherboard. Does it look similar to this?

$ lspci|grep -i usb
00:1a.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4
00:1a.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5
00:1a.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6
00:1a.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2
00:1d.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1
00:1d.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2
00:1d.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3
00:1d.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1

The kernel is 4.1.13-100.fc21.x86_64 and libusb is libusb-0.1.5-5.fc21.x86_64, I also tried some manual poking around using libhid-python-0.2.17-17.fc21.x86_64 and ran into similar problems - Error from libusb: Input/output error

Not entirely surprising - much of the NUT USB HID code came from that libhid source tree. I should look into whether the HIDAPI package does things differently, but I really think this is a compatibility issue at a lower level (as you pointed out with lsusb).

@bcl
Copy link

bcl commented Nov 23, 2015

00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 05)
00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 05)
00:1d.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 (rev 05)

It was on a USB2 port, moving it to a USB3 port didn't change the behavior at all.

@bcl
Copy link

bcl commented Nov 25, 2015

I moved the Tripp Lite to another system with only the scaling patch applied and for the moment it appears to be running ok. lspci looks like:

00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05)
00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05)

@clepple
Copy link
Member

clepple commented Nov 25, 2015 via email

@clepple
Copy link
Member

clepple commented Sep 25, 2016

A lot has changed at once, but I have a positive datapoint for more reliable SMART1500LCDT support:

  • Ubuntu 16.04 (x86_64)
  • Kernel 4.4.0-38-generic
  • libusb 1.0.20 (dpkg says "2:1.0.20-1")
  • NUT libusb-1.0 branch (commit db20c07)
  • Same ICH10-based motherboard as before

I have a feeling that the kernel update helped the most, because lsusb -v is rock solid now. I'll try some of the other USB ports later.

If you had trouble before, can you recheck? If you're still on an older kernel, feel free to try out the libusb-1.0 branch (see issue #300).

@robbiet480
Copy link

So I just got this UPS yesterday and have it working nicely with NUT. However, every so often (about 15-30 minutes, sometimes much longer) i'll get notified that NUT has lost communications. I am running on the NUT provided by apt.

  • Ubuntu 16.04.1 LTS
  • 4.4.0-43-generic
  • Network UPS Tools upsd 2.7.2

I just changed pollinterval to 10 from the default (2) because I had messages like usb 2-2: usbfs: USBDEVFS_CONTROL failed cmd usbhid-ups rqt 161 rq 1 len 3 ret -71 in my syslog. I will try this out for a couple hours and if I see little or no improvement will either try applying this PR locally or try the libusb-1.0 branch that @clepple mentioned above.

If it matters, I originally had the UPS plugged into a USB 2 port but after I started having issues I switched it to a USB 3 port. I believe that it may have started crashing less after switching it.

Also, i'm on a SuperMicro A1SAI-2750F Intel Atom server board. I have a Z-Wave stick plugged in alongside the UPS and it's never given me problems.

Finally, I have been doing a lot of development for NUT since getting my UPS (creating a Go library and implementing NUT support in Home Assistant). Could repeated abuse of the TCP API cause problems somehow? I assume not but thought i'd point it out.

@clepple
Copy link
Member

clepple commented Oct 25, 2016

Could repeated abuse of the TCP API cause problems somehow?

The TCP protocol fetches data from the dstate layer. The polling loop proceeds independently of the dstate queries (unless you are starving out the event loop completely), so I don't think this would affect things.

@robbiet480
Copy link

Changing the pollinterval to 10 made the issue much better but it's not perfect yet. Still getting 20+ drops per day. Will be trying out this PR and the libusb-1.0 branch.

@robbiet480
Copy link

Finally got around to switching to libusb-1.0 branch and this already seems much more stable. Will report back tomorrow to let you know if I'm still having drops.

@clepple
Copy link
Member

clepple commented Nov 7, 2016

The libusb-1.0 branch (#300) is now available in this PPA: https://launchpad.net/~clepple/+archive/ubuntu/nut

@robbiet480
Copy link

@clepple Still getting drops with the libusb-1.0 branch, almost as frequently. I did revert my pollinterval change back to defaults though. I also found that libusb-1.0 would only work when the UPS was plugged into a USB 2.0 port.

@clepple
Copy link
Member

clepple commented Nov 11, 2016

I also found that libusb-1.0 would only work when the UPS was plugged into a USB 2.0 port.

@robbiet480 can you elaborate on this? Failure to detect, or disconnects relatively soon when on USB 3.0? (Also wondering if simply unplugging and reconnecting the USB cable helps...)

@robbiet480
Copy link

@clepple Sorry for the delay, meant to reply to this last night. It was total failure to detect. I did not attempt to unplug/reconnect, just left it on a USB 2.0 port.

@robbiet480
Copy link

I continue having drops. I tried to restart the driver during one of these drops and lsusb didn't see the USB device at all. I had to re-plug the UPS into a new USB port before lsusb recognized it again (this time a USB 3 port) at which point the driver also cleanly started. Changed the pollinterval to 30 now. Wondering if I just have a dud of a UPS?

@clepple
Copy link
Member

clepple commented Nov 13, 2016

Wondering if I just have a dud of a UPS?

I think it's the combination of that model (the 3016 USB ID) plus the USB controllers on newer motherboards.

Another data point: I moved my SMART1500LCDT from a Linux box with an ICH10 to an older Core2 Duo Mac Mini. I let Mac OS X poll the UPS, so libusb was out of the picture. It worked fine for several days. I then moved the UPS to an i7-based Mac Mini (also OS X 10.11) and the UPS disconnected within five minutes of being polled by the OS.

@robbiet480
Copy link

@clepple Interesting data point since my last comment: Haven't had a single drop since changing pollinterval to 30.

Copy link

@sdgathman sdgathman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied this scale change as a patch to 2.6.5 on rhel6, and it greatly improves the output.

@clepple
Copy link
Member

clepple commented Dec 10, 2016

@sdgathman I am not familiar with the new "user approves these changes" UI in GitHub, but the reason why this pull request hasn't been merged is due to the major changes in the claim/release logic that would have to be tested on many other OS and UPS combinations, and also because I don't think the infinite retry logic makes sense for non-home users (or if it does, we need additional upsmon notifications for when it is still retrying). We ended up testing all of the OS/UPS combinations anyway as part of #300...

The scaling patch was merged separately as part of 2.7.4.

@robbiet480
Copy link

Until late last night I was continuing to get disconnects many times per day. Then I saw the issue description of #277 where the author stated that he resolved disconnections by changing out the provided USB cable with a known good one. I then searched the Amazon reviews and found many people reporting the exact same issue. Since changing out my USB cable everything has been super stable 😎. First time for everything. Never would have thought it was the cable!

@robbiet480
Copy link

robbiet480 commented Jul 20, 2018

Turns out, I kept getting disconnects and just gave up for a long time. A few days ago, while I was doing annual server maintenance and was waiting for a ZFS scrub to finish I decided to play around with this again.

On the latest NUT, compiled from GitHub, with no other config changes, I am no longer getting disconnects, for real, after like 3 days of monitoring!

What is sometimes happening is upsc will report stale data or just a generic error like

Error while connecting to localhost, disconnect
Error: Server disconnected

Attempting to restart nut-server will throw errors the first time like Can't connect to UPS [tripp-lite] (usbhid-ups-tripp-lite): Interrupted system call and the restart will time out. Attempting a restart shortly after that will complete successfully and data will flow.

Things to note: I was previously on the libusb branch, now back to the mainline. driver.version reports 2.7.4-480-g14301bd.

So anyway, guessing that a thought to be unrelated fix was applied between January 13, 2017 and 3-4 days ago that happened to seriously improve Tripp-Lite support in NUT?

@sdgathman
Copy link

@robbiet480 Have you seen my reset assist kludge for SMART1500LCDT?
https://github.com/sdgathman/trippfix
Blog story: https://gathman.org/blog/category/usb/

@robbiet480
Copy link

Almost two years on and man this UPS really sucks. Continually getting drops still.

@sdgathman
Copy link

Almost two years on and man this UPS really sucks. Continually getting drops still.

I upgraded server to CentOS-8, and the UPS still drops. BUT, kernel-4.18 now automatically tries the power cycle sequence, so my addon package for CentOS-6 is no longer needed. See README on sdgathman/trippfix.

@jimklimov
Copy link
Member

Unfortunately this PR grew out of sync vs. current master, needs expert attention to realign.

@jimklimov jimklimov added the outdated-CI PR was made without rights for NUT Team to update its source branch, proposed codebase has no/old CI label Oct 9, 2020
jimklimov added a commit to jimklimov/nut that referenced this pull request Jan 21, 2021
@jimklimov
Copy link
Member

jimklimov commented Jan 29, 2022

Note for future looks at this PR: according to comments above, "The scaling patch was merged separately as part of 2.7.4" (and I think there were later others between that and 2.7.5, see e.g. #963), while the rest of this change was deemed too disruptive for the time being then (with libusb-1.0 merge ahead).

Now that the libusb-1.0 support is in NUT master branch, as well as other changes around the codebase (from formatting to warnings fixes) this PR has to be adapted to be mergeable again, at least.

Content-wise, seems it can be split into several efforts:

  • one big and relatively independent change is the generally retryable initialization of ups drivers, which now are not "void" but can return 1 if successful. Very few drivers were changed to return 0 if detect/init failed so they should retry in main.c, and they still use fatalx() in case of errors making the retries moot. Rome wasn't built in a day - this is not a blocker per se, but something to address for the feature to be practically useful in the field.
    • Note also that with current NUT on platforms with systemd or SMF, we would be able to wrap each driver process into a service unit instance, so the current failed init is handled by the OS to restart the driver.
  • another change is the different work with USB (and nominally SHUT), e.g. regarding claim/release of interface, maybe something else I missed if I looked too cursorily...
  • UPDATE: relatively recent PRs usbhid-ups: do not crash due to auto-reconnecting #1632 and drivers/usbhid-ups.c: try to close libusb handle before reconnecting … #1335 touched on resilience vs. failures per issue Spurious usbhid-ups ioctl error, fails to recover, dies. #414, enhancing the previously available (for at least 15 years) ability of the driver to reconnect to the device after a communications hiccup.

@caseyjmorton
Copy link

So at this current point in time, is there a viable workaround for this model?

@sdgathman
Copy link

So at this current point in time, is there a viable workaround for this model?

@caseyjmorton I use this model. I used the duct tape referenced above for CentOS-6, and CentOS-8 and now Rocky8 kernels do the port power cycle themselves. You just need a hub that supports PPPS. https://gathman.org/2016/07/30/Standard_Schmandard/

@jimklimov jimklimov marked this pull request as draft September 9, 2022 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-conflicts outdated-CI PR was made without rights for NUT Team to update its source branch, proposed codebase has no/old CI Tripp Lite USB
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants