Skip to content
This repository has been archived by the owner on Jan 16, 2019. It is now read-only.

Adds Low Bandwith Trajectory Follower implementation #9

Merged
merged 11 commits into from
Sep 27, 2018

Conversation

potiuk
Copy link

@potiuk potiuk commented Dec 29, 2017

@Zagitta @gavanderhoorn @ThomasTimm - this is the promised pull request that implements Safe Trajectory Follower. It's a drop-in update - no public API changes, It should work out-of-the-box, you just need to add 'use_safe_trajectory_follower" as launch argument of the driver.

It is rather safe (it never "catches-up" dangerously by design), it behaves well and follows trajectory very well, it is resilient to communication problems, it works very reliable over WiFi (!) and it can run on heavily loaded PC. All that while controlling the position of the robot very precisely. It stops when the client disconnects (it finishes interpolation step before) and it is preemptible.

I believe for this specific case (no ros_control used, position-based control, MoveIt directly connected to the driver via "follow_joint_trajectory" action) - it has only advantages over the previous implementation and no disadvantages that I am aware of. I'd suggest that after some time it is proven, you might want to make it the default implementation (for this case). For now I added a switch "use_safe_trajectory_follower" which has to be set to true to enable it (mutually-exclusive with ros-control). The nice side effect of the design I came up with is that you can freely control speedup of actual trajectory executed vs. the planned trajectory by MoveIt (basically Robot has it's own perception of time passing and we can control whether robot time flows slower, same, or faster than real time). This is actually a very cool use case for research and development purposes. Note that the movies below are recorded at real time, it's the robot that moves faster :).

I prepared a set of snapshots from my tests showing several test cases (both movies and graphs showing how the robot behaved). Most of those tests were run with heavily loaded PC (normal camera + RGBD camera connected via ros node, recording of video, displaying point-cloud from RGBD camera in RVIZ). When we try to run the same case with "standard" Trajectory Follower under the load or with wireless connection, we got into dangeous "catch-up" almost immediately, This had never happened over my (already hours) of testing with the Safe Trajectory Follower. At most I got protective stop because I reached limits configured via robot setup, but it was always fully controlled behaviour, never uncontrolled "catch-up":

Here are the snapshots: https://drive.google.com/drive/folders/1N2Tu3OgWEZ6nip_XE0o9J0HvWkS4H5I1

Those are :

  • Wireless run with normal speed and heavily loaded PC
  • Wired run with 3x speed-up
  • Wired run wuth 3x speed-up and heavily loaded PC
  • Wireless run with 3x speedup and heavily loaded PC

The "normal speed wireless" charts are the most interesting ones - https://drive.google.com/open?id=1802exN_fbFWLpVbwtrop2SEFBcMAFTQp. It shows how well the driver handles delay situation that might inevitably happen when you use wireless even with my implementation. You can clearly see that it got delayed a bit and that then the "catching up" was really very local - it made the move a bit longer, but nothing dangerous happened.

All the documentation in README.md is updated - I explained the difference, parameters, and gave some examples of the speed-up configuration.

Note - the change is independent of my other pull requests - but it can be easily merged together (with small easy-solvable conflicts). I actually have also a separate branch where all the other pull requests are merged (NOMAGIC_ALL_MERGED). If you get to conclusion that we should merge all of them in one go - we could merge this branch (everything is rebased on top of the current master).

Also here is a context for some design decisions I made in this change - feel free to comment/or suggest other appraoches in case you have other ideas/suggestions for improvements, if you feel particularly unhappy about any of those.

  • minimal impact - first and foremost I implemented it with assumption to minimise impact on the current architecture of ur_modern_driver and implement it quickly. I think I managed to do both :).

  • supporting only V3+ - I decided to not implement support for pre V3 (servoj_gain + lookahead_time were added then). The driver will simply exit with helpful message in case someone tries to run it on earlier firmware. I think it is nice compromise between clarity and functionality - you can stil always disable safe trajectory follower if you want to connect to older firmware.

  • threading model/complexity of URSript - the architecture of the urscript is rather complex (especially threading model) - I have three threads that communicate with each other over global variables (protected with critical sections). Thanks to that I avoided impacting of potentially blocking operations (send/receive) on the control loop. Control thread runs the control loop independently from sending/receiving data. It's a bit complex but works very well - I can request and receive the next trajectory point while interpolating previous two points - without impacting the control loop at all. As a side note - this can be vastly simplified if we use RTDE interface. RTDE can handle send/receive exchange with the client using general purpose registries. I am planning to make another pyhon-only implementation of the driver (small subset of what the current ur_modern_driver does) using RTDE and then I will share my experience with you. It would be too much of an architecture change to implement RTDE in the ur_modern_driver as it is now (especially that python API is provided by UR). It seems that my design will allow eventually to run a smaller, RTDE-based python-only node for all UR communication for this specific case (not a full ur_modern_driver replacement)- those are my mid-term plans.

  • TrajectoryFollowerInterface - both Trajectory Follower and Safe Trajectory Follower implement the same interface (Trajectory Follower Interface). That allows me to use them interchangeably in the ActionServer. My relatively olde exprience cause some miusnderstanding of dynamic cast, but I think now it's relying on ros_control variable being mutually exclusive with Safe Trajectory Follower does the job. I decided that the interface will be pure-virtual (similar to Java interfaces) to make those two Followers more decoupled. There is very little potentially shareable code between those two and it's better to keep them decoupled.

  • add new/missing arguments to ur_common.launch - I added all new (and some old) arguments to ur_common.launch.. I see no harm in that as I added them with defaults. Having parameters in the default common launch might be helpful for other to understand that they can use different Follower. I have not added it for all the URX implementations though. I think common launch + README explanation is enough. Of course if you decide to switch to Safe Follower as default, you will have to update those :)

  • TCP/IP messages - I decided to use ASCII float array for sending trajectories rather than binary int array as was in the original follower. I do not really need to optimise it as it is send at few Hz and then it's easier to debug and understand what's going on when it is sent using plain ASCII (and the code is a bit simpler - no artificial multiplication to get to int from float and back). It seems to work very well even with ASCII.

  • Inefficiently reading line from TCP/IP server - Implementation of reading line from URscript using TCP/IP is quite inefficient (reading char-by-char). Classic. But since those messages are rather short - maximum 4 chararcter per message - it has no real impact at all. URScript only sends required next trajectory point number to the driver every time it starts doing interpolation for previous segment. I decided it's simpler to do it this way and not spend too much time to implement buffering (I'd love to use boost's read_until() but it's impossible without changing ur_modern_driver architecture to use asio - that would be far too big change.

I hope others who had similar problems will test it as well while you review it. I have friends in Poland at two universities that I asked to test it, and they said they will do. The "jerky" behavior is a pain for them as well. I hope we can get confirmation from others that it works as well for them. I'd love if we can merge it quickly and also that the whole refactor is merged into the main repo - that would be awesome. The refactor makes it much easier for others - like me - to contribute.

@Zagitta
Copy link
Owner

Zagitta commented Jan 9, 2018

I will get around to reviewing your changes eventually but I'm quite busy at the moment so please forgive me for taking so long to get back to you.
I'm quite impressed by your approach though! I'm also curious how much effort you think it would be to support older firmwares?

@potiuk
Copy link
Author

potiuk commented Jan 10, 2018

I understand :) . No worries. I think it should be rather easy to support older firmwares. The only missing bit is to dynamically generate the servoj command as it is done in the original trajectory follower (with or without the lookahead/gain parameters). I did not do it simply because I have no older firmware and no possibility to test it, so even if it is simple, there is a bit of risk involved. I think it could be added /modified later after it's merged though - by somoene who can test it.

BTW. As mentioned before - we already have some first good results using RTDE (but we gradually will be rewriting only what we need, not the whole ur_modern_driver). I'd really recommend to move the ur_modern_driver to use it instead of reverse TCP connection. You avoid a number of connectivity problems this way (like firewalls on client PC for one, or remaining timing-out server sockets when you kill the client), you can make the threading model simpler, the communication overhead might be much smaller (you only subscribe to what you need with the frequency you specify - might be different frequency for different values and you can even have multiple clients - each with different data). It's much more stable/versatile/robust.

@simonschmeisser
Copy link

@potiuk do you have that RTDE variant somewhere? Sounds interesting!

@potiuk
Copy link
Author

potiuk commented Jan 10, 2018

Not yet. We had some other priorities, but we are planning to open-source it's implementation soon-ish (I guess 2-3 weeks is a reasonable estimate). We are still not 100% sure about the final setup but those will likely be set of independent small python ros nodes for various interfaces of the UR - each communicating with the robot separately ratther than one monolithic driver - this is what RTDE is enabler of.

The good thing about python is that the RTDE client library is provided and supported directly by UR. See attachment here: https://www.universal-robots.com/how-tos-and-faqs/how-to/ur-how-tos/real-time-data-exchange-rtde-guide-22229 / . For C implementation there is I guess quite an amount of code to implement for the interface.

@axelschroth
Copy link

Nice work! Tested your Implementation (NoMagicAi:SAFE_TRAJECTORY_FOLLOWER) with a UR5.
Unfortunately I see two disadvantages:

  1. Safe trajectory follower uses a large URScript program which have to be transmitted and run by the controller with each trajectory
  2. Goal position is not reached correctly in several cases

Here you can see a plot of a movement of joint 'wrist_3' resulting from a trajectory with only 2 points (start and goal):
image

  • blue: goal position, brown: actual joint position
  • x-Axis: time, y-Axis: joint position (rad)
  • movement starts about 700 ms after trajectory message got published
  • servo stops clearly before reaching goal position

Any idea how we could handle such situations?

Safe Trajectory Follower implements different approach for controlling
the robot. Rather than calculate the interpolation steps in the driver
and send the small interpolated steps over the network to the URScript
program with 500Hz frequency, the coarser MoveIt trajectory is sent
(with few Hz) and the interpolation steps are calculated by the
URScript.

The algorithm for time progress has also built-in protection against
any delays induced by load on the driver, network or URControl - it
will never "catch-up" dangerously when such delay are introduced,
It will rather pause and wait for the next small interpolation step
instructions and re-start the move slower - never skipping any
interpolated steps.

Those changes make Safe Trajectory Follower much more resilient to
network communication problems and removes any superficial requirements
for the network setup, kernel latency and no-load-requirement for the
driver's PC - making it much more suitable for research, development
and quick iteration loops. It works reliably even over WiFi.
@potiuk potiuk force-pushed the SAFE_TRAJECTORY_FOLLOWER branch from 4e51fb6 to 5dcef72 Compare January 11, 2018 23:00
@potiuk
Copy link
Author

potiuk commented Jan 11, 2018

Thanks @axelschroth for testing! I agree that excessive delay at launch is a disadvantage. I have not optimised for that (safety was first!) and I have not measured it so far - slight delay at start was not a concern to us. But I will take a look at possible optimisations here, It's likely that the code can be simplified/optimised still a bit. I also did not have as a goal to improve vs. original TrajectoryFollower and with my observations (but not backed by hard data) the difference in delay vs. the original driver was not huge. But it's really valuable input from you and worth looking at.

Since you've done the test recently, maybe you could run a quick comparision with the old follower? Simply set use_safe_trajectory_follower flag to false in your launch file and let me know the differences you observe. That would help me tremendously if I could see the change vs. the old driver with your setup, as I have a feeling that both drivers will behave similarly (both with reaction time and precision) and seems that you have everything in place to test it quickly. I now will work more on our own python/RTDE implementation of some of the interfaces to UR so it could take me some time to switch back to the modern_driver testing mode. However if you send me the exact trajectory you got (just rostopic echo from the follow_joint_trajectory/goal topic ) - I could try to see what speeds etc/ you got - and reproduce it - but it could take few days to try it. Another option could be if you run my version with debugging and send me console output ? Maybe you can uncompress the debug.tar.gz and run 'source ./set_debug_mode.bash before launching the driver (and send me console output afterwards).

Let me comment on those two problem a bit more:

Upload

One way to decrease the size of the program might be to switch to RTDE in fact. The implementation should be simpler. I believe switching to RTDE will allow to decrease the size of the URscript by 20-30% (and we will have less threads - maybe just one). I will do the python implementation in the next week most likely so I will be able to see the effect of it (and I will make sure to measure the "upload time" influence). But I don't think I will have time to reimplement the ur_modern_driver C implementation - it would likely require some redesign of the driver itself and is quite more involved.

Another optimisation I can make is to remove logging. I used it quite liberally to debug it but it could be easily removed (even dynamically before the upload). That might be a small but probably noticeable improvement (not because of execution time but uploading/parsing time) if uploading time is the problem. I will test it in the python implementation first to see if it is worth it and might port it back to the driver's code.

Third possibilty that I considered (but would be more complex to implement) was to provide the urscript side as URcaps implementation. It would be much more complex to both - implement and operate (you'd have to install it first via pendant to use it) - but it would likely benefit from one-time-parsing and storing the program in the internal representation on the control PC. But for that I would have no time almost for sure.

Precision

I also noticed precision problems especially when we move fast. Servoj has lookahead time/gain parameters that influence smoothness and precision a lot (less lookahead and gain - > more precision in general) . But in our cases it causes the arm to stop a bit after the desired position (with lookahead it will continue to run a bit faster as it won't realise that it should already stop). Could you tell me what values you had for those? Again - I did not plan (yet) to improve that aspect vs. the original Follower, so seeing a baseline from it could help me with understanding if this is my code or whether it worked like that in the original driver.

As position is important to us, I already thought however how to improve it fairly easily. I thought that we could actually add an extra loop for several cycles at the end of the trajectory and run servoj command several times at the end with same target positions until the target positions are reached. That could solve your problem as well as ours with high speed "overshoot". But in case of overshoot it would make the robot arm to go back-forth a bit at the end of the move,

I also pondered about artifficially slowing down the move towards the end if you move fast. It should be possible to re-calculate both velocities and times in the trajectory received. I am already doing some of that in pyhon code so maybe will be able to tell shortly what results it will bring.

Another possibilty (you could try it easily now as well and tell me if it worked!) is to set lower time_interval/servoj time parameters and have more fine-grained interpolation.. Since everything runs in a loop on ur control, I think it should be no problem to run servoj for 0.004 s or 0.002s now. You need to simply change both servoj_time and time_interval parameters in launch to 0.004 and possibly precision gets better. But better be careful and gradual if you do that :). I think too short servoj time causes some undesirable "jerks" as well.

@potiuk
Copy link
Author

potiuk commented Jan 11, 2018

BTW. I rebased the driver on the latest changes for 3.5 firmware!

@gavanderhoorn
Copy link
Collaborator

@potiuk: thanks for all the hard work and your evaluation of the current situation (in ros-industrial-attic#153). That is very much appreciated. Complaining is easy, but then doing something constructive is something else, so 💯 for that.

There are several people now looking at this PR, so I'll wait for them to report their findings.

One comment - which might seem minor, but I believe is important: I understand why you feel this implementation is 'safer' than the current trajectory follower (no / less chance of overshoot, no catch-ups due to vm hickups, etc), but the 'safe' adjective in the name of the class and the PR implies a lot of things - especially in the context of robotics and (external) motion control, and I believe that is not a good idea.

This PR introduces quite some new code, especially on the controller side. Apart from the tests you have done (and I'm not worried that you didn't do a good job), the code has not been vetted by proper review or certification processes. Even the reviews we do here are far from enough to claim it is safe. We cannot guarantee in any way we've exercised the code (any code, really) sufficiently to conclusively say things about how it will behave in every situation, nor will we do that.

Now certification and risk assessment will always be specific to a certain implementation, and of course prefixing a class and parameter name with safe does not mean that we're suddenly liable (so that is not the issue here), but I'd like to avoid any potential confusion of future users of this driver, and would thus request we come up with a new name for this.

@gavanderhoorn
Copy link
Collaborator

Note that my comment goes for everything in ur_modern_driver, not just your PR of course.

It's just that your PR introduces the safe adjective in some places, and I personally would like to see that changed.

@gavanderhoorn
Copy link
Collaborator

gavanderhoorn commented Jan 12, 2018

@potiuk wrote:

Servoj has lookahead time/gain parameters that influence smoothness and precision a lot (less lookahead and gain - > more precision in general) . But in our cases it causes the arm to stop a bit after the desired position (with lookahead it will continue to run a bit faster as it won't realise that it should already stop).

It would seem this could be an inherent (and possibly fundamental) problem with servoj(..).

See also ros-industrial-attic#110.

@potiuk
Copy link
Author

potiuk commented Jan 13, 2018

@gavanderhoorn I perfectly understand your concerns with the name and concur with it. I have no problem changing the name whatsoever. I just did not figure out a good name yet :). It's quite obvious that the driver is much more sutiable for research where you make sure that you are in "controlled" environment. I don't think we'd use it in anything more than research/development anywa. I have not realised the name might imply such understanding that you describe.

I will change the name, parameters etc. then to something else. Maybe LowBandwidthTrajectoryFollower would do (unless I figure out better way). It's not ideal name but it's about what it really is.

@potiuk potiuk changed the title Adds Safe Trajectory Follower implementation Adds Low Bandwith Trajectory Follower implementation Jan 13, 2018
@potiuk
Copy link
Author

potiuk commented Jan 13, 2018

@gavanderhoorn - interesting about the "fundamental" servoj problem. I will do some experiments shortly and see if precision can be improved with movej (or simply some adjustments towards the end of trajectory). I am not that concerned about not following the trajectory extremely precisely, but more about actually finishing the move in precisely the location it was supposed to finish - which I think would be a nice property of the follower. But I will also try to see how it behaves with movej with blend.

@j-polden
Copy link

j-polden commented Jan 14, 2018

@potiuk , thanks a lot for taking the time to produce and share this work. I spent a couple of hours last week testing this PR, so I’ll leave some of my observations here for those whom may be interested.

I tested this work on our UR5 robot, a short summary of typical behaviour observed is shown in [1]. To stress test this new approach, I used Netem [2] to inject some randomised network delays into the system.

In its ‘standard’ format, the ur_modern_driver suffered from the previously observed problems highlighted in this thread: Lagging and then a resultant catch up which typically caused an e-stop on the UR controller.

With the low bandwidth trajectory follower implemented, the robot motion was observed to be far smoother. I ran some robot motion routines around 20 or 30 times, and also increased the maximum latency values in the network. All executed motions seemed consistent despite different network delays input to the system, and no motions experienced any problems (unstable motion, e-stop etc). I did not plot out and do a detailed comparison of the command and executed trajectory profiles, however. All these tests were just ‘eye’ tests, but to me they appeared functional and reliable.

I am wondering out loud whether it would eventually be appropriate to implement this as the default mode for this driver so to ensure that new users do not fall into the same problems a number of us have already experienced? Please let me know if more testing is required, and I will see what I can do to help (if that is what is needed).

[1] https://www.youtube.com/watch?v=xQJV9BzUaP8&
[2] https://wiki.linuxfoundation.org/networking/netem

@potiuk
Copy link
Author

potiuk commented Jan 14, 2018

@j-polden -> thanks for the testing, that concurs with my previous tests, but I had no time for such thorough network congestion testing. It's super cool you've done that :D.

I'd love to see it as "default" after we get some more testing and especially comparision with the old driver (see the @axelschroth 's comments). Not my call though :). I also have some experimentation related to the precision described above (especiallly stopping the move in the desired location) - maybe I can improve it still so might worth iterating a bit with the code. I might have one/few more small/incremental changes as a result coming soon-ish (I'd love if you can help with testing it then !)

@potiuk
Copy link
Author

potiuk commented Jan 14, 2018

@gavanderhoorn - I renamed the follower to "LowBandwidthTrajectoryFollower". I checked that it works fine after rename (it was quite low risk change). I've also renamed the parameter name and pretty much every "safe" reference from README and elsewhere. It's a separate commit (so that anyone can see incremental changes vs. the previous commit) but before I merge it, I think it will be best to squash all the commits into one final commit. This way no reference to "SafeTrajectoryFollower" will remain in the code/history.

The only remaining "SAFE_TRAJECTORY" reference is branch name, but it cannot be changed in Github whithout deleting the pull request, but branch name is fairly volatile (once merged it will only remain in pull request history). I hope htis addresses the concern you raised :).

@axelschroth
Copy link

axelschroth commented Jan 15, 2018

Since you've done the test recently, maybe you could run a quick comparision with the old follower?

@potiuk Sure, attached you can find the echo output of /follower_trajectory/goal topic:
follower_echo_wrist3.txt

These plots compare your implementation to the original one:
LowBandwidthTrajectoryFollower:
safe_follower
TrajectoryFollower:
traj_follower

I expected the end of the movement to be quite similar to the original implementation, but it seems to behave differently. Your approach does not overshoot but rather stop too early before reaching the goal position.. A little delay caused by transmitting/parsing can be seen as well in these plots.

Btw: Removing all debug and txtmsg outputs of the URScript safed approximately 200ms parsing time.

@gavanderhoorn
Copy link
Collaborator

@axelschroth wrote:

A little delay caused by transmitting/parsing can be seen as well in these plots.

can you quantify this a bit more? How much is a little?

You mentioned 700ms in #9 (comment), it's 500ms now?

With an average delay of 24ms (as described in @ThomasTimm's report), 500ms seems like a bit more than a little.

@axelschroth
Copy link

@gavanderhoorn I didn't run an exact measurement of the delay, its just an estimate based on some plots. Yes, I observed a delay of ~500ms without debug output. But you are right, compared to to the original implementation it's still a big difference.

@Zagitta
Copy link
Owner

Zagitta commented Jan 17, 2018

I have no fundamental issues with this becoming the default TrajectoryFollower once it has proven its merit, as I definitely think this approach is a better way to control the robot than the existing TrajectoryFollower.
The only concern I have regarding that is it potentially having slightly different semantics than the current one and breaking stuff for people who might just mindlessly update to the newest version.

I'd rather see the old one remaining the default for a while but loudly complain about being deprecated and after some grace period change the default. Or maybe just use semantic versioning and be done with it 😄

@potiuk
Copy link
Author

potiuk commented Jan 18, 2018

Some further updates. We started to look at the speed /reaction finally. And I got to measure (and improve it). In also observed some 700ms delay (so much longer than around 250ms observed with the old driver - which mirrrors @axelschroth observations). However I was able to improve it fairly significanty (and easily). It is indeed a problem of sheer size of the program to parse. When I created it, I focused more on readability and debugging but it turns out that initial parsing is really really slow even for few 100 lines for UR. In a new python version of the Low Bandwidth Follower I just removed all the unneeded code (including comments and textmsg's and if DEBUG's. I went down from 344 to 204 lines - and (surprise, surprise!) the delay went down to around 400 ms. That was an easy win (I will soon - likely tomorrow) add another commit/change with those after I test it in the modern driver).

Stil more than the original driver, but by much smaller margin. Probably I am also overcautios with local/global variables copying and we could likely simplify a lot of this code (and as mentioned before RTDE interface use would actually remove the need for synchronization and multiple threads).

As a side note - we already finished implementing and testing (today) RTDE interfaces for hand-shaking (for a bit simpler control) and It will likely result in an open-source python simplified implementation of subset of modern driver functionality as mentioned elsewhere. We already know we can do that fairly easily, but again doing it in the modern driver will need some fundamental changes. We also decided to implement it a bit differently in python - we upload a program first in our action server and then we send it instructions over RTDE - which means that we do not have to upload the program every time we make move (might be any number of those). This has the benefit that there is no "per move" program upload overhead and drawback that you cannot use pendant while your sequence of moves executes (but it's low price to pay for the overall speed improvements we observed). Maybe that will be an inspiration for a future version of the modern driver.

Another comment - we are also going to try the movej approach. As it turns out - our case is that we want to move with the positions provided by trajectories, but we don't care about speeds generated by moveit (especially that as @gavanderhoorn pointed out - the velocities generated by moveit are quite far from perfect or even good ( moveit/moveit#416 ). We basically want to move as fast as possible closely following the trajectory generated (but not perfectly following it). Fo this we likely do not even need to do interpolation between trajectory points, nor 0.008 s control of the UR - rather than that we can simply use movej with blend for every coarse segment of the trajectory and only use positions generated by moveit (no velocities/accelerations). Seems that it might yield smooth, fast moves giving reasonably close trajectory to what was planned. Of course it won't be perfect for some uses (like trying to synchronise several arms/groups) - so I might actually come up with a third follower to choose from ;).

@potiuk potiuk force-pushed the SAFE_TRAJECTORY_FOLLOWER branch from 73e5e8e to 924c371 Compare March 24, 2018 12:15
@psajecki
Copy link

psajecki commented Mar 27, 2018

@potiuk , @Zagitta , sorry my testing is a day late. The low bandwidth trajectory follower has not been working well for me today unfortunately. It seems to work when the speed is within a certain range, but too fast or too slow it will fail to work. The error I keep getting is this:
[ERROR] [1522161481.964797186]: Failed to accept incoming robot connection [ERROR] [1522161481.964823349]: Failed to start trajectory follower!
At 25% speed it seems fine.
At 50% speed it stops mid-movement and errors out with the above.
At 0.06 - 0.10% speed, it also stops mid movement with the error above.

After this error, it attempts to do the next motion in the program and fails
(or possibly it attempts to do the next motion early because the move command terminates before the move is actually done and silently errors)
On Friday it was doing the slow movements (0.06-0.10%) fine. I'm not sure what changed today.
This may be my first time trying at 50% speed, so it's the first time I've seen an issue at this speed.
I've set the following parameters to make sure it's not related to motion timeouts:
trajectory_execution/allowed_execution_duration_scaling=1.5
trajectory_execution/allowed_goal_duration_margin" value=2
trajectory_execution/execution_duration_monitoring=false

With use_lowbandwidth_trajectory_follower set to false, everything works as expected.

@psajecki
Copy link

@potiuk , @Zagitta , I've tested the new pull in ROS controller mode. It seems to work as well as before. The velocity based positional controller does not seem to work for me, but the positional based one works fine. May I suggest making the positional based control the default for ROS control? I'm hoping the runaway base axis moves on startup are related to the velocity controller, so if it defaults to the positional one, it will solve this potentially dangerous problem.
In my testing today I did not see the runaway base axis movement, but it does occur fairly infrequently so I may have just been lucky.

@potiuk
Copy link
Author

potiuk commented Mar 27, 2018

@psajecki -> Is it possible to get morecomplete logs when you run LBTF? Both from ros and maybe from the robot ? There is a URControl.log file when you login to the robot, I believei in Home directory of the root user. It looks like for some reason the connection back cannot be established from the robot, which could be because of urscript compilation problems (but it has not changed for months now :(. I'd really love to understand what's going on.

@psajecki
Copy link

psajecki commented Mar 27, 2018

@potiuk , Can you please try writing a simple program that runs linear movements (i.e. use the eef_step parameter) between two points using 0.6%, 25%, and 50% velocity and acceleration scaling in a loop? Set velocity and acceleration scaling equal. Use an eef step of 0.01. I think it will be more time efficient for both of us if you can reproduce the issue on your robot, and I think this will do it.

Below is a code snippet of how we move the robot. You can see we are doing it in a form of an action server, so just replace goal->poses with hard-coded poses and remove setting the result in the error handling.

    moveit::planning_interface::MoveGroupInterface move_group(move_group_name);

    std::vector<geometry_msgs::Pose> waypoint_poses(goal->poses);

    moveit_msgs::RobotTrajectory trajectory_msg;
    double ratioCompleted = move_group.computeCartesianPath(
        waypoint_poses,
        eef_step,
        jump_threshold,
        trajectory_msg);

    if (ratioCompleted != 1.0)
    {
      result.status_code.val = moveit_msgs::MoveItErrorCodes::PLANNING_FAILED;
      goal_handle.setAborted(result);
      return;
    }

    robot_trajectory::RobotTrajectory trajectory (
        move_group.getCurrentState()->getRobotModel(), 
        move_group.getName());

    trajectory.setRobotTrajectoryMsg(
        *move_group.getCurrentState(), 
        trajectory_msg);

    trajectory_processing::IterativeParabolicTimeParameterization iptp;

    bool success = iptp.computeTimeStamps(
        trajectory,
        max_velocity_scaling, 
        max_acceleration_scaling);
    if (!success)
    {
      result.status_code.val = moveit_msgs::MoveItErrorCodes::PLANNING_FAILED;
      goal_handle.setAborted(result);
      return;
    }
    trajectory.getRobotTrajectoryMsg(trajectory_msg);
    moveit::planning_interface::MoveGroupInterface::Plan plan;
    plan.trajectory_ = trajectory_msg;
    MoveItErrorCode error_code = move_group.execute(plan);


@potiuk
Copy link
Author

potiuk commented Mar 29, 2018

@psajecki - I can try but not very soon - this time I might be bottleneck, because we are entering pretty busy period in our project, so It can take quite some time until a) I get the hold of robot b) will find time to test it.

@psajecki
Copy link

@potiuk believe me I can relate! I'll keep monitoring this thread for updates.

@SrdjanStevanetic
Copy link

SrdjanStevanetic commented Apr 25, 2018

Hi all, first of all thanks @potiuk and others for the new driver. We use to control the universal robot ur3. We have done some tests and the behavior sometimes is good sometimes we have problems. Setup that we use is the following: we use 2 robots (KUKA and UR3) together with some other IoT devices like conveyor, Tablet, etc. We have a windows PC with VM-Ware server hosting the VMs (for controlling KUKA and UR3). That windows PC is connected over a gigabit router and then a switch to the real robot controllers from KUKA and UR3. We use the low_bandwith_trajectors_controller to control the robot.
The following parameters are used for trajectory planning:

and the joint_limits set for all joints are:
joint_limits:
shoulder_pan_joint:
has_velocity_limits: true
max_velocity: 0.8
has_acceleration_limits: true
max_acceleration: 0.8

What we observed sometimes is that the connection between the ur driver and the real robot controller breaks. In particular, the connection on port 30003 for receiving the joint_states simply breaks down and no more joint_states updates come from the controller to the driver. It did not happen however on the port 30002, that still kept sending data. The connection on 30003 broke down when the ur3 is not in action and even did not get any command to execute. When I kill the ur driver and start it again the joint_states start to come again and the communication works fine. I have checked the .log file both on ros side and on the robot controller and i do not see any errors. I have also attached the files here. Could you please have a look?
Many thanks in advance.
log_history.txt
ros_network_problems_the_best_example_1.log

@SrdjanStevanetic
Copy link

Hi all,
I have investigated the problem further and observed the following. If I stress the network, for example using the iperf tool, I observed that the ur_driver works fine at the beginning and some latency in the moves can be seen in the rviz which is expected since there is a packet loss in the communication and some of the robot joint_states get lost. After some time, the rviz representation does not move at all even if I move the robot using the pendant manually. I checked the joint_states topic and there is no published joint_states, it looks like it is frozen. I also run the wireshark with the filter: ip.src==robot_ip and tcp.port==30003, and no messages are displayed which means that there the communication on the port 30003 is lost. In the ros logs I do not see any errors, also not the messages showing that the producer or pipeline ended or any disconnection happened. I have attached the ros log file. There are some more log messages that i introduced for the testing purposes. I have also deleted some of the messages because the file is too big.
rosout.log

Many thanks.

Regards, Dr. Srdjan Stevanetic

@gavanderhoorn
Copy link
Collaborator

gavanderhoorn commented May 8, 2018

@SrdjanStevanetic wrote:

In the ros logs I do not see any errors, also not the messages showing that the producer or pipeline ended or any disconnection happened. I have attached the ros log file. There are some more log messages that i introduced for the testing purposes. I have also deleted some of the messages because the file is too big.

If you're 'stressing the network', I don't expect specific errors to pop-up in the ROS logs, unless the driver node is able to detect problems with the connection(s).

Most of the traffic will be between the driver node and the UR controller, and the way you describe your setup, I doubt there is much ROS traffic leaving the VM (or even the lo interface).

As to the problem of seemingly lost connections: not sure how to approach this. One experiment I can think of: try running nc ip.of.ur3.controller 30002 > /dev/null and nc ip.of.ur3.controller 30003 > /dev/null. Don't start any driver(s). Now repeat what you did with iperf and check with Wireshark. If that shows the same behaviour (ie: traffic stops), then the driver is not involved. If that keeps working, there would seem to be something in what the driver does to the TCP connection that makes it lose connection (or at least, makes the traffic come to a halt).

It would be good to also take a look at TCP window sizes: do you see those increasing or decreasing significantly for the connection to 30003?

@SrdjanStevanetic
Copy link

SrdjanStevanetic commented May 8, 2018

@gavanderhoorn Thanks for the feedback. Yes there is much traffic outside of VMs, e.g. between the VMs and the robots, etc. I have run the analysis with nc tool as you proposed and observed the same problem. After some time the tcp connection to the robot freezes and there are no messages any more. It happened for both 30003 and 30002 ports but the 30003 port get frozen faster. Regarding the tcp window size I observed the window size of the connection to the port 30003 on the VM that communicates with the robot and saw that it remains the same, just for the time when the packets start coming periodically (because of the network stress the packets come all the time at the beginning, then there are some breaks and then at the end no more packets come) no window_size is captured. So the plot for the window_size is like a dashed line. I created it using Statistics -> TCP Stream Graph -> Window Scaling Graph in the wireshark.
So it seems that the problem is on the robot controller side?

@gavanderhoorn
Copy link
Collaborator

@SrdjanStevanetic wrote:

I have run the analysis with nc tool as you proposed and observed the same problem. After some time the tcp connection to the robot freezes and there are no messages any more. It happened for both 30003 and 30002 ports but the 30003 port get frozen faster.

That would seem to suggest it's not something the driver doesn't (or does) do, but it's a problem somewhere else.

If you can repeat the nc test without a VM (so a native Linux machine + your robot controller) that would take out that variable. UR have done some strange things with the TCP implementation on the controller before (although it's just Gentoo), so it could be that the controller just can't cope with a high volume of traffic on the network segment it is connected to.

I would have two additional comments:

  1. it's generally advisable / preferable to have robots and their external control PCs on isolated segments
  2. do you really have such high volume traffic on your network, or are we talking about KB/s here?

@SrdjanStevanetic
Copy link

We cannot unfortunatelly remove the VMs from our software, but I can try the test without the VM which I think is not the problem here. We have a high traffic volume (~10MBs = 80 Mbps) because we use the camera for scanning the scene.

@gavanderhoorn
Copy link
Collaborator

I'm not suggesting you remove the VMs, I just suggest you test without a VM. If in that case the traffic does not stop, it would suggest something in the VM is the cause of your problems.

@simonschmeisser
Copy link

We (or rather our customer unfortunately ...) noticed another problem. After a certain number of moves (ca 600 to 1000) it stops working with "accept() on socket failed with error (Too many open files)"

running
watch -d 'lsof | grep ur_driver | grep TCP | wc -l'
we see an increase by 11 for every trajectory sent. Ie it looks like the callback socket that does the actual communication is not disposed/closed/reused properly

@gavanderhoorn
Copy link
Collaborator

Hm. Is that with the changes in this PR, or in the base fork by @Zagitta?

@simonschmeisser
Copy link

simonschmeisser commented May 8, 2018

it's in this one, low bandwith trajectory follower, it stops once above command reaches 11128, which is about 1012 trajectories

@SrdjanStevanetic
Copy link

SrdjanStevanetic commented May 14, 2018

Regarding the previously mentioned connection problems when the network is heavily stressed I have implemented a solution using the select() function with timeout, that actually says when there is something to be read from the interfaces 30002 and 30003, then after that function the call to recv() is made as before. The read() function in the tcp_socket.cpp file looks now like this:

bool TCPSocket::read(uint8_t *buf, size_t buf_len, size_t &read)
{
  read = 0;

   if (state_ != SocketState::Connected)
     return false;

   struct timeval timeout;
   fd_set rfds;
   FD_ZERO(&rfds);
   FD_SET(socket_fd_, &rfds);
   timeout.tv_sec = 10;
   timeout.tv_usec = 0;
   int recVal = ::select(socket_fd_ + 1, &rfds, (fd_set *)0, (fd_set *)0, &timeout);
   ssize_t res=-1;
   switch(recVal)
   {
  case(0):
	  {
		  //Timeout expired
	  	  LOG_ERROR("TCPSocket select(): Timeout expired!");
		  return false;
		  break;
	  }
  case(-1):
	  {
		  //Error
	  	  LOG_ERROR("TCPSocket select(): Error value returned!");
		  return false;
		  break;
	  }
  default:
	  {
		  res = ::recv(socket_fd_, buf, buf_len, 0);
	  }
   }

   if (res == 0)
   {
     state_ = SocketState::Disconnected;
     return false;
   }
   else if (res < 0)
   {
       LOG_ERROR("TCPSocket recv(): NOTHING received (response<0)!");
       return false;
   }

   read = static_cast<size_t>(res);
   return true;
 }

To enable the driver to reconnect to the interfaces 30002 and 30003 I have also changed minimally the tryGet() function in the producer.h file which now looks like this:

 bool tryGet(std::vector<unique_ptr<T>>& products)
 {
     // 4KB should be enough to hold any packet received from UR
     uint8_t buf[4096];
     size_t read = 0;
    // expoential backoff reconnects
     while (true)
    {
     if (stream_.read(buf, sizeof(buf), read))
    {
         // reset sleep amount
         timeout_ = std::chrono::seconds(1);
         break;
    }

  if (stream_.closed())
  {
    return false;
  }
  LOG_WARN("Failed to read from stream, reconnecting in %ld seconds...", timeout_.count());
  std::this_thread::sleep_for(timeout_);

  stream_.disconnect();

  if (stream_.connect())
    continue;

     auto next = timeout_ * 2;
     if (next <= std::chrono::seconds(120))
    timeout_ = next;
  }

   BinParser bp(buf, read);
   return parser_.parse(bp, products);
  }

Only the stream_.disconnect(); command is added. Now the driver reconnects automatically to the given interfaces when the network load decreases and continues its normal operation. Without the given changes the driver stays waiting forever in a blocking call of recv() in the TCPSocket::read(...) function and needs to be restarted in order to wok properly.

@axelschroth
Copy link

Regarding the sockets issue @simonschmeisser mentioned, we noticed that the socket was not closed properly. Instead of using shutdown() we should really close sockets, see this commit.

First tests using simple movements were successful without increasing number of socket descriptors. I am currently doing some further tests. Any concerns about using close() instead of shutdown() here?

@gavanderhoorn
Copy link
Collaborator

As an update: as part of the repository maintenance activities that will be initiated next month (September), this PR will be merged into this fork of ur_modern_driver.

If anyone has any serious objections to this, now would be the time to make them heard.

We will not request additional changes from the submitter (@potiuk), so enhancements or small fixups will be done after the initial merge.

@gavanderhoorn gavanderhoorn dismissed Zagitta’s stale review September 27, 2018 13:41

Outstanding issues will be dealt with in future PRs

@gavanderhoorn gavanderhoorn merged commit 50bd24d into Zagitta:master Sep 27, 2018
@gavanderhoorn
Copy link
Collaborator

Thanks @potiuk for all the work. 👍 💯 🎆

Please don't understand the long time to merge as an underappreciation of your efforts.

Good luck with your company and perhaps talk to you later.

@potiuk
Copy link
Author

potiuk commented Oct 2, 2018

You're welcome. I no longer work with UR (moved back to more cloud and open-source Apache projects) but I am happy I could help with the driver :).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants