Watchdog triggered using rtt_rosnode with xenomai

Hi Orocos-users,

I have the following problem using rtt_rosnode running under Xenomai
2.5.5.2 and ask for your help. It's an AMD Geode system with 500 MHz.

A few seconds after starting the real-time application, all Xenomai
threads seem to be frozen and the system log (dmesg) shows the following
messages:

[ 52.424032] Xenomai: watchdog triggered -- signaling runaway thread
'RosPublishActivity'
[ 52.424032] Xenomai: watchdog triggered -- killing runaway thread
'RosPublishActivity'

Restarting the real-time application alone does not help, you have to
restart the system after the watchdog has fired.
Disabling the watchdog or increasing the watchdog timeout freezes the
system after a while.
The orocos.log file does not contain any errors or unexpected messages.

The application is running a PeriodActivity with a frequency of 50Hz and
11 ports that publish data in each updateHook(). The ROS master and
other nodes are running on a remote system connected via Ethernet, but
not all of the publishers have subscribers. Is it possible, that this
overcharges the Geode system causing the RosPublishActivity::loop()
function to never return or might there be a more general problem with
rtt_rosnode and Xenomai? Does anybody use that combination?

Why does Xenomai bother with RosPublishActivity at all? Shouldn't that
activity run with scheduler ORO_SCHED_OTHER?

Thanks for some useful hints,
Johannes

Watchdog triggered using rtt_rosnode with xenomai

Hi Johannes,

On Thu, Feb 23, 2012 at 9:43 PM, Johannes Meyer
<meyer [..] ...>wrote:

> Hi Orocos-users,
>
> I have the following problem using rtt_rosnode running under Xenomai
> 2.5.5.2 and ask for your help. It's an AMD Geode system with 500 MHz.
>
> A few seconds after starting the real-time application, all Xenomai
> threads seem to be frozen and the system log (dmesg) shows the following
> messages:
>
> [ 52.424032] Xenomai: watchdog triggered -- signaling runaway thread
> 'RosPublishActivity'
> [ 52.424032] Xenomai: watchdog triggered -- killing runaway thread
> 'RosPublishActivity'
>
> Restarting the real-time application alone does not help, you have to
> restart the system after the watchdog has fired.
> Disabling the watchdog or increasing the watchdog timeout freezes the
> system after a while.
> The orocos.log file does not contain any errors or unexpected messages.
>

Orocos can not detect this incident, it's indeed the OS that needs to handle
runaway threads.

>
> The application is running a PeriodActivity with a frequency of 50Hz and
>

Please, please, don't use PeriodicActivity unless you know very well what
you're doing ! (on the other hand, this is not causing your issue...)

> 11 ports that publish data in each updateHook(). The ROS master and
> other nodes are running on a remote system connected via Ethernet, but
> not all of the publishers have subscribers. Is it possible, that this
> overcharges the Geode system causing the RosPublishActivity::loop()
> function to never return or might there be a more general problem with
> rtt_rosnode and Xenomai? Does anybody use that combination?
>

We've been playing with Xenomai + rtt_rosnode on a powerful system, and
did not have this issue. We couldn't get the CPU affinity to work, but that
was about all. All the symptoms you describe indeed point to loop()
being trigger()'ed all the time, using all CPU resources.

>
> Why does Xenomai bother with RosPublishActivity at all? Shouldn't that
> activity run with scheduler ORO_SCHED_OTHER?
>

It is run in that scheduler (could you check in /proc/xenomai/sched ? ),
but we hold a Xenomai Mutex in loop(), which causes that thread to
switch to primary mode. We could try to get the publish() out of the
mutex lock and see if this improves things. For example, copy the map
while holding the mutex, then unlock and then iterate over the copy.

Does the system work 'flawlessly' under gnulinux ? Or does it have 100%
CPU usage too for some threads ? 50Hz looks very reasonable to me though...

Peter

Watchdog triggered using rtt_rosnode with xenomai

Hi Peter,

Am 23.02.2012 22:32, wrote Peter Soetens:
> Hi Johannes,
>
> On Thu, Feb 23, 2012 at 9:43 PM, Johannes Meyer
> <meyer [..] ... meyer [..] ...>> wrote:
>
> Hi Orocos-users,
>
> I have the following problem using rtt_rosnode running under Xenomai
> 2.5.5.2 and ask for your help. It's an AMD Geode system with 500 MHz.
>
> A few seconds after starting the real-time application, all Xenomai
> threads seem to be frozen and the system log (dmesg) shows the
> following
> messages:
>
> [ 52.424032] Xenomai: watchdog triggered -- signaling runaway thread
> 'RosPublishActivity'
> [ 52.424032] Xenomai: watchdog triggered -- killing runaway thread
> 'RosPublishActivity'
>
> Restarting the real-time application alone does not help, you have to
> restart the system after the watchdog has fired.
> Disabling the watchdog or increasing the watchdog timeout freezes the
> system after a while.
> The orocos.log file does not contain any errors or unexpected
> messages.
>
>
> Orocos can not detect this incident, it's indeed the OS that needs to
> handle
> runaway threads.
>
>
> The application is running a PeriodActivity with a frequency of
> 50Hz and
>
>
> Please, please, don't use PeriodicActivity unless you know very well what
> you're doing ! (on the other hand, this is not causing your issue...)
Yes, I was wrong in that point. It is indeed a periodic RTT::Activity
since we use RTT 2.0.

> 11 ports that publish data in each updateHook(). The ROS master and
> other nodes are running on a remote system connected via Ethernet, but
> not all of the publishers have subscribers. Is it possible, that this
> overcharges the Geode system causing the RosPublishActivity::loop()
> function to never return or might there be a more general problem with
> rtt_rosnode and Xenomai? Does anybody use that combination?
>
>
> We've been playing with Xenomai + rtt_rosnode on a powerful system, and
> did not have this issue. We couldn't get the CPU affinity to work, but
> that
> was about all. All the symptoms you describe indeed point to loop()
> being trigger()'ed all the time, using all CPU resources.
>
>
> Why does Xenomai bother with RosPublishActivity at all? Shouldn't that
> activity run with scheduler ORO_SCHED_OTHER?
>
>
> It is run in that scheduler (could you check in /proc/xenomai/sched ? ),
> but we hold a Xenomai Mutex in loop(), which causes that thread to
> switch to primary mode. We could try to get the publish() out of the
> mutex lock and see if this improves things. For example, copy the map
> while holding the mutex, then unlock and then iterate over the copy.
> Does the system work 'flawlessly' under gnulinux ? Or does it have 100%
> CPU usage too for some threads ? 50Hz looks very reasonable to me
> though...
I did not try with gnulinux so far, as our hardware interface
TaskContext requires direct access to hardware interrupts, which can be
done in user-space using Xenomai. With gnulinux, we would need a kernel
driver. I will try with a fake driver and gnulinux target within the
next days.

The real-time process also involves a Kalman filter prediction and
update step with 13 state variables, so CPU usage is somewhere between
80% to 100% percent as far as I remember. But even when the real-time
task uses 100% I would not expect the RosPubActivity to bring the whole
system down, but instead just miss updates or starve.

Do you think it is worth to check ros_pub.getNumSubscribers() in
RosPubChannelElement::signal() to avoid triggering the
RosPublishActivity when there are no subscribers? Or would this break
realtimeness? There are some boost::recursive_mutex involved in this
call. On the other hand, the ros_pub.publish() call should be cheap if
there are no subscribers as roscpp checks that by itself before
serialization.

By the way, what is the "best practice" to fill ROS message headers with
a valid ros::Time stamp while avoiding to call ros::Time::now() in
real-time context?

Johannes

Watchdog triggered using rtt_rosnode with xenomai

Hi Johannes,

On Fri, Feb 24, 2012 at 12:17 AM, Johannes Meyer
<meyer [..] ...>wrote:

> Hi Peter,
>
> Am 23.02.2012 22:32, wrote Peter Soetens:
>
>> Hi Johannes,
>>
>>
>> On Thu, Feb 23, 2012 at 9:43 PM, Johannes Meyer <
>> meyer [..] ... <mailto:meyer [..] ...-**darmstadt.de<meyer [..] ...>>>
>> wrote:
>>
>> Hi Orocos-users,
>>
>> I have the following problem using rtt_rosnode running under Xenomai
>> 2.5.5.2 and ask for your help. It's an AMD Geode system with 500 MHz.
>>
>> A few seconds after starting the real-time application, all Xenomai
>> threads seem to be frozen and the system log (dmesg) shows the
>> following
>> messages:
>>
>> [ 52.424032] Xenomai: watchdog triggered -- signaling runaway thread
>> 'RosPublishActivity'
>> [ 52.424032] Xenomai: watchdog triggered -- killing runaway thread
>> 'RosPublishActivity'
>>
>> Restarting the real-time application alone does not help, you have to
>> restart the system after the watchdog has fired.
>> Disabling the watchdog or increasing the watchdog timeout freezes the
>> system after a while.
>> The orocos.log file does not contain any errors or unexpected
>> messages.
>>
>>
>> Orocos can not detect this incident, it's indeed the OS that needs to
>> handle
>> runaway threads.
>>
>>
>> The application is running a PeriodActivity with a frequency of
>> 50Hz and
>>
>>
>> Please, please, don't use PeriodicActivity unless you know very well what
>> you're doing ! (on the other hand, this is not causing your issue...)
>>
> Yes, I was wrong in that point. It is indeed a periodic RTT::Activity
> since we use RTT 2.0.
>
>
> 11 ports that publish data in each updateHook(). The ROS master and
>> other nodes are running on a remote system connected via Ethernet, but
>> not all of the publishers have subscribers. Is it possible, that this
>> overcharges the Geode system causing the RosPublishActivity::loop()
>> function to never return or might there be a more general problem with
>> rtt_rosnode and Xenomai? Does anybody use that combination?
>>
>>
>> We've been playing with Xenomai + rtt_rosnode on a powerful system, and
>> did not have this issue. We couldn't get the CPU affinity to work, but
>> that
>> was about all. All the symptoms you describe indeed point to loop()
>> being trigger()'ed all the time, using all CPU resources.
>>
>>
>> Why does Xenomai bother with RosPublishActivity at all? Shouldn't that
>> activity run with scheduler ORO_SCHED_OTHER?
>>
>>
>> It is run in that scheduler (could you check in /proc/xenomai/sched ? ),
>> but we hold a Xenomai Mutex in loop(), which causes that thread to
>> switch to primary mode. We could try to get the publish() out of the
>> mutex lock and see if this improves things. For example, copy the map
>> while holding the mutex, then unlock and then iterate over the copy.
>> Does the system work 'flawlessly' under gnulinux ? Or does it have 100%
>> CPU usage too for some threads ? 50Hz looks very reasonable to me
>> though...
>>
> I did not try with gnulinux so far, as our hardware interface TaskContext
> requires direct access to hardware interrupts, which can be done in
> user-space using Xenomai. With gnulinux, we would need a kernel driver. I
> will try with a fake driver and gnulinux target within the next days.
>
> The real-time process also involves a Kalman filter prediction and update
> step with 13 state variables, so CPU usage is somewhere between 80% to 100%
> percent as far as I remember. But even when the real-time task uses 100% I
> would not expect the RosPubActivity to bring the whole system down, but
> instead just miss updates or starve.
>

Yes, that would be the 'ideal' graceful degradation...

>
> Do you think it is worth to check ros_pub.getNumSubscribers() in
> RosPubChannelElement::signal() to avoid triggering the RosPublishActivity
> when there are no subscribers? Or would this break realtimeness? There are
> some boost::recursive_mutex involved in this call. On the other hand, the
> ros_pub.publish() call should be cheap if there are no subscribers as
> roscpp checks that by itself before serialization.
>

Don't try to outsmart these functions, they are optimized already. You
won't win anything. RTT is already dropping samples in case an overrun
occurs. ROS comm is cheap when nothing is to be done. Are you sure that
everything is compiled with -O2 or -O3 ? how big are these messages anyway ?

>
> By the way, what is the "best practice" to fill ROS message headers with a
> valid ros::Time stamp while avoiding to call ros::Time::now() in real-time
> context?

Xenomai 2.6 has a feature to read the system time in hard realtime
contexts. When using 2.5, you'll have to implement your own syncing
mechanism, which is not trivial. We were pondering about it to put such a
mechanism in RTT, but with Xenomai 2.6, there is an alternative...

Peter

Watchdog triggered using rtt_rosnode with xenomai

Hi Peter,

On 24.02.2012 10:13, Peter Soetens wrote:
> Hi Johannes,
>
> On Fri, Feb 24, 2012 at 12:17 AM, Johannes Meyer
> <meyer [..] ... meyer [..] ...>> wrote:
>
> Hi Peter,
>
> Am 23.02.2012 22:32, wrote Peter Soetens:
>
> Hi Johannes,
>
>
> On Thu, Feb 23, 2012 at 9:43 PM, Johannes Meyer
> <meyer [..] ... meyer [..] ...>
> <mailto:meyer [..] ...
> <mailto:meyer [..] ...>>> wrote:
>
> Hi Orocos-users,
>
> I have the following problem using rtt_rosnode running
> under Xenomai
> 2.5.5.2 and ask for your help. It's an AMD Geode system
> with 500 MHz.
>
> A few seconds after starting the real-time application, all
> Xenomai
> threads seem to be frozen and the system log (dmesg) shows the
> following
> messages:
>
> [ 52.424032] Xenomai: watchdog triggered -- signaling
> runaway thread
> 'RosPublishActivity'
> [ 52.424032] Xenomai: watchdog triggered -- killing
> runaway thread
> 'RosPublishActivity'
>
> Restarting the real-time application alone does not help,
> you have to
> restart the system after the watchdog has fired.
> Disabling the watchdog or increasing the watchdog timeout
> freezes the
> system after a while.
> The orocos.log file does not contain any errors or unexpected
> messages.
>
>
> Orocos can not detect this incident, it's indeed the OS that
> needs to handle
> runaway threads.
>
>
> The application is running a PeriodActivity with a frequency of
> 50Hz and
>
>
> Please, please, don't use PeriodicActivity unless you know
> very well what
> you're doing ! (on the other hand, this is not causing your
> issue...)
>
> Yes, I was wrong in that point. It is indeed a periodic
> RTT::Activity since we use RTT 2.0.
>
>
> 11 ports that publish data in each updateHook(). The ROS
> master and
> other nodes are running on a remote system connected via
> Ethernet, but
> not all of the publishers have subscribers. Is it possible,
> that this
> overcharges the Geode system causing the
> RosPublishActivity::loop()
> function to never return or might there be a more general
> problem with
> rtt_rosnode and Xenomai? Does anybody use that combination?
>
>
> We've been playing with Xenomai + rtt_rosnode on a powerful
> system, and
> did not have this issue. We couldn't get the CPU affinity to
> work, but that
> was about all. All the symptoms you describe indeed point to
> loop()
> being trigger()'ed all the time, using all CPU resources.
>
>
> Why does Xenomai bother with RosPublishActivity at all?
> Shouldn't that
> activity run with scheduler ORO_SCHED_OTHER?
>
>
> It is run in that scheduler (could you check in
> /proc/xenomai/sched ? ),
> but we hold a Xenomai Mutex in loop(), which causes that thread to
> switch to primary mode. We could try to get the publish() out
> of the
> mutex lock and see if this improves things. For example, copy
> the map
> while holding the mutex, then unlock and then iterate over the
> copy.
> Does the system work 'flawlessly' under gnulinux ? Or does it
> have 100%
> CPU usage too for some threads ? 50Hz looks very reasonable to
> me though...
>
> I did not try with gnulinux so far, as our hardware interface
> TaskContext requires direct access to hardware interrupts, which
> can be done in user-space using Xenomai. With gnulinux, we would
> need a kernel driver. I will try with a fake driver and gnulinux
> target within the next days.
>

Just a short update on this issue: With target gnulinux and a fake
driver, the control loop and the RosPublishActivity worked without
problems at about 40% CPU usage. The fake driver just writes constant
sensor data periodically and the behavior might be different as with the
Xenomai interrupts being involved.

With Xenomai, the proposed method to copy the map of Publishers, then
release the mutex and then call the individual publishers in
RosPublishActivity::loop() seems to be a working solution. The watchdog
is not triggered anymore. Are there any side effects, e.g. that publish
requests can get lost after the mutex is released but before the
function returns? In the current version, the binary flag set during the
requestPublish() call is not reset anyhow, which is probably not the
intention of the author.

> The real-time process also involves a Kalman filter prediction and
> update step with 13 state variables, so CPU usage is somewhere
> between 80% to 100% percent as far as I remember. But even when
> the real-time task uses 100% I would not expect the RosPubActivity
> to bring the whole system down, but instead just miss updates or
> starve.
>
>
> Yes, that would be the 'ideal' graceful degradation...
>
>
> Do you think it is worth to check ros_pub.getNumSubscribers() in
> RosPubChannelElement::signal() to avoid triggering the
> RosPublishActivity when there are no subscribers? Or would this
> break realtimeness? There are some boost::recursive_mutex involved
> in this call. On the other hand, the ros_pub.publish() call should
> be cheap if there are no subscribers as roscpp checks that by
> itself before serialization.
>
>
> Don't try to outsmart these functions, they are optimized already. You
> won't win anything. RTT is already dropping samples in case an overrun
> occurs. ROS comm is cheap when nothing is to be done. Are you sure
> that everything is compiled with -O2 or -O3 ? how big are these
> messages anyway ?
>
>
> By the way, what is the "best practice" to fill ROS message
> headers with a valid ros::Time stamp while avoiding to call
> ros::Time::now() in real-time context?
>
>
> Xenomai 2.6 has a feature to read the system time in hard realtime
> contexts. When using 2.5, you'll have to implement your own syncing
> mechanism, which is not trivial. We were pondering about it to put
> such a mechanism in RTT, but with Xenomai 2.6, there is an alternative...
>
I also installed and tried with Xenomai 2.6 to make use of the new
CLOCK_HOST_REALTIME feature, but the watchdog issue without modification
of the RosPublishActivity still remains.

> Peter
>

Thanks for your help,
Johannes