we are using two Ensenso N46 cameras in a real-time multi-camera setup together with other cameras, which is why we have a separate camera object for each camera - so that we can activate and deactivate single ones dynamically. However, we’ve noticed that the Ensenso SDK (NxLib) sometimes hangs on the following command:
NxLibCommand computeDisparityMap(cmdComputeDisparityMap); computeDisparityMap.parameters()[itmCamera] = cameraSerial; computeDisparityMap.execute(); // ← Suddenly blocks forever after some time, e.g. after 5 minutes
As soon as this happens with the first camera, the processing thread of the second camera object also gets stuck at the same point (directly the next frame). From my point of view, this suggests that the NxLib is no longer functioning correctly internally—possibly because I am using it incorrectly?
The capturing thread per camera works basically as follows:
And the processing thread per camera works basically as follows:
processThread = std::thread([this](){
while(shouldRun){
imageAvailable.acquire(); // <- Semaphore IA (waiting)
if(!shouldRun)
break;
NxLibCommand computeDisparityMap(cmdComputeDisparityMap);
computeDisparityMap.parameters()[itmCamera] = cameraSerial;
computeDisparityMap.execute();
// ^^ HERE, after some minutes of working fine, this function
// seems to never returning and blocking forever
NxLibCommand computePointMap(cmdComputePointMap);
computePointMap.parameters()[itmCamera] = cameraSerial;
computePointMap.execute();
// Some processing done by us to copy the depth image
// into our internal data structures for our own processing.
// ...
imageProcessed.release(); // Semaphore IP (releasing)
}
});
Do you have any idea what could be causing this? Is the way we’re using the SDK generally correct? Or is there perhaps a real-time multithreading example similar to the one used internally by NxView?
Please note that we simplified the code above for better readability and did not include all of our debug output and error handling. However, to avoid any misunderstandings, here’s how we actually handle this in the processing thread.
tryCatchNxLib([this, &lastTime](){
// Debug output A
// ...
NxLibCommand computeDisparityMap(cmdComputeDisparityMap);
computeDisparityMap.parameters()[itmCamera] = cameraSerial;
computeDisparityMap.execute();
// ^^ HERE, after some minutes of working fine, this function
// seems to never returning and blocking forever
// Debug Output B
// ...
NxLibCommand computePointMap(cmdComputePointMap);
computePointMap.parameters()[itmCamera] = cameraSerial;
computePointMap.execute();
// Some processing done by us to copy the depth image
// into our internal data structures for our own processing.
// ...
}
// Debug Output C
// ...
imageProcessed.release();
The last debug output we get in that situation is Debug Output A. If an exception would be thrown, the imageProcessed Semaphore should be released anyway, as well as we should get the Debug Output C. So it seems to block at the computeDisparyMap-Command.
Could you record a log file of your running application that contains the blockage? You can follow this guide on how to record a log file? If the log file is too large to upload here, I can provide you a file sharing link.
Thank you for the log file. It looks like there is a bug in our CUDA implementation of the ComputeDisparityMap command. Could you try disabling CUDA for now and see if the blockage still occurs? Alternatively, you can try to execute the ComputeDisparityMap commands in one thread or enable CUDA for only one of the two commands.
We are investigating the problem and tracking it under issue #6089.
I started the program once with NxLibItem(itmCUDA)[itmEnabled].set(false); — it has now been running for an hour without any crashes or freezes on the first attempt (normally, freezes would have occurred earlier).
As expected with CUDA disabled, the overall camera frame rate is lower, and it seems that one of the cameras occasionally stalls for a short period (several hundred milliseconds up to about a second). There are also occasional frame drops on the third camera, probably because I haven’t configured the number of CPU threads used by the Ensenso SDK yet, which might be causing some interference.
If the issue is related to the synchronization of multiple CUDA kernel launches, I might try implementing a shared processing thread for the Ensenso cameras and adjust the synchronization so that the program continues running even if individual cameras are deactivated or disconnected at runtime. I’ll probably give that a try next week.
According to the provided log file, processing was blocked by inter-view propagation (the propagation of information from the left view to the right). It would be great if you could disable the coupled propagation for CUDA PatchMatch. In a camera node, set “Parameters > DisparityMap > StereoMatching > PatchMatch > CoupledPropagation” to false. If the error goes away, then it is most likely that the propagation is the cause. If not, the problem is probably with the GPU in general (and we need another log file).
I have now started the software with camera[itmParameters][itmDisparityMap][itmStereoMatching][itmPatchMatch][itmCoupledPropagation] = false;
for each camera and with CUDA enabled, this time without logging via NxTreeEdit. It froze again after a short while. I’m currently recording a new log file and will send it to you as soon as the issue occurs again.
Thank you for your quick response. We now see that the issue is not with CUDA PatchMatch itself. In your second log, the system hangs either before or after the PatchMatch processing. However, both logs contain an incomplete “Make stereo matcher” call. Perhaps the problem lies there. Would it be too much to ask you to record another log file? If “Make stereo matcher” appears a third time, we can identify where the problem is occurring.
Thank you for the data. Unfortunately, it doesn’t clarify the situation. The file “PatchMatch_CPDisabled3c.enslog” does not contain incomplete calls, as you mentioned in the DM. The “PatchMatch_CPDisabled4b.enslog” file does not contain an incomplete “Make stereo matcher” call either.
I confirmed that “Make stereo matcher” doesn’t use CUDA. It is a very short function, usually it takes about 50 μs. It seems that, at some point, the GPU just stopped working.
It would be great to have answers to the following questions:
Does this hanging ever happen with one camera? Or does it only happen when several cameras are working simultaneously on the same GPU?
Thank you very much for your efforts in troubleshooting. A few weeks ago, I tested the setup on another computer with an RTX 3080 and an RTX 4090 using a single Ensenso N46 camera, which ran for hours without any issues. On Monday, I’ll check whether the current system (with an RTX 5060 Ti) using only a single camera also crashes and will share the results afterward.
Just for your information: We ran a few more tests. When CUDA is disabled, the application always runs without crashing or blocking (the longest test lasted 65 hours, another one 17 hours) — even with two cameras.
When we enable CUDA in the Ensenso SDK, the app running on the 5060 TI with two Ensenso cameras freezes after about approx. 2–15 minutes. With one camera, the application crashes after about 30–40 minutes with an ‘unknown signal’ inside the NVIDIA driver, during a call in our OpenGL application.
On the computer with the 4090 RTX, we repeatedly observed no issues, even after several hours with one camera. However, we currently cannot test with two cameras on that system, since we only have one camera here and the machine with the 5060 TI is a remote system at a project partner’s site.
It seems that the issue might be related to the NVIDIA driver and the simultaneous use of OpenGL and CUDA on the 5060 TI, though it’s only a guess and I can’t rule out an error on our side. For now, I plan to switch that machine to CPU processing with the Ensensos.
Thank you very much for your testing. Based on my personal experience, recent NVIDIA drivers have many issues. I would be very happy if you could test the 4090 RTX (or any architecture lower than the 5060) with two cameras. We will also try to build and test a similar system. Could you please provide the exact version of the driver you are using?