#90 closed defect (postpone)
Low streaming performance
Reported by: | Matthias Vogelgesang | Owned by: | Suren A. Chilingaryan |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | pcilib | Version: | |
Keywords: | Cc: | Suren A. Chilingaryan |
Description
The results from the benchmark tool are somewhat disappointing and probably caused by something we haven't yet under total control. Here are my observation:
- Off-line decoding of frames takes about 3.5 to 4.5 ms which is enough to decode > 200 frames per second.
- The overhead of libuca is negligible: running the benchmark with the mock camera (streaming 640x480 images at 8-bit) results in more than 60000 frames/s or ~ 17 GB/s bandwidth.
- Running the benchmark with the ufo camera, we achieve miserable 34 frames/s at 10 bit and 24 frames/s at 12 bits, each at 0.00001s exposure time. However, this is with the "synchronous" calls.
Unfortunately, I cannot single out a specific reason for this performance.
Attachments (1)
Change History (4)
Changed 12 years ago by
comment:1 Changed 12 years ago by
It seems that PC-DAQ takes unusually long time to read the data. Frames are acquired using stimuli. DMA engine is immediately enabled and instructed to get the data. Frames are stored in ~100ms (32 frames), but it takes additional ~80ms to read all the data (see attachment). Data are not stored, nor decoded. Frames are acquired using UFO4 firmware, but same behavior is observed using UFO5 firmware. This may influence the results reported above.
comment:2 follow-up: 3 Changed 12 years ago by
Resolution: | → postpone |
---|---|
Status: | new → closed |
I.e. according to Uros numbers we are reading 32 frames in 180ms. The frame size is currently about 18MB (due to 4x time increase in 12bit mode). This gives us approximately 3200 MB/s which is maximum of DMA engine if I remember correctly. Of course, this still should result in 150 fps at least, but...
OK. Now back to Matthias numbers. Using iss-suren1 we got about 30 frames per second in 12 bit mode. It gives us 540 MB/s per second. Measured memcpy performance on this PC is 4.61 GB/s. The default processing path of pcitool make 2 memory copies: first to free DMA buffer ASAP, second during decoding. LibUCA, as I can see, makes another copy. With 18MB frame size, it is obvious it is not preserved in the L2 cache any more. Now lets compute how much time we need for 32 frames:
DMA: 180ms
Memcopy: 3 x 124ms
Decoding: 32 '*' 5ms
=======
Overall: 712ms
OK. I can't directly tell where goes another 250ms (25% of time). But the numbers are pretty reasonable. Now, what we shall do:
- We shall use a PC with high-speed memory. memcpy at ipecamera is about 7 GB/s and we can have even faster memory.
- The fast-path of pcitool should be used (rawcallback). In this mode pcitool will not do any memory copies itself but send data to the specified callback as it comes in. This will eliminate 2 unnecessary memory copies.
- The ufodecode should implement streaming interface. So, we will be able to use L2 cache even for large frame sizes.
- I think the DMA engine can be tuned as well.
Now, I don't want to make this work twice. Therefore, I don't want to start tuning this things until we get a full-speed test system which is promised by Michele in October-November.
For this reason, I'm closing this ticket with postpone resolution. A significant architecture change in multiple components is required to achieve higher speed. We need a full-speed test bed to start the work.
comment:3 Changed 12 years ago by
Replying to csa:
For this reason, I'm closing this ticket with postpone resolution. A significant architecture change in multiple components is required to achieve higher speed. We need a full-speed test bed to start the work.
Could you please not close the tickets but rather create a new milestone and assign the ticket to it?
graph depicting PC-DAQ data taking