Login  Register

Re: Questions regarding multithreaded processing

Posted by Gelman, Laurent on Jul 15, 2016; 7:49am
URL: http://imagej.273.s1.nabble.com/Questions-regarding-multithreaded-processing-tp5016878p5016886.html

Dear George,

We made the same kind of tests and experience as you did. We also tested numerous machines with a variable number of cores declared in the ImageJ Option Menu, in combination with different amounts of RAM, without being able to draw really clear conclusions about why it is fast or slow on the respective computers. We also tested different processes, from a simple Gaussian blur to more complex macros.

In a nutshell:
We also observed awful performances on our Microscoft Server 2012 / 32CPUs / 512GB RAM machine, irrespective of the combination of CPUs and RAM we declare in ImageJ. Surely, giving more than 16 CPUs to ImageJ does not improve overall speed, sometimes it even decreases. Note that this very same machine is really fast when using Matlab and the parallel processing toolbox.
Until recently, the fastest computers we could find to run ImageJ were my iMac, which runs Windows 7 (:-)) (specs: i7-4771 CPU  3.5GHz, 32GB RAM), and the HIVE (hexacore machine) sold by the company Acquifer (no commercial interest). Until then, we thought the speed of individual CPUs is the key, less their numbers, but we got really surprised lately when we tested the new virtual machines (VMs) our IT department set up for us to do some remote processing of very big microscopy datasets (24 cores, 128 to 256 GB RAM for each VM). Although the CPUs on the physical servers are not that fast (2.5 GHz, but is this really a good measure of computation speed? I am not sure...), we measured that our VMs were the fastest machines we tested so far. So we have actually no theory anymore about ImageJ and speed. It is not clear to us either, whether having Windows 7 or Windows server 2012 makes a difference.
Finally, I should mention that when you use complex processes, for example Stitching, the speed of the individual CPUs is also important, as we had the impression that the reading/loading of the file uses only one core. There again, we could see a beautiful correlation between CPU speed (GHz specs) and the process.

Current solution:
If we really need to be very fast,
1. we write an ImageJ macro in python and launch multiple threads in parallel, but we observed that the whole was not "thread safe", i.e. we see "collisions" between the different processes.
2. we write a python program to launch multiple ImageJ instances in a headless mode and parse the macro this way.

I would be also delighted to understand what makes ImageJ go fast or slow on a computer, that would help us to purchase the right machines from the beginning.

Very best regards,

Laurent.

___________________________
Laurent Gelman, PhD
Friedrich Miescher Institut
Head, Facility for Advanced Imaging and Microscopy
Light microscopy
WRO 1066.2.16
Maulbeerstrasse 66
CH-4058 Basel
+41 (0)61 696 35 13
+41 (0)79 618 73 69
www.fmi.ch
www.microscopynetwork.unibas.ch/


-----Original Message-----
From: George Patterson [mailto:[hidden email]]
Sent: mercredi 13 juillet 2016 23:55
Subject: Questions regarding multithreaded processing

Dear all,
I’ve assembled a plugin to analyze a time series on a pixel-by-pixel basis.
It works fine but is slow.
There are likely still plenty of optimizations that can be done to improve the speed and thanks to Albert Cordona and Stephen Preibisch sharing code and tutorials  (http://albert.rierol.net/imagej_programming_tutorials.html),
I’ve even have a version that runs multi-threaded.
When run on multi-core machines the speed is improved, but I’m not sure what sort of improvement I should expect.  Moreover, the machines I expected to be the fastest are not.  This is likely stemming from my misunderstanding of parallel processing and Java programming in general so I’m hoping some of you with more experience can provide some feedback.  
I list below some observations and questions along with test runs on the same data set using the same plugin on a few different machines.
Thanks for any suggestions.
George


Since the processing speeds differ, I realize the speeds of each machine to complete the analysis will differ.  I’m more interested the improvement of multiple threads on an individual machine.
In running these tests, I altered the code to use a different number of threads in each run.
Is setting the number of threads in the code and determining the time to finish the analysis a valid approach to testing improvement?

Machine 5 is producing some odd behavior which I’ll discuss and ask for suggestions below.  

For machines 1-4, the speed improves with the number of threads up to about half the number of available processors.  
Do the improvements with the number of threads listed below seem reasonable?
Is the improvement up to only about half the number of available processors due to “hyperthreading”?  My limited (and probably wrong) understanding is that hyperthreading makes a single core appear to be two which share resources and thus a machine with 2 cores will return 4 when queried for number of cpus.  Yes, I know that is too simplistic, but it’s the best I can do.
Could it simply be that my code is not written properly to take advantage of hyperthreading?  Could anyone point me to a source and/or example code explaining how I could change it to take advantage of hyperthreading if this is the problem?

Number of threads used are shown in parentheses where applicable.
1. MacBook Pro 2.66 GHz Intel Core i7
number of processors: 1
Number of cores: 2
non-threaded plugin version ~59 sec
threaded (1) ~51 sec
threaded (2) ~36 sec
threaded (3) ~34 sec
threaded (4) ~35 sec

2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2 Number of cores: 8 non-threaded plugin version ~60 sec threaded (1) ~59 sec threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6) ~13.2 sec threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12) ~11.1 sec threaded (16) ~11.5 sec

3. Windows 7 DELL   3.2 GHz Intel Core i5
number of cpus shown in resource monitor: 4 non-threaded plugin version ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec threaded (3) ~20.4 sec threaded (4) ~21.8 sec

4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz  AMD Opteron 6272
number of cpus shown in resource monitor: 32 non-threaded plugin version ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded (4) ~46 sec threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded (12) ~16.4 sec threaded (16) ~15.8 sec threaded (20) ~15.7 sec threaded (24) ~15.9 sec threaded (32) ~16 sec

For machines 1-4, the cpu usage can be observed in the Activity Monitor
(Mac) or Resource Monitor (Windows) and during the execution of the plugin all of the cpus were active.  For machine 5 shown below, only 22 of the 64 show activity.  And it is not always the same 22.  From the example runs below you can see it really isn’t performing very well considering the number of available cores.  I originally thought this machine should be the best, but it barely outperforms my laptop.  This is probably a question for another forum, but I am wondering if anyone else has encountered anything similar.

5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz  AMD Opteron
6378
number of cpus shown in resource monitor: 64 non-threaded plugin version ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded (8) ~29.3 sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded (24) ~24.1 sec threaded (32) ~24.5 sec threaded (40) ~24.8 sec threaded (48) ~23.8 sec threaded (64) ~24.8 sec







--
View this message in context: http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html
Sent from the ImageJ mailing list archive at Nabble.com.

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html