Posted by
Michael Schmid on
Jul 14, 2016; 8:17am
URL: http://imagej.273.s1.nabble.com/Questions-regarding-multithreaded-processing-tp5016878p5016883.html
Hi George,
there are several reasons why there is no linear increase in speed with
the number of threads. These come into my mind:
- Memory bandwidth is shared by all CPUs (typically the most important
issue)
- L2 and/or L3 cache is shared between processors; the cache may be too
small if there are many threads
- When two or more threads access memory addresses that are separate by
i*2^n where i is a small integer, this causes conflicts due to limited
cache associativity. This problem can arise with image sizes that are
powers of 2, such as 4096*4096, and e.g. 4 threads starting at e.g. 0,
1/4, 1/2 and 3/4 of the image height. Independent of multithreading,
cache associativity issues also slow down processing column-wise
processing of images whose height is a power of 2 or i*2^n (again with i
being a small integer).
- Garbage collection has many "Stop the World" events. This means that
all application threads are stopped until the operation completes.
- Some threads might finish up earlier than others (different work load
or just different success rates with memory and cache access)
- Program parts that are not parallelized (Amdahl's law)
- Overhead when creating the threads (and for synchronization between
threads, if necessary)
IMHO the most important point for efficient parallelization (and
efficient Java code anyhow) is avoiding creating lots of objects that
need garbage collection (don't care about dozens, but definitely avoid
hundreds of thousands or millions of Objects).
What also helps (maybe 20% gain, but strongly depending on the problem)
is having the threads share their load such that they access the same
data area for reading. E.g. the ImageJ built-in RankFilters (mean,
minimum, maximum, median, remove outliers) have the work split up into
pixel rows of an image, and the threads work on nearby rows (each thread
also needs a few adjacent rows of data anyhow). This helps quite a bit,
since all the input data nicely fit into the CPU cache, but it requires
more programming effort and synchronization.
Hyperthreading: In my experience, the gain of using hyperthreading is
modest, but it exists - maybe in the 10% range (i.e., when using 4
threads on a 2-core CPU with hyperthreading). This is in contrast to the
results with your plugin (no gain).
I have no experience with machines having a large number of cores like
your4*16-core AMD Opteron; I can't say whether Java still distributes
the threads correctly between the cores for 64 cores.
---
By the way, just from the programming side:
The easiest way for parallelization is writing a PlugInFilter.
If operations on the stack slices are independent, just specify the
PARALLELIZE_STACKS flag, and ImageJ will call the run(ip) method in
parallel for the stack slices.
If operations care about the ROI, you can also specify the
PARALLELIZE_IMAGES flag. When processing a single image, ImageJ will
call the run(ip) method in parallel, with rectangular ROIs. E.g. for 4
threads the first thread will get a ROI with the uppermost 1/4 of the
height, the second thread the range from 1/4 to 1/2 of the height, and
so on.
The number of threads is in Edit>Options>menory and Threads and
initially set to the number of cores (including hyperhtreading).
Michael
________________________________________________________________
On 2016-07-13 23:55, George Patterson wrote:
> Dear all,
> I’ve assembled a plugin to analyze a time series on a pixel-by-pixel basis.
> It works fine but is slow.
> There are likely still plenty of optimizations that can be done to improve
> the speed and thanks to Albert Cordona and Stephen Preibisch sharing code
> and tutorials (
http://albert.rierol.net/imagej_programming_tutorials.html),
> I’ve even have a version that runs multi-threaded.
> When run on multi-core machines the speed is improved, but I’m not sure what
> sort of improvement I should expect. Moreover, the machines I expected to
> be the fastest are not. This is likely stemming from my misunderstanding of
> parallel processing and Java programming in general so I’m hoping some of
> you with more experience can provide some feedback.
> I list below some observations and questions along with test runs on the
> same data set using the same plugin on a few different machines.
> Thanks for any suggestions.
> George
>
>
> Since the processing speeds differ, I realize the speeds of each machine to
> complete the analysis will differ. I’m more interested the improvement of
> multiple threads on an individual machine.
> In running these tests, I altered the code to use a different number of
> threads in each run.
> Is setting the number of threads in the code and determining the time to
> finish the analysis a valid approach to testing improvement?
>
> Machine 5 is producing some odd behavior which I’ll discuss and ask for
> suggestions below.
>
> For machines 1-4, the speed improves with the number of threads up to about
> half the number of available processors.
> Do the improvements with the number of threads listed below seem reasonable?
> Is the improvement up to only about half the number of available processors
> due to “hyperthreading”? My limited (and probably wrong) understanding is
> that hyperthreading makes a single core appear to be two which share
> resources and thus a machine with 2 cores will return 4 when queried for
> number of cpus. Yes, I know that is too simplistic, but it’s the best I can
> do.
> Could it simply be that my code is not written properly to take advantage of
> hyperthreading? Could anyone point me to a source and/or example code
> explaining how I could change it to take advantage of hyperthreading if this
> is the problem?
>
> Number of threads used are shown in parentheses where applicable.
> 1. MacBook Pro 2.66 GHz Intel Core i7
> number of processors: 1
> Number of cores: 2
> non-threaded plugin version ~59 sec
> threaded (1) ~51 sec
> threaded (2) ~36 sec
> threaded (3) ~34 sec
> threaded (4) ~35 sec
>
> 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon
> number of processors: 2
> Number of cores: 8
> non-threaded plugin version ~60 sec
> threaded (1) ~59 sec
> threaded (2) ~28.9 sec
> threaded (4) ~15.6 sec
> threaded (6) ~13.2 sec
> threaded (8) ~11.3 sec
> threaded (10) ~11.1 sec
> threaded (12) ~11.1 sec
> threaded (16) ~11.5 sec
>
> 3. Windows 7 DELL 3.2 GHz Intel Core i5
> number of cpus shown in resource monitor: 4
> non-threaded plugin version ~45.3 sec
> threaded (1) ~48.3 sec
> threaded (2) ~21.7 sec
> threaded (3) ~20.4 sec
> threaded (4) ~21.8 sec
>
> 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron 6272
> number of cpus shown in resource monitor: 32
> non-threaded plugin version ~162 sec
> threaded (1) ~158 sec
> threaded (2) ~85.1 sec
> threaded (4) ~46 sec
> threaded (8) ~22.9 sec
> threaded (10) ~18.6 sec
> threaded (12) ~16.4 sec
> threaded (16) ~15.8 sec
> threaded (20) ~15.7 sec
> threaded (24) ~15.9 sec
> threaded (32) ~16 sec
>
> For machines 1-4, the cpu usage can be observed in the Activity Monitor
> (Mac) or Resource Monitor (Windows) and during the execution of the plugin
> all of the cpus were active. For machine 5 shown below, only 22 of the 64
> show activity. And it is not always the same 22. From the example runs
> below you can see it really isn’t performing very well considering the
> number of available cores. I originally thought this machine should be the
> best, but it barely outperforms my laptop. This is probably a question for
> another forum, but I am wondering if anyone else has encountered anything
> similar.
>
> 5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz AMD Opteron
> 6378
> number of cpus shown in resource monitor: 64
> non-threaded plugin version ~140 sec
> threaded (1) ~137 sec
> threaded (4) ~60.3 sec
> threaded (8) ~29.3 sec
> threaded (12) ~22.9 sec
> threaded (16) ~23.8 sec
> threaded (24) ~24.1 sec
> threaded (32) ~24.5 sec
> threaded (40) ~24.8 sec
> threaded (48) ~23.8 sec
> threaded (64) ~24.8 sec
>
>
>
>
>
>
>
> --
> View this message in context:
http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html> Sent from the ImageJ mailing list archive at Nabble.com.
>
> --
> ImageJ mailing list:
http://imagej.nih.gov/ij/list.html>
--
ImageJ mailing list:
http://imagej.nih.gov/ij/list.html