ImageJ

Questions regarding multithreaded processing

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

11 messages Options

George Patterson

Jul 13, 2016; 9:55pm

Questions regarding multithreaded processing

43 posts

Dear all,
I’ve assembled a plugin to analyze a time series on a pixel-by-pixel basis. It works fine but is slow.
There are likely still plenty of optimizations that can be done to improve the speed and thanks to Albert Cordona and Stephen Preibisch sharing code and tutorials (http://albert.rierol.net/imagej_programming_tutorials.html), I’ve even have a version that runs multi-threaded.
When run on multi-core machines the speed is improved, but I’m not sure what sort of improvement I should expect. Moreover, the machines I expected to be the fastest are not. This is likely stemming from my misunderstanding of parallel processing and Java programming in general so I’m hoping some of you with more experience can provide some feedback.
I list below some observations and questions along with test runs on the same data set using the same plugin on a few different machines.
Thanks for any suggestions.
George

Since the processing speeds differ, I realize the speeds of each machine to complete the analysis will differ. I’m more interested the improvement of multiple threads on an individual machine.
In running these tests, I altered the code to use a different number of threads in each run.
Is setting the number of threads in the code and determining the time to finish the analysis a valid approach to testing improvement?

Machine 5 is producing some odd behavior which I’ll discuss and ask for suggestions below.

For machines 1-4, the speed improves with the number of threads up to about half the number of available processors.
Do the improvements with the number of threads listed below seem reasonable?
Is the improvement up to only about half the number of available processors due to “hyperthreading”? My limited (and probably wrong) understanding is that hyperthreading makes a single core appear to be two which share resources and thus a machine with 2 cores will return 4 when queried for number of cpus. Yes, I know that is too simplistic, but it’s the best I can do.
Could it simply be that my code is not written properly to take advantage of hyperthreading? Could anyone point me to a source and/or example code explaining how I could change it to take advantage of hyperthreading if this is the problem?

Number of threads used are shown in parentheses where applicable.
1. MacBook Pro 2.66 GHz Intel Core i7
number of processors: 1
Number of cores: 2
non-threaded plugin version ~59 sec
threaded (1) ~51 sec
threaded (2) ~36 sec
threaded (3) ~34 sec
threaded (4) ~35 sec

2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon
number of processors: 2
Number of cores: 8
non-threaded plugin version ~60 sec
threaded (1) ~59 sec
threaded (2) ~28.9 sec
threaded (4) ~15.6 sec
threaded (6) ~13.2 sec
threaded (8) ~11.3 sec
threaded (10) ~11.1 sec
threaded (12) ~11.1 sec
threaded (16) ~11.5 sec

3. Windows 7 DELL 3.2 GHz Intel Core i5
number of cpus shown in resource monitor: 4
non-threaded plugin version ~45.3 sec
threaded (1) ~48.3 sec
threaded (2) ~21.7 sec
threaded (3) ~20.4 sec
threaded (4) ~21.8 sec

4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron 6272
number of cpus shown in resource monitor: 32
non-threaded plugin version ~162 sec
threaded (1) ~158 sec
threaded (2) ~85.1 sec
threaded (4) ~46 sec
threaded (8) ~22.9 sec
threaded (10) ~18.6 sec
threaded (12) ~16.4 sec
threaded (16) ~15.8 sec
threaded (20) ~15.7 sec
threaded (24) ~15.9 sec
threaded (32) ~16 sec

For machines 1-4, the cpu usage can be observed in the Activity Monitor (Mac) or Resource Monitor (Windows) and during the execution of the plugin all of the cpus were active. For machine 5 shown below, only 22 of the 64 show activity. And it is not always the same 22. From the example runs below you can see it really isn’t performing very well considering the number of available cores. I originally thought this machine should be the best, but it barely outperforms my laptop. This is probably a question for another forum, but I am wondering if anyone else has encountered anything similar.

5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz AMD Opteron 6378
number of cpus shown in resource monitor: 64
non-threaded plugin version ~140 sec
threaded (1) ~137 sec
threaded (4) ~60.3 sec
threaded (8) ~29.3 sec
threaded (12) ~22.9 sec
threaded (16) ~23.8 sec
threaded (24) ~24.1 sec
threaded (32) ~24.5 sec
threaded (40) ~24.8 sec
threaded (48) ~23.8 sec
threaded (64) ~24.8 sec

Kurt Thorn

Jul 13, 2016; 11:10pm

Re: Questions regarding multithreaded processing

18 posts

I don't know enough about multithreading to say much intelligent, but I
did recently see a post suggesting that in parallel processing with
matlab, using more instances than physical cores may nor produce much
speed improvement: http://undocumentedmatlab.com/blog/a-few-parfor-tips

Kurt

On 7/13/2016 2:55 PM, George Patterson wrote:

> Dear all,
> I’ve assembled a plugin to analyze a time series on a pixel-by-pixel basis.
> It works fine but is slow.
> There are likely still plenty of optimizations that can be done to improve
> the speed and thanks to Albert Cordona and Stephen Preibisch sharing code
> and tutorials (http://albert.rierol.net/imagej_programming_tutorials.html),
> I’ve even have a version that runs multi-threaded.
> When run on multi-core machines the speed is improved, but I’m not sure what
> sort of improvement I should expect. Moreover, the machines I expected to
> be the fastest are not. This is likely stemming from my misunderstanding of
> parallel processing and Java programming in general so I’m hoping some of
> you with more experience can provide some feedback.
> I list below some observations and questions along with test runs on the
> same data set using the same plugin on a few different machines.
> Thanks for any suggestions.
> George
>
>
> Since the processing speeds differ, I realize the speeds of each machine to
> complete the analysis will differ. I’m more interested the improvement of
> multiple threads on an individual machine.
> In running these tests, I altered the code to use a different number of
> threads in each run.
> Is setting the number of threads in the code and determining the time to
> finish the analysis a valid approach to testing improvement?
>
> Machine 5 is producing some odd behavior which I’ll discuss and ask for
> suggestions below.
>
> For machines 1-4, the speed improves with the number of threads up to about
> half the number of available processors.
> Do the improvements with the number of threads listed below seem reasonable?
> Is the improvement up to only about half the number of available processors
> due to “hyperthreading”? My limited (and probably wrong) understanding is
> that hyperthreading makes a single core appear to be two which share
> resources and thus a machine with 2 cores will return 4 when queried for
> number of cpus. Yes, I know that is too simplistic, but it’s the best I can
> do.
> Could it simply be that my code is not written properly to take advantage of
> hyperthreading? Could anyone point me to a source and/or example code
> explaining how I could change it to take advantage of hyperthreading if this
> is the problem?
>
> Number of threads used are shown in parentheses where applicable.
> 1. MacBook Pro 2.66 GHz Intel Core i7
> number of processors: 1
> Number of cores: 2
> non-threaded plugin version ~59 sec
> threaded (1) ~51 sec
> threaded (2) ~36 sec
> threaded (3) ~34 sec
> threaded (4) ~35 sec
>
> 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon
> number of processors: 2
> Number of cores: 8
> non-threaded plugin version ~60 sec
> threaded (1) ~59 sec
> threaded (2) ~28.9 sec
> threaded (4) ~15.6 sec
> threaded (6) ~13.2 sec
> threaded (8) ~11.3 sec
> threaded (10) ~11.1 sec
> threaded (12) ~11.1 sec
> threaded (16) ~11.5 sec
>
> 3. Windows 7 DELL 3.2 GHz Intel Core i5
> number of cpus shown in resource monitor: 4
> non-threaded plugin version ~45.3 sec
> threaded (1) ~48.3 sec
> threaded (2) ~21.7 sec
> threaded (3) ~20.4 sec
> threaded (4) ~21.8 sec
>
> 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron 6272
> number of cpus shown in resource monitor: 32
> non-threaded plugin version ~162 sec
> threaded (1) ~158 sec
> threaded (2) ~85.1 sec
> threaded (4) ~46 sec
> threaded (8) ~22.9 sec
> threaded (10) ~18.6 sec
> threaded (12) ~16.4 sec
> threaded (16) ~15.8 sec
> threaded (20) ~15.7 sec
> threaded (24) ~15.9 sec
> threaded (32) ~16 sec
>
> For machines 1-4, the cpu usage can be observed in the Activity Monitor
> (Mac) or Resource Monitor (Windows) and during the execution of the plugin
> all of the cpus were active. For machine 5 shown below, only 22 of the 64
> show activity. And it is not always the same 22. From the example runs
> below you can see it really isn’t performing very well considering the
> number of available cores. I originally thought this machine should be the
> best, but it barely outperforms my laptop. This is probably a question for
> another forum, but I am wondering if anyone else has encountered anything
> similar.
>
> 5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz AMD Opteron
> 6378
> number of cpus shown in resource monitor: 64
> non-threaded plugin version ~140 sec
> threaded (1) ~137 sec
> threaded (4) ~60.3 sec
> threaded (8) ~29.3 sec
> threaded (12) ~22.9 sec
> threaded (16) ~23.8 sec
> threaded (24) ~24.1 sec
> threaded (32) ~24.5 sec
> threaded (40) ~24.8 sec
> threaded (48) ~23.8 sec
> threaded (64) ~24.8 sec
>
>
>
>
>
>
>
> --
> View this message in context: http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html
> Sent from the ImageJ mailing list archive at Nabble.com.
>
> --
> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>

... [show rest of quote]

--
Kurt Thorn
Associate Professor
Director, Nikon Imaging Center
http://thornlab.ucsf.edu/
http://nic.ucsf.edu/blog/

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html

Olivier Burri

Jul 14, 2016; 7:09am

Re: Questions regarding multithreaded processing

149 posts

Hi Thom and George.

Seeing as this bit is a bit more technical and closer to a plugin development question, would you mind posting it on http://forum.imagej.net
Long technical email threads like this one tend to get muddy, especially if we try to share code snippets or want to comment on a particular part.

Plus you get a bunch of the ImageJ Devs who hang out there all the time.

And finally, it looks pretty :)

A quick remark though. Seeing as we do not know HOW you implemented the parallel processing, it will be difficult to help.

Some notes: If you 'simply' make a bunch of threads where each accesses a given pixel in a loop through an atomic integer for example, it is not going to be faster. Accessing a single pixel is extremely fast and what will slow you down is having each thread waiting to get its pixel index.

This is why on most examples, first you break the task by number of cores and you assign each thread with a pre-defined number of pixels (a block) to process. That way each thread can just go and access the pixels they want without worrying about what another thread does. And there you can expect a speed increase pretty that scales linearly with the number of available cores.

Best

Oli

> -----Original Message-----
> From: ImageJ Interest Group [mailto:[hidden email]] On Behalf Of Kurt
> Thorn
> Sent: jeudi, 14 juillet 2016 01:26
> To: [hidden email]
> Subject: Re: Questions regarding multithreaded processing
>
> I don't know enough about multithreading to say much intelligent, but I did
> recently see a post suggesting that in parallel processing with matlab, using
> more instances than physical cores may nor produce much speed improvement:
> http://undocumentedmatlab.com/blog/a-few-parfor-tips
>
> Kurt
>
> On 7/13/2016 2:55 PM, George Patterson wrote:
> > Dear all,
> > I’ve assembled a plugin to analyze a time series on a pixel-by-pixel basis.
> > It works fine but is slow.
> > There are likely still plenty of optimizations that can be done to
> > improve the speed and thanks to Albert Cordona and Stephen Preibisch
> > sharing code and tutorials
> > (http://albert.rierol.net/imagej_programming_tutorials.html),
> > I’ve even have a version that runs multi-threaded.
> > When run on multi-core machines the speed is improved, but I’m not
> > sure what sort of improvement I should expect. Moreover, the machines
> > I expected to be the fastest are not. This is likely stemming from my
> > misunderstanding of parallel processing and Java programming in
> > general so I’m hoping some of you with more experience can provide some
> feedback.
> > I list below some observations and questions along with test runs on
> > the same data set using the same plugin on a few different machines.
> > Thanks for any suggestions.
> > George
> >
> >
> > Since the processing speeds differ, I realize the speeds of each
> > machine to complete the analysis will differ. I’m more interested the
> > improvement of multiple threads on an individual machine.
> > In running these tests, I altered the code to use a different number
> > of threads in each run.
> > Is setting the number of threads in the code and determining the time
> > to finish the analysis a valid approach to testing improvement?
> >
> > Machine 5 is producing some odd behavior which I’ll discuss and ask
> > for suggestions below.
> >
> > For machines 1-4, the speed improves with the number of threads up to
> > about half the number of available processors.
> > Do the improvements with the number of threads listed below seem
> reasonable?
> > Is the improvement up to only about half the number of available
> > processors due to “hyperthreading”? My limited (and probably wrong)
> > understanding is that hyperthreading makes a single core appear to be
> > two which share resources and thus a machine with 2 cores will return
> > 4 when queried for number of cpus. Yes, I know that is too
> > simplistic, but it’s the best I can do.
> > Could it simply be that my code is not written properly to take
> > advantage of hyperthreading? Could anyone point me to a source and/or
> > example code explaining how I could change it to take advantage of
> > hyperthreading if this is the problem?
> >
> > Number of threads used are shown in parentheses where applicable.
> > 1. MacBook Pro 2.66 GHz Intel Core i7
> > number of processors: 1
> > Number of cores: 2
> > non-threaded plugin version ~59 sec
> > threaded (1) ~51 sec
> > threaded (2) ~36 sec
> > threaded (3) ~34 sec
> > threaded (4) ~35 sec
> >
> > 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2
> > Number of cores: 8 non-threaded plugin version ~60 sec threaded (1)
> > ~59 sec threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6)
> > ~13.2 sec threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12)
> > ~11.1 sec threaded (16) ~11.5 sec
> >
> > 3. Windows 7 DELL 3.2 GHz Intel Core i5
> > number of cpus shown in resource monitor: 4 non-threaded plugin
> > version ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec
> > threaded (3) ~20.4 sec threaded (4) ~21.8 sec
> >
> > 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron 6272
> > number of cpus shown in resource monitor: 32 non-threaded plugin
> > version ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded
> > (4) ~46 sec threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded
> > (12) ~16.4 sec threaded (16) ~15.8 sec threaded (20) ~15.7 sec
> > threaded (24) ~15.9 sec threaded (32) ~16 sec
> >
> > For machines 1-4, the cpu usage can be observed in the Activity
> > Monitor
> > (Mac) or Resource Monitor (Windows) and during the execution of the
> > plugin all of the cpus were active. For machine 5 shown below, only
> > 22 of the 64 show activity. And it is not always the same 22. From
> > the example runs below you can see it really isn’t performing very
> > well considering the number of available cores. I originally thought
> > this machine should be the best, but it barely outperforms my laptop.
> > This is probably a question for another forum, but I am wondering if
> > anyone else has encountered anything similar.
> >
> > 5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz AMD
> Opteron
> > 6378
> > number of cpus shown in resource monitor: 64 non-threaded plugin
> > version ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded
> > (8) ~29.3 sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded
> > (24) ~24.1 sec threaded (32) ~24.5 sec threaded (40) ~24.8 sec
> > threaded (48) ~23.8 sec threaded (64) ~24.8 sec
> >
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> > http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-pro
> > cessing-tp5016878.html Sent from the ImageJ mailing list archive at
> > Nabble.com.
> >
> > --
> > ImageJ mailing list: http://imagej.nih.gov/ij/list.html
> >
>
>
> --
> Kurt Thorn
> Associate Professor
> Director, Nikon Imaging Center
> http://thornlab.ucsf.edu/
> http://nic.ucsf.edu/blog/
>
> --
> ImageJ mailing list: http://imagej.nih.gov/ij/list.html

... [show rest of quote]

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html

Michael Schmid

Jul 14, 2016; 8:17am

Re: Questions regarding multithreaded processing

2136 posts

In reply to this post by George Patterson

Hi George,

there are several reasons why there is no linear increase in speed with
the number of threads. These come into my mind:

- Memory bandwidth is shared by all CPUs (typically the most important
issue)
- L2 and/or L3 cache is shared between processors; the cache may be too
small if there are many threads
- When two or more threads access memory addresses that are separate by
i*2^n where i is a small integer, this causes conflicts due to limited
cache associativity. This problem can arise with image sizes that are
powers of 2, such as 4096*4096, and e.g. 4 threads starting at e.g. 0,
1/4, 1/2 and 3/4 of the image height. Independent of multithreading,
cache associativity issues also slow down processing column-wise
processing of images whose height is a power of 2 or i*2^n (again with i
being a small integer).
- Garbage collection has many "Stop the World" events. This means that
all application threads are stopped until the operation completes.
- Some threads might finish up earlier than others (different work load
or just different success rates with memory and cache access)
- Program parts that are not parallelized (Amdahl's law)
- Overhead when creating the threads (and for synchronization between
threads, if necessary)

IMHO the most important point for efficient parallelization (and
efficient Java code anyhow) is avoiding creating lots of objects that
need garbage collection (don't care about dozens, but definitely avoid
hundreds of thousands or millions of Objects).

What also helps (maybe 20% gain, but strongly depending on the problem)
is having the threads share their load such that they access the same
data area for reading. E.g. the ImageJ built-in RankFilters (mean,
minimum, maximum, median, remove outliers) have the work split up into
pixel rows of an image, and the threads work on nearby rows (each thread
also needs a few adjacent rows of data anyhow). This helps quite a bit,
since all the input data nicely fit into the CPU cache, but it requires
more programming effort and synchronization.

Hyperthreading: In my experience, the gain of using hyperthreading is
modest, but it exists - maybe in the 10% range (i.e., when using 4
threads on a 2-core CPU with hyperthreading). This is in contrast to the
results with your plugin (no gain).

I have no experience with machines having a large number of cores like
your4*16-core AMD Opteron; I can't say whether Java still distributes
the threads correctly between the cores for 64 cores.

---

By the way, just from the programming side:

The easiest way for parallelization is writing a PlugInFilter.
If operations on the stack slices are independent, just specify the
PARALLELIZE_STACKS flag, and ImageJ will call the run(ip) method in
parallel for the stack slices.
If operations care about the ROI, you can also specify the
PARALLELIZE_IMAGES flag. When processing a single image, ImageJ will
call the run(ip) method in parallel, with rectangular ROIs. E.g. for 4
threads the first thread will get a ROI with the uppermost 1/4 of the
height, the second thread the range from 1/4 to 1/2 of the height, and
so on.
The number of threads is in Edit>Options>menory and Threads and
initially set to the number of cores (including hyperhtreading).

Michael
________________________________________________________________
On 2016-07-13 23:55, George Patterson wrote:

... [show rest of quote]

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html

Gelman, Laurent

Jul 15, 2016; 7:49am

Re: Questions regarding multithreaded processing

16 posts

In reply to this post by George Patterson

Dear George,

We made the same kind of tests and experience as you did. We also tested numerous machines with a variable number of cores declared in the ImageJ Option Menu, in combination with different amounts of RAM, without being able to draw really clear conclusions about why it is fast or slow on the respective computers. We also tested different processes, from a simple Gaussian blur to more complex macros.

In a nutshell:
We also observed awful performances on our Microscoft Server 2012 / 32CPUs / 512GB RAM machine, irrespective of the combination of CPUs and RAM we declare in ImageJ. Surely, giving more than 16 CPUs to ImageJ does not improve overall speed, sometimes it even decreases. Note that this very same machine is really fast when using Matlab and the parallel processing toolbox.
Until recently, the fastest computers we could find to run ImageJ were my iMac, which runs Windows 7 (:-)) (specs: i7-4771 CPU 3.5GHz, 32GB RAM), and the HIVE (hexacore machine) sold by the company Acquifer (no commercial interest). Until then, we thought the speed of individual CPUs is the key, less their numbers, but we got really surprised lately when we tested the new virtual machines (VMs) our IT department set up for us to do some remote processing of very big microscopy datasets (24 cores, 128 to 256 GB RAM for each VM). Although the CPUs on the physical servers are not that fast (2.5 GHz, but is this really a good measure of computation speed? I am not sure...), we measured that our VMs were the fastest machines we tested so far. So we have actually no theory anymore about ImageJ and speed. It is not clear to us either, whether having Windows 7 or Windows server 2012 makes a difference.
Finally, I should mention that when you use complex processes, for example Stitching, the speed of the individual CPUs is also important, as we had the impression that the reading/loading of the file uses only one core. There again, we could see a beautiful correlation between CPU speed (GHz specs) and the process.

Current solution:
If we really need to be very fast,
1. we write an ImageJ macro in python and launch multiple threads in parallel, but we observed that the whole was not "thread safe", i.e. we see "collisions" between the different processes.
2. we write a python program to launch multiple ImageJ instances in a headless mode and parse the macro this way.

I would be also delighted to understand what makes ImageJ go fast or slow on a computer, that would help us to purchase the right machines from the beginning.

Very best regards,

Laurent.

___________________________
Laurent Gelman, PhD
Friedrich Miescher Institut
Head, Facility for Advanced Imaging and Microscopy
Light microscopy
WRO 1066.2.16
Maulbeerstrasse 66
CH-4058 Basel
+41 (0)61 696 35 13
+41 (0)79 618 73 69
www.fmi.ch
www.microscopynetwork.unibas.ch/

-----Original Message-----
From: George Patterson [mailto:[hidden email]]
Sent: mercredi 13 juillet 2016 23:55
Subject: Questions regarding multithreaded processing

Dear all,
I’ve assembled a plugin to analyze a time series on a pixel-by-pixel basis.
It works fine but is slow.
There are likely still plenty of optimizations that can be done to improve the speed and thanks to Albert Cordona and Stephen Preibisch sharing code and tutorials (http://albert.rierol.net/imagej_programming_tutorials.html),
I’ve even have a version that runs multi-threaded.
When run on multi-core machines the speed is improved, but I’m not sure what sort of improvement I should expect. Moreover, the machines I expected to be the fastest are not. This is likely stemming from my misunderstanding of parallel processing and Java programming in general so I’m hoping some of you with more experience can provide some feedback.
I list below some observations and questions along with test runs on the same data set using the same plugin on a few different machines.
Thanks for any suggestions.
George

Since the processing speeds differ, I realize the speeds of each machine to complete the analysis will differ. I’m more interested the improvement of multiple threads on an individual machine.
In running these tests, I altered the code to use a different number of threads in each run.
Is setting the number of threads in the code and determining the time to finish the analysis a valid approach to testing improvement?

Machine 5 is producing some odd behavior which I’ll discuss and ask for suggestions below.

For machines 1-4, the speed improves with the number of threads up to about half the number of available processors.
Do the improvements with the number of threads listed below seem reasonable?
Is the improvement up to only about half the number of available processors due to “hyperthreading”? My limited (and probably wrong) understanding is that hyperthreading makes a single core appear to be two which share resources and thus a machine with 2 cores will return 4 when queried for number of cpus. Yes, I know that is too simplistic, but it’s the best I can do.
Could it simply be that my code is not written properly to take advantage of hyperthreading? Could anyone point me to a source and/or example code explaining how I could change it to take advantage of hyperthreading if this is the problem?

Number of threads used are shown in parentheses where applicable.
1. MacBook Pro 2.66 GHz Intel Core i7
number of processors: 1
Number of cores: 2
non-threaded plugin version ~59 sec
threaded (1) ~51 sec
threaded (2) ~36 sec
threaded (3) ~34 sec
threaded (4) ~35 sec

2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2 Number of cores: 8 non-threaded plugin version ~60 sec threaded (1) ~59 sec threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6) ~13.2 sec threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12) ~11.1 sec threaded (16) ~11.5 sec

3. Windows 7 DELL 3.2 GHz Intel Core i5
number of cpus shown in resource monitor: 4 non-threaded plugin version ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec threaded (3) ~20.4 sec threaded (4) ~21.8 sec

4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron 6272
number of cpus shown in resource monitor: 32 non-threaded plugin version ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded (4) ~46 sec threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded (12) ~16.4 sec threaded (16) ~15.8 sec threaded (20) ~15.7 sec threaded (24) ~15.9 sec threaded (32) ~16 sec

For machines 1-4, the cpu usage can be observed in the Activity Monitor
(Mac) or Resource Monitor (Windows) and during the execution of the plugin all of the cpus were active. For machine 5 shown below, only 22 of the 64 show activity. And it is not always the same 22. From the example runs below you can see it really isn’t performing very well considering the number of available cores. I originally thought this machine should be the best, but it barely outperforms my laptop. This is probably a question for another forum, but I am wondering if anyone else has encountered anything similar.

5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz AMD Opteron
6378
number of cpus shown in resource monitor: 64 non-threaded plugin version ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded (8) ~29.3 sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded (24) ~24.1 sec threaded (32) ~24.5 sec threaded (40) ~24.8 sec threaded (48) ~23.8 sec threaded (64) ~24.8 sec

--
View this message in context: http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html
Sent from the ImageJ mailing list archive at Nabble.com.

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html

George Patterson

Jul 15, 2016; 4:05pm

Re: Questions regarding multithreaded processing

43 posts

Hi all,

Thank you all for your feedback. Below I'll try to respond to the parts I
can answer.

>Seeing as this bit is a bit more technical and closer to a plugin
development question, would you mind posting it on http://forum.imagej.net

>Long technical email threads like this one tend to get muddy, especially
if we try to share code snippets or want to comment on a particular part.

In the future, I’ll direct plugin development questions to that forum. I
didn’t bother sharing code since at this point since I just wanted to know
what improvements to expect with multi-threaded processing.

>A quick remark though. Seeing as we do not know HOW you implemented the
parallel processing, it will be difficult to help.

>Some notes: If you 'simply' make a bunch of threads where each accesses a
given pixel in a loop through an atomic integer for example, it is not
going to be faster. Accessing a single pixel is extremely fast >and what
will slow you down is having each thread waiting to get its pixel index.

As I mentioned, I didn’t know what to expect so I wasn’t sure I had a
problem. The atomic integer approach is what I used initially. To be
clear, the speed does improve with more threads, it just doesn’t improve as
much as it should based on the responses by Oli and Micheal. Based on
suggestions from Oli and Micheal, I changed the code to designate different
blocks of the image to different threads. This seemed to improve the speed
modestly 5-10%. Thanks for the suggestion. I’ll take this approach for
any future developments.

>IMHO the most important point for efficient parallelization (and efficient
Java code anyhow) is avoiding creating lots of objects that need garbage
collection (don't care about dozens, but definitely avoid >hundreds of
thousands or millions of Objects).

Micheal thanks for sharing the list of potential problems. I’ll work my
way through them as well as I can. The number of objects created is the
first I started checking. A new Curvefitter is created for every pixel so
for a 256x256x200 stack >65000 are created and subjected to garbage
collection I guess. I still haven’t found a way around generating this
many curvefitters.

This led me to looking more closely at the Curvefitter documentation and I
found this https://imagej.nih.gov/ij/docs/curve-fitter.html where it
indicates “Two threads with two independent minimization runs, with
repetition until two identical results are found, to avoid local minima or
keeping the result of a stuck simplex.” Does this mean that for each
thread that generates a new Curvefitter, the Curvefitter generates a second
thread on its own? If so, then my plugin is generating twice as many
threads as I think and might explain why my speed improvement is observed
only to about half the number of cpus. Possible? Probable? No way? Since
this is maybe getting into some technical bits which the plugin developers
probably know well, I’ll take Oli's advice ask this on the imagej.net forum.

>We made the same kind of tests and experience as you did. We also tested
numerous machines with a variable number of cores declared in the ImageJ
Option Menu, in combination with different amounts of >RAM, without being
able to draw really clear conclusions about why it is fast or slow on the
respective computers. We also tested different processes, from a simple
Gaussian blur to more complex macros.

Laurent, thanks for sharing your experiences. Our issues with different
machines might be better answered on another forum (maybe
http://forum.imagej.net ). Maybe we should start a new query on just this
topic?

Thanks again for the feedback.

George

On Fri, Jul 15, 2016 at 3:49 AM, Gelman, Laurent <[hidden email]>
wrote:

> Dear George,
>
> We made the same kind of tests and experience as you did. We also tested
> numerous machines with a variable number of cores declared in the ImageJ
> Option Menu, in combination with different amounts of RAM, without being
> able to draw really clear conclusions about why it is fast or slow on the
> respective computers. We also tested different processes, from a simple
> Gaussian blur to more complex macros.
>
> In a nutshell:
> We also observed awful performances on our Microscoft Server 2012 / 32CPUs
> / 512GB RAM machine, irrespective of the combination of CPUs and RAM we
> declare in ImageJ. Surely, giving more than 16 CPUs to ImageJ does not
> improve overall speed, sometimes it even decreases. Note that this very
> same machine is really fast when using Matlab and the parallel processing
> toolbox.
> Until recently, the fastest computers we could find to run ImageJ were my
> iMac, which runs Windows 7 (:-)) (specs: i7-4771 CPU 3.5GHz, 32GB RAM),
> and the HIVE (hexacore machine) sold by the company Acquifer (no commercial
> interest). Until then, we thought the speed of individual CPUs is the key,
> less their numbers, but we got really surprised lately when we tested the
> new virtual machines (VMs) our IT department set up for us to do some
> remote processing of very big microscopy datasets (24 cores, 128 to 256 GB
> RAM for each VM). Although the CPUs on the physical servers are not that
> fast (2.5 GHz, but is this really a good measure of computation speed? I am
> not sure...), we measured that our VMs were the fastest machines we tested
> so far. So we have actually no theory anymore about ImageJ and speed. It is
> not clear to us either, whether having Windows 7 or Windows server 2012
> makes a difference.
> Finally, I should mention that when you use complex processes, for example
> Stitching, the speed of the individual CPUs is also important, as we had
> the impression that the reading/loading of the file uses only one core.
> There again, we could see a beautiful correlation between CPU speed (GHz
> specs) and the process.
>
> Current solution:
> If we really need to be very fast,
> 1. we write an ImageJ macro in python and launch multiple threads in
> parallel, but we observed that the whole was not "thread safe", i.e. we see
> "collisions" between the different processes.
> 2. we write a python program to launch multiple ImageJ instances in a
> headless mode and parse the macro this way.
>
> I would be also delighted to understand what makes ImageJ go fast or slow
> on a computer, that would help us to purchase the right machines from the
> beginning.
>
> Very best regards,
>
> Laurent.
>
> ___________________________
> Laurent Gelman, PhD
> Friedrich Miescher Institut
> Head, Facility for Advanced Imaging and Microscopy
> Light microscopy
> WRO 1066.2.16
> Maulbeerstrasse 66
> CH-4058 Basel
> +41 (0)61 696 35 13
> +41 (0)79 618 73 69
> www.fmi.ch
> www.microscopynetwork.unibas.ch/
>
>
> -----Original Message-----
> From: George Patterson [mailto:[hidden email]]
> Sent: mercredi 13 juillet 2016 23:55
> Subject: Questions regarding multithreaded processing
>
> Dear all,
> I’ve assembled a plugin to analyze a time series on a pixel-by-pixel basis.
> It works fine but is slow.
> There are likely still plenty of optimizations that can be done to improve
> the speed and thanks to Albert Cordona and Stephen Preibisch sharing code
> and tutorials (http://albert.rierol.net/imagej_programming_tutorials.html
> ),
> I’ve even have a version that runs multi-threaded.
> When run on multi-core machines the speed is improved, but I’m not sure
> what sort of improvement I should expect. Moreover, the machines I
> expected to be the fastest are not. This is likely stemming from my
> misunderstanding of parallel processing and Java programming in general so
> I’m hoping some of you with more experience can provide some feedback.
> I list below some observations and questions along with test runs on the
> same data set using the same plugin on a few different machines.
> Thanks for any suggestions.
> George
>
>
> Since the processing speeds differ, I realize the speeds of each machine
> to complete the analysis will differ. I’m more interested the improvement
> of multiple threads on an individual machine.
> In running these tests, I altered the code to use a different number of
> threads in each run.
> Is setting the number of threads in the code and determining the time to
> finish the analysis a valid approach to testing improvement?
>
> Machine 5 is producing some odd behavior which I’ll discuss and ask for
> suggestions below.
>
> For machines 1-4, the speed improves with the number of threads up to
> about half the number of available processors.
> Do the improvements with the number of threads listed below seem
> reasonable?
> Is the improvement up to only about half the number of available
> processors due to “hyperthreading”? My limited (and probably wrong)
> understanding is that hyperthreading makes a single core appear to be two
> which share resources and thus a machine with 2 cores will return 4 when
> queried for number of cpus. Yes, I know that is too simplistic, but it’s
> the best I can do.
> Could it simply be that my code is not written properly to take advantage
> of hyperthreading? Could anyone point me to a source and/or example code
> explaining how I could change it to take advantage of hyperthreading if
> this is the problem?
>
> Number of threads used are shown in parentheses where applicable.
> 1. MacBook Pro 2.66 GHz Intel Core i7
> number of processors: 1
> Number of cores: 2
> non-threaded plugin version ~59 sec
> threaded (1) ~51 sec
> threaded (2) ~36 sec
> threaded (3) ~34 sec
> threaded (4) ~35 sec
>
> 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2
> Number of cores: 8 non-threaded plugin version ~60 sec threaded (1) ~59 sec
> threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6) ~13.2 sec
> threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12) ~11.1 sec
> threaded (16) ~11.5 sec
>
> 3. Windows 7 DELL 3.2 GHz Intel Core i5
> number of cpus shown in resource monitor: 4 non-threaded plugin version
> ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec threaded (3) ~20.4
> sec threaded (4) ~21.8 sec
>
> 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron
> 6272
> number of cpus shown in resource monitor: 32 non-threaded plugin version
> ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded (4) ~46 sec
> threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded (12) ~16.4 sec
> threaded (16) ~15.8 sec threaded (20) ~15.7 sec threaded (24) ~15.9 sec
> threaded (32) ~16 sec
>
> For machines 1-4, the cpu usage can be observed in the Activity Monitor
> (Mac) or Resource Monitor (Windows) and during the execution of the plugin
> all of the cpus were active. For machine 5 shown below, only 22 of the 64
> show activity. And it is not always the same 22. From the example runs
> below you can see it really isn’t performing very well considering the
> number of available cores. I originally thought this machine should be the
> best, but it barely outperforms my laptop. This is probably a question for
> another forum, but I am wondering if anyone else has encountered anything
> similar.
>
> 5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz AMD
> Opteron
> 6378
> number of cpus shown in resource monitor: 64 non-threaded plugin version
> ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded (8) ~29.3
> sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded (24) ~24.1 sec
> threaded (32) ~24.5 sec threaded (40) ~24.8 sec threaded (48) ~23.8 sec
> threaded (64) ~24.8 sec
>
>
>
>
>
>
>
> --
> View this message in context:
> http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html
> Sent from the ImageJ mailing list archive at Nabble.com.
>
> --
> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>

... [show rest of quote]

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html

Michael Schmid

Jul 15, 2016; 5:23pm

Re: Questions regarding multithreaded processing

2136 posts

Hi George,

as curve fitting is an essential part of your plugin, probably I should
answer (I am responsible for the two threads and more code in the
CurveFitter. You would not reach me on the developers' mailing list as
my main occupation is not in computing or image processing, and I
contribute to ImageJ only then and when, if I need it for my own work or
I think I might eventually need it).

The CurveFitter usually uses the Minimizer, which uses two threads,
indeed. It does not use the Minimizer (and thus, only one Thread) for
linear regression fits:
- Straight Line 'y = a+bx'
- "Exponential (linear regression) 'y = a*exp(bx)', and
- "Power (linear regression)" 'a*x^b'.

If it's not linear regression, you can disable using two Minimizer
Threads by
myCurveFitter.getMinimizer().setMaximumThreads(1);
(obviously, before you call myCurveFitter.doFit)
When using many parallel threads, this will also speed it up slightly,
e.g. if the Minimizer does not find two consistent solutions
immediately, with two threads it will give it two more tries (total of
four), with one thread it may get two consistent solutions after a total
of three tries.

If it is linear regression, ask me again; somewhere I have a linear
regression class that can be reused without creating a new object.

If your curve fitting problem is linear in all parameters, but not
linear regression with only two parameters (e.g. polynomial,
a*sin(x)+b*cos(x)+c, etc.), it would be faster to use the analytical
solution instead of the CurveFitter, but it is more programming effort
(you could try to find a suitable library such as Apache Commons Math,
Jama).

If your problem is nonlinear but it has one of the forms
- a + b*function(c, d, ...; x),
- a + function(b, c, d, ...; x),
- a + b*x + function(c, d, ...; x), or
- a*x + function(b, c, d, ...; x),
you can speed up a lot by eliminating one or two parameters with
myCurveFitter.setOffsetMultiplySlopeParams

By the way, creating a new CurveFitter also creates several other
objects, so having one per pixel really induces a lot of garbage collection.

If creating many CurveFitters or Minimizers is more common (anyone out
there who also does this?) we should consider making the CurveFitter and
Minimizer reusable (e.g. with a CurveFitter.setData(double[] xData,
double[] yData) method, which clears the previous result and settings).

Best regards,

Michael
________________________________________________________________
On 2016-07-15 18:16, George Patterson wrote:

> Hi all,
>
> Thank you all for your feedback. Below I'll try to respond to the parts I
> can answer.
>
>
>> Seeing as this bit is a bit more technical and closer to a plugin
> development question, would you mind posting it on http://forum.imagej.net
>
>> Long technical email threads like this one tend to get muddy, especially
> if we try to share code snippets or want to comment on a particular part.
>
> In the future, I’ll direct plugin development questions to that forum. I
> didn’t bother sharing code since at this point since I just wanted to know
> what improvements to expect with multi-threaded processing.
>
>
>> A quick remark though. Seeing as we do not know HOW you implemented the
> parallel processing, it will be difficult to help.
>
>> Some notes: If you 'simply' make a bunch of threads where each accesses a
> given pixel in a loop through an atomic integer for example, it is not
> going to be faster. Accessing a single pixel is extremely fast >and what
> will slow you down is having each thread waiting to get its pixel index.
>
> As I mentioned, I didn’t know what to expect so I wasn’t sure I had a
> problem. The atomic integer approach is what I used initially. To be
> clear, the speed does improve with more threads, it just doesn’t improve as
> much as it should based on the responses by Oli and Micheal. Based on
> suggestions from Oli and Micheal, I changed the code to designate different
> blocks of the image to different threads. This seemed to improve the speed
> modestly 5-10%. Thanks for the suggestion. I’ll take this approach for
> any future developments.
>
>
>> IMHO the most important point for efficient parallelization (and efficient
> Java code anyhow) is avoiding creating lots of objects that need garbage
> collection (don't care about dozens, but definitely avoid >hundreds of
> thousands or millions of Objects).
>
> Micheal thanks for sharing the list of potential problems. I’ll work my
> way through them as well as I can. The number of objects created is the
> first I started checking. A new Curvefitter is created for every pixel so
> for a 256x256x200 stack >65000 are created and subjected to garbage
> collection I guess. I still haven’t found a way around generating this
> many curvefitters.
>
> This led me to looking more closely at the Curvefitter documentation and I
> found this https://imagej.nih.gov/ij/docs/curve-fitter.html where it
> indicates “Two threads with two independent minimization runs, with
> repetition until two identical results are found, to avoid local minima or
> keeping the result of a stuck simplex.” Does this mean that for each
> thread that generates a new Curvefitter, the Curvefitter generates a second
> thread on its own? If so, then my plugin is generating twice as many
> threads as I think and might explain why my speed improvement is observed
> only to about half the number of cpus. Possible? Probable? No way? Since
> this is maybe getting into some technical bits which the plugin developers
> probably know well, I’ll take Oli's advice ask this on the imagej.net forum.
>
>
>> We made the same kind of tests and experience as you did. We also tested
> numerous machines with a variable number of cores declared in the ImageJ
> Option Menu, in combination with different amounts of >RAM, without being
> able to draw really clear conclusions about why it is fast or slow on the
> respective computers. We also tested different processes, from a simple
> Gaussian blur to more complex macros.
>
> Laurent, thanks for sharing your experiences. Our issues with different
> machines might be better answered on another forum (maybe
> http://forum.imagej.net ). Maybe we should start a new query on just this
> topic?
>
>
> Thanks again for the feedback.
>
> George
>
>
>
> On Fri, Jul 15, 2016 at 3:49 AM, Gelman, Laurent <[hidden email]>
> wrote:
>
>> Dear George,
>>
>> We made the same kind of tests and experience as you did. We also tested
>> numerous machines with a variable number of cores declared in the ImageJ
>> Option Menu, in combination with different amounts of RAM, without being
>> able to draw really clear conclusions about why it is fast or slow on the
>> respective computers. We also tested different processes, from a simple
>> Gaussian blur to more complex macros.
>>
>> In a nutshell:
>> We also observed awful performances on our Microscoft Server 2012 / 32CPUs
>> / 512GB RAM machine, irrespective of the combination of CPUs and RAM we
>> declare in ImageJ. Surely, giving more than 16 CPUs to ImageJ does not
>> improve overall speed, sometimes it even decreases. Note that this very
>> same machine is really fast when using Matlab and the parallel processing
>> toolbox.
>> Until recently, the fastest computers we could find to run ImageJ were my
>> iMac, which runs Windows 7 (:-)) (specs: i7-4771 CPU 3.5GHz, 32GB RAM),
>> and the HIVE (hexacore machine) sold by the company Acquifer (no commercial
>> interest). Until then, we thought the speed of individual CPUs is the key,
>> less their numbers, but we got really surprised lately when we tested the
>> new virtual machines (VMs) our IT department set up for us to do some
>> remote processing of very big microscopy datasets (24 cores, 128 to 256 GB
>> RAM for each VM). Although the CPUs on the physical servers are not that
>> fast (2.5 GHz, but is this really a good measure of computation speed? I am
>> not sure...), we measured that our VMs were the fastest machines we tested
>> so far. So we have actually no theory anymore about ImageJ and speed. It is
>> not clear to us either, whether having Windows 7 or Windows server 2012
>> makes a difference.
>> Finally, I should mention that when you use complex processes, for example
>> Stitching, the speed of the individual CPUs is also important, as we had
>> the impression that the reading/loading of the file uses only one core.
>> There again, we could see a beautiful correlation between CPU speed (GHz
>> specs) and the process.
>>
>> Current solution:
>> If we really need to be very fast,
>> 1. we write an ImageJ macro in python and launch multiple threads in
>> parallel, but we observed that the whole was not "thread safe", i.e. we see
>> "collisions" between the different processes.
>> 2. we write a python program to launch multiple ImageJ instances in a
>> headless mode and parse the macro this way.
>>
>> I would be also delighted to understand what makes ImageJ go fast or slow
>> on a computer, that would help us to purchase the right machines from the
>> beginning.
>>
>> Very best regards,
>>
>> Laurent.
>>
>> ___________________________
>> Laurent Gelman, PhD
>> Friedrich Miescher Institut
>> Head, Facility for Advanced Imaging and Microscopy
>> Light microscopy
>> WRO 1066.2.16
>> Maulbeerstrasse 66
>> CH-4058 Basel
>> +41 (0)61 696 35 13
>> +41 (0)79 618 73 69
>> www.fmi.ch
>> www.microscopynetwork.unibas.ch/
>>
>>
>> -----Original Message-----
>> From: George Patterson [mailto:[hidden email]]
>> Sent: mercredi 13 juillet 2016 23:55
>> Subject: Questions regarding multithreaded processing
>>
>> Dear all,
>> I’ve assembled a plugin to analyze a time series on a pixel-by-pixel basis.
>> It works fine but is slow.
>> There are likely still plenty of optimizations that can be done to improve
>> the speed and thanks to Albert Cordona and Stephen Preibisch sharing code
>> and tutorials (http://albert.rierol.net/imagej_programming_tutorials.html
>> ),
>> I’ve even have a version that runs multi-threaded.
>> When run on multi-core machines the speed is improved, but I’m not sure
>> what sort of improvement I should expect. Moreover, the machines I
>> expected to be the fastest are not. This is likely stemming from my
>> misunderstanding of parallel processing and Java programming in general so
>> I’m hoping some of you with more experience can provide some feedback.
>> I list below some observations and questions along with test runs on the
>> same data set using the same plugin on a few different machines.
>> Thanks for any suggestions.
>> George
>>
>>
>> Since the processing speeds differ, I realize the speeds of each machine
>> to complete the analysis will differ. I’m more interested the improvement
>> of multiple threads on an individual machine.
>> In running these tests, I altered the code to use a different number of
>> threads in each run.
>> Is setting the number of threads in the code and determining the time to
>> finish the analysis a valid approach to testing improvement?
>>
>> Machine 5 is producing some odd behavior which I’ll discuss and ask for
>> suggestions below.
>>
>> For machines 1-4, the speed improves with the number of threads up to
>> about half the number of available processors.
>> Do the improvements with the number of threads listed below seem
>> reasonable?
>> Is the improvement up to only about half the number of available
>> processors due to “hyperthreading”? My limited (and probably wrong)
>> understanding is that hyperthreading makes a single core appear to be two
>> which share resources and thus a machine with 2 cores will return 4 when
>> queried for number of cpus. Yes, I know that is too simplistic, but it’s
>> the best I can do.
>> Could it simply be that my code is not written properly to take advantage
>> of hyperthreading? Could anyone point me to a source and/or example code
>> explaining how I could change it to take advantage of hyperthreading if
>> this is the problem?
>>
>> Number of threads used are shown in parentheses where applicable.
>> 1. MacBook Pro 2.66 GHz Intel Core i7
>> number of processors: 1
>> Number of cores: 2
>> non-threaded plugin version ~59 sec
>> threaded (1) ~51 sec
>> threaded (2) ~36 sec
>> threaded (3) ~34 sec
>> threaded (4) ~35 sec
>>
>> 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2
>> Number of cores: 8 non-threaded plugin version ~60 sec threaded (1) ~59 sec
>> threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6) ~13.2 sec
>> threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12) ~11.1 sec
>> threaded (16) ~11.5 sec
>>
>> 3. Windows 7 DELL 3.2 GHz Intel Core i5
>> number of cpus shown in resource monitor: 4 non-threaded plugin version
>> ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec threaded (3) ~20.4
>> sec threaded (4) ~21.8 sec
>>
>> 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron
>> 6272
>> number of cpus shown in resource monitor: 32 non-threaded plugin version
>> ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded (4) ~46 sec
>> threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded (12) ~16.4 sec
>> threaded (16) ~15.8 sec threaded (20) ~15.7 sec threaded (24) ~15.9 sec
>> threaded (32) ~16 sec
>>
>> For machines 1-4, the cpu usage can be observed in the Activity Monitor
>> (Mac) or Resource Monitor (Windows) and during the execution of the plugin
>> all of the cpus were active. For machine 5 shown below, only 22 of the 64
>> show activity. And it is not always the same 22. From the example runs
>> below you can see it really isn’t performing very well considering the
>> number of available cores. I originally thought this machine should be the
>> best, but it barely outperforms my laptop. This is probably a question for
>> another forum, but I am wondering if anyone else has encountered anything
>> similar.
>>
>> 5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz AMD
>> Opteron
>> 6378
>> number of cpus shown in resource monitor: 64 non-threaded plugin version
>> ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded (8) ~29.3
>> sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded (24) ~24.1 sec
>> threaded (32) ~24.5 sec threaded (40) ~24.8 sec threaded (48) ~23.8 sec
>> threaded (64) ~24.8 sec
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html
>> Sent from the ImageJ mailing list archive at Nabble.com.
>>
>> --
>> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>>
>
> --
> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>

... [show rest of quote]

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html

George Patterson

Jul 15, 2016; 7:49pm

Re: Questions regarding multithreaded processing

43 posts

Michael,
Thanks for the quick response and for helping providing us with
CurveFitter.

> as curve fitting is an essential part of your plugin, probably I should
> answer (I am responsible for the two threads and more code in the
> CurveFitter. You would not reach me on the developers' mailing list as my
> main occupation is not in computing or image processing, and I contribute
> to ImageJ only then and when, if I need it for my own work or I think I
> might eventually need it).
>

Good to know. I'll let the other forum know the answer is over here.

>
> The CurveFitter usually uses the Minimizer, which uses two threads,
> indeed. It does not use the Minimizer (and thus, only one Thread) for
> linear regression fits:
> - Straight Line 'y = a+bx'
> - "Exponential (linear regression) 'y = a*exp(bx)', and
> - "Power (linear regression)" 'a*x^b'.
>

I am using "Exponential with offset so no linear regression.

If it's not linear regression, you can disable using two Minimizer Threads
> by
> myCurveFitter.getMinimizer().setMaximumThreads(1);
> (obviously, before you call myCurveFitter.doFit)
>

Using the machine below again.

4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron
6272
number of cpus shown in resource monitor: 32

setting the maximum threads to one previous
threaded (1) ~205.3 sec 158 sec
threaded (2) ~108.9 sec 85.1 sec
threaded (4) ~64.5 sec 46 sec
threaded (8) ~35.2 sec 22.9 sec
threaded (10) ~28.1 sec 18.6 sec
threaded (12) ~24.6 sec 16.4 sec
threaded (16) ~17.7 sec 15.8 sec
threaded (20) ~15.1 sec 15.7 sec
threaded (24) ~13.3 sec 15.9 sec
threaded (32) ~10 sec 16 sec

The improvement much closer to linear. It is slower with fewer threads
than before. I bet you know why. Care to educate me?

> By the way, creating a new CurveFitter also creates several other objects,
> so having one per pixel really induces a lot of garbage collection.
>

So in addition to producing more threads than I originally thought, my
major limitation is probably the amount of garbage I'm producing. Correct?

>
> If creating many CurveFitters or Minimizers is more common (anyone out
> there who also does this?) we should consider making the CurveFitter and
> Minimizer reusable (e.g. with a CurveFitter.setData(double[] xData,
> double[] yData) method, which clears the previous result and settings).
>

Obviously I'm in favor but that sounds like it might take a bit of effort
by someone in the know.

Thanks again for your help.

Best,
George

> _________________________
>
> On 2016-07-15 18:16, George Patterson wrote:
>
>> Hi all,
>>
>> Thank you all for your feedback. Below I'll try to respond to the parts I
>> can answer.
>>
>>
>> Seeing as this bit is a bit more technical and closer to a plugin
>>>
>> development question, would you mind posting it on
>> http://forum.imagej.net
>>
>> Long technical email threads like this one tend to get muddy, especially
>>>
>> if we try to share code snippets or want to comment on a particular part.
>>
>> In the future, I’ll direct plugin development questions to that forum. I
>> didn’t bother sharing code since at this point since I just wanted to know
>> what improvements to expect with multi-threaded processing.
>>
>>
>> A quick remark though. Seeing as we do not know HOW you implemented the
>>>
>> parallel processing, it will be difficult to help.
>>
>> Some notes: If you 'simply' make a bunch of threads where each accesses a
>>>
>> given pixel in a loop through an atomic integer for example, it is not
>> going to be faster. Accessing a single pixel is extremely fast >and what
>> will slow you down is having each thread waiting to get its pixel index.
>>
>> As I mentioned, I didn’t know what to expect so I wasn’t sure I had a
>> problem. The atomic integer approach is what I used initially. To be
>> clear, the speed does improve with more threads, it just doesn’t improve
>> as
>> much as it should based on the responses by Oli and Micheal. Based on
>> suggestions from Oli and Micheal, I changed the code to designate
>> different
>> blocks of the image to different threads. This seemed to improve the
>> speed
>> modestly 5-10%. Thanks for the suggestion. I’ll take this approach for
>> any future developments.
>>
>>
>> IMHO the most important point for efficient parallelization (and efficient
>>>
>> Java code anyhow) is avoiding creating lots of objects that need garbage
>> collection (don't care about dozens, but definitely avoid >hundreds of
>> thousands or millions of Objects).
>>
>> Micheal thanks for sharing the list of potential problems. I’ll work my
>> way through them as well as I can. The number of objects created is the
>> first I started checking. A new Curvefitter is created for every pixel so
>> for a 256x256x200 stack >65000 are created and subjected to garbage
>> collection I guess. I still haven’t found a way around generating this
>> many curvefitters.
>>
>> This led me to looking more closely at the Curvefitter documentation and I
>> found this https://imagej.nih.gov/ij/docs/curve-fitter.html where it
>> indicates “Two threads with two independent minimization runs, with
>> repetition until two identical results are found, to avoid local minima or
>> keeping the result of a stuck simplex.” Does this mean that for each
>> thread that generates a new Curvefitter, the Curvefitter generates a
>> second
>> thread on its own? If so, then my plugin is generating twice as many
>> threads as I think and might explain why my speed improvement is observed
>> only to about half the number of cpus. Possible? Probable? No way? Since
>> this is maybe getting into some technical bits which the plugin developers
>> probably know well, I’ll take Oli's advice ask this on the imagej.net
>> forum.
>>
>>
>> We made the same kind of tests and experience as you did. We also tested
>>>
>> numerous machines with a variable number of cores declared in the ImageJ
>> Option Menu, in combination with different amounts of >RAM, without being
>> able to draw really clear conclusions about why it is fast or slow on the
>> respective computers. We also tested different processes, from a simple
>> Gaussian blur to more complex macros.
>>
>> Laurent, thanks for sharing your experiences. Our issues with different
>> machines might be better answered on another forum (maybe
>> http://forum.imagej.net ). Maybe we should start a new query on just
>> this
>> topic?
>>
>>
>> Thanks again for the feedback.
>>
>> George
>>
>>
>>
>> On Fri, Jul 15, 2016 at 3:49 AM, Gelman, Laurent <[hidden email]>
>> wrote:
>>
>> Dear George,
>>>
>>> We made the same kind of tests and experience as you did. We also tested
>>> numerous machines with a variable number of cores declared in the ImageJ
>>> Option Menu, in combination with different amounts of RAM, without being
>>> able to draw really clear conclusions about why it is fast or slow on the
>>> respective computers. We also tested different processes, from a simple
>>> Gaussian blur to more complex macros.
>>>
>>> In a nutshell:
>>> We also observed awful performances on our Microscoft Server 2012 /
>>> 32CPUs
>>> / 512GB RAM machine, irrespective of the combination of CPUs and RAM we
>>> declare in ImageJ. Surely, giving more than 16 CPUs to ImageJ does not
>>> improve overall speed, sometimes it even decreases. Note that this very
>>> same machine is really fast when using Matlab and the parallel processing
>>> toolbox.
>>> Until recently, the fastest computers we could find to run ImageJ were my
>>> iMac, which runs Windows 7 (:-)) (specs: i7-4771 CPU 3.5GHz, 32GB RAM),
>>> and the HIVE (hexacore machine) sold by the company Acquifer (no
>>> commercial
>>> interest). Until then, we thought the speed of individual CPUs is the
>>> key,
>>> less their numbers, but we got really surprised lately when we tested the
>>> new virtual machines (VMs) our IT department set up for us to do some
>>> remote processing of very big microscopy datasets (24 cores, 128 to 256
>>> GB
>>> RAM for each VM). Although the CPUs on the physical servers are not that
>>> fast (2.5 GHz, but is this really a good measure of computation speed? I
>>> am
>>> not sure...), we measured that our VMs were the fastest machines we
>>> tested
>>> so far. So we have actually no theory anymore about ImageJ and speed. It
>>> is
>>> not clear to us either, whether having Windows 7 or Windows server 2012
>>> makes a difference.
>>> Finally, I should mention that when you use complex processes, for
>>> example
>>> Stitching, the speed of the individual CPUs is also important, as we had
>>> the impression that the reading/loading of the file uses only one core.
>>> There again, we could see a beautiful correlation between CPU speed (GHz
>>> specs) and the process.
>>>
>>> Current solution:
>>> If we really need to be very fast,
>>> 1. we write an ImageJ macro in python and launch multiple threads in
>>> parallel, but we observed that the whole was not "thread safe", i.e. we
>>> see
>>> "collisions" between the different processes.
>>> 2. we write a python program to launch multiple ImageJ instances in a
>>> headless mode and parse the macro this way.
>>>
>>> I would be also delighted to understand what makes ImageJ go fast or slow
>>> on a computer, that would help us to purchase the right machines from the
>>> beginning.
>>>
>>> Very best regards,
>>>
>>> Laurent.
>>>
>>> ___________________________
>>> Laurent Gelman, PhD
>>> Friedrich Miescher Institut
>>> Head, Facility for Advanced Imaging and Microscopy
>>> Light microscopy
>>> WRO 1066.2.16
>>> Maulbeerstrasse 66
>>> CH-4058 Basel
>>> +41 (0)61 696 35 13
>>> +41 (0)79 618 73 69
>>> www.fmi.ch
>>> www.microscopynetwork.unibas.ch/
>>>
>>>
>>> -----Original Message-----
>>> From: George Patterson [mailto:[hidden email]]
>>> Sent: mercredi 13 juillet 2016 23:55
>>> Subject: Questions regarding multithreaded processing
>>>
>>> Dear all,
>>> I’ve assembled a plugin to analyze a time series on a pixel-by-pixel
>>> basis.
>>> It works fine but is slow.
>>> There are likely still plenty of optimizations that can be done to
>>> improve
>>> the speed and thanks to Albert Cordona and Stephen Preibisch sharing code
>>> and tutorials (
>>> http://albert.rierol.net/imagej_programming_tutorials.html
>>> ),
>>> I’ve even have a version that runs multi-threaded.
>>> When run on multi-core machines the speed is improved, but I’m not sure
>>> what sort of improvement I should expect. Moreover, the machines I
>>> expected to be the fastest are not. This is likely stemming from my
>>> misunderstanding of parallel processing and Java programming in general
>>> so
>>> I’m hoping some of you with more experience can provide some feedback.
>>> I list below some observations and questions along with test runs on the
>>> same data set using the same plugin on a few different machines.
>>> Thanks for any suggestions.
>>> George
>>>
>>>
>>> Since the processing speeds differ, I realize the speeds of each machine
>>> to complete the analysis will differ. I’m more interested the
>>> improvement
>>> of multiple threads on an individual machine.
>>> In running these tests, I altered the code to use a different number of
>>> threads in each run.
>>> Is setting the number of threads in the code and determining the time to
>>> finish the analysis a valid approach to testing improvement?
>>>
>>> Machine 5 is producing some odd behavior which I’ll discuss and ask for
>>> suggestions below.
>>>
>>> For machines 1-4, the speed improves with the number of threads up to
>>> about half the number of available processors.
>>> Do the improvements with the number of threads listed below seem
>>> reasonable?
>>> Is the improvement up to only about half the number of available
>>> processors due to “hyperthreading”? My limited (and probably wrong)
>>> understanding is that hyperthreading makes a single core appear to be two
>>> which share resources and thus a machine with 2 cores will return 4 when
>>> queried for number of cpus. Yes, I know that is too simplistic, but it’s
>>> the best I can do.
>>> Could it simply be that my code is not written properly to take advantage
>>> of hyperthreading? Could anyone point me to a source and/or example code
>>> explaining how I could change it to take advantage of hyperthreading if
>>> this is the problem?
>>>
>>> Number of threads used are shown in parentheses where applicable.
>>> 1. MacBook Pro 2.66 GHz Intel Core i7
>>> number of processors: 1
>>> Number of cores: 2
>>> non-threaded plugin version ~59 sec
>>> threaded (1) ~51 sec
>>> threaded (2) ~36 sec
>>> threaded (3) ~34 sec
>>> threaded (4) ~35 sec
>>>
>>> 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2
>>> Number of cores: 8 non-threaded plugin version ~60 sec threaded (1) ~59
>>> sec
>>> threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6) ~13.2 sec
>>> threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12) ~11.1 sec
>>> threaded (16) ~11.5 sec
>>>
>>> 3. Windows 7 DELL 3.2 GHz Intel Core i5
>>> number of cpus shown in resource monitor: 4 non-threaded plugin version
>>> ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec threaded (3)
>>> ~20.4
>>> sec threaded (4) ~21.8 sec
>>>
>>> 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron
>>> 6272
>>> number of cpus shown in resource monitor: 32 non-threaded plugin version
>>> ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded (4) ~46
>>> sec
>>> threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded (12) ~16.4 sec
>>> threaded (16) ~15.8 sec threaded (20) ~15.7 sec threaded (24) ~15.9 sec
>>> threaded (32) ~16 sec
>>>
>>> For machines 1-4, the cpu usage can be observed in the Activity Monitor
>>> (Mac) or Resource Monitor (Windows) and during the execution of the
>>> plugin
>>> all of the cpus were active. For machine 5 shown below, only 22 of the
>>> 64
>>> show activity. And it is not always the same 22. From the example runs
>>> below you can see it really isn’t performing very well considering the
>>> number of available cores. I originally thought this machine should be
>>> the
>>> best, but it barely outperforms my laptop. This is probably a question
>>> for
>>> another forum, but I am wondering if anyone else has encountered anything
>>> similar.
>>>
>>> 5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz AMD
>>> Opteron
>>> 6378
>>> number of cpus shown in resource monitor: 64 non-threaded plugin version
>>> ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded (8) ~29.3
>>> sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded (24) ~24.1
>>> sec
>>> threaded (32) ~24.5 sec threaded (40) ~24.8 sec threaded (48) ~23.8 sec
>>> threaded (64) ~24.8 sec
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>>
>>> http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html
>>> Sent from the ImageJ mailing list archive at Nabble.com.
>>>
>>> --
>>> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>>>
>>>
>> --
>> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>>
>>
> --
> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>

... [show rest of quote]

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html

Michael Schmid

Jul 16, 2016; 11:18am

Re: Questions regarding multithreaded processing

2136 posts

Hi George,

concerning the comparison with one (new) or two (previous) Mimimizer threads:

> setting the maximum threads to one previous
> threaded (1) ~205.3 sec 158 sec
> threaded (2) ~108.9 sec 85.1 sec
> threaded (4) ~64.5 sec 46 sec
> threaded (8) ~35.2 sec 22.9 sec
> threaded (10) ~28.1 sec 18.6 sec
> threaded (12) ~24.6 sec 16.4 sec
> threaded (16) ~17.7 sec 15.8 sec
> threaded (20) ~15.1 sec 15.7 sec
> threaded (24) ~13.3 sec 15.9 sec
> threaded (32) ~10 sec 16 sec

For the minimizing operation itself, the 'previous' case has twice the
number of threads due to the Minimizer, so it was actually minimizing with
2 to 64 threads.
This explains why there was no gain in the previous version when
increasing the number of threads from 16 to 32 (with 32 processors), it
was actually an increase from 32 to 64.
For comparing the speed with one or two Minimizer threads, this means that
you have to compare like the following:

new, one minimizer thread previous
threaded (2) ~108.9 sec 158 sec
threaded (4) ~64.5 sec 85.1 sec
threaded (8) ~35.2 sec 46 sec
threaded (16) ~17.7 sec 22.9 sec
threaded (32) ~10 sec 15.8 sec
threaded (64) 16 sec

So it clearly helps to use one Minimizer thread; possibly the main reason
is avoiding the overhead of creating a Minimizer thread for each pixel and
the accompanying synchronization between the two Minimizer threads.

The table also tells that the gain with parallelization is not so bad:
A factor of 20 from 1 to 32 threads, so the total time is not dominated by
the 'Stop the world' events for Garbage collection or memory bandwidth.

--

Concerning the curve fitting problem, "Exponential with offset", y =
a*exp(-bx) + c:

The CurveFitter eliminates two parameters (a, c) by linear regression, so
it actually performs a one-dimensional minimization. I guess that this
problem is well-behaved and the Minimizer always finds the correct result
in the first attempt. Then it is not necessary to try a second run.
So you can try:
myCurveFitter.getMinimizer().setMaxRestarts(0);
This makes the Minimizer run only once, with no second try to make sure
the result is correct. It also avoids a second thread.
I would suggest that you try it and compare whether the result is the same
(there might be tiny differences since minimization is stochastic and the
accuracy is finite).

If it works as I expect, it should cut the time for minimization to 1/2.
If the decrease in processing time is comparable, it would mean that
computing time is still dominated by the Minimizer, not garbage collection
(and and the rest of processing each pixel, including memory access). If
the speed gain is only marginal, it would indicate that optimization
should focus on garbage collection and the non-minimizer operations per
pixel.

What you might also do to speed up the process: If you have a good guess
for the 'b' parameter and the typical uncertainty of this guess for 'b',
specify them in the initialPrams and initialParamVariations. E.g. if 'b'
does not change much between neighboring pixels, use the previous value
for initialization. The default value of the initialParamVariations for
'b' is 10% of the specified 'b' value.
Don't care about the initial the 'a' and 'c' parameters and their range;
these values will be ignored.

HTH,

Michael
____________________________________________________________________

On Fri, July 15, 2016 22:03, George Patterson wrote:

> Michael,
> Thanks for the quick response and for helping providing us with
> CurveFitter.
>
>
>> as curve fitting is an essential part of your plugin, probably I should
>> answer (I am responsible for the two threads and more code in the
>> CurveFitter. You would not reach me on the developers' mailing list as
>> my
>> main occupation is not in computing or image processing, and I
>> contribute
>> to ImageJ only then and when, if I need it for my own work or I think I
>> might eventually need it).
>>
>
> Good to know. I'll let the other forum know the answer is over here.
>
>>
>> The CurveFitter usually uses the Minimizer, which uses two threads,
>> indeed. It does not use the Minimizer (and thus, only one Thread) for
>> linear regression fits:
>> - Straight Line 'y = a+bx'
>> - "Exponential (linear regression) 'y = a*exp(bx)', and
>> - "Power (linear regression)" 'a*x^b'.
>>
>
> I am using "Exponential with offset so no linear regression.
>
> If it's not linear regression, you can disable using two Minimizer Threads
>> by
>> myCurveFitter.getMinimizer().setMaximumThreads(1);
>> (obviously, before you call myCurveFitter.doFit)
>>
>
>
> Using the machine below again.
>
> 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron
> 6272
> number of cpus shown in resource monitor: 32
>
> setting the maximum threads to one previous
> threaded (1) ~205.3 sec 158 sec
> threaded (2) ~108.9 sec 85.1 sec
> threaded (4) ~64.5 sec 46 sec
> threaded (8) ~35.2 sec 22.9
> sec
> threaded (10) ~28.1 sec 18.6 sec
> threaded (12) ~24.6 sec 16.4 sec
> threaded (16) ~17.7 sec 15.8 sec
> threaded (20) ~15.1 sec 15.7 sec
> threaded (24) ~13.3 sec 15.9 sec
> threaded (32) ~10 sec 16 sec
>
> The improvement much closer to linear. It is slower with fewer threads
> than before. I bet you know why. Care to educate me?
>
>
>> By the way, creating a new CurveFitter also creates several other
>> objects,
>> so having one per pixel really induces a lot of garbage collection.
>>
>
> So in addition to producing more threads than I originally thought, my
> major limitation is probably the amount of garbage I'm producing.
> Correct?
>
>
>>
>> If creating many CurveFitters or Minimizers is more common (anyone out
>> there who also does this?) we should consider making the CurveFitter and
>> Minimizer reusable (e.g. with a CurveFitter.setData(double[] xData,
>> double[] yData) method, which clears the previous result and settings).
>>
>
> Obviously I'm in favor but that sounds like it might take a bit of effort
> by someone in the know.
>
>
> Thanks again for your help.
>
> Best,
> George
>
>
>
>
>
>
>
>
>
>> _________________________
>>
>> On 2016-07-15 18:16, George Patterson wrote:
>>
>>> Hi all,
>>>
>>> Thank you all for your feedback. Below I'll try to respond to the
>>> parts I
>>> can answer.
>>>
>>>
>>> Seeing as this bit is a bit more technical and closer to a plugin
>>>>
>>> development question, would you mind posting it on
>>> http://forum.imagej.net
>>>
>>> Long technical email threads like this one tend to get muddy,
>>> especially
>>>>
>>> if we try to share code snippets or want to comment on a particular
>>> part.
>>>
>>> In the future, Iâll direct plugin development questions to that
>>> forum. I
>>> didnât bother sharing code since at this point since I just wanted to
>>> know
>>> what improvements to expect with multi-threaded processing.
>>>
>>>
>>> A quick remark though. Seeing as we do not know HOW you implemented the
>>>>
>>> parallel processing, it will be difficult to help.
>>>
>>> Some notes: If you 'simply' make a bunch of threads where each accesses
>>> a
>>>>
>>> given pixel in a loop through an atomic integer for example, it is not
>>> going to be faster. Accessing a single pixel is extremely fast >and
>>> what
>>> will slow you down is having each thread waiting to get its pixel
>>> index.
>>>
>>> As I mentioned, I didnât know what to expect so I wasnât sure I had
>>> a
>>> problem. The atomic integer approach is what I used initially. To be
>>> clear, the speed does improve with more threads, it just doesnât
>>> improve
>>> as
>>> much as it should based on the responses by Oli and Micheal. Based on
>>> suggestions from Oli and Micheal, I changed the code to designate
>>> different
>>> blocks of the image to different threads. This seemed to improve the
>>> speed
>>> modestly 5-10%. Thanks for the suggestion. Iâll take this approach
>>> for
>>> any future developments.
>>>
>>>
>>> IMHO the most important point for efficient parallelization (and
>>> efficient
>>>>
>>> Java code anyhow) is avoiding creating lots of objects that need
>>> garbage
>>> collection (don't care about dozens, but definitely avoid >hundreds of
>>> thousands or millions of Objects).
>>>
>>> Micheal thanks for sharing the list of potential problems. Iâll work
>>> my
>>> way through them as well as I can. The number of objects created is
>>> the
>>> first I started checking. A new Curvefitter is created for every pixel
>>> so
>>> for a 256x256x200 stack >65000 are created and subjected to garbage
>>> collection I guess. I still havenât found a way around generating
>>> this
>>> many curvefitters.
>>>
>>> This led me to looking more closely at the Curvefitter documentation
>>> and I
>>> found this https://imagej.nih.gov/ij/docs/curve-fitter.html where it
>>> indicates âTwo threads with two independent minimization runs, with
>>> repetition until two identical results are found, to avoid local minima
>>> or
>>> keeping the result of a stuck simplex.â Does this mean that for each
>>> thread that generates a new Curvefitter, the Curvefitter generates a
>>> second
>>> thread on its own? If so, then my plugin is generating twice as many
>>> threads as I think and might explain why my speed improvement is
>>> observed
>>> only to about half the number of cpus. Possible? Probable? No way?
>>> Since
>>> this is maybe getting into some technical bits which the plugin
>>> developers
>>> probably know well, Iâll take Oli's advice ask this on the imagej.net
>>> forum.
>>>
>>>
>>> We made the same kind of tests and experience as you did. We also
>>> tested
>>>>
>>> numerous machines with a variable number of cores declared in the
>>> ImageJ
>>> Option Menu, in combination with different amounts of >RAM, without
>>> being
>>> able to draw really clear conclusions about why it is fast or slow on
>>> the
>>> respective computers. We also tested different processes, from a simple
>>> Gaussian blur to more complex macros.
>>>
>>> Laurent, thanks for sharing your experiences. Our issues with
>>> different
>>> machines might be better answered on another forum (maybe
>>> http://forum.imagej.net ). Maybe we should start a new query on just
>>> this
>>> topic?
>>>
>>>
>>> Thanks again for the feedback.
>>>
>>> George
>>>
>>>
>>>
>>> On Fri, Jul 15, 2016 at 3:49 AM, Gelman, Laurent
>>> <[hidden email]>
>>> wrote:
>>>
>>> Dear George,
>>>>
>>>> We made the same kind of tests and experience as you did. We also
>>>> tested
>>>> numerous machines with a variable number of cores declared in the
>>>> ImageJ
>>>> Option Menu, in combination with different amounts of RAM, without
>>>> being
>>>> able to draw really clear conclusions about why it is fast or slow on
>>>> the
>>>> respective computers. We also tested different processes, from a
>>>> simple
>>>> Gaussian blur to more complex macros.
>>>>
>>>> In a nutshell:
>>>> We also observed awful performances on our Microscoft Server 2012 /
>>>> 32CPUs
>>>> / 512GB RAM machine, irrespective of the combination of CPUs and RAM
>>>> we
>>>> declare in ImageJ. Surely, giving more than 16 CPUs to ImageJ does not
>>>> improve overall speed, sometimes it even decreases. Note that this
>>>> very
>>>> same machine is really fast when using Matlab and the parallel
>>>> processing
>>>> toolbox.
>>>> Until recently, the fastest computers we could find to run ImageJ were
>>>> my
>>>> iMac, which runs Windows 7 (:-)) (specs: i7-4771 CPU 3.5GHz, 32GB
>>>> RAM),
>>>> and the HIVE (hexacore machine) sold by the company Acquifer (no
>>>> commercial
>>>> interest). Until then, we thought the speed of individual CPUs is the
>>>> key,
>>>> less their numbers, but we got really surprised lately when we tested
>>>> the
>>>> new virtual machines (VMs) our IT department set up for us to do some
>>>> remote processing of very big microscopy datasets (24 cores, 128 to
>>>> 256
>>>> GB
>>>> RAM for each VM). Although the CPUs on the physical servers are not
>>>> that
>>>> fast (2.5 GHz, but is this really a good measure of computation speed?
>>>> I
>>>> am
>>>> not sure...), we measured that our VMs were the fastest machines we
>>>> tested
>>>> so far. So we have actually no theory anymore about ImageJ and speed.
>>>> It
>>>> is
>>>> not clear to us either, whether having Windows 7 or Windows server
>>>> 2012
>>>> makes a difference.
>>>> Finally, I should mention that when you use complex processes, for
>>>> example
>>>> Stitching, the speed of the individual CPUs is also important, as we
>>>> had
>>>> the impression that the reading/loading of the file uses only one
>>>> core.
>>>> There again, we could see a beautiful correlation between CPU speed
>>>> (GHz
>>>> specs) and the process.
>>>>
>>>> Current solution:
>>>> If we really need to be very fast,
>>>> 1. we write an ImageJ macro in python and launch multiple threads in
>>>> parallel, but we observed that the whole was not "thread safe", i.e.
>>>> we
>>>> see
>>>> "collisions" between the different processes.
>>>> 2. we write a python program to launch multiple ImageJ instances in a
>>>> headless mode and parse the macro this way.
>>>>
>>>> I would be also delighted to understand what makes ImageJ go fast or
>>>> slow
>>>> on a computer, that would help us to purchase the right machines from
>>>> the
>>>> beginning.
>>>>
>>>> Very best regards,
>>>>
>>>> Laurent.
>>>>
>>>> ___________________________
>>>> Laurent Gelman, PhD
>>>> Friedrich Miescher Institut
>>>> Head, Facility for Advanced Imaging and Microscopy
>>>> Light microscopy
>>>> WRO 1066.2.16
>>>> Maulbeerstrasse 66
>>>> CH-4058 Basel
>>>> +41 (0)61 696 35 13
>>>> +41 (0)79 618 73 69
>>>> www.fmi.ch
>>>> www.microscopynetwork.unibas.ch/
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: George Patterson [mailto:[hidden email]]
>>>> Sent: mercredi 13 juillet 2016 23:55
>>>> Subject: Questions regarding multithreaded processing
>>>>
>>>> Dear all,
>>>> Iâve assembled a plugin to analyze a time series on a pixel-by-pixel
>>>> basis.
>>>> It works fine but is slow.
>>>> There are likely still plenty of optimizations that can be done to
>>>> improve
>>>> the speed and thanks to Albert Cordona and Stephen Preibisch sharing
>>>> code
>>>> and tutorials (
>>>> http://albert.rierol.net/imagej_programming_tutorials.html
>>>> ),
>>>> Iâve even have a version that runs multi-threaded.
>>>> When run on multi-core machines the speed is improved, but Iâm not
>>>> sure
>>>> what sort of improvement I should expect. Moreover, the machines I
>>>> expected to be the fastest are not. This is likely stemming from my
>>>> misunderstanding of parallel processing and Java programming in
>>>> general
>>>> so
>>>> Iâm hoping some of you with more experience can provide some
>>>> feedback.
>>>> I list below some observations and questions along with test runs on
>>>> the
>>>> same data set using the same plugin on a few different machines.
>>>> Thanks for any suggestions.
>>>> George
>>>>
>>>>
>>>> Since the processing speeds differ, I realize the speeds of each
>>>> machine
>>>> to complete the analysis will differ. Iâm more interested the
>>>> improvement
>>>> of multiple threads on an individual machine.
>>>> In running these tests, I altered the code to use a different number
>>>> of
>>>> threads in each run.
>>>> Is setting the number of threads in the code and determining the time
>>>> to
>>>> finish the analysis a valid approach to testing improvement?
>>>>
>>>> Machine 5 is producing some odd behavior which Iâll discuss and ask
>>>> for
>>>> suggestions below.
>>>>
>>>> For machines 1-4, the speed improves with the number of threads up to
>>>> about half the number of available processors.
>>>> Do the improvements with the number of threads listed below seem
>>>> reasonable?
>>>> Is the improvement up to only about half the number of available
>>>> processors due to âhyperthreadingâ? My limited (and probably
>>>> wrong)
>>>> understanding is that hyperthreading makes a single core appear to be
>>>> two
>>>> which share resources and thus a machine with 2 cores will return 4
>>>> when
>>>> queried for number of cpus. Yes, I know that is too simplistic, but
>>>> itâs
>>>> the best I can do.
>>>> Could it simply be that my code is not written properly to take
>>>> advantage
>>>> of hyperthreading? Could anyone point me to a source and/or example
>>>> code
>>>> explaining how I could change it to take advantage of hyperthreading
>>>> if
>>>> this is the problem?
>>>>
>>>> Number of threads used are shown in parentheses where applicable.
>>>> 1. MacBook Pro 2.66 GHz Intel Core i7
>>>> number of processors: 1
>>>> Number of cores: 2
>>>> non-threaded plugin version ~59 sec
>>>> threaded (1) ~51 sec
>>>> threaded (2) ~36 sec
>>>> threaded (3) ~34 sec
>>>> threaded (4) ~35 sec
>>>>
>>>> 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2
>>>> Number of cores: 8 non-threaded plugin version ~60 sec threaded (1)
>>>> ~59
>>>> sec
>>>> threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6) ~13.2 sec
>>>> threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12) ~11.1 sec
>>>> threaded (16) ~11.5 sec
>>>>
>>>> 3. Windows 7 DELL 3.2 GHz Intel Core i5
>>>> number of cpus shown in resource monitor: 4 non-threaded plugin
>>>> version
>>>> ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec threaded (3)
>>>> ~20.4
>>>> sec threaded (4) ~21.8 sec
>>>>
>>>> 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD
>>>> Opteron
>>>> 6272
>>>> number of cpus shown in resource monitor: 32 non-threaded plugin
>>>> version
>>>> ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded (4) ~46
>>>> sec
>>>> threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded (12) ~16.4 sec
>>>> threaded (16) ~15.8 sec threaded (20) ~15.7 sec threaded (24) ~15.9
>>>> sec
>>>> threaded (32) ~16 sec
>>>>
>>>> For machines 1-4, the cpu usage can be observed in the Activity
>>>> Monitor
>>>> (Mac) or Resource Monitor (Windows) and during the execution of the
>>>> plugin
>>>> all of the cpus were active. For machine 5 shown below, only 22 of
>>>> the
>>>> 64
>>>> show activity. And it is not always the same 22. From the example
>>>> runs
>>>> below you can see it really isnât performing very well considering
>>>> the
>>>> number of available cores. I originally thought this machine should
>>>> be
>>>> the
>>>> best, but it barely outperforms my laptop. This is probably a
>>>> question
>>>> for
>>>> another forum, but I am wondering if anyone else has encountered
>>>> anything
>>>> similar.
>>>>
>>>> 5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz
>>>> AMD
>>>> Opteron
>>>> 6378
>>>> number of cpus shown in resource monitor: 64 non-threaded plugin
>>>> version
>>>> ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded (8)
>>>> ~29.3
>>>> sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded (24)
>>>> ~24.1
>>>> sec
>>>> threaded (32) ~24.5 sec threaded (40) ~24.8 sec threaded (48) ~23.8
>>>> sec
>>>> threaded (64) ~24.8 sec
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>>
>>>> http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html
>>>> Sent from the ImageJ mailing list archive at Nabble.com.
>>>>
>>>> --
>>>> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>>>>
>>>>
>>> --
>>> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>>>
>>>
>> --
>> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>>
>
> --
> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>

... [show rest of quote]

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html

George Patterson

Jul 18, 2016; 3:15pm

Re: Questions regarding multithreaded processing

43 posts

Hi Michael,
Thanks for the feedback.

>
> For comparing the speed with one or two Minimizer threads, this means that
> you have to compare like the following:
>
> new, one minimizer thread previous
> threaded (2) ~108.9 sec 158 sec
> threaded (4) ~64.5 sec 85.1 sec
> threaded (8) ~35.2 sec 46 sec
> threaded (16) ~17.7 sec 22.9 sec
> threaded (32) ~10 sec 15.8 sec
> threaded (64) 16 sec
>

Of course. Sorry for the mix-up.

> So it clearly helps to use one Minimizer thread; possibly the main reason
> is avoiding the overhead of creating a Minimizer thread for each pixel and
> the accompanying synchronization between the two Minimizer threads.
>
> The table also tells that the gain with parallelization is not so bad:
> A factor of 20 from 1 to 32 threads, so the total time is not dominated by
> the 'Stop the world' events for Garbage collection or memory bandwidth.
>
>

> --
>
> Concerning the curve fitting problem, "Exponential with offset", y =
> a*exp(-bx) + c:
>
> The CurveFitter eliminates two parameters (a, c) by linear regression, so
> it actually performs a one-dimensional minimization. I guess that this
> problem is well-behaved and the Minimizer always finds the correct result
> in the first attempt. Then it is not necessary to try a second run.
> So you can try:
> myCurveFitter.getMinimizer().setMaxRestarts(0);
> This makes the Minimizer run only once, with no second try to make sure
> the result is correct. It also avoids a second thread.
> I would suggest that you try it and compare whether the result is the same
> (there might be tiny differences since minimization is stochastic and the
> accuracy is finite).
>

... [show rest of quote]

Some new comparisons are below to see if it is behaving as you expect.

4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron
6272
number of cpus shown in resource monitor: 32

myCurveFitter.getMinimizer().setMaxRestarts(0);
threads (1) 103.5 sec
threads (2) 55 sec
threads (4) 33.3 sec
threads (8) 18.2 sec
threads (16) 9.7 sec
threads (32) 5.6 sec

Seems to make it even faster. I think this is what you predicted.

4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron
6272
number of cpus shown in resource monitor: 32

myCurveFitter.getMinimizer().setMaximumThreads(1);
myCurveFitter.getMinimizer().setMaxRestarts(0);
threads (1) 205.3 sec
threads (2) 107.6 sec
threads (4) 64.9 sec
threads (8) 34.9 sec
threads (16) 17.9 sec
threads (32) 10.2 sec

There is likely no reason to both of these commands together, but it seems
to give the same as only using
myCurveFitter.getMinimizer().setMaximumThreads(1);
I included this just to see if this is expected behavior.

> If it works as I expect, it should cut the time for minimization to 1/2.
> If the decrease in processing time is comparable, it would mean that
> computing time is still dominated by the Minimizer, not garbage collection
> (and and the rest of processing each pixel, including memory access). If
> the speed gain is only marginal, it would indicate that optimization
> should focus on garbage collection and the non-minimizer operations per
> pixel.
>
> What you might also do to speed up the process: If you have a good guess
> for the 'b' parameter and the typical uncertainty of this guess for 'b',
> specify them in the initialPrams and initialParamVariations. E.g. if 'b'
> does not change much between neighboring pixels, use the previous value
> for initialization. The default value of the initialParamVariations for
> 'b' is 10% of the specified 'b' value.
> Don't care about the initial the 'a' and 'c' parameters and their range;
> these values will be ignored.
>

... [show rest of quote]

Thanks for the suggestions. They've given some ideas to incorporate into
the plugin.

I wasn't really expecting any miracle speed up for my plugin. Just the
suggestions you've made have vastly improved it.

I do notice small differences in the final results with the different
versions 2 threads versus setMaximumThreads versus setMaxRestarts, but
these seem to be at the 8th and 9th decimal places.

And thanks again for all your help.
George

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html

Michael Schmid

Jul 18, 2016; 4:31pm

Re: Questions regarding multithreaded processing

2136 posts

Hi George,

concerning setMaximumThreads(1) and setMaxRestarts(0):
I guess that your text should read
setMaximumThreads(0) and setMaxRestarts(0)?

There is a bug in my Minimizer code, which makes it perform two runs
instead of one in that case. I have asked Wayne to fix it.

By the way, I just found out that the CurveFitter has no way to specify
the initialParamVariations for the built-in functions (setting the
initialParamVariations for the Minimizer won't help, the CurveFitter
will override it with its own default values).

So please forget my idea to specify the initialParamVariations (=
typical accuracy of the initial parameter given by the user).
Nevertheless, specifying a starting value for the 'b' parameter of the
"Exponential with offset" fit, y = a*exp(-bx) + c, should be beneficial
for speed (use the CurveFitter.setInitialParameters, not the Minimizer
method).

If there are many people out there doing fits with functions that need
one-dimensional minimization (2nd-order polynomial, and the
exp/log/power fits that are not simply done by regression), and speed is
a general issue, eventually one might consider of modifying the
Minimizer such that it adds a different algorithm for better convergence
in the one-dimensional case. The current Nelder-Mead simplex algorithm
is quite good for general (also badly behaved) few-dimensional problems,
but rather inefficient for the final steps of well-behaved
one-dimensional minimization problems.

Michael
________________________________________________________________
On 2016-07-18 17:30, George Patterson wrote:

> Hi Michael,
> Thanks for the feedback.
>
>>
>> For comparing the speed with one or two Minimizer threads, this means that
>> you have to compare like the following:
>>
>> new, one minimizer thread previous
>> threaded (2) ~108.9 sec 158 sec
>> threaded (4) ~64.5 sec 85.1 sec
>> threaded (8) ~35.2 sec 46 sec
>> threaded (16) ~17.7 sec 22.9 sec
>> threaded (32) ~10 sec 15.8 sec
>> threaded (64) 16 sec
>>
>
> Of course. Sorry for the mix-up.
>
>
>> So it clearly helps to use one Minimizer thread; possibly the main reason
>> is avoiding the overhead of creating a Minimizer thread for each pixel and
>> the accompanying synchronization between the two Minimizer threads.
>>
>> The table also tells that the gain with parallelization is not so bad:
>> A factor of 20 from 1 to 32 threads, so the total time is not dominated by
>> the 'Stop the world' events for Garbage collection or memory bandwidth.
>>
>>
>
>
>> --
>>
>> Concerning the curve fitting problem, "Exponential with offset", y =
>> a*exp(-bx) + c:
>>
>> The CurveFitter eliminates two parameters (a, c) by linear regression, so
>> it actually performs a one-dimensional minimization. I guess that this
>> problem is well-behaved and the Minimizer always finds the correct result
>> in the first attempt. Then it is not necessary to try a second run.
>> So you can try:
>> myCurveFitter.getMinimizer().setMaxRestarts(0);
>> This makes the Minimizer run only once, with no second try to make sure
>> the result is correct. It also avoids a second thread.
>> I would suggest that you try it and compare whether the result is the same
>> (there might be tiny differences since minimization is stochastic and the
>> accuracy is finite).
>>
>
> Some new comparisons are below to see if it is behaving as you expect.
>
> 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron
> 6272
> number of cpus shown in resource monitor: 32
>
> myCurveFitter.getMinimizer().setMaxRestarts(0);
> threads (1) 103.5 sec
> threads (2) 55 sec
> threads (4) 33.3 sec
> threads (8) 18.2 sec
> threads (16) 9.7 sec
> threads (32) 5.6 sec
>
> Seems to make it even faster. I think this is what you predicted.
>
>
>
> 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron
> 6272
> number of cpus shown in resource monitor: 32
>
> myCurveFitter.getMinimizer().setMaximumThreads(1);
> myCurveFitter.getMinimizer().setMaxRestarts(0);
> threads (1) 205.3 sec
> threads (2) 107.6 sec
> threads (4) 64.9 sec
> threads (8) 34.9 sec
> threads (16) 17.9 sec
> threads (32) 10.2 sec
>
> There is likely no reason to both of these commands together, but it seems
> to give the same as only using
> myCurveFitter.getMinimizer().setMaximumThreads(1);
> I included this just to see if this is expected behavior.
>
>
>> If it works as I expect, it should cut the time for minimization to 1/2.
>> If the decrease in processing time is comparable, it would mean that
>> computing time is still dominated by the Minimizer, not garbage collection
>> (and and the rest of processing each pixel, including memory access). If
>> the speed gain is only marginal, it would indicate that optimization
>> should focus on garbage collection and the non-minimizer operations per
>> pixel.
>>
>> What you might also do to speed up the process: If you have a good guess
>> for the 'b' parameter and the typical uncertainty of this guess for 'b',
>> specify them in the initialPrams and initialParamVariations. E.g. if 'b'
>> does not change much between neighboring pixels, use the previous value
>> for initialization. The default value of the initialParamVariations for
>> 'b' is 10% of the specified 'b' value.
>> Don't care about the initial the 'a' and 'c' parameters and their range;
>> these values will be ignored.
>>
>
> Thanks for the suggestions. They've given some ideas to incorporate into
> the plugin.
>
> I wasn't really expecting any miracle speed up for my plugin. Just the
> suggestions you've made have vastly improved it.
>
> I do notice small differences in the final results with the different
> versions 2 threads versus setMaximumThreads versus setMaxRestarts, but
> these seem to be at the 8th and 9th decimal places.
>
> And thanks again for all your help.
> George
>
> --
> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>

... [show rest of quote]

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html