ImageJ - Re: Questions regarding multithreaded processing

ImageJ

Re: Questions regarding multithreaded processing

Posted by Michael Schmid on Jul 16, 2016; 11:18am
URL: http://imagej.273.s1.nabble.com/Questions-regarding-multithreaded-processing-tp5016878p5016893.html

Hi George,

concerning the comparison with one (new) or two (previous) Mimimizer threads:

> setting the maximum threads to one previous
> threaded (1) ~205.3 sec 158 sec
> threaded (2) ~108.9 sec 85.1 sec
> threaded (4) ~64.5 sec 46 sec
> threaded (8) ~35.2 sec 22.9 sec
> threaded (10) ~28.1 sec 18.6 sec
> threaded (12) ~24.6 sec 16.4 sec
> threaded (16) ~17.7 sec 15.8 sec
> threaded (20) ~15.1 sec 15.7 sec
> threaded (24) ~13.3 sec 15.9 sec
> threaded (32) ~10 sec 16 sec

For the minimizing operation itself, the 'previous' case has twice the
number of threads due to the Minimizer, so it was actually minimizing with
2 to 64 threads.
This explains why there was no gain in the previous version when
increasing the number of threads from 16 to 32 (with 32 processors), it
was actually an increase from 32 to 64.
For comparing the speed with one or two Minimizer threads, this means that
you have to compare like the following:

new, one minimizer thread previous
threaded (2) ~108.9 sec 158 sec
threaded (4) ~64.5 sec 85.1 sec
threaded (8) ~35.2 sec 46 sec
threaded (16) ~17.7 sec 22.9 sec
threaded (32) ~10 sec 15.8 sec
threaded (64) 16 sec

So it clearly helps to use one Minimizer thread; possibly the main reason
is avoiding the overhead of creating a Minimizer thread for each pixel and
the accompanying synchronization between the two Minimizer threads.

The table also tells that the gain with parallelization is not so bad:
A factor of 20 from 1 to 32 threads, so the total time is not dominated by
the 'Stop the world' events for Garbage collection or memory bandwidth.

--

Concerning the curve fitting problem, "Exponential with offset", y =
a*exp(-bx) + c:

The CurveFitter eliminates two parameters (a, c) by linear regression, so
it actually performs a one-dimensional minimization. I guess that this
problem is well-behaved and the Minimizer always finds the correct result
in the first attempt. Then it is not necessary to try a second run.
So you can try:
myCurveFitter.getMinimizer().setMaxRestarts(0);
This makes the Minimizer run only once, with no second try to make sure
the result is correct. It also avoids a second thread.
I would suggest that you try it and compare whether the result is the same
(there might be tiny differences since minimization is stochastic and the
accuracy is finite).

If it works as I expect, it should cut the time for minimization to 1/2.
If the decrease in processing time is comparable, it would mean that
computing time is still dominated by the Minimizer, not garbage collection
(and and the rest of processing each pixel, including memory access). If
the speed gain is only marginal, it would indicate that optimization
should focus on garbage collection and the non-minimizer operations per
pixel.

What you might also do to speed up the process: If you have a good guess
for the 'b' parameter and the typical uncertainty of this guess for 'b',
specify them in the initialPrams and initialParamVariations. E.g. if 'b'
does not change much between neighboring pixels, use the previous value
for initialization. The default value of the initialParamVariations for
'b' is 10% of the specified 'b' value.
Don't care about the initial the 'a' and 'c' parameters and their range;
these values will be ignored.

HTH,

Michael
____________________________________________________________________

On Fri, July 15, 2016 22:03, George Patterson wrote:

> Michael,
> Thanks for the quick response and for helping providing us with
> CurveFitter.
>
>
>> as curve fitting is an essential part of your plugin, probably I should
>> answer (I am responsible for the two threads and more code in the
>> CurveFitter. You would not reach me on the developers' mailing list as
>> my
>> main occupation is not in computing or image processing, and I
>> contribute
>> to ImageJ only then and when, if I need it for my own work or I think I
>> might eventually need it).
>>
>
> Good to know. I'll let the other forum know the answer is over here.
>
>>
>> The CurveFitter usually uses the Minimizer, which uses two threads,
>> indeed. It does not use the Minimizer (and thus, only one Thread) for
>> linear regression fits:
>> - Straight Line 'y = a+bx'
>> - "Exponential (linear regression) 'y = a*exp(bx)', and
>> - "Power (linear regression)" 'a*x^b'.
>>
>
> I am using "Exponential with offset so no linear regression.
>
> If it's not linear regression, you can disable using two Minimizer Threads
>> by
>> myCurveFitter.getMinimizer().setMaximumThreads(1);
>> (obviously, before you call myCurveFitter.doFit)
>>
>
>
> Using the machine below again.
>
> 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron
> 6272
> number of cpus shown in resource monitor: 32
>
> setting the maximum threads to one previous
> threaded (1) ~205.3 sec 158 sec
> threaded (2) ~108.9 sec 85.1 sec
> threaded (4) ~64.5 sec 46 sec
> threaded (8) ~35.2 sec 22.9
> sec
> threaded (10) ~28.1 sec 18.6 sec
> threaded (12) ~24.6 sec 16.4 sec
> threaded (16) ~17.7 sec 15.8 sec
> threaded (20) ~15.1 sec 15.7 sec
> threaded (24) ~13.3 sec 15.9 sec
> threaded (32) ~10 sec 16 sec
>
> The improvement much closer to linear. It is slower with fewer threads
> than before. I bet you know why. Care to educate me?
>
>
>> By the way, creating a new CurveFitter also creates several other
>> objects,
>> so having one per pixel really induces a lot of garbage collection.
>>
>
> So in addition to producing more threads than I originally thought, my
> major limitation is probably the amount of garbage I'm producing.
> Correct?
>
>
>>
>> If creating many CurveFitters or Minimizers is more common (anyone out
>> there who also does this?) we should consider making the CurveFitter and
>> Minimizer reusable (e.g. with a CurveFitter.setData(double[] xData,
>> double[] yData) method, which clears the previous result and settings).
>>
>
> Obviously I'm in favor but that sounds like it might take a bit of effort
> by someone in the know.
>
>
> Thanks again for your help.
>
> Best,
> George
>
>
>
>
>
>
>
>
>
>> _________________________
>>
>> On 2016-07-15 18:16, George Patterson wrote:
>>
>>> Hi all,
>>>
>>> Thank you all for your feedback. Below I'll try to respond to the
>>> parts I
>>> can answer.
>>>
>>>
>>> Seeing as this bit is a bit more technical and closer to a plugin
>>>>
>>> development question, would you mind posting it on
>>> http://forum.imagej.net
>>>
>>> Long technical email threads like this one tend to get muddy,
>>> especially
>>>>
>>> if we try to share code snippets or want to comment on a particular
>>> part.
>>>
>>> In the future, Iâll direct plugin development questions to that
>>> forum. I
>>> didnât bother sharing code since at this point since I just wanted to
>>> know
>>> what improvements to expect with multi-threaded processing.
>>>
>>>
>>> A quick remark though. Seeing as we do not know HOW you implemented the
>>>>
>>> parallel processing, it will be difficult to help.
>>>
>>> Some notes: If you 'simply' make a bunch of threads where each accesses
>>> a
>>>>
>>> given pixel in a loop through an atomic integer for example, it is not
>>> going to be faster. Accessing a single pixel is extremely fast >and
>>> what
>>> will slow you down is having each thread waiting to get its pixel
>>> index.
>>>
>>> As I mentioned, I didnât know what to expect so I wasnât sure I had
>>> a
>>> problem. The atomic integer approach is what I used initially. To be
>>> clear, the speed does improve with more threads, it just doesnât
>>> improve
>>> as
>>> much as it should based on the responses by Oli and Micheal. Based on
>>> suggestions from Oli and Micheal, I changed the code to designate
>>> different
>>> blocks of the image to different threads. This seemed to improve the
>>> speed
>>> modestly 5-10%. Thanks for the suggestion. Iâll take this approach
>>> for
>>> any future developments.
>>>
>>>
>>> IMHO the most important point for efficient parallelization (and
>>> efficient
>>>>
>>> Java code anyhow) is avoiding creating lots of objects that need
>>> garbage
>>> collection (don't care about dozens, but definitely avoid >hundreds of
>>> thousands or millions of Objects).
>>>
>>> Micheal thanks for sharing the list of potential problems. Iâll work
>>> my
>>> way through them as well as I can. The number of objects created is
>>> the
>>> first I started checking. A new Curvefitter is created for every pixel
>>> so
>>> for a 256x256x200 stack >65000 are created and subjected to garbage
>>> collection I guess. I still havenât found a way around generating
>>> this
>>> many curvefitters.
>>>
>>> This led me to looking more closely at the Curvefitter documentation
>>> and I
>>> found this https://imagej.nih.gov/ij/docs/curve-fitter.html where it
>>> indicates âTwo threads with two independent minimization runs, with
>>> repetition until two identical results are found, to avoid local minima
>>> or
>>> keeping the result of a stuck simplex.â Does this mean that for each
>>> thread that generates a new Curvefitter, the Curvefitter generates a
>>> second
>>> thread on its own? If so, then my plugin is generating twice as many
>>> threads as I think and might explain why my speed improvement is
>>> observed
>>> only to about half the number of cpus. Possible? Probable? No way?
>>> Since
>>> this is maybe getting into some technical bits which the plugin
>>> developers
>>> probably know well, Iâll take Oli's advice ask this on the imagej.net
>>> forum.
>>>
>>>
>>> We made the same kind of tests and experience as you did. We also
>>> tested
>>>>
>>> numerous machines with a variable number of cores declared in the
>>> ImageJ
>>> Option Menu, in combination with different amounts of >RAM, without
>>> being
>>> able to draw really clear conclusions about why it is fast or slow on
>>> the
>>> respective computers. We also tested different processes, from a simple
>>> Gaussian blur to more complex macros.
>>>
>>> Laurent, thanks for sharing your experiences. Our issues with
>>> different
>>> machines might be better answered on another forum (maybe
>>> http://forum.imagej.net ). Maybe we should start a new query on just
>>> this
>>> topic?
>>>
>>>
>>> Thanks again for the feedback.
>>>
>>> George
>>>
>>>
>>>
>>> On Fri, Jul 15, 2016 at 3:49 AM, Gelman, Laurent
>>> <[hidden email]>
>>> wrote:
>>>
>>> Dear George,
>>>>
>>>> We made the same kind of tests and experience as you did. We also
>>>> tested
>>>> numerous machines with a variable number of cores declared in the
>>>> ImageJ
>>>> Option Menu, in combination with different amounts of RAM, without
>>>> being
>>>> able to draw really clear conclusions about why it is fast or slow on
>>>> the
>>>> respective computers. We also tested different processes, from a
>>>> simple
>>>> Gaussian blur to more complex macros.
>>>>
>>>> In a nutshell:
>>>> We also observed awful performances on our Microscoft Server 2012 /
>>>> 32CPUs
>>>> / 512GB RAM machine, irrespective of the combination of CPUs and RAM
>>>> we
>>>> declare in ImageJ. Surely, giving more than 16 CPUs to ImageJ does not
>>>> improve overall speed, sometimes it even decreases. Note that this
>>>> very
>>>> same machine is really fast when using Matlab and the parallel
>>>> processing
>>>> toolbox.
>>>> Until recently, the fastest computers we could find to run ImageJ were
>>>> my
>>>> iMac, which runs Windows 7 (:-)) (specs: i7-4771 CPU 3.5GHz, 32GB
>>>> RAM),
>>>> and the HIVE (hexacore machine) sold by the company Acquifer (no
>>>> commercial
>>>> interest). Until then, we thought the speed of individual CPUs is the
>>>> key,
>>>> less their numbers, but we got really surprised lately when we tested
>>>> the
>>>> new virtual machines (VMs) our IT department set up for us to do some
>>>> remote processing of very big microscopy datasets (24 cores, 128 to
>>>> 256
>>>> GB
>>>> RAM for each VM). Although the CPUs on the physical servers are not
>>>> that
>>>> fast (2.5 GHz, but is this really a good measure of computation speed?
>>>> I
>>>> am
>>>> not sure...), we measured that our VMs were the fastest machines we
>>>> tested
>>>> so far. So we have actually no theory anymore about ImageJ and speed.
>>>> It
>>>> is
>>>> not clear to us either, whether having Windows 7 or Windows server
>>>> 2012
>>>> makes a difference.
>>>> Finally, I should mention that when you use complex processes, for
>>>> example
>>>> Stitching, the speed of the individual CPUs is also important, as we
>>>> had
>>>> the impression that the reading/loading of the file uses only one
>>>> core.
>>>> There again, we could see a beautiful correlation between CPU speed
>>>> (GHz
>>>> specs) and the process.
>>>>
>>>> Current solution:
>>>> If we really need to be very fast,
>>>> 1. we write an ImageJ macro in python and launch multiple threads in
>>>> parallel, but we observed that the whole was not "thread safe", i.e.
>>>> we
>>>> see
>>>> "collisions" between the different processes.
>>>> 2. we write a python program to launch multiple ImageJ instances in a
>>>> headless mode and parse the macro this way.
>>>>
>>>> I would be also delighted to understand what makes ImageJ go fast or
>>>> slow
>>>> on a computer, that would help us to purchase the right machines from
>>>> the
>>>> beginning.
>>>>
>>>> Very best regards,
>>>>
>>>> Laurent.
>>>>
>>>> ___________________________
>>>> Laurent Gelman, PhD
>>>> Friedrich Miescher Institut
>>>> Head, Facility for Advanced Imaging and Microscopy
>>>> Light microscopy
>>>> WRO 1066.2.16
>>>> Maulbeerstrasse 66
>>>> CH-4058 Basel
>>>> +41 (0)61 696 35 13
>>>> +41 (0)79 618 73 69
>>>> www.fmi.ch
>>>> www.microscopynetwork.unibas.ch/
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: George Patterson [mailto:[hidden email]]
>>>> Sent: mercredi 13 juillet 2016 23:55
>>>> Subject: Questions regarding multithreaded processing
>>>>
>>>> Dear all,
>>>> Iâve assembled a plugin to analyze a time series on a pixel-by-pixel
>>>> basis.
>>>> It works fine but is slow.
>>>> There are likely still plenty of optimizations that can be done to
>>>> improve
>>>> the speed and thanks to Albert Cordona and Stephen Preibisch sharing
>>>> code
>>>> and tutorials (
>>>> http://albert.rierol.net/imagej_programming_tutorials.html
>>>> ),
>>>> Iâve even have a version that runs multi-threaded.
>>>> When run on multi-core machines the speed is improved, but Iâm not
>>>> sure
>>>> what sort of improvement I should expect. Moreover, the machines I
>>>> expected to be the fastest are not. This is likely stemming from my
>>>> misunderstanding of parallel processing and Java programming in
>>>> general
>>>> so
>>>> Iâm hoping some of you with more experience can provide some
>>>> feedback.
>>>> I list below some observations and questions along with test runs on
>>>> the
>>>> same data set using the same plugin on a few different machines.
>>>> Thanks for any suggestions.
>>>> George
>>>>
>>>>
>>>> Since the processing speeds differ, I realize the speeds of each
>>>> machine
>>>> to complete the analysis will differ. Iâm more interested the
>>>> improvement
>>>> of multiple threads on an individual machine.
>>>> In running these tests, I altered the code to use a different number
>>>> of
>>>> threads in each run.
>>>> Is setting the number of threads in the code and determining the time
>>>> to
>>>> finish the analysis a valid approach to testing improvement?
>>>>
>>>> Machine 5 is producing some odd behavior which Iâll discuss and ask
>>>> for
>>>> suggestions below.
>>>>
>>>> For machines 1-4, the speed improves with the number of threads up to
>>>> about half the number of available processors.
>>>> Do the improvements with the number of threads listed below seem
>>>> reasonable?
>>>> Is the improvement up to only about half the number of available
>>>> processors due to âhyperthreadingâ? My limited (and probably
>>>> wrong)
>>>> understanding is that hyperthreading makes a single core appear to be
>>>> two
>>>> which share resources and thus a machine with 2 cores will return 4
>>>> when
>>>> queried for number of cpus. Yes, I know that is too simplistic, but
>>>> itâs
>>>> the best I can do.
>>>> Could it simply be that my code is not written properly to take
>>>> advantage
>>>> of hyperthreading? Could anyone point me to a source and/or example
>>>> code
>>>> explaining how I could change it to take advantage of hyperthreading
>>>> if
>>>> this is the problem?
>>>>
>>>> Number of threads used are shown in parentheses where applicable.
>>>> 1. MacBook Pro 2.66 GHz Intel Core i7
>>>> number of processors: 1
>>>> Number of cores: 2
>>>> non-threaded plugin version ~59 sec
>>>> threaded (1) ~51 sec
>>>> threaded (2) ~36 sec
>>>> threaded (3) ~34 sec
>>>> threaded (4) ~35 sec
>>>>
>>>> 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2
>>>> Number of cores: 8 non-threaded plugin version ~60 sec threaded (1)
>>>> ~59
>>>> sec
>>>> threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6) ~13.2 sec
>>>> threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12) ~11.1 sec
>>>> threaded (16) ~11.5 sec
>>>>
>>>> 3. Windows 7 DELL 3.2 GHz Intel Core i5
>>>> number of cpus shown in resource monitor: 4 non-threaded plugin
>>>> version
>>>> ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec threaded (3)
>>>> ~20.4
>>>> sec threaded (4) ~21.8 sec
>>>>
>>>> 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD
>>>> Opteron
>>>> 6272
>>>> number of cpus shown in resource monitor: 32 non-threaded plugin
>>>> version
>>>> ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded (4) ~46
>>>> sec
>>>> threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded (12) ~16.4 sec
>>>> threaded (16) ~15.8 sec threaded (20) ~15.7 sec threaded (24) ~15.9
>>>> sec
>>>> threaded (32) ~16 sec
>>>>
>>>> For machines 1-4, the cpu usage can be observed in the Activity
>>>> Monitor
>>>> (Mac) or Resource Monitor (Windows) and during the execution of the
>>>> plugin
>>>> all of the cpus were active. For machine 5 shown below, only 22 of
>>>> the
>>>> 64
>>>> show activity. And it is not always the same 22. From the example
>>>> runs
>>>> below you can see it really isnât performing very well considering
>>>> the
>>>> number of available cores. I originally thought this machine should
>>>> be
>>>> the
>>>> best, but it barely outperforms my laptop. This is probably a
>>>> question
>>>> for
>>>> another forum, but I am wondering if anyone else has encountered
>>>> anything
>>>> similar.
>>>>
>>>> 5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz
>>>> AMD
>>>> Opteron
>>>> 6378
>>>> number of cpus shown in resource monitor: 64 non-threaded plugin
>>>> version
>>>> ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded (8)
>>>> ~29.3
>>>> sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded (24)
>>>> ~24.1
>>>> sec
>>>> threaded (32) ~24.5 sec threaded (40) ~24.8 sec threaded (48) ~23.8
>>>> sec
>>>> threaded (64) ~24.8 sec
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>>
>>>> http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html
>>>> Sent from the ImageJ mailing list archive at Nabble.com.
>>>>
>>>> --
>>>> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>>>>
>>>>
>>> --
>>> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>>>
>>>
>> --
>> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>>
>
> --
> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html