Login  Register

Re: Questions regarding multithreaded processing

Posted by Michael Schmid on Jul 15, 2016; 5:23pm
URL: http://imagej.273.s1.nabble.com/Questions-regarding-multithreaded-processing-tp5016878p5016889.html

Hi George,

as curve fitting is an essential part of your plugin, probably I should
answer (I am responsible for the two threads and more code in the
CurveFitter. You would not reach me on the developers' mailing list as
my main occupation is not in computing or image processing, and I
contribute to ImageJ only then and when, if I need it for my own work or
I think I might eventually need it).

The CurveFitter usually uses the Minimizer, which uses two threads,
indeed. It does not use the Minimizer (and thus, only one Thread) for
linear regression fits:
- Straight Line 'y = a+bx'
- "Exponential (linear regression) 'y = a*exp(bx)', and
- "Power (linear regression)" 'a*x^b'.

If it's not linear regression, you can disable using two Minimizer
Threads by
   myCurveFitter.getMinimizer().setMaximumThreads(1);
(obviously, before you call myCurveFitter.doFit)
When using many parallel threads, this will also speed it up slightly,
e.g. if the Minimizer does not find two consistent solutions
immediately, with two threads it will give it two more tries (total of
four), with one thread it may get two consistent solutions after a total
of three tries.

If it is linear regression, ask me again; somewhere I have a linear
regression class that can be reused without creating a new object.

If your curve fitting problem is linear in all parameters, but not
linear regression with only two parameters (e.g. polynomial,
a*sin(x)+b*cos(x)+c, etc.), it would be faster to use the analytical
solution instead of the CurveFitter, but it is more programming effort
(you could try to find a suitable library such as Apache Commons Math,
Jama).

If your problem is nonlinear but it has one of the forms
- a + b*function(c, d, ...; x),
- a + function(b, c, d, ...; x),
- a + b*x + function(c, d, ...; x), or
- a*x + function(b, c, d, ...; x),
you can speed up a lot by eliminating one or two parameters with
   myCurveFitter.setOffsetMultiplySlopeParams

By the way, creating a new CurveFitter also creates several other
objects, so having one per pixel really induces a lot of garbage collection.

If creating many CurveFitters or Minimizers is more common (anyone out
there who also does this?) we should consider making the CurveFitter and
Minimizer reusable (e.g. with a CurveFitter.setData(double[] xData,
double[] yData) method, which clears the previous result and settings).


Best regards,

Michael
________________________________________________________________
On 2016-07-15 18:16, George Patterson wrote:

> Hi all,
>
> Thank you all for your feedback.  Below I'll try to respond to the parts I
> can answer.
>
>
>> Seeing as this bit is a bit more technical and closer to a plugin
> development question, would you mind posting it on http://forum.imagej.net
>
>> Long technical email threads like this one tend to get muddy, especially
> if we try to share code snippets or want to comment on a particular part.
>
> In the future, I’ll direct plugin development questions to that forum. I
> didn’t bother sharing code since at this point since I just wanted to know
> what improvements to expect with multi-threaded processing.
>
>
>> A quick remark though. Seeing as we do not know HOW you implemented the
> parallel processing, it will be difficult to help.
>
>> Some notes: If you 'simply' make a bunch of threads where each accesses a
> given pixel in a loop through an atomic integer for example, it is not
> going to be faster. Accessing a single pixel is extremely fast >and what
> will slow you down is having each thread waiting to get its pixel index.
>
> As I mentioned, I didn’t know what to expect so I wasn’t sure I had a
> problem.  The atomic integer approach is what I used initially.  To be
> clear, the speed does improve with more threads, it just doesn’t improve as
> much as it should based on the responses by Oli and Micheal.   Based on
> suggestions from Oli and Micheal, I changed the code to designate different
> blocks of the image to different threads.  This seemed to improve the speed
> modestly 5-10%.  Thanks for the suggestion.  I’ll take this approach for
> any future developments.
>
>
>> IMHO the most important point for efficient parallelization (and efficient
> Java code anyhow) is avoiding creating lots of objects that need garbage
> collection (don't care about dozens, but definitely avoid >hundreds of
> thousands or millions of Objects).
>
> Micheal thanks for sharing the list of potential problems.  I’ll work my
> way through them as well as I can.  The number of objects created is the
> first I started checking.  A new Curvefitter is created for every pixel so
> for a 256x256x200 stack >65000 are created and subjected to garbage
> collection I guess.  I still haven’t found a way around generating this
> many curvefitters.
>
> This led me to looking more closely at the Curvefitter documentation and I
> found this https://imagej.nih.gov/ij/docs/curve-fitter.html where it
> indicates “Two threads with two independent minimization runs, with
> repetition until two identical results are found, to avoid local minima or
> keeping the result of a stuck simplex.”  Does this mean that for each
> thread that generates a new Curvefitter, the Curvefitter generates a second
> thread on its own?  If so, then my plugin is generating twice as many
> threads as I think and might explain why my speed improvement is observed
> only to about half the number of cpus.  Possible? Probable? No way?  Since
> this is maybe getting into some technical bits which the plugin developers
> probably know well, I’ll take Oli's advice ask this on the imagej.net forum.
>
>
>> We made the same kind of tests and experience as you did. We also tested
> numerous machines with a variable number of cores declared in the ImageJ
> Option Menu, in combination with different amounts of >RAM, without being
> able to draw really clear conclusions about why it is fast or slow on the
> respective computers. We also tested different processes, from a simple
> Gaussian blur to more complex macros.
>
> Laurent, thanks for sharing your experiences.  Our issues with different
> machines might be better answered on another forum (maybe
> http://forum.imagej.net ).  Maybe we should start a new query on just this
> topic?
>
>
> Thanks again for the feedback.
>
> George
>
>
>
> On Fri, Jul 15, 2016 at 3:49 AM, Gelman, Laurent <[hidden email]>
> wrote:
>
>> Dear George,
>>
>> We made the same kind of tests and experience as you did. We also tested
>> numerous machines with a variable number of cores declared in the ImageJ
>> Option Menu, in combination with different amounts of RAM, without being
>> able to draw really clear conclusions about why it is fast or slow on the
>> respective computers. We also tested different processes, from a simple
>> Gaussian blur to more complex macros.
>>
>> In a nutshell:
>> We also observed awful performances on our Microscoft Server 2012 / 32CPUs
>> / 512GB RAM machine, irrespective of the combination of CPUs and RAM we
>> declare in ImageJ. Surely, giving more than 16 CPUs to ImageJ does not
>> improve overall speed, sometimes it even decreases. Note that this very
>> same machine is really fast when using Matlab and the parallel processing
>> toolbox.
>> Until recently, the fastest computers we could find to run ImageJ were my
>> iMac, which runs Windows 7 (:-)) (specs: i7-4771 CPU  3.5GHz, 32GB RAM),
>> and the HIVE (hexacore machine) sold by the company Acquifer (no commercial
>> interest). Until then, we thought the speed of individual CPUs is the key,
>> less their numbers, but we got really surprised lately when we tested the
>> new virtual machines (VMs) our IT department set up for us to do some
>> remote processing of very big microscopy datasets (24 cores, 128 to 256 GB
>> RAM for each VM). Although the CPUs on the physical servers are not that
>> fast (2.5 GHz, but is this really a good measure of computation speed? I am
>> not sure...), we measured that our VMs were the fastest machines we tested
>> so far. So we have actually no theory anymore about ImageJ and speed. It is
>> not clear to us either, whether having Windows 7 or Windows server 2012
>> makes a difference.
>> Finally, I should mention that when you use complex processes, for example
>> Stitching, the speed of the individual CPUs is also important, as we had
>> the impression that the reading/loading of the file uses only one core.
>> There again, we could see a beautiful correlation between CPU speed (GHz
>> specs) and the process.
>>
>> Current solution:
>> If we really need to be very fast,
>> 1. we write an ImageJ macro in python and launch multiple threads in
>> parallel, but we observed that the whole was not "thread safe", i.e. we see
>> "collisions" between the different processes.
>> 2. we write a python program to launch multiple ImageJ instances in a
>> headless mode and parse the macro this way.
>>
>> I would be also delighted to understand what makes ImageJ go fast or slow
>> on a computer, that would help us to purchase the right machines from the
>> beginning.
>>
>> Very best regards,
>>
>> Laurent.
>>
>> ___________________________
>> Laurent Gelman, PhD
>> Friedrich Miescher Institut
>> Head, Facility for Advanced Imaging and Microscopy
>> Light microscopy
>> WRO 1066.2.16
>> Maulbeerstrasse 66
>> CH-4058 Basel
>> +41 (0)61 696 35 13
>> +41 (0)79 618 73 69
>> www.fmi.ch
>> www.microscopynetwork.unibas.ch/
>>
>>
>> -----Original Message-----
>> From: George Patterson [mailto:[hidden email]]
>> Sent: mercredi 13 juillet 2016 23:55
>> Subject: Questions regarding multithreaded processing
>>
>> Dear all,
>> I’ve assembled a plugin to analyze a time series on a pixel-by-pixel basis.
>> It works fine but is slow.
>> There are likely still plenty of optimizations that can be done to improve
>> the speed and thanks to Albert Cordona and Stephen Preibisch sharing code
>> and tutorials  (http://albert.rierol.net/imagej_programming_tutorials.html
>> ),
>> I’ve even have a version that runs multi-threaded.
>> When run on multi-core machines the speed is improved, but I’m not sure
>> what sort of improvement I should expect.  Moreover, the machines I
>> expected to be the fastest are not.  This is likely stemming from my
>> misunderstanding of parallel processing and Java programming in general so
>> I’m hoping some of you with more experience can provide some feedback.
>> I list below some observations and questions along with test runs on the
>> same data set using the same plugin on a few different machines.
>> Thanks for any suggestions.
>> George
>>
>>
>> Since the processing speeds differ, I realize the speeds of each machine
>> to complete the analysis will differ.  I’m more interested the improvement
>> of multiple threads on an individual machine.
>> In running these tests, I altered the code to use a different number of
>> threads in each run.
>> Is setting the number of threads in the code and determining the time to
>> finish the analysis a valid approach to testing improvement?
>>
>> Machine 5 is producing some odd behavior which I’ll discuss and ask for
>> suggestions below.
>>
>> For machines 1-4, the speed improves with the number of threads up to
>> about half the number of available processors.
>> Do the improvements with the number of threads listed below seem
>> reasonable?
>> Is the improvement up to only about half the number of available
>> processors due to “hyperthreading”?  My limited (and probably wrong)
>> understanding is that hyperthreading makes a single core appear to be two
>> which share resources and thus a machine with 2 cores will return 4 when
>> queried for number of cpus.  Yes, I know that is too simplistic, but it’s
>> the best I can do.
>> Could it simply be that my code is not written properly to take advantage
>> of hyperthreading?  Could anyone point me to a source and/or example code
>> explaining how I could change it to take advantage of hyperthreading if
>> this is the problem?
>>
>> Number of threads used are shown in parentheses where applicable.
>> 1. MacBook Pro 2.66 GHz Intel Core i7
>> number of processors: 1
>> Number of cores: 2
>> non-threaded plugin version ~59 sec
>> threaded (1) ~51 sec
>> threaded (2) ~36 sec
>> threaded (3) ~34 sec
>> threaded (4) ~35 sec
>>
>> 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2
>> Number of cores: 8 non-threaded plugin version ~60 sec threaded (1) ~59 sec
>> threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6) ~13.2 sec
>> threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12) ~11.1 sec
>> threaded (16) ~11.5 sec
>>
>> 3. Windows 7 DELL   3.2 GHz Intel Core i5
>> number of cpus shown in resource monitor: 4 non-threaded plugin version
>> ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec threaded (3) ~20.4
>> sec threaded (4) ~21.8 sec
>>
>> 4. Windows 7 Xi MTower 2P64 Workstation         2 x 2.1 GHz  AMD Opteron
>> 6272
>> number of cpus shown in resource monitor: 32 non-threaded plugin version
>> ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded (4) ~46 sec
>> threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded (12) ~16.4 sec
>> threaded (16) ~15.8 sec threaded (20) ~15.7 sec threaded (24) ~15.9 sec
>> threaded (32) ~16 sec
>>
>> For machines 1-4, the cpu usage can be observed in the Activity Monitor
>> (Mac) or Resource Monitor (Windows) and during the execution of the plugin
>> all of the cpus were active.  For machine 5 shown below, only 22 of the 64
>> show activity.  And it is not always the same 22.  From the example runs
>> below you can see it really isn’t performing very well considering the
>> number of available cores.  I originally thought this machine should be the
>> best, but it barely outperforms my laptop.  This is probably a question for
>> another forum, but I am wondering if anyone else has encountered anything
>> similar.
>>
>> 5. Windows Server 2012 Xi MTower 2P64 Workstation       4 x 2.4 GHz  AMD
>> Opteron
>> 6378
>> number of cpus shown in resource monitor: 64 non-threaded plugin version
>> ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded (8) ~29.3
>> sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded (24) ~24.1 sec
>> threaded (32) ~24.5 sec threaded (40) ~24.8 sec threaded (48) ~23.8 sec
>> threaded (64) ~24.8 sec
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html
>> Sent from the ImageJ mailing list archive at Nabble.com.
>>
>> --
>> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>>
>
> --
> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html