Login  Register

Re: Questions regarding multithreaded processing

Posted by George Patterson on Jul 15, 2016; 7:49pm
URL: http://imagej.273.s1.nabble.com/Questions-regarding-multithreaded-processing-tp5016878p5016892.html

Michael,
Thanks for the quick response and for helping providing us with
CurveFitter.


> as curve fitting is an essential part of your plugin, probably I should
> answer (I am responsible for the two threads and more code in the
> CurveFitter. You would not reach me on the developers' mailing list as my
> main occupation is not in computing or image processing, and I contribute
> to ImageJ only then and when, if I need it for my own work or I think I
> might eventually need it).
>

Good to know.  I'll let the other forum know the answer is over here.

>
> The CurveFitter usually uses the Minimizer, which uses two threads,
> indeed. It does not use the Minimizer (and thus, only one Thread) for
> linear regression fits:
> - Straight Line 'y = a+bx'
> - "Exponential (linear regression) 'y = a*exp(bx)', and
> - "Power (linear regression)" 'a*x^b'.
>

I am using "Exponential with offset so no linear regression.

If it's not linear regression, you can disable using two Minimizer Threads
> by
>   myCurveFitter.getMinimizer().setMaximumThreads(1);
> (obviously, before you call myCurveFitter.doFit)
>


 Using the machine below again.

4. Windows 7 Xi MTower 2P64 Workstation         2 x 2.1 GHz  AMD Opteron
6272
number of cpus shown in resource monitor: 32

setting the maximum threads to one                       previous
threaded (1) ~205.3 sec                                           158 sec
threaded (2) ~108.9 sec                                           85.1 sec
threaded (4) ~64.5 sec                                             46 sec
threaded (8) ~35.2 sec                                             22.9 sec
threaded (10) ~28.1 sec                                          18.6 sec
threaded (12) ~24.6 sec                                          16.4 sec
threaded (16) ~17.7 sec                                          15.8 sec
threaded (20) ~15.1 sec                                          15.7 sec
threaded (24) ~13.3 sec                                          15.9 sec
threaded (32) ~10 sec                                             16 sec

The improvement much closer to linear.  It is slower with fewer threads
than before.  I bet you know why.  Care to educate me?


> By the way, creating a new CurveFitter also creates several other objects,
> so having one per pixel really induces a lot of garbage collection.
>

So in addition to producing more threads than I originally thought, my
major limitation is probably the amount of garbage I'm producing.  Correct?


>
> If creating many CurveFitters or Minimizers is more common (anyone out
> there who also does this?) we should consider making the CurveFitter and
> Minimizer reusable (e.g. with a CurveFitter.setData(double[] xData,
> double[] yData) method, which clears the previous result and settings).
>

Obviously I'm in favor but that sounds like it might take a bit of effort
by someone in the know.


Thanks again for your help.

Best,
George









> _________________________
>
> On 2016-07-15 18:16, George Patterson wrote:
>
>> Hi all,
>>
>> Thank you all for your feedback.  Below I'll try to respond to the parts I
>> can answer.
>>
>>
>> Seeing as this bit is a bit more technical and closer to a plugin
>>>
>> development question, would you mind posting it on
>> http://forum.imagej.net
>>
>> Long technical email threads like this one tend to get muddy, especially
>>>
>> if we try to share code snippets or want to comment on a particular part.
>>
>> In the future, I’ll direct plugin development questions to that forum. I
>> didn’t bother sharing code since at this point since I just wanted to know
>> what improvements to expect with multi-threaded processing.
>>
>>
>> A quick remark though. Seeing as we do not know HOW you implemented the
>>>
>> parallel processing, it will be difficult to help.
>>
>> Some notes: If you 'simply' make a bunch of threads where each accesses a
>>>
>> given pixel in a loop through an atomic integer for example, it is not
>> going to be faster. Accessing a single pixel is extremely fast >and what
>> will slow you down is having each thread waiting to get its pixel index.
>>
>> As I mentioned, I didn’t know what to expect so I wasn’t sure I had a
>> problem.  The atomic integer approach is what I used initially.  To be
>> clear, the speed does improve with more threads, it just doesn’t improve
>> as
>> much as it should based on the responses by Oli and Micheal.   Based on
>> suggestions from Oli and Micheal, I changed the code to designate
>> different
>> blocks of the image to different threads.  This seemed to improve the
>> speed
>> modestly 5-10%.  Thanks for the suggestion.  I’ll take this approach for
>> any future developments.
>>
>>
>> IMHO the most important point for efficient parallelization (and efficient
>>>
>> Java code anyhow) is avoiding creating lots of objects that need garbage
>> collection (don't care about dozens, but definitely avoid >hundreds of
>> thousands or millions of Objects).
>>
>> Micheal thanks for sharing the list of potential problems.  I’ll work my
>> way through them as well as I can.  The number of objects created is the
>> first I started checking.  A new Curvefitter is created for every pixel so
>> for a 256x256x200 stack >65000 are created and subjected to garbage
>> collection I guess.  I still haven’t found a way around generating this
>> many curvefitters.
>>
>> This led me to looking more closely at the Curvefitter documentation and I
>> found this https://imagej.nih.gov/ij/docs/curve-fitter.html where it
>> indicates “Two threads with two independent minimization runs, with
>> repetition until two identical results are found, to avoid local minima or
>> keeping the result of a stuck simplex.”  Does this mean that for each
>> thread that generates a new Curvefitter, the Curvefitter generates a
>> second
>> thread on its own?  If so, then my plugin is generating twice as many
>> threads as I think and might explain why my speed improvement is observed
>> only to about half the number of cpus.  Possible? Probable? No way?  Since
>> this is maybe getting into some technical bits which the plugin developers
>> probably know well, I’ll take Oli's advice ask this on the imagej.net
>> forum.
>>
>>
>> We made the same kind of tests and experience as you did. We also tested
>>>
>> numerous machines with a variable number of cores declared in the ImageJ
>> Option Menu, in combination with different amounts of >RAM, without being
>> able to draw really clear conclusions about why it is fast or slow on the
>> respective computers. We also tested different processes, from a simple
>> Gaussian blur to more complex macros.
>>
>> Laurent, thanks for sharing your experiences.  Our issues with different
>> machines might be better answered on another forum (maybe
>> http://forum.imagej.net ).  Maybe we should start a new query on just
>> this
>> topic?
>>
>>
>> Thanks again for the feedback.
>>
>> George
>>
>>
>>
>> On Fri, Jul 15, 2016 at 3:49 AM, Gelman, Laurent <[hidden email]>
>> wrote:
>>
>> Dear George,
>>>
>>> We made the same kind of tests and experience as you did. We also tested
>>> numerous machines with a variable number of cores declared in the ImageJ
>>> Option Menu, in combination with different amounts of RAM, without being
>>> able to draw really clear conclusions about why it is fast or slow on the
>>> respective computers. We also tested different processes, from a simple
>>> Gaussian blur to more complex macros.
>>>
>>> In a nutshell:
>>> We also observed awful performances on our Microscoft Server 2012 /
>>> 32CPUs
>>> / 512GB RAM machine, irrespective of the combination of CPUs and RAM we
>>> declare in ImageJ. Surely, giving more than 16 CPUs to ImageJ does not
>>> improve overall speed, sometimes it even decreases. Note that this very
>>> same machine is really fast when using Matlab and the parallel processing
>>> toolbox.
>>> Until recently, the fastest computers we could find to run ImageJ were my
>>> iMac, which runs Windows 7 (:-)) (specs: i7-4771 CPU  3.5GHz, 32GB RAM),
>>> and the HIVE (hexacore machine) sold by the company Acquifer (no
>>> commercial
>>> interest). Until then, we thought the speed of individual CPUs is the
>>> key,
>>> less their numbers, but we got really surprised lately when we tested the
>>> new virtual machines (VMs) our IT department set up for us to do some
>>> remote processing of very big microscopy datasets (24 cores, 128 to 256
>>> GB
>>> RAM for each VM). Although the CPUs on the physical servers are not that
>>> fast (2.5 GHz, but is this really a good measure of computation speed? I
>>> am
>>> not sure...), we measured that our VMs were the fastest machines we
>>> tested
>>> so far. So we have actually no theory anymore about ImageJ and speed. It
>>> is
>>> not clear to us either, whether having Windows 7 or Windows server 2012
>>> makes a difference.
>>> Finally, I should mention that when you use complex processes, for
>>> example
>>> Stitching, the speed of the individual CPUs is also important, as we had
>>> the impression that the reading/loading of the file uses only one core.
>>> There again, we could see a beautiful correlation between CPU speed (GHz
>>> specs) and the process.
>>>
>>> Current solution:
>>> If we really need to be very fast,
>>> 1. we write an ImageJ macro in python and launch multiple threads in
>>> parallel, but we observed that the whole was not "thread safe", i.e. we
>>> see
>>> "collisions" between the different processes.
>>> 2. we write a python program to launch multiple ImageJ instances in a
>>> headless mode and parse the macro this way.
>>>
>>> I would be also delighted to understand what makes ImageJ go fast or slow
>>> on a computer, that would help us to purchase the right machines from the
>>> beginning.
>>>
>>> Very best regards,
>>>
>>> Laurent.
>>>
>>> ___________________________
>>> Laurent Gelman, PhD
>>> Friedrich Miescher Institut
>>> Head, Facility for Advanced Imaging and Microscopy
>>> Light microscopy
>>> WRO 1066.2.16
>>> Maulbeerstrasse 66
>>> CH-4058 Basel
>>> +41 (0)61 696 35 13
>>> +41 (0)79 618 73 69
>>> www.fmi.ch
>>> www.microscopynetwork.unibas.ch/
>>>
>>>
>>> -----Original Message-----
>>> From: George Patterson [mailto:[hidden email]]
>>> Sent: mercredi 13 juillet 2016 23:55
>>> Subject: Questions regarding multithreaded processing
>>>
>>> Dear all,
>>> I’ve assembled a plugin to analyze a time series on a pixel-by-pixel
>>> basis.
>>> It works fine but is slow.
>>> There are likely still plenty of optimizations that can be done to
>>> improve
>>> the speed and thanks to Albert Cordona and Stephen Preibisch sharing code
>>> and tutorials  (
>>> http://albert.rierol.net/imagej_programming_tutorials.html
>>> ),
>>> I’ve even have a version that runs multi-threaded.
>>> When run on multi-core machines the speed is improved, but I’m not sure
>>> what sort of improvement I should expect.  Moreover, the machines I
>>> expected to be the fastest are not.  This is likely stemming from my
>>> misunderstanding of parallel processing and Java programming in general
>>> so
>>> I’m hoping some of you with more experience can provide some feedback.
>>> I list below some observations and questions along with test runs on the
>>> same data set using the same plugin on a few different machines.
>>> Thanks for any suggestions.
>>> George
>>>
>>>
>>> Since the processing speeds differ, I realize the speeds of each machine
>>> to complete the analysis will differ.  I’m more interested the
>>> improvement
>>> of multiple threads on an individual machine.
>>> In running these tests, I altered the code to use a different number of
>>> threads in each run.
>>> Is setting the number of threads in the code and determining the time to
>>> finish the analysis a valid approach to testing improvement?
>>>
>>> Machine 5 is producing some odd behavior which I’ll discuss and ask for
>>> suggestions below.
>>>
>>> For machines 1-4, the speed improves with the number of threads up to
>>> about half the number of available processors.
>>> Do the improvements with the number of threads listed below seem
>>> reasonable?
>>> Is the improvement up to only about half the number of available
>>> processors due to “hyperthreading”?  My limited (and probably wrong)
>>> understanding is that hyperthreading makes a single core appear to be two
>>> which share resources and thus a machine with 2 cores will return 4 when
>>> queried for number of cpus.  Yes, I know that is too simplistic, but it’s
>>> the best I can do.
>>> Could it simply be that my code is not written properly to take advantage
>>> of hyperthreading?  Could anyone point me to a source and/or example code
>>> explaining how I could change it to take advantage of hyperthreading if
>>> this is the problem?
>>>
>>> Number of threads used are shown in parentheses where applicable.
>>> 1. MacBook Pro 2.66 GHz Intel Core i7
>>> number of processors: 1
>>> Number of cores: 2
>>> non-threaded plugin version ~59 sec
>>> threaded (1) ~51 sec
>>> threaded (2) ~36 sec
>>> threaded (3) ~34 sec
>>> threaded (4) ~35 sec
>>>
>>> 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2
>>> Number of cores: 8 non-threaded plugin version ~60 sec threaded (1) ~59
>>> sec
>>> threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6) ~13.2 sec
>>> threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12) ~11.1 sec
>>> threaded (16) ~11.5 sec
>>>
>>> 3. Windows 7 DELL   3.2 GHz Intel Core i5
>>> number of cpus shown in resource monitor: 4 non-threaded plugin version
>>> ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec threaded (3)
>>> ~20.4
>>> sec threaded (4) ~21.8 sec
>>>
>>> 4. Windows 7 Xi MTower 2P64 Workstation         2 x 2.1 GHz  AMD Opteron
>>> 6272
>>> number of cpus shown in resource monitor: 32 non-threaded plugin version
>>> ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded (4) ~46
>>> sec
>>> threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded (12) ~16.4 sec
>>> threaded (16) ~15.8 sec threaded (20) ~15.7 sec threaded (24) ~15.9 sec
>>> threaded (32) ~16 sec
>>>
>>> For machines 1-4, the cpu usage can be observed in the Activity Monitor
>>> (Mac) or Resource Monitor (Windows) and during the execution of the
>>> plugin
>>> all of the cpus were active.  For machine 5 shown below, only 22 of the
>>> 64
>>> show activity.  And it is not always the same 22.  From the example runs
>>> below you can see it really isn’t performing very well considering the
>>> number of available cores.  I originally thought this machine should be
>>> the
>>> best, but it barely outperforms my laptop.  This is probably a question
>>> for
>>> another forum, but I am wondering if anyone else has encountered anything
>>> similar.
>>>
>>> 5. Windows Server 2012 Xi MTower 2P64 Workstation       4 x 2.4 GHz  AMD
>>> Opteron
>>> 6378
>>> number of cpus shown in resource monitor: 64 non-threaded plugin version
>>> ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded (8) ~29.3
>>> sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded (24) ~24.1
>>> sec
>>> threaded (32) ~24.5 sec threaded (40) ~24.8 sec threaded (48) ~23.8 sec
>>> threaded (64) ~24.8 sec
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>>
>>> http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html
>>> Sent from the ImageJ mailing list archive at Nabble.com.
>>>
>>> --
>>> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>>>
>>>
>> --
>> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>>
>>
> --
> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html