Login  Register

Re: Questions regarding multithreaded processing

Posted by George Patterson on Jul 15, 2016; 4:05pm
URL: http://imagej.273.s1.nabble.com/Questions-regarding-multithreaded-processing-tp5016878p5016887.html

Hi all,

Thank you all for your feedback.  Below I'll try to respond to the parts I
can answer.


>Seeing as this bit is a bit more technical and closer to a plugin
development question, would you mind posting it on http://forum.imagej.net

>Long technical email threads like this one tend to get muddy, especially
if we try to share code snippets or want to comment on a particular part.

In the future, I’ll direct plugin development questions to that forum. I
didn’t bother sharing code since at this point since I just wanted to know
what improvements to expect with multi-threaded processing.


>A quick remark though. Seeing as we do not know HOW you implemented the
parallel processing, it will be difficult to help.

>Some notes: If you 'simply' make a bunch of threads where each accesses a
given pixel in a loop through an atomic integer for example, it is not
going to be faster. Accessing a single pixel is extremely fast >and what
will slow you down is having each thread waiting to get its pixel index.

As I mentioned, I didn’t know what to expect so I wasn’t sure I had a
problem.  The atomic integer approach is what I used initially.  To be
clear, the speed does improve with more threads, it just doesn’t improve as
much as it should based on the responses by Oli and Micheal.   Based on
suggestions from Oli and Micheal, I changed the code to designate different
blocks of the image to different threads.  This seemed to improve the speed
modestly 5-10%.  Thanks for the suggestion.  I’ll take this approach for
any future developments.


>IMHO the most important point for efficient parallelization (and efficient
Java code anyhow) is avoiding creating lots of objects that need garbage
collection (don't care about dozens, but definitely avoid >hundreds of
thousands or millions of Objects).

Micheal thanks for sharing the list of potential problems.  I’ll work my
way through them as well as I can.  The number of objects created is the
first I started checking.  A new Curvefitter is created for every pixel so
for a 256x256x200 stack >65000 are created and subjected to garbage
collection I guess.  I still haven’t found a way around generating this
many curvefitters.

This led me to looking more closely at the Curvefitter documentation and I
found this https://imagej.nih.gov/ij/docs/curve-fitter.html where it
indicates “Two threads with two independent minimization runs, with
repetition until two identical results are found, to avoid local minima or
keeping the result of a stuck simplex.”  Does this mean that for each
thread that generates a new Curvefitter, the Curvefitter generates a second
thread on its own?  If so, then my plugin is generating twice as many
threads as I think and might explain why my speed improvement is observed
only to about half the number of cpus.  Possible? Probable? No way?  Since
this is maybe getting into some technical bits which the plugin developers
probably know well, I’ll take Oli's advice ask this on the imagej.net forum.


>We made the same kind of tests and experience as you did. We also tested
numerous machines with a variable number of cores declared in the ImageJ
Option Menu, in combination with different amounts of >RAM, without being
able to draw really clear conclusions about why it is fast or slow on the
respective computers. We also tested different processes, from a simple
Gaussian blur to more complex macros.

Laurent, thanks for sharing your experiences.  Our issues with different
machines might be better answered on another forum (maybe
http://forum.imagej.net ).  Maybe we should start a new query on just this
topic?


Thanks again for the feedback.

George



On Fri, Jul 15, 2016 at 3:49 AM, Gelman, Laurent <[hidden email]>
wrote:

> Dear George,
>
> We made the same kind of tests and experience as you did. We also tested
> numerous machines with a variable number of cores declared in the ImageJ
> Option Menu, in combination with different amounts of RAM, without being
> able to draw really clear conclusions about why it is fast or slow on the
> respective computers. We also tested different processes, from a simple
> Gaussian blur to more complex macros.
>
> In a nutshell:
> We also observed awful performances on our Microscoft Server 2012 / 32CPUs
> / 512GB RAM machine, irrespective of the combination of CPUs and RAM we
> declare in ImageJ. Surely, giving more than 16 CPUs to ImageJ does not
> improve overall speed, sometimes it even decreases. Note that this very
> same machine is really fast when using Matlab and the parallel processing
> toolbox.
> Until recently, the fastest computers we could find to run ImageJ were my
> iMac, which runs Windows 7 (:-)) (specs: i7-4771 CPU  3.5GHz, 32GB RAM),
> and the HIVE (hexacore machine) sold by the company Acquifer (no commercial
> interest). Until then, we thought the speed of individual CPUs is the key,
> less their numbers, but we got really surprised lately when we tested the
> new virtual machines (VMs) our IT department set up for us to do some
> remote processing of very big microscopy datasets (24 cores, 128 to 256 GB
> RAM for each VM). Although the CPUs on the physical servers are not that
> fast (2.5 GHz, but is this really a good measure of computation speed? I am
> not sure...), we measured that our VMs were the fastest machines we tested
> so far. So we have actually no theory anymore about ImageJ and speed. It is
> not clear to us either, whether having Windows 7 or Windows server 2012
> makes a difference.
> Finally, I should mention that when you use complex processes, for example
> Stitching, the speed of the individual CPUs is also important, as we had
> the impression that the reading/loading of the file uses only one core.
> There again, we could see a beautiful correlation between CPU speed (GHz
> specs) and the process.
>
> Current solution:
> If we really need to be very fast,
> 1. we write an ImageJ macro in python and launch multiple threads in
> parallel, but we observed that the whole was not "thread safe", i.e. we see
> "collisions" between the different processes.
> 2. we write a python program to launch multiple ImageJ instances in a
> headless mode and parse the macro this way.
>
> I would be also delighted to understand what makes ImageJ go fast or slow
> on a computer, that would help us to purchase the right machines from the
> beginning.
>
> Very best regards,
>
> Laurent.
>
> ___________________________
> Laurent Gelman, PhD
> Friedrich Miescher Institut
> Head, Facility for Advanced Imaging and Microscopy
> Light microscopy
> WRO 1066.2.16
> Maulbeerstrasse 66
> CH-4058 Basel
> +41 (0)61 696 35 13
> +41 (0)79 618 73 69
> www.fmi.ch
> www.microscopynetwork.unibas.ch/
>
>
> -----Original Message-----
> From: George Patterson [mailto:[hidden email]]
> Sent: mercredi 13 juillet 2016 23:55
> Subject: Questions regarding multithreaded processing
>
> Dear all,
> I’ve assembled a plugin to analyze a time series on a pixel-by-pixel basis.
> It works fine but is slow.
> There are likely still plenty of optimizations that can be done to improve
> the speed and thanks to Albert Cordona and Stephen Preibisch sharing code
> and tutorials  (http://albert.rierol.net/imagej_programming_tutorials.html
> ),
> I’ve even have a version that runs multi-threaded.
> When run on multi-core machines the speed is improved, but I’m not sure
> what sort of improvement I should expect.  Moreover, the machines I
> expected to be the fastest are not.  This is likely stemming from my
> misunderstanding of parallel processing and Java programming in general so
> I’m hoping some of you with more experience can provide some feedback.
> I list below some observations and questions along with test runs on the
> same data set using the same plugin on a few different machines.
> Thanks for any suggestions.
> George
>
>
> Since the processing speeds differ, I realize the speeds of each machine
> to complete the analysis will differ.  I’m more interested the improvement
> of multiple threads on an individual machine.
> In running these tests, I altered the code to use a different number of
> threads in each run.
> Is setting the number of threads in the code and determining the time to
> finish the analysis a valid approach to testing improvement?
>
> Machine 5 is producing some odd behavior which I’ll discuss and ask for
> suggestions below.
>
> For machines 1-4, the speed improves with the number of threads up to
> about half the number of available processors.
> Do the improvements with the number of threads listed below seem
> reasonable?
> Is the improvement up to only about half the number of available
> processors due to “hyperthreading”?  My limited (and probably wrong)
> understanding is that hyperthreading makes a single core appear to be two
> which share resources and thus a machine with 2 cores will return 4 when
> queried for number of cpus.  Yes, I know that is too simplistic, but it’s
> the best I can do.
> Could it simply be that my code is not written properly to take advantage
> of hyperthreading?  Could anyone point me to a source and/or example code
> explaining how I could change it to take advantage of hyperthreading if
> this is the problem?
>
> Number of threads used are shown in parentheses where applicable.
> 1. MacBook Pro 2.66 GHz Intel Core i7
> number of processors: 1
> Number of cores: 2
> non-threaded plugin version ~59 sec
> threaded (1) ~51 sec
> threaded (2) ~36 sec
> threaded (3) ~34 sec
> threaded (4) ~35 sec
>
> 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2
> Number of cores: 8 non-threaded plugin version ~60 sec threaded (1) ~59 sec
> threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6) ~13.2 sec
> threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12) ~11.1 sec
> threaded (16) ~11.5 sec
>
> 3. Windows 7 DELL   3.2 GHz Intel Core i5
> number of cpus shown in resource monitor: 4 non-threaded plugin version
> ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec threaded (3) ~20.4
> sec threaded (4) ~21.8 sec
>
> 4. Windows 7 Xi MTower 2P64 Workstation         2 x 2.1 GHz  AMD Opteron
> 6272
> number of cpus shown in resource monitor: 32 non-threaded plugin version
> ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded (4) ~46 sec
> threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded (12) ~16.4 sec
> threaded (16) ~15.8 sec threaded (20) ~15.7 sec threaded (24) ~15.9 sec
> threaded (32) ~16 sec
>
> For machines 1-4, the cpu usage can be observed in the Activity Monitor
> (Mac) or Resource Monitor (Windows) and during the execution of the plugin
> all of the cpus were active.  For machine 5 shown below, only 22 of the 64
> show activity.  And it is not always the same 22.  From the example runs
> below you can see it really isn’t performing very well considering the
> number of available cores.  I originally thought this machine should be the
> best, but it barely outperforms my laptop.  This is probably a question for
> another forum, but I am wondering if anyone else has encountered anything
> similar.
>
> 5. Windows Server 2012 Xi MTower 2P64 Workstation       4 x 2.4 GHz  AMD
> Opteron
> 6378
> number of cpus shown in resource monitor: 64 non-threaded plugin version
> ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded (8) ~29.3
> sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded (24) ~24.1 sec
> threaded (32) ~24.5 sec threaded (40) ~24.8 sec threaded (48) ~23.8 sec
> threaded (64) ~24.8 sec
>
>
>
>
>
>
>
> --
> View this message in context:
> http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html
> Sent from the ImageJ mailing list archive at Nabble.com.
>
> --
> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html