Dear all,
I’ve assembled a plugin to analyze a time series on a pixel-by-pixel basis. It works fine but is slow. There are likely still plenty of optimizations that can be done to improve the speed and thanks to Albert Cordona and Stephen Preibisch sharing code and tutorials (http://albert.rierol.net/imagej_programming_tutorials.html), I’ve even have a version that runs multi-threaded. When run on multi-core machines the speed is improved, but I’m not sure what sort of improvement I should expect. Moreover, the machines I expected to be the fastest are not. This is likely stemming from my misunderstanding of parallel processing and Java programming in general so I’m hoping some of you with more experience can provide some feedback. I list below some observations and questions along with test runs on the same data set using the same plugin on a few different machines. Thanks for any suggestions. George Since the processing speeds differ, I realize the speeds of each machine to complete the analysis will differ. I’m more interested the improvement of multiple threads on an individual machine. In running these tests, I altered the code to use a different number of threads in each run. Is setting the number of threads in the code and determining the time to finish the analysis a valid approach to testing improvement? Machine 5 is producing some odd behavior which I’ll discuss and ask for suggestions below. For machines 1-4, the speed improves with the number of threads up to about half the number of available processors. Do the improvements with the number of threads listed below seem reasonable? Is the improvement up to only about half the number of available processors due to “hyperthreading”? My limited (and probably wrong) understanding is that hyperthreading makes a single core appear to be two which share resources and thus a machine with 2 cores will return 4 when queried for number of cpus. Yes, I know that is too simplistic, but it’s the best I can do. Could it simply be that my code is not written properly to take advantage of hyperthreading? Could anyone point me to a source and/or example code explaining how I could change it to take advantage of hyperthreading if this is the problem? Number of threads used are shown in parentheses where applicable. 1. MacBook Pro 2.66 GHz Intel Core i7 number of processors: 1 Number of cores: 2 non-threaded plugin version ~59 sec threaded (1) ~51 sec threaded (2) ~36 sec threaded (3) ~34 sec threaded (4) ~35 sec 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2 Number of cores: 8 non-threaded plugin version ~60 sec threaded (1) ~59 sec threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6) ~13.2 sec threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12) ~11.1 sec threaded (16) ~11.5 sec 3. Windows 7 DELL 3.2 GHz Intel Core i5 number of cpus shown in resource monitor: 4 non-threaded plugin version ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec threaded (3) ~20.4 sec threaded (4) ~21.8 sec 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron 6272 number of cpus shown in resource monitor: 32 non-threaded plugin version ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded (4) ~46 sec threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded (12) ~16.4 sec threaded (16) ~15.8 sec threaded (20) ~15.7 sec threaded (24) ~15.9 sec threaded (32) ~16 sec For machines 1-4, the cpu usage can be observed in the Activity Monitor (Mac) or Resource Monitor (Windows) and during the execution of the plugin all of the cpus were active. For machine 5 shown below, only 22 of the 64 show activity. And it is not always the same 22. From the example runs below you can see it really isn’t performing very well considering the number of available cores. I originally thought this machine should be the best, but it barely outperforms my laptop. This is probably a question for another forum, but I am wondering if anyone else has encountered anything similar. 5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz AMD Opteron 6378 number of cpus shown in resource monitor: 64 non-threaded plugin version ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded (8) ~29.3 sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded (24) ~24.1 sec threaded (32) ~24.5 sec threaded (40) ~24.8 sec threaded (48) ~23.8 sec threaded (64) ~24.8 sec |
I don't know enough about multithreading to say much intelligent, but I
did recently see a post suggesting that in parallel processing with matlab, using more instances than physical cores may nor produce much speed improvement: http://undocumentedmatlab.com/blog/a-few-parfor-tips Kurt On 7/13/2016 2:55 PM, George Patterson wrote: > Dear all, > I’ve assembled a plugin to analyze a time series on a pixel-by-pixel basis. > It works fine but is slow. > There are likely still plenty of optimizations that can be done to improve > the speed and thanks to Albert Cordona and Stephen Preibisch sharing code > and tutorials (http://albert.rierol.net/imagej_programming_tutorials.html), > I’ve even have a version that runs multi-threaded. > When run on multi-core machines the speed is improved, but I’m not sure what > sort of improvement I should expect. Moreover, the machines I expected to > be the fastest are not. This is likely stemming from my misunderstanding of > parallel processing and Java programming in general so I’m hoping some of > you with more experience can provide some feedback. > I list below some observations and questions along with test runs on the > same data set using the same plugin on a few different machines. > Thanks for any suggestions. > George > > > Since the processing speeds differ, I realize the speeds of each machine to > complete the analysis will differ. I’m more interested the improvement of > multiple threads on an individual machine. > In running these tests, I altered the code to use a different number of > threads in each run. > Is setting the number of threads in the code and determining the time to > finish the analysis a valid approach to testing improvement? > > Machine 5 is producing some odd behavior which I’ll discuss and ask for > suggestions below. > > For machines 1-4, the speed improves with the number of threads up to about > half the number of available processors. > Do the improvements with the number of threads listed below seem reasonable? > Is the improvement up to only about half the number of available processors > due to “hyperthreading”? My limited (and probably wrong) understanding is > that hyperthreading makes a single core appear to be two which share > resources and thus a machine with 2 cores will return 4 when queried for > number of cpus. Yes, I know that is too simplistic, but it’s the best I can > do. > Could it simply be that my code is not written properly to take advantage of > hyperthreading? Could anyone point me to a source and/or example code > explaining how I could change it to take advantage of hyperthreading if this > is the problem? > > Number of threads used are shown in parentheses where applicable. > 1. MacBook Pro 2.66 GHz Intel Core i7 > number of processors: 1 > Number of cores: 2 > non-threaded plugin version ~59 sec > threaded (1) ~51 sec > threaded (2) ~36 sec > threaded (3) ~34 sec > threaded (4) ~35 sec > > 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon > number of processors: 2 > Number of cores: 8 > non-threaded plugin version ~60 sec > threaded (1) ~59 sec > threaded (2) ~28.9 sec > threaded (4) ~15.6 sec > threaded (6) ~13.2 sec > threaded (8) ~11.3 sec > threaded (10) ~11.1 sec > threaded (12) ~11.1 sec > threaded (16) ~11.5 sec > > 3. Windows 7 DELL 3.2 GHz Intel Core i5 > number of cpus shown in resource monitor: 4 > non-threaded plugin version ~45.3 sec > threaded (1) ~48.3 sec > threaded (2) ~21.7 sec > threaded (3) ~20.4 sec > threaded (4) ~21.8 sec > > 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron 6272 > number of cpus shown in resource monitor: 32 > non-threaded plugin version ~162 sec > threaded (1) ~158 sec > threaded (2) ~85.1 sec > threaded (4) ~46 sec > threaded (8) ~22.9 sec > threaded (10) ~18.6 sec > threaded (12) ~16.4 sec > threaded (16) ~15.8 sec > threaded (20) ~15.7 sec > threaded (24) ~15.9 sec > threaded (32) ~16 sec > > For machines 1-4, the cpu usage can be observed in the Activity Monitor > (Mac) or Resource Monitor (Windows) and during the execution of the plugin > all of the cpus were active. For machine 5 shown below, only 22 of the 64 > show activity. And it is not always the same 22. From the example runs > below you can see it really isn’t performing very well considering the > number of available cores. I originally thought this machine should be the > best, but it barely outperforms my laptop. This is probably a question for > another forum, but I am wondering if anyone else has encountered anything > similar. > > 5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz AMD Opteron > 6378 > number of cpus shown in resource monitor: 64 > non-threaded plugin version ~140 sec > threaded (1) ~137 sec > threaded (4) ~60.3 sec > threaded (8) ~29.3 sec > threaded (12) ~22.9 sec > threaded (16) ~23.8 sec > threaded (24) ~24.1 sec > threaded (32) ~24.5 sec > threaded (40) ~24.8 sec > threaded (48) ~23.8 sec > threaded (64) ~24.8 sec > > > > > > > > -- > View this message in context: http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html > Sent from the ImageJ mailing list archive at Nabble.com. > > -- > ImageJ mailing list: http://imagej.nih.gov/ij/list.html > -- Kurt Thorn Associate Professor Director, Nikon Imaging Center http://thornlab.ucsf.edu/ http://nic.ucsf.edu/blog/ -- ImageJ mailing list: http://imagej.nih.gov/ij/list.html |
Hi Thom and George.
Seeing as this bit is a bit more technical and closer to a plugin development question, would you mind posting it on http://forum.imagej.net Long technical email threads like this one tend to get muddy, especially if we try to share code snippets or want to comment on a particular part. Plus you get a bunch of the ImageJ Devs who hang out there all the time. And finally, it looks pretty :) A quick remark though. Seeing as we do not know HOW you implemented the parallel processing, it will be difficult to help. Some notes: If you 'simply' make a bunch of threads where each accesses a given pixel in a loop through an atomic integer for example, it is not going to be faster. Accessing a single pixel is extremely fast and what will slow you down is having each thread waiting to get its pixel index. This is why on most examples, first you break the task by number of cores and you assign each thread with a pre-defined number of pixels (a block) to process. That way each thread can just go and access the pixels they want without worrying about what another thread does. And there you can expect a speed increase pretty that scales linearly with the number of available cores. Best Oli > -----Original Message----- > From: ImageJ Interest Group [mailto:[hidden email]] On Behalf Of Kurt > Thorn > Sent: jeudi, 14 juillet 2016 01:26 > To: [hidden email] > Subject: Re: Questions regarding multithreaded processing > > I don't know enough about multithreading to say much intelligent, but I did > recently see a post suggesting that in parallel processing with matlab, using > more instances than physical cores may nor produce much speed improvement: > http://undocumentedmatlab.com/blog/a-few-parfor-tips > > Kurt > > On 7/13/2016 2:55 PM, George Patterson wrote: > > Dear all, > > I’ve assembled a plugin to analyze a time series on a pixel-by-pixel basis. > > It works fine but is slow. > > There are likely still plenty of optimizations that can be done to > > improve the speed and thanks to Albert Cordona and Stephen Preibisch > > sharing code and tutorials > > (http://albert.rierol.net/imagej_programming_tutorials.html), > > I’ve even have a version that runs multi-threaded. > > When run on multi-core machines the speed is improved, but I’m not > > sure what sort of improvement I should expect. Moreover, the machines > > I expected to be the fastest are not. This is likely stemming from my > > misunderstanding of parallel processing and Java programming in > > general so I’m hoping some of you with more experience can provide some > feedback. > > I list below some observations and questions along with test runs on > > the same data set using the same plugin on a few different machines. > > Thanks for any suggestions. > > George > > > > > > Since the processing speeds differ, I realize the speeds of each > > machine to complete the analysis will differ. I’m more interested the > > improvement of multiple threads on an individual machine. > > In running these tests, I altered the code to use a different number > > of threads in each run. > > Is setting the number of threads in the code and determining the time > > to finish the analysis a valid approach to testing improvement? > > > > Machine 5 is producing some odd behavior which I’ll discuss and ask > > for suggestions below. > > > > For machines 1-4, the speed improves with the number of threads up to > > about half the number of available processors. > > Do the improvements with the number of threads listed below seem > reasonable? > > Is the improvement up to only about half the number of available > > processors due to “hyperthreading”? My limited (and probably wrong) > > understanding is that hyperthreading makes a single core appear to be > > two which share resources and thus a machine with 2 cores will return > > 4 when queried for number of cpus. Yes, I know that is too > > simplistic, but it’s the best I can do. > > Could it simply be that my code is not written properly to take > > advantage of hyperthreading? Could anyone point me to a source and/or > > example code explaining how I could change it to take advantage of > > hyperthreading if this is the problem? > > > > Number of threads used are shown in parentheses where applicable. > > 1. MacBook Pro 2.66 GHz Intel Core i7 > > number of processors: 1 > > Number of cores: 2 > > non-threaded plugin version ~59 sec > > threaded (1) ~51 sec > > threaded (2) ~36 sec > > threaded (3) ~34 sec > > threaded (4) ~35 sec > > > > 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2 > > Number of cores: 8 non-threaded plugin version ~60 sec threaded (1) > > ~59 sec threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6) > > ~13.2 sec threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12) > > ~11.1 sec threaded (16) ~11.5 sec > > > > 3. Windows 7 DELL 3.2 GHz Intel Core i5 > > number of cpus shown in resource monitor: 4 non-threaded plugin > > version ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec > > threaded (3) ~20.4 sec threaded (4) ~21.8 sec > > > > 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron 6272 > > number of cpus shown in resource monitor: 32 non-threaded plugin > > version ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded > > (4) ~46 sec threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded > > (12) ~16.4 sec threaded (16) ~15.8 sec threaded (20) ~15.7 sec > > threaded (24) ~15.9 sec threaded (32) ~16 sec > > > > For machines 1-4, the cpu usage can be observed in the Activity > > Monitor > > (Mac) or Resource Monitor (Windows) and during the execution of the > > plugin all of the cpus were active. For machine 5 shown below, only > > 22 of the 64 show activity. And it is not always the same 22. From > > the example runs below you can see it really isn’t performing very > > well considering the number of available cores. I originally thought > > this machine should be the best, but it barely outperforms my laptop. > > This is probably a question for another forum, but I am wondering if > > anyone else has encountered anything similar. > > > > 5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz AMD > Opteron > > 6378 > > number of cpus shown in resource monitor: 64 non-threaded plugin > > version ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded > > (8) ~29.3 sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded > > (24) ~24.1 sec threaded (32) ~24.5 sec threaded (40) ~24.8 sec > > threaded (48) ~23.8 sec threaded (64) ~24.8 sec > > > > > > > > > > > > > > > > -- > > View this message in context: > > http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-pro > > cessing-tp5016878.html Sent from the ImageJ mailing list archive at > > Nabble.com. > > > > -- > > ImageJ mailing list: http://imagej.nih.gov/ij/list.html > > > > > -- > Kurt Thorn > Associate Professor > Director, Nikon Imaging Center > http://thornlab.ucsf.edu/ > http://nic.ucsf.edu/blog/ > > -- > ImageJ mailing list: http://imagej.nih.gov/ij/list.html -- ImageJ mailing list: http://imagej.nih.gov/ij/list.html |
In reply to this post by George Patterson
Hi George,
there are several reasons why there is no linear increase in speed with the number of threads. These come into my mind: - Memory bandwidth is shared by all CPUs (typically the most important issue) - L2 and/or L3 cache is shared between processors; the cache may be too small if there are many threads - When two or more threads access memory addresses that are separate by i*2^n where i is a small integer, this causes conflicts due to limited cache associativity. This problem can arise with image sizes that are powers of 2, such as 4096*4096, and e.g. 4 threads starting at e.g. 0, 1/4, 1/2 and 3/4 of the image height. Independent of multithreading, cache associativity issues also slow down processing column-wise processing of images whose height is a power of 2 or i*2^n (again with i being a small integer). - Garbage collection has many "Stop the World" events. This means that all application threads are stopped until the operation completes. - Some threads might finish up earlier than others (different work load or just different success rates with memory and cache access) - Program parts that are not parallelized (Amdahl's law) - Overhead when creating the threads (and for synchronization between threads, if necessary) IMHO the most important point for efficient parallelization (and efficient Java code anyhow) is avoiding creating lots of objects that need garbage collection (don't care about dozens, but definitely avoid hundreds of thousands or millions of Objects). What also helps (maybe 20% gain, but strongly depending on the problem) is having the threads share their load such that they access the same data area for reading. E.g. the ImageJ built-in RankFilters (mean, minimum, maximum, median, remove outliers) have the work split up into pixel rows of an image, and the threads work on nearby rows (each thread also needs a few adjacent rows of data anyhow). This helps quite a bit, since all the input data nicely fit into the CPU cache, but it requires more programming effort and synchronization. Hyperthreading: In my experience, the gain of using hyperthreading is modest, but it exists - maybe in the 10% range (i.e., when using 4 threads on a 2-core CPU with hyperthreading). This is in contrast to the results with your plugin (no gain). I have no experience with machines having a large number of cores like your4*16-core AMD Opteron; I can't say whether Java still distributes the threads correctly between the cores for 64 cores. --- By the way, just from the programming side: The easiest way for parallelization is writing a PlugInFilter. If operations on the stack slices are independent, just specify the PARALLELIZE_STACKS flag, and ImageJ will call the run(ip) method in parallel for the stack slices. If operations care about the ROI, you can also specify the PARALLELIZE_IMAGES flag. When processing a single image, ImageJ will call the run(ip) method in parallel, with rectangular ROIs. E.g. for 4 threads the first thread will get a ROI with the uppermost 1/4 of the height, the second thread the range from 1/4 to 1/2 of the height, and so on. The number of threads is in Edit>Options>menory and Threads and initially set to the number of cores (including hyperhtreading). Michael ________________________________________________________________ On 2016-07-13 23:55, George Patterson wrote: > Dear all, > I’ve assembled a plugin to analyze a time series on a pixel-by-pixel basis. > It works fine but is slow. > There are likely still plenty of optimizations that can be done to improve > the speed and thanks to Albert Cordona and Stephen Preibisch sharing code > and tutorials (http://albert.rierol.net/imagej_programming_tutorials.html), > I’ve even have a version that runs multi-threaded. > When run on multi-core machines the speed is improved, but I’m not sure what > sort of improvement I should expect. Moreover, the machines I expected to > be the fastest are not. This is likely stemming from my misunderstanding of > parallel processing and Java programming in general so I’m hoping some of > you with more experience can provide some feedback. > I list below some observations and questions along with test runs on the > same data set using the same plugin on a few different machines. > Thanks for any suggestions. > George > > > Since the processing speeds differ, I realize the speeds of each machine to > complete the analysis will differ. I’m more interested the improvement of > multiple threads on an individual machine. > In running these tests, I altered the code to use a different number of > threads in each run. > Is setting the number of threads in the code and determining the time to > finish the analysis a valid approach to testing improvement? > > Machine 5 is producing some odd behavior which I’ll discuss and ask for > suggestions below. > > For machines 1-4, the speed improves with the number of threads up to about > half the number of available processors. > Do the improvements with the number of threads listed below seem reasonable? > Is the improvement up to only about half the number of available processors > due to “hyperthreading”? My limited (and probably wrong) understanding is > that hyperthreading makes a single core appear to be two which share > resources and thus a machine with 2 cores will return 4 when queried for > number of cpus. Yes, I know that is too simplistic, but it’s the best I can > do. > Could it simply be that my code is not written properly to take advantage of > hyperthreading? Could anyone point me to a source and/or example code > explaining how I could change it to take advantage of hyperthreading if this > is the problem? > > Number of threads used are shown in parentheses where applicable. > 1. MacBook Pro 2.66 GHz Intel Core i7 > number of processors: 1 > Number of cores: 2 > non-threaded plugin version ~59 sec > threaded (1) ~51 sec > threaded (2) ~36 sec > threaded (3) ~34 sec > threaded (4) ~35 sec > > 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon > number of processors: 2 > Number of cores: 8 > non-threaded plugin version ~60 sec > threaded (1) ~59 sec > threaded (2) ~28.9 sec > threaded (4) ~15.6 sec > threaded (6) ~13.2 sec > threaded (8) ~11.3 sec > threaded (10) ~11.1 sec > threaded (12) ~11.1 sec > threaded (16) ~11.5 sec > > 3. Windows 7 DELL 3.2 GHz Intel Core i5 > number of cpus shown in resource monitor: 4 > non-threaded plugin version ~45.3 sec > threaded (1) ~48.3 sec > threaded (2) ~21.7 sec > threaded (3) ~20.4 sec > threaded (4) ~21.8 sec > > 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron 6272 > number of cpus shown in resource monitor: 32 > non-threaded plugin version ~162 sec > threaded (1) ~158 sec > threaded (2) ~85.1 sec > threaded (4) ~46 sec > threaded (8) ~22.9 sec > threaded (10) ~18.6 sec > threaded (12) ~16.4 sec > threaded (16) ~15.8 sec > threaded (20) ~15.7 sec > threaded (24) ~15.9 sec > threaded (32) ~16 sec > > For machines 1-4, the cpu usage can be observed in the Activity Monitor > (Mac) or Resource Monitor (Windows) and during the execution of the plugin > all of the cpus were active. For machine 5 shown below, only 22 of the 64 > show activity. And it is not always the same 22. From the example runs > below you can see it really isn’t performing very well considering the > number of available cores. I originally thought this machine should be the > best, but it barely outperforms my laptop. This is probably a question for > another forum, but I am wondering if anyone else has encountered anything > similar. > > 5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz AMD Opteron > 6378 > number of cpus shown in resource monitor: 64 > non-threaded plugin version ~140 sec > threaded (1) ~137 sec > threaded (4) ~60.3 sec > threaded (8) ~29.3 sec > threaded (12) ~22.9 sec > threaded (16) ~23.8 sec > threaded (24) ~24.1 sec > threaded (32) ~24.5 sec > threaded (40) ~24.8 sec > threaded (48) ~23.8 sec > threaded (64) ~24.8 sec > > > > > > > > -- > View this message in context: http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html > Sent from the ImageJ mailing list archive at Nabble.com. > > -- > ImageJ mailing list: http://imagej.nih.gov/ij/list.html > -- ImageJ mailing list: http://imagej.nih.gov/ij/list.html |
In reply to this post by George Patterson
Dear George,
We made the same kind of tests and experience as you did. We also tested numerous machines with a variable number of cores declared in the ImageJ Option Menu, in combination with different amounts of RAM, without being able to draw really clear conclusions about why it is fast or slow on the respective computers. We also tested different processes, from a simple Gaussian blur to more complex macros. In a nutshell: We also observed awful performances on our Microscoft Server 2012 / 32CPUs / 512GB RAM machine, irrespective of the combination of CPUs and RAM we declare in ImageJ. Surely, giving more than 16 CPUs to ImageJ does not improve overall speed, sometimes it even decreases. Note that this very same machine is really fast when using Matlab and the parallel processing toolbox. Until recently, the fastest computers we could find to run ImageJ were my iMac, which runs Windows 7 (:-)) (specs: i7-4771 CPU 3.5GHz, 32GB RAM), and the HIVE (hexacore machine) sold by the company Acquifer (no commercial interest). Until then, we thought the speed of individual CPUs is the key, less their numbers, but we got really surprised lately when we tested the new virtual machines (VMs) our IT department set up for us to do some remote processing of very big microscopy datasets (24 cores, 128 to 256 GB RAM for each VM). Although the CPUs on the physical servers are not that fast (2.5 GHz, but is this really a good measure of computation speed? I am not sure...), we measured that our VMs were the fastest machines we tested so far. So we have actually no theory anymore about ImageJ and speed. It is not clear to us either, whether having Windows 7 or Windows server 2012 makes a difference. Finally, I should mention that when you use complex processes, for example Stitching, the speed of the individual CPUs is also important, as we had the impression that the reading/loading of the file uses only one core. There again, we could see a beautiful correlation between CPU speed (GHz specs) and the process. Current solution: If we really need to be very fast, 1. we write an ImageJ macro in python and launch multiple threads in parallel, but we observed that the whole was not "thread safe", i.e. we see "collisions" between the different processes. 2. we write a python program to launch multiple ImageJ instances in a headless mode and parse the macro this way. I would be also delighted to understand what makes ImageJ go fast or slow on a computer, that would help us to purchase the right machines from the beginning. Very best regards, Laurent. ___________________________ Laurent Gelman, PhD Friedrich Miescher Institut Head, Facility for Advanced Imaging and Microscopy Light microscopy WRO 1066.2.16 Maulbeerstrasse 66 CH-4058 Basel +41 (0)61 696 35 13 +41 (0)79 618 73 69 www.fmi.ch www.microscopynetwork.unibas.ch/ -----Original Message----- From: George Patterson [mailto:[hidden email]] Sent: mercredi 13 juillet 2016 23:55 Subject: Questions regarding multithreaded processing Dear all, I’ve assembled a plugin to analyze a time series on a pixel-by-pixel basis. It works fine but is slow. There are likely still plenty of optimizations that can be done to improve the speed and thanks to Albert Cordona and Stephen Preibisch sharing code and tutorials (http://albert.rierol.net/imagej_programming_tutorials.html), I’ve even have a version that runs multi-threaded. When run on multi-core machines the speed is improved, but I’m not sure what sort of improvement I should expect. Moreover, the machines I expected to be the fastest are not. This is likely stemming from my misunderstanding of parallel processing and Java programming in general so I’m hoping some of you with more experience can provide some feedback. I list below some observations and questions along with test runs on the same data set using the same plugin on a few different machines. Thanks for any suggestions. George Since the processing speeds differ, I realize the speeds of each machine to complete the analysis will differ. I’m more interested the improvement of multiple threads on an individual machine. In running these tests, I altered the code to use a different number of threads in each run. Is setting the number of threads in the code and determining the time to finish the analysis a valid approach to testing improvement? Machine 5 is producing some odd behavior which I’ll discuss and ask for suggestions below. For machines 1-4, the speed improves with the number of threads up to about half the number of available processors. Do the improvements with the number of threads listed below seem reasonable? Is the improvement up to only about half the number of available processors due to “hyperthreading”? My limited (and probably wrong) understanding is that hyperthreading makes a single core appear to be two which share resources and thus a machine with 2 cores will return 4 when queried for number of cpus. Yes, I know that is too simplistic, but it’s the best I can do. Could it simply be that my code is not written properly to take advantage of hyperthreading? Could anyone point me to a source and/or example code explaining how I could change it to take advantage of hyperthreading if this is the problem? Number of threads used are shown in parentheses where applicable. 1. MacBook Pro 2.66 GHz Intel Core i7 number of processors: 1 Number of cores: 2 non-threaded plugin version ~59 sec threaded (1) ~51 sec threaded (2) ~36 sec threaded (3) ~34 sec threaded (4) ~35 sec 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2 Number of cores: 8 non-threaded plugin version ~60 sec threaded (1) ~59 sec threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6) ~13.2 sec threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12) ~11.1 sec threaded (16) ~11.5 sec 3. Windows 7 DELL 3.2 GHz Intel Core i5 number of cpus shown in resource monitor: 4 non-threaded plugin version ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec threaded (3) ~20.4 sec threaded (4) ~21.8 sec 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron 6272 number of cpus shown in resource monitor: 32 non-threaded plugin version ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded (4) ~46 sec threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded (12) ~16.4 sec threaded (16) ~15.8 sec threaded (20) ~15.7 sec threaded (24) ~15.9 sec threaded (32) ~16 sec For machines 1-4, the cpu usage can be observed in the Activity Monitor (Mac) or Resource Monitor (Windows) and during the execution of the plugin all of the cpus were active. For machine 5 shown below, only 22 of the 64 show activity. And it is not always the same 22. From the example runs below you can see it really isn’t performing very well considering the number of available cores. I originally thought this machine should be the best, but it barely outperforms my laptop. This is probably a question for another forum, but I am wondering if anyone else has encountered anything similar. 5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz AMD Opteron 6378 number of cpus shown in resource monitor: 64 non-threaded plugin version ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded (8) ~29.3 sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded (24) ~24.1 sec threaded (32) ~24.5 sec threaded (40) ~24.8 sec threaded (48) ~23.8 sec threaded (64) ~24.8 sec -- View this message in context: http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html Sent from the ImageJ mailing list archive at Nabble.com. -- ImageJ mailing list: http://imagej.nih.gov/ij/list.html -- ImageJ mailing list: http://imagej.nih.gov/ij/list.html |
Hi all,
Thank you all for your feedback. Below I'll try to respond to the parts I can answer. >Seeing as this bit is a bit more technical and closer to a plugin development question, would you mind posting it on http://forum.imagej.net >Long technical email threads like this one tend to get muddy, especially if we try to share code snippets or want to comment on a particular part. In the future, I’ll direct plugin development questions to that forum. I didn’t bother sharing code since at this point since I just wanted to know what improvements to expect with multi-threaded processing. >A quick remark though. Seeing as we do not know HOW you implemented the parallel processing, it will be difficult to help. >Some notes: If you 'simply' make a bunch of threads where each accesses a given pixel in a loop through an atomic integer for example, it is not going to be faster. Accessing a single pixel is extremely fast >and what will slow you down is having each thread waiting to get its pixel index. As I mentioned, I didn’t know what to expect so I wasn’t sure I had a problem. The atomic integer approach is what I used initially. To be clear, the speed does improve with more threads, it just doesn’t improve as much as it should based on the responses by Oli and Micheal. Based on suggestions from Oli and Micheal, I changed the code to designate different blocks of the image to different threads. This seemed to improve the speed modestly 5-10%. Thanks for the suggestion. I’ll take this approach for any future developments. >IMHO the most important point for efficient parallelization (and efficient Java code anyhow) is avoiding creating lots of objects that need garbage collection (don't care about dozens, but definitely avoid >hundreds of thousands or millions of Objects). Micheal thanks for sharing the list of potential problems. I’ll work my way through them as well as I can. The number of objects created is the first I started checking. A new Curvefitter is created for every pixel so for a 256x256x200 stack >65000 are created and subjected to garbage collection I guess. I still haven’t found a way around generating this many curvefitters. This led me to looking more closely at the Curvefitter documentation and I found this https://imagej.nih.gov/ij/docs/curve-fitter.html where it indicates “Two threads with two independent minimization runs, with repetition until two identical results are found, to avoid local minima or keeping the result of a stuck simplex.” Does this mean that for each thread that generates a new Curvefitter, the Curvefitter generates a second thread on its own? If so, then my plugin is generating twice as many threads as I think and might explain why my speed improvement is observed only to about half the number of cpus. Possible? Probable? No way? Since this is maybe getting into some technical bits which the plugin developers probably know well, I’ll take Oli's advice ask this on the imagej.net forum. >We made the same kind of tests and experience as you did. We also tested numerous machines with a variable number of cores declared in the ImageJ Option Menu, in combination with different amounts of >RAM, without being able to draw really clear conclusions about why it is fast or slow on the respective computers. We also tested different processes, from a simple Gaussian blur to more complex macros. Laurent, thanks for sharing your experiences. Our issues with different machines might be better answered on another forum (maybe http://forum.imagej.net ). Maybe we should start a new query on just this topic? Thanks again for the feedback. George On Fri, Jul 15, 2016 at 3:49 AM, Gelman, Laurent <[hidden email]> wrote: > Dear George, > > We made the same kind of tests and experience as you did. We also tested > numerous machines with a variable number of cores declared in the ImageJ > Option Menu, in combination with different amounts of RAM, without being > able to draw really clear conclusions about why it is fast or slow on the > respective computers. We also tested different processes, from a simple > Gaussian blur to more complex macros. > > In a nutshell: > We also observed awful performances on our Microscoft Server 2012 / 32CPUs > / 512GB RAM machine, irrespective of the combination of CPUs and RAM we > declare in ImageJ. Surely, giving more than 16 CPUs to ImageJ does not > improve overall speed, sometimes it even decreases. Note that this very > same machine is really fast when using Matlab and the parallel processing > toolbox. > Until recently, the fastest computers we could find to run ImageJ were my > iMac, which runs Windows 7 (:-)) (specs: i7-4771 CPU 3.5GHz, 32GB RAM), > and the HIVE (hexacore machine) sold by the company Acquifer (no commercial > interest). Until then, we thought the speed of individual CPUs is the key, > less their numbers, but we got really surprised lately when we tested the > new virtual machines (VMs) our IT department set up for us to do some > remote processing of very big microscopy datasets (24 cores, 128 to 256 GB > RAM for each VM). Although the CPUs on the physical servers are not that > fast (2.5 GHz, but is this really a good measure of computation speed? I am > not sure...), we measured that our VMs were the fastest machines we tested > so far. So we have actually no theory anymore about ImageJ and speed. It is > not clear to us either, whether having Windows 7 or Windows server 2012 > makes a difference. > Finally, I should mention that when you use complex processes, for example > Stitching, the speed of the individual CPUs is also important, as we had > the impression that the reading/loading of the file uses only one core. > There again, we could see a beautiful correlation between CPU speed (GHz > specs) and the process. > > Current solution: > If we really need to be very fast, > 1. we write an ImageJ macro in python and launch multiple threads in > parallel, but we observed that the whole was not "thread safe", i.e. we see > "collisions" between the different processes. > 2. we write a python program to launch multiple ImageJ instances in a > headless mode and parse the macro this way. > > I would be also delighted to understand what makes ImageJ go fast or slow > on a computer, that would help us to purchase the right machines from the > beginning. > > Very best regards, > > Laurent. > > ___________________________ > Laurent Gelman, PhD > Friedrich Miescher Institut > Head, Facility for Advanced Imaging and Microscopy > Light microscopy > WRO 1066.2.16 > Maulbeerstrasse 66 > CH-4058 Basel > +41 (0)61 696 35 13 > +41 (0)79 618 73 69 > www.fmi.ch > www.microscopynetwork.unibas.ch/ > > > -----Original Message----- > From: George Patterson [mailto:[hidden email]] > Sent: mercredi 13 juillet 2016 23:55 > Subject: Questions regarding multithreaded processing > > Dear all, > I’ve assembled a plugin to analyze a time series on a pixel-by-pixel basis. > It works fine but is slow. > There are likely still plenty of optimizations that can be done to improve > the speed and thanks to Albert Cordona and Stephen Preibisch sharing code > and tutorials (http://albert.rierol.net/imagej_programming_tutorials.html > ), > I’ve even have a version that runs multi-threaded. > When run on multi-core machines the speed is improved, but I’m not sure > what sort of improvement I should expect. Moreover, the machines I > expected to be the fastest are not. This is likely stemming from my > misunderstanding of parallel processing and Java programming in general so > I’m hoping some of you with more experience can provide some feedback. > I list below some observations and questions along with test runs on the > same data set using the same plugin on a few different machines. > Thanks for any suggestions. > George > > > Since the processing speeds differ, I realize the speeds of each machine > to complete the analysis will differ. I’m more interested the improvement > of multiple threads on an individual machine. > In running these tests, I altered the code to use a different number of > threads in each run. > Is setting the number of threads in the code and determining the time to > finish the analysis a valid approach to testing improvement? > > Machine 5 is producing some odd behavior which I’ll discuss and ask for > suggestions below. > > For machines 1-4, the speed improves with the number of threads up to > about half the number of available processors. > Do the improvements with the number of threads listed below seem > reasonable? > Is the improvement up to only about half the number of available > processors due to “hyperthreading”? My limited (and probably wrong) > understanding is that hyperthreading makes a single core appear to be two > which share resources and thus a machine with 2 cores will return 4 when > queried for number of cpus. Yes, I know that is too simplistic, but it’s > the best I can do. > Could it simply be that my code is not written properly to take advantage > of hyperthreading? Could anyone point me to a source and/or example code > explaining how I could change it to take advantage of hyperthreading if > this is the problem? > > Number of threads used are shown in parentheses where applicable. > 1. MacBook Pro 2.66 GHz Intel Core i7 > number of processors: 1 > Number of cores: 2 > non-threaded plugin version ~59 sec > threaded (1) ~51 sec > threaded (2) ~36 sec > threaded (3) ~34 sec > threaded (4) ~35 sec > > 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2 > Number of cores: 8 non-threaded plugin version ~60 sec threaded (1) ~59 sec > threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6) ~13.2 sec > threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12) ~11.1 sec > threaded (16) ~11.5 sec > > 3. Windows 7 DELL 3.2 GHz Intel Core i5 > number of cpus shown in resource monitor: 4 non-threaded plugin version > ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec threaded (3) ~20.4 > sec threaded (4) ~21.8 sec > > 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron > 6272 > number of cpus shown in resource monitor: 32 non-threaded plugin version > ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded (4) ~46 sec > threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded (12) ~16.4 sec > threaded (16) ~15.8 sec threaded (20) ~15.7 sec threaded (24) ~15.9 sec > threaded (32) ~16 sec > > For machines 1-4, the cpu usage can be observed in the Activity Monitor > (Mac) or Resource Monitor (Windows) and during the execution of the plugin > all of the cpus were active. For machine 5 shown below, only 22 of the 64 > show activity. And it is not always the same 22. From the example runs > below you can see it really isn’t performing very well considering the > number of available cores. I originally thought this machine should be the > best, but it barely outperforms my laptop. This is probably a question for > another forum, but I am wondering if anyone else has encountered anything > similar. > > 5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz AMD > Opteron > 6378 > number of cpus shown in resource monitor: 64 non-threaded plugin version > ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded (8) ~29.3 > sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded (24) ~24.1 sec > threaded (32) ~24.5 sec threaded (40) ~24.8 sec threaded (48) ~23.8 sec > threaded (64) ~24.8 sec > > > > > > > > -- > View this message in context: > http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html > Sent from the ImageJ mailing list archive at Nabble.com. > > -- > ImageJ mailing list: http://imagej.nih.gov/ij/list.html > -- ImageJ mailing list: http://imagej.nih.gov/ij/list.html |
Hi George,
as curve fitting is an essential part of your plugin, probably I should answer (I am responsible for the two threads and more code in the CurveFitter. You would not reach me on the developers' mailing list as my main occupation is not in computing or image processing, and I contribute to ImageJ only then and when, if I need it for my own work or I think I might eventually need it). The CurveFitter usually uses the Minimizer, which uses two threads, indeed. It does not use the Minimizer (and thus, only one Thread) for linear regression fits: - Straight Line 'y = a+bx' - "Exponential (linear regression) 'y = a*exp(bx)', and - "Power (linear regression)" 'a*x^b'. If it's not linear regression, you can disable using two Minimizer Threads by myCurveFitter.getMinimizer().setMaximumThreads(1); (obviously, before you call myCurveFitter.doFit) When using many parallel threads, this will also speed it up slightly, e.g. if the Minimizer does not find two consistent solutions immediately, with two threads it will give it two more tries (total of four), with one thread it may get two consistent solutions after a total of three tries. If it is linear regression, ask me again; somewhere I have a linear regression class that can be reused without creating a new object. If your curve fitting problem is linear in all parameters, but not linear regression with only two parameters (e.g. polynomial, a*sin(x)+b*cos(x)+c, etc.), it would be faster to use the analytical solution instead of the CurveFitter, but it is more programming effort (you could try to find a suitable library such as Apache Commons Math, Jama). If your problem is nonlinear but it has one of the forms - a + b*function(c, d, ...; x), - a + function(b, c, d, ...; x), - a + b*x + function(c, d, ...; x), or - a*x + function(b, c, d, ...; x), you can speed up a lot by eliminating one or two parameters with myCurveFitter.setOffsetMultiplySlopeParams By the way, creating a new CurveFitter also creates several other objects, so having one per pixel really induces a lot of garbage collection. If creating many CurveFitters or Minimizers is more common (anyone out there who also does this?) we should consider making the CurveFitter and Minimizer reusable (e.g. with a CurveFitter.setData(double[] xData, double[] yData) method, which clears the previous result and settings). Best regards, Michael ________________________________________________________________ On 2016-07-15 18:16, George Patterson wrote: > Hi all, > > Thank you all for your feedback. Below I'll try to respond to the parts I > can answer. > > >> Seeing as this bit is a bit more technical and closer to a plugin > development question, would you mind posting it on http://forum.imagej.net > >> Long technical email threads like this one tend to get muddy, especially > if we try to share code snippets or want to comment on a particular part. > > In the future, I’ll direct plugin development questions to that forum. I > didn’t bother sharing code since at this point since I just wanted to know > what improvements to expect with multi-threaded processing. > > >> A quick remark though. Seeing as we do not know HOW you implemented the > parallel processing, it will be difficult to help. > >> Some notes: If you 'simply' make a bunch of threads where each accesses a > given pixel in a loop through an atomic integer for example, it is not > going to be faster. Accessing a single pixel is extremely fast >and what > will slow you down is having each thread waiting to get its pixel index. > > As I mentioned, I didn’t know what to expect so I wasn’t sure I had a > problem. The atomic integer approach is what I used initially. To be > clear, the speed does improve with more threads, it just doesn’t improve as > much as it should based on the responses by Oli and Micheal. Based on > suggestions from Oli and Micheal, I changed the code to designate different > blocks of the image to different threads. This seemed to improve the speed > modestly 5-10%. Thanks for the suggestion. I’ll take this approach for > any future developments. > > >> IMHO the most important point for efficient parallelization (and efficient > Java code anyhow) is avoiding creating lots of objects that need garbage > collection (don't care about dozens, but definitely avoid >hundreds of > thousands or millions of Objects). > > Micheal thanks for sharing the list of potential problems. I’ll work my > way through them as well as I can. The number of objects created is the > first I started checking. A new Curvefitter is created for every pixel so > for a 256x256x200 stack >65000 are created and subjected to garbage > collection I guess. I still haven’t found a way around generating this > many curvefitters. > > This led me to looking more closely at the Curvefitter documentation and I > found this https://imagej.nih.gov/ij/docs/curve-fitter.html where it > indicates “Two threads with two independent minimization runs, with > repetition until two identical results are found, to avoid local minima or > keeping the result of a stuck simplex.” Does this mean that for each > thread that generates a new Curvefitter, the Curvefitter generates a second > thread on its own? If so, then my plugin is generating twice as many > threads as I think and might explain why my speed improvement is observed > only to about half the number of cpus. Possible? Probable? No way? Since > this is maybe getting into some technical bits which the plugin developers > probably know well, I’ll take Oli's advice ask this on the imagej.net forum. > > >> We made the same kind of tests and experience as you did. We also tested > numerous machines with a variable number of cores declared in the ImageJ > Option Menu, in combination with different amounts of >RAM, without being > able to draw really clear conclusions about why it is fast or slow on the > respective computers. We also tested different processes, from a simple > Gaussian blur to more complex macros. > > Laurent, thanks for sharing your experiences. Our issues with different > machines might be better answered on another forum (maybe > http://forum.imagej.net ). Maybe we should start a new query on just this > topic? > > > Thanks again for the feedback. > > George > > > > On Fri, Jul 15, 2016 at 3:49 AM, Gelman, Laurent <[hidden email]> > wrote: > >> Dear George, >> >> We made the same kind of tests and experience as you did. We also tested >> numerous machines with a variable number of cores declared in the ImageJ >> Option Menu, in combination with different amounts of RAM, without being >> able to draw really clear conclusions about why it is fast or slow on the >> respective computers. We also tested different processes, from a simple >> Gaussian blur to more complex macros. >> >> In a nutshell: >> We also observed awful performances on our Microscoft Server 2012 / 32CPUs >> / 512GB RAM machine, irrespective of the combination of CPUs and RAM we >> declare in ImageJ. Surely, giving more than 16 CPUs to ImageJ does not >> improve overall speed, sometimes it even decreases. Note that this very >> same machine is really fast when using Matlab and the parallel processing >> toolbox. >> Until recently, the fastest computers we could find to run ImageJ were my >> iMac, which runs Windows 7 (:-)) (specs: i7-4771 CPU 3.5GHz, 32GB RAM), >> and the HIVE (hexacore machine) sold by the company Acquifer (no commercial >> interest). Until then, we thought the speed of individual CPUs is the key, >> less their numbers, but we got really surprised lately when we tested the >> new virtual machines (VMs) our IT department set up for us to do some >> remote processing of very big microscopy datasets (24 cores, 128 to 256 GB >> RAM for each VM). Although the CPUs on the physical servers are not that >> fast (2.5 GHz, but is this really a good measure of computation speed? I am >> not sure...), we measured that our VMs were the fastest machines we tested >> so far. So we have actually no theory anymore about ImageJ and speed. It is >> not clear to us either, whether having Windows 7 or Windows server 2012 >> makes a difference. >> Finally, I should mention that when you use complex processes, for example >> Stitching, the speed of the individual CPUs is also important, as we had >> the impression that the reading/loading of the file uses only one core. >> There again, we could see a beautiful correlation between CPU speed (GHz >> specs) and the process. >> >> Current solution: >> If we really need to be very fast, >> 1. we write an ImageJ macro in python and launch multiple threads in >> parallel, but we observed that the whole was not "thread safe", i.e. we see >> "collisions" between the different processes. >> 2. we write a python program to launch multiple ImageJ instances in a >> headless mode and parse the macro this way. >> >> I would be also delighted to understand what makes ImageJ go fast or slow >> on a computer, that would help us to purchase the right machines from the >> beginning. >> >> Very best regards, >> >> Laurent. >> >> ___________________________ >> Laurent Gelman, PhD >> Friedrich Miescher Institut >> Head, Facility for Advanced Imaging and Microscopy >> Light microscopy >> WRO 1066.2.16 >> Maulbeerstrasse 66 >> CH-4058 Basel >> +41 (0)61 696 35 13 >> +41 (0)79 618 73 69 >> www.fmi.ch >> www.microscopynetwork.unibas.ch/ >> >> >> -----Original Message----- >> From: George Patterson [mailto:[hidden email]] >> Sent: mercredi 13 juillet 2016 23:55 >> Subject: Questions regarding multithreaded processing >> >> Dear all, >> I’ve assembled a plugin to analyze a time series on a pixel-by-pixel basis. >> It works fine but is slow. >> There are likely still plenty of optimizations that can be done to improve >> the speed and thanks to Albert Cordona and Stephen Preibisch sharing code >> and tutorials (http://albert.rierol.net/imagej_programming_tutorials.html >> ), >> I’ve even have a version that runs multi-threaded. >> When run on multi-core machines the speed is improved, but I’m not sure >> what sort of improvement I should expect. Moreover, the machines I >> expected to be the fastest are not. This is likely stemming from my >> misunderstanding of parallel processing and Java programming in general so >> I’m hoping some of you with more experience can provide some feedback. >> I list below some observations and questions along with test runs on the >> same data set using the same plugin on a few different machines. >> Thanks for any suggestions. >> George >> >> >> Since the processing speeds differ, I realize the speeds of each machine >> to complete the analysis will differ. I’m more interested the improvement >> of multiple threads on an individual machine. >> In running these tests, I altered the code to use a different number of >> threads in each run. >> Is setting the number of threads in the code and determining the time to >> finish the analysis a valid approach to testing improvement? >> >> Machine 5 is producing some odd behavior which I’ll discuss and ask for >> suggestions below. >> >> For machines 1-4, the speed improves with the number of threads up to >> about half the number of available processors. >> Do the improvements with the number of threads listed below seem >> reasonable? >> Is the improvement up to only about half the number of available >> processors due to “hyperthreading”? My limited (and probably wrong) >> understanding is that hyperthreading makes a single core appear to be two >> which share resources and thus a machine with 2 cores will return 4 when >> queried for number of cpus. Yes, I know that is too simplistic, but it’s >> the best I can do. >> Could it simply be that my code is not written properly to take advantage >> of hyperthreading? Could anyone point me to a source and/or example code >> explaining how I could change it to take advantage of hyperthreading if >> this is the problem? >> >> Number of threads used are shown in parentheses where applicable. >> 1. MacBook Pro 2.66 GHz Intel Core i7 >> number of processors: 1 >> Number of cores: 2 >> non-threaded plugin version ~59 sec >> threaded (1) ~51 sec >> threaded (2) ~36 sec >> threaded (3) ~34 sec >> threaded (4) ~35 sec >> >> 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2 >> Number of cores: 8 non-threaded plugin version ~60 sec threaded (1) ~59 sec >> threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6) ~13.2 sec >> threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12) ~11.1 sec >> threaded (16) ~11.5 sec >> >> 3. Windows 7 DELL 3.2 GHz Intel Core i5 >> number of cpus shown in resource monitor: 4 non-threaded plugin version >> ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec threaded (3) ~20.4 >> sec threaded (4) ~21.8 sec >> >> 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron >> 6272 >> number of cpus shown in resource monitor: 32 non-threaded plugin version >> ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded (4) ~46 sec >> threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded (12) ~16.4 sec >> threaded (16) ~15.8 sec threaded (20) ~15.7 sec threaded (24) ~15.9 sec >> threaded (32) ~16 sec >> >> For machines 1-4, the cpu usage can be observed in the Activity Monitor >> (Mac) or Resource Monitor (Windows) and during the execution of the plugin >> all of the cpus were active. For machine 5 shown below, only 22 of the 64 >> show activity. And it is not always the same 22. From the example runs >> below you can see it really isn’t performing very well considering the >> number of available cores. I originally thought this machine should be the >> best, but it barely outperforms my laptop. This is probably a question for >> another forum, but I am wondering if anyone else has encountered anything >> similar. >> >> 5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz AMD >> Opteron >> 6378 >> number of cpus shown in resource monitor: 64 non-threaded plugin version >> ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded (8) ~29.3 >> sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded (24) ~24.1 sec >> threaded (32) ~24.5 sec threaded (40) ~24.8 sec threaded (48) ~23.8 sec >> threaded (64) ~24.8 sec >> >> >> >> >> >> >> >> -- >> View this message in context: >> http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html >> Sent from the ImageJ mailing list archive at Nabble.com. >> >> -- >> ImageJ mailing list: http://imagej.nih.gov/ij/list.html >> > > -- > ImageJ mailing list: http://imagej.nih.gov/ij/list.html > -- ImageJ mailing list: http://imagej.nih.gov/ij/list.html |
Michael,
Thanks for the quick response and for helping providing us with CurveFitter. > as curve fitting is an essential part of your plugin, probably I should > answer (I am responsible for the two threads and more code in the > CurveFitter. You would not reach me on the developers' mailing list as my > main occupation is not in computing or image processing, and I contribute > to ImageJ only then and when, if I need it for my own work or I think I > might eventually need it). > Good to know. I'll let the other forum know the answer is over here. > > The CurveFitter usually uses the Minimizer, which uses two threads, > indeed. It does not use the Minimizer (and thus, only one Thread) for > linear regression fits: > - Straight Line 'y = a+bx' > - "Exponential (linear regression) 'y = a*exp(bx)', and > - "Power (linear regression)" 'a*x^b'. > I am using "Exponential with offset so no linear regression. If it's not linear regression, you can disable using two Minimizer Threads > by > myCurveFitter.getMinimizer().setMaximumThreads(1); > (obviously, before you call myCurveFitter.doFit) > Using the machine below again. 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron 6272 number of cpus shown in resource monitor: 32 setting the maximum threads to one previous threaded (1) ~205.3 sec 158 sec threaded (2) ~108.9 sec 85.1 sec threaded (4) ~64.5 sec 46 sec threaded (8) ~35.2 sec 22.9 sec threaded (10) ~28.1 sec 18.6 sec threaded (12) ~24.6 sec 16.4 sec threaded (16) ~17.7 sec 15.8 sec threaded (20) ~15.1 sec 15.7 sec threaded (24) ~13.3 sec 15.9 sec threaded (32) ~10 sec 16 sec The improvement much closer to linear. It is slower with fewer threads than before. I bet you know why. Care to educate me? > By the way, creating a new CurveFitter also creates several other objects, > so having one per pixel really induces a lot of garbage collection. > So in addition to producing more threads than I originally thought, my major limitation is probably the amount of garbage I'm producing. Correct? > > If creating many CurveFitters or Minimizers is more common (anyone out > there who also does this?) we should consider making the CurveFitter and > Minimizer reusable (e.g. with a CurveFitter.setData(double[] xData, > double[] yData) method, which clears the previous result and settings). > Obviously I'm in favor but that sounds like it might take a bit of effort by someone in the know. Thanks again for your help. Best, George > _________________________ > > On 2016-07-15 18:16, George Patterson wrote: > >> Hi all, >> >> Thank you all for your feedback. Below I'll try to respond to the parts I >> can answer. >> >> >> Seeing as this bit is a bit more technical and closer to a plugin >>> >> development question, would you mind posting it on >> http://forum.imagej.net >> >> Long technical email threads like this one tend to get muddy, especially >>> >> if we try to share code snippets or want to comment on a particular part. >> >> In the future, I’ll direct plugin development questions to that forum. I >> didn’t bother sharing code since at this point since I just wanted to know >> what improvements to expect with multi-threaded processing. >> >> >> A quick remark though. Seeing as we do not know HOW you implemented the >>> >> parallel processing, it will be difficult to help. >> >> Some notes: If you 'simply' make a bunch of threads where each accesses a >>> >> given pixel in a loop through an atomic integer for example, it is not >> going to be faster. Accessing a single pixel is extremely fast >and what >> will slow you down is having each thread waiting to get its pixel index. >> >> As I mentioned, I didn’t know what to expect so I wasn’t sure I had a >> problem. The atomic integer approach is what I used initially. To be >> clear, the speed does improve with more threads, it just doesn’t improve >> as >> much as it should based on the responses by Oli and Micheal. Based on >> suggestions from Oli and Micheal, I changed the code to designate >> different >> blocks of the image to different threads. This seemed to improve the >> speed >> modestly 5-10%. Thanks for the suggestion. I’ll take this approach for >> any future developments. >> >> >> IMHO the most important point for efficient parallelization (and efficient >>> >> Java code anyhow) is avoiding creating lots of objects that need garbage >> collection (don't care about dozens, but definitely avoid >hundreds of >> thousands or millions of Objects). >> >> Micheal thanks for sharing the list of potential problems. I’ll work my >> way through them as well as I can. The number of objects created is the >> first I started checking. A new Curvefitter is created for every pixel so >> for a 256x256x200 stack >65000 are created and subjected to garbage >> collection I guess. I still haven’t found a way around generating this >> many curvefitters. >> >> This led me to looking more closely at the Curvefitter documentation and I >> found this https://imagej.nih.gov/ij/docs/curve-fitter.html where it >> indicates “Two threads with two independent minimization runs, with >> repetition until two identical results are found, to avoid local minima or >> keeping the result of a stuck simplex.” Does this mean that for each >> thread that generates a new Curvefitter, the Curvefitter generates a >> second >> thread on its own? If so, then my plugin is generating twice as many >> threads as I think and might explain why my speed improvement is observed >> only to about half the number of cpus. Possible? Probable? No way? Since >> this is maybe getting into some technical bits which the plugin developers >> probably know well, I’ll take Oli's advice ask this on the imagej.net >> forum. >> >> >> We made the same kind of tests and experience as you did. We also tested >>> >> numerous machines with a variable number of cores declared in the ImageJ >> Option Menu, in combination with different amounts of >RAM, without being >> able to draw really clear conclusions about why it is fast or slow on the >> respective computers. We also tested different processes, from a simple >> Gaussian blur to more complex macros. >> >> Laurent, thanks for sharing your experiences. Our issues with different >> machines might be better answered on another forum (maybe >> http://forum.imagej.net ). Maybe we should start a new query on just >> this >> topic? >> >> >> Thanks again for the feedback. >> >> George >> >> >> >> On Fri, Jul 15, 2016 at 3:49 AM, Gelman, Laurent <[hidden email]> >> wrote: >> >> Dear George, >>> >>> We made the same kind of tests and experience as you did. We also tested >>> numerous machines with a variable number of cores declared in the ImageJ >>> Option Menu, in combination with different amounts of RAM, without being >>> able to draw really clear conclusions about why it is fast or slow on the >>> respective computers. We also tested different processes, from a simple >>> Gaussian blur to more complex macros. >>> >>> In a nutshell: >>> We also observed awful performances on our Microscoft Server 2012 / >>> 32CPUs >>> / 512GB RAM machine, irrespective of the combination of CPUs and RAM we >>> declare in ImageJ. Surely, giving more than 16 CPUs to ImageJ does not >>> improve overall speed, sometimes it even decreases. Note that this very >>> same machine is really fast when using Matlab and the parallel processing >>> toolbox. >>> Until recently, the fastest computers we could find to run ImageJ were my >>> iMac, which runs Windows 7 (:-)) (specs: i7-4771 CPU 3.5GHz, 32GB RAM), >>> and the HIVE (hexacore machine) sold by the company Acquifer (no >>> commercial >>> interest). Until then, we thought the speed of individual CPUs is the >>> key, >>> less their numbers, but we got really surprised lately when we tested the >>> new virtual machines (VMs) our IT department set up for us to do some >>> remote processing of very big microscopy datasets (24 cores, 128 to 256 >>> GB >>> RAM for each VM). Although the CPUs on the physical servers are not that >>> fast (2.5 GHz, but is this really a good measure of computation speed? I >>> am >>> not sure...), we measured that our VMs were the fastest machines we >>> tested >>> so far. So we have actually no theory anymore about ImageJ and speed. It >>> is >>> not clear to us either, whether having Windows 7 or Windows server 2012 >>> makes a difference. >>> Finally, I should mention that when you use complex processes, for >>> example >>> Stitching, the speed of the individual CPUs is also important, as we had >>> the impression that the reading/loading of the file uses only one core. >>> There again, we could see a beautiful correlation between CPU speed (GHz >>> specs) and the process. >>> >>> Current solution: >>> If we really need to be very fast, >>> 1. we write an ImageJ macro in python and launch multiple threads in >>> parallel, but we observed that the whole was not "thread safe", i.e. we >>> see >>> "collisions" between the different processes. >>> 2. we write a python program to launch multiple ImageJ instances in a >>> headless mode and parse the macro this way. >>> >>> I would be also delighted to understand what makes ImageJ go fast or slow >>> on a computer, that would help us to purchase the right machines from the >>> beginning. >>> >>> Very best regards, >>> >>> Laurent. >>> >>> ___________________________ >>> Laurent Gelman, PhD >>> Friedrich Miescher Institut >>> Head, Facility for Advanced Imaging and Microscopy >>> Light microscopy >>> WRO 1066.2.16 >>> Maulbeerstrasse 66 >>> CH-4058 Basel >>> +41 (0)61 696 35 13 >>> +41 (0)79 618 73 69 >>> www.fmi.ch >>> www.microscopynetwork.unibas.ch/ >>> >>> >>> -----Original Message----- >>> From: George Patterson [mailto:[hidden email]] >>> Sent: mercredi 13 juillet 2016 23:55 >>> Subject: Questions regarding multithreaded processing >>> >>> Dear all, >>> I’ve assembled a plugin to analyze a time series on a pixel-by-pixel >>> basis. >>> It works fine but is slow. >>> There are likely still plenty of optimizations that can be done to >>> improve >>> the speed and thanks to Albert Cordona and Stephen Preibisch sharing code >>> and tutorials ( >>> http://albert.rierol.net/imagej_programming_tutorials.html >>> ), >>> I’ve even have a version that runs multi-threaded. >>> When run on multi-core machines the speed is improved, but I’m not sure >>> what sort of improvement I should expect. Moreover, the machines I >>> expected to be the fastest are not. This is likely stemming from my >>> misunderstanding of parallel processing and Java programming in general >>> so >>> I’m hoping some of you with more experience can provide some feedback. >>> I list below some observations and questions along with test runs on the >>> same data set using the same plugin on a few different machines. >>> Thanks for any suggestions. >>> George >>> >>> >>> Since the processing speeds differ, I realize the speeds of each machine >>> to complete the analysis will differ. I’m more interested the >>> improvement >>> of multiple threads on an individual machine. >>> In running these tests, I altered the code to use a different number of >>> threads in each run. >>> Is setting the number of threads in the code and determining the time to >>> finish the analysis a valid approach to testing improvement? >>> >>> Machine 5 is producing some odd behavior which I’ll discuss and ask for >>> suggestions below. >>> >>> For machines 1-4, the speed improves with the number of threads up to >>> about half the number of available processors. >>> Do the improvements with the number of threads listed below seem >>> reasonable? >>> Is the improvement up to only about half the number of available >>> processors due to “hyperthreading”? My limited (and probably wrong) >>> understanding is that hyperthreading makes a single core appear to be two >>> which share resources and thus a machine with 2 cores will return 4 when >>> queried for number of cpus. Yes, I know that is too simplistic, but it’s >>> the best I can do. >>> Could it simply be that my code is not written properly to take advantage >>> of hyperthreading? Could anyone point me to a source and/or example code >>> explaining how I could change it to take advantage of hyperthreading if >>> this is the problem? >>> >>> Number of threads used are shown in parentheses where applicable. >>> 1. MacBook Pro 2.66 GHz Intel Core i7 >>> number of processors: 1 >>> Number of cores: 2 >>> non-threaded plugin version ~59 sec >>> threaded (1) ~51 sec >>> threaded (2) ~36 sec >>> threaded (3) ~34 sec >>> threaded (4) ~35 sec >>> >>> 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2 >>> Number of cores: 8 non-threaded plugin version ~60 sec threaded (1) ~59 >>> sec >>> threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6) ~13.2 sec >>> threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12) ~11.1 sec >>> threaded (16) ~11.5 sec >>> >>> 3. Windows 7 DELL 3.2 GHz Intel Core i5 >>> number of cpus shown in resource monitor: 4 non-threaded plugin version >>> ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec threaded (3) >>> ~20.4 >>> sec threaded (4) ~21.8 sec >>> >>> 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron >>> 6272 >>> number of cpus shown in resource monitor: 32 non-threaded plugin version >>> ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded (4) ~46 >>> sec >>> threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded (12) ~16.4 sec >>> threaded (16) ~15.8 sec threaded (20) ~15.7 sec threaded (24) ~15.9 sec >>> threaded (32) ~16 sec >>> >>> For machines 1-4, the cpu usage can be observed in the Activity Monitor >>> (Mac) or Resource Monitor (Windows) and during the execution of the >>> plugin >>> all of the cpus were active. For machine 5 shown below, only 22 of the >>> 64 >>> show activity. And it is not always the same 22. From the example runs >>> below you can see it really isn’t performing very well considering the >>> number of available cores. I originally thought this machine should be >>> the >>> best, but it barely outperforms my laptop. This is probably a question >>> for >>> another forum, but I am wondering if anyone else has encountered anything >>> similar. >>> >>> 5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz AMD >>> Opteron >>> 6378 >>> number of cpus shown in resource monitor: 64 non-threaded plugin version >>> ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded (8) ~29.3 >>> sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded (24) ~24.1 >>> sec >>> threaded (32) ~24.5 sec threaded (40) ~24.8 sec threaded (48) ~23.8 sec >>> threaded (64) ~24.8 sec >>> >>> >>> >>> >>> >>> >>> >>> -- >>> View this message in context: >>> >>> http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html >>> Sent from the ImageJ mailing list archive at Nabble.com. >>> >>> -- >>> ImageJ mailing list: http://imagej.nih.gov/ij/list.html >>> >>> >> -- >> ImageJ mailing list: http://imagej.nih.gov/ij/list.html >> >> > -- > ImageJ mailing list: http://imagej.nih.gov/ij/list.html > -- ImageJ mailing list: http://imagej.nih.gov/ij/list.html |
Hi George,
concerning the comparison with one (new) or two (previous) Mimimizer threads: > setting the maximum threads to one previous > threaded (1) ~205.3 sec 158 sec > threaded (2) ~108.9 sec 85.1 sec > threaded (4) ~64.5 sec 46 sec > threaded (8) ~35.2 sec 22.9 sec > threaded (10) ~28.1 sec 18.6 sec > threaded (12) ~24.6 sec 16.4 sec > threaded (16) ~17.7 sec 15.8 sec > threaded (20) ~15.1 sec 15.7 sec > threaded (24) ~13.3 sec 15.9 sec > threaded (32) ~10 sec 16 sec For the minimizing operation itself, the 'previous' case has twice the number of threads due to the Minimizer, so it was actually minimizing with 2 to 64 threads. This explains why there was no gain in the previous version when increasing the number of threads from 16 to 32 (with 32 processors), it was actually an increase from 32 to 64. For comparing the speed with one or two Minimizer threads, this means that you have to compare like the following: new, one minimizer thread previous threaded (2) ~108.9 sec 158 sec threaded (4) ~64.5 sec 85.1 sec threaded (8) ~35.2 sec 46 sec threaded (16) ~17.7 sec 22.9 sec threaded (32) ~10 sec 15.8 sec threaded (64) 16 sec So it clearly helps to use one Minimizer thread; possibly the main reason is avoiding the overhead of creating a Minimizer thread for each pixel and the accompanying synchronization between the two Minimizer threads. The table also tells that the gain with parallelization is not so bad: A factor of 20 from 1 to 32 threads, so the total time is not dominated by the 'Stop the world' events for Garbage collection or memory bandwidth. -- Concerning the curve fitting problem, "Exponential with offset", y = a*exp(-bx) + c: The CurveFitter eliminates two parameters (a, c) by linear regression, so it actually performs a one-dimensional minimization. I guess that this problem is well-behaved and the Minimizer always finds the correct result in the first attempt. Then it is not necessary to try a second run. So you can try: myCurveFitter.getMinimizer().setMaxRestarts(0); This makes the Minimizer run only once, with no second try to make sure the result is correct. It also avoids a second thread. I would suggest that you try it and compare whether the result is the same (there might be tiny differences since minimization is stochastic and the accuracy is finite). If it works as I expect, it should cut the time for minimization to 1/2. If the decrease in processing time is comparable, it would mean that computing time is still dominated by the Minimizer, not garbage collection (and and the rest of processing each pixel, including memory access). If the speed gain is only marginal, it would indicate that optimization should focus on garbage collection and the non-minimizer operations per pixel. What you might also do to speed up the process: If you have a good guess for the 'b' parameter and the typical uncertainty of this guess for 'b', specify them in the initialPrams and initialParamVariations. E.g. if 'b' does not change much between neighboring pixels, use the previous value for initialization. The default value of the initialParamVariations for 'b' is 10% of the specified 'b' value. Don't care about the initial the 'a' and 'c' parameters and their range; these values will be ignored. HTH, Michael ____________________________________________________________________ On Fri, July 15, 2016 22:03, George Patterson wrote: > Michael, > Thanks for the quick response and for helping providing us with > CurveFitter. > > >> as curve fitting is an essential part of your plugin, probably I should >> answer (I am responsible for the two threads and more code in the >> CurveFitter. You would not reach me on the developers' mailing list as >> my >> main occupation is not in computing or image processing, and I >> contribute >> to ImageJ only then and when, if I need it for my own work or I think I >> might eventually need it). >> > > Good to know. I'll let the other forum know the answer is over here. > >> >> The CurveFitter usually uses the Minimizer, which uses two threads, >> indeed. It does not use the Minimizer (and thus, only one Thread) for >> linear regression fits: >> - Straight Line 'y = a+bx' >> - "Exponential (linear regression) 'y = a*exp(bx)', and >> - "Power (linear regression)" 'a*x^b'. >> > > I am using "Exponential with offset so no linear regression. > > If it's not linear regression, you can disable using two Minimizer Threads >> by >> myCurveFitter.getMinimizer().setMaximumThreads(1); >> (obviously, before you call myCurveFitter.doFit) >> > > > Using the machine below again. > > 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron > 6272 > number of cpus shown in resource monitor: 32 > > setting the maximum threads to one previous > threaded (1) ~205.3 sec 158 sec > threaded (2) ~108.9 sec 85.1 sec > threaded (4) ~64.5 sec 46 sec > threaded (8) ~35.2 sec 22.9 > sec > threaded (10) ~28.1 sec 18.6 sec > threaded (12) ~24.6 sec 16.4 sec > threaded (16) ~17.7 sec 15.8 sec > threaded (20) ~15.1 sec 15.7 sec > threaded (24) ~13.3 sec 15.9 sec > threaded (32) ~10 sec 16 sec > > The improvement much closer to linear. It is slower with fewer threads > than before. I bet you know why. Care to educate me? > > >> By the way, creating a new CurveFitter also creates several other >> objects, >> so having one per pixel really induces a lot of garbage collection. >> > > So in addition to producing more threads than I originally thought, my > major limitation is probably the amount of garbage I'm producing. > Correct? > > >> >> If creating many CurveFitters or Minimizers is more common (anyone out >> there who also does this?) we should consider making the CurveFitter and >> Minimizer reusable (e.g. with a CurveFitter.setData(double[] xData, >> double[] yData) method, which clears the previous result and settings). >> > > Obviously I'm in favor but that sounds like it might take a bit of effort > by someone in the know. > > > Thanks again for your help. > > Best, > George > > > > > > > > > >> _________________________ >> >> On 2016-07-15 18:16, George Patterson wrote: >> >>> Hi all, >>> >>> Thank you all for your feedback. Below I'll try to respond to the >>> parts I >>> can answer. >>> >>> >>> Seeing as this bit is a bit more technical and closer to a plugin >>>> >>> development question, would you mind posting it on >>> http://forum.imagej.net >>> >>> Long technical email threads like this one tend to get muddy, >>> especially >>>> >>> if we try to share code snippets or want to comment on a particular >>> part. >>> >>> In the future, Iâll direct plugin development questions to that >>> forum. I >>> didnât bother sharing code since at this point since I just wanted to >>> know >>> what improvements to expect with multi-threaded processing. >>> >>> >>> A quick remark though. Seeing as we do not know HOW you implemented the >>>> >>> parallel processing, it will be difficult to help. >>> >>> Some notes: If you 'simply' make a bunch of threads where each accesses >>> a >>>> >>> given pixel in a loop through an atomic integer for example, it is not >>> going to be faster. Accessing a single pixel is extremely fast >and >>> what >>> will slow you down is having each thread waiting to get its pixel >>> index. >>> >>> As I mentioned, I didnât know what to expect so I wasnât sure I had >>> a >>> problem. The atomic integer approach is what I used initially. To be >>> clear, the speed does improve with more threads, it just doesnât >>> improve >>> as >>> much as it should based on the responses by Oli and Micheal. Based on >>> suggestions from Oli and Micheal, I changed the code to designate >>> different >>> blocks of the image to different threads. This seemed to improve the >>> speed >>> modestly 5-10%. Thanks for the suggestion. Iâll take this approach >>> for >>> any future developments. >>> >>> >>> IMHO the most important point for efficient parallelization (and >>> efficient >>>> >>> Java code anyhow) is avoiding creating lots of objects that need >>> garbage >>> collection (don't care about dozens, but definitely avoid >hundreds of >>> thousands or millions of Objects). >>> >>> Micheal thanks for sharing the list of potential problems. Iâll work >>> my >>> way through them as well as I can. The number of objects created is >>> the >>> first I started checking. A new Curvefitter is created for every pixel >>> so >>> for a 256x256x200 stack >65000 are created and subjected to garbage >>> collection I guess. I still havenât found a way around generating >>> this >>> many curvefitters. >>> >>> This led me to looking more closely at the Curvefitter documentation >>> and I >>> found this https://imagej.nih.gov/ij/docs/curve-fitter.html where it >>> indicates âTwo threads with two independent minimization runs, with >>> repetition until two identical results are found, to avoid local minima >>> or >>> keeping the result of a stuck simplex.â Does this mean that for each >>> thread that generates a new Curvefitter, the Curvefitter generates a >>> second >>> thread on its own? If so, then my plugin is generating twice as many >>> threads as I think and might explain why my speed improvement is >>> observed >>> only to about half the number of cpus. Possible? Probable? No way? >>> Since >>> this is maybe getting into some technical bits which the plugin >>> developers >>> probably know well, Iâll take Oli's advice ask this on the imagej.net >>> forum. >>> >>> >>> We made the same kind of tests and experience as you did. We also >>> tested >>>> >>> numerous machines with a variable number of cores declared in the >>> ImageJ >>> Option Menu, in combination with different amounts of >RAM, without >>> being >>> able to draw really clear conclusions about why it is fast or slow on >>> the >>> respective computers. We also tested different processes, from a simple >>> Gaussian blur to more complex macros. >>> >>> Laurent, thanks for sharing your experiences. Our issues with >>> different >>> machines might be better answered on another forum (maybe >>> http://forum.imagej.net ). Maybe we should start a new query on just >>> this >>> topic? >>> >>> >>> Thanks again for the feedback. >>> >>> George >>> >>> >>> >>> On Fri, Jul 15, 2016 at 3:49 AM, Gelman, Laurent >>> <[hidden email]> >>> wrote: >>> >>> Dear George, >>>> >>>> We made the same kind of tests and experience as you did. We also >>>> tested >>>> numerous machines with a variable number of cores declared in the >>>> ImageJ >>>> Option Menu, in combination with different amounts of RAM, without >>>> being >>>> able to draw really clear conclusions about why it is fast or slow on >>>> the >>>> respective computers. We also tested different processes, from a >>>> simple >>>> Gaussian blur to more complex macros. >>>> >>>> In a nutshell: >>>> We also observed awful performances on our Microscoft Server 2012 / >>>> 32CPUs >>>> / 512GB RAM machine, irrespective of the combination of CPUs and RAM >>>> we >>>> declare in ImageJ. Surely, giving more than 16 CPUs to ImageJ does not >>>> improve overall speed, sometimes it even decreases. Note that this >>>> very >>>> same machine is really fast when using Matlab and the parallel >>>> processing >>>> toolbox. >>>> Until recently, the fastest computers we could find to run ImageJ were >>>> my >>>> iMac, which runs Windows 7 (:-)) (specs: i7-4771 CPU 3.5GHz, 32GB >>>> RAM), >>>> and the HIVE (hexacore machine) sold by the company Acquifer (no >>>> commercial >>>> interest). Until then, we thought the speed of individual CPUs is the >>>> key, >>>> less their numbers, but we got really surprised lately when we tested >>>> the >>>> new virtual machines (VMs) our IT department set up for us to do some >>>> remote processing of very big microscopy datasets (24 cores, 128 to >>>> 256 >>>> GB >>>> RAM for each VM). Although the CPUs on the physical servers are not >>>> that >>>> fast (2.5 GHz, but is this really a good measure of computation speed? >>>> I >>>> am >>>> not sure...), we measured that our VMs were the fastest machines we >>>> tested >>>> so far. So we have actually no theory anymore about ImageJ and speed. >>>> It >>>> is >>>> not clear to us either, whether having Windows 7 or Windows server >>>> 2012 >>>> makes a difference. >>>> Finally, I should mention that when you use complex processes, for >>>> example >>>> Stitching, the speed of the individual CPUs is also important, as we >>>> had >>>> the impression that the reading/loading of the file uses only one >>>> core. >>>> There again, we could see a beautiful correlation between CPU speed >>>> (GHz >>>> specs) and the process. >>>> >>>> Current solution: >>>> If we really need to be very fast, >>>> 1. we write an ImageJ macro in python and launch multiple threads in >>>> parallel, but we observed that the whole was not "thread safe", i.e. >>>> we >>>> see >>>> "collisions" between the different processes. >>>> 2. we write a python program to launch multiple ImageJ instances in a >>>> headless mode and parse the macro this way. >>>> >>>> I would be also delighted to understand what makes ImageJ go fast or >>>> slow >>>> on a computer, that would help us to purchase the right machines from >>>> the >>>> beginning. >>>> >>>> Very best regards, >>>> >>>> Laurent. >>>> >>>> ___________________________ >>>> Laurent Gelman, PhD >>>> Friedrich Miescher Institut >>>> Head, Facility for Advanced Imaging and Microscopy >>>> Light microscopy >>>> WRO 1066.2.16 >>>> Maulbeerstrasse 66 >>>> CH-4058 Basel >>>> +41 (0)61 696 35 13 >>>> +41 (0)79 618 73 69 >>>> www.fmi.ch >>>> www.microscopynetwork.unibas.ch/ >>>> >>>> >>>> -----Original Message----- >>>> From: George Patterson [mailto:[hidden email]] >>>> Sent: mercredi 13 juillet 2016 23:55 >>>> Subject: Questions regarding multithreaded processing >>>> >>>> Dear all, >>>> Iâve assembled a plugin to analyze a time series on a pixel-by-pixel >>>> basis. >>>> It works fine but is slow. >>>> There are likely still plenty of optimizations that can be done to >>>> improve >>>> the speed and thanks to Albert Cordona and Stephen Preibisch sharing >>>> code >>>> and tutorials ( >>>> http://albert.rierol.net/imagej_programming_tutorials.html >>>> ), >>>> Iâve even have a version that runs multi-threaded. >>>> When run on multi-core machines the speed is improved, but Iâm not >>>> sure >>>> what sort of improvement I should expect. Moreover, the machines I >>>> expected to be the fastest are not. This is likely stemming from my >>>> misunderstanding of parallel processing and Java programming in >>>> general >>>> so >>>> Iâm hoping some of you with more experience can provide some >>>> feedback. >>>> I list below some observations and questions along with test runs on >>>> the >>>> same data set using the same plugin on a few different machines. >>>> Thanks for any suggestions. >>>> George >>>> >>>> >>>> Since the processing speeds differ, I realize the speeds of each >>>> machine >>>> to complete the analysis will differ. Iâm more interested the >>>> improvement >>>> of multiple threads on an individual machine. >>>> In running these tests, I altered the code to use a different number >>>> of >>>> threads in each run. >>>> Is setting the number of threads in the code and determining the time >>>> to >>>> finish the analysis a valid approach to testing improvement? >>>> >>>> Machine 5 is producing some odd behavior which Iâll discuss and ask >>>> for >>>> suggestions below. >>>> >>>> For machines 1-4, the speed improves with the number of threads up to >>>> about half the number of available processors. >>>> Do the improvements with the number of threads listed below seem >>>> reasonable? >>>> Is the improvement up to only about half the number of available >>>> processors due to âhyperthreadingâ? My limited (and probably >>>> wrong) >>>> understanding is that hyperthreading makes a single core appear to be >>>> two >>>> which share resources and thus a machine with 2 cores will return 4 >>>> when >>>> queried for number of cpus. Yes, I know that is too simplistic, but >>>> itâs >>>> the best I can do. >>>> Could it simply be that my code is not written properly to take >>>> advantage >>>> of hyperthreading? Could anyone point me to a source and/or example >>>> code >>>> explaining how I could change it to take advantage of hyperthreading >>>> if >>>> this is the problem? >>>> >>>> Number of threads used are shown in parentheses where applicable. >>>> 1. MacBook Pro 2.66 GHz Intel Core i7 >>>> number of processors: 1 >>>> Number of cores: 2 >>>> non-threaded plugin version ~59 sec >>>> threaded (1) ~51 sec >>>> threaded (2) ~36 sec >>>> threaded (3) ~34 sec >>>> threaded (4) ~35 sec >>>> >>>> 2. Mac Pro 2 x 2.26 GHz Quad-Core Intel Xeon number of processors: 2 >>>> Number of cores: 8 non-threaded plugin version ~60 sec threaded (1) >>>> ~59 >>>> sec >>>> threaded (2) ~28.9 sec threaded (4) ~15.6 sec threaded (6) ~13.2 sec >>>> threaded (8) ~11.3 sec threaded (10) ~11.1 sec threaded (12) ~11.1 sec >>>> threaded (16) ~11.5 sec >>>> >>>> 3. Windows 7 DELL 3.2 GHz Intel Core i5 >>>> number of cpus shown in resource monitor: 4 non-threaded plugin >>>> version >>>> ~45.3 sec threaded (1) ~48.3 sec threaded (2) ~21.7 sec threaded (3) >>>> ~20.4 >>>> sec threaded (4) ~21.8 sec >>>> >>>> 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD >>>> Opteron >>>> 6272 >>>> number of cpus shown in resource monitor: 32 non-threaded plugin >>>> version >>>> ~162 sec threaded (1) ~158 sec threaded (2) ~85.1 sec threaded (4) ~46 >>>> sec >>>> threaded (8) ~22.9 sec threaded (10) ~18.6 sec threaded (12) ~16.4 sec >>>> threaded (16) ~15.8 sec threaded (20) ~15.7 sec threaded (24) ~15.9 >>>> sec >>>> threaded (32) ~16 sec >>>> >>>> For machines 1-4, the cpu usage can be observed in the Activity >>>> Monitor >>>> (Mac) or Resource Monitor (Windows) and during the execution of the >>>> plugin >>>> all of the cpus were active. For machine 5 shown below, only 22 of >>>> the >>>> 64 >>>> show activity. And it is not always the same 22. From the example >>>> runs >>>> below you can see it really isnât performing very well considering >>>> the >>>> number of available cores. I originally thought this machine should >>>> be >>>> the >>>> best, but it barely outperforms my laptop. This is probably a >>>> question >>>> for >>>> another forum, but I am wondering if anyone else has encountered >>>> anything >>>> similar. >>>> >>>> 5. Windows Server 2012 Xi MTower 2P64 Workstation 4 x 2.4 GHz >>>> AMD >>>> Opteron >>>> 6378 >>>> number of cpus shown in resource monitor: 64 non-threaded plugin >>>> version >>>> ~140 sec threaded (1) ~137 sec threaded (4) ~60.3 sec threaded (8) >>>> ~29.3 >>>> sec threaded (12) ~22.9 sec threaded (16) ~23.8 sec threaded (24) >>>> ~24.1 >>>> sec >>>> threaded (32) ~24.5 sec threaded (40) ~24.8 sec threaded (48) ~23.8 >>>> sec >>>> threaded (64) ~24.8 sec >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> >>>> http://imagej.1557.x6.nabble.com/Questions-regarding-multithreaded-processing-tp5016878.html >>>> Sent from the ImageJ mailing list archive at Nabble.com. >>>> >>>> -- >>>> ImageJ mailing list: http://imagej.nih.gov/ij/list.html >>>> >>>> >>> -- >>> ImageJ mailing list: http://imagej.nih.gov/ij/list.html >>> >>> >> -- >> ImageJ mailing list: http://imagej.nih.gov/ij/list.html >> > > -- > ImageJ mailing list: http://imagej.nih.gov/ij/list.html > -- ImageJ mailing list: http://imagej.nih.gov/ij/list.html |
Hi Michael,
Thanks for the feedback. > > For comparing the speed with one or two Minimizer threads, this means that > you have to compare like the following: > > new, one minimizer thread previous > threaded (2) ~108.9 sec 158 sec > threaded (4) ~64.5 sec 85.1 sec > threaded (8) ~35.2 sec 46 sec > threaded (16) ~17.7 sec 22.9 sec > threaded (32) ~10 sec 15.8 sec > threaded (64) 16 sec > Of course. Sorry for the mix-up. > So it clearly helps to use one Minimizer thread; possibly the main reason > is avoiding the overhead of creating a Minimizer thread for each pixel and > the accompanying synchronization between the two Minimizer threads. > > The table also tells that the gain with parallelization is not so bad: > A factor of 20 from 1 to 32 threads, so the total time is not dominated by > the 'Stop the world' events for Garbage collection or memory bandwidth. > > > -- > > Concerning the curve fitting problem, "Exponential with offset", y = > a*exp(-bx) + c: > > The CurveFitter eliminates two parameters (a, c) by linear regression, so > it actually performs a one-dimensional minimization. I guess that this > problem is well-behaved and the Minimizer always finds the correct result > in the first attempt. Then it is not necessary to try a second run. > So you can try: > myCurveFitter.getMinimizer().setMaxRestarts(0); > This makes the Minimizer run only once, with no second try to make sure > the result is correct. It also avoids a second thread. > I would suggest that you try it and compare whether the result is the same > (there might be tiny differences since minimization is stochastic and the > accuracy is finite). > Some new comparisons are below to see if it is behaving as you expect. 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron 6272 number of cpus shown in resource monitor: 32 myCurveFitter.getMinimizer().setMaxRestarts(0); threads (1) 103.5 sec threads (2) 55 sec threads (4) 33.3 sec threads (8) 18.2 sec threads (16) 9.7 sec threads (32) 5.6 sec Seems to make it even faster. I think this is what you predicted. 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron 6272 number of cpus shown in resource monitor: 32 myCurveFitter.getMinimizer().setMaximumThreads(1); myCurveFitter.getMinimizer().setMaxRestarts(0); threads (1) 205.3 sec threads (2) 107.6 sec threads (4) 64.9 sec threads (8) 34.9 sec threads (16) 17.9 sec threads (32) 10.2 sec There is likely no reason to both of these commands together, but it seems to give the same as only using myCurveFitter.getMinimizer().setMaximumThreads(1); I included this just to see if this is expected behavior. > If it works as I expect, it should cut the time for minimization to 1/2. > If the decrease in processing time is comparable, it would mean that > computing time is still dominated by the Minimizer, not garbage collection > (and and the rest of processing each pixel, including memory access). If > the speed gain is only marginal, it would indicate that optimization > should focus on garbage collection and the non-minimizer operations per > pixel. > > What you might also do to speed up the process: If you have a good guess > for the 'b' parameter and the typical uncertainty of this guess for 'b', > specify them in the initialPrams and initialParamVariations. E.g. if 'b' > does not change much between neighboring pixels, use the previous value > for initialization. The default value of the initialParamVariations for > 'b' is 10% of the specified 'b' value. > Don't care about the initial the 'a' and 'c' parameters and their range; > these values will be ignored. > Thanks for the suggestions. They've given some ideas to incorporate into the plugin. I wasn't really expecting any miracle speed up for my plugin. Just the suggestions you've made have vastly improved it. I do notice small differences in the final results with the different versions 2 threads versus setMaximumThreads versus setMaxRestarts, but these seem to be at the 8th and 9th decimal places. And thanks again for all your help. George -- ImageJ mailing list: http://imagej.nih.gov/ij/list.html |
Hi George,
concerning setMaximumThreads(1) and setMaxRestarts(0): I guess that your text should read setMaximumThreads(0) and setMaxRestarts(0)? There is a bug in my Minimizer code, which makes it perform two runs instead of one in that case. I have asked Wayne to fix it. By the way, I just found out that the CurveFitter has no way to specify the initialParamVariations for the built-in functions (setting the initialParamVariations for the Minimizer won't help, the CurveFitter will override it with its own default values). So please forget my idea to specify the initialParamVariations (= typical accuracy of the initial parameter given by the user). Nevertheless, specifying a starting value for the 'b' parameter of the "Exponential with offset" fit, y = a*exp(-bx) + c, should be beneficial for speed (use the CurveFitter.setInitialParameters, not the Minimizer method). If there are many people out there doing fits with functions that need one-dimensional minimization (2nd-order polynomial, and the exp/log/power fits that are not simply done by regression), and speed is a general issue, eventually one might consider of modifying the Minimizer such that it adds a different algorithm for better convergence in the one-dimensional case. The current Nelder-Mead simplex algorithm is quite good for general (also badly behaved) few-dimensional problems, but rather inefficient for the final steps of well-behaved one-dimensional minimization problems. Michael ________________________________________________________________ On 2016-07-18 17:30, George Patterson wrote: > Hi Michael, > Thanks for the feedback. > >> >> For comparing the speed with one or two Minimizer threads, this means that >> you have to compare like the following: >> >> new, one minimizer thread previous >> threaded (2) ~108.9 sec 158 sec >> threaded (4) ~64.5 sec 85.1 sec >> threaded (8) ~35.2 sec 46 sec >> threaded (16) ~17.7 sec 22.9 sec >> threaded (32) ~10 sec 15.8 sec >> threaded (64) 16 sec >> > > Of course. Sorry for the mix-up. > > >> So it clearly helps to use one Minimizer thread; possibly the main reason >> is avoiding the overhead of creating a Minimizer thread for each pixel and >> the accompanying synchronization between the two Minimizer threads. >> >> The table also tells that the gain with parallelization is not so bad: >> A factor of 20 from 1 to 32 threads, so the total time is not dominated by >> the 'Stop the world' events for Garbage collection or memory bandwidth. >> >> > > >> -- >> >> Concerning the curve fitting problem, "Exponential with offset", y = >> a*exp(-bx) + c: >> >> The CurveFitter eliminates two parameters (a, c) by linear regression, so >> it actually performs a one-dimensional minimization. I guess that this >> problem is well-behaved and the Minimizer always finds the correct result >> in the first attempt. Then it is not necessary to try a second run. >> So you can try: >> myCurveFitter.getMinimizer().setMaxRestarts(0); >> This makes the Minimizer run only once, with no second try to make sure >> the result is correct. It also avoids a second thread. >> I would suggest that you try it and compare whether the result is the same >> (there might be tiny differences since minimization is stochastic and the >> accuracy is finite). >> > > Some new comparisons are below to see if it is behaving as you expect. > > 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron > 6272 > number of cpus shown in resource monitor: 32 > > myCurveFitter.getMinimizer().setMaxRestarts(0); > threads (1) 103.5 sec > threads (2) 55 sec > threads (4) 33.3 sec > threads (8) 18.2 sec > threads (16) 9.7 sec > threads (32) 5.6 sec > > Seems to make it even faster. I think this is what you predicted. > > > > 4. Windows 7 Xi MTower 2P64 Workstation 2 x 2.1 GHz AMD Opteron > 6272 > number of cpus shown in resource monitor: 32 > > myCurveFitter.getMinimizer().setMaximumThreads(1); > myCurveFitter.getMinimizer().setMaxRestarts(0); > threads (1) 205.3 sec > threads (2) 107.6 sec > threads (4) 64.9 sec > threads (8) 34.9 sec > threads (16) 17.9 sec > threads (32) 10.2 sec > > There is likely no reason to both of these commands together, but it seems > to give the same as only using > myCurveFitter.getMinimizer().setMaximumThreads(1); > I included this just to see if this is expected behavior. > > >> If it works as I expect, it should cut the time for minimization to 1/2. >> If the decrease in processing time is comparable, it would mean that >> computing time is still dominated by the Minimizer, not garbage collection >> (and and the rest of processing each pixel, including memory access). If >> the speed gain is only marginal, it would indicate that optimization >> should focus on garbage collection and the non-minimizer operations per >> pixel. >> >> What you might also do to speed up the process: If you have a good guess >> for the 'b' parameter and the typical uncertainty of this guess for 'b', >> specify them in the initialPrams and initialParamVariations. E.g. if 'b' >> does not change much between neighboring pixels, use the previous value >> for initialization. The default value of the initialParamVariations for >> 'b' is 10% of the specified 'b' value. >> Don't care about the initial the 'a' and 'c' parameters and their range; >> these values will be ignored. >> > > Thanks for the suggestions. They've given some ideas to incorporate into > the plugin. > > I wasn't really expecting any miracle speed up for my plugin. Just the > suggestions you've made have vastly improved it. > > I do notice small differences in the final results with the different > versions 2 threads versus setMaximumThreads versus setMaxRestarts, but > these seem to be at the 8th and 9th decimal places. > > And thanks again for all your help. > George > > -- > ImageJ mailing list: http://imagej.nih.gov/ij/list.html > -- ImageJ mailing list: http://imagej.nih.gov/ij/list.html |
Free forum by Nabble | Edit this page |