Re: Usage data collection (was: Re: ImageJ 2.0.0-rc-11 released)

Posted by Mark Hiner-2 on
URL: http://imagej.273.s1.nabble.com/ImageJ-2-0-0-rc-11-released-tp5009074p5009115.html

Stephan:

>   In addition to that, I would ask you to, in spirit of full Open Source
and Open Policy make the collected data (the data, not the derived statistics)
read-accessible to everybody in full and license it under an Open Data
license

This is a fantastic idea and I wish we had come up with it earlier...
issue'd here: https://github.com/imagej/imagej-usage/issues/2

>I am very sorry to be so negative in this particular aspect.

No need to be sorry for being honest. If we release something it's because
we' think it's a good idea, but we are extremely lucky to have this
community already installed and willing to provide feedback.

>Please ask yourself whether you would feel comfortable about your operating
system reporting back which applications you've used when, where and how
often

Because you asked...

I'm starting to feel that after 4 years of open source software development
as a public employee, I may be the wrong person to answer this question.
All of my work is open to the public <http://fiji.sc/User:Hinerm>, my
salary is public record <http://host.madison.com/data/uw_salaries/>, and my
GitHub profile is basically my work schedule <https://github.com/Hinerm>. I
shop on Amazon.com and use google a lot, so I'm fairly certain I've
contributed counts to at least a few database entries, somewhere, and every
software tool I've worked on has been to facilitate the sharing of data
between scientists. So my honest answer is no, I would only feel spied upon
if there was actually personally identifying data that was being stored.

But  of course, those are my personal feelings. I appreciate that others do
not feel that way; I would love it if I could both fully understand those
reservations and put them at ease... but those are not necessary
prerequisites for me to see that something should probably change.

I do appreciate the concerns for the database being hacked, and I hope I
made it clear that we do not put identifying data in the database. So even
if the database was hacked, there should not be a way to abuse this data
(another reason why I think making it open source is a cool idea).

Gabriel:

>What is the problem in showing a similar dialog and let people know what
will
>be going on?

So, this is embarrassing, but I completely forgot about my own commit
<https://github.com/imagej/imagej/commit/49d6bc1f8491a4fe41101f88588895dc851041ad>
describing the usage statistics addition to the ImageJ welcome message that
gets displayed any time a new ImageJ2 is uploaded. I know it's not the same
as a dedicated dialog, but I put that text there because I did want to have
something pop up to inform people of what was going on.

I had mixed feelings about whether the dialog should look like a static
document or a changelog, and it was "easier" to just insert the update at
the time. Now I feel like that was a mistake and it should be a revision
log, with the relevant changes separated out by version with most recent at
the top. I was also thinking the welcome message should be modal,
<https://github.com/scijava/scijava-common/issues/115> to ensure users have
to see it... but I don't know if that's too annoying? Anyway, my hope is
that this would be a general solution to raise the visibility of all the
changes we make. (I have been seriously distressed at how many bug reports
we received where people didn't realize they should disable SCIFIO to
restore some lost behavior)

Gabriel / Preibisch :

You both mention asking for the upload permission at startup because that's
how we did it with SCIFIO, but I just wanted to clarify - would it matter
if permission was requested at upload time instead? I like the idea that
the dialog would be saying
"this is happening right now" instead of asking for an ambiguous pass to
upload in the future.. but I wasn't sure if I was missing something that
makes on startup more desirable? See this issue for permissions refactoring
discussion: https://github.com/imagej/imagej-usage/issues/3

Thanks to all for baring with us/me during this feedback period!

- Mark

P.S. Fun fact: last Friday, my biggest concern for our release was that the
statistics uploading would make quitting too slow... ;)

On Mon, Aug 11, 2014 at 5:54 PM, Stephan Saalfeld <
[hidden email]> wrote:

> I fully support Gabriel and Herbie in their strong rejection of this
> sort of data collection and, in particular, the way it aims to dupe the
> clueless, introducing it default on.
>
> Nobody questions that this kind of data is extremely interesting and
> useful.  I strongly believe that every plugin developer drools over
> knowing how much their plugins got used, and when, and where.  This
> lust, however, does by now means justify to actually do it.
>
> I am also pretty sure that your intentions about this procedure are
> perfectly honorable and that you're not planning any evil.  However,
> that does not mean that everybody else does at any time in the future
> and the data will be there, exposed to those hacking into your servers.
>
> Please ask yourself whether you would feel comfortable about your
> operating system reporting back which applications you've used when,
> where and how often.  I would feel spied on.
>
> It would be great if you could consider switching that functionality off
> by default and offer interested users the choice to contribute
> consciously and voluntarily.  That way, you would consciously make the
> decision to gather significantly less data but in the most honorable
> way, withstanding the (understandable) desire to get more faster.  In
> addition to that, I would ask you to, in spirit of full Open Source and
> Open Policy make the collected data (the data, not the derived
> statistics) read-accessible to everybody in full and license it under an
> Open Data license, e.g.
>
> http://opendatacommons.org/licenses/odbl/
>
> or one of the CCs
>
> http://creativecommons.org/licenses/
>
> Everybody can then test whether the data is truly harmless, but I
> actually believe that we may find interesting ways to identify
> individuals by their ImageJ usage patterns.
>
> If you would change the usage data collection policy in this spirit, I
> would consider switching it on, for a while.
>
> I am very sorry to be so negative in this particular aspect.  I
> appreciate a lot the immense amount of work you are spending to make
> ImageJ2 a better analysis tool freely available to the community.
>
> Best regards,
> Stephan
>
>
>
>
>
> On Mon, 2014-08-11 at 22:53 +0100, Gabriel Landini wrote:
> > On Monday 11 Aug 2014 14:18:11 Mark Hiner wrote:
> > > > SCIFIO was opt in, but usage tracking is opt out? It does not make
> sense.
> > >
> > > To be clear, SCIFIO is enabled by default.. you have to uncheck a box
> to
> > > disable SCIFIO, so it is opt out.
> >
> > Right, but it was impossible to miss as I had to answer the SCIFIO
> dialog when
> > the update came.
> > What is the problem in showing a similar dialog and let people know what
> will
> > be going on?
> >
> > > I think there is a difference in the questions "what do you do with the
> > > software" and "what do users do with the software". I don't believe we
> will
> > > ever ask the former question.
> >
> > Mark, what you or me personally *believe* somebody will ask in the
> future does
> > not matter. It is the process of getting informed consent on the data
> > collection; IJ2 is assuming and makes it less obvious than it could be.
> >
> > > we can ask:
> > >  "How many times was Bio-Formats used with Java 7"?
> > > we *can not* ask:
> > >  "how many times did Gabriel Landini run Bio-Formats?"
> >
> > Even with my poor knowledge of network traffic I can imagine that it
> might
> > trivial to script something using time stamps and ip addresses of the
> > uploading machine as well as plenty of emails also ip addresses from
> users.
> > Not that I remotely think that the devel team would have the time or
> > inclination to do this, but if we are talking about what is impossible, I
> > suspect it is not. So whether that is potentially identifiable
> information is
> > probably debatable. If there is then you would be effectively logging in
> a
> > database their location every hour (!) IJ2 runs. Doesn't that sound a bit
> > creepy?
> >
> > My issue was (and remains) that data collection needs to be fully
> informed
> > before it takes place, not to be On by default.
> >
> > > >If this happens to be something people want to adhere to, then there
> is
> > > > nothing to worry about as there will be lots of users opting in when
> given
> > > > the chance.
> >
> > > I believe this is actually hard to predict.
> >
> > Ask the users in a similar way SCIFIO was done and you will have the
> answer.
> > Then we would not be having this conversation.
> >
> > > If usage statistics were presented similarly - with a pop up on launch
> and
> > > an options menu - my expectations for opt-in numbers would be very
> low. Not
> > > because people don't want to contribute but because we created a
> barrier to
> > > the process.
> >
> > The issue that does not seem to stick after all this typing is that IJ2
> should
> > not make that decision for the users. IJ2 is not the owner of the
> processes
> > happening in a user's computer. You need to ask, not assume, that people
> will
> > be happy for their computers to contact a database every hour and
> letting it
> > know they are there and doing this or that.
> >
> > > A more successful alternative might be, when statistics are actually
> being
> > > uploaded, to display a dialog asking to proceed or not - with
> yes/no/don't
> > > ask me again options. That sounds promising, but also potentially
> annoying
> > > or confusing to get that pop up, and we can still expect statistics
> > > reporting to drop.
> >
> > But if the reporting statistics drop, that would have to do. Make
> estimates
> > instead of collecting all possible data.
> >
> > > So since we are not sending or storing use-specific data, and provided
> and
> > > publicized the opt-out mechanism, we decided to go with the option
> that was
> > > un-disruptive at the workflow level and maximized data collection.
> >
> > Yes, you said that before, and I am sure I am not alone thinking it is
> not the
> > desirable way of doing it.
> >
> > > Especially given, as you mentioned, that users ultimately need to
> agree to
> > > communicate with an external server to download these applications and
> > > updates.
> >
> > But there is an obvious difference between the two situations. One is
> > requesting an update. The other is broadcasting to a database.
> >
> > > I hope it's clear that I am not saying we are unwilling to change how
> > > permissions are exposed.. but if we can circumvent that need via
> discussion
> > > it would certainly be my preference. And if we do end up making any
> > > changes, I would like them to be as minimally damaging to the quality
> of
> > > the data gathering as possible.
> >
> > I sounds like it is preferable not to ask people about the data
> collection.
> > That is in my view an error of judgement that can be resolved easily.
> >
> > > To me, there has to be actual user data being exposed to be a matter of
> > > privacy. Can you clarify what you believe to be the concern here?
> >
> > Sure: that the process of collecting usage data is not made clear from
> the
> > beginning and it should have informed consent before the collection
> starts.
> >
> > If there are no privacy issues, why is the function to switch it Off
> called
> > "Privacy"?
> >
> > Regards
> >
> > Gabriel
> >
> > --
> > ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>
> --
> ImageJ mailing list: http://imagej.nih.gov/ij/list.html
>

--
ImageJ mailing list: http://imagej.nih.gov/ij/list.html