Should CISV get into the data mining business?
The latest edition of The Economist has a special called "Data, data, everywhere" (full-text PDF for download free at the moment!). It's about the fact that things like digital cameras, Walmarkt records and your footprints on the web create more and more information that can be stored and used for different purposes: Private companies build internet tools on free Government data (like crime reports), Google built a spell check created from trillion spelling mistakes made in the search box and Amazon "knows" what books and movies you may like by mining through other customers data.
Of course one of the articles also refer to one of my favourite guys, Hans Rosling and his Trendalizer software*, which I used to create the CISV bubbles. Yet, one of the most interesting paragraphs of the special report was the description of an emerging executive job:
This is exactly what I've been trying to do, with all the "Statistics Nerd" posts here at FTB. But whereas my amateur attempts may have sparked a few ideas, I wonder if CISV should take the issue more seriously. How about building a database, that contains much more than just how many camps where hosted by whom in which year? Let's add cancellation data and evaluation data. Let's try to track down costs (of travel and hosting). Finally, how about getting somebody into IO who's good working all this out to benefit CISV?
Of course collecting and handling such data will lead to privacy and security, even legal issues. Already now German parents find it difficult to register their kids at CISV friends. But I'm sure there's so many things we aren't aware of, that could be extracted: Maybe some camps in some chapters are way more expensive than others? Or maybe there is a quality trend, that summer camps for 15y olds are much worse than the others? With that information, maybe trainings and programme development could be applied in a more targeted way?
On a different note, the special report also contains the following paragraph:
It reminded me that Arne-Christian/NOR, then IFC-chair, told the board in 2002 that 10% of the NAs were hosting 60% of all camps. Should CISV concentrate on those NAs needs?
*The artilcle calls the software "Gapminder", which is in fact the name of his organization, which shows that the mighty Economist is sometimes a bit sloppy in checking their facts.
The latest edition of The Economist has a special called "Data, data, everywhere" (full-text PDF for download free at the moment!). It's about the fact that things like digital cameras, Walmarkt records and your footprints on the web create more and more information that can be stored and used for different purposes: Private companies build internet tools on free Government data (like crime reports), Google built a spell check created from trillion spelling mistakes made in the search box and Amazon "knows" what books and movies you may like by mining through other customers data.
Of course one of the articles also refer to one of my favourite guys, Hans Rosling and his Trendalizer software*, which I used to create the CISV bubbles. Yet, one of the most interesting paragraphs of the special report was the description of an emerging executive job:
Chief Information Officers (CIOs) have become somewhat more prominent in the executive suite, and a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extrac the nuggets of gold hidden under mountains of data.
This is exactly what I've been trying to do, with all the "Statistics Nerd" posts here at FTB. But whereas my amateur attempts may have sparked a few ideas, I wonder if CISV should take the issue more seriously. How about building a database, that contains much more than just how many camps where hosted by whom in which year? Let's add cancellation data and evaluation data. Let's try to track down costs (of travel and hosting). Finally, how about getting somebody into IO who's good working all this out to benefit CISV?
Of course collecting and handling such data will lead to privacy and security, even legal issues. Already now German parents find it difficult to register their kids at CISV friends. But I'm sure there's so many things we aren't aware of, that could be extracted: Maybe some camps in some chapters are way more expensive than others? Or maybe there is a quality trend, that summer camps for 15y olds are much worse than the others? With that information, maybe trainings and programme development could be applied in a more targeted way?
On a different note, the special report also contains the following paragraph:
Best Buy, a retailer, found that 7% of its customers accounted for 43% of its sales, so that it reorganized its stored to concentrate on those customers' needs.
It reminded me that Arne-Christian/NOR, then IFC-chair, told the board in 2002 that 10% of the NAs were hosting 60% of all camps. Should CISV concentrate on those NAs needs?
*The artilcle calls the software "Gapminder", which is in fact the name of his organization, which shows that the mighty Economist is sometimes a bit sloppy in checking their facts.
We've been talking about a similar issue in the US now for about a year. We need an online database to keep track of members' information. We've got plenty of it on paper but how useful and efficient is that? It needs to be secure too. Every other sizable organization (national or international) has a member database. Part of this not getting done has to do with the fact that we're all volunteers. We need to hire people to get this done not only on a national level but also an international one.
After that I think we should also include what you're talking about with camps and evaluations.
No, it should not.
That's the response to the (probably provocative?) headline, but let's take a look at the arguments:
First, data mining (defined as recognizing patterns in big amounts of data) needs one thing we do not have: big amounts of data. Now, as far as I remember, FTB has always been advocating decentralized organisational patterns, a cut-down on bureaucratic obstacles, and an "intuitive" way of measuring aspects of CISV's success and effectiveness.
Assuming that we collect three values for each delegate (say, travel cost, a "quality" value out of the PDPEF, and a "send again?" -yes/no-value), CISV Austria would generate around 200 data sets a year (which would probably be an additional 100 hours of work, assuming that the infrastructure for storing this data is readily avialable), which would mean we'd have to make a board member responsible only for that.
With 3-4 Camps, 1 or 2 Mosaic Projects and 6-10 Interchanges a year, another 100 hours would accumulate for generating the data for our programmes. (Maybe less if all we focus on is what already is in PDPEF, but if we want to include things like costs, sponsor funds, staff work, site evaluation, etc. a lot more data has to be collected and entered.)
So, two more volunteer "positions" in our national Board, just for data collection? That would be more than 10% of our board...
Secondly, assuming we really start collecting this data: how do we handle it? I mean: what questions do we ask that we cannot answer yet, and how would they impact our work? Say for example that camps in NA X turn out to be very expensive in terms of travel costs, whereas NA M's camps are not only cheaper, but also have the same quality. Does that mean NA X doesn't get to host that camp any more? Would we actually look into _consequences_ when discovering things like that? And in how many cases would the answer be something plain obvious (that flights overseas are more expensive than intracontinental flights)?
Thirdly: yet another IO staff? A paid (and expensively so), full time data miner? .....
And on the last question: Should CISV focus more on the needs of those 10% (i.e. 6 NA's) that host 60% of the camps? Definitely: no. They're obviously coming along well in the World Of CISV. Focus on those 90% that host the rest, and ask what you can do to improve this ratio :-) After all, how much more do you want to "sell" to those 6 NA's?
CISV does, in fact, generate a ton of "data." There is also the potential for the organization to generate even more useful "data." Of course, all of this depends greatly on the need for such information. Here are a few ideas I had where a data mining person, contract or software could be of use:
Membership: As Katie said above, a database of strictly CISV members/participants would be highly useful in the retention of volunteers and potential financial development of the organization. This is, in a sense, what Friends is for, although I think its implementation hasn't been very well put into place. CISV USA's current "database" (if one can call it that) is made up of several boxes of microfilm (seriously!) dating back several decades. As far as I know, the NA still pays a company to transfer YLIF/ALIF & other forms to microfiche. NOTHING is digitized. This is a waste of money and resources, and a bear when it comes to searching for past participants.
Imagine a secured online database which stores participant information, forms, and more. I've been witness to the complete digital transformation of contact and admissions data as an assistant in my univeristy's graduate school, all thanks to two pieces of web-based software. From those two systems, my coworkers and I are able to search for specific contacts, send "blast" e-mails to groups of prospective students, track web information, and more. Of course, access to this software comes at a price. I'm not sure the complete price, but I'm guessing its in the tens of thousands of dollars. Would that be worth it for CISV USA? I'm not sure.
Academic Research: Those in academia love their data. Having a digitized database of information regarding participants and other "of interest to academics" stuff would be useful to those doing research of CISV programs.
Those are just a few ideas...
Hmm...Flo, you may be right, that the "return of investment" may not be that high, and that maybe the pricetag (like Martin suggests) could be to high. I tend to agree with most of your crtical points.
Nevertheless, I still think that we already are accumulating tons of data, with quite some volunteer working hours involved:
- PDPEF data
- participants info
- CISV friends data
- membership lists
- camp budgets
Just, that it's vastly decentralized and it may be of no use whatsoever, if nobody tries to make sense of the big picture.
One last thing: I disagree that FTB has a tradition of "intuitively" trying to measuere things, quite the contrary: I actually did try to crunch the data once in a while to check the validity of certain asumptions, such as "AIMs booster volunteer motivation".
Speaking of interesting uses for data - (and even though it's US only) how cool is this? http://maps.ers.usda.gov/FoodAtlas/
(thanks to Mae Cooper for the link).
I am not sure whether we are talking about the same things here... Of course we should have member databases (CISV Austria has a pretty system in place since ~2000, handling addresses, participants, participations, finances and stuff - all online and easy to manage), and of course microfiché doesn't seem too pretty in the 21st century; but that's just data: no-one is "mining" into it (yet). The only data internationally available that we could "mine" into (i.e. ask intelligent questions where a (non-obvious) answer could be extracted from the data) is probably the PDPEF data - and we don't really have a stable basis to work on that (yet). Everything else is just too far distributed between IO, Committees, NAs, Chapters, JBs... to really make anything out of it.
And @Nick, the "intuitive" was not meant to say how you do the statistics (I believe they are mathematically proof an sound), but more on your choice of indicators (meaning, working with the indicators we have and choosing the "obvious" ones; it was meant to be a compliment on how intuitive and easy it is to understand how the indicators work together. I don't believe this would be possible using complex mathematical and statistical methods which are in use in data mining...).
Flo, I do think that both "data" go hand in hand: Our membership data (at the moment usually at home in obscure excel spreadsheets) and any kind of evaluation data. (Until the electronic PDPEF comes up, at home on piles and piles of paper). If we could establish an electronic link with every member and any kind of camp data, the stuff gets interesting.
I'm still exploring the theme, and unsure whether the usefulness of any effort in this area is worth the privacy risks. However, when I was more involved in IPP, we once discussed the issue of "minimal standards": How bad can a camp be, before we agree, that the world would be better without it. How many camps have serious quality issues? Maybe this is the kind of question a more elaborate database could answer.
Hmm, I think the problem of finding out the size of usefulness of collecting data is, that we can't see if the collecting and interpretation helps before we do it. Sure, having all data set, you could sort it by everything you want. But it also means, that you can interprete much more stuff into it than the data owns. You could find answers for all questions but I'm not sure if the results we would get out of the data would be reliable enough to work with it.
If I look at the recommendations Amazon gives to me, it's maybe 10% (maximum) what really pricks my eyes (and works thereby for Amazon). Do we realy want to spent time/money/energy in interpretations of our data which are not dependable? If maybe 8 out of 10 interpetations are not correct? It might work for DVDs, CDs, Books and Flatscreen-TVs, but for educational content and humans?
What brings me to Nicks question "How bad can a camp be, before we agree, that the world would be better without it?". It's similar to "How shitty has a leader to be to never be sended away again?". So we're talking about Black Lists for Camps, Camp Sites and leaders/staffs. Where is the limit of inadequacy? It depends of so many facts, social stuff and feelings, let's combine them in "cultures". Shouldn't we avoid limits if we want to respect other cultures? Does "blacklisting" mean "being closed minded to other views"? I don't know, but what I know is, that in the end the attention should be at the welfare of the kids and not at the sticking to limits. And what is if the interpretation of the data blacklists someone wrongful?
Conclusion:
Are interpretations secure enough that they worth the extra work?
After having some more thoughts on this, my biggest worry is, that CISV just doesn't have enough data:
If Wallmart realizes from analyzing the data of 1 zillion shopers thats let's say the more expensive chocolate will be bought 0.1% more often, if it sits in the shelf next to the banana-milk, that can turn into a real profit. But when I tried to analyze, whether CISV NAs profit from hosting an AIM, I only had very little numbers, and even those were questionable. So, I guess it's a question of the reliability and the validity of the numbers, and in the end the statistics may not work out, because there aren't enough numbers.
Nevertheless, let me draw a conclusion from my own profession: The concept of "Evidence-based Medicine" asks the doctor to combine the best evidence available with the individual patient's history and the individual personal experience. Transfering this concept to CISV I'd like us to have a bit more evidence when making decisions for the future, and that can only be trying to collect the best available data.
Let me be a bit more precise: We're always complaining that our programmes are getting more expensive, but we have no clue how much money is spent on travelling. Furthermore, everybody thinks that some programmes are better than others, but nobody has ever compared evaluations. get the stuff together, and then we can talk!
The PDPEF gets the stuff togehther. Just add questions like "How much was your flight?" and "When did you book?" in part 1 and the IO (and hopefully we later on) know how much money is spent on traveling. And (at least regarding educational matters) the PDPEF is already a tool for comparing the programs. The questions in the PDPEF are debatable (highly), but the PDPEF as a tool exists. Leadoff summaries/evaluations already exist, they just do not compare the programs but show the weaks and strongs for each single program.
I stuck to the Interchange PDPEF. For other programs it might be more difficult to evaluate the travel costs. Sorry for the confusion.
Electronic PDPEF is not far away:
http://forms.cisv.org/test/pdpef/default.aspx
You need a Friends login, and you must have been a Programme Staff on a Camp program the last 10 years to access it.
Feedback is very welcome!
My comment is out-dated but I absolutely agree with Nick that there is not enough data in CISV, and would completely be 'for' having a professional in charge of it for us. We make too many assumptions without having the evidence to back it up. I see the purpose of having someone in charge of data for three key reasons. Education, Marketing and Organisational (not intentionally, but quite hilariously the 3 strategic priorities)
Education: we have started with the PDPEF data, and this is already suggesting ways that we can improve programmes, I look forward to seeing how this data will be interpreted in the future.
Marketing: If we improved the communications to our members, and created tools (better then friends) to communicate to our members centrally (as most NAs/PAs don't have the capacity to develop that themselves) then perhaps we could target different groups in different ways
Organisational development: The concept of a chapter evaluation tool is a fantastic idea: a lot of useful data could be drawn from that to self improve and to inform ODC.
Conclusion: I absolutely love data. I am also a visual person, and nothing speaks better to a group of people then a nice graph. More please.