Harnessing the Power of the Crowd to Increase Capacity for Data Science in the Social Sector

We present three case studies of organizations using a data science competition to answer a pressing question. The first is in education where a nonprofit that creates smart school budgets wanted to automatically tag budget line items. The second is …

Authors: Peter Bull, Isaac Slavitt, Greg Lipstein

Harnessing the Power of the Crowd to Increase Capacity for Data Science   in the Social Sector
Harn essing the Po wer of the Cr owd to Incr ease Capacity f or Data Scien ce in the Social Sector Peter B ull P E T E R @ D R I V E N DAT A . O R G Driv enData, Cambridge, MA USA Isaac Slavitt I S A AC @ D R I V E N DA TA . O R G Driv enData, Cambridge, MA USA Greg Lip stein G R E G @ D R I V E N DA TA . O R G Driv enData, Cambridge, MA USA Abstract W e present three case studies of o r ganization s us- ing a data science comp etition to answer a press- ing question. The first is in education where a nonpr ofit tha t creates smar t school budgets wanted to auto matically tag budget line items. The second is in public health, wher e a low-cost, nonpr ofit womens’ health care provider wanted to understand the effect of demogr aphic and be - havioral q uestions o n p redicting which services a woman would need. The third and final example is in gov ernment innovation: using online restau- rant re views from Y elp, competitors built mo dels to forec as t which r es taurants were mo st lik ely to have hygiene violations when visited b y health inspectors. 1 Finally , we reflect on the unique benefits of the open, public competition model. 1. Intr oduction If yo ur goal is to change the future, it help s to h a ve good prediction s about what that future looks like. And there are many groups interested in changin g the future. Amazon wants to in crease the num ber produ cts you order, so they predict which ones y ou m ight want to b uy ne xt an d recom mend them.( Schafer et al. , 1999 ) T witter wants to boost you r use o f their platform , so 1 These competitions were run on the Driv enData competition platform (www .dri v endata.org ); Driv enData employs the authors of this paper . 2016 ICML W orkshop on #Data4Good: Machine Learning in So- cial Good Applications , Ne w Y ork, NY , USA. Copyright by t h e author(s). Figure 1. The best p redictions (lo wer is better) o ver time in a data science competition. Dif ferent color dots represent different sub- mitters. they pred ict which tweets you will ig nore an d which you will engage with.( Lin & K olcz , 201 2 ) Facebook and Google want to inc rease t he nu mber o f ads you click o n their sites, so they predict your person al click- through behavior .( McMahan et al. , 2013 ) Th ese companies hav e become extraordin arily skilled at mak ing these predictions. But there are many other reasons to want to change the fu- ture. Educator s want to increase th e number o f students gradua ti ng h igh school. Health workers want to improve the overall health of a pop ulation at a sustainable cost. Mi- crolende rs want to give mo re i ndividuals in the de veloping world a chance to pursue their d reams without incurring default. Con s ervationists want to curb ou r energy usage without ham pering p roductivity . Governm ents want to pre- vent fires from destroying li ves and property . W e have the computational power and methods to tackle many of these challenges. Howe ver , the data scientists that can le verage these resource s are hard to find and expen - si ve to hire. McKinsey h as estimated that in 2018 there will be 190,0 00 analy tics po sitions that go unfilled in the United States.( James Manyika , 201 1 ) If there is that k ind of shortfall in the com mercial sector , we expect the so - cial sector will lag e ven further b ehind. In 20 14, the me- dian salary for a data scientist was $98,0 00, which puts 31 Harnessing the Po wer of the Crowd to Increase Ca pacity for Data Science in the Social Secto r qualified data s cientists out o f reach as full time hire s at many nonp rofits.( King & Magou las , 201 4 ) In the last ten year s , o pen innovation has be en reco gnized as a cost- effecti ve way to generate creative, high-quality solution s to problem s.( Boudreau & Lakh ani , 2 013 ) Data science com- petitions, wher e practitione rs, researchers, and students publicly com pete t o make the best prediction s, offer a way to harness the power of op en i nnovation for data analysis. Competitions have great visibility among bo th data science practitioner s and other n onprofits workin g in similar issue areas. Participating in a competition s purs data scientists to think mor e abou t applicable pro blems in the social sec- tor , which is especially imp ortant at a time where c reati ve applications of new data techniqu es can greatly imp rov e how organ izations oper ate. These competitio ns also help start the con versatio n amon g nonpro fits abou t how machine learning tools can be used on their own data. 2. Case Study: “Box-plots for Edu cation” 2.1. Context Education Resource Strategies ( ERS) was found ed as a non-p rofit consulting firm in 2 004, with a primar y goal of helping public s chool districts use their limited reso urces more strategically . Ho wever , this g oal is o ften easier said than don e. Un li ke companies, wh ich ben efit from com- paring themselves to their peers, districts frequently have no reliable ways to comp are their spendin g to other dis- tricts. E v en if e x penditures are public, the compariso ns are not app les-to-apples because of a lack o f standa rdized re- porting structu res. As a result, district decision makers ar e often left in the dark. ERS attempts to solve this problem by working with dis- tricts’ b udgets to assign every line item to certain categories in a comprehensive fin ancial spen ding framework. If th is process is completed correctly , ERS can offer cross-district insight into a p artner district’ s fina nces. For example, ERS might observe that a particular district spen ds m ore on fa- cilities and maintenance th an peer districts–while this is not inherently good or b ad, it helps the district to have knowl- edge of ho w it s decisions a nd sp ending c ompare to those of its peers. In ord er to comp are b udg et or expenditure data a cross dis- tricts, ERS assign s every line item to c ertain ca te g ories in a compreh ensi ve finan cial spend ing framework. For instance, some labels describe what the spend ing “is”– compen s ation, benefits, eq uipment, property rental, and so on. Other categories describe wh at the spending “do es, ” which g roups of stud ents benefit, and wher e the fun ds co me from. Howe ver, categorizing each of these budget line-items is T able 1. Example text from schoo l district budgets PETRO-VEND FUEL AND FLUIDS Regional Playoff Hosts Capital Assets - Locally Defined Groupings FURNITURE AND FIXTURES ITEMGH EXTENDED DAY Water and Sewage * UPPER EARLY INTERVENTION PROGRAM 4-5 Food Services - Other Costs Supp.- Materials extremely labor-intensive, often tak ing se veral weeks fo r an employee to h and-tag each row in Excel. This chal- lenge pu t a limit o n the quality of com parisons, becau se ERS could only process so many b u dgets per year . 2.2. Challenge In som e senses, this is a classic machin e learning pr ob- lem. ERS had hand- labeled over 45 0,000 budget line items with the relev a nt categories. The o bjecti ve was to train the machine to do th e process th at they h ad been doing by hand. This kind of prob lem is known as supervised ma- chine learning . In this competition , com petitors were asked to id entify the highest pr obability labels for each o f nine different cate- gories. T o do this, they had to first create featur es from th e text of the d is trict budgets. W e includ e some examples of text fro m schoo l budge ts to demonstrate th at th is is not a standard natural languag e processing task–in fact, it is v ery much “u nnatural” language proce s sing given the number and variety of abbre viations and pu nctuation. 2.3. Results The winning algorithm came from a competitor who had submitted over 100 times to th e competition–a testament to the p assi on of d ata scientists workin g f or a cau s e. The algorithm uses a standard me thod–logistic regression– b ut derives its power f rom its feature en gineering: for example, using tri-grams, p airwise interactions, the “hashing trick” for dim ensionality reduction, an d term frequ ency-in verse docume nt frequency (tf-idf) among other tech niques. ERS estimates that this algor ithm will tag files with over 90% accur ac y an d will sav e 75% o f the time usually taken to code financial files. At 400 hou rs per project, this means 300 h ours sa ved per pr oject, or close to 1 ,000 hours per employee, who ty pically does three p rojects p er year . Ul- timately , this equ ates to r oughly twenty- fi ve week s of em- ployee time sa ved! 32 Harnessing the Po wer of the Crowd to Increase Ca pacity for Data Science in the Social Secto r 3. Case Study: “Countable Care” 3.1. Context Planned Parenthood is t he nation’ s leading p rovider and ad- vocate o f h igh-quality , affordab le healthcare for women, men, and youn g peop le, as well as the natio n’ s largest provider o f sex education . With appr oximately 700 health centers across the country , Plann ed P ar enthood organiz a- tions serve all patients with care and co mpassion, with respect and witho ut judgment. Und erstanding the trend s in women’ s h ealth care is cr itical to d eli vering the expert, quality care that is the hallmark of Planned Parenthood. Planned Parenthood is an innovator in health care delivery , continually looking to find the best ways to expand access to quality , affordable care to e veryon e who needs it. W e want your help to b etter understand the comp le x dyna mics of health car e in order to better serve the nee ds of those who depend on us. The goal of this com petition is to driv e innovation a nd anal- ysis in t he field of po pulation health by predicting which reprod ucti ve health care services are accessed b y women. The en d pro duct o f the competitio n will improve pu blic health with novel pred icti ve analytics as pa rt of our effort to giv e ba ck to the research and health care comm unity . 3.2. Challenge Users wer e given extremely detailed informatio n from mul- tiple year s of the Centers fo r Disease Control and Prev en- tion (CDC) reproductive hea lt h and family ch oice survey , the National Survey of F amily Growth (NSFG). The survey “gathers informa tion on family life, marr iage and div orce, pregnancy , infe rtility , use o f contr aception, and men’ s and women’ s health” and is widely used by resear chers foc us- ing on contraception, family planning, and women’ s health in general.( Chandra et al. , 2005 ) In ad dition to a vast am ount o f in -depth demogra phic and personal history info rmation, the NSFG tracks captur es fairly detailed family and re productiv e historie s , as well a s health decisions. W e challeng ed users to m odel deeply la- tent, non -linear , and non-obviou s associations b etween de- mograp hics and p ersonal histories in ord er to predict these health choices for individual respo ndents. This is a particularly interesting challenge giv en the branch ing nature of the survey . Because responde nts were only asked certain questions if they answere d affirmatively to a pr e viou s question , th ere was a lot of miss ing data in the dataset. It was an open challenge to co mpetitors to determine ho w to treat this missing values a nd still make effecti ve predictions. Figure 2. Green pixels are questions that were answered; yello w pixels were gaps in the data where survey questions were unan- swered. 3.3. Results Using robust ensem bles of individually s ophisticated mod- els, the to p perform ers in this challeng e were able to achieve significan t predictive lift over both random noise and mo re n ai ve mod els . Th e co de an d mo dels associated with this challenge were deli vered to the Guttmacher Insti- tute for further study . 4. Case Study: “Keeping it Fr esh” 4.1. Context The City of Boston regularly insp ects every re s taurant to monitor an d impr o ve foo d safety and pub lic health. As in most cities, health inspection s are gen erally ran dom, wh ich can increa s e time spent on spo t checks at clean restaurants that have b een following the rules closely –and missed op- portun ities to improve health and hygien e a t places with more pressing food safety issues. Each year, million s of p eople cycle throug h and post Y elp revie ws ab out their e x periences at th ese same restaurants. The information in these revie ws has th e poten tial to im- prove the City’ s inspection efforts, and could transform the way inspections are targeted. A team of Harv a rd econ omists and Y elp–with s uppo rt from 33 Harnessing the Po wer of the Crowd to Increase Ca pacity for Data Science in the Social Secto r T able 2. Example re vie w from ov er 230k in the Y elp dataset { "business_id" : "CgdK8DiyX9Y4kTKEPi_qgA " , "type" : "review" , "text" : "This is the place I like to go for deli sandwiches (and salads/soups ) when in the FinancialDist rict. I’m not sure what makes this place stand out from the million other deli sandwich places inthe area. Maybe it’s the lack of pretentiou sness..." , "date" : "2005-12-11" , "stars" : 4 , "review_id" : "zQH071b6x9g 1ZHbhJnaNKw" , "user_id" : "NfvN6-zeU0RsD 0Q_Sk-DSQ" , "votes" : { "cool" : 1 , "useful" : 1 , "funny" : 0 } } the City of Boston–c o-sponsored this comp etiti on to ex- plore ways to use Y elp review data to improve th e in- spections process. T he City of Boston, a p artner in this challenge, w a s committed to e xamining ways to integrate the winn ing algo rithm into its day -to-day inspection ope r - ations. 4.2. Challenge The go al for th is com petition was to use da ta from so cial media to n arro w the search for h ealth code violations in Boston. Competitors wer e giv e n access to historica l hy- giene violation records from the City of Boston and a mas- si ve archi ve of Y elp’ s consumer revie ws along with restau- rant metadata. Specifically , users were dire cted to predict the resu lt s of an inspection of every restaur ant on e very day for the time per iod in qu es tion. The challenge is to fig- ure ou t the word s, phr ases , p atterns in foot traffic, prices, cuisines, and o ther clu es th at har ness digital exhaust to make city services more effecti ve. 4.3. Results The competition op ened Monday , April 27th 2 015 and ac- cepted n ormal submissions for eight weeks. During th is period, users could see ho w well their predictions were rel- ativ e to the holdou t d ata, and compare their pe rformance with other competito rs on the public leaderboar d. In this comp etition, data scientists were not simply try ing to ma k e predictio ns on normal h oldout data alon e. Rath er , they were challen ged actually to pr edict the future. In ad- dition to a normal pub lic le aderboard, this competition fea- tured an e valuation perio d after normal submissions closed; Figure 3. A map of hyg iene violations in the City of Boston. Darker circles have more violation s historically , while lighter ones hav e fewe r . prior to submissions being closed , comp etitors made s pe- cial sub missions n ot o n the hold out dataset but on the up - coming six weeks of actual Boston food inspection s . By July 7th 2015 , no rmal submissions were closed a nd users had submitted th eir final ev alu ation predictions f or the ev aluatio n period. Over the next six w eeks, as the city of Boston went abou t their normal inspection routin es we pulled the violatio ns from the open data po rtal and ev alu- ated comp etitors per formance in realtime. At the end of the competition, the results were ev aluated by a team of re- searchers at Harvard Un i versity wh o “estimate that the C ity of Boston would be 30%-50 % more productive using a top- perfor ming algorith m f rom th e tournam ent . . . [and] are currently testing the winning algorithms ef ficacy in prac- tice, using a field experiment that integrates th e winning algorithm s into Bostons p rocess for alloca ti ng inspectors. ” ( Glaeser et al. , 2016 ) 5. Conclusion It is an exciting time to be work ing on data-for-good projects. W ith a b it of creati v ity , da ta scientists can cre- ate analo gies between pro blems that are being solved in industry and th e challen ges facing no nprofits, NGOs, and governments. The tools that are used by corpor ations to im- prove their operations an d bottom -lines ca n just as e asi ly be used to he lp social impact organ izations be mo re effecti ve and more efficient. Open in nov ation provides a new way for nonpr ofits and governments to access talent that is har d to find and expen- si ve. E xperts from ar ound the world can contribute to soc ial impact fro m where ver they are, whenever t hey hav e free 34 Harnessing the Po wer of the Crowd to Increase Ca pacity for Data Science in the Social Secto r time. No nprofits can be almo st g uaranteed high p erform- ing algo rithms gi ven the sheer numb er of models explo red during a comp etition. Both g roups can learn from and ex- plore new application s of e x ist ing techniq ues to make the world a better place. Refer ences Boudreau , Ke vin J an d Lak hani, Karim R. Using the crowd as an innovation p artner . Harvar d b u s iness r eview , 91 (4):60 –69, 2013. Chandra, Anjani, Martinez, Gladys M, Mosher , W illiam D, Abma, Joyce C, and Jon es, Jennif er . Fertility , family planning , and rep roductiv e health o f us w omen: data from the 2002 natio nal survey of family growth. V ital and health statistics. Series 23, Data fr o m the National Survey of F amily Gr owth , (25) :1–160, 2005. Glaeser , Edward L., Hillis, And re w , K om iners, Scott Duke, and Luca, Micha el. Crowdsourcing city govern- ment: Using tour naments to improve inspection accuracy . W orking P a per 2212 4, National Bu- reau of Economic Research, Mar ch 201 6. URL http://www.n ber.org/papers/w22124 . James Manyika, Michael Chu i, Brad Brown Jacqu es Bughin Richar d Dobbs Charles Roxburgh An gela Hung Byer s. Big data: Th e next frontier for inn o vation, competition , a nd pr oducti vity . McKinsey Global Insti- tute , June, 2011 . King, John and Magoulas, Roger . 2014 data science salary survey: T ools, tren ds, what pa ys (an d what doesn’t) for data professiona ls . Nov ember, 2 014. Lin, Jimmy and Kolcz, Alek. Large-scale m achine lear ning at twitter . In Pr o ceedings of the 2012 ACM SIGMOD Internation al Confer ence on Management o f Data , pp. 793–8 04. A CM, 2012 . McMahan, H. Brendan, H olt, Gary , Scu lle y , D., Y oung, Michael, Ebner , Dietmar, Gr ady , Ju li an, Nie, Lan , Phillips, T odd, Da vydov , Eugen e, Golovin, Daniel, Chikkerur, Sharat, Liu, Dan, W attenberg, Martin, Hrafnkelsson, Arnar Mar , Boulos, T om , and Ku- bica, Jeremy . Ad click pred iction: A v ie w f rom the trenches. I n Pr oceed ings of the 19 th ACM S IGKDD Internation al Con fer ence on Knowledge Discov- ery an d Data Minin g , KDD ’ 13, pp. 1222– 1230, New Y ork, NY , USA, 2 013. ACM. ISBN 97 8-1- 4503- 2174-7. doi: 10.114 5/2487575. 2 488200. URL http://doi.a cm.org/10.1145/248757 5. 2 488200 . Schafer, J. Ben, K onstan, Joseph, and Riedl, John. Recom- mender systems in e-com merce. In Pr oceeding s of the 1st ACM Conference on Ele ctr on ic Commer ce , EC ’ 99, pp. 158 –166, New Y o rk, NY , USA, 1 999. AC M. ISBN 1-581 13-176-3. d oi: 1 0.1145/336 992.337035. URL http://doi.a cm.org/10.1145/336992 .3 3 7035 . 35

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment