Harnessing the Power of the Crowd to Increase Capacity for Data Science in the Social Sector
We present three case studies of organizations using a data science competition to answer a pressing question. The first is in education where a nonprofit that creates smart school budgets wanted to automatically tag budget line items. The second is …
Authors: Peter Bull, Isaac Slavitt, Greg Lipstein
Harn essing the Po wer of the Cr owd to Incr ease Capacity f or Data Scien ce in the Social Sector Peter B ull P E T E R @ D R I V E N DAT A . O R G Driv enData, Cambridge, MA USA Isaac Slavitt I S A AC @ D R I V E N DA TA . O R G Driv enData, Cambridge, MA USA Greg Lip stein G R E G @ D R I V E N DA TA . O R G Driv enData, Cambridge, MA USA Abstract W e present three case studies of o r ganization s us- ing a data science comp etition to answer a press- ing question. The first is in education where a nonpr ofit tha t creates smar t school budgets wanted to auto matically tag budget line items. The second is in public health, wher e a low-cost, nonpr ofit womens’ health care provider wanted to understand the effect of demogr aphic and be - havioral q uestions o n p redicting which services a woman would need. The third and final example is in gov ernment innovation: using online restau- rant re views from Y elp, competitors built mo dels to forec as t which r es taurants were mo st lik ely to have hygiene violations when visited b y health inspectors. 1 Finally , we reflect on the unique benefits of the open, public competition model. 1. Intr oduction If yo ur goal is to change the future, it help s to h a ve good prediction s about what that future looks like. And there are many groups interested in changin g the future. Amazon wants to in crease the num ber produ cts you order, so they predict which ones y ou m ight want to b uy ne xt an d recom mend them.( Schafer et al. , 1999 ) T witter wants to boost you r use o f their platform , so 1 These competitions were run on the Driv enData competition platform (www .dri v endata.org ); Driv enData employs the authors of this paper . 2016 ICML W orkshop on #Data4Good: Machine Learning in So- cial Good Applications , Ne w Y ork, NY , USA. Copyright by t h e author(s). Figure 1. The best p redictions (lo wer is better) o ver time in a data science competition. Dif ferent color dots represent different sub- mitters. they pred ict which tweets you will ig nore an d which you will engage with.( Lin & K olcz , 201 2 ) Facebook and Google want to inc rease t he nu mber o f ads you click o n their sites, so they predict your person al click- through behavior .( McMahan et al. , 2013 ) Th ese companies hav e become extraordin arily skilled at mak ing these predictions. But there are many other reasons to want to change the fu- ture. Educator s want to increase th e number o f students gradua ti ng h igh school. Health workers want to improve the overall health of a pop ulation at a sustainable cost. Mi- crolende rs want to give mo re i ndividuals in the de veloping world a chance to pursue their d reams without incurring default. Con s ervationists want to curb ou r energy usage without ham pering p roductivity . Governm ents want to pre- vent fires from destroying li ves and property . W e have the computational power and methods to tackle many of these challenges. Howe ver , the data scientists that can le verage these resource s are hard to find and expen - si ve to hire. McKinsey h as estimated that in 2018 there will be 190,0 00 analy tics po sitions that go unfilled in the United States.( James Manyika , 201 1 ) If there is that k ind of shortfall in the com mercial sector , we expect the so - cial sector will lag e ven further b ehind. In 20 14, the me- dian salary for a data scientist was $98,0 00, which puts 31 Harnessing the Po wer of the Crowd to Increase Ca pacity for Data Science in the Social Secto r qualified data s cientists out o f reach as full time hire s at many nonp rofits.( King & Magou las , 201 4 ) In the last ten year s , o pen innovation has be en reco gnized as a cost- effecti ve way to generate creative, high-quality solution s to problem s.( Boudreau & Lakh ani , 2 013 ) Data science com- petitions, wher e practitione rs, researchers, and students publicly com pete t o make the best prediction s, offer a way to harness the power of op en i nnovation for data analysis. Competitions have great visibility among bo th data science practitioner s and other n onprofits workin g in similar issue areas. Participating in a competition s purs data scientists to think mor e abou t applicable pro blems in the social sec- tor , which is especially imp ortant at a time where c reati ve applications of new data techniqu es can greatly imp rov e how organ izations oper ate. These competitio ns also help start the con versatio n amon g nonpro fits abou t how machine learning tools can be used on their own data. 2. Case Study: “Box-plots for Edu cation” 2.1. Context Education Resource Strategies ( ERS) was found ed as a non-p rofit consulting firm in 2 004, with a primar y goal of helping public s chool districts use their limited reso urces more strategically . Ho wever , this g oal is o ften easier said than don e. Un li ke companies, wh ich ben efit from com- paring themselves to their peers, districts frequently have no reliable ways to comp are their spendin g to other dis- tricts. E v en if e x penditures are public, the compariso ns are not app les-to-apples because of a lack o f standa rdized re- porting structu res. As a result, district decision makers ar e often left in the dark. ERS attempts to solve this problem by working with dis- tricts’ b udgets to assign every line item to certain categories in a comprehensive fin ancial spen ding framework. If th is process is completed correctly , ERS can offer cross-district insight into a p artner district’ s fina nces. For example, ERS might observe that a particular district spen ds m ore on fa- cilities and maintenance th an peer districts–while this is not inherently good or b ad, it helps the district to have knowl- edge of ho w it s decisions a nd sp ending c ompare to those of its peers. In ord er to comp are b udg et or expenditure data a cross dis- tricts, ERS assign s every line item to c ertain ca te g ories in a compreh ensi ve finan cial spend ing framework. For instance, some labels describe what the spend ing “is”– compen s ation, benefits, eq uipment, property rental, and so on. Other categories describe wh at the spending “do es, ” which g roups of stud ents benefit, and wher e the fun ds co me from. Howe ver, categorizing each of these budget line-items is T able 1. Example text from schoo l district budgets PETRO-VEND FUEL AND FLUIDS Regional Playoff Hosts Capital Assets - Locally Defined Groupings FURNITURE AND FIXTURES ITEMGH EXTENDED DAY Water and Sewage * UPPER EARLY INTERVENTION PROGRAM 4-5 Food Services - Other Costs Supp.- Materials extremely labor-intensive, often tak ing se veral weeks fo r an employee to h and-tag each row in Excel. This chal- lenge pu t a limit o n the quality of com parisons, becau se ERS could only process so many b u dgets per year . 2.2. Challenge In som e senses, this is a classic machin e learning pr ob- lem. ERS had hand- labeled over 45 0,000 budget line items with the relev a nt categories. The o bjecti ve was to train the machine to do th e process th at they h ad been doing by hand. This kind of prob lem is known as supervised ma- chine learning . In this competition , com petitors were asked to id entify the highest pr obability labels for each o f nine different cate- gories. T o do this, they had to first create featur es from th e text of the d is trict budgets. W e includ e some examples of text fro m schoo l budge ts to demonstrate th at th is is not a standard natural languag e processing task–in fact, it is v ery much “u nnatural” language proce s sing given the number and variety of abbre viations and pu nctuation. 2.3. Results The winning algorithm came from a competitor who had submitted over 100 times to th e competition–a testament to the p assi on of d ata scientists workin g f or a cau s e. The algorithm uses a standard me thod–logistic regression– b ut derives its power f rom its feature en gineering: for example, using tri-grams, p airwise interactions, the “hashing trick” for dim ensionality reduction, an d term frequ ency-in verse docume nt frequency (tf-idf) among other tech niques. ERS estimates that this algor ithm will tag files with over 90% accur ac y an d will sav e 75% o f the time usually taken to code financial files. At 400 hou rs per project, this means 300 h ours sa ved per pr oject, or close to 1 ,000 hours per employee, who ty pically does three p rojects p er year . Ul- timately , this equ ates to r oughly twenty- fi ve week s of em- ployee time sa ved! 32 Harnessing the Po wer of the Crowd to Increase Ca pacity for Data Science in the Social Secto r 3. Case Study: “Countable Care” 3.1. Context Planned Parenthood is t he nation’ s leading p rovider and ad- vocate o f h igh-quality , affordab le healthcare for women, men, and youn g peop le, as well as the natio n’ s largest provider o f sex education . With appr oximately 700 health centers across the country , Plann ed P ar enthood organiz a- tions serve all patients with care and co mpassion, with respect and witho ut judgment. Und erstanding the trend s in women’ s h ealth care is cr itical to d eli vering the expert, quality care that is the hallmark of Planned Parenthood. Planned Parenthood is an innovator in health care delivery , continually looking to find the best ways to expand access to quality , affordable care to e veryon e who needs it. W e want your help to b etter understand the comp le x dyna mics of health car e in order to better serve the nee ds of those who depend on us. The goal of this com petition is to driv e innovation a nd anal- ysis in t he field of po pulation health by predicting which reprod ucti ve health care services are accessed b y women. The en d pro duct o f the competitio n will improve pu blic health with novel pred icti ve analytics as pa rt of our effort to giv e ba ck to the research and health care comm unity . 3.2. Challenge Users wer e given extremely detailed informatio n from mul- tiple year s of the Centers fo r Disease Control and Prev en- tion (CDC) reproductive hea lt h and family ch oice survey , the National Survey of F amily Growth (NSFG). The survey “gathers informa tion on family life, marr iage and div orce, pregnancy , infe rtility , use o f contr aception, and men’ s and women’ s health” and is widely used by resear chers foc us- ing on contraception, family planning, and women’ s health in general.( Chandra et al. , 2005 ) In ad dition to a vast am ount o f in -depth demogra phic and personal history info rmation, the NSFG tracks captur es fairly detailed family and re productiv e historie s , as well a s health decisions. W e challeng ed users to m odel deeply la- tent, non -linear , and non-obviou s associations b etween de- mograp hics and p ersonal histories in ord er to predict these health choices for individual respo ndents. This is a particularly interesting challenge giv en the branch ing nature of the survey . Because responde nts were only asked certain questions if they answere d affirmatively to a pr e viou s question , th ere was a lot of miss ing data in the dataset. It was an open challenge to co mpetitors to determine ho w to treat this missing values a nd still make effecti ve predictions. Figure 2. Green pixels are questions that were answered; yello w pixels were gaps in the data where survey questions were unan- swered. 3.3. Results Using robust ensem bles of individually s ophisticated mod- els, the to p perform ers in this challeng e were able to achieve significan t predictive lift over both random noise and mo re n ai ve mod els . Th e co de an d mo dels associated with this challenge were deli vered to the Guttmacher Insti- tute for further study . 4. Case Study: “Keeping it Fr esh” 4.1. Context The City of Boston regularly insp ects every re s taurant to monitor an d impr o ve foo d safety and pub lic health. As in most cities, health inspection s are gen erally ran dom, wh ich can increa s e time spent on spo t checks at clean restaurants that have b een following the rules closely –and missed op- portun ities to improve health and hygien e a t places with more pressing food safety issues. Each year, million s of p eople cycle throug h and post Y elp revie ws ab out their e x periences at th ese same restaurants. The information in these revie ws has th e poten tial to im- prove the City’ s inspection efforts, and could transform the way inspections are targeted. A team of Harv a rd econ omists and Y elp–with s uppo rt from 33 Harnessing the Po wer of the Crowd to Increase Ca pacity for Data Science in the Social Secto r T able 2. Example re vie w from ov er 230k in the Y elp dataset { "business_id" : "CgdK8DiyX9Y4kTKEPi_qgA " , "type" : "review" , "text" : "This is the place I like to go for deli sandwiches (and salads/soups ) when in the FinancialDist rict. I’m not sure what makes this place stand out from the million other deli sandwich places inthe area. Maybe it’s the lack of pretentiou sness..." , "date" : "2005-12-11" , "stars" : 4 , "review_id" : "zQH071b6x9g 1ZHbhJnaNKw" , "user_id" : "NfvN6-zeU0RsD 0Q_Sk-DSQ" , "votes" : { "cool" : 1 , "useful" : 1 , "funny" : 0 } } the City of Boston–c o-sponsored this comp etiti on to ex- plore ways to use Y elp review data to improve th e in- spections process. T he City of Boston, a p artner in this challenge, w a s committed to e xamining ways to integrate the winn ing algo rithm into its day -to-day inspection ope r - ations. 4.2. Challenge The go al for th is com petition was to use da ta from so cial media to n arro w the search for h ealth code violations in Boston. Competitors wer e giv e n access to historica l hy- giene violation records from the City of Boston and a mas- si ve archi ve of Y elp’ s consumer revie ws along with restau- rant metadata. Specifically , users were dire cted to predict the resu lt s of an inspection of every restaur ant on e very day for the time per iod in qu es tion. The challenge is to fig- ure ou t the word s, phr ases , p atterns in foot traffic, prices, cuisines, and o ther clu es th at har ness digital exhaust to make city services more effecti ve. 4.3. Results The competition op ened Monday , April 27th 2 015 and ac- cepted n ormal submissions for eight weeks. During th is period, users could see ho w well their predictions were rel- ativ e to the holdou t d ata, and compare their pe rformance with other competito rs on the public leaderboar d. In this comp etition, data scientists were not simply try ing to ma k e predictio ns on normal h oldout data alon e. Rath er , they were challen ged actually to pr edict the future. In ad- dition to a normal pub lic le aderboard, this competition fea- tured an e valuation perio d after normal submissions closed; Figure 3. A map of hyg iene violations in the City of Boston. Darker circles have more violation s historically , while lighter ones hav e fewe r . prior to submissions being closed , comp etitors made s pe- cial sub missions n ot o n the hold out dataset but on the up - coming six weeks of actual Boston food inspection s . By July 7th 2015 , no rmal submissions were closed a nd users had submitted th eir final ev alu ation predictions f or the ev aluatio n period. Over the next six w eeks, as the city of Boston went abou t their normal inspection routin es we pulled the violatio ns from the open data po rtal and ev alu- ated comp etitors per formance in realtime. At the end of the competition, the results were ev aluated by a team of re- searchers at Harvard Un i versity wh o “estimate that the C ity of Boston would be 30%-50 % more productive using a top- perfor ming algorith m f rom th e tournam ent . . . [and] are currently testing the winning algorithms ef ficacy in prac- tice, using a field experiment that integrates th e winning algorithm s into Bostons p rocess for alloca ti ng inspectors. ” ( Glaeser et al. , 2016 ) 5. Conclusion It is an exciting time to be work ing on data-for-good projects. W ith a b it of creati v ity , da ta scientists can cre- ate analo gies between pro blems that are being solved in industry and th e challen ges facing no nprofits, NGOs, and governments. The tools that are used by corpor ations to im- prove their operations an d bottom -lines ca n just as e asi ly be used to he lp social impact organ izations be mo re effecti ve and more efficient. Open in nov ation provides a new way for nonpr ofits and governments to access talent that is har d to find and expen- si ve. E xperts from ar ound the world can contribute to soc ial impact fro m where ver they are, whenever t hey hav e free 34 Harnessing the Po wer of the Crowd to Increase Ca pacity for Data Science in the Social Secto r time. No nprofits can be almo st g uaranteed high p erform- ing algo rithms gi ven the sheer numb er of models explo red during a comp etition. Both g roups can learn from and ex- plore new application s of e x ist ing techniq ues to make the world a better place. Refer ences Boudreau , Ke vin J an d Lak hani, Karim R. Using the crowd as an innovation p artner . Harvar d b u s iness r eview , 91 (4):60 –69, 2013. Chandra, Anjani, Martinez, Gladys M, Mosher , W illiam D, Abma, Joyce C, and Jon es, Jennif er . Fertility , family planning , and rep roductiv e health o f us w omen: data from the 2002 natio nal survey of family growth. V ital and health statistics. Series 23, Data fr o m the National Survey of F amily Gr owth , (25) :1–160, 2005. Glaeser , Edward L., Hillis, And re w , K om iners, Scott Duke, and Luca, Micha el. Crowdsourcing city govern- ment: Using tour naments to improve inspection accuracy . W orking P a per 2212 4, National Bu- reau of Economic Research, Mar ch 201 6. URL http://www.n ber.org/papers/w22124 . James Manyika, Michael Chu i, Brad Brown Jacqu es Bughin Richar d Dobbs Charles Roxburgh An gela Hung Byer s. Big data: Th e next frontier for inn o vation, competition , a nd pr oducti vity . McKinsey Global Insti- tute , June, 2011 . King, John and Magoulas, Roger . 2014 data science salary survey: T ools, tren ds, what pa ys (an d what doesn’t) for data professiona ls . Nov ember, 2 014. Lin, Jimmy and Kolcz, Alek. Large-scale m achine lear ning at twitter . In Pr o ceedings of the 2012 ACM SIGMOD Internation al Confer ence on Management o f Data , pp. 793–8 04. A CM, 2012 . McMahan, H. Brendan, H olt, Gary , Scu lle y , D., Y oung, Michael, Ebner , Dietmar, Gr ady , Ju li an, Nie, Lan , Phillips, T odd, Da vydov , Eugen e, Golovin, Daniel, Chikkerur, Sharat, Liu, Dan, W attenberg, Martin, Hrafnkelsson, Arnar Mar , Boulos, T om , and Ku- bica, Jeremy . Ad click pred iction: A v ie w f rom the trenches. I n Pr oceed ings of the 19 th ACM S IGKDD Internation al Con fer ence on Knowledge Discov- ery an d Data Minin g , KDD ’ 13, pp. 1222– 1230, New Y ork, NY , USA, 2 013. ACM. ISBN 97 8-1- 4503- 2174-7. doi: 10.114 5/2487575. 2 488200. URL http://doi.a cm.org/10.1145/248757 5. 2 488200 . Schafer, J. Ben, K onstan, Joseph, and Riedl, John. Recom- mender systems in e-com merce. In Pr oceeding s of the 1st ACM Conference on Ele ctr on ic Commer ce , EC ’ 99, pp. 158 –166, New Y o rk, NY , USA, 1 999. AC M. ISBN 1-581 13-176-3. d oi: 1 0.1145/336 992.337035. URL http://doi.a cm.org/10.1145/336992 .3 3 7035 . 35
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment