Soil Property and Class Maps of the Conterminous US at 100 meter Spatial Resolution based on a Compilation of National Soil Point Observations and Machine Learning
With growing concern for the depletion of soil resources, conventional soil data must be updated to support spatially explicit human-landscape models. Three US soil point datasetswere combined with a stack of over 200 environmental datasets to genera…
Authors: Am, a Ramcharan, Tomislav Hengl
Soil Science Society of America Journal Supplemental material av ailable for this article. Soil Sci. Soc. Am. J. doi:10.2136/sssaj2017.04.0122 Received 21 Apr . 2017. Accepted 2 Oct. 2017. *Corresponding author (amr418@psu.edu). © Soil Science Society of America. T his is an open access article distributed under the CC BY -NC-ND license (http://creativecommons.org/licenses/b y-nc-nd/4.0/) Soil P ropert y and Class Maps of the Conterminous United St ates at 100-Meter Spat ial R esolut ion P edology With gr owing concern for the depletion of soil r esources, conv entional soil maps need to be updated and pro vided at ner and ner resolutions to be able to support spatially explicit human–landscape models. T hree US soil point datasets—the National Cooper ative Soil Sur ve y Characterization Database, the National Soil Information System, and the Rapid Carbon Assessment dataset—wer e combined with a stack of ov er 200 environmental datasets and gSSURGO polygon maps to gener ate complete cov erage grid - ded predictions at 100-m spatial resolution of six soil properties (per centage of organic C, total N, bulk density , pH, and per centage of sand and clay) and two US soil taxonomic classes (291 gr eat groups [GGs] and 78 modied par - ticle size classes [mPSCs]) for the conterminous United States. Models wer e built using par allelized random for est and gradient boosting algorithms as implemented in the r anger and xgboost packages for R. Soil property pr e - dictions wer e generated at se ven standard soil depths (0, 5, 15, 30, 60, 100, and 200 cm). Prediction pr obability maps for US soil taxonomic classica - tions wer e also generated. Cr oss validation r esults indicated an out-of-bag classication accur acy of 60% for GGs and 66% for mPSCs; for soil pr oper - ties, RMSE for leav e-location-out cross-v alidation was 0.74 ( R 2 = 0.68), 17.8 wt% ( R 2 = 0.57), 12 wt% ( R 2 = 0.46), 3.63 wt% ( R 2 = 0.41), 0.2 g cm −3 ( R 2 = 0.42), and 0.27 wt% ( R 2 = 0.39) for pH, per cent sand and clay , weight per centage of organic C, bulk density , and weight per centage of total N, respectiv ely . Nine independent validation datasets wer e used to assess pre - diction accur acies for soil class models, and results r anged between 24 and 58% and between 24 and 93% for GG and mPSC pr ediction accuracies, respectiv ely . Although mapping accuracies w ere v ariable and likely low er than gSSURGO in some areas, this modeling appr oach can enable easier integr ation of soil information with spatially explicit models compared with multicomponent map units . Abbreviations: DEM, digital elev ation model; GG, great group; mPSC, modied particle size class; NASIS, National Soil Information System; NCSS, National Cooperati ve Soil Survey; RaC A, Rapid Carbon Assessment; SOC, soil organic carbon; SSURGO , Soil Survey Geographic database; VIMP , variable importance. T he N atio nal Academ y of Sciences has reported that ch anges in the qual - ity of soil and water resou rces ha ve accelerated du ring the past decades at a rate pr oportional to hu man- induced activities and populat ion growth (N atio nal Research Council, 2001). With t he growing concer n about finite soil resour ces, soil scientis ts are task ed with providing i nforma tion th at can su pport spat ial ly explicit hu man– landscape models that c an aid in preser ving soil resou rces . This infor mat ion nee ds to meet curr ent r e quir emen ts to estim ate and man ag e stor es and fluxes of w ater , carbon, nut rien ts, and solutes in soil. Con ven tional systems fo r mappi ng and classif ying soils were not designed for this purpose (A rroua ys et al ., Amanda Ramchar an* P enn State Univ . University P ark, P A 16802 T omisla v Hengl ISRIC—W orld Soil Information W ageningen, T he Netherlands T ra vis Nauman USGS Southwest Biological Science Center Moab, UT 84532 Colb y Brungard New Mexico State Univ . Las Cruces, NM 88003 Sharon W altman W est Virginia Univ .— GRU Morgantown, WV 26506 Sk ye W ills NRCS Lincoln, NE 68508 James T hompson W est Virginia Univ . Morgantown, WV 26506 Core Ideas • Ensemble machine learning methods wer e used to obtain gridded soil property and class maps. • Final predictions w ere gener ated for six soil properties and tw o soil classes at 100-m resolution. • Soil data are easier to integr ate with spatially explicit models compar ed with multicomponent map units. • Soil property maps are a vailable at sev en standard depths. Published online January 12, 2018 ∆ Soil Science Society of America Journal 2014). I nstead, a gridded product with th re e-dimension al (3D) estim ates of soil prope rties is required. The highest-resolu tion 3D soil database wit h conter minous coverage of the U nite d St ates is the v ector So il Su r vey Ge ographic database (SSURG O) and its 10-m resolu tion gridded equivalent gSS URGO ( https://g dg .sc.egov .usda.gov/). B ecause t his data - base compiles appro ximat ely 3000 independen t soil sur veys vary - ing in scale from 1:12,000 t o 1:125,000 ( Thompson et al., 2012), associated tabular att ributes ar e of different sc ales, and map units comprise diffe rent n umbers of soil componen ts. Within a m ap unit, soil componen ts are not spa tially assig ned, and therefor e spat ial d isagg regation is required to der ive a gridded product at regional or nat ional scales. Od gers et al. (2012) produce d 100-m resolu tion soil property maps fr om the US General Soil M ap (ST A TSG O2) by combining spline fitt ing and weighted ave rag ing per ma pping unit. Al though nominally at hig h spa - tial resolut ion, maps p roduce d by Odgers et al. (2012) re veal a relat ively coarse scale of the ST A TSG O2 (1:250,000). Chaney et al. (2016) recently disag gregated and harmonized SS URG O componen ts within t he SS URGO database to create POL ARI S, a 30-m spa tial resolutio n map of the 50 mos t probable SS URGO componen ts. Linking soil properties to POL ARI S is complex due to low accu racies, and ther efore it rem ains to be determined whethe r creat ing soil property maps from SSURG O componen t predictions is t he optimal pat h to follow . Ack nowledg ing the com plexities of crea ting gridded soil property maps from con vention al soil maps, such as artificial boundaries i n the data associat ed with g eopolitical boundaries, un named soil component s, and variat ion in soil properties in soil componen ts withou t location of these com ponents wit hin the larger map u nit delineation (A rroua ys et al., 2014; Batjes, 2016; Odg ers et al., 2012; Thompson et al., 2010), we ha ve opted here for a point dat a/machine learnin g -driven 3D ap proach to pre - dictive soil ma pping. Machine learning algorith ms, such as re - gression tr ees, Cubist, and random for ests, ha ve been proven use - ful for predictive soil mappin g by He nderson et al. (2004) and Kheir et al. (2010) and then la ter by H engl et al. (2014, 2017), who ha ve developed a f ully autom ated framework for globa l soil property and class mappin g calle d “SoilGrids. ” One limitat ion of the ( globa l) S oilGrids system is t he inability to integrate detailed nat ional data i nto the modeling syste m. In tegrating n ation al soil and en viron ment al datasets may r esult in predictions wit h a hig her accu rac y th an previous SoilGrids results. W e pr opose a machine learning–based framework tha t builds 3D spat ial pred ictio ns from a compr ehensive st ack of en viron ment al datasets available in t he U nite d St ates, including conv entio nal soil maps, clima tic and topographic covariates, and point dat a . W e u se this framework, visualized in Fig. 1, to provide ful l-cove rag e gridded informa tion for the con terminou s U nite d Sta tes, which can be rapidly updated as revised obser vations of soil properties and classes become available. The soil pr opert y maps w e produced include bulk density , pH, perce nt sand, clay , percen t org anic C, and total N at seven st andard soil depth s (0, 5, 15, 30, 60, 100, and 200 cm). Soil class maps i nclude US Soil T axonom y great group (G G) and modified par ticle size classes (mPSCs). M odifie d particle size class is a functio nal descriptor in US Soil T axonom y designe d to combine agricultural and engi - neering soil particle size classifica tion system s (So il Su r vey Staff, 2010) and has been used to identify areas of similar e colog ical potent ial to aid in land management (N auman and Du niway , 2016). W e p rovide model fitting and accu racy a ssess ment res ults and discuss adv antages and d isadvan tages of our approach. All soil maps desc ribe d in t his paper are available u nder the Open Dat abase L icense v1.0 and can be obtained from the P enn St ate U niversity repositor y at h ttps://scholarsphere.psu.edu. MA TERIALS AND METHODS Spatial Prediction with Mac hine Learning F or this study , we used two typ es of models for generati ng predictions: ( i) Fo r soil class–typ e var iables (G G and mPSC), we used the random for est method (Brei man, 2001), as imple - ment ed in the ranger packag e ( W right and Zieg ler , 2017) for R (R Development Cor e T eam, 2009), and ( ii) for soil property nu meric–type variables, we used an en semble of two tree-based machine learnin g methods—random for est (Breim an, 2001) and gradient boosting (H astie et al., 2009)— as impleme nted via the ranger packag e (W rig ht and Ziegler , 2017) and the xgb oost packag e (Chen and G uestr in, 2016), which are both av ailable a s R cont ributed packag es (R Development Co re T eam, 2009). Random forest is a nonparamet ric model base d on patt ern recog nition usi ng similarities amo ng obser vat ions to fit decision tr ees . T o build a t ree, a bo otst rap sam ple (a random sample with replaceme nt) is tak en from observations. T o det ermine a split at a node in a tree, a random subsam ple of predictor variables is tak en to select the predictor th at minimizes the regression e rror . The size of the predictor subsam ple is usually the squ are root of the total nu mber of predictor variables. N odes contin ue to be split unt il no further im prove ment in e rror is achieved. Omitted obser vations, ter med the “ out-of- ba g ” sample, are u sed to compute the r egression err ors for tr ees . Gradien t bo ost ing builds additive r eg ressio n mod - els by sequent ially fitting a parameter ized f unct ion or base learner to cur ren t “pseudo ” residuals by least squar es at each itera tion. The pseudo -resid uals are the gradient of t he curr ent loss fu nction being minimized at each iterat ive step (H astie et al., 2009). F or soil class–type variables, we used a g ene ral spatial pre - diction model of the for m : soil class probabilities ( s ) = f [ X 1 ( s ) + X 2 ( s ) + X 3 ( s )+…+ X p ( s )] where s is two -dimension al space (northing , ea sti ng ), and X va l - ues are covaria tes deriv ed from the stack of e nvir onmen tal da - tasets, which wer e either agg regated from finer- resolut ion (e.g., 30 m) dat a or downsc ale d (u sing cubic spline or av erag ed ) from coarser (e.g., 1 km or 250 m) resolu tion dat a to 100 m. F or soil property numer ic–typ e (3D) var iables, the model was : soil property ( s , d ) = f [ d + X 1 (s,d) + X 2 ( s , d ) + X 3 ( s , d ) +…+ X p ( s , d )] www .soils.org/public ations/sssaj ∆ where s is two- dimensio nal space (northing , easting ), d is depth, and X values are covar iates, including global pre diction s of soil properties (SoilGrids), which we re downsc ale d from 250- m to 100-m resolu tion usin g cubic spline inte rpolation. This method of int eg rat ing globa l soil model predictions into ou r mo del form ulation can be conside red an ensemble pre - diction or model ave rag ing , operati ng under t he assum ption tha t combining realization s of target variables is better th an one realization alone. M ulder et al . (2016) corr ectly recog nized tha t, in man y areas in the wor ld, lo cally produced predictions of soil properties would lik ely be sig nificant ly more accura te than SoilGrids. Be cau se each contr ibutor model has its ow n str eng th and weaknesses, we decided to combine glob al and local predic - tions r ather t han selecting t he single b est -performin g model for a given situat ion or scenar io (the t radition al pursuit). A cav eat of this modeling decision is tha t predictions can be vulnerable to er - ror pro pa gation, res ulting in opt imistic cr oss validatio n results. I f available, the u se of an independen t testin g dataset for validat ion can produce more r obust validation met rics, or an em pirical un - certainty function can be used to scale prediction probabilities for class models, as was done in this study . No model variograms for residu als or kriging of predictions wit h residuals wer e done because some research h as shown this to be un necessar y for vari- ables where cov ariates explain >60% of spa tial variat ion (Lacoste et al., 2016; V aysse and Lag acherie, 2015). The ranger and xg boost packag es were selected for predic - tive modeling because (i) the y ha ve high pred iction accu racies when modeling complex i nteract ions among pr ed ictor v ariables (C utler et al., 2007; H engl et al ., 2017) and (ii ) they can be im - plemen ted in paral lel, so th at predictions can be generated for large volumes of pixels. The nu mber of tr ees g row n in the random for est model was reduced from the default setti ng of 500 to minimize the risk of model overfitt ing . This was also done because previous resear ch showed no significant effect on model ac cu rac y from r educing the n umber of tr ees in the model (Latin ne et al., 2001; Oshiro et al., 2012). After initial testin g , the nu mb er of t rees g ro wn for all mo dels was set to 85. T he train function of t he caret pack - ag e (K uhn and J ohnson, 2013) w as use d to tu ne model param - eters, and fin al parameter selection was based on reducing the RM SE over 100 ite ration s. The parameters tu ned included eta and max_depth, and fin al values were 0.4 and 3 for soil pro pert y models. Eta con tr ols the learning ra te of the forest b y scaling the cont ribut ion of each tree betwe en 0 and 1, w hereas M a x_dept h is the maximu m depth of a t ree. F or soil property mo dels, the xg boost tuning paramet ers γ , colsample_bytree, and min_child_ weight were held con stant a t 0, 0.8, and 1, respectively . Gam ma is the minim um loss reduct ion required to make a par tit ion on a leaf node of a tree, Colsample_bytr ee is the fraction of colu mns to sample from w hen const ructing a t ree, and min_child_weig ht is the minim um nu mber of inst ances set in a termin al node (Chen and G uestr in, 2016). T uning results of m try (the nu mb er of variables randomly sam ple d at each s plit in the random fores t model ) are re por ted in Su pplemen tar y T able S1. The general spat ial pred ictio n framework, described in de - tail in Fig. 1, includes a nu mber of GIS op era tions, such as dow n - scaling , a gg regation of raste rs, spatial ove rlay , fittin g of machine learning models, generat ion of predictions, and da ta export. All Fig. 1. Gener al data processing worko w used to gener ate spatial predictions at 100 m resolution (USA48 SoilGrids100m+). ∆ Soil Science Society of America Journal processing w as implemen ted on a 508-R AM hig h- perform ance ser ver w ith 48 cores and 4 gig abytes of disk space. F or soil prop - erties, model fitti ng and spat ial pred iction (v alues at seven stan - dards dept hs) took appro ximat ely 12 h per property to build models and predict ; soil classes took appro ximat ely 14 h to build models and g ene rate spat ial predictions. Fi nal soil properties modeled and mapped included ( i) soil organic C in % weig ht, sand and cla y in % weight, bulk density of the fine earth fract ion (<2 mm ) in g cm -3 , total N in % weight, and soil pH in 1:1 soil –H 2 O solut ion. The USDA taxonomic soil classes modeled and mapped included mPSCs (78 classes) and soil G G (291 classes). Data F or model training we used point da ta from t hree data sou rces: ( i) The N ational Cooperat ive Soil Sur vey (NC SS) Charact erizat ion Dat abase, ( ii ) The N ation al S oil I nformat ion Syst em (NASIS), and ( iii) The Rapid Carbon Assess ment (RaCA) P roject. National Cooper ativ e Soil Surv ey Char acterization Database The NC SS characte rizat ion database provides soil ch ar - acteriza tion dat a from projects performed by t he Kellog g Soil Su r vey Lab orat or y and cooperatin g laboratories (http://ncss - labdatam art.sc.eg ov . usda .gov/). Beg inning w ith the 2015 NCS S database, 34,183 pedons (213,499 horizons) w ere av ailable for spat ial mo deling. This database was filtered to ret ain only soil samples th at had a minim um of a site locat ion ( W GS84 c oordi - nat es) and at least one soil property obser vation. The degree of completeness for each soil pedon and the available pro pert y and att ribute d ata for each individu al sample varied cros s the dat abase. Most dat a had soil organ ic C, soil tex - tur e (tot al sand , silt, clay con tent), and p H. For the o rg anic C data, a linear regression based on Wills et al. (2013) was applie d to W alkley–Black o rg anic C measure ments t o al ign with tot al combust ion C measureme nts. Da ta prepr ocessing was done to remov e neg ativ e values for soil properties. Com pile d soil sample data con tain varia tions t hat can come fro m several potential er - ror types in addition to the actu al property variation s. These in - clude the level of accur ac y recorded, mea su remen t err or , the dat e at which the sam ple was taken, and i mprecise methods of locat - ing sampling loca tions. These factors can i ntr oduce bia ses in to the models because these differ ences may be confou nded with changes in property or class values. Correctio ns can be made for some factors to reduce this possibility , as was done for organic C , but ove rall there is limited abilit y to isolat e or neg ate t he effects of these factors on de rived models. National Soil Information System Data The USDA–NRCS maintain s a larg e dat abase of soil maps and obser vat ions called NAS IS. W e used NASI S records made prim arily dur ing soil surve y activ ities (separat e from NC SS char - acteriza tion dat a). These obser vat ions are most ly field obser va - tions m ade by scien tists usi ng tran sects to determine SSURG O componen ts and to build map u nits. The NASIS point data ar e not given the same review process as the componen t/map u nit data (SSURG O). M ost hav e not be en tes ted in a laboratory but were desc ribed and classifie d to SSURG O componen ts and US Soil T axonomy classes (Soil Su r vey Staff, 1999, 2014) by exper i - enced local so il scien tists. The most recen t taxonomic classificat ion for each obser va - tion w as queried from NASIS using R (R Development Co re T eam, 2009). Soil great group (G G ; e.g., “ Calciarg ids ”) and par - ticle size classes (PSCs) (e.g., “Loamy” or “Coarse-loamy”) were then ex tracted from t a xo nomic class names. P article size classes are one of the mos t consisten tly defined taxonomic descri ptors in US Soil T a xonom y (Caudle et al., 2013; Duniwa y et al., 2010). These classes generalize the text ure of a soil profile based on a rep - resen tativ e set of depths c al led the cont rol section, which is specif - ic to differe nt groups of soils (Soil Sur vey Staff , 2014). Extract e d PSCs were t hen modifie d (m PS C) as follows to ens ure t hat class - es more fully covered the spectrum of possible soils acr oss the con - ter minous U nite d St ates (N auman and Du niway , 2016). Obser vat ions of rock out crops, which t raditionally hav e no PSC , wer e labele d as such. T o av oid redundancy in the taxonom - ic classificat ion, Psam ment s (sandy soils) were assigned to a san - dy class because these soils ha ve no PSC desig na tion. Soils with a Lithic ( i.e., shallow bedrock contact) taxonomic classifica tion were l abele d with a “Lithic” m PSC. Histosols (organic soils), which tradit ionally have no PSC, were assigned to an “Organ ic” mPSC. These processin g steps result ed in 328,380 G G obser va - tions and 308,291 m P SC obser vat ions th at were u sed a s tr aining data fo r building soil class models (Fig. 2). Rapid C Assessment Dataset The NRCS R aCA project was conducted to capture i nfor - mat ion on the C con tent of soils acr oss the con terminous U n ited Sta tes (Soil Sur vey Staff, 2016). A m ultilevel st ratified random sampling scheme w as created to maximize spat ial sample cover - ag e. The RaCA proje ct consisted of 31,215 pedon obser vat ions with organic C, total N, and bulk density measure ments (w hen possible) in the 0 to 50 pr ofile range. S oil organic C and total N measure ments w ere perfor med ac cording t o standard K ellog g Soil Sur vey Laboratory methods (S oil Su r vey Staff, 2014). The Ra CA dataset w as c ombined with t he NC SS database for pre - dicting soil pr operties. Soil Cov ariates W e used a rang e of ge ospa tial datasets relevan t to soil forma - tion to build a s tack of en viron ment al c ovar iates for predictiv e soil modeling . Each env iron mental covar iate was sour ce d at 100- m resolut ion, or a resampling met hod was applied to convert data t o 100-m resolut ion (e.g., cubicspline was applied to P RISM datasets t o conve rt from 800 to 100 m). Examples of the e nvir on - ment al covariates are sho wn in F ig . 3, and a detailed sum mar y of the dat asets, sources, and resam pling methods is available on Github (https://github.com/aramch aran/US_SoilGrids100m). The en viron men tal covariates incl uded : www .soils.org/public ations/sssaj ∆ • Digital ele vat ion model (DE M)-derived covar iates: T en ter rain attr ibutes we re derived from a con ter minous 100- m DEM for the U nited States (http ://na tionalma p.gov/) using SAGA GIS so war e (Conrad et al., 2015). ese included : elevation, slope gradient, pr ole cur va ture, m ul - tir esolution index of v alle y bottom at ness, su rface roug h - ness, valley depth, negative topographic openness, posi - tive t opog raphic opennes s, Melto n rug ge dness n umber , SAGA wetnes s index, and topo graphic position index. • P RISM cl im ate covaria tes : PRIS M 30-year nor mal climate surfaces cove ring the per iod 1971 to 2010 included precipi - tat ion, mean temper atur e, minimum t emperat ure, m a xi - mu m tempera ture, mean dew p oint t emperat ure, minim um vapor pr essur e decit, and maximum v apor pressu re decit (Daly et al., 2008). Nineteen bioclimatic i ndices (e.g ., an - nual diu rnal range, total precipitatio n in the driest q uarter , ave rag e tem peratur e in the dries t quarter) der ived from PRI SM norm al datasets by O’Don nel and Ignizio (2012) were i ncluded. • MOD IS land products: Satellite imager y dat asets derived from 35 yr of MODIS data included six long-ter m aver - ag ed bimonthly mean and SD enhanced vegetation i ndi - ces, six long-term av erage d bimont hly mean surface reec - tances for MODIS bands four (NIR) and se ven (MIR), 24 long-term a verage d mont hly mean daytime and nighttime surface te mperatu re measur ement s, and 12 long -t erm av - erage d mont hly cloud cover dat asets of t wice-daily remote sensin g –derived cloud obser vat ions (https://mod is-land. g sfc.n asa .g ov). • Landsat 30-m resolut ion products: Landsat (cloud-fr ee) NIR , SWIR bands, γ radiometr ic imag es (H ansen et al., 2013), long-term su rf ace wat er occurre nce (Pek el et a l., 2016), and bare ground im ag es were agg regated from 30 to 100 m using t he aver ag ing alg orit hm in GDAL ( W armer - dam, 2008). • SoilGrids250m : SoilGrids250m (He ng l et al., 2017) soil property maps we re included in the covar iate stack. e original maps we re down scale d to 100 resolu tion usi ng bi - cubic resampling in GDAL ( W arme rdam, 2008). • Addition al envi ronme ntal covaria tes : ese included land cover classes fo r the year 2011 (H omer et al., 2015), US poten tial natu ral veg eta tion (K üchler , 1964), histori - cal natu ral re regime classes (Fir e Sciences Laborator y , Rocky Mou ntain Research, 2000), and aero radiometric grids (Duval et al., 2005). gSSURGO Map Unit Data P aren t mater ial classes (87 classes) and ag greg ated drain - ag e classes (four cl asses) base d on the re presen tativ e soil compo - nent s of gS SURG O ma p units wer e extract ed from gS SURG O . Original parent m ater ial classes were su mmar ized to 87 distinct paren t mate rial kind classes based on S choeneberger (2012). These 87 classes we re nested into nine pare nt ma terial kind grou ps base d on t he g eomorphic descriptio n system of Schoeneberg er (2012). This relat ionship is illust rated in Su pplemen tar y T able S1. F or drainag e classes, one limita tion of the d ata was the v ariatio n in methodolog y for ma pping drainage cla ssifica tions acr oss the U nited States. Give n this limitation, t he eig ht dr ainag e classes ex - tract ed were reclassified to four drainage classes (Suppleme ntary T able S2) to dampen the effect of inconsiste nt classificat ion. P aren t mater ial and drainag e class data were missin g from appr oxima tely 15% of the gSSU RGO map unit s. T o use these covaria tes for spatial pr e diction (w hich requires spat ial ly com - plete covaria tes), random forest models we re built to predict the missing da ta using t he 10 terrain a ttribu tes (see So il Covar iates) and the USGS surficial mat erials map (Soller and Reheis, 2004). P aren t mater ial and drainag e class prediction accur acies were 54 and 79%, respectively . These models were used to gap fill missing data fr om SSU RGO map units. Random checks were made on predictions before m aps wer e used a s model inpu ts; these results were not i ndependen tly peer reviewe d. Accur ac y Assessment Cross-v alidation of Soil Property and Class Models After reviewing modeling res ults and completing m ultiple itera tions of t he model ing p rocess to eliminate obv ious err ors (e.g., incorrect tre nds in results due t o pol itical borde rs), 10- fold leave-loca tion-out cr oss-validat ion was run to assess model accuracy for soil property and ta xo nomic class models . F or this validation sche me, soil samples were s plit into t raining and testi ng datasets. SoilGrids250m covaria tes were r emoved from the c ross- Fig. 2. Maps of points used to tr ain models. Red, NASIS soil class points; blue, NCSS database and RaCA points (combined). ∆ Soil Science Society of America Journal validation t o ensu re the assess ment w as completely independen t of covaria tes produced with NC SS pe don da ta . M odel training used 90% of the dat a , and the r emaining 10% we re used for model testi ng ; therefor e, validation r esults are more con ser vat ive com - pared with models using 100% of t he data to pr oduce maps. Soil samples wer e split into t raining and test da ta such th at horizon obser vat ions from a si ng le pedon were not s plit across tr aining and test dat asets ( leave- location-out sche me). This 90/10 data partitioning sche me was repeated 10 times t o calculate the av er - ag ed performance met rics. F or each of the six nu meric soil pro p - erties, we de rived the R 2 value, mean er ror , mean absolute e rror , and RM SE (K uhn and J ohnson, 2013). All equa tions for c ross- validation met rics are provided in the su pplemen tar y mat erials. Independent V alidation and Uncertainty of Soil Class Models I n addition to cr oss-validat ion, nine regional datasets wer e collated from pedon obser vation s and used to independen tly validate GG and mPSC class predictions. These collated and harmo nize d validat ion datasets, su mm arized in T ab le s 1 and 2, Fig. 3. Examples of cov ariates used for the conterminous United States. (a) Elevation (http://nationalmap.gov/). (b) V ertical distance to channel network (Conr ad et al., 2015). (c) SSURGO parent material (Schoeneberger , 2012). (d) Mean monthly MODIS vegetation index for J anuary and Februar y (https://modis-land.gsfc.nasa.gov). (e) Long-term a ver aged mean monthly surface temper ature (daytime); MODIS, J anuary . (f) Long-term av erage cloud co ver (W ilson and Jetz, 2016). (g) Annual mean diurnal range (O’Donnel and Ignizio, 2012). (h) T otal precipitation in the driest month (O’Donnel and Ignizio, 2012). (i) Aver age vapor pressure decit (Dal y et al., 2008). (j) T ree cov er (Hansen et al., 2013). (k) SoilGrids250m Clay (Hengl et al., 2017). (l) SoilGrids250m Sand (Hengl et al., 2017). www .soils.org/public ations/sssaj ∆ were collected as par t of graduate s tudent t heses or as part of NRC S soil su r vey effor ts bu t had not been included in any of t he databases used in model training (B rungard et al ., 2015; F onnesbeck, 2015; Molstad, 2000; Stu m et a l., 2010). Althou g h coverage of independen t data is not exha ustiv e due to the limited av ailabil ity of soils data, these regional datasets provide an indica tion of modeling perform ance in a range of climat es and landscapes across t he U nite d St ates. V alidation da - tasets included dat a from th ree areas in wester n U tah (Beaver Coun t y , J uab Cou nty , Millard County), sout heastern U tah, north-cent ra l W yoming , southcen tral N ew Mexico, norther n Min nesota, ea ste rn I owa, and Maine. Fi nal soil class predictions and the associated predic - tion pr obabil ities w ere validated. Overall accuracy was use d to validate soil taxono mic class prediction accuracies for t he ent ire validat ion dataset and for individ ual validation da tasets. P rediction probabilities were v alidated using em pirical uncer - tainty function s desig ned to relat e the deter ministic ma ps of G G and m PS C to inde pendent field observations. These fu nc - tions build o n methods from several prior soil class m apping studies showin g positive but nonlinear t rends between rates of independe nt obser vat ion ag reeme nt with t ree-based machine learning pr ed ictio ns and the u nderlying pr ed ictio n probabili - ties produced by these models (N auman and T hompson, 2014; N au man et al., 2014). This process starts by t al lying i nstances where i ndependent field obse r vation s are correctly and incor - rectly predicted by the har dened determinist ic class maps. Then t he instances ar e g rou pe d by in ter vals of the pr ed iction probabilities from t he machine learning model associated with each field obser vat ion. The inte r vals were cr eated to include appro ximately equal nu mbers of field obser vat ions using t he gg plot number functio n in R ( Wick ham, 2016). The nu mber of inte r vals was initially varied in a sensit ivity analysis b etween 10 and 20 to main tain 0.5 to 1% resolut ion in the es tima tes of proportion s of field obser vation s correctly predicted, hereto - fore refe rred to as validation pr obabilities. W e loo k ed for tre nds bet ween median prediction probabili - ties and v al idat ion probabilities of the i nter vals in t he differen t inte r val binning scen arios. As an example, if the median predic - tion pr obabil ity for an inte r val is low (20), we expect the propor - tion of inde pendent observations t hat m atch this class (i.e., the validation p robabilit y) to also be low (and likely lowe r than the median prediction probability). Based on the finding th at ve r y similar positive tr ends wer e found for all nu mbers of inte r vals tested, we looked for the best fittin g regression model among all inter val scen arios using a variety of empir ical link f unct ions (including linear , linear- log , quadra tic, power , exponential, and generalize d additive models) in R . The b est models for both m PSC and G G were chosen based on the hig hest R 2 value. The fitted cu r ves were projected onto the m apped prediction probabilit y su rface to creat e a validation pr obabil ity surface ma p. This validation probability map estim ates how lik ely it is tha t the deter minis - tic ma p is correct at every pixel, which is valuable inform ation for end use rs. I t also provides an idea of over al l model uncertain - ty at a pixel because low dete rministic m ap probabilities indicate more ove rall mo del confusion. RESUL TS Model Fitting Examples of the com plete coverage, g ridded soil property , and class maps r esults are illust rated in Fig. 4. Visual analysis of Fig. 5 shows mapped featu res can be distinguished up to the 1:25,000 scale. I n Fig. 4, examples of spatially explicit predictions of percen t clay and pH ar e shown at 5 c m depth. The probability of occurre nce of coarse loamy mPSC and F rag iudalfs across the U nited States is also shown. The pr obabilit y map of F ragiuda lfs alig ns wit h SS URGO, showing t hat the mos t extensiv e areas of occurre nce of Alfisols with fragipans is the sout hern Missis sippi valley and the southe rn A ppalachian Moun tains (Bockheim and H artemink, 2013). The probabilit y ma p of F ra giudalf s in F ig . 5A also shows the probability of oc cu rrence inc reasing in fores t - ed areas, similar to the dist ributio n of frag ipan soils reported in Gross man and Car lisle (1969). Fig u re 5B shows th at percen t clay at 5 cm de pth is lower i n areas of hig h probability of F rag iudalfs (and vice versa), w hich is in line with findings that fr ag ipans oc - cur pr imarily in t he fine-silty , fine-loamy , and coarse-loamy PSCs (Bockheim and H ar temink, 2013). I n addition to fin al reported so il pro pert y maps, t he data - set colle ctio n includes preliminary maps of M g [cmol(+)/kg ], K [cmol(+)/kg ], and cat ion exchan g e activity [cmol(+)/kg ] th at T able 1. Number of modied particle size class observ ations, great group classes, and obser vation density in eac h independent dataset. Area location Number of observ ations Number of classes Observ ation density (per km 2 ) Iowa 69 3 1.898 Maine 183 14 0.001 North-central Wy oming 51 7 0.029 Northern Minnestota 108 12 0.002 South-central New Mexico 103 8 0.042 Southeastern Utah 356 7 0.005 W estern Utah (Beav er County) 329 13 0.04 W estern Utah (Juab County) 208 16 0.02 W estern Utah (Millard County) 605 10 0.02 T otal 2012 T able 2. Number of great group obser vations, gr eat group classes, and observ ation density in each independent dataset. Area location Number of observ ations Number of classes Observ ation density (per km 2 ) Eastern Iowa 69 4 2.531 Maine 183 12 0.001 Northcentral Wy oming 51 5 0.021 Northern Minnesota 97 13 0.002 South-central New Mexico 103 5 0.026 Southeastern Utah 361 9 0.006 W estern Utah (Beav er County) 326 11 0.034 W estern Utah (Juab County) 204 8 0.01 W estern Utah (Millard County) 605 12 0.024 T otal 1999 ∆ Soil Science Society of America Journal were i nitially tested but not validated in this study . Probability maps fo r each soil G G and mPSC class are also provided. Su mmary results of the model fitti ng are show n in Fig. 6 and 7 and T able 3. In F ig . 6, the variable im portance ( VIMP) indicates the im portance of a predictor variable in a model by measur - ing the i ncrease in prediction er ror whe n a predictor variable is “ noised up ” by randomizatio n (Breim an, 2001). U sing variable importance met rics for model inte rpretat ion can be limiting where t here are grou ps of hig hly correla ted variables. Str obl et al. (2007) showed that cor related var iables are used inter change - ably in decision trees of random for est models, resulting i n less- relevant v ariables receiving boosted importance. Therefor e, we limited using the r esults of the VIMP analysis for model inter - preta tion and rela tion to how soil su r vey is conducted. The top 25 pr ed ictors r eported by the ranger packag e are shown in F ig . 6. Be yond these 25 t op predictors, VIM P was rela - tively equal among t he rem aining predictor variables. The drop of importance of covaria tes is relat ively gradual. Depth was a str ong predictor acros s all so il pro pert y models, indicatin g the str ong correl ation between soil properties and sample de pth in the models. Figure 6 also shows th at the highly c or related cli - mat ic normals and bioclimat ic indicators de rived from PRIS M were t he most frequen t predictor variables for both soil pro p - erty and class models . P arameters of the DEM were the second most frequen t dataset, followed by SSURG O -derived and satel - lite data de rived from MODIS and Landsat. PR ISM data sets, deriv ed from climate-elevation r eg res sions for DEM g rid cells (Daly et al., 2008), correlated more st rong ly ove rall with soil characte ristics t han with sa tellite-derived data, possibly be cause MODI S/Landsat da ta reflect dynamic chan g es in vegetat ion tha t may or m ay not be related to soil properties. Dat a from PRI SM and DEM parameters, both hig hly cor - relat ed pre dictor v ariables, dominated the mos t important pr e - dictors in the soil class models, whe reas SS URG O-derived parent mat erial and drainag e class predictor v ariables were pr esent in t he top 10 predictors only for soil organic C , tot al N, sand fraction, and clay fract ion. The SS URGO -derived variables ap peared not as important as clima tic layers o r DE M der ivativ es for mappin g G G and m PS C classes. F rom a pract ical standpoint, t his result may be due to t he hig hly correla ted predictor variables domina t - ing the VIMP results or t he large number of SSURG O parent mat erial categories used, which resulted in low importance per paren t mate rial categor y . F rom the pers pe ctiv e of soil sur vey , G G taxa are largely informed by climat e, not by parent m ater ial . Accur ac y Assessment Cross-V alidation of Soil Property and Class Models T en -fold cross-v alidation was used to calculate t he perfor - mance met rics of the models for soil properties and classes. W e note th at, given the spat ial cluster ing of the soil samples ( e.g ., hig h sampling de nsity on a gricultur al land and low sampling density in na tional forest s), it is likely the v alidation result s var y spat ial ly , and we mu st assume t here is spa tial autocorr elation of prediction er rors. Mor e robust validat ion results can lik ely be Fig. 4. Examples of soil property and class maps for tw o soil properties and tw o soil classes. (A) P ercent cla y . (B) pH at 5 cm soil depth. (C) P ercent probability of coarse loamy modied particle size class. (D) P ercent pr obability of Fr agiudalfs great group. T he complete dataset collection is av ailable at doi:https://doi.org/10.18113/S1KW2H. www .soils.org/public ations/sssaj ∆ achieved with a testin g dataset col - lected af ter a v al ida tion st rateg y , such as an obje ctiv e probabilit y sam - pling (Brus et al., 2011). The model with t he best performance met rics was soil pH, and t he weakest per - form ance metrics res ulted from the soil textu re models ( percen t sand and clay). F or the soil class models, cross- validation r esulted in an ave r - ag e misclassificat ion rate of 0.34 for the m PSC mo del and 0.40 for the soil G G model, g iving m apping ac - curacies of 66 and 60%, r espectively . W e provide ma pping accuracy sta tistics and an e mpirical unce r - tainty analysis instead of ka ppa sta tistics, which ar e not as usef ul for int erpreta tion and compariso n of classificat ion accuracy (Pon tius and Millones, 2011). Soil property models under estima ted values, as indicated by mean er ror in T able 3. Figure 7 shows t he correla tion plots for soil properties along wit h the line of perfect fit (solid l ine) and t he line of best fit ( least squar e method ; dashed line). Base d on these p H correla tion plot, predictions w ere close to measur ed values over the range of soil propert y values. Bulk density , soil org anic C (SO C), and total N showed over estim ation for low values. The re were both ov er- and unde r-estima ted across the r ang e of values for per cent clay and sand, with both models tending mo re to unde rpredict hig h values. Figure 8 shows a com parison of the soil property spat ial prediction result s with SoilW eb (Beaudette and O’ Ge en, 2009), the online soil su r vey for California base d on SSURG O and ST A TSG O for percen t clay . Ma ps derived from com pletely dif - fere nt modeling appr oaches show similar spatial patt erns. The resolu tion of the s patial prediction result s is hig her (100 m compared with 1 k m), but t he comparison shows t hat our r esults unde restim ated values in the high range c ompar ed with SoilW eb ( in line wit h the results of t he perform ance met - rics and corr elation plots). A nother v al idat ion str ateg y atte mpt - ed was to extract SOC estim ates from t he valu1 table associated with t he gS SURG OFY2016 product. These were a vailable at standar d depth in ter vals, as opposed to S OC concent ration s, which were a vailable at differe nt levels. This comparison seemed less infor mat ive because SO C and bulk density are required, and if maps diffe r it is difficult to determine if it is the SOC or bulk density results t hat differ fr om the SSURG O products because SSU RGO provides ve r y g ene ral g uidance to the ag greg at ion of soil properties. Th us, the compar ison in Fig. 8 was provided for visualizat ion of patter ns or “face v al idity . ” Independent Data V alidation of Soil Class Model The ove rall ac cu rac y of G G prediction s using all indepen - dent v alidation obser va tions w as 37%. The overall mPSC ac - curacy using all independen t validation obse r vation s was 42%. V alidation of GG and mPSC predictions for each independe nt dataset are p resented in F ig . 9 and 10. Grea t g rou p prediction ac - curacy range d between 24 and 58%. Modified par ticle size class prediction accur ac y range d between 24 and 93%. Empirical Uncertainty of Soil Classes Relationships between the en semble model prediction probabilities and validat ion probabilities were q uite str ong . Linear–log function s with 10 probability inter vals fit the GG data best (F ig . 11A ), and 12 int er vals resulted in the best m PSC fit (Fig. 11B). Str ong linear–log rel ationships for bot h probabil - ity distribut ions sup por t the h yp othesis t hat the det erministic probabilities produced by random fores t models hold valuable Fig. 5. Zoomed in 1:25,000 scale maps for (A) per cent probability of F ragiudalfs and (B) per cent cla y ov erlaid on terrain map with tr aining point values displa yed. Map location: Little Black Slough Nature Reserv e, IL. ∆ Soil Science Society of America Journal infor matio n about uncertain t y in fin al predictions but th at the relat ionship to prediction u ncertainty is nonlinear . Overall, mPSC had hig her validat ion probabilities, reflecting the higher overall accuracy stat istics. Relatively si milar spatial pat - ter ns in validatio n probabilities can be seen by comparing t he G G (Fig. 11C) and mPSC (Fig. 11D). Both show better accu racies in the cen tral midweste rn sta tes (N ebraska , Michigan, Iow a , Illinois), portions of T exas, and along the easte rn coastal plain, likely due to the higher density of soil samples in these areas. The decreased slope of the line at higher prediction probabilities (not actu al ly an asymptote) conceptu al ly re presen ts model overfitti ng where t he prediction probability is hig her th an the validatio n probabilit y . Fig. 6. V ariable importance plots per soil variable reported b y the ranger pac kage. DEPTH, sampling (vertical) depth; DRNGSS, dr ainage class according to the gSSURGO; L14 and L00, Landsat cloud-free bands; MCF , MODIS Cloud Fr action images; MOD3, 1-km land surface temperatur es; MOD5, Enhanced V egetation Index images; NED6, digital elev ation model (DEM) deriv atives; PMTGSS, parent material accor ding to the gSSURGO; PRI5, PRISM climatic images; sg, SoilGrids datasets. Complete descriptions of codes are pro vided in the GitHub repositor y (https://github. com/aramc haran/US_SoilGrids100m/blob/master/ Cov ariate_Codes/SoilGrids_USA48_Covs100m.csv). www .soils.org/public ations/sssaj ∆ DISCUSSION Machine Learning and Soil Science This study har nessed the capabilities of a high-perfor mance comput ing en viron ment along with parallel programmed machine learning algorithm s to create soil pr opert y and class m aps with ex - hau stive cove rag e for the con terminou s U nite d St ates ( illust rated in F ig . 6 and 7, respectively). W e work e d within a r elatively new scient ific paradigm where large datasets from a v ariety of sources were i ntegrated with qu antit ative met hods to provide gridded spat ial variation of soil ch aracteris tics. T o illust rate ou r point, we presen t a SSU RGO map u nit in U tah where seve n soil t ypes are ag greg ated in a SSURG O association m ap unit (F ig . 12). This varia tion of soil componen ts within m ap units can limit the usefulness of soils maps for l and managers. Previous research shows t hat t his map u nit has be en im plicated a s a dust e missions sour ce, possibly causing acciden ts on I ntersta te 70 because of heavy grazing and hig hly erodible soils (Flag g et al., 2014). T o manage for that, t he variability in soil textur e within t his map unit is v er y importan t. I n this case, dominant sout hwest wi nds blowing saltat ing sands from t he south west onto cl ayey soils derived from M ancos shales is a likely mecha - nism for dus t mobil izat ion (Miller et al., 2012). Therefor e, a manager mig ht w ant to initially focus on stabilizing t he sandier areas upwi nd so they do not bombard the finer areas wit h saltat - ing sands. H owever , this solution might not occur to a m anag er withou t the spat ial context these m aps provide. T his example demons trat es a worst-case scenario i n areas of the U nited States where det aile d soil ma pping is limited. Our r esults were m ade possible thr oug h collaboration w ith soil and compute r scient ists, ens urin g expert knowledge was cap - tur ed and use d to v al idat e results. The collaboratio n with soil scient ists wit hin g over nme nt ag encies and land grant u niversi - ties en abled us to make m any decisions on how t o prepare t he data, run the an alyses, e valua te err ors and inconsiste ncies in the map p roducts, conduct visu al and quantit ative v alidations of re - sults, and info rm futur e research efforts wit hin this domain. With Fig. 7. Correlation plots, based on cr oss-validation results, for eac h soil property with the line of perfect t (solid line) and the line of best t (least squared method) (dashed line). T able 3. A ver age prediction err or for soil properties based on 10-fold cross-v alidation. V ariable n† Mean Min. Max. R 2 Mean error SD RMSE Mean absolute error Soil organic C 239,526 1.62 0 70.4 0.41 −0.15 4.72 3.63 1.17 % Sand 195,748 34.45 0 99.62 0.57 −0.37 27.1 17.8 13.5 % Clay 195,748 24.12 0.02 96.5 0.46 −0.16 16.4 12.0 7.06 Bulk density 81,541 1.38 0.06 2.6 0.42 0.01 0.27 0.20 0.15 pH 200,870 6.22 2 10.9 0.68 0 1.32 0.74 0.57 T otal N 74,369 0.22 0.1 5.5 0.39 −0.01 0.35 0.27 0.13 † Number of samples used for training. ∆ Soil Science Society of America Journal awar eness of the “ dat a deluge” in many sciences (Roudier et al., 2015), we developed our analysis in an open science framewor k, providin g the necessar y metadat a , dat a , and code in an open- access Gith ub repositor y (https://g ithub .com/aramcharan/US_ SoilGrids100m). Limitations of W orking with Conventional Soil Data Our wo rk atte mpted to address some of t he main chal - lenges of work ing with con ven tional soils dat asets. These include work ing with m ap units com prising m ultiple soil componen ts, managing missing da ta, identif ying t a xo nomic classifications from m ultiple editions of US Soil T axonomy , and a ssessi ng the accuracy of soil maps when i ndependent d ata are limited. Our appr oach to these challenges is described b elow . C urre ntly , the SS URGO database represe nts soil var iabilit y at scales within individu al map unit polygons by assigning composi - tion ( or proportion) of m ultiple SSURG O component s as well a s inclusion s of minor soil component s and nonsoil areas to map u nit ident ifiers (N auman and T hompson, 2014). P revious research h as work e d to disag greg ate m ap units w ith soil–l andscape models (Bui, 2003; de B ruin et a l., 1999). Ch aney et a l. (2016) provide an example of this for t he conter minous U nited States. T o exploit infor matio n represe nted in the SSURG O database, we a pplie d previous resear ch that asse mbled so il pare nt mat erial classes, base d on the re presen tativ e component per centage for SSU RGO, to cre - ate a pare nt mat erial conte rminous ma p ( htt ps ://gdg .sc.eg ov . usda . gov/). The tradeoff in selecting a re presen tativ e soil component for a ma p unit was the loss of i nforma tion cont ained within the map u nit pertaining to additional soil componen ts. W e also d id not hav e the actual locat ion of component s within the m ap unit, which would hav e a var ying effect on models because the nu mb er of componen ts changed and the size of the ma p unit changed. Anothe r limitation of w orkin g with conv ent ional soil data was the classifica tion of GG data using m ultiple editions of US Soil T axonomy . Of the 328,380 obser vatio ns of most recent GG ext racted from the N A SIS database, 6622 had obsolete classifica - tions accor ding to the 12th edition of U S So il T a xo nomy (Soil Su r vey Staff, 2014). T o mak e the most of con ven tional soil dat a , futur e work could include recorr elating obsolete d ata. With mul - tiple editions of GG classes in the tr aining data, there is t he po - ten tial for confusion bet ween ove rlappin g taxonomic concepts, which could reduce model ac cu rac y . I f model accurac y is high, the confusion mat rix results for the GG model cou ld infor m str ateg ies to recorrela te obsolete classes by prov iding insight into which GG classes misclassif y with each othe r . Overall accuracy metrics for the soil taxono mic class models were accept able considerin g the ge ographic extent of t he model. H owever , d iffer ences in validatio n accuracy b etween indepen - dent v alidation dat asets indicate spa tially variable prediction ac - curacy . The lack of a clear tr end bet ween t he number of obse r va - tions, t he nu mber of classes, obser vat ion density ( T ables and 2 and 3), and prediction accur ac y (Fig. 9 and 10) indicates th at differences in v alidation accur ac y betwe en areas is not because of differences in t he independen t validation d ata. I t is difficult to specif y which conditions caused the v ariabilit y in accu rac y metr ics bet ween independe nt validat ion datasets. H owever , it is likely t hat t his variabilit y is a res ult of se ve ral factors, includ - ing the fr equenc y distr ibution of obse r ved soil ta xono mic classes (Brun g ard et al., 2015) and similarity bet ween taxonomic classes (Rossiter et al., 2017). Differe nces in prediction accuracy be - tween areas are also likely a result of differ ences in the st reng th of covariat e–cla ss r elationships. F urther i mprove ment in predictive accuracy results is need - ed and will likely come from model st ratifica tion, the addit ion of regionally important covaria tes (e.g., so il er osion and distur - bance rates) and man ag emen t practices (e.g., irrigation ), and the inclu sion of additional tr aining obser vatio ns at locatio ns of low prediction qu alit y . V ariabilit y in accu rac y met rics (and thu s variability in prediction quality) likely also results fro m a mis - mat ch bet ween grid size predictions (100 m) and t he inheren t Fig. 8. Comparison of SoilW eb Soil Properties (Beaudette and O’Geen, 2009) (a) near Sacramento, C A (inset of a) with US48 SoilGrids100m+ results and (b) for per cent clay . Legends are the same for both data sources. SoilW eb properties ar e derived fr om SSURGO and ST ATSGO. www .soils.org/public ations/sssaj ∆ scale at which soil characte ristics v ar y across differen t landscapes. I n our vis ual review of the prediction s (data not sho wn), it ap peare d tha t for some landscapes 100 m was too coarse to captu re the geog raphic scale of soil G G taxonomic var iabilit y . U sing eastern I owa as an example, the n umber of GGs oc cu rring a t fine scales depends on t he landscape and the pro ximit y of soil featu res to taxono mic breaks . I n this case, ther e are th ree g eolog ic stra ta ex - posed through erosion wit h differen t mineral - og ies and erosio n histories. These pedons also str addle ta x onomic breaks with diag nost ic ho - rizons and soil feat ures fou nd in multi ple taxa withou t larg e differ ences in properties. This hig hlig ht s the difficulties in pr ed icti ng taxo - nomic classes when t he similarities between taxa are not consisten t betwe en higher taxa and across soil landsc apes. I t is possible that soil taxonomic class predictions could im prove if en viro nmen tal covaria tes at resolut ions <100 m wer e used for modeling (e.g ., N au man and Duniwa y , 2016). Howeve r , this would increase comput ation com plexit y for a model of this large an exten t and limit the nu mber of usable covaria tes (e.g., P RISM data should not be downscaled to 30 m where mic roclimate effects incr ease in importance). Empir ical uncertainty functions w ere cr eated for soil class maps t o compare deter ministic maps of GG and mPSC with the independe nt validatio n data. The empiric al uncertainty function predicted validation probabilities t hat re present es tima tes of the independe nt validatio n data ma tch rates t hat are mor e realistic repr esenta tions of local model uncertain t y at each pix el. With the high R 2 values for the models to r escale prediction probabili - ties (0.84 and 0.91 for GG and mPSC , res pe ctiv ely), the unce r - tainty analyses sugg est the p robabilities produced by the models are infor mat ive even though they were lo wer th an expecte d. The disadvan tag e of these an alyses is that the dete rministic m aps used to validate inde - pendent obse r vation s only use the top pr e - dicted class ( T op -1 prediction), r educing the outpu t of the models from a dist ribut ion of soil class probabilities to one pr ed icted class. F or soil class data wit h man y classes and/or class boundaries t hat ar e abstract or ov erlap conceptually , there w ould be a c onside rable loss of info rmat ion and a perceived low model perform ance. This is e vide nt in compari ng the res ults for the validat ion probabilities for the two soil classes ma ppe d. With more soil G G classes (291 classes), the v alidation prob - abilities wer e lower th an the m PS C classes (78 classes). F utu re work can r epeat the u ncertainty analyses with t he top five predicted classes, as is often reported in the m achine learning litera - tur e (Szeg edy et al., 2016). Altern atively , we can compare results with anot her product at similar sc ale ( gSSU RGO or ST A TS GO). The result s of the soil class models demonst rate t he challenges of work ing with con ven tional soil concepts. Ou r results show t hat ther e are categories within U S So il T a xo nomy th at can tak e advan - tage of so il ma pping concepts wit hout much r evision by soil sci - ence experts (m PSC), whereas other cat eg ories requir e extensiv e inpu t by experts to updat e conve ntion al soil data to apply t hese methods (G G). M odifications of USDA PSCs similar to w hat we map pe d (mPSC) ha ve proved to be useful for reg ion al-scale land management (N aum an and Duniway , 2016; Na uman et al., 2017), and our r esults support further developme nt. CONCLUSION This work de monst rated the use of a high-perfor mance comput ing en viron ment and ense mble machine learning met h - ods to produce gridded soil propert y and class ma ps for the Fig. 9. Great Gr oup prediction accur acy b y independent validation dataset. Number of observ ations is the number of great group obser vations in eac h validation dataset. Number of classes is the number of great group classes in eac h validation dataset. Fig. 10. Modied particle size class (mPSC) prediction accur acy b y independent validation dataset. Number of observ ations is the number of mPSC observ ations in each validation dataset. Number of classes is the number of mPSCs in each v alidation dataset. ∆ Soil Science Society of America Journal conte rminous U nited States a t 100 m. The results of t his study provide docu mented es - tim ates and unce rtainties of machine lear ning predictions, given cu rren t available na tional soils and env iron mental da tasets, of six soil properties (percent organic C, total N, bulk density , pH, percen t sand, and clay) at seven standar d depths (0, 5, 15, 30, 60, 100, and 200 cm) and two soil classificat ion schemes (G G and m PS C). By i mplemen ting t he study within an a utoma ted framework, g ridded maps w ith exhau stive cove rag e can be rapidly impr oved (<1 wk of comput ing ) and shared as more dat a be come av ailable and more users test these d ata products. The av erag e R 2 for soil property models was 0.49, and soil class models had an ave rag e classificat ion accuracy of 63%. Inde pendent data w ere used to validate soil class model predictions (1999 point s for G G and 2012 points for m PSC) and to conduct empirical uncertain ty analyses . Ni ne reg ional dat asets were u sed to validate soil class maps, and, al - Fig. 11. Empirical uncertainty analyses r esults. (A) Empirical relationships between pr ediction and validation probabilities for gr eat groups and (B) modied particle size classes. (C) Resulting maps for great groups and (D) modied particle size classes. Gr aphs (A) and (B) include a 1:1 reference line (blac k). GG, great group; mPSC, modied particle size class. Fig. 12. The T en-Mile Cany on region near Moab, UT , w as mapped with large association map units that often aggregate soils with widely v arying management strategies associated with dust emission. T his map shows that in terms of near surface cla y content variation and variable attribution of SSURGO map unit component. www .soils.org/public ations/sssaj ∆ though an evaluation w as not conducted for the conter minous U nited States, we con sider the test c ase areas a step for ward i n the process. This collaborative s tudy attem pted to extend t he curr ent soils data r esources for t he U n ited St ates to meet the needs of ge ographic simulation models depe ndent on s patially explicit, 3D refer ence soil g eog raphic dat asets by creat ing a product t hat is easier to in teg rat e with models compared with m ulticomponen t map u nits. E ven t houg h ma pping accuracy is variable and lik ely lower t han gSSU RGO in some areas, the soil property maps can immediately be used for refine ment/tes ting of h uman l andscape models and can be used by land managers who require spat ial variability of soil character istics within SSURG O ma p units. The dat a , metada ta, covariates, models, and codes can be freely accessed via doi : h ttps://do i.org/10.18113/S1K W2H and from a public Gith ub repositor y (https://g ithub .com/aram - charan/US_SoilGrids100m). SUPPLEMENT AL MA TERIAL Sup plement al informa tion provides d ata on pare nt mat erial and drainage cla sses, model tu ning parameters, and cr oss valida - tion eq uations . A CKNOWLEDGMENTS is project was supported by the Agriculture and F oo d Research I nitiativ e Competit ive Grant 2012-68005-19703 from t he USDA N ation al I nstitu te of F oo d and Agricultur e. e autho rs thank the reviewers fo r their edits to this m anuscript; the USDA– NRC S (specically the US N ation al Co opera tive Soil Survey S oil Charact erizat ion database and the N ational Soil In format ion Syst em), USGS, and NASA for creating the na tional dat asets and providing nu merous GIS layers t hat we re used as input to modeling ; the open source soware developers, es pecia lly the cr eators of the packages rang er (W rig ht and Ziegler , 2017), xg boost (Chen and G uestri n, 2016), and caret (Kuh n and Joh nson, 2013); and the cr eators and maint ainers of GDAL ( W armerdam, 2008) and SAGA GIS (Conrad et al., 2015). ese maps wer e made possible thanks to a generation of soil s ur veyors tha t have i nvested 100+ years to collect all ground observations of soil pro perties and systemat ically document the m. REFERENCES Arrou ays, D., M.G. Gru ndy , A.E . H artemink, J. W . Hem pel , G.B.M. H euvelink, Y . Hon g , et a l. 2014. GlobalSoilMap: T oward a ne-r esolution global g rid of soil properties. Adv . Agron. 125:93–134. doi:10.1016/B978-0-12-800137- 0.00003-0 Batjes, N.H. 2016. H armonized soil property values for broad-scale modelling (WIS E30sec) with es timat es of g lobal soil carbon stocks. G eoderma 269:61–68. doi:10.1016/j.g eoderma.2016.01.034 Beaudette, D ., and A . T . O’Geen. 2009. Soil-web : An online soil su r vey for California, Arizona, and Nevad a . Compu t. Ge osci. 35:2119–2128. doi:10.1016/j.cag eo.2008.10.016 Bockheim, J .G., and A.E . H artemink . 2013. Soils with fragipans in the USA. Cat ena 104:233–242. doi:10.1016/j.catena.2012.11.014 Breim an, L. 2001. R andom forest s. Mach. Learn. 45(1):5–32. doi:10.1023/A :1010933404324 Brungard, C. W ., J.L. Boettinger , M.C. Duniw ay , S.A . Wills, and T .C. Edwar ds. 2015. Machine lear ning for predicting soil classes in t hree semi-arid landscapes. Geo der ma 239:68–83. doi:10.1016/j.g eoderma.2014.09.019 Brus, D.J ., B. Kempen, and G.B.M. H euvelink . 2011. Sampling for v alidation of digital so il ma ps. Eur . J . Soil S ci. 62(3):394–407. doi:10.1111/j.1365- 2389.2011.01364.x Bui, E. 2003. A stra teg y to ll gaps in soil surve y ove r large spatial extents: An example from t he Mu rray -Darling basin of A ustr alia . Ge oderm a 111(01/2003):21–44. Caudle, D ., H. Sanchez, J. DiBenedetto, C. T albot, and M. Karl. 2013. In teragenc y ecolog ical site handbook for rang elands. USDA–NRCS, USDA–FS, and Dep. of I nter ior–Bu reau of Land M anag eme nt, W ashington, D C. Chaney , N. W ., E.F . W ood , A.B. McBra tney , J . W . H empel, T . W . N auman, C. W . Brungard, et al. 2016. P OLAR IS: A 30-meter probabilistic soil series m ap of the cont ig uous U nited States. Geo der ma 274:54–67. doi:10.1016/j. ge oderm a .2016.03.025 Chen, T ., and C. Guest rin. 2016. XGBoost: A scalable tree bo osti ng system. ArXiv e-pri nts. https://arxiv .org/abs/1603.02754 (accessed Mar . 2016). Conrad, O, B. Bechtel, M. Bock, H. Dietrich, E. Fische r, L. Gerlitz, et al. 2015. Syste m for autom ated ge oscient ic analyses (SAGA) v . 2.1. 4. Geosci. Model Dev . 8(7):1991–2007. doi:10.5194/g md-8-1991-2015 Cu tler , D, T .C. Edwards J r ,K.H. Beard,A . C utler ,K . T . Hess,J . Gibson,et al. 2007. Random forests for classicat ion in ecolog y . Eco log y 88(11):2783–2792. doi:10.1890/07-0539.1 Daly , C., M. Halbleib, J.I. S mith, W .P . Gibson, M.K. D ogg ett, G.H. T aylor , et a l. 2008. Physiographically sensitiv e mapping of clima tolog ical temper ature and precipitatio n across the con terminous U nited States. I nt. J. C limatol. 28(15):2031–2064. doi:10.1002/joc.1688 de Bruin, S., W .G. Wielemake r, and M. M o len aar . 1999. For malisation of soil- landscape know led ge throu g h in teractive hier archical disa ggreg ation. Geo der ma 91(1):151–172. doi:10.1016/S0016-7061(99)00004-X Duniway , M.C., B. T . Bestelmeyer, and A. T ug el. 2010. Soil processes and properties tha t distinguish e cologica l sites and st ates. Rang elands 32(6):9–15. doi:10.2111/R angelands-D-10-00090.1 Duval, J .S., J .M. Carson, P .B. Holman, and A.G. Darnley . 2005. T errestr ial radioactivity and gamma-ray exposu re in the U nited States and C anada. T e ch. rept. Open- File Report 2005-1413. USG S, Reston, V A . Fir e Sciences Laborator y , Ro cky M ount ain Research. 2000. Historical natu ral re regimes, V ersion 2000. Rocky M ountain Research, Missoula, MT . Flagg , C.B., J.C. N e, R .L . Reynolds, and J. Belna p. 2014. Spatial and te mporal patter ns of dust emission s (2004–2012) in semi-arid landscapes, southeast ern U tah, USA. Aeo lian Res. 15:31–43. doi:10.1016/j.ae olia .2013.10.002 Fo nnesbeck, B.B. 2015. Dig ital soil mapping u sing landscape str aticat ion for arid rangelands in the Eastern Gr eat Basin, Cent ral U tah. Ph.D. diss., U tah Stat e U niversity, Logan. Grossm an, R .B., and F .J . Carlisle. 1969. F rag ipan soils of the eastern U nited States. Adv . Agron. 21:237–279. doi :10.1016/S0065-2113(08)60099-1 H ansen, M.C., P . V . Pot apov , R . Moore, M. H ancher , S.A . T urubanova, A . T y ukavin a , et al. 2013. High-resolut ion glob al maps of 21st -century forest cover chan g e. Science 342(6160):850–853. doi:10.1126/science.1244693 H astie, T .J., R .J . Tibshirani, and J.J .H. F riedman. 2009. e elements of st atist ical learning : Data mining , infer ence, and prediction. Sprin g er - V erlag , New Yo r k . doi:10.1007/978-0-387-84858-7 He nderson, B.L., E.N. Bui, C.J . Moran, and D .A .p. Simon. 2004. A ustralia-wide predictions of soil propert ies using decision trees. Geo der ma 124(3-4):383– 398. doi:10.1016/j.g eoderma.2004.06.007 He ngl T ., J.M. de J esus, R .A . M acMillan, N.H. Batjes, G.B. Heu velink, E. Ribeiro, et al. 2014. So ilGrids1k m : Global soil informa tion based on automa ted mappin g . PLoS One 9(8):e105992. doi :10.1371/journ al.pone.0105992 He ngl T ., J . Mendes de J esus, G.B. H euvelink, M. Ruiperez Gonzalez, M. Kilibarda, A. Bla gotić, et al. 2017. SoilGrids250m : Global gridde d soil inform ation based on machine learning. PLoS One 12(2):e0169748. doi:10.1371/journal.pone.0169748 Home r , C., J. Dewitz, L. Y ang , S. J in, P . Danielson, G. Xian, et al. 2015. Completion of the 2011 N ational Land Cover Dat abase for the conter minous U nite d Sta tes: R epr esentin g a decade of land cover change informa tion. Photogramm. Eng. Remote Sensing 81(5):345–354. Kheir , R .B., M.H. Greve, P .K . Bøcher , M.B. Greve, R . Larsen, and K . McCloy . 2010. Pr ed ictiv e mapping of soil organic carbon in wet cultiva ted lands using classicatio n-tree based models : e case study of Den mark. J. En viron. Man ag e. 91(5):1150–1160. doi:10.1016/j.jenvman.2010.01.001 Küchler , A . W . 1964. Poten tial natu ral veg etat ion of the conte rminous U nited Stat es. USG S, Reston, V A. Kuh n, M., and K . J ohnson. 2013. Ap plie d predictive modeling. Sprin g er , New Yo r k . doi:10.1007/978-1-4614-6849-3 Lacoste, M., V .L. Mulder , A .C. Richer -de Forges, M.P . Martin, and D . Arrou ays. 2016. E valuati ng large- exten t spatial modeling appr oaches : A case study for ∆ Soil Science Society of America Journal soil depth for F rance. Geo derm a Reg ional 7(2):137–152. doi:10.1016/j. ge odrs.2016.02.006 Latin ne, P ., O. Debeir , and C. Decaestecker . 2001. L imitin g the number of t rees in random forest s. In: J. Kit tler and F . Ro li, editors, M ultiple classier systems. Spri nger, N e w Y ork. p. 178–187. Miller , M.E., M.A . Bowke r, R .L. Reynolds, and H.L . Goldstein. 2012. P ost-re land trea tmen ts and wind erosion –lessons from t he Milford F lat Fir e, U T , USA. Aeo lian Res. 7:29–44. doi :10.1016/j.aeolia .2012.04.001 Molst ad, N.E . 2000. Landscapes, soil morpholog y and switchgrass management and productivity in the C hariton Rive r V alley , Iowa. Ph.D. diss., I owa S tate U niversity, A mes. M ulder , V .L ., M. Lacoste, A.C. Richer-de F orges, M.P . Martin, and D. A rrouays. 2016. N ational ve rsus globa l modelling the 3D distr ibution of soil organic carbon in mainland F rance. Ge oderma 263:16–34. doi:10.1016/j. ge oderm a .2015.08.035 N ational Research Cou ncil. 2001. Grand challenges in enviro nment al sciences. N ational Ac ademy P ress, W a shington, DC. N auman, T . W ., and M.C. Duniway . 2016. e aut omated refere nce toolset : A soil- ge omorphic ecolog ical potential ma tching alg orit hm. Soil Sci. S oc. Am. J . 80(5):1317–1328. doi:10.2136/sssaj2016.05.0151 N auman, T . W ., and J.A. ompson. 2014. Semi-automa ted disa gg regation of conve ntional soil ma ps using know ledg e driven da ta mining and classicat ion trees. Geoderma 213:385–399. doi:10.1016/j.g eoderma.2013.08.024 N auman, T . W ., J.A. ompson, and C. Rasmussen. 2014. Semi-autom ated disag gregation of a conve ntional soil ma p using know ledg e driven da ta mining and random fores ts in the Sonoran Desert, USA. Photogramm. Eng . Remote Sensing 80(4):353–366. doi:10.14358/PER S.80.4.353 N auman, T . W ., M.C. Duniway , M.L. Villarreal, and T .B. Poitras. 2017. Disturbance au tomated refe rence toolset (DAR T ): Assessing pat terns in ecolog ical recover y from ene rg y de velopme nt on the Colorado P lateau. Sci. T otal Environ. 584:476–488. doi:10.1016/j.scitotenv .2017.01.034 Odg ers, N.P ., Z . Lib ohova, and J .A . ompson. 2012. Equal-area s pline functions applied to a legac y soil database to create w eig hted- means maps of soil organic carbon at a cont inental scale. Geoderma 189:153–163. doi:10.1016/j. ge oderm a .2012.05.026 O’Donnel, M.S., and D .A . Ignizio. 2012. Bioclimatic p redictors for supporting ecolog ical applications i n the conter minous U n ited Sta tes. USG S dat a series. USGS , Reston, V AS. Oshiro, T .M., P .S. P erez, and J.A. Baranauskas. 2012. Ho w many t rees in a random forest? In: P . Per ner , editor, M achine learning and da ta mining in patte rn recog nition. Lecture N otes in Computer Science, vol. 7376. S pringer , Berlin, Heidelberg. p. 154–168. P ekel, J .-F ., A . Cottam, N. Gorelick, and A .S. Belward. 2016. H ig h-r esolution mappin g of g lobal surface wate r and its long-term chan g es. N atur e 540(7633):418–422. doi:10.1038/natur e20584 P ontius, R .G., and M. Millones. 2011. Death t o Kappa : Birth of quan tity disag reemen t and allocation disag reemen t for accuracy assessment. I nt. J . Remote Sens. 32(15):4407–4429. doi:10.1080/01431161.2011.552923 R Development Core T eam. 2009. R : A languag e and envi ronmen t for stat istical computing . R F oundatio n for Stat istical Comput ing , Vienna, Au stria. Rossiter , D.G., R . Zeng , and G.-L. Zhang . 2017. Accou nting for t a xo nomic distance in accuracy assessmen t of soil class predictions. Geoderma 292:118– 127. doi:10.1016/j.g eoderma.2017.01.012 Roudier , P ., A . Ritchie, C. Hedley , and D. M edyckyj -Scott. 2015. e rise of inform ation science: A changing landscape for soil science. I n: I OP Confere nce Series: E arth and En viron mental Science, V ol. 25. I OP Publishing , Bristol, UK. doi :10.1088/1755-1315/25/1/012023 Schoeneberg er , P .J. 2012. F ield book for describing and sampling soils, V ersion 3.0. N ational Soil Survey Center , L incoln, NE. Soil Surve y Sta. 1999. Soil taxonom y : A basis system of soil classication for making and i nterpret ing soil surveys . USDA–NRCS, W ashing ton, DC. Soil Surve y St a. 2010. Keys to soil taxonomy , 2010. US Gov . Print. Oce, W ashington, D C Soil Surve y Sta. 2014. K eys to soil taxonomy , 2014 . US Gov . Prin t. Oce, W ashington, D C. Soil Surve y Sta. 2016. Rapid carbon assessme nt (RaCa): Met hodolo g y , sampling , and sum mary. U SDA–NRCS , W a shington, DC. Soller, D .R ., and M.C. Reheis. 2004. Surcial mate rials in the conte rminous U nite d Sta tes. Open-le Report 03-275. USG S, Reston, V A . Str obl, C., A .-L. Boulesteix, A . Zeileis, and T . Hot horn. 2007. Bias in random forest var iable importance measures: Illust rations, sou rces and a solution. BMC Bioinforma tics 8(1):25. doi:10.1186/1471-2105-8-25 Stu m, A .K., J.L. Boettinger , M.A . White, and R .D. Ramsey . 2010. Random forests applied as a soil spatial predictive model in arid U tah. In: J.L. Boettinger , D. W .Howell, A.C. Moore, A.E . H artemink, and S. Kienast-B rown, editors, Digital soil mapping. Springer , New Y ork . p. 179–190. Szegedy, C., V . V anhoucke, S. I oe, J . Shlens, and Z . W ojna . 2016. Rethinking the incept ion architecture for compu ter vision. P roc. IEEE (6)2818–2826. doi:10.1109/CV PR .2016.308 ompson, J . A., T . Prescott, A.C. Moore, J . B ell, D. Kau tz, J. H empel, S. W . W altm an, and C.H. Per r y . 2010. Re gional approach to soil pro pert y map ping using legac y dat a and spatial disag gregation techniques. I n: R . Gilkes, editor , Pr oce edings of the 19th W orld Congress of Soil Science : Soil Solutions for a Changing W orld. 1–6 Aug. 2010, Brisbane, A ustralia. p. 17–20. ompson, J . A., T . W . Na uman, N.P . O dgers, Z . Libohova, and J. W . Hempel. 2012. H armonization of legacy so il ma ps in No rth America: Status, tr ends, and implicat ions for digital so il ma pping eorts. I n : B. Minas ny , B.P . Malone, and A .B. Mc Bratney , e ditors, Digital soil assessment s and be yond: Pr oce edings of the Fih Global W orkshop on Digital So il M apping. CRC Pr ess, Boca R aton, FL. p. 97–102. doi:10.1201/b12728-21 V aysse, K ., and P . Lagacherie. 2015. Ev aluating digital soil mapping a pproaches for mappin g GlobalS oilMa p soil properties from legac y dat a in Langue doc- Roussillon (F rance). G eoderma Regional 4:20–30. doi :10.1016/j. ge odrs.2014.11.003 W armer dam, F ran k. 2008. e g eospatial dat a abstractio n librar y . In: B. Hall and M.G. Leahy , editors, Open source appr oaches in spatial data h andling . Spri nger, N e w Y ork. p. 87–104. Wickh am, H. 2016. g gplot2: Eleg ant graphics for dat a analysis. Springer , New Yo r k . Wills, S., C. Sey bold, J . Chiaretti, C. Sequeira, and L. W est. 2013. Quantif ying tacit kno wledg e about soil organic carb on stocks usin g soil taxa and ocia l soil series descri ptions. Soil Sci. S oc. Am. J . 77(5):1711–1723. doi :10.2136/ sssaj2012.0168 Wilson, A.M., and W . Jetz. 2016. Remotely sensed high-resolut ion g lobal cloud dynamics for predicting ecosystem and biodiversity distr ibutions. PLoS Biol. 14(3):e1002415. doi :10.1371/jour nal.pbio.1002415 W rig ht, M.N., and A. Zieg ler . 2017. R anger: A fa st im plementa tion of random forests for high dimension al data in C++ and R . J. S tat. Sow . 77(1):1–17. doi:10.18637/jss.v077.i01
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment