Serveur d'exploration Cyberinfrastructure

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000069 ( Pmc/Corpus ); précédent : 0000689; suivant : 0000700 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Enabling Big Geoscience Data Analytics with a Cloud-Based, MapReduce-Enabled and Service-Oriented Workflow Framework</title>
<author>
<name sortKey="Li, Zhenlong" sort="Li, Zhenlong" uniqKey="Li Z" first="Zhenlong" last="Li">Zhenlong Li</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, VA, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Yang, Chaowei" sort="Yang, Chaowei" uniqKey="Yang C" first="Chaowei" last="Yang">Chaowei Yang</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, VA, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Jin, Baoxuan" sort="Jin, Baoxuan" uniqKey="Jin B" first="Baoxuan" last="Jin">Baoxuan Jin</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, VA, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff002">
<addr-line>Yunnan Provincial Geomatics Center, Yunnan Bureau of Surveying, Mapping, and GeoInformation, Kunming,Yunnan, China</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Yu, Manzhu" sort="Yu, Manzhu" uniqKey="Yu M" first="Manzhu" last="Yu">Manzhu Yu</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, VA, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Liu, Kai" sort="Liu, Kai" uniqKey="Liu K" first="Kai" last="Liu">Kai Liu</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, VA, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sun, Min" sort="Sun, Min" uniqKey="Sun M" first="Min" last="Sun">Min Sun</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, VA, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zhan, Matthew" sort="Zhan, Matthew" uniqKey="Zhan M" first="Matthew" last="Zhan">Matthew Zhan</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, VA, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff003">
<addr-line>Department of Computer Science, University of Texas—Austin, Austin, Texas, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">25742012</idno>
<idno type="pmc">4351198</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4351198</idno>
<idno type="RBID">PMC:4351198</idno>
<idno type="doi">10.1371/journal.pone.0116781</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">000069</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Enabling Big Geoscience Data Analytics with a Cloud-Based, MapReduce-Enabled and Service-Oriented Workflow Framework</title>
<author>
<name sortKey="Li, Zhenlong" sort="Li, Zhenlong" uniqKey="Li Z" first="Zhenlong" last="Li">Zhenlong Li</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, VA, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Yang, Chaowei" sort="Yang, Chaowei" uniqKey="Yang C" first="Chaowei" last="Yang">Chaowei Yang</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, VA, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Jin, Baoxuan" sort="Jin, Baoxuan" uniqKey="Jin B" first="Baoxuan" last="Jin">Baoxuan Jin</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, VA, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff002">
<addr-line>Yunnan Provincial Geomatics Center, Yunnan Bureau of Surveying, Mapping, and GeoInformation, Kunming,Yunnan, China</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Yu, Manzhu" sort="Yu, Manzhu" uniqKey="Yu M" first="Manzhu" last="Yu">Manzhu Yu</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, VA, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Liu, Kai" sort="Liu, Kai" uniqKey="Liu K" first="Kai" last="Liu">Kai Liu</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, VA, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sun, Min" sort="Sun, Min" uniqKey="Sun M" first="Min" last="Sun">Min Sun</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, VA, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zhan, Matthew" sort="Zhan, Matthew" uniqKey="Zhan M" first="Matthew" last="Zhan">Matthew Zhan</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, VA, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff003">
<addr-line>Department of Computer Science, University of Texas—Austin, Austin, Texas, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint>
<date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Geoscience observations and model simulations are generating vast amounts of multi-dimensional data. Effectively analyzing these data are essential for geoscience studies. However, the tasks are challenging for geoscientists because processing the massive amount of data is both computing and data intensive in that data analytics requires complex procedures and multiple tools. To tackle these challenges, a scientific workflow framework is proposed for big geoscience data analytics. In this framework techniques are proposed by leveraging cloud computing, MapReduce, and Service Oriented Architecture (SOA). Specifically, HBase is adopted for storing and managing big geoscience data across distributed computers. MapReduce-based algorithm framework is developed to support parallel processing of geoscience data. And service-oriented workflow architecture is built for supporting on-demand complex data analytics in the cloud environment. A proof-of-concept prototype tests the performance of the framework. Results show that this innovative framework significantly improves the efficiency of big geoscience data analytics by reducing the data processing time as well as simplifying data analytical procedures for geoscientists.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Groot, R" uniqKey="Groot R">R Groot</name>
</author>
<author>
<name sortKey="Mclaughlin, Jd" uniqKey="Mclaughlin J">JD McLaughlin</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, C" uniqKey="Yang C">C Yang</name>
</author>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Xie, J" uniqKey="Xie J">J Xie</name>
</author>
<author>
<name sortKey="Zhou, B" uniqKey="Zhou B">B Zhou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, C" uniqKey="Yang C">C Yang</name>
</author>
<author>
<name sortKey="Goodchild, M" uniqKey="Goodchild M">M Goodchild</name>
</author>
<author>
<name sortKey="Huang, Q" uniqKey="Huang Q">Q Huang</name>
</author>
<author>
<name sortKey="Nebert, D" uniqKey="Nebert D">D Nebert</name>
</author>
<author>
<name sortKey="Raskin, R" uniqKey="Raskin R">R Raskin</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edwards, Pn" uniqKey="Edwards P">PN Edwards</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Grassl, H" uniqKey="Grassl H">H Grassl</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hodgson, Ja" uniqKey="Hodgson J">JA Hodgson</name>
</author>
<author>
<name sortKey="Thomas, Cd" uniqKey="Thomas C">CD Thomas</name>
</author>
<author>
<name sortKey="Wintle, Ba" uniqKey="Wintle B">BA Wintle</name>
</author>
<author>
<name sortKey="Moilanen, A" uniqKey="Moilanen A">A Moilanen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Murphy, Jm" uniqKey="Murphy J">JM Murphy</name>
</author>
<author>
<name sortKey="Sexton, Dm" uniqKey="Sexton D">DM Sexton</name>
</author>
<author>
<name sortKey="Barnett, Dn" uniqKey="Barnett D">DN Barnett</name>
</author>
<author>
<name sortKey="Jones, Gs" uniqKey="Jones G">GS Jones</name>
</author>
<author>
<name sortKey="Webb, Mj" uniqKey="Webb M">MJ Webb</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, Z" uniqKey="Li Z">Z Li</name>
</author>
<author>
<name sortKey="Yang, C" uniqKey="Yang C">C Yang</name>
</author>
<author>
<name sortKey="Sun, M" uniqKey="Sun M">M Sun</name>
</author>
<author>
<name sortKey="Li, J" uniqKey="Li J">J Li</name>
</author>
<author>
<name sortKey="Xu, C" uniqKey="Xu C">C Xu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cui, D" uniqKey="Cui D">D Cui</name>
</author>
<author>
<name sortKey="Wu, Y" uniqKey="Wu Y">Y Wu</name>
</author>
<author>
<name sortKey="Zhang, Q" uniqKey="Zhang Q">Q Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, Y" uniqKey="Liu Y">Y Liu</name>
</author>
<author>
<name sortKey="Guo, W" uniqKey="Guo W">W Guo</name>
</author>
<author>
<name sortKey="Jiang, W" uniqKey="Jiang W">W Jiang</name>
</author>
<author>
<name sortKey="Gong, J" uniqKey="Gong J">J Gong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, C" uniqKey="Yang C">C Yang</name>
</author>
<author>
<name sortKey="Wu, H" uniqKey="Wu H">H Wu</name>
</author>
<author>
<name sortKey="Huang, Q" uniqKey="Huang Q">Q Huang</name>
</author>
<author>
<name sortKey="Li, Z" uniqKey="Li Z">Z Li</name>
</author>
<author>
<name sortKey="Li, J" uniqKey="Li J">J Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, J" uniqKey="Li J">J Li</name>
</author>
<author>
<name sortKey="Wang, Fz" uniqKey="Wang F">FZ Wang</name>
</author>
<author>
<name sortKey="Meng, L" uniqKey="Meng L">L Meng</name>
</author>
<author>
<name sortKey="Zhang, W" uniqKey="Zhang W">W Zhang</name>
</author>
<author>
<name sortKey="Cai, Y" uniqKey="Cai Y">Y Cai</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Juve, G" uniqKey="Juve G">G Juve</name>
</author>
<author>
<name sortKey="Deelman, E" uniqKey="Deelman E">E Deelman</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wright, Dj" uniqKey="Wright D">DJ Wright</name>
</author>
<author>
<name sortKey="Wang, S" uniqKey="Wang S">S Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fiore, S" uniqKey="Fiore S">S Fiore</name>
</author>
<author>
<name sortKey="Negro, A" uniqKey="Negro A">A Negro</name>
</author>
<author>
<name sortKey="Aloisio, G" uniqKey="Aloisio G">G Aloisio</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stonebraker, M" uniqKey="Stonebraker M">M Stonebraker</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, Y" uniqKey="Liu Y">Y Liu</name>
</author>
<author>
<name sortKey="Chen, B" uniqKey="Chen B">B Chen</name>
</author>
<author>
<name sortKey="He, W" uniqKey="He W">W He</name>
</author>
<author>
<name sortKey="Fang, Y" uniqKey="Fang Y">Y Fang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Khetrapal, A" uniqKey="Khetrapal A">A Khetrapal</name>
</author>
<author>
<name sortKey="Ganesh, V" uniqKey="Ganesh V">V Ganesh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lakshman, A" uniqKey="Lakshman A">A Lakshman</name>
</author>
<author>
<name sortKey="Malik, P" uniqKey="Malik P">P Malik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chang, F" uniqKey="Chang F">F Chang</name>
</author>
<author>
<name sortKey="Dean, J" uniqKey="Dean J">J Dean</name>
</author>
<author>
<name sortKey="Ghemawat, S" uniqKey="Ghemawat S">S Ghemawat</name>
</author>
<author>
<name sortKey="Hsieh, Wc" uniqKey="Hsieh W">WC Hsieh</name>
</author>
<author>
<name sortKey="Wallach, Da" uniqKey="Wallach D">DA Wallach</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, J" uniqKey="Chen J">J Chen</name>
</author>
<author>
<name sortKey="Zheng, G" uniqKey="Zheng G">G Zheng</name>
</author>
<author>
<name sortKey="Chen, H" uniqKey="Chen H">H Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, H" uniqKey="Zhang H">H Zhang</name>
</author>
<author>
<name sortKey="Liu, M" uniqKey="Liu M">M Liu</name>
</author>
<author>
<name sortKey="Shi, Y" uniqKey="Shi Y">Y Shi</name>
</author>
<author>
<name sortKey="Yuen, Da" uniqKey="Yuen D">DA Yuen</name>
</author>
<author>
<name sortKey="Yan, Z" uniqKey="Yan Z">Z Yan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Geist, A" uniqKey="Geist A">A Geist</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gropp, W" uniqKey="Gropp W">W Gropp</name>
</author>
<author>
<name sortKey="Lusk, E" uniqKey="Lusk E">E Lusk</name>
</author>
<author>
<name sortKey="Skjellum, A" uniqKey="Skjellum A">A Skjellum</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Foster, I" uniqKey="Foster I">I Foster</name>
</author>
<author>
<name sortKey="Kesselman, C" uniqKey="Kesselman C">C Kesselman</name>
</author>
<author>
<name sortKey="Tuecke, S" uniqKey="Tuecke S">S Tuecke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dean, J" uniqKey="Dean J">J Dean</name>
</author>
<author>
<name sortKey="Ghemawat, S" uniqKey="Ghemawat S">S Ghemawat</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rizvandi, Nb" uniqKey="Rizvandi N">NB Rizvandi</name>
</author>
<author>
<name sortKey="Boloori, Aj" uniqKey="Boloori A">AJ Boloori</name>
</author>
<author>
<name sortKey="Kamyabpour, N" uniqKey="Kamyabpour N">N Kamyabpour</name>
</author>
<author>
<name sortKey="Zomaya, Ay" uniqKey="Zomaya A">AY Zomaya</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, H" uniqKey="Zhao H">H Zhao</name>
</author>
<author>
<name sortKey="Ai, S" uniqKey="Ai S">S Ai</name>
</author>
<author>
<name sortKey="Lv, Z" uniqKey="Lv Z">Z Lv</name>
</author>
<author>
<name sortKey="Li, B" uniqKey="Li B">B Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lud Scher, B" uniqKey="Lud Scher B">B Ludäscher</name>
</author>
<author>
<name sortKey="Altintas, I" uniqKey="Altintas I">I Altintas</name>
</author>
<author>
<name sortKey="Berkley, C" uniqKey="Berkley C">C Berkley</name>
</author>
<author>
<name sortKey="Higgins, D" uniqKey="Higgins D">D Higgins</name>
</author>
<author>
<name sortKey="Jaeger, E" uniqKey="Jaeger E">E Jaeger</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, S" uniqKey="Wang S">S Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, C" uniqKey="Yang C">C Yang</name>
</author>
<author>
<name sortKey="Raskin, R" uniqKey="Raskin R">R Raskin</name>
</author>
<author>
<name sortKey="Goodchild, M" uniqKey="Goodchild M">M Goodchild</name>
</author>
<author>
<name sortKey="Gahegan, M" uniqKey="Gahegan M">M Gahegan</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gil, Y" uniqKey="Gil Y">Y Gil</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Taylor, Ij" uniqKey="Taylor I">IJ Taylor</name>
</author>
<author>
<name sortKey="Deelman, E" uniqKey="Deelman E">E Deelman</name>
</author>
<author>
<name sortKey="Gannon, D" uniqKey="Gannon D">D Gannon</name>
</author>
<author>
<name sortKey="Shields, M" uniqKey="Shields M">M Shields</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yue, P" uniqKey="Yue P">P Yue</name>
</author>
<author>
<name sortKey="He, L" uniqKey="He L">L He</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Oinn, T" uniqKey="Oinn T">T Oinn</name>
</author>
<author>
<name sortKey="Addis, M" uniqKey="Addis M">M Addis</name>
</author>
<author>
<name sortKey="Ferris, J" uniqKey="Ferris J">J Ferris</name>
</author>
<author>
<name sortKey="Marvin, D" uniqKey="Marvin D">D Marvin</name>
</author>
<author>
<name sortKey="Senger, M" uniqKey="Senger M">M Senger</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Barga, R" uniqKey="Barga R">R Barga</name>
</author>
<author>
<name sortKey="Jackson, J" uniqKey="Jackson J">J Jackson</name>
</author>
<author>
<name sortKey="Araujo, N" uniqKey="Araujo N">N Araujo</name>
</author>
<author>
<name sortKey="Guo, D" uniqKey="Guo D">D Guo</name>
</author>
<author>
<name sortKey="Gautam, N" uniqKey="Gautam N">N Gautam</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mattmann, Ca" uniqKey="Mattmann C">CA Mattmann</name>
</author>
<author>
<name sortKey="Crichton, Dj" uniqKey="Crichton D">DJ Crichton</name>
</author>
<author>
<name sortKey="Hart, Af" uniqKey="Hart A">AF Hart</name>
</author>
<author>
<name sortKey="Goodale, C" uniqKey="Goodale C">C Goodale</name>
</author>
<author>
<name sortKey="Hughes, Js" uniqKey="Hughes J">JS Hughes</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Williams, Dn" uniqKey="Williams D">DN Williams</name>
</author>
<author>
<name sortKey="Drach, R" uniqKey="Drach R">R Drach</name>
</author>
<author>
<name sortKey="Ananthakrishnan, R" uniqKey="Ananthakrishnan R">R Ananthakrishnan</name>
</author>
<author>
<name sortKey="Foster, I" uniqKey="Foster I">I Foster</name>
</author>
<author>
<name sortKey="Fraser, D" uniqKey="Fraser D">D Fraser</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, Q" uniqKey="Huang Q">Q Huang</name>
</author>
<author>
<name sortKey="Yang, C" uniqKey="Yang C">C Yang</name>
</author>
<author>
<name sortKey="Liu, K" uniqKey="Liu K">K Liu</name>
</author>
<author>
<name sortKey="Xia, J" uniqKey="Xia J">J Xia</name>
</author>
<author>
<name sortKey="Xu, C" uniqKey="Xu C">C Xu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yue, P" uniqKey="Yue P">P Yue</name>
</author>
<author>
<name sortKey="Di, L" uniqKey="Di L">L Di</name>
</author>
<author>
<name sortKey="Yang, W" uniqKey="Yang W">W Yang</name>
</author>
<author>
<name sortKey="Yu, G" uniqKey="Yu G">G Yu</name>
</author>
<author>
<name sortKey="Zhao, P" uniqKey="Zhao P">P Zhao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, Z" uniqKey="Li Z">Z Li</name>
</author>
<author>
<name sortKey="Yang, C" uniqKey="Yang C">C Yang</name>
</author>
<author>
<name sortKey="Wu, H" uniqKey="Wu H">H Wu</name>
</author>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Miao, L" uniqKey="Miao L">L Miao</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group>
<journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, CA USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">25742012</article-id>
<article-id pub-id-type="pmc">4351198</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0116781</article-id>
<article-id pub-id-type="publisher-id">PONE-D-14-42409</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Enabling Big Geoscience Data Analytics with a Cloud-Based, MapReduce-Enabled and Service-Oriented Workflow Framework</article-title>
<alt-title alt-title-type="running-head">Developing a New Framework to Enable Big Geoscience Data Analytics</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Li</surname>
<given-names>Zhenlong</given-names>
</name>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Yang</surname>
<given-names>Chaowei</given-names>
</name>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
<xref rid="cor001" ref-type="corresp">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Jin</surname>
<given-names>Baoxuan</given-names>
</name>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff002">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Yu</surname>
<given-names>Manzhu</given-names>
</name>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Liu</surname>
<given-names>Kai</given-names>
</name>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Sun</surname>
<given-names>Min</given-names>
</name>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zhan</surname>
<given-names>Matthew</given-names>
</name>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff003">
<sup>3</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff001">
<label>1</label>
<addr-line>NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, VA, United States of America</addr-line>
</aff>
<aff id="aff002">
<label>2</label>
<addr-line>Yunnan Provincial Geomatics Center, Yunnan Bureau of Surveying, Mapping, and GeoInformation, Kunming,Yunnan, China</addr-line>
</aff>
<aff id="aff003">
<label>3</label>
<addr-line>Department of Computer Science, University of Texas—Austin, Austin, Texas, United States of America</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Gomez-Gesteira</surname>
<given-names>Moncho</given-names>
</name>
<role>Academic Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">
<addr-line>University of Vigo, SPAIN</addr-line>
</aff>
<author-notes>
<fn fn-type="conflict" id="coi001">
<p>
<bold>Competing Interests: </bold>
The authors have declared that no competing interests exist.</p>
</fn>
<fn fn-type="con" id="contrib001">
<p>Conceived and designed the experiments: CY ZL BJ. Performed the experiments: ZL KL MY MZ. Analyzed the data: ZL CY BJ KL MY. Contributed reagents/materials/analysis tools: ZL CY BJ MS KL. Wrote the paper: ZL CY MY BJ.</p>
</fn>
<corresp id="cor001">* E-mail:
<email>cyang3@gmu.edu</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>5</day>
<month>3</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="collection">
<year>2015</year>
</pub-date>
<volume>10</volume>
<issue>3</issue>
<elocation-id>e0116781</elocation-id>
<history>
<date date-type="received">
<day>20</day>
<month>9</month>
<year>2014</year>
</date>
<date date-type="accepted">
<day>14</day>
<month>12</month>
<year>2014</year>
</date>
</history>
<permissions>
<copyright-year>2015</copyright-year>
<copyright-holder>Li et al</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open access article distributed under the terms of the
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License</ext-link>
, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:type="simple" xlink:href="pone.0116781.pdf"></self-uri>
<abstract>
<p>Geoscience observations and model simulations are generating vast amounts of multi-dimensional data. Effectively analyzing these data are essential for geoscience studies. However, the tasks are challenging for geoscientists because processing the massive amount of data is both computing and data intensive in that data analytics requires complex procedures and multiple tools. To tackle these challenges, a scientific workflow framework is proposed for big geoscience data analytics. In this framework techniques are proposed by leveraging cloud computing, MapReduce, and Service Oriented Architecture (SOA). Specifically, HBase is adopted for storing and managing big geoscience data across distributed computers. MapReduce-based algorithm framework is developed to support parallel processing of geoscience data. And service-oriented workflow architecture is built for supporting on-demand complex data analytics in the cloud environment. A proof-of-concept prototype tests the performance of the framework. Results show that this innovative framework significantly improves the efficiency of big geoscience data analytics by reducing the data processing time as well as simplifying data analytical procedures for geoscientists.</p>
</abstract>
<funding-group>
<funding-statement>This research is supported by NSF (PLR-1349259, IIP-1338925, CNS-1117300) and NASA (NNG12PP37I). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</funding-statement>
</funding-group>
<counts>
<fig-count count="13"></fig-count>
<table-count count="2"></table-count>
<page-count count="23"></page-count>
</counts>
<custom-meta-group>
<custom-meta id="data-availability">
<meta-name>Data Availability</meta-name>
<meta-value>All relevant data are within the paper.</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
<notes>
<title>Data Availability</title>
<p>All relevant data are within the paper.</p>
</notes>
</front>
<body>
<sec sec-type="intro" id="sec001">
<title>Introduction</title>
<p>Geoscience data are a core component driving geoscience advancement [
<xref rid="pone.0116781.ref001" ref-type="bibr">1</xref>
]. Understanding the Earth as a system requires a combination of observational data recorded by sensors and simulation data produced by numerical models [
<xref rid="pone.0116781.ref002" ref-type="bibr">2</xref>
]. Over the past half century human’s capability to explore the Earth system has been enhanced with the emergence of new computing, sensor and information technologies [
<xref rid="pone.0116781.ref003" ref-type="bibr">3</xref>
]. While the technological advancements accelerate collecting, simulating and sharing geoscience data, they also produce Big Data for geosciences from at least two aspects. First, massive amounts of multi-dimensional data recording various physical phenomena are taken by the sensors across the globe, and these data are accumulated rapidly with a daily increase rate of terabytes to petabytes [
<xref rid="pone.0116781.ref004" ref-type="bibr">4</xref>
]. For example the meteorological satellite Himawari-9 collects ∼3 terabytes data from space every day [
<xref rid="pone.0116781.ref005" ref-type="bibr">5</xref>
]. Second, supercomputers enable geoscientists to simulate Earth phenomena with finer spatiotemporal resolution and greater space and time coverage, producing large amounts of simulated geoscience data.</p>
<p>Effectively processing and analyzing big geoscience data are becoming critical to challenges such as climate change, natural disasters, diseases and other emergencies. However, the ever growing big geoscience data exceed the capacity of computing and data management technologies [
<xref rid="pone.0116781.ref006" ref-type="bibr">6</xref>
]. This is particularly true in climate science, which normally produces hundreds of terabytes of data in model simulations [
<xref rid="pone.0116781.ref002" ref-type="bibr">2</xref>
,
<xref rid="pone.0116781.ref007" ref-type="bibr">7</xref>
].</p>
<p>In this paper, we first take big climate data analytics as a case study to exemplify three challenges in big geoscience data processing and analyzing and then demonstrate how our proposed solution could address these challenges.</p>
<sec id="sec002">
<title>1.1 A Study Case: Climate Model Sensitivity</title>
<p>Climate change is one of the biggest contemporary concerns for humankind due to its broad impacts on society and ecosystems worldwide [
<xref rid="pone.0116781.ref008" ref-type="bibr">8</xref>
]. Information about future climate is critical for decision makers, such as agriculture planning, emergency preparedness, political negotiations and intelligence [
<xref rid="pone.0116781.ref009" ref-type="bibr">9</xref>
]. However, a major problem the decision makers face is that different climate models produce different projected climate scenarios due to unknown model uncertainties. Testing the sensitivity of input parameters of a climate model is a standard modeling practice for determining the model uncertainties [
<xref rid="pone.0116781.ref010" ref-type="bibr">10</xref>
]. To do this, perturbed physics ensembles (PPEs) run a model hundreds or thousands of times with different model input parameters, followed by analyses of the model output and input to identify which parameter is more sensitive to simulated climate changes (diagnostic).</p>
<p>Climate@Home (
<ext-link ext-link-type="uri" xlink:href="http://climateathome.com/climate@home">http://climateathome.com/climate@home</ext-link>
) is a project initiated by NASA to advance climate modeling studies [
<xref rid="pone.0116781.ref011" ref-type="bibr">11</xref>
]. In this project to study the sensitivity of ModelE (
<ext-link ext-link-type="uri" xlink:href="http://www.giss.nasa.gov/tools/modelE/">http://www.giss.nasa.gov/tools/modelE/</ext-link>
, global climate model developed by NASA), 300 ensemble model-runs (PPE-300) are required for each experiment, sweeping seven atmospheric parameters in each model-run input (
<xref rid="pone.0116781.t001" ref-type="table">Table 1</xref>
). The simulation period is from December 1949 to January 1961 with a 4° x 5° spatial resolution and a monthly time resolution. Each model run generates ∼10 gigabytes data in four dimensions (3D space and 1D time) with 336 climatic variables and totally three terabytes of data for the PPE-300 experiment.</p>
<table-wrap id="pone.0116781.t001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0116781.t001</object-id>
<label>Table 1</label>
<caption>
<title>Seven tested atmospheric parameters in the PPE-300 experiment.</title>
</caption>
<alternatives>
<graphic id="pone.0116781.t001g" xlink:href="pone.0116781.t001"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">Parameter</th>
<th align="left" rowspan="1" colspan="1">Definition</th>
<th align="left" rowspan="1" colspan="1">Range</th>
<th align="left" rowspan="1" colspan="1">Default</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">funio_denom</td>
<td align="left" rowspan="1" colspan="1">Affects fraction of time that clouds in a mixed-phase regime over ocean are ice or water</td>
<td align="left" rowspan="1" colspan="1">5–25</td>
<td align="left" rowspan="1" colspan="1">22</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">autoconv_mult</td>
<td align="left" rowspan="1" colspan="1">Multiplies rate of auto conversion of cloud condensate</td>
<td align="left" rowspan="1" colspan="1">0.5–2</td>
<td align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">radius_mult</td>
<td align="left" rowspan="1" colspan="1">Multiples effective radius of cloud droplets</td>
<td align="left" rowspan="1" colspan="1">0.5–2</td>
<td align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">ent_conf1</td>
<td align="left" rowspan="1" colspan="1">Entrainment coefficient for convective plume</td>
<td align="left" rowspan="1" colspan="1">0.1–0.9</td>
<td align="left" rowspan="1" colspan="1">0.3</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">ent_conf2</td>
<td align="left" rowspan="1" colspan="1">Entrainment coefficient for secondary convective plume</td>
<td align="left" rowspan="1" colspan="1">0.1—0.9</td>
<td align="left" rowspan="1" colspan="1">0.6</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">U00a</td>
<td align="left" rowspan="1" colspan="1">Relative humidity threshold for stratus cloud formation</td>
<td align="left" rowspan="1" colspan="1">0.4–0.8</td>
<td align="left" rowspan="1" colspan="1">0.6</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">U00b</td>
<td align="left" rowspan="1" colspan="1">Relative humidity multiplier for low clouds</td>
<td align="left" rowspan="1" colspan="1">0.9–2.5</td>
<td align="left" rowspan="1" colspan="1">2.3</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>To identify which of the 336 output variables are sensitive to the seven input parameters, the three terabytes model output is analyzed. Specifically, the following steps are taken:
<list list-type="simple">
<list-item>
<p>
<bold>S1. Simulation</bold>
: Run ModelE 300 times sweeping seven input parameters;</p>
</list-item>
<list-item>
<p>
<bold>S2. Preprocess</bold>
: Convert model output (monthly .acc files) into NetCDF files, and combine monthly data to reduce the file numbers;</p>
</list-item>
<list-item>
<p>
<bold>S3. Management</bold>
: Store and manage the NetCDF files in a file system or database;</p>
</list-item>
<list-item>
<p>
<bold>S4. Process</bold>
: For each of the 336 variables in each of the 300 runs, calculate the annual global and 10-year mean.</p>
</list-item>
<list-item>
<p>
<bold>S5. Analysis</bold>
: Conduct linear regression analysis for each Parameter-Variable (P, V) pair (totally 336*7 pairs) using the 300 runs; and</p>
</list-item>
<list-item>
<p>
<bold>S6. Visualization</bold>
: Identify and plot the variables most affected by the parameters.</p>
</list-item>
</list>
</p>
</sec>
<sec id="sec003">
<title>1.2 Challenges Posed by Geoscience Data Analytics</title>
<p>Geoscience data analytics poses three computing challenges as exemplified in the climate model sensitivity study case.</p>
<p>
<bold>C1. Big data or data intensity:</bold>
Storing, managing, and processing massive datasets are grand challenge in geosciences [
<xref rid="pone.0116781.ref012" ref-type="bibr">12</xref>
,
<xref rid="pone.0116781.ref013" ref-type="bibr">13</xref>
,
<xref rid="pone.0116781.ref051" ref-type="bibr">51</xref>
]. For example, one PPE-300 experiment produces 3 terabytes of climate data. A scalable data management framework is critical for managing these datasets. Furthermore, geoscience data analytics need to deal with heterogeneous data formats (e.g., array-based data, text files, and images), access distributed data sources, and share the result. Different data access protocols (e.g., FTP, HTTP) and data service standards (e.g., WCS, WFS, and OpenDAP) are normally involved in each step’s input/output. Hence, a mechanism to encapsulate these heterogeneities is essential.</p>
<p>
<bold>C2. Computing intensity:</bold>
Multi-dimensions and heterogeneous data structures are intrinsic characteristics of geoscience data [
<xref rid="pone.0116781.ref014" ref-type="bibr">14</xref>
]. Processing and analyzing these complex big data are computing intensive, requiring massive amounts of computing resources. In the case study, S4 is computing intensive given the terabytes of 4-D data. A parallelization-enabled algorithm is one key to accelerate these processes. Another computing intensive aspect is climate simulation (S1), where each model-run requires ∼5 days to simulate a single 10-year scenario. Traditional computing cannot finish the 300 model-runs with reasonable effort and time [
<xref rid="pone.0116781.ref015" ref-type="bibr">15</xref>
]. In addition, parallelization requires more resources since processing threads are running at the same time. Therefore, supplying adequate computing resources is another key to tackle the computing intensity challenge.</p>
<p>
<bold>C3. Procedure complexity:</bold>
Geoscience data analytics normally require complex steps with a specific sequence [
<xref rid="pone.0116781.ref016" ref-type="bibr">16</xref>
]. For example, the study case needs six steps (S1 to S6) from data generation (simulation) to visualization. A workflow platform tailored for handling these procedures is critical for managing, conducting and reusing the processes. In addition, conducting each step requires different tools, libraries and external processing services. To accomplish an analytics task, geoscientists normally need to discover appropriate tools/libraries, write their own programs/scripts and deal with Linux command lines. For example, S2 requires data format conversion tools, and S4 requires specific tools using libraries (e.g., NetCDF-Java,
<ext-link ext-link-type="uri" xlink:href="http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/">http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/</ext-link>
). And, for S5 and S6, scientists need to program using R script or other languages. A mechanism to integrate these heterogeneous tools and libraries is essential.</p>
<p>Cloud computing is a new computing paradigm characterized by its on-demand self-service, broad network access, resource pooling, rapid elasticity and measured service [
<xref rid="pone.0116781.ref017" ref-type="bibr">17</xref>
]. Cloud computing provides potential computing solutions to the management, discovery, access, and analytics of the big geoscience data for intuitive decision support [
<xref rid="pone.0116781.ref004" ref-type="bibr">4</xref>
].</p>
<p>In this paper, we explore the idea of building a scientific workflow framework using cloud computing as the fundamental infrastructure to tackle the aforementioned challenges. In this framework, methodologies are proposed for leveraging cloud computing, parallel computing and Service Oriented Architecture (SOA) as follows: HBase stores and manages big geoscience data across distributed computers; a MapReduce-based algorithm framework supports parallel processing of geoscience data; service-oriented workflow architecture supports on-demand complex data analytics; and the whole framework is implemented in the cloud environment. The remainder of this paper details the framework in the following sequence: Section 2 reviews relevant research; Section 3 details the methodologies; Section 4 introduces a proof-of-concept prototype and experimental results; and Section 5 draws conclusions and discusses future research.</p>
</sec>
</sec>
<sec id="sec004">
<title>Related Work</title>
<p>In this section, some related work, fundamental technologies and background for the research are discussed.</p>
<sec id="sec005">
<title>2.1. Database Technologies for Managing Big Geoscience Data</title>
<p>Over the past decades, relational databases management systems (RDBMS) (e.g., Oracle) have been used to manage a variety of scientific data including that of the geosciences [
<xref rid="pone.0116781.ref018" ref-type="bibr">18</xref>
]. With RDBMS metadata are normally managed in a relational database while the actual data are stored in file systems. The data can be accessed by querying the database to find the reference (file location). While this approach takes advantage of the matured relational database technology, it is limited in terms of scalability and reliability since the data are normally archived in raw files. In fact the evolution of geoscience data has exceeded the capability of existing infrastructure for data access, archiving, analysis and mining [
<xref rid="pone.0116781.ref019" ref-type="bibr">19</xref>
,
<xref rid="pone.0116781.ref020" ref-type="bibr">20</xref>
].</p>
<p>To overcome the drawbacks of the traditional RDBMS, an emerging group of projects are addressing the multi-dimensional geoscience data utilizing distributed data management (e.g., Integrated Rule-Oriented Data Systems:
<ext-link ext-link-type="uri" xlink:href="http://irods.org/">http://irods.org/</ext-link>
, Climate-G testbed [
<xref rid="pone.0116781.ref021" ref-type="bibr">21</xref>
], the Earth System Grid Federation [
<xref rid="pone.0116781.ref052" ref-type="bibr">52</xref>
]). These projects provide a grid-based framework to manage big geoscience data in a distributed environment. However, they do not draw support from cloud computing [
<xref rid="pone.0116781.ref022" ref-type="bibr">22</xref>
], so the resources and services can neither be initiated on demand nor meet the requirements of high scalability, availability and elastic of computing processes. In addition, these systems are normally complicated and bulky, making them hard to be adopted for other scientific research and applications.</p>
<p>NoSQL databases [
<xref rid="pone.0116781.ref023" ref-type="bibr">23</xref>
] provide a potential solution to the traditional RDBMS problems while offering flexibility to be tailored for various requirements. Over the past several years NoSQL databases have been used to store and manage big data in a distributed environment. Compared to traditional RDBMS, NoSQL database has the characteristics of schema-free, default replication support and simple API [
<xref rid="pone.0116781.ref024" ref-type="bibr">24</xref>
]. The most prevalent NoSQL databases such as HBase [
<xref rid="pone.0116781.ref025" ref-type="bibr">25</xref>
] and Cassandra [
<xref rid="pone.0116781.ref026" ref-type="bibr">26</xref>
] are based on a BigTable [
<xref rid="pone.0116781.ref027" ref-type="bibr">27</xref>
] schema. HBase, an open source distributed database running on top of Hadoop Distributed File System (HDFS), provides high scalability and reliability by storing data across a cluster of commodity hardware with automatic failover support. Studies to harness the power of HBase to manage big geoscience data include that of Liu et al. [
<xref rid="pone.0116781.ref024" ref-type="bibr">24</xref>
], who proposed a method to store massive imagery data in HBase by introducing two specific tables (“
<italic>HRasterTable</italic>
” and “
<italic>HRasterDataTable</italic>
”), and Chen et al. [
<xref rid="pone.0116781.ref028" ref-type="bibr">28</xref>
], who proposed a mechanism to effectively search and manage remote sensing images stored in HBase. Unfortunately, less research attention has been focused on leveraging HBase to handle big array-based multi-dimensional data (e.g., NetCDF or HDF).</p>
<p>To address this shortcoming, a data decomposition mechanism is proposed to manage multidimensional geoscience data with HBase in a scalable cloud computing environment.</p>
</sec>
<sec id="sec006">
<title>2.2. Parallelization Technologies to Process Big Geoscience Data</title>
<p>Computation and data intensive geoscience analytics are becoming prevalent. To improve scalability and performance, parallelization technologies are essential [
<xref rid="pone.0116781.ref029" ref-type="bibr">29</xref>
]. Traditionally, most parallel applications achieve fine grained parallelism using message passing infrastructures such as PVM [
<xref rid="pone.0116781.ref030" ref-type="bibr">30</xref>
] and MPI [
<xref rid="pone.0116781.ref031" ref-type="bibr">31</xref>
] executed on computer clusters, super computers, or grid infrastructures [
<xref rid="pone.0116781.ref032" ref-type="bibr">32</xref>
]. While these infrastructures are efficient in performing computing intensive parallel applications, when the volumes of data increase, the overall performance decreases due to the inevitable data movement. This hampers the usage of MPI-based infrastructure in processing big geoscience data. In addition, these infrastructures normally have poor scalability and allocating resources is constrained by computational infrastructure.</p>
<p>MapReduce [
<xref rid="pone.0116781.ref033" ref-type="bibr">33</xref>
], a parallelization model initiated by Google, is a potential solution to address the big data challenges as it adopts a more data-centered approach to parallelize runtimes, moving computation to the data instead of the converse. This avoids the movement of large volume data across the network which impacts performance. Hadoop (
<ext-link ext-link-type="uri" xlink:href="http://hadoop.apache.org/">http://hadoop.apache.org/</ext-link>
) is an open source implementation of MapReduce and has been adopted in the geoscience research community [
<xref rid="pone.0116781.ref015" ref-type="bibr">15</xref>
,
<xref rid="pone.0116781.ref028" ref-type="bibr">28</xref>
,
<xref rid="pone.0116781.ref034" ref-type="bibr">34</xref>
].</p>
<p>Since Hadoop is designed to process unstructured data (e.g., texts, documents, and web pages), the array-based, multi-dimensional geoscience data cannot be digested by Hadoop. Studies have explored processing geoscience data in Hadoop. One approach converts binary-based dataset into text-based dataset. For example, Zhao et al. [
<xref rid="pone.0116781.ref035" ref-type="bibr">35</xref>
] converted NetCDF data into text-based CDL (
<ext-link ext-link-type="uri" xlink:href="http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/CDL-Syntax.html">http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/CDL-Syntax.html</ext-link>
) files to allow parallel access of massive NetCDF data using MapReduce. Although straightforward, this approach poses two issues: the transformation sacrifices the integrity and portability of the NetCDF data as well as increases the data management complexity; and the transformed data volume may increase by several times from its original volume. The other approach reorganizes and stores the original NetCDF dataset in Hadoop supported files (e.g., Sequence Files,
<ext-link ext-link-type="uri" xlink:href="http://wiki.apache.org/hadoop/SequenceFile">http://wiki.apache.org/hadoop/SequenceFile</ext-link>
). Duffy et al. [
<xref rid="pone.0116781.ref036" ref-type="bibr">36</xref>
] leveraged Hadoop MapReduce to process climate data by converting the dataset into Hadoop Sequence Files, eliminating the issues that occurred in the first approach. However, all records must be fully traversed to match records since no index or query is supported by Sequence File, reducing the performance as the number of records increases.</p>
<p>To address this problem, this paper explores a mechanism to store big geoscience data in HBase (see Section 2.1). Based on the proposed data decomposition mechanism, a MapReduce-enabled framework is introduced to support on-demand accessing and processing in parallel of big geoscience data.</p>
</sec>
<sec id="sec007">
<title>2.3. Scientific Workflow for Geosciences</title>
<p>Scientific workflow serves as a problem-solving environment simplifying tasks by creating meaningful sub-tasks and combining to form executable data analysis pipelines [
<xref rid="pone.0116781.ref037" ref-type="bibr">37</xref>
]. Scientific workflow provides mechanisms to discover, share, analyze, and evaluate research tools [
<xref rid="pone.0116781.ref038" ref-type="bibr">38</xref>
,
<xref rid="pone.0116781.ref039" ref-type="bibr">39</xref>
] and is a significant element of geospatial cyberinfrastructure [
<xref rid="pone.0116781.ref037" ref-type="bibr">37</xref>
,
<xref rid="pone.0116781.ref040" ref-type="bibr">40</xref>
<xref rid="pone.0116781.ref042" ref-type="bibr">42</xref>
]. Provenance tracking provided by workflow systems enables geoscientists to determine the reliability of the data and service products and validate and reproduce scientific results in cyberinfrastructure [
<xref rid="pone.0116781.ref043" ref-type="bibr">43</xref>
].</p>
<p>There are several scientific workflow systems including Kepler [
<xref rid="pone.0116781.ref037" ref-type="bibr">37</xref>
], Taverna [
<xref rid="pone.0116781.ref044" ref-type="bibr">44</xref>
], Triana [
<xref rid="pone.0116781.ref045" ref-type="bibr">45</xref>
], Trident [
<xref rid="pone.0116781.ref046" ref-type="bibr">46</xref>
], and VisTrails [
<xref rid="pone.0116781.ref047" ref-type="bibr">47</xref>
]. These systems compose and schedule complex workflows on a distributed environment, such as clusters and Grids [
<xref rid="pone.0116781.ref032" ref-type="bibr">32</xref>
]. As a new computing infrastructure, cloud computing is a new approach for deploying and executing scientific workflows [
<xref rid="pone.0116781.ref016" ref-type="bibr">16</xref>
]. Preliminary studies to evaluate feasibility and performance of migrating scientific workflows into the cloud [
<xref rid="pone.0116781.ref048" ref-type="bibr">48</xref>
<xref rid="pone.0116781.ref050" ref-type="bibr">50</xref>
] have found that cloud computing provides comparable performance with better scalability and flexibility to traditional computing infrastructure given similar resources. However, these studies mainly focused on deploying current scientific workflow platforms to the cloud environment by replacing traditional physical machines with virtual machines in existing workflow deployment. A more comprehensive study is desired to fully leverage the advantages of cloud computing to enable scientific workflow for supporting geoscience.</p>
<p>We propose a cloud-based workflow framework by incorporating cloud computing to provision on-demand the whole workflow execution environment, adding dynamically computing resources to the workflow during runtime and integrating heterogeneous tools seamlessly.</p>
</sec>
</sec>
<sec id="sec008">
<title>Methodologies</title>
<sec id="sec009">
<title>3.1. Framework</title>
<p>The framework (
<xref rid="pone.0116781.g001" ref-type="fig">Fig. 1</xref>
) is layer-based and includes four layers: computing resource (Cloud Platform); processing (Hadoop Cluster); service; and presentation (Workflow Builder). Cloud platform provides the on-demand computing resources including computing, storage, and network as services. The cloud platform includes a processing layer where the workflow engine running on the virtualized Hadoop cluster (virtual machines as cluster nodes). The service layer is built on top of the cluster for registering, managing, and chaining the services. The services are chained as executable workflows in an on-demand and scalable computing environment. Processing layer and service layer form the workflow execution environment. On top is the presentation layer which enables users to publish, discover and use services to build workflows in a drag-and-drop style, and runs and monitors workflows in a web-based interface. Oozie (Oozie
<ext-link ext-link-type="uri" xlink:href="http://yahoo.github.com/oozie/">http://yahoo.github.com/oozie/</ext-link>
) is adapted as the workflow engine due to its intrinsic integration with Hadoop MapReduce.</p>
<fig id="pone.0116781.g001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0116781.g001</object-id>
<label>Fig 1</label>
<caption>
<title>Framework architecture.</title>
</caption>
<graphic xlink:href="pone.0116781.g001"></graphic>
</fig>
<p>In this framework, data intensity is handled through storing and managing data using HBase in a distributed environment. Computing intensity is tackled by allocating the intensive computation tasks to many computing nodes using the MapReduce model. By integrating cloud computing, computing resources associated with the workflow are provisioned or terminated on-demand to ensure performance while minimizing resource consumption.</p>
<p>Service Oriented Architecture (SOA) is adopted to publish different processes as individual services, and these not only include processing and data services but also offer infrastructure and tool services. Different from the traditional web services orchestration [
<xref rid="pone.0116781.ref055" ref-type="bibr">55</xref>
<xref rid="pone.0116781.ref056" ref-type="bibr">56</xref>
], the “service” herein does not refer to the “web service” but rather to a self-described functional unit plugged into the workflow. This section details the framework from big geoscience data process, service-oriented approach, and cloud-based workflow execution environment.</p>
</sec>
<sec id="sec010">
<title>3.2. Big Geoscience Data Processing with MapReduce</title>
<sec id="sec011">
<title>3.2.1. Multi-Dimensional Geoscience Data Decomposition Mechanism</title>
<p>This section details the mechanism to decompose the array-based data files and store them in HBase.</p>
<p>Normally, geoscience data are five dimensional: space (latitude, longitude, and altitude), time and variable. For the array-based data models, data are stored in individual files, regarded as a dataset. The dataset is located by the dataset id (e.g., file URI). The array-based data model is expressed as
<xref rid="pone.0116781.e001" ref-type="disp-formula">Equation 1</xref>
, and each dataset id refers to a dataset containing five dimensions (X, Y, Z, T and V).
<disp-formula id="pone.0116781.e001">
<alternatives>
<graphic xlink:href="pone.0116781.e001.jpg" id="pone.0116781.e001g" position="anchor" mimetype="image" orientation="portrait"></graphic>
<mml:math id="M1">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>S</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>Y</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>Z</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>V</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</alternatives>
<label>Equation 1</label>
</disp-formula>
Where
<italic>DS = Dataset, V = Variable, T = Time, X = Latitude, Y = Longitude, Z = Altitude, and D = Dataset Id</italic>
</p>
<p>In HBase a straightforward way to store the array-based data is using
<italic>Dataset Id</italic>
as the row key and
<italic>Dataset</italic>
as row value. While this works for storing data, the parallelization of data processing is problematic because one dataset may reach gigabytes.</p>
<p>Based on the array-based data model, geoscience data is decomposed hierarchically (
<xref rid="pone.0116781.g002" ref-type="fig">Fig. 2</xref>
). Each dataset contains one or multiple timestamps, and at each timestamp there are multiple variables; each variable refers to a 2D or 3D data grid. Assuming each data grid as an
<italic>AtomDataset</italic>
, the decomposed data model is expressed as
<xref rid="pone.0116781.e002" ref-type="disp-formula">Equation 2</xref>
<disp-formula id="pone.0116781.e002">
<alternatives>
<graphic xlink:href="pone.0116781.e002.jpg" id="pone.0116781.e002g" position="anchor" mimetype="image" orientation="portrait"></graphic>
<mml:math id="M2">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>V</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>A</mml:mi>
<mml:mi>D</mml:mi>
<mml:mi>S</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>Y</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>Z</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</alternatives>
<label>Equation 2</label>
</disp-formula>
Where
<italic>ADS = AtomDataset, V = Variable, T = Time, X = Latitude, Y = Longitude, Z = Altitude</italic>
, and
<italic>D = Dataset Id</italic>
.</p>
<fig id="pone.0116781.g002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0116781.g002</object-id>
<label>Fig 2</label>
<caption>
<title>Hierarchical structure of the multi-dimensional geoscience data.</title>
</caption>
<graphic xlink:href="pone.0116781.g002"></graphic>
</fig>
<p>Compared with
<xref rid="pone.0116781.e001" ref-type="disp-formula">Equation 1</xref>
, the decomposed data model moves two dimensions T and V from the right to the left side. This triggers two changes: 5D dataset (X, Y, Z, T, V) is degraded to 3D
<italic>AtomDataset</italic>
(X, Y, Z) and single dataset id (D) becomes composite id(D, T, V). With this decomposition, large volumes of geoscience data are managed in a Bigtable style [
<xref rid="pone.0116781.ref027" ref-type="bibr">27</xref>
], where the (D, T, V) are stored as the composite row key and the
<italic>AtomDataset</italic>
as the row value in HBase.</p>
<p>Besides the scalability and reliability of HBase, this decomposition has three advantages. First, the D, T, and V are stored in HBase as columns in series enabling flexible search against the time, variable and dataset. Once data are loaded into HBase, the
<italic>AtomDataset</italic>
queries and accesses data through various filters. Second, new data can be seamlessly appended and integrated to the database without breaking current data structure. And third, parallelization with MapReduce algorithm is achieved in a finer granularity by decomposing the data from 5 to 3 dimensions.</p>
</sec>
<sec id="sec012">
<title>3.2.2. MapReduce-enabled Framework for Processing Big Geoscience Data</title>
<p>Based on the above data decomposition mechanism, we introduce a MapReduce-enabled framework to process big geoscience data. The back end of the framework is a Hadoop cluster deployed in the cloud environment that provides distributed storage and computing power. The framework contains the following components:
<italic>Geo-HBase, Controller</italic>
, and
<italic>Pluggable MR Operator</italic>
(
<xref rid="pone.0116781.g003" ref-type="fig">Fig. 3</xref>
).</p>
<fig id="pone.0116781.g003" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0116781.g003</object-id>
<label>Fig 3</label>
<caption>
<title>MapReduce-based framework for processing big geoscience data.</title>
</caption>
<graphic xlink:href="pone.0116781.g003"></graphic>
</fig>
<list list-type="bullet">
<list-item>
<p>
<italic>Geo-HBase</italic>
stores the decomposed big geoscience data (Section 3.2.1).
<italic>Geo-HBase</italic>
supports flexible queries to the data repository based on dataset id, time, and variable, so a subset of interested data is effectively extracted and processed.</p>
</list-item>
<list-item>
<p>
<italic>Pluggable MR Operator</italic>
is a MapReduce program conducting a processing task against the data stored in HBase (e.g., calculating an annual mean for selected variables, sub-setting the data based on user specified regions).</p>
</list-item>
<list-item>
<p>
<italic>Controller</italic>
is the user interface allowing users to interact with the framework, such as starting a processing job with specified parameters.</p>
</list-item>
</list>
<p>A typical workflow for processing data with this framework is the following sequence:
<italic>Controller</italic>
sends processing request with the query parameters (dataset ids, time period, and variables) and spatial region;
<italic>Geo-HBase</italic>
extracts the required data based on the dataset id (D), time (T), and variable (V); the extracted data are loaded to
<italic>MR Operator</italic>
as a list of key-value pairs; the
<italic>Map</italic>
first conducts spatial (X, Y, Z) sub-setting based on the specified spatial region. The composite key sorts and groups the emitted intermediate data from
<italic>Map</italic>
based on the composition of (D, T, V) by
<italic>MR Operator</italic>
; and finally the result is written back to HBase.</p>
<p>Scientists can develop different MapReduce algorithms to process the data stored in Geo-HBase as
<italic>Pluggable MR Operators</italic>
. Furthermore, these
<italic>Pluggable MR Operators</italic>
are published as
<italic>Processing Services</italic>
that are used to build the workflow.
<xref rid="pone.0116781.g004" ref-type="fig">Fig. 4</xref>
is an example MapReduce algorithm for calculating annual global mean of a subset of climate data.</p>
<fig id="pone.0116781.g004" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0116781.g004</object-id>
<label>Fig 4</label>
<caption>
<title>Algorithm for computing annual global mean of multiple datasets based on the framework.</title>
</caption>
<graphic xlink:href="pone.0116781.g004"></graphic>
</fig>
</sec>
</sec>
<sec id="sec013">
<title>3.3. Service-oriented Approach</title>
<sec id="sec014">
<title>3.3.1. Service Model</title>
<p>The key to leveraging service-oriented concept in the workflow architecture is that each step in the workflow is abstracted as a service, and various services are chained to form a workflow. To ensure that different services can be connected in a unified fashion, we abstract each service as a processing unit with two general interfaces: input and output. For input, two types are defined: Input Parameter (IP) and Input Data (ID). Similarly there are two output types: Output Parameter (OP) and Output Data (OD). The input and output parameters are primitive, (e.g., numbers, short texts), whereas the input and output data refer to data files stored in the shared file system. Based on this, we define a unified service model as
<xref rid="pone.0116781.e003" ref-type="disp-formula">Expression 1</xref>
, where the output data and parameter of one service are used as the input data and parameter by another service, thus enabling the servicing chaining.</p>
<disp-formula id="pone.0116781.e003">
<alternatives>
<graphic xlink:href="pone.0116781.e003.jpg" id="pone.0116781.e003g" position="anchor" mimetype="image" orientation="portrait"></graphic>
<mml:math id="M3">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>v</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>I</mml:mi>
<mml:mi>D</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>I</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>e</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:mover>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>O</mml:mi>
<mml:mi>D</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>O</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</alternatives>
<label>Expression 1</label>
</disp-formula>
<p>Unlike traditional scientific workflow in which each step is normally the computational process, we define four types of services to build a workflow, and each is described below.</p>
<list list-type="bullet">
<list-item>
<p>
<bold>Processing Service</bold>
processes, analyzes or visualizes input data. Three types of programs are published as a processing service: MapReduce program processing the big geoscience data stored in HBase (Section 3.2); Java executable program conducting general processing task; and Shell script conducting data preprocess, statistics or visualization. For example, a Shell script calling R script to plot a climate variable is published as a processing service.</p>
</list-item>
<list-item>
<p>
<bold>Data Service</bold>
focuses on fetching data from outside of the workflow as service input and publishing output as various services to share (Section 3.3.2).</p>
</list-item>
<list-item>
<p>
<bold>Model Service</bold>
runs geoscience models (e.g., climate model) with user specified model input; the modeling environment of software configuration and computing resources running the model are automatically provisioned in the cloud [
<xref rid="pone.0116781.ref053" ref-type="bibr">53</xref>
].</p>
</list-item>
<list-item>
<p>
<bold>Infrastructure Service</bold>
provisions the virtual machine-based services by leveraging the IaaS. Three types are included: provisioning pure computing resources (e.g., bare-metal virtual machine); provisioning computing platforms (e.g. Hadoop or MPI-based cluster); and provisioning virtual machines with pre-installed software packages or applications (e.g., virtual machine with R environment).</p>
</list-item>
</list>
<p>Following the service model, each service is composed of service executable program and service definition metadata. Service definition metadata is an XML describing the services (
<xref rid="pone.0116781.g005" ref-type="fig">Fig. 5</xref>
) and is comprised of three sections: service description of the general service information; service entry point indicating the location of the service executable program; and service interface detailing the service input and output along with semantic description. To register a service into the workflow framework, the service definition metadata is first interpreted to add the service in the service catalogue, and the service executable program is uploaded to the workflow execution environment.</p>
<fig id="pone.0116781.g005" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0116781.g005</object-id>
<label>Fig 5</label>
<caption>
<title>Service definition metadata example (Infrastructure Service: ProvisionVM).</title>
<p>Please refer to the Supporting Information file (service definition XML schema.xsd) for the service metadata XML schema.</p>
</caption>
<graphic xlink:href="pone.0116781.g005"></graphic>
</fig>
</sec>
<sec id="sec015">
<title>3.3.2. Loosely-coupled Service I/O Mechanism</title>
<p>The workflow engine is deployed on Hadoop, and the workflow tasks (services) are executed on different machines. Hence, it is important that all services read input and write output data in a shared file system to avoid extra data transfer loads. The HDFS is used as such a file system in the framework, providing a unified service execution environment. However, geoscience analytics often requires small to midsized data from remote data services (e.g., WFS, WCS, and OPenDAP) as part of the input, and publish the output as web services (e.g., WMS, WFS). One solution is that the service includes the function to fetch and publish data from remote services. However, at least two problems arise. The first is that data handling is tightly coupled with the processing logic, which makes it difficult for the service to incorporate other types of data services. The second is that each service implements its own data handling function which cannot be reused. We propose a loosely-coupled, service Input/output (I/O) mechanism as illustrated in
<xref rid="pone.0116781.g006" ref-type="fig">Fig. 6</xref>
to address these shortcomings.</p>
<fig id="pone.0116781.g006" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0116781.g006</object-id>
<label>Fig 6</label>
<caption>
<title>Loosely-coupled Service I/O mechanism.</title>
</caption>
<graphic xlink:href="pone.0116781.g006"></graphic>
</fig>
<p>This mechanism extracts the data handling components and publishes them as individual workflow
<italic>Data Services</italic>
, including two categories:
<italic>Fetch Data Services</italic>
and
<italic>Publish Data Services. Fetch Data Service</italic>
fetches data from remote data servers and loads them into HDFS. Other services, such as
<italic>Processing Service</italic>
and
<italic>Model Service</italic>
, can access the data directly from HDFS. For example,
<italic>Fetch WAF</italic>
(Web Access Folder) service downloads data from a WAF and loads them to HDFS;
<italic>Fetch OpenDAP</italic>
service subsets data from an OpenDAP server.
<italic>Publish Data Service</italic>
requires a server to host the data. For example, to publish a
<italic>Processing Service’s</italic>
output as WMS, a WMS server (e.g., GeoServer) is required to host the service, and, an
<italic>Infrastructure Service</italic>
can be integrated into the workflow to provision a virtual machine with pre-installed GeoServer.</p>
<p>
<xref rid="pone.0116781.g007" ref-type="fig">Fig. 7</xref>
shows a typical workflow consisting of four different services:
<list list-type="order">
<list-item>
<p>A
<italic>Fetch Data Service</italic>
fetches vector data (U.S. state boundary) from a WFS server as the input of the
<italic>Processing Service</italic>
;</p>
</list-item>
<list-item>
<p>The
<italic>Processing Service</italic>
is a MapReduce program which calculates the monthly mean land surface temperature from the climate data stored in
<italic>Geo-HBase</italic>
using the boundary data as the statistics unit;</p>
</list-item>
<list-item>
<p>Meanwhile, an
<italic>Infrastructure Service</italic>
provisions a virtual machine with pre-installed GeoServer from the cloud platform; and</p>
</list-item>
<list-item>
<p>
<italic>Publish Data Service</italic>
publishes output data from process service to GeoServer as WMS.</p>
</list-item>
</list>
</p>
<fig id="pone.0116781.g007" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0116781.g007</object-id>
<label>Fig 7</label>
<caption>
<title>A typical workflow with four types of services.</title>
</caption>
<graphic xlink:href="pone.0116781.g007"></graphic>
</fig>
<p>This service I/O mechanism is flexible and extendable in that external services are supported by developing corresponding data services in the workflow platform. Once a data service is registered, it can be used by any other services to fetch/publish input/output. This service I/O mechanism addresses the challenge of heterogeneous and distributed data associated with each step’s input and output in the workflow.</p>
</sec>
</sec>
<sec id="sec016">
<title>3.4. Cloud-based Workflow Execution Environment</title>
<p>Scientific workflows normally require a collection of software components such as tools, libraries, programs, models, and applications, and these components are developed at different times by different people [
<xref rid="pone.0116781.ref016" ref-type="bibr">16</xref>
]. In this case study the workflow needs to setup and run a climate model, first with NetCDF Operator (NCO) library to preprocess the model output, followed by Hadoop MapReduce to parallel process model output, and then fed to a Java program (or R script) to conduct linear regression analysis and visualization. These heterogeneous software components must be seamlessly integrated into a coherent workflow. To achieve this, a traditional workflow system needs to pre-install the required software components on the physical machine (s), and this poses two problems. First, if the execution environment is backed by a cluster, the same software components must be configured on each machine, and any update to the execution environment is time consuming. Second, some software components are complex requiring specific execution environments that cannot be installed on the common environment. To address these shortcomings, we propose the workflow in the cloud environment with two mechanisms.</p>
<p>The first mechanism deploys the whole Workflow Execution Environment (WEE, Hadoop cluster) in the cloud. The entire WEE is “burned” to image, including Hadoop software, workflow engine, and library environment for executing the workflow tasks (e.g., R, NCO, JRE) and can be provisioned within minutes. The VMs are provisioned as cluster nodes based on the VM image (a snapshot of pre-configured operating system used to launch a VM). When an update is required, the VM image is re-built by installing new or removing old software components, and the WEE is re-provisioned quickly based on the new VM image. Another advantage is that new computing resources can be easily added to the WEE by provisioning more cluster nodes.</p>
<p>The second mechanism integrates specified software into VM images and publishes these images as
<italic>Infrastructure Services</italic>
. This is more flexible in that the software environment is self-contained and exposed as a standard infrastructure service in the workflow platform. These services are added and removed without affecting current WEE. In addition, the complex software components (e.g., climate model, GeoServer) are difficult to integrate into WEE due to the specified system requirement and high resource occupation, and publishing them as
<italic>Infrastructure Services</italic>
improves the system performance and flexibility. Furthermore, this mechanism provides an alternative to integrating legacy software that requires a specific execution environment into the workflow. Finally, the image-based
<italic>Infrastructure Service</italic>
offers a reproducible environment for certain tasks in the workflow.</p>
</sec>
</sec>
<sec id="sec017">
<title>Prototype and Experiment Result</title>
<p>To verify the performance of the proposed framework, a proof-of-concept is offered, and an experiment is conducted for the aforementioned case study using the prototype.</p>
<sec id="sec018">
<title>4.1. Prototype Based on the Framework</title>
<sec id="sec019">
<title>4.1.1. Cloud Environment Setup</title>
<p>The proposed framework is based on both private and public clouds. A private cloud platform on Eucalyptus (
<ext-link ext-link-type="uri" xlink:href="http://www.eucalyptus.com">http://www.eucalyptus.com</ext-link>
) 4.0 is established in our data center, serving as the cloud environment, and this selection is based on our previous study [
<xref rid="pone.0116781.ref054" ref-type="bibr">54</xref>
]. In addition, Eucalyptus has compatible Application Programming Interfaces (APIs) with Amazon’s Elastic Compute Cloud (Amazon EC2,
<ext-link ext-link-type="uri" xlink:href="http://aws.amazon.com/ec2/">http://aws.amazon.com/ec2/</ext-link>
), a widely used public cloud service. The underlying hardware consists of 16 physical machines connected with 1 Gigabit Ethernet (Gbps), and each has an 8-core CPU running at 2.35 GHz with 16 GB of RAM and 60 GB of on-board storage. Totally, 120 m1.small VMs (1 core CPU running at 1 GHz and 2G of RAM) is provisioned in the cloud.</p>
</sec>
<sec id="sec020">
<title>4.1.2. Prototype Implementation</title>
<p>The prototype implementation architecture (
<xref rid="pone.0116781.g008" ref-type="fig">Fig. 8</xref>
) contains four major components:
<italic>Eucalyptus Cloud, Workflow Execution Environment (WEE), Web-based Workflow Builder</italic>
, and
<italic>Service/Workflow Registry</italic>
.</p>
<fig id="pone.0116781.g008" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0116781.g008</object-id>
<label>Fig 8</label>
<caption>
<title>Prototype implementation architecture.</title>
</caption>
<graphic xlink:href="pone.0116781.g008"></graphic>
</fig>
<p>
<italic>Eucalyptus Cloud</italic>
provides virtualized computing resources. The WEE, built on top of the cloud platform, consists of computing, storage and processing libraries. The computing is provided by a virtualized Hadoop cluster and coordinated by the workflow engine (powered by Oozie). Storage is provided by HBase and HDFS, where HBase stores the decomposed big climate data and key/value-based workflow output, whereas HDFS stores the service executable programs and other workflow output.</p>
<p>
<italic>Service/Workflow Registry</italic>
is the service layer providing a database for managing the registered services and saved workflows. Service definition metadata (XML) and workflow definition files (XML) are stored in the database. During service registration, the service executable program is uploaded to WEE.</p>
<p>
<italic>Web-based Workflow Builder</italic>
is the graphic interface (
<xref rid="pone.0116781.g009" ref-type="fig">Fig. 9</xref>
) through which users build workflow by visually connecting various services, run workflow by submitting the request to WEE with one-click, and monitor the workflow execution status in real time. Services and workflows are loaded to the builder from the registry. The workflow is saved to the server for re-running or downloaded as XML for sharing. The builder is based on the open source workflow-generator tool (
<ext-link ext-link-type="uri" xlink:href="https://github.com/jabirahmed/oozie/tree/master/workflowgenerator">https://github.com/jabirahmed/oozie/tree/master/workflowgenerator</ext-link>
).</p>
<fig id="pone.0116781.g009" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0116781.g009</object-id>
<label>Fig 9</label>
<caption>
<title>GUI of the web-based workflow builder.</title>
</caption>
<graphic xlink:href="pone.0116781.g009"></graphic>
</fig>
</sec>
</sec>
<sec id="sec021">
<title>4.2. Experiment Result</title>
<sec id="sec022">
<title>4.2.1. Executable Workflow for the Study Case</title>
<p>To demonstrate how the proposed workflow framework addresses the challenges posed by the study case (Section 1.2), over ten services are developed following the proposed service model. These are registered to the prototype system to facilitate the six steps of the study case.</p>
<p>For the S1, a
<italic>Model Service</italic>
(
<italic>RunModelE</italic>
) is based on our previous study [
<xref rid="pone.0116781.ref053" ref-type="bibr">53</xref>
] to setup and run ModelE automatically. This is also an
<italic>Infrastructure Service</italic>
since it provisions a virtual cluster with configured modeling environment to run the model. For S2, two
<italic>Processing Services</italic>
are developed with
<italic>AccToNetCDF</italic>
being a script-based service converting model out .acc files to NetCDF format, and
<italic>NetCDFtoHBase</italic>
using NCO library to decompose (split) the NetCDF and subsequently uploading into database using HBase APIs. For S4, a MapReduce-enabled
<italic>Processing Service</italic>
computes the global monthly mean for all model output. Finally, for S5 and S6 a Java-based
<italic>Processing Service</italic>
conducts linear regression analysis and plots the relationships for the most affected variables. To support input and output for the above services,
<italic>FetchDataHttp</italic>
downloads data from a web accessible folder or simply a URL to the WEE.
<italic>PublishDataWaf publishes</italic>
the data in the WEE to a web accessible folder.</p>
<p>Once these services are registered, an executable workflow is built by visually dragging and connecting services in the
<italic>Web-based Workflow Builder</italic>
to conduct the experiment (
<xref rid="pone.0116781.g010" ref-type="fig">Fig. 10</xref>
). In this workflow,
<italic>RunModelE</italic>
provisions virtual machines to run the climate model. When model runs are finished, output are preprocessed and loaded to HBase with
<italic>ArcToNetCDF</italic>
and
<italic>NetCDFtoHBase</italic>
. Then global monthly mean for each output climate variable is calculated in parallel in the WEE with
<italic>ComputeGlobalMonthlyMean</italic>
service. Next, two services
<italic>GetGlobalEesembleMean</italic>
and
<italic>FetchModelParamters</italic>
, are executed in parallel. Once finished,
<italic>CorrelationAnalysis</italic>
service calculates linear regression statistics for each Parameter-Variable pair based on the variable ensemble mean values and the model input parameters. Finally, the workflow output (intermediate and final) is published on a web accessible folder (
<xref rid="pone.0116781.g011" ref-type="fig">Fig. 11</xref>
).</p>
<fig id="pone.0116781.g010" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0116781.g010</object-id>
<label>Fig 10</label>
<caption>
<title>Executable workflow for the study case built in the prototype.</title>
</caption>
<graphic xlink:href="pone.0116781.g010"></graphic>
</fig>
<fig id="pone.0116781.g011" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0116781.g011</object-id>
<label>Fig 11</label>
<caption>
<title>Workflow output: Workflow output published in a web accessible folder (A), presented as correlation statistics in CSV format (B), and plotted output climate variables highly affected by the seven model input parameters (R
<sup>2</sup>
> 0.6, 9 of 57 pairs).</title>
</caption>
<graphic xlink:href="pone.0116781.g011"></graphic>
</fig>
<p>This workflow transforms a complex geoscience experiment into an intuitive diagram-based workflow. In contrast to a traditional workflow, this workflow addresses the three problems of data intensity, computing intensity, and procedure complexity as presented below:
<list list-type="bullet">
<list-item>
<p>For the computing intensity,
<italic>RunModelE</italic>
service on-demand provisions a cluster of virtual machines with pre-configured model environment and on-demand parameter configuration to conduct ensemble model runs in parallel. In addition, a Hadoop cluster is provisioned on-demand in the workflow (
<xref rid="pone.0116781.g012" ref-type="fig">Fig. 12</xref>
);</p>
</list-item>
<list-item>
<p>For data intensity, the MapReduce-enabled
<italic>ComputeGlobalMonthlyMean</italic>
service conducts parallel processing of large volumes of model output in the cloud-based WEE; and</p>
</list-item>
<list-item>
<p>For the procedure complexity, the service model enables the complex problem to be decoupled into reusable services. Furthermore, the heterogeneous software components (e.g., Hadoop, R, NCO, JRE) are seamlessly integrated in the cloud-based WEE.</p>
</list-item>
</list>
</p>
<fig id="pone.0116781.g012" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0116781.g012</object-id>
<label>Fig 12</label>
<caption>
<title>An executable workflow for provisioning a Hadoop cluster on-demand in the cloud.</title>
<p>Three Infrastructure Services support this provision: ProvisionMaster, ProvisonSlave, and ConfigureHadoop.</p>
</caption>
<graphic xlink:href="pone.0116781.g012"></graphic>
</fig>
</sec>
<sec id="sec023">
<title>4.2.2. Performance Evaluation for Big Climate Data Processing</title>
<p>To evaluate the performance of the big geoscience data processing strategy (Section 3.2), we calculated the global monthly mean for 100 model outputs using a 6-node Hadoop cluster (1 master node and 5 slave nodes). Each node is a virtual machine with 8-core CPU/2.35 GHz with 16 GB RAM, and the 100 model outputs are preprocessed and loaded to HBase deployed on the Hadoop cluster. Another virtual machine with the same configuration processes the same data with the traditional serial method. Two sets of tests are conducted. The first keeps the number of cluster nodes the same and processes different numbers of model outputs from 1 to 100 (
<xref rid="pone.0116781.g013" ref-type="fig">Fig. 13A</xref>
). The second keeps the 100 model output unchanged but changes the number of cluster nodes from 1 to 5 (
<xref rid="pone.0116781.g013" ref-type="fig">Fig. 13B</xref>
).</p>
<fig id="pone.0116781.g013" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0116781.g013</object-id>
<label>Fig 13</label>
<caption>
<title>Performance evaluation result for the MapReduce-enabled big climate data processing.</title>
</caption>
<graphic xlink:href="pone.0116781.g013"></graphic>
</fig>
<p>For the first set of tests and as model output number increases, the time consumed for the serial method increases dramatically for 5 model outputs, whereas the time for MapReduce approach only increases marginally (give a percentage) (
<xref rid="pone.0116781.g013" ref-type="fig">Fig. 13A</xref>
). With 100 outputs, the serial process takes > 20 minutes, while the MapReduce approach takes ∼3.5 minutes. It should be noted that if the number of model output is < 5, the time for the MapReduce approach is more than that of the serial approach due to the overhead of the Hadoop framework. For the second set of tests and with increasing node number, the consumed time decreases significantly (
<xref rid="pone.0116781.g013" ref-type="fig">Fig. 13B</xref>
), which indicates efficient scalability of the proposed big geoscience data processing strategy. Scalability is important in cloud environment because new nodes are quickly provisioned and added to the cluster as needed to improve performance.</p>
</sec>
</sec>
</sec>
<sec id="sec024">
<title>Discussion and Conclusion</title>
<p>This paper proposes a cloud-based, MapReduce-enabled, and service-oriented workflow framework to address the challenges posed by big geoscience data analytics. These challenges are tested by a case study of climate model sensitivity diagnostics. Methodologies for designing and implementing the framework are presented. To test the feasibility of the framework, a proof-of-concept workflow platform prototype is offered. A set of services are developed and registered to the prototype system, and an executable workflow is built based on these services for the study case. Two sets of tests are conducted to evaluate the performance of the proposed big geoscience data processing strategy.</p>
<p>The workflow and test results show that the proposed framework provides a robust and efficient approach to accelerate geoscience studies. Each proposed methodology addresses one or several aspects of the challenges facing the geosciences community. Specifically,
<xref rid="pone.0116781.t002" ref-type="table">Table 2</xref>
summarizes the proposed methodologies (Section 3) for addressing the corresponding challenges (Section 1).</p>
<table-wrap id="pone.0116781.t002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0116781.t002</object-id>
<label>Table 2</label>
<caption>
<title>Methodologies addressing the challenges.</title>
</caption>
<alternatives>
<graphic id="pone.0116781.t002g" xlink:href="pone.0116781.t002"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<thead>
<tr>
<th colspan="2" align="center" rowspan="1">
<bold>Challenges</bold>
</th>
<th align="left" rowspan="1" colspan="1">
<bold>3.1</bold>
</th>
<th align="left" rowspan="1" colspan="1">
<bold>3.2.1</bold>
</th>
<th align="left" rowspan="1" colspan="1">
<bold>3.2.2</bold>
</th>
<th align="left" rowspan="1" colspan="1">
<bold>3.3.1</bold>
</th>
<th align="left" rowspan="1" colspan="1">
<bold>3.3.2</bold>
</th>
<th align="left" rowspan="1" colspan="1">
<bold>3.4</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2" align="left" colspan="1">C1. Data intensity</td>
<td align="left" rowspan="1" colspan="1">Data management</td>
<td align="left" rowspan="1" colspan="1">x</td>
<td align="left" rowspan="1" colspan="1">x</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Distributed data</td>
<td align="left" rowspan="1" colspan="1">x</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">x</td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="2" align="left" colspan="1">C2. Computing intensity</td>
<td align="left" rowspan="1" colspan="1">Model simulation</td>
<td align="left" rowspan="1" colspan="1">x</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">x</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">x</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Big data processing</td>
<td align="left" rowspan="1" colspan="1">x</td>
<td align="left" rowspan="1" colspan="1">x</td>
<td align="left" rowspan="1" colspan="1">x</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">x</td>
</tr>
<tr>
<td rowspan="2" align="left" colspan="1">C3. Complex procedure</td>
<td align="left" rowspan="1" colspan="1">Complex steps</td>
<td align="left" rowspan="1" colspan="1">x</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">x</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Heterogeneous tools</td>
<td align="left" rowspan="1" colspan="1">x</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">x</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">x</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>By leveraging cloud computing, MapReduce, and SOA, this framework seamlessly integrates the proposed methodologies as a whole to form a scalable, reliable and interoperable workflow environment. Such a workflow environment enables scientists to achieve four goals: transform complex geoscience experiment into intuitive diagram-based workflows by decoupling the experiment into reusable services; manage big geoscience data in a scalable and reliable distributed environment; process big geoscience data in parallel by adapting MapReduce and provide on-demand; and provision computing resources during the workflow execution to meet the performance requirement.</p>
<sec id="sec025">
<title>5.1. Key Features of the Workflow Framework</title>
<p>This framework provides three features compared to traditional scientific workflow platform as presented below:
<list list-type="bullet">
<list-item>
<p>
<bold>Cloud-based for computing intensity:</bold>
Adequate computing resources are critical since scientific workflow normally contains computing intensive tasks and require hundreds of steps executed in parallel. This workflow framework provides the mechanism to supply adequate computing resources to the WEE by provisioning more VMs into the WEE and shifting the computing load to resources independent of WEE using
<italic>Infrastructure Services</italic>
(e.g., running a computing intensive model on a virtual machine). In addition, the entire WEE is provisioned based on customized VM images, and virtualization enables each node of the WEE to have exactly the same computational environment. Therefore, this framework provides provenance for the WEE in a bitwise level. This cloud-based feature helps address computing intensity challenges.</p>
</list-item>
<list-item>
<p>
<bold>MapReduce-enabled for data intensity:</bold>
By incorporating the big geoscience data processing strategy (Section 3.2), the proposed framework manages and processes big geoscience data. The data decomposition and storage mechanism enables the multi-dimensional geoscience data to be effectively stored in a distributed environment (HBase), while the MapReduce-enabled processing framework enables data to be processed in parallel on the cluster of WEE chained with other tasks in the workflow. The MapReduce-enabled feature helps address the data intensity challenge.</p>
</list-item>
<list-item>
<p>
<bold>Service-oriented for procedure complexity</bold>
: Different steps involved in scientific workflows are published as four types of services: process, data, model and infrastructure (Section 3.3). In contrast to traditional scientific workflow considering only computational tasks, infrastructure services enable scientists to provision on-demand more computing resources during the workflow execution. Model services enable scientists to integrate an entire modeling environment to the workflow. By introducing a unified service model, these services are registered to the framework and connected in a unified manner. In addition, the service-oriented mechanism opens the framework, allowing scientists to collaborate by publishing their own services and workflows. Thus, the service-oriented feature helps address the challenge of procedure complexity.</p>
</list-item>
</list>
</p>
</sec>
<sec id="sec026">
<title>5.2. Future Research</title>
<p>As a preliminary study, this framework has limitations. There are at least two major challenges that need to be addressed in the future:
<list list-type="bullet">
<list-item>
<p>The framework for data storage currently uses virtual storage attached to the VMs to form the HDFS. The storage attached to each VM is of two types. The first is virtualized directly from the physical machine on which the VM is hosted, and the stored data are accessible directly by the VM without going through any network. However, such storage is not permanent, and the data are lost with the termination of the VM. The second storage type is virtualized from a storage cluster connected to the cloud platform and persists even when the VM is terminated. However, since the storage is from a storage cluster instead of the VM’s host machine, the VM needs to access the data remotely. Therefore, neither storage type is optimized for the framework. Further study is desired to explore a new storage mechanism to support both local access and persistence.</p>
</list-item>
<list-item>
<p>We only consider the private cloud in the prototype system. While a private cloud may be enough for a research center, spike workload normally cannot be handled due to the limited resources. To address this problem, a hybrid cloud mechanism is a candidate for the framework, using full-controlled private cloud as the primary cloud while bursting to the public cloud (e.g., Amazon EC2) for extra computing resources when needed.</p>
</list-item>
</list>
</p>
<p>Data intensity, computing intensity and procedure complexity are grand challenges in the geosciences even with 21
<sup>st</sup>
century computing technologies. The proposed framework offers a potential solution to solve these challenges. This framework serves as a path to a common geospatial cyberinfrastructure platform shared by the geoscience community to relieve scientist from computing issues and facilitate scientific discoveries.</p>
</sec>
</sec>
</body>
<back>
<ack>
<p>Dr. George Taylor helped proof an earlier version of the manuscript.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="pone.0116781.ref001">
<label>1</label>
<mixed-citation publication-type="book">
<name>
<surname>Groot</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>McLaughlin</surname>
<given-names>JD</given-names>
</name>
(
<year>2000</year>
)
<article-title>Geospatial data infrastructure: concepts, cases, and good practice</article-title>
:
<publisher-loc>Oxford</publisher-loc>
:
<publisher-name>Oxford university press</publisher-name>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref002">
<label>2</label>
<mixed-citation publication-type="other">Schnase JL, Duffy DQ, Tamkin GS, Nadeau D, Thompson JH, et al. (2014) MERRA analytic services: Meeting the big data challenges of climate science through cloud-enabled climate analytics-as-a-service. Computers, Environment and Urban Systems
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1016/j.compenvurbsys.2013.12.003">10.1016/j.compenvurbsys.2013.12.003</ext-link>
</comment>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref003">
<label>3</label>
<mixed-citation publication-type="journal">
<name>
<surname>Yang</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Xie</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Zhou</surname>
<given-names>B</given-names>
</name>
(
<year>2008</year>
)
<article-title>Distributed geospatial information processing: sharing distributed geospatial resources to support Digital Earth</article-title>
.
<source>International Journal of Digital Earth</source>
<volume>1</volume>
:
<fpage>259</fpage>
<lpage>278</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref004">
<label>4</label>
<mixed-citation publication-type="journal">
<name>
<surname>Yang</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Goodchild</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Huang</surname>
<given-names>Q</given-names>
</name>
,
<name>
<surname>Nebert</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Raskin</surname>
<given-names>R</given-names>
</name>
,
<etal>et al</etal>
(
<year>2011</year>
)
<article-title>Spatial cloud computing: how can the geospatial sciences use and help shape cloud computing?</article-title>
<source>International Journal of Digital Earth</source>
<volume>4</volume>
:
<fpage>305</fpage>
<lpage>329</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref005">
<label>5</label>
<mixed-citation publication-type="other">Minchin S (2014) Big Data: Dealing with the Deluge. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.atnf.csiro.au/research/workshops/2013/astroinformatics/talks/StuartMinchin_Astroinformatics.pdf">http://www.atnf.csiro.au/research/workshops/2013/astroinformatics/talks/StuartMinchin_Astroinformatics.pdf</ext-link>
. Accessed 2014 December 27.</mixed-citation>
</ref>
<ref id="pone.0116781.ref006">
<label>6</label>
<mixed-citation publication-type="other">Cugler DC, Oliver D, Evans MR, Shekhar S, Medeiros CB (2013) Spatial Big Data: Platforms, Analytics, and Science. GeoJournal. (in press)</mixed-citation>
</ref>
<ref id="pone.0116781.ref007">
<label>7</label>
<mixed-citation publication-type="book">
<name>
<surname>Edwards</surname>
<given-names>PN</given-names>
</name>
(
<year>2010</year>
)
<article-title>A vast machine: Computer models, climate data, and the politics of global warming</article-title>
:
<publisher-name>MIT Press</publisher-name>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref008">
<label>8</label>
<mixed-citation publication-type="journal">
<name>
<surname>Grassl</surname>
<given-names>H</given-names>
</name>
(
<year>2011</year>
)
<article-title>Climate change challenges</article-title>
.
<source>Surveys in Geophysics</source>
<volume>32</volume>
:
<fpage>319</fpage>
<lpage>328</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref009">
<label>9</label>
<mixed-citation publication-type="journal">
<name>
<surname>Hodgson</surname>
<given-names>JA</given-names>
</name>
,
<name>
<surname>Thomas</surname>
<given-names>CD</given-names>
</name>
,
<name>
<surname>Wintle</surname>
<given-names>BA</given-names>
</name>
,
<name>
<surname>Moilanen</surname>
<given-names>A</given-names>
</name>
(
<year>2009</year>
)
<article-title>Climate change, connectivity and conservation decision making: back to basics</article-title>
.
<source>Journal of Applied Ecology</source>
<volume>46</volume>
:
<fpage>964</fpage>
<lpage>969</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref010">
<label>10</label>
<mixed-citation publication-type="journal">
<name>
<surname>Murphy</surname>
<given-names>JM</given-names>
</name>
,
<name>
<surname>Sexton</surname>
<given-names>DM</given-names>
</name>
,
<name>
<surname>Barnett</surname>
<given-names>DN</given-names>
</name>
,
<name>
<surname>Jones</surname>
<given-names>GS</given-names>
</name>
,
<name>
<surname>Webb</surname>
<given-names>MJ</given-names>
</name>
,
<etal>et al</etal>
(
<year>2004</year>
)
<article-title>Quantification of modelling uncertainties in a large ensemble of climate change simulations</article-title>
.
<source>Nature</source>
<volume>430</volume>
:
<fpage>768</fpage>
<lpage>772</lpage>
.
<pub-id pub-id-type="pmid">15306806</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref011">
<label>11</label>
<mixed-citation publication-type="book">
<name>
<surname>Li</surname>
<given-names>Z</given-names>
</name>
,
<name>
<surname>Yang</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Sun</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Li</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Xu</surname>
<given-names>C</given-names>
</name>
,
<etal>et al</etal>
(
<year>2013</year>
)
<article-title>A high performance web-based system for analyzing and visualizing spatiotemporal data for climate studies</article-title>
.
<source>Web and Wireless Geographical Information Systems</source>
:
<publisher-name>Springer</publisher-name>
<publisher-loc>Berlin Heidelberg</publisher-loc>
<fpage>190</fpage>
<lpage>198</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref012">
<label>12</label>
<mixed-citation publication-type="book">
<name>
<surname>Cui</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Wu</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Zhang</surname>
<given-names>Q</given-names>
</name>
(
<year>2010</year>
)
<article-title>Massive spatial data processing model based on cloud computing model</article-title>
.
<publisher-name>IEEE</publisher-name>
pp.
<fpage>347</fpage>
<lpage>350</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1002/pst.472">10.1002/pst.472</ext-link>
</comment>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref013">
<label>13</label>
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Guo</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Jiang</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Gong</surname>
<given-names>J</given-names>
</name>
(
<year>2009</year>
)
<article-title>Research of remote sensing service based on cloud computing mode</article-title>
.
<source>Application Research of Computers</source>
<volume>26</volume>
:
<fpage>3428</fpage>
<lpage>3431</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref014">
<label>14</label>
<mixed-citation publication-type="journal">
<name>
<surname>Yang</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Wu</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Huang</surname>
<given-names>Q</given-names>
</name>
,
<name>
<surname>Li</surname>
<given-names>Z</given-names>
</name>
,
<name>
<surname>Li</surname>
<given-names>J</given-names>
</name>
(
<year>2011</year>
)
<article-title>Using spatial principles to optimize distributed computing for enabling the physical science discoveries</article-title>
.
<source>Proceedings of the National Academy of Sciences</source>
<volume>108</volume>
:
<fpage>5498</fpage>
<lpage>5503</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1073/pnas.0909315108">10.1073/pnas.0909315108</ext-link>
</comment>
<pub-id pub-id-type="pmid">21444779</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref015">
<label>15</label>
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Wang</surname>
<given-names>FZ</given-names>
</name>
,
<name>
<surname>Meng</surname>
<given-names>L</given-names>
</name>
,
<name>
<surname>Zhang</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Cai</surname>
<given-names>Y</given-names>
</name>
(
<year>2014</year>
)
<article-title>A map-reduce-enabled SOLAP cube for large-scale remotely sensed data aggregation</article-title>
.
<source>Computers & Geosciences</source>
<volume>70</volume>
:
<fpage>110</fpage>
<lpage>119</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.4103/2231-0770.148499">10.4103/2231-0770.148499</ext-link>
</comment>
<pub-id pub-id-type="pmid">25625082</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref016">
<label>16</label>
<mixed-citation publication-type="book">
<name>
<surname>Juve</surname>
<given-names>G</given-names>
</name>
,
<name>
<surname>Deelman</surname>
<given-names>E</given-names>
</name>
(
<year>2011</year>
)
<article-title>Scientific workflows in the cloud</article-title>
. In:
<source>Grids. Clouds and Virtualization</source>
:
<publisher-name>Springer</publisher-name>
<fpage>71</fpage>
<lpage>91</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref017">
<label>17</label>
<mixed-citation publication-type="other">Mell P, Grance T (2011) The NIST definition of cloud computing. Available:
<ext-link ext-link-type="uri" xlink:href="http://faculty.winthrop.edu/domanm/csci411/Handouts/NIST.pdf">http://faculty.winthrop.edu/domanm/csci411/Handouts/NIST.pdf</ext-link>
. Accessed 2014 December 27.</mixed-citation>
</ref>
<ref id="pone.0116781.ref018">
<label>18</label>
<mixed-citation publication-type="other">Porter JH (2000) Scientific databases. In: Michener W. K., & Brunt J. W. editors. Ecological data: Design, management and processing: 48–69.</mixed-citation>
</ref>
<ref id="pone.0116781.ref019">
<label>19</label>
<mixed-citation publication-type="journal">
<name>
<surname>Wright</surname>
<given-names>DJ</given-names>
</name>
,
<name>
<surname>Wang</surname>
<given-names>S</given-names>
</name>
(
<year>2011</year>
)
<article-title>The emergence of spatial cyberinfrastructure</article-title>
.
<source>Proceedings of the National Academy of Sciences</source>
,
<volume>108</volume>
(
<issue>14</issue>
),
<fpage>5488</fpage>
<lpage>5491</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1073/pnas.1103051108">10.1073/pnas.1103051108</ext-link>
</comment>
<pub-id pub-id-type="pmid">21467227</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref020">
<label>20</label>
<mixed-citation publication-type="other">Dongarra J (2011) The international exascale software project roadmap. International Journal of High Performance Computing Applications: 1094342010391989.</mixed-citation>
</ref>
<ref id="pone.0116781.ref021">
<label>21</label>
<mixed-citation publication-type="journal">
<name>
<surname>Fiore</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Negro</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Aloisio</surname>
<given-names>G</given-names>
</name>
(
<year>2012</year>
)
<article-title>The Climate-G Portal: The context, key features and a multi-dimensional analysis</article-title>
.
<source>Future Generation Computer Systems</source>
<volume>28</volume>
:
<fpage>1</fpage>
<lpage>8</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref022">
<label>22</label>
<mixed-citation publication-type="other">Cinquini L, Crichton D, Mattmann C, Bell GM, Drach B, et al. (2012) The Earth System Grid Federation: An open infrastructure for access to distributed geospatial data IEEE. pp. 1–10.</mixed-citation>
</ref>
<ref id="pone.0116781.ref023">
<label>23</label>
<mixed-citation publication-type="journal">
<name>
<surname>Stonebraker</surname>
<given-names>M</given-names>
</name>
(
<year>2010</year>
)
<article-title>SQL databases v. NoSQL databases</article-title>
.
<source>Communications of the ACM</source>
<volume>53</volume>
:
<fpage>10</fpage>
<lpage>11</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref024">
<label>24</label>
<mixed-citation publication-type="book">
<name>
<surname>Liu</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Chen</surname>
<given-names>B</given-names>
</name>
,
<name>
<surname>He</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Fang</surname>
<given-names>Y</given-names>
</name>
(
<year>2013</year>
)
<article-title>Massive image data management using HBase and MapReduce</article-title>
.
<publisher-name>IEEE</publisher-name>
pp.
<fpage>1</fpage>
<lpage>5</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1097/ACC.0b013e31829c6877">10.1097/ACC.0b013e31829c6877</ext-link>
</comment>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref025">
<label>25</label>
<mixed-citation publication-type="book">
<name>
<surname>Khetrapal</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Ganesh</surname>
<given-names>V</given-names>
</name>
(
<year>2006</year>
)
<article-title>HBase and Hypertable for large scale distributed storage systems</article-title>
.
<publisher-name>Dept of Computer Science, Purdue University</publisher-name>
Available:
<ext-link ext-link-type="uri" xlink:href="http://cloud.pubs.dbs.uni-leipzig.de/sites/cloud.pubs.dbs.uni-leipzig.de/files/Khetrapal2008HBaseandHypertableforlargescaledistributedstorage.pdf">http://cloud.pubs.dbs.uni-leipzig.de/sites/cloud.pubs.dbs.uni-leipzig.de/files/Khetrapal2008HBaseandHypertableforlargescaledistributedstorage.pdf</ext-link>
. Accessed 2014 December 27. </mixed-citation>
</ref>
<ref id="pone.0116781.ref026">
<label>26</label>
<mixed-citation publication-type="journal">
<name>
<surname>Lakshman</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Malik</surname>
<given-names>P</given-names>
</name>
(
<year>2010</year>
)
<article-title>Cassandra: a decentralized structured storage system</article-title>
.
<source>ACM SIGOPS Operating Systems Review</source>
<volume>44</volume>
:
<fpage>35</fpage>
<lpage>40</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref027">
<label>27</label>
<mixed-citation publication-type="journal">
<name>
<surname>Chang</surname>
<given-names>F</given-names>
</name>
,
<name>
<surname>Dean</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Ghemawat</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Hsieh</surname>
<given-names>WC</given-names>
</name>
,
<name>
<surname>Wallach</surname>
<given-names>DA</given-names>
</name>
,
<etal>et al</etal>
(
<year>2008</year>
)
<article-title>Bigtable: A distributed storage system for structured data</article-title>
.
<source>ACM Transactions on Computer Systems (TOCS)</source>
<volume>26</volume>
:
<fpage>4</fpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref028">
<label>28</label>
<mixed-citation publication-type="book">
<name>
<surname>Chen</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Zheng</surname>
<given-names>G</given-names>
</name>
,
<name>
<surname>Chen</surname>
<given-names>H</given-names>
</name>
(
<year>2013</year>
)
<article-title>ELM-MapReduce: MapReduce accelerated extreme learning machine for big spatial data analysis</article-title>
.
<publisher-name>IEEE</publisher-name>
pp.
<fpage>400</fpage>
<lpage>405</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1007/s13191-013-0341-z">10.1007/s13191-013-0341-z</ext-link>
</comment>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref029">
<label>29</label>
<mixed-citation publication-type="journal">
<name>
<surname>Zhang</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Liu</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Shi</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Yuen</surname>
<given-names>DA</given-names>
</name>
,
<name>
<surname>Yan</surname>
<given-names>Z</given-names>
</name>
,
<etal>et al</etal>
(
<year>2007</year>
)
<article-title>Toward an automated parallel computing environment for geosciences</article-title>
.
<source>Physics of the Earth and Planetary Interiors</source>
<volume>163</volume>
:
<fpage>2</fpage>
<lpage>22</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref030">
<label>30</label>
<mixed-citation publication-type="book">
<name>
<surname>Geist</surname>
<given-names>A</given-names>
</name>
(
<year>1994</year>
)
<article-title>PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing</article-title>
:
<publisher-name>MIT press</publisher-name>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref031">
<label>31</label>
<mixed-citation publication-type="book">
<name>
<surname>Gropp</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Lusk</surname>
<given-names>E</given-names>
</name>
,
<name>
<surname>Skjellum</surname>
<given-names>A</given-names>
</name>
(
<year>1999</year>
)
<article-title>Using MPI: portable parallel programming with the message-passing interface</article-title>
:
<publisher-name>MIT press</publisher-name>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref032">
<label>32</label>
<mixed-citation publication-type="journal">
<name>
<surname>Foster</surname>
<given-names>I</given-names>
</name>
,
<name>
<surname>Kesselman</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Tuecke</surname>
<given-names>S</given-names>
</name>
(
<year>2001</year>
)
<article-title>The anatomy of the grid: Enabling scalable virtual organizations</article-title>
.
<source>International journal of high performance computing applications</source>
<volume>15</volume>
:
<fpage>200</fpage>
<lpage>222</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref033">
<label>33</label>
<mixed-citation publication-type="journal">
<name>
<surname>Dean</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Ghemawat</surname>
<given-names>S</given-names>
</name>
(
<year>2008</year>
)
<article-title>MapReduce: simplified data processing on large clusters</article-title>
.
<source>Communications of the ACM</source>
<volume>51</volume>
:
<fpage>107</fpage>
<lpage>113</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref034">
<label>34</label>
<mixed-citation publication-type="book">
<name>
<surname>Rizvandi</surname>
<given-names>NB</given-names>
</name>
,
<name>
<surname>Boloori</surname>
<given-names>AJ</given-names>
</name>
,
<name>
<surname>Kamyabpour</surname>
<given-names>N</given-names>
</name>
,
<name>
<surname>Zomaya</surname>
<given-names>AY</given-names>
</name>
(
<year>2011</year>
)
<article-title>MapReduce implementation of prestack Kirchhoff time migration (PKTM) on seismic data</article-title>
.
<publisher-name>IEEE</publisher-name>
pp.
<fpage>86</fpage>
<lpage>91</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1007/s12262-011-0358-7">10.1007/s12262-011-0358-7</ext-link>
</comment>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref035">
<label>35</label>
<mixed-citation publication-type="book">
<name>
<surname>Zhao</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Ai</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Lv</surname>
<given-names>Z</given-names>
</name>
,
<name>
<surname>Li</surname>
<given-names>B</given-names>
</name>
(
<year>2010</year>
)
<article-title>Parallel accessing massive NetCDF data based on mapreduce. Web Information Systems and Mining</article-title>
:
<publisher-name>Springer</publisher-name>
pp.
<fpage>425</fpage>
<lpage>431</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.3390/s110100425">10.3390/s110100425</ext-link>
</comment>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref036">
<label>36</label>
<mixed-citation publication-type="other">Duffy DQ, Schnase JL, Thompson JH, Freeman SM, Clune TL (2012) Preliminary Evaluation of MapReduce for High-Performance Climate Data Analysis. NASA new technology report white paper.</mixed-citation>
</ref>
<ref id="pone.0116781.ref037">
<label>37</label>
<mixed-citation publication-type="journal">
<name>
<surname>Ludäscher</surname>
<given-names>B</given-names>
</name>
,
<name>
<surname>Altintas</surname>
<given-names>I</given-names>
</name>
,
<name>
<surname>Berkley</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Higgins</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Jaeger</surname>
<given-names>E</given-names>
</name>
,
<etal>et al</etal>
(
<year>2006</year>
)
<article-title>Scientific workflow management and the Kepler system</article-title>
.
<source>Concurrency and Computation: Practice and Experience</source>
<volume>18</volume>
:
<fpage>1039</fpage>
<lpage>1065</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref038">
<label>38</label>
<mixed-citation publication-type="journal">
<name>
<surname>Wang</surname>
<given-names>S</given-names>
</name>
(
<year>2010</year>
)
<article-title>A CyberGIS framework for the synthesis of cyberinfrastructure, GIS, and spatial analysis</article-title>
.
<source>Annals of the Association of American Geographers</source>
<volume>100</volume>
:
<fpage>535</fpage>
<lpage>557</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref039">
<label>39</label>
<mixed-citation publication-type="journal">
<name>
<surname>Yang</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Raskin</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Goodchild</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Gahegan</surname>
<given-names>M</given-names>
</name>
(
<year>2010</year>
)
<article-title>Geospatial cyberinfrastructure: past, present and future</article-title>
.
<source>Computers, Environment and Urban Systems</source>
<volume>34</volume>
:
<fpage>264</fpage>
<lpage>277</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref040">
<label>40</label>
<mixed-citation publication-type="other">Deelman E, Gil Y, Zemankova M (2006) NSF workshop on the challenges of scientific workflows 1–2.</mixed-citation>
</ref>
<ref id="pone.0116781.ref041">
<label>41</label>
<mixed-citation publication-type="journal">
<name>
<surname>Gil</surname>
<given-names>Y</given-names>
</name>
(
<year>2008</year>
)
<article-title>From data to knowledge to discoveries: Scientific workflows and artificial intelligence</article-title>
.
<source>Scientific Programming</source>
<volume>16</volume>
:
<fpage>4</fpage>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref042">
<label>42</label>
<mixed-citation publication-type="book">
<name>
<surname>Taylor</surname>
<given-names>IJ</given-names>
</name>
,
<name>
<surname>Deelman</surname>
<given-names>E</given-names>
</name>
,
<name>
<surname>Gannon</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Shields</surname>
<given-names>M</given-names>
</name>
(
<year>2007</year>
)
<article-title>Workflows for e-Science</article-title>
:
<publisher-name>Springer-Verlag London Limited</publisher-name>
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/jxb/erm028">10.1093/jxb/erm028</ext-link>
</comment>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref043">
<label>43</label>
<mixed-citation publication-type="book">
<name>
<surname>Yue</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>He</surname>
<given-names>L</given-names>
</name>
(
<year>2009</year>
)
<article-title>Geospatial data provenance in cyberinfrastructure</article-title>
.
<publisher-name>IEEE</publisher-name>
pp.
<fpage>1</fpage>
<lpage>4</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1016/j.fas.2009.11.003">10.1016/j.fas.2009.11.003</ext-link>
</comment>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref044">
<label>44</label>
<mixed-citation publication-type="journal">
<name>
<surname>Oinn</surname>
<given-names>T</given-names>
</name>
,
<name>
<surname>Addis</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Ferris</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Marvin</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Senger</surname>
<given-names>M</given-names>
</name>
,
<etal>et al</etal>
(
<year>2004</year>
)
<article-title>Taverna: a tool for the composition and enactment of bioinformatics workflows</article-title>
.
<source>Bioinformatics</source>
<volume>20</volume>
:
<fpage>3045</fpage>
<lpage>3054</lpage>
.
<pub-id pub-id-type="pmid">15201187</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref045">
<label>45</label>
<mixed-citation publication-type="other">Majithia S, Shields M, Taylor I, Wang I (2004) Triana: A graphical web service composition and execution toolkit IEEE. pp. 514–521.</mixed-citation>
</ref>
<ref id="pone.0116781.ref046">
<label>46</label>
<mixed-citation publication-type="book">
<name>
<surname>Barga</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Jackson</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Araujo</surname>
<given-names>N</given-names>
</name>
,
<name>
<surname>Guo</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Gautam</surname>
<given-names>N</given-names>
</name>
,
<etal>et al</etal>
(
<year>2008</year>
)
<article-title>The trident scientific workflow workbench</article-title>
.
<publisher-name>IEEE</publisher-name>
pp.
<fpage>317</fpage>
<lpage>318</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1007/s12265-008-9065-6">10.1007/s12265-008-9065-6</ext-link>
</comment>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref047">
<label>47</label>
<mixed-citation publication-type="other">Bavoil L, Callahan SP, Crossno PJ, Freire J, Scheidegger CE, et al. (2005) Vistrails: Enabling interactive multiple-view visualizations IEEE. pp. 135–142.</mixed-citation>
</ref>
<ref id="pone.0116781.ref048">
<label>48</label>
<mixed-citation publication-type="other">Hoffa C, Mehta G, Freeman T, Deelman E, Keahey K, et al. (2008) On the use of cloud computing for scientific workflows IEEE. pp. 640–645.</mixed-citation>
</ref>
<ref id="pone.0116781.ref049">
<label>49</label>
<mixed-citation publication-type="other">Juve G, Deelman E, Vahi K, Mehta G, Berriman B, et al. (2009) Scientific workflow applications on Amazon EC2 IEEE. pp. 59–66.</mixed-citation>
</ref>
<ref id="pone.0116781.ref050">
<label>50</label>
<mixed-citation publication-type="other">Simmhan Y, Barga R, van Ingen C, Lazowska E, Szalay A (2009) Building the trident scientific workflow workbench for data management in the cloud IEEE. pp. 41–50.</mixed-citation>
</ref>
<ref id="pone.0116781.ref051">
<label>51</label>
<mixed-citation publication-type="book">
<name>
<surname>Mattmann</surname>
<given-names>CA</given-names>
</name>
,
<name>
<surname>Crichton</surname>
<given-names>DJ</given-names>
</name>
,
<name>
<surname>Hart</surname>
<given-names>AF</given-names>
</name>
,
<name>
<surname>Goodale</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Hughes</surname>
<given-names>JS</given-names>
</name>
,
<etal>et al</etal>
(
<year>2011</year>
)
<article-title>Architecting Data-Intensive Software Systems</article-title>
.
<source>Handbook of Data Intensive Computing</source>
:
<publisher-name>Springer</publisher-name>
pp.
<fpage>25</fpage>
<lpage>57</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref052">
<label>52</label>
<mixed-citation publication-type="journal">
<name>
<surname>Williams</surname>
<given-names>DN</given-names>
</name>
,
<name>
<surname>Drach</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Ananthakrishnan</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Foster</surname>
<given-names>I</given-names>
</name>
,
<name>
<surname>Fraser</surname>
<given-names>D</given-names>
</name>
,
<etal>et al</etal>
(
<year>2009</year>
)
<article-title>The Earth System Grid: Enabling access to multimodel climate simulation data</article-title>
.
<source>Bulletin of the American Meteorological Society</source>
<volume>90</volume>
:
<fpage>195</fpage>
<lpage>205</lpage>
.</mixed-citation>
</ref>
<ref id="pone.0116781.ref053">
<label>53</label>
<mixed-citation publication-type="other">Li Z, Yang C, Huang Q, Liu K, Sun M, et al. (2014) Building Model as a Service to support geosciences. Computers, Environment and Urban Systems.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1016/j.compenvurbsys.2014.06.004">10.1016/j.compenvurbsys.2014.06.004</ext-link>
</comment>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref054">
<label>54</label>
<mixed-citation publication-type="journal">
<name>
<surname>Huang</surname>
<given-names>Q</given-names>
</name>
,
<name>
<surname>Yang</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Liu</surname>
<given-names>K</given-names>
</name>
,
<name>
<surname>Xia</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Xu</surname>
<given-names>C</given-names>
</name>
,
<etal>et al</etal>
(
<year>2013</year>
)
<article-title>Evaluating open-source cloud computing solutions for geosciences</article-title>
.
<source>Computers & Geosciences</source>
<volume>59</volume>
:
<fpage>41</fpage>
<lpage>52</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.4103/2231-0770.148499">10.4103/2231-0770.148499</ext-link>
</comment>
<pub-id pub-id-type="pmid">25625082</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref055">
<label>55</label>
<mixed-citation publication-type="journal">
<name>
<surname>Yue</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Di</surname>
<given-names>L</given-names>
</name>
,
<name>
<surname>Yang</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Yu</surname>
<given-names>G</given-names>
</name>
,
<name>
<surname>Zhao</surname>
<given-names>P</given-names>
</name>
(
<year>2007</year>
)
<article-title>Semantics-based automatic composition of geospatial Web service chains</article-title>
.
<source>Computers & Geosciences</source>
,
<volume>33</volume>
(
<issue>5</issue>
),
<fpage>649</fpage>
<lpage>665</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.4103/2231-0770.148499">10.4103/2231-0770.148499</ext-link>
</comment>
<pub-id pub-id-type="pmid">25625082</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0116781.ref056">
<label>56</label>
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>Z</given-names>
</name>
.
<name>
<surname>Yang</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Wu</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Miao</surname>
<given-names>L</given-names>
</name>
(
<year>2011</year>
)
<article-title>An optimized framework for seamlessly integrating OGC Web Services to support geospatial sciences</article-title>
.
<source>International Journal of Geographical Information Science</source>
,
<volume>25</volume>
(
<issue>4</issue>
),
<fpage>595</fpage>
<lpage>613</lpage>
.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/CyberinfraV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000069  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000069  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    CyberinfraV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Thu Oct 27 09:30:58 2016. Site generation: Sun Mar 10 23:08:40 2024