InforLorV4, Istex, Corpus, bibRecord, 001594

Fault-Management in P2P-MPI

Identifieur interne : 001594 ( Istex/Corpus ); précédent : 001593; suivant : 001595

Fault-Management in P2P-MPI

Auteurs : Stéphane Genaud ; Emmanuel Jeannot ; Choopan Rattanapoka

Source :

International Journal of Parallel Programming [ 0885-7458 ] ; 2009-10-01.

RBID : ISTEX:5E7C8EC4D7C270F8D66020C33884FC34178138D6

English descriptors

KwdEn :
- Fault-tolerance, Grid computing, Middleware, Parallelism.

Abstract

Abstract: We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid’5000, show that the real detection times closely match the predictions.

Url:

https://api.istex.fr/ark:/67375/VQC-T0H758JH-P/fulltext.pdf

DOI: 10.1007/s10766-009-0115-8

Links to Exploration step

ISTEX:5E7C8EC4D7C270F8D66020C33884FC34178138D6

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Fault-Management in P2P-MPI</title>
<author><name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
<affiliation><mods:affiliation>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy, France</mods:affiliation>
</affiliation>
<affiliation><mods:affiliation>E-mail: Stephane.Genaud@loria.fr</mods:affiliation>
</affiliation>
<affiliation><mods:affiliation>E-mail: genaud@icps.u-strasbg.fr</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
<affiliation><mods:affiliation>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy, France</mods:affiliation>
</affiliation>
<affiliation><mods:affiliation>E-mail: Emmanuel.Jeannot@loria.fr</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
<affiliation><mods:affiliation>Department of Electronics Engineering Technology, College of Industrial Technology, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand</mods:affiliation>
</affiliation>
<affiliation><mods:affiliation>E-mail: choopanr@kmutnb.ac.th</mods:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:5E7C8EC4D7C270F8D66020C33884FC34178138D6</idno>
<date when="2009" year="2009">2009</date>
<idno type="doi">10.1007/s10766-009-0115-8</idno>
<idno type="url">https://api.istex.fr/ark:/67375/VQC-T0H758JH-P/fulltext.pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001594</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">001594</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Fault-Management in P2P-MPI</title>
<author><name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
<affiliation><mods:affiliation>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy, France</mods:affiliation>
</affiliation>
<affiliation><mods:affiliation>E-mail: Stephane.Genaud@loria.fr</mods:affiliation>
</affiliation>
<affiliation><mods:affiliation>E-mail: genaud@icps.u-strasbg.fr</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
<affiliation><mods:affiliation>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy, France</mods:affiliation>
</affiliation>
<affiliation><mods:affiliation>E-mail: Emmanuel.Jeannot@loria.fr</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
<affiliation><mods:affiliation>Department of Electronics Engineering Technology, College of Industrial Technology, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand</mods:affiliation>
</affiliation>
<affiliation><mods:affiliation>E-mail: choopanr@kmutnb.ac.th</mods:affiliation>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">International Journal of Parallel Programming</title>
<title level="j" type="abbrev">Int J Parallel Prog</title>
<idno type="ISSN">0885-7458</idno>
<idno type="eISSN">1573-7640</idno>
<imprint><publisher>Springer US; http://www.springer-ny.com</publisher>
<pubPlace>Boston</pubPlace>
<date type="published" when="2009-10-01">2009-10-01</date>
<biblScope unit="volume">37</biblScope>
<biblScope unit="issue">5</biblScope>
<biblScope unit="page" from="433">433</biblScope>
<biblScope unit="page" to="461">461</biblScope>
</imprint>
<idno type="ISSN">0885-7458</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0885-7458</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Fault-tolerance</term>
<term>Grid computing</term>
<term>Middleware</term>
<term>Parallelism</term>
</keywords>
</textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid’5000, show that the real detection times closely match the predictions.</div>
</front>
</TEI>
<istex><corpusName>springer-journals</corpusName>
<author><json:item><name>Stéphane Genaud</name>
<affiliations><json:string>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy, France</json:string>
<json:string>E-mail: Stephane.Genaud@loria.fr</json:string>
<json:string>E-mail: genaud@icps.u-strasbg.fr</json:string>
</affiliations>
</json:item>
<json:item><name>Emmanuel Jeannot</name>
<affiliations><json:string>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy, France</json:string>
<json:string>E-mail: Emmanuel.Jeannot@loria.fr</json:string>
</affiliations>
</json:item>
<json:item><name>Choopan Rattanapoka</name>
<affiliations><json:string>Department of Electronics Engineering Technology, College of Industrial Technology, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand</json:string>
<json:string>E-mail: choopanr@kmutnb.ac.th</json:string>
</affiliations>
</json:item>
</author>
<subject><json:item><lang><json:string>eng</json:string>
</lang>
<value>Grid computing</value>
</json:item>
<json:item><lang><json:string>eng</json:string>
</lang>
<value>Middleware</value>
</json:item>
<json:item><lang><json:string>eng</json:string>
</lang>
<value>Parallelism</value>
</json:item>
<json:item><lang><json:string>eng</json:string>
</lang>
<value>Fault-tolerance</value>
</json:item>
</subject>
<articleId><json:string>115</json:string>
<json:string>s10766-009-0115-8</json:string>
</articleId>
<arkIstex>ark:/67375/VQC-T0H758JH-P</arkIstex>
<language><json:string>eng</json:string>
</language>
<originalGenre><json:string>OriginalPaper</json:string>
</originalGenre>
<abstract>Abstract: We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid’5000, show that the real detection times closely match the predictions.</abstract>
<qualityIndicators><score>10</score>
<pdfWordCount>11239</pdfWordCount>
<pdfCharCount>60707</pdfCharCount>
<pdfVersion>1.3</pdfVersion>
<pdfPageCount>29</pdfPageCount>
<pdfPageSize>439.37 x 666.142 pts</pdfPageSize>
<refBibsNative>false</refBibsNative>
<abstractWordCount>273</abstractWordCount>
<abstractCharCount>1801</abstractCharCount>
<keywordCount>4</keywordCount>
</qualityIndicators>
<title>Fault-Management in P2P-MPI</title>
<genre><json:string>research-article</json:string>
</genre>
<host><title>International Journal of Parallel Programming</title>
<language><json:string>unknown</json:string>
</language>
<publicationDate>2009</publicationDate>
<copyrightDate>2009</copyrightDate>
<issn><json:string>0885-7458</json:string>
</issn>
<eissn><json:string>1573-7640</json:string>
</eissn>
<journalId><json:string>10766</json:string>
</journalId>
<volume>37</volume>
<issue>5</issue>
<pages><first>433</first>
<last>461</last>
</pages>
<genre><json:string>journal</json:string>
</genre>
<subject><json:item><value>Software Engineering/Programming and Operating Systems</value>
</json:item>
<json:item><value>Processor Architectures</value>
</json:item>
<json:item><value>Theory of Computation</value>
</json:item>
</subject>
</host>
<ark><json:string>ark:/67375/VQC-T0H758JH-P</json:string>
</ark>
<publicationDate>2009</publicationDate>
<copyrightDate>2009</copyrightDate>
<doi><json:string>10.1007/s10766-009-0115-8</json:string>
</doi>
<id>5E7C8EC4D7C270F8D66020C33884FC34178138D6</id>
<score>1</score>
<fulltext><json:item><extension>pdf</extension>
<original>true</original>
<mimetype>application/pdf</mimetype>
<uri>https://api.istex.fr/ark:/67375/VQC-T0H758JH-P/fulltext.pdf</uri>
</json:item>
<json:item><extension>zip</extension>
<original>false</original>
<mimetype>application/zip</mimetype>
<uri>https://api.istex.fr/ark:/67375/VQC-T0H758JH-P/bundle.zip</uri>
</json:item>
<istex:fulltextTEI uri="https://api.istex.fr/ark:/67375/VQC-T0H758JH-P/fulltext.tei"><teiHeader><fileDesc><titleStmt><title level="a" type="main" xml:lang="en">Fault-Management in P2P-MPI</title>
</titleStmt>
<publicationStmt><authority>ISTEX</authority>
<publisher scheme="https://scientific-publisher.data.istex.fr">Springer US; http://www.springer-ny.com</publisher>
<pubPlace>Boston</pubPlace>
<availability><licence><p>Springer Science+Business Media, LLC, 2009</p>
</licence>
<p scheme="https://loaded-corpus.data.istex.fr/ark:/67375/XBH-3XSW68JL-F">springer</p>
</availability>
<date>2008-05-23</date>
</publicationStmt>
<notesStmt><note type="research-article" scheme="https://content-type.data.istex.fr/ark:/67375/XTP-1JC4F85T-7">research-article</note>
<note type="journal" scheme="https://publication-type.data.istex.fr/ark:/67375/JMC-0GLKJH51-B">journal</note>
</notesStmt>
<sourceDesc><biblStruct type="inbook"><analytic><title level="a" type="main" xml:lang="en">Fault-Management in P2P-MPI</title>
<author xml:id="author-0000" corresp="yes"><persName><forename type="first">Stéphane</forename>
<surname>Genaud</surname>
</persName>
<email>Stephane.Genaud@loria.fr</email>
<email>genaud@icps.u-strasbg.fr</email>
<affiliation>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy, France</affiliation>
</author>
<author xml:id="author-0001"><persName><forename type="first">Emmanuel</forename>
<surname>Jeannot</surname>
</persName>
<email>Emmanuel.Jeannot@loria.fr</email>
<affiliation>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy, France</affiliation>
</author>
<author xml:id="author-0002"><persName><forename type="first">Choopan</forename>
<surname>Rattanapoka</surname>
</persName>
<email>choopanr@kmutnb.ac.th</email>
<affiliation>Department of Electronics Engineering Technology, College of Industrial Technology, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand</affiliation>
</author>
<idno type="istex">5E7C8EC4D7C270F8D66020C33884FC34178138D6</idno>
<idno type="ark">ark:/67375/VQC-T0H758JH-P</idno>
<idno type="DOI">10.1007/s10766-009-0115-8</idno>
<idno type="article-id">115</idno>
<idno type="article-id">s10766-009-0115-8</idno>
</analytic>
<monogr><title level="j">International Journal of Parallel Programming</title>
<title level="j" type="abbrev">Int J Parallel Prog</title>
<idno type="pISSN">0885-7458</idno>
<idno type="eISSN">1573-7640</idno>
<idno type="journal-ID">true</idno>
<idno type="issue-article-count">4</idno>
<idno type="volume-issue-count">6</idno>
<imprint><publisher>Springer US; http://www.springer-ny.com</publisher>
<pubPlace>Boston</pubPlace>
<date type="published" when="2009-10-01"></date>
<biblScope unit="volume">37</biblScope>
<biblScope unit="issue">5</biblScope>
<biblScope unit="page" from="433">433</biblScope>
<biblScope unit="page" to="461">461</biblScope>
</imprint>
</monogr>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><creation><date>2008-05-23</date>
</creation>
<langUsage><language ident="en">en</language>
</langUsage>
<abstract xml:lang="en"><p>Abstract: We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid’5000, show that the real detection times closely match the predictions.</p>
</abstract>
<textClass xml:lang="en"><keywords scheme="keyword"><list><head>Keywords</head>
<item><term>Grid computing</term>
</item>
<item><term>Middleware</term>
</item>
<item><term>Parallelism</term>
</item>
<item><term>Fault-tolerance</term>
</item>
</list>
</keywords>
</textClass>
<textClass><keywords scheme="Journal Subject"><list><head>Computer Science</head>
<item><term>Software Engineering/Programming and Operating Systems</term>
</item>
<item><term>Processor Architectures</term>
</item>
<item><term>Theory of Computation</term>
</item>
</list>
</keywords>
</textClass>
</profileDesc>
<revisionDesc><change when="2008-05-23">Created</change>
<change when="2009-10-01">Published</change>
</revisionDesc>
</teiHeader>
</istex:fulltextTEI>
<json:item><extension>txt</extension>
<original>false</original>
<mimetype>text/plain</mimetype>
<uri>https://api.istex.fr/ark:/67375/VQC-T0H758JH-P/fulltext.txt</uri>
</json:item>
</fulltext>
<metadata><istex:metadataXml wicri:clean="corpus springer-journals not found" wicri:toSee="no header"><istex:xmlDeclaration>version="1.0" encoding="UTF-8"</istex:xmlDeclaration>
<istex:docType PUBLIC="-//Springer-Verlag//DTD A++ V2.4//EN" URI="http://devel.springer.de/A++/V2.4/DTD/A++V2.4.dtd" name="istex:docType"></istex:docType>
<istex:document><Publisher><PublisherInfo><PublisherName>Springer US</PublisherName>
<PublisherLocation>Boston</PublisherLocation>
<PublisherURL>http://www.springer-ny.com</PublisherURL>
</PublisherInfo>
<Journal OutputMedium="All"><JournalInfo JournalProductType="ArchiveJournal" NumberingStyle="ContentOnly"><JournalID>10766</JournalID>
<JournalPrintISSN>0885-7458</JournalPrintISSN>
<JournalElectronicISSN>1573-7640</JournalElectronicISSN>
<JournalTitle>International Journal of Parallel Programming</JournalTitle>
<JournalAbbreviatedTitle>Int J Parallel Prog</JournalAbbreviatedTitle>
<JournalSubjectGroup><JournalSubject Type="Primary">Computer Science</JournalSubject>
<JournalSubject Type="Secondary">Software Engineering/Programming and Operating Systems</JournalSubject>
<JournalSubject Type="Secondary">Processor Architectures</JournalSubject>
<JournalSubject Type="Secondary">Theory of Computation</JournalSubject>
</JournalSubjectGroup>
</JournalInfo>
<Volume OutputMedium="All"><VolumeInfo TocLevels="0" VolumeType="Regular"><VolumeIDStart>37</VolumeIDStart>
<VolumeIDEnd>37</VolumeIDEnd>
<VolumeIssueCount>6</VolumeIssueCount>
</VolumeInfo>
<Issue IssueType="Regular" OutputMedium="All"><IssueInfo IssueType="Regular" TocLevels="0"><IssueIDStart>5</IssueIDStart>
<IssueIDEnd>5</IssueIDEnd>
<IssueArticleCount>4</IssueArticleCount>
<IssueHistory><OnlineDate><Year>2009</Year>
<Month>8</Month>
<Day>18</Day>
</OnlineDate>
<PrintDate><Year>2009</Year>
<Month>8</Month>
<Day>17</Day>
</PrintDate>
<CoverDate><Year>2009</Year>
<Month>10</Month>
</CoverDate>
<PricelistYear>2009</PricelistYear>
</IssueHistory>
<IssueCopyright><CopyrightHolderName>Springer Science+Business Media, LLC</CopyrightHolderName>
<CopyrightYear>2009</CopyrightYear>
</IssueCopyright>
</IssueInfo>
<Article ID="s10766-009-0115-8" OutputMedium="All"><ArticleInfo ArticleType="OriginalPaper" ContainsESM="No" Language="En" NumberingStyle="ContentOnly" TocLevels="0"><ArticleID>115</ArticleID>
<ArticleDOI>10.1007/s10766-009-0115-8</ArticleDOI>
<ArticleSequenceNumber>1</ArticleSequenceNumber>
<ArticleTitle Language="En">Fault-Management in P2P-MPI</ArticleTitle>
<ArticleFirstPage>433</ArticleFirstPage>
<ArticleLastPage>461</ArticleLastPage>
<ArticleHistory><RegistrationDate><Year>2009</Year>
<Month>7</Month>
<Day>21</Day>
</RegistrationDate>
<Received><Year>2008</Year>
<Month>5</Month>
<Day>23</Day>
</Received>
<Accepted><Year>2009</Year>
<Month>7</Month>
<Day>21</Day>
</Accepted>
<OnlineDate><Year>2009</Year>
<Month>8</Month>
<Day>5</Day>
</OnlineDate>
</ArticleHistory>
<ArticleCopyright><CopyrightHolderName>Springer Science+Business Media, LLC</CopyrightHolderName>
<CopyrightYear>2009</CopyrightYear>
</ArticleCopyright>
<ArticleGrants Type="Regular"><MetadataGrant Grant="OpenAccess"></MetadataGrant>
<AbstractGrant Grant="OpenAccess"></AbstractGrant>
<BodyPDFGrant Grant="Restricted"></BodyPDFGrant>
<BodyHTMLGrant Grant="Restricted"></BodyHTMLGrant>
<BibliographyGrant Grant="Restricted"></BibliographyGrant>
<ESMGrant Grant="Restricted"></ESMGrant>
</ArticleGrants>
</ArticleInfo>
<ArticleHeader><AuthorGroup><Author AffiliationIDS="Aff1" CorrespondingAffiliationID="Aff1"><AuthorName DisplayOrder="Western"><GivenName>Stéphane</GivenName>
<FamilyName>Genaud</FamilyName>
</AuthorName>
<Contact><Email>Stephane.Genaud@loria.fr</Email>
<Email>genaud@icps.u-strasbg.fr</Email>
</Contact>
</Author>
<Author AffiliationIDS="Aff1"><AuthorName DisplayOrder="Western"><GivenName>Emmanuel</GivenName>
<FamilyName>Jeannot</FamilyName>
</AuthorName>
<Contact><Email>Emmanuel.Jeannot@loria.fr</Email>
</Contact>
</Author>
<Author AffiliationIDS="Aff2"><AuthorName DisplayOrder="Western"><GivenName>Choopan</GivenName>
<FamilyName>Rattanapoka</FamilyName>
</AuthorName>
<Contact><Email>choopanr@kmutnb.ac.th</Email>
</Contact>
</Author>
<Affiliation ID="Aff1"><OrgName>AlGorille Team, LORIA</OrgName>
<OrgAddress><Street>Campus Scientifique</Street>
<Postbox>BP 239</Postbox>
<Postcode>54506</Postcode>
<City>Vandoeuvre-lès-Nancy</City>
<Country Code="FR">France</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff2"><OrgDivision>Department of Electronics Engineering Technology, College of Industrial Technology</OrgDivision>
<OrgName>King Mongkut’s University of Technology North Bangkok</OrgName>
<OrgAddress><City>Bangkok</City>
<Country Code="TH">Thailand</Country>
</OrgAddress>
</Affiliation>
</AuthorGroup>
<Abstract ID="Abs1" Language="En"><Heading>Abstract</Heading>
<Para>We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the <Emphasis Type="Italic">binary round-robin protocol</Emphasis>
 for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid’5000, show that the real detection times closely match the predictions.</Para>
</Abstract>
<KeywordGroup Language="En"><Heading>Keywords</Heading>
<Keyword>Grid computing</Keyword>
<Keyword>Middleware</Keyword>
<Keyword>Parallelism</Keyword>
<Keyword>Fault-tolerance</Keyword>
</KeywordGroup>
</ArticleHeader>
<NoBody></NoBody>
</Article>
</Issue>
</Volume>
</Journal>
</Publisher>
</istex:document>
</istex:metadataXml>
<mods version="3.6"><titleInfo lang="en"><title>Fault-Management in P2P-MPI</title>
</titleInfo>
<titleInfo type="alternative" contentType="CDATA"><title>Fault-Management in P2P-MPI</title>
</titleInfo>
<name type="personal" displayLabel="corresp"><namePart type="given">Stéphane</namePart>
<namePart type="family">Genaud</namePart>
<affiliation>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy, France</affiliation>
<affiliation>E-mail: Stephane.Genaud@loria.fr</affiliation>
<affiliation>E-mail: genaud@icps.u-strasbg.fr</affiliation>
<role><roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal"><namePart type="given">Emmanuel</namePart>
<namePart type="family">Jeannot</namePart>
<affiliation>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy, France</affiliation>
<affiliation>E-mail: Emmanuel.Jeannot@loria.fr</affiliation>
<role><roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal"><namePart type="given">Choopan</namePart>
<namePart type="family">Rattanapoka</namePart>
<affiliation>Department of Electronics Engineering Technology, College of Industrial Technology, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand</affiliation>
<affiliation>E-mail: choopanr@kmutnb.ac.th</affiliation>
<role><roleTerm type="text">author</roleTerm>
</role>
</name>
<typeOfResource>text</typeOfResource>
<genre type="research-article" displayLabel="OriginalPaper" authority="ISTEX" authorityURI="https://content-type.data.istex.fr" valueURI="https://content-type.data.istex.fr/ark:/67375/XTP-1JC4F85T-7">research-article</genre>
<originInfo><publisher>Springer US; http://www.springer-ny.com</publisher>
<place><placeTerm type="text">Boston</placeTerm>
</place>
<dateCreated encoding="w3cdtf">2008-05-23</dateCreated>
<dateIssued encoding="w3cdtf">2009-10-01</dateIssued>
<copyrightDate encoding="w3cdtf">2009</copyrightDate>
</originInfo>
<language><languageTerm type="code" authority="rfc3066">en</languageTerm>
<languageTerm type="code" authority="iso639-2b">eng</languageTerm>
</language>
<abstract lang="en">Abstract: We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid’5000, show that the real detection times closely match the predictions.</abstract>
<subject lang="en"><genre>Keywords</genre>
<topic>Grid computing</topic>
<topic>Middleware</topic>
<topic>Parallelism</topic>
<topic>Fault-tolerance</topic>
</subject>
<relatedItem type="host"><titleInfo><title>International Journal of Parallel Programming</title>
</titleInfo>
<titleInfo type="abbreviated"><title>Int J Parallel Prog</title>
</titleInfo>
<genre type="journal" authority="ISTEX" authorityURI="https://publication-type.data.istex.fr" valueURI="https://publication-type.data.istex.fr/ark:/67375/JMC-0GLKJH51-B">journal</genre>
<originInfo><publisher>Springer</publisher>
<dateIssued encoding="w3cdtf">2009-08-18</dateIssued>
<copyrightDate encoding="w3cdtf">2009</copyrightDate>
</originInfo>
<subject><genre>Computer Science</genre>
<topic>Software Engineering/Programming and Operating Systems</topic>
<topic>Processor Architectures</topic>
<topic>Theory of Computation</topic>
</subject>
<identifier type="ISSN">0885-7458</identifier>
<identifier type="eISSN">1573-7640</identifier>
<identifier type="JournalID">10766</identifier>
<identifier type="IssueArticleCount">4</identifier>
<identifier type="VolumeIssueCount">6</identifier>
<part><date>2009</date>
<detail type="volume"><number>37</number>
<caption>vol.</caption>
</detail>
<detail type="issue"><number>5</number>
<caption>no.</caption>
</detail>
<extent unit="pages"><start>433</start>
<end>461</end>
</extent>
</part>
<recordInfo><recordOrigin>Springer Science+Business Media, LLC, 2009</recordOrigin>
</recordInfo>
</relatedItem>
<identifier type="istex">5E7C8EC4D7C270F8D66020C33884FC34178138D6</identifier>
<identifier type="ark">ark:/67375/VQC-T0H758JH-P</identifier>
<identifier type="DOI">10.1007/s10766-009-0115-8</identifier>
<identifier type="ArticleID">115</identifier>
<identifier type="ArticleID">s10766-009-0115-8</identifier>
<accessCondition type="use and reproduction" contentType="copyright">Springer Science+Business Media, LLC, 2009</accessCondition>
<recordInfo><recordContentSource authority="ISTEX" authorityURI="https://loaded-corpus.data.istex.fr" valueURI="https://loaded-corpus.data.istex.fr/ark:/67375/XBH-3XSW68JL-F">springer</recordContentSource>
<recordOrigin>Springer Science+Business Media, LLC, 2009</recordOrigin>
</recordInfo>
</mods>
<json:item><extension>json</extension>
<original>false</original>
<mimetype>application/json</mimetype>
<uri>https://api.istex.fr/ark:/67375/VQC-T0H758JH-P/record.json</uri>
</json:item>
</metadata>
<serie></serie>
</istex>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Istex/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001594 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Istex/Corpus/biblio.hfd -nk 001594 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Istex
   |étape=   Corpus
   |type=    RBID
   |clé=     ISTEX:5E7C8EC4D7C270F8D66020C33884FC34178138D6
   |texte=   Fault-Management in P2P-MPI
}}

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022

	Serveur d'exploration sur la recherche en informatique en Lorraine
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur la recherche en informatique en Lorraine

Fault-Management in P2P-MPI

Fault-Management in P2P-MPI

Source :

English descriptors

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri