Serveur d'exploration sur la recherche en informatique en Lorraine

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Fault-Management in P2P-MPI

Identifieur interne : 003A35 ( Main/Merge ); précédent : 003A34; suivant : 003A36

Fault-Management in P2P-MPI

Auteurs : Stéphane Genaud [France] ; Emmanuel Jeannot [France] ; Choopan Rattanapoka [Thaïlande]

Source :

RBID : ISTEX:5E7C8EC4D7C270F8D66020C33884FC34178138D6

English descriptors

Abstract

Abstract: We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid’5000, show that the real detection times closely match the predictions.

Url:
DOI: 10.1007/s10766-009-0115-8

Links toward previous steps (curation, corpus...)


Links to Exploration step

ISTEX:5E7C8EC4D7C270F8D66020C33884FC34178138D6

Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Fault-Management in P2P-MPI</title>
<author>
<name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
</author>
<author>
<name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
</author>
<author>
<name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:5E7C8EC4D7C270F8D66020C33884FC34178138D6</idno>
<date when="2009" year="2009">2009</date>
<idno type="doi">10.1007/s10766-009-0115-8</idno>
<idno type="url">https://api.istex.fr/ark:/67375/VQC-T0H758JH-P/fulltext.pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001594</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">001594</idno>
<idno type="wicri:Area/Istex/Curation">001575</idno>
<idno type="wicri:Area/Istex/Checkpoint">000A64</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">000A64</idno>
<idno type="wicri:doubleKey">0885-7458:2009:Genaud S:fault:management:in</idno>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:inria-00425516</idno>
<idno type="url">https://hal.inria.fr/inria-00425516</idno>
<idno type="wicri:Area/Hal/Corpus">006C19</idno>
<idno type="wicri:Area/Hal/Curation">006C19</idno>
<idno type="wicri:Area/Hal/Checkpoint">002A06</idno>
<idno type="wicri:explorRef" wicri:stream="Hal" wicri:step="Checkpoint">002A06</idno>
<idno type="wicri:doubleKey">0885-7458:2009:Genaud S:fault:management:in</idno>
<idno type="wicri:Area/Main/Merge">003A35</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Fault-Management in P2P-MPI</title>
<author>
<name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
<affiliation wicri:level="3">
<country xml:lang="fr">France</country>
<wicri:regionArea>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy</wicri:regionArea>
<placeName>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
</author>
<author>
<name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
<affiliation wicri:level="3">
<country xml:lang="fr">France</country>
<wicri:regionArea>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy</wicri:regionArea>
<placeName>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
</author>
<author>
<name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Thaïlande</country>
<wicri:regionArea>Department of Electronics Engineering Technology, College of Industrial Technology, King Mongkut’s University of Technology North Bangkok, Bangkok</wicri:regionArea>
<wicri:noRegion>Bangkok</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Thaïlande</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">International Journal of Parallel Programming</title>
<title level="j" type="abbrev">Int J Parallel Prog</title>
<idno type="ISSN">0885-7458</idno>
<idno type="eISSN">1573-7640</idno>
<imprint>
<publisher>Springer US; http://www.springer-ny.com</publisher>
<pubPlace>Boston</pubPlace>
<date type="published" when="2009-10-01">2009-10-01</date>
<biblScope unit="volume">37</biblScope>
<biblScope unit="issue">5</biblScope>
<biblScope unit="page" from="433">433</biblScope>
<biblScope unit="page" to="461">461</biblScope>
</imprint>
<idno type="ISSN">0885-7458</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0885-7458</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Fault-tolerance</term>
<term>Grid computing</term>
<term>Middleware</term>
<term>Parallelism</term>
</keywords>
<keywords scheme="mix" xml:lang="en">
<term>Fault-tolerance</term>
<term>Grid computing</term>
<term>Middleware</term>
<term>Parallelism</term>
</keywords>
</textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid’5000, show that the real detection times closely match the predictions.</div>
</front>
</TEI>
<double doi="10.1007/s10766-009-0115-8">
<HAL>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="nl">Fault management in P2P-MPI</title>
<author>
<name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-2346" status="OLD">
<idno type="RNSR">200718299P</idno>
<orgName>Algorithms for the Grid</orgName>
<orgName type="acronym">ALGORILLE</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-160" type="direct"></relation>
<relation name="UMR7503" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300291" type="indirect"></relation>
<relation active="#struct-300292" type="indirect"></relation>
<relation active="#struct-300293" type="indirect"></relation>
<relation active="#struct-2496" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-160" type="direct">
<org type="laboratory" xml:id="struct-160" status="OLD">
<orgName>Laboratoire Lorrain de Recherche en Informatique et ses Applications</orgName>
<orgName type="acronym">LORIA</orgName>
<desc>
<address>
<addrLine>Campus Scientifique BP 239 54506 Vandoeuvre-lès-Nancy Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.loria.fr</ref>
</desc>
<listRelation>
<relation name="UMR7503" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300009" type="direct"></relation>
<relation active="#struct-300291" type="direct"></relation>
<relation active="#struct-300292" type="direct"></relation>
<relation active="#struct-300293" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR7503" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="ISNI">0000000122597504</idno>
<idno type="IdRef">02636817X</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect">
<org type="institution" xml:id="struct-300009" status="VALID">
<orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc>
<address>
<addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300291" type="indirect">
<org type="institution" xml:id="struct-300291" status="OLD">
<orgName>Université Henri Poincaré - Nancy 1</orgName>
<orgName type="acronym">UHP</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<addrLine>24-30 rue Lionnois, BP 60120, 54 003 NANCY cedex, France</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300292" type="indirect">
<org type="institution" xml:id="struct-300292" status="OLD">
<orgName>Université Nancy 2</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<addrLine>91 avenue de la Libération, BP 454, 54001 Nancy cedex</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300293" type="indirect">
<org type="institution" xml:id="struct-300293" status="OLD">
<orgName>Institut National Polytechnique de Lorraine</orgName>
<orgName type="acronym">INPL</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-2496" type="direct">
<org type="laboratory" xml:id="struct-2496" status="OLD">
<orgName>INRIA Lorraine</orgName>
<desc>
<address>
<addrLine>615 rue du Jardin Botanique 54600 Villers-lès-Nancy</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre-de-recherche-inria/nancy-grand-est</ref>
</desc>
<listRelation>
<relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="university">Université Nancy 2</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="university">Institut national polytechnique de Lorraine</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
</affiliation>
</author>
<author>
<name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-2346" status="OLD">
<idno type="RNSR">200718299P</idno>
<orgName>Algorithms for the Grid</orgName>
<orgName type="acronym">ALGORILLE</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-160" type="direct"></relation>
<relation name="UMR7503" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300291" type="indirect"></relation>
<relation active="#struct-300292" type="indirect"></relation>
<relation active="#struct-300293" type="indirect"></relation>
<relation active="#struct-2496" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-160" type="direct">
<org type="laboratory" xml:id="struct-160" status="OLD">
<orgName>Laboratoire Lorrain de Recherche en Informatique et ses Applications</orgName>
<orgName type="acronym">LORIA</orgName>
<desc>
<address>
<addrLine>Campus Scientifique BP 239 54506 Vandoeuvre-lès-Nancy Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.loria.fr</ref>
</desc>
<listRelation>
<relation name="UMR7503" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300009" type="direct"></relation>
<relation active="#struct-300291" type="direct"></relation>
<relation active="#struct-300292" type="direct"></relation>
<relation active="#struct-300293" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR7503" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="ISNI">0000000122597504</idno>
<idno type="IdRef">02636817X</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect">
<org type="institution" xml:id="struct-300009" status="VALID">
<orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc>
<address>
<addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300291" type="indirect">
<org type="institution" xml:id="struct-300291" status="OLD">
<orgName>Université Henri Poincaré - Nancy 1</orgName>
<orgName type="acronym">UHP</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<addrLine>24-30 rue Lionnois, BP 60120, 54 003 NANCY cedex, France</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300292" type="indirect">
<org type="institution" xml:id="struct-300292" status="OLD">
<orgName>Université Nancy 2</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<addrLine>91 avenue de la Libération, BP 454, 54001 Nancy cedex</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300293" type="indirect">
<org type="institution" xml:id="struct-300293" status="OLD">
<orgName>Institut National Polytechnique de Lorraine</orgName>
<orgName type="acronym">INPL</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-2496" type="direct">
<org type="laboratory" xml:id="struct-2496" status="OLD">
<orgName>INRIA Lorraine</orgName>
<desc>
<address>
<addrLine>615 rue du Jardin Botanique 54600 Villers-lès-Nancy</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre-de-recherche-inria/nancy-grand-est</ref>
</desc>
<listRelation>
<relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="university">Université Nancy 2</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="university">Institut national polytechnique de Lorraine</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
</affiliation>
</author>
<author>
<name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-93785" status="OLD">
<orgName>Laboratoire de Sciences de l'Image, de l'Informatique et de la Télédétection, équipe ICPS</orgName>
<orgName type="acronym">LSIIT / ICPS</orgName>
<desc>
<address>
<addrLine>Bd S. Brant 67412 Illkirch</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://icps.u-strasbg.fr</ref>
</desc>
<listRelation>
<relation active="#struct-302445" type="direct"></relation>
<relation name="UMR7005" active="#struct-441569" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-302445" type="direct">
<org type="institution" xml:id="struct-302445" status="VALID">
<orgName>Université de Strasbourg</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="UMR7005" active="#struct-441569" type="direct">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="ISNI">0000000122597504</idno>
<idno type="IdRef">02636817X</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Strasbourg</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Alsace (région administrative)</region>
</placeName>
<orgName type="university">Université de Strasbourg</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:inria-00425516</idno>
<idno type="halId">inria-00425516</idno>
<idno type="halUri">https://hal.inria.fr/inria-00425516</idno>
<idno type="url">https://hal.inria.fr/inria-00425516</idno>
<idno type="doi">10.1007/s10766-009-0115-8</idno>
<date when="2009-10">2009-10</date>
<idno type="wicri:Area/Hal/Corpus">006C19</idno>
<idno type="wicri:Area/Hal/Curation">006C19</idno>
<idno type="wicri:Area/Hal/Checkpoint">002A06</idno>
<idno type="wicri:explorRef" wicri:stream="Hal" wicri:step="Checkpoint">002A06</idno>
<idno type="wicri:doubleKey">0885-7458:2009:Genaud S:fault:management:in</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="nl">Fault management in P2P-MPI</title>
<author>
<name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-2346" status="OLD">
<idno type="RNSR">200718299P</idno>
<orgName>Algorithms for the Grid</orgName>
<orgName type="acronym">ALGORILLE</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-160" type="direct"></relation>
<relation name="UMR7503" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300291" type="indirect"></relation>
<relation active="#struct-300292" type="indirect"></relation>
<relation active="#struct-300293" type="indirect"></relation>
<relation active="#struct-2496" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-160" type="direct">
<org type="laboratory" xml:id="struct-160" status="OLD">
<orgName>Laboratoire Lorrain de Recherche en Informatique et ses Applications</orgName>
<orgName type="acronym">LORIA</orgName>
<desc>
<address>
<addrLine>Campus Scientifique BP 239 54506 Vandoeuvre-lès-Nancy Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.loria.fr</ref>
</desc>
<listRelation>
<relation name="UMR7503" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300009" type="direct"></relation>
<relation active="#struct-300291" type="direct"></relation>
<relation active="#struct-300292" type="direct"></relation>
<relation active="#struct-300293" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR7503" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="ISNI">0000000122597504</idno>
<idno type="IdRef">02636817X</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect">
<org type="institution" xml:id="struct-300009" status="VALID">
<orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc>
<address>
<addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300291" type="indirect">
<org type="institution" xml:id="struct-300291" status="OLD">
<orgName>Université Henri Poincaré - Nancy 1</orgName>
<orgName type="acronym">UHP</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<addrLine>24-30 rue Lionnois, BP 60120, 54 003 NANCY cedex, France</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300292" type="indirect">
<org type="institution" xml:id="struct-300292" status="OLD">
<orgName>Université Nancy 2</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<addrLine>91 avenue de la Libération, BP 454, 54001 Nancy cedex</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300293" type="indirect">
<org type="institution" xml:id="struct-300293" status="OLD">
<orgName>Institut National Polytechnique de Lorraine</orgName>
<orgName type="acronym">INPL</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-2496" type="direct">
<org type="laboratory" xml:id="struct-2496" status="OLD">
<orgName>INRIA Lorraine</orgName>
<desc>
<address>
<addrLine>615 rue du Jardin Botanique 54600 Villers-lès-Nancy</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre-de-recherche-inria/nancy-grand-est</ref>
</desc>
<listRelation>
<relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="university">Université Nancy 2</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="university">Institut national polytechnique de Lorraine</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
</affiliation>
</author>
<author>
<name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-2346" status="OLD">
<idno type="RNSR">200718299P</idno>
<orgName>Algorithms for the Grid</orgName>
<orgName type="acronym">ALGORILLE</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-160" type="direct"></relation>
<relation name="UMR7503" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300291" type="indirect"></relation>
<relation active="#struct-300292" type="indirect"></relation>
<relation active="#struct-300293" type="indirect"></relation>
<relation active="#struct-2496" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-160" type="direct">
<org type="laboratory" xml:id="struct-160" status="OLD">
<orgName>Laboratoire Lorrain de Recherche en Informatique et ses Applications</orgName>
<orgName type="acronym">LORIA</orgName>
<desc>
<address>
<addrLine>Campus Scientifique BP 239 54506 Vandoeuvre-lès-Nancy Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.loria.fr</ref>
</desc>
<listRelation>
<relation name="UMR7503" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300009" type="direct"></relation>
<relation active="#struct-300291" type="direct"></relation>
<relation active="#struct-300292" type="direct"></relation>
<relation active="#struct-300293" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR7503" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="ISNI">0000000122597504</idno>
<idno type="IdRef">02636817X</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect">
<org type="institution" xml:id="struct-300009" status="VALID">
<orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc>
<address>
<addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300291" type="indirect">
<org type="institution" xml:id="struct-300291" status="OLD">
<orgName>Université Henri Poincaré - Nancy 1</orgName>
<orgName type="acronym">UHP</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<addrLine>24-30 rue Lionnois, BP 60120, 54 003 NANCY cedex, France</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300292" type="indirect">
<org type="institution" xml:id="struct-300292" status="OLD">
<orgName>Université Nancy 2</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<addrLine>91 avenue de la Libération, BP 454, 54001 Nancy cedex</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300293" type="indirect">
<org type="institution" xml:id="struct-300293" status="OLD">
<orgName>Institut National Polytechnique de Lorraine</orgName>
<orgName type="acronym">INPL</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-2496" type="direct">
<org type="laboratory" xml:id="struct-2496" status="OLD">
<orgName>INRIA Lorraine</orgName>
<desc>
<address>
<addrLine>615 rue du Jardin Botanique 54600 Villers-lès-Nancy</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre-de-recherche-inria/nancy-grand-est</ref>
</desc>
<listRelation>
<relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="university">Université Nancy 2</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
<placeName>
<settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="university">Institut national polytechnique de Lorraine</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
</affiliation>
</author>
<author>
<name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-93785" status="OLD">
<orgName>Laboratoire de Sciences de l'Image, de l'Informatique et de la Télédétection, équipe ICPS</orgName>
<orgName type="acronym">LSIIT / ICPS</orgName>
<desc>
<address>
<addrLine>Bd S. Brant 67412 Illkirch</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://icps.u-strasbg.fr</ref>
</desc>
<listRelation>
<relation active="#struct-302445" type="direct"></relation>
<relation name="UMR7005" active="#struct-441569" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-302445" type="direct">
<org type="institution" xml:id="struct-302445" status="VALID">
<orgName>Université de Strasbourg</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="UMR7005" active="#struct-441569" type="direct">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="ISNI">0000000122597504</idno>
<idno type="IdRef">02636817X</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Strasbourg</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Alsace (région administrative)</region>
</placeName>
<orgName type="university">Université de Strasbourg</orgName>
</affiliation>
</author>
</analytic>
<idno type="DOI">10.1007/s10766-009-0115-8</idno>
<series>
<title level="j">International Journal of Parallel Programming</title>
<idno type="ISSN">0885-7458</idno>
<imprint>
<date type="datePub">2009-10</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="mix" xml:lang="en">
<term>Fault-tolerance</term>
<term>Grid computing</term>
<term>Middleware</term>
<term>Parallelism</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid'5000, show that the real detection times closely match the predictions.</div>
</front>
</TEI>
</HAL>
<ISTEX>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Fault-Management in P2P-MPI</title>
<author>
<name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
</author>
<author>
<name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
</author>
<author>
<name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:5E7C8EC4D7C270F8D66020C33884FC34178138D6</idno>
<date when="2009" year="2009">2009</date>
<idno type="doi">10.1007/s10766-009-0115-8</idno>
<idno type="url">https://api.istex.fr/ark:/67375/VQC-T0H758JH-P/fulltext.pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001594</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">001594</idno>
<idno type="wicri:Area/Istex/Curation">001575</idno>
<idno type="wicri:Area/Istex/Checkpoint">000A64</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">000A64</idno>
<idno type="wicri:doubleKey">0885-7458:2009:Genaud S:fault:management:in</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Fault-Management in P2P-MPI</title>
<author>
<name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
<affiliation wicri:level="3">
<country xml:lang="fr">France</country>
<wicri:regionArea>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy</wicri:regionArea>
<placeName>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
</author>
<author>
<name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
<affiliation wicri:level="3">
<country xml:lang="fr">France</country>
<wicri:regionArea>AlGorille Team, LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy</wicri:regionArea>
<placeName>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
</author>
<author>
<name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Thaïlande</country>
<wicri:regionArea>Department of Electronics Engineering Technology, College of Industrial Technology, King Mongkut’s University of Technology North Bangkok, Bangkok</wicri:regionArea>
<wicri:noRegion>Bangkok</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Thaïlande</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">International Journal of Parallel Programming</title>
<title level="j" type="abbrev">Int J Parallel Prog</title>
<idno type="ISSN">0885-7458</idno>
<idno type="eISSN">1573-7640</idno>
<imprint>
<publisher>Springer US; http://www.springer-ny.com</publisher>
<pubPlace>Boston</pubPlace>
<date type="published" when="2009-10-01">2009-10-01</date>
<biblScope unit="volume">37</biblScope>
<biblScope unit="issue">5</biblScope>
<biblScope unit="page" from="433">433</biblScope>
<biblScope unit="page" to="461">461</biblScope>
</imprint>
<idno type="ISSN">0885-7458</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0885-7458</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Fault-tolerance</term>
<term>Grid computing</term>
<term>Middleware</term>
<term>Parallelism</term>
</keywords>
</textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid’5000, show that the real detection times closely match the predictions.</div>
</front>
</TEI>
</ISTEX>
</double>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 003A35 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 003A35 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Main
   |étape=   Merge
   |type=    RBID
   |clé=     ISTEX:5E7C8EC4D7C270F8D66020C33884FC34178138D6
   |texte=   Fault-Management in P2P-MPI
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022