Fault management in P2P-MPI
Identifieur interne : 002A06 ( Hal/Checkpoint ); précédent : 002A05; suivant : 002A07Fault management in P2P-MPI
Auteurs : Stéphane Genaud [France] ; Emmanuel Jeannot [France] ; Choopan Rattanapoka [France]Source :
- International Journal of Parallel Programming [ 0885-7458 ] ; 2009-10.
English descriptors
Abstract
We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid'5000, show that the real detection times closely match the predictions.
Url:
DOI: 10.1007/s10766-009-0115-8
Links toward previous steps (curation, corpus...)
Links to Exploration step
Hal:inria-00425516Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="nl">Fault management in P2P-MPI</title>
<author><name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
<affiliation wicri:level="1"><hal:affiliation type="researchteam" xml:id="struct-2346" status="OLD"><idno type="RNSR">200718299P</idno>
<orgName>Algorithms for the Grid</orgName>
<orgName type="acronym">ALGORILLE</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-160" type="direct"></relation>
<relation name="UMR7503" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300291" type="indirect"></relation>
<relation active="#struct-300292" type="indirect"></relation>
<relation active="#struct-300293" type="indirect"></relation>
<relation active="#struct-2496" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-160" type="direct"><org type="laboratory" xml:id="struct-160" status="OLD"><orgName>Laboratoire Lorrain de Recherche en Informatique et ses Applications</orgName>
<orgName type="acronym">LORIA</orgName>
<desc><address><addrLine>Campus Scientifique BP 239 54506 Vandoeuvre-lès-Nancy Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.loria.fr</ref>
</desc>
<listRelation><relation name="UMR7503" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300009" type="direct"></relation>
<relation active="#struct-300291" type="direct"></relation>
<relation active="#struct-300292" type="direct"></relation>
<relation active="#struct-300293" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR7503" active="#struct-441569" type="indirect"><org type="institution" xml:id="struct-441569" status="VALID"><idno type="ISNI">0000000122597504</idno>
<idno type="IdRef">02636817X</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect"><org type="institution" xml:id="struct-300009" status="VALID"><orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc><address><addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300291" type="indirect"><org type="institution" xml:id="struct-300291" status="OLD"><orgName>Université Henri Poincaré - Nancy 1</orgName>
<orgName type="acronym">UHP</orgName>
<date type="end">2011-12-31</date>
<desc><address><addrLine>24-30 rue Lionnois, BP 60120, 54 003 NANCY cedex, France</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300292" type="indirect"><org type="institution" xml:id="struct-300292" status="OLD"><orgName>Université Nancy 2</orgName>
<date type="end">2011-12-31</date>
<desc><address><addrLine>91 avenue de la Libération, BP 454, 54001 Nancy cedex</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300293" type="indirect"><org type="institution" xml:id="struct-300293" status="OLD"><orgName>Institut National Polytechnique de Lorraine</orgName>
<orgName type="acronym">INPL</orgName>
<date type="end">2011-12-31</date>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-2496" type="direct"><org type="laboratory" xml:id="struct-2496" status="OLD"><orgName>INRIA Lorraine</orgName>
<desc><address><addrLine>615 rue du Jardin Botanique 54600 Villers-lès-Nancy</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre-de-recherche-inria/nancy-grand-est</ref>
</desc>
<listRelation><relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName><settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="university">Université Nancy 2</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
<placeName><settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="university">Institut national polytechnique de Lorraine</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
</affiliation>
</author>
<author><name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
<affiliation wicri:level="1"><hal:affiliation type="researchteam" xml:id="struct-2346" status="OLD"><idno type="RNSR">200718299P</idno>
<orgName>Algorithms for the Grid</orgName>
<orgName type="acronym">ALGORILLE</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-160" type="direct"></relation>
<relation name="UMR7503" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300291" type="indirect"></relation>
<relation active="#struct-300292" type="indirect"></relation>
<relation active="#struct-300293" type="indirect"></relation>
<relation active="#struct-2496" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-160" type="direct"><org type="laboratory" xml:id="struct-160" status="OLD"><orgName>Laboratoire Lorrain de Recherche en Informatique et ses Applications</orgName>
<orgName type="acronym">LORIA</orgName>
<desc><address><addrLine>Campus Scientifique BP 239 54506 Vandoeuvre-lès-Nancy Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.loria.fr</ref>
</desc>
<listRelation><relation name="UMR7503" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300009" type="direct"></relation>
<relation active="#struct-300291" type="direct"></relation>
<relation active="#struct-300292" type="direct"></relation>
<relation active="#struct-300293" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR7503" active="#struct-441569" type="indirect"><org type="institution" xml:id="struct-441569" status="VALID"><idno type="ISNI">0000000122597504</idno>
<idno type="IdRef">02636817X</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect"><org type="institution" xml:id="struct-300009" status="VALID"><orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc><address><addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300291" type="indirect"><org type="institution" xml:id="struct-300291" status="OLD"><orgName>Université Henri Poincaré - Nancy 1</orgName>
<orgName type="acronym">UHP</orgName>
<date type="end">2011-12-31</date>
<desc><address><addrLine>24-30 rue Lionnois, BP 60120, 54 003 NANCY cedex, France</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300292" type="indirect"><org type="institution" xml:id="struct-300292" status="OLD"><orgName>Université Nancy 2</orgName>
<date type="end">2011-12-31</date>
<desc><address><addrLine>91 avenue de la Libération, BP 454, 54001 Nancy cedex</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300293" type="indirect"><org type="institution" xml:id="struct-300293" status="OLD"><orgName>Institut National Polytechnique de Lorraine</orgName>
<orgName type="acronym">INPL</orgName>
<date type="end">2011-12-31</date>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-2496" type="direct"><org type="laboratory" xml:id="struct-2496" status="OLD"><orgName>INRIA Lorraine</orgName>
<desc><address><addrLine>615 rue du Jardin Botanique 54600 Villers-lès-Nancy</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre-de-recherche-inria/nancy-grand-est</ref>
</desc>
<listRelation><relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName><settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="university">Université Nancy 2</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
<placeName><settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="university">Institut national polytechnique de Lorraine</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
</affiliation>
</author>
<author><name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-93785" status="OLD"><orgName>Laboratoire de Sciences de l'Image, de l'Informatique et de la Télédétection, équipe ICPS</orgName>
<orgName type="acronym">LSIIT / ICPS</orgName>
<desc><address><addrLine>Bd S. Brant 67412 Illkirch</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://icps.u-strasbg.fr</ref>
</desc>
<listRelation><relation active="#struct-302445" type="direct"></relation>
<relation name="UMR7005" active="#struct-441569" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-302445" type="direct"><org type="institution" xml:id="struct-302445" status="VALID"><orgName>Université de Strasbourg</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="UMR7005" active="#struct-441569" type="direct"><org type="institution" xml:id="struct-441569" status="VALID"><idno type="ISNI">0000000122597504</idno>
<idno type="IdRef">02636817X</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName><settlement type="city">Strasbourg</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Alsace (région administrative)</region>
</placeName>
<orgName type="university">Université de Strasbourg</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:inria-00425516</idno>
<idno type="halId">inria-00425516</idno>
<idno type="halUri">https://hal.inria.fr/inria-00425516</idno>
<idno type="url">https://hal.inria.fr/inria-00425516</idno>
<idno type="doi">10.1007/s10766-009-0115-8</idno>
<date when="2009-10">2009-10</date>
<idno type="wicri:Area/Hal/Corpus">006C19</idno>
<idno type="wicri:Area/Hal/Curation">006C19</idno>
<idno type="wicri:Area/Hal/Checkpoint">002A06</idno>
<idno type="wicri:explorRef" wicri:stream="Hal" wicri:step="Checkpoint">002A06</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="nl">Fault management in P2P-MPI</title>
<author><name sortKey="Genaud, Stephane" sort="Genaud, Stephane" uniqKey="Genaud S" first="Stéphane" last="Genaud">Stéphane Genaud</name>
<affiliation wicri:level="1"><hal:affiliation type="researchteam" xml:id="struct-2346" status="OLD"><idno type="RNSR">200718299P</idno>
<orgName>Algorithms for the Grid</orgName>
<orgName type="acronym">ALGORILLE</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-160" type="direct"></relation>
<relation name="UMR7503" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300291" type="indirect"></relation>
<relation active="#struct-300292" type="indirect"></relation>
<relation active="#struct-300293" type="indirect"></relation>
<relation active="#struct-2496" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-160" type="direct"><org type="laboratory" xml:id="struct-160" status="OLD"><orgName>Laboratoire Lorrain de Recherche en Informatique et ses Applications</orgName>
<orgName type="acronym">LORIA</orgName>
<desc><address><addrLine>Campus Scientifique BP 239 54506 Vandoeuvre-lès-Nancy Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.loria.fr</ref>
</desc>
<listRelation><relation name="UMR7503" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300009" type="direct"></relation>
<relation active="#struct-300291" type="direct"></relation>
<relation active="#struct-300292" type="direct"></relation>
<relation active="#struct-300293" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR7503" active="#struct-441569" type="indirect"><org type="institution" xml:id="struct-441569" status="VALID"><idno type="ISNI">0000000122597504</idno>
<idno type="IdRef">02636817X</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect"><org type="institution" xml:id="struct-300009" status="VALID"><orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc><address><addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300291" type="indirect"><org type="institution" xml:id="struct-300291" status="OLD"><orgName>Université Henri Poincaré - Nancy 1</orgName>
<orgName type="acronym">UHP</orgName>
<date type="end">2011-12-31</date>
<desc><address><addrLine>24-30 rue Lionnois, BP 60120, 54 003 NANCY cedex, France</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300292" type="indirect"><org type="institution" xml:id="struct-300292" status="OLD"><orgName>Université Nancy 2</orgName>
<date type="end">2011-12-31</date>
<desc><address><addrLine>91 avenue de la Libération, BP 454, 54001 Nancy cedex</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300293" type="indirect"><org type="institution" xml:id="struct-300293" status="OLD"><orgName>Institut National Polytechnique de Lorraine</orgName>
<orgName type="acronym">INPL</orgName>
<date type="end">2011-12-31</date>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-2496" type="direct"><org type="laboratory" xml:id="struct-2496" status="OLD"><orgName>INRIA Lorraine</orgName>
<desc><address><addrLine>615 rue du Jardin Botanique 54600 Villers-lès-Nancy</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre-de-recherche-inria/nancy-grand-est</ref>
</desc>
<listRelation><relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName><settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="university">Université Nancy 2</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
<placeName><settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="university">Institut national polytechnique de Lorraine</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
</affiliation>
</author>
<author><name sortKey="Jeannot, Emmanuel" sort="Jeannot, Emmanuel" uniqKey="Jeannot E" first="Emmanuel" last="Jeannot">Emmanuel Jeannot</name>
<affiliation wicri:level="1"><hal:affiliation type="researchteam" xml:id="struct-2346" status="OLD"><idno type="RNSR">200718299P</idno>
<orgName>Algorithms for the Grid</orgName>
<orgName type="acronym">ALGORILLE</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-160" type="direct"></relation>
<relation name="UMR7503" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300291" type="indirect"></relation>
<relation active="#struct-300292" type="indirect"></relation>
<relation active="#struct-300293" type="indirect"></relation>
<relation active="#struct-2496" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-160" type="direct"><org type="laboratory" xml:id="struct-160" status="OLD"><orgName>Laboratoire Lorrain de Recherche en Informatique et ses Applications</orgName>
<orgName type="acronym">LORIA</orgName>
<desc><address><addrLine>Campus Scientifique BP 239 54506 Vandoeuvre-lès-Nancy Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.loria.fr</ref>
</desc>
<listRelation><relation name="UMR7503" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300009" type="direct"></relation>
<relation active="#struct-300291" type="direct"></relation>
<relation active="#struct-300292" type="direct"></relation>
<relation active="#struct-300293" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR7503" active="#struct-441569" type="indirect"><org type="institution" xml:id="struct-441569" status="VALID"><idno type="ISNI">0000000122597504</idno>
<idno type="IdRef">02636817X</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect"><org type="institution" xml:id="struct-300009" status="VALID"><orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc><address><addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300291" type="indirect"><org type="institution" xml:id="struct-300291" status="OLD"><orgName>Université Henri Poincaré - Nancy 1</orgName>
<orgName type="acronym">UHP</orgName>
<date type="end">2011-12-31</date>
<desc><address><addrLine>24-30 rue Lionnois, BP 60120, 54 003 NANCY cedex, France</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300292" type="indirect"><org type="institution" xml:id="struct-300292" status="OLD"><orgName>Université Nancy 2</orgName>
<date type="end">2011-12-31</date>
<desc><address><addrLine>91 avenue de la Libération, BP 454, 54001 Nancy cedex</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300293" type="indirect"><org type="institution" xml:id="struct-300293" status="OLD"><orgName>Institut National Polytechnique de Lorraine</orgName>
<orgName type="acronym">INPL</orgName>
<date type="end">2011-12-31</date>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-2496" type="direct"><org type="laboratory" xml:id="struct-2496" status="OLD"><orgName>INRIA Lorraine</orgName>
<desc><address><addrLine>615 rue du Jardin Botanique 54600 Villers-lès-Nancy</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre-de-recherche-inria/nancy-grand-est</ref>
</desc>
<listRelation><relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName><settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="university">Université Nancy 2</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
<placeName><settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="university">Institut national polytechnique de Lorraine</orgName>
<orgName type="institution" wicri:auto="newGroup">Université de Lorraine</orgName>
</affiliation>
</author>
<author><name sortKey="Rattanapoka, Choopan" sort="Rattanapoka, Choopan" uniqKey="Rattanapoka C" first="Choopan" last="Rattanapoka">Choopan Rattanapoka</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-93785" status="OLD"><orgName>Laboratoire de Sciences de l'Image, de l'Informatique et de la Télédétection, équipe ICPS</orgName>
<orgName type="acronym">LSIIT / ICPS</orgName>
<desc><address><addrLine>Bd S. Brant 67412 Illkirch</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://icps.u-strasbg.fr</ref>
</desc>
<listRelation><relation active="#struct-302445" type="direct"></relation>
<relation name="UMR7005" active="#struct-441569" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-302445" type="direct"><org type="institution" xml:id="struct-302445" status="VALID"><orgName>Université de Strasbourg</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="UMR7005" active="#struct-441569" type="direct"><org type="institution" xml:id="struct-441569" status="VALID"><idno type="ISNI">0000000122597504</idno>
<idno type="IdRef">02636817X</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName><settlement type="city">Strasbourg</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Alsace (région administrative)</region>
</placeName>
<orgName type="university">Université de Strasbourg</orgName>
</affiliation>
</author>
</analytic>
<idno type="DOI">10.1007/s10766-009-0115-8</idno>
<series><title level="j">International Journal of Parallel Programming</title>
<idno type="ISSN">0885-7458</idno>
<imprint><date type="datePub">2009-10</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="mix" xml:lang="en"><term>Fault-tolerance</term>
<term>Grid computing</term>
<term>Middleware</term>
<term>Parallelism</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid'5000, show that the real detection times closely match the predictions.</div>
</front>
</TEI>
<hal api="V3"><titleStmt><title xml:lang="nl">Fault management in P2P-MPI</title>
<author role="aut"><persName><forename type="first">Stéphane</forename>
<surname>Genaud</surname>
</persName>
<email>stephane.genaud@loria.fr</email>
<idno type="halauthor">440063</idno>
<affiliation ref="#struct-2346"></affiliation>
</author>
<author role="aut"><persName><forename type="first">Emmanuel</forename>
<surname>Jeannot</surname>
</persName>
<email>emmanuel.jeannot@loria.fr</email>
<idno type="halauthor">59321</idno>
<affiliation ref="#struct-2346"></affiliation>
</author>
<author role="aut"><persName><forename type="first">Choopan</forename>
<surname>Rattanapoka</surname>
</persName>
<email>choopanr@kmutnb.ac.th</email>
<idno type="halauthor">440064</idno>
<orgName ref="#struct-366717"></orgName>
<affiliation ref="#struct-93785"></affiliation>
</author>
<editor role="depositor"><persName><forename>Stéphane</forename>
<surname>Genaud</surname>
</persName>
<email>genaud@unistra.fr</email>
</editor>
</titleStmt>
<editionStmt><edition n="v1" type="current"><date type="whenSubmitted">2009-10-21 21:04:08</date>
<date type="whenModified">2016-05-18 08:56:00</date>
<date type="whenReleased">2009-10-21 21:04:08</date>
<date type="whenProduced">2009-10</date>
</edition>
<respStmt><resp>contributor</resp>
<name key="124667"><persName><forename>Stéphane</forename>
<surname>Genaud</surname>
</persName>
<email>genaud@unistra.fr</email>
</name>
</respStmt>
</editionStmt>
<publicationStmt><distributor>CCSD</distributor>
<idno type="halId">inria-00425516</idno>
<idno type="halUri">https://hal.inria.fr/inria-00425516</idno>
<idno type="halBibtex">genaud:inria-00425516</idno>
<idno type="halRefHtml">International Journal of Parallel Programming, Springer Verlag, 2009, 37 (5), pp.433-461. <10.1007/s10766-009-0115-8></idno>
<idno type="halRef">International Journal of Parallel Programming, Springer Verlag, 2009, 37 (5), pp.433-461. <10.1007/s10766-009-0115-8></idno>
</publicationStmt>
<seriesStmt><idno type="stamp" n="CNRS">CNRS - Centre national de la recherche scientifique</idno>
<idno type="stamp" n="INRIA">INRIA - Institut National de Recherche en Informatique et en Automatique</idno>
<idno type="stamp" n="UNIV-STRASBG">Université de Strasbourg</idno>
<idno type="stamp" n="LORIA2">Publications du LORIA</idno>
<idno type="stamp" n="INRIA-NANCY-GRAND-EST">INRIA Nancy - Grand Est</idno>
<idno type="stamp" n="FRANCE-GRILLES">France Grilles</idno>
<idno type="stamp" n="LORIA">LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications</idno>
<idno type="stamp" n="LORIA-NSS" p="LORIA">Réseaux, systèmes et services</idno>
<idno type="stamp" n="GRID5000">Grid'5000</idno>
<idno type="stamp" n="INRIA2">INRIA 2</idno>
<idno type="stamp" n="INPL">Institut National Polytechnique de Lorraine</idno>
<idno type="stamp" n="INRIA-LORRAINE">INRIA Nancy - Grand Est</idno>
<idno type="stamp" n="LABO-LORIA-SET" p="LORIA">LABO-LORIA-SET</idno>
<idno type="stamp" n="UNIV-LORRAINE">Université de Lorraine</idno>
</seriesStmt>
<notesStmt><note type="audience" n="2">International</note>
<note type="popular" n="0">No</note>
<note type="peer" n="1">Yes</note>
</notesStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="nl">Fault management in P2P-MPI</title>
<author role="aut"><persName><forename type="first">Stéphane</forename>
<surname>Genaud</surname>
</persName>
<email>stephane.genaud@loria.fr</email>
<idno type="halAuthorId">440063</idno>
<affiliation ref="#struct-2346"></affiliation>
</author>
<author role="aut"><persName><forename type="first">Emmanuel</forename>
<surname>Jeannot</surname>
</persName>
<email>emmanuel.jeannot@loria.fr</email>
<idno type="halAuthorId">59321</idno>
<affiliation ref="#struct-2346"></affiliation>
</author>
<author role="aut"><persName><forename type="first">Choopan</forename>
<surname>Rattanapoka</surname>
</persName>
<email>choopanr@kmutnb.ac.th</email>
<idno type="halAuthorId">440064</idno>
<orgName ref="#struct-366717"></orgName>
<affiliation ref="#struct-93785"></affiliation>
</author>
</analytic>
<monogr><idno type="halJournalId" status="VALID">14619</idno>
<idno type="issn">0885-7458</idno>
<idno type="eissn">1573-7640</idno>
<title level="j">International Journal of Parallel Programming</title>
<imprint><publisher>Springer Verlag</publisher>
<biblScope unit="volume">37</biblScope>
<biblScope unit="issue">5</biblScope>
<biblScope unit="pp">433-461</biblScope>
<date type="datePub">2009-10</date>
</imprint>
</monogr>
<idno type="doi">10.1007/s10766-009-0115-8</idno>
</biblStruct>
</sourceDesc>
<profileDesc><langUsage><language ident="en">English</language>
</langUsage>
<textClass><keywords scheme="author"><term xml:lang="en">Grid computing</term>
<term xml:lang="en">Middleware</term>
<term xml:lang="en">Parallelism</term>
<term xml:lang="en">Fault-tolerance</term>
</keywords>
<classCode scheme="halDomain" n="info.info-dc">Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC]</classCode>
<classCode scheme="halTypology" n="ART">Journal articles</classCode>
</textClass>
<abstract xml:lang="en">We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid'5000, show that the real detection times closely match the predictions.</abstract>
<particDesc><org type="consortium">Grid'5000</org>
</particDesc>
</profileDesc>
</hal>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Hal/Checkpoint
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002A06 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Hal/Checkpoint/biblio.hfd -nk 002A06 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Wicri/Lorraine |area= InforLorV4 |flux= Hal |étape= Checkpoint |type= RBID |clé= Hal:inria-00425516 |texte= Fault management in P2P-MPI }}
This area was generated with Dilib version V0.6.33. |