Serveur d'exploration sur la recherche en informatique en Lorraine

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Design of a model and frameworks for application distribution on PC clusters with low-overhead fault tolerance

Identifieur interne : 002514 ( Main/Merge ); précédent : 002513; suivant : 002515

Design of a model and frameworks for application distribution on PC clusters with low-overhead fault tolerance

Auteurs : Constantinos Makassikis [France]

Source :

RBID : Hal:tel-00591083

Descripteurs français

English descriptors

Abstract

PC clusters are distributed architectures whose adoption spreads as a result of their low cost but also their extensibility in terms of nodes. In particular, the increase in nodes is responsable for the increase of fail-stop failures which jeopardize distributed applications. The absence of efficient and portable solutions limits their use to non critical applications or without time constraints. MoLOToF is a model for application-level fault tolerance based on checkpointing. To ease the addition of fault tolerance, it proposes to structure applications using fault-tolerant skeletons as well as collaborations between the programmer and the fault tolerance system to gain in efficiency. The application of MoLOToF on SPMD and Master-Worker families of parallel algorithms lead to FT-GReLoSSS and ToMaWork frameworks respectively. Each framework provides fault-tolerant skeletons suited to targeted families of algorithms and an original implementation. FT-GReLoSSS uses C++ on top of MPI while ToMaWork uses Java on top of virtual shared memory system provided by JavaSpaces technology. The frameworks' evaluation reveals a reasonable time development overhead and negligible runtime overheads in absence of fault tolerance. Experiments up to 256 nodes on a dualcore PC cluster, demonstrate a better efficiency of FT-GReLoSSS' fault tolerance solution compared to existing system-level solutions (LAM/MPI and DMTCP).

Url:

Links toward previous steps (curation, corpus...)


Links to Exploration step

Hal:tel-00591083

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Design of a model and frameworks for application distribution on PC clusters with low-overhead fault tolerance</title>
<title xml:lang="fr">Conception d'un modèle et de frameworks de distribution d'applications sur grappes de PCs avec tolérance aux pannes à faible coût</title>
<author>
<name sortKey="Makassikis, Constantinos" sort="Makassikis, Constantinos" uniqKey="Makassikis C" first="Constantinos" last="Makassikis">Constantinos Makassikis</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-26305" status="VALID">
<orgName>SUPELEC-Campus Metz</orgName>
<desc>
<address>
<addrLine>2 rue Edouard Belin 57070 Metz</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.metz.supelec.fr/metz/</ref>
</desc>
<listRelation>
<relation active="#struct-300812" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300812" type="direct">
<org type="institution" xml:id="struct-300812" status="VALID">
<orgName>SUPELEC</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:tel-00591083</idno>
<idno type="halId">tel-00591083</idno>
<idno type="halUri">https://tel.archives-ouvertes.fr/tel-00591083</idno>
<idno type="url">https://tel.archives-ouvertes.fr/tel-00591083</idno>
<date when="2011-02-02">2011-02-02</date>
<idno type="wicri:Area/Hal/Corpus">001B82</idno>
<idno type="wicri:Area/Hal/Curation">001B82</idno>
<idno type="wicri:Area/Hal/Checkpoint">001F50</idno>
<idno type="wicri:explorRef" wicri:stream="Hal" wicri:step="Checkpoint">001F50</idno>
<idno type="wicri:Area/Main/Merge">002514</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Design of a model and frameworks for application distribution on PC clusters with low-overhead fault tolerance</title>
<title xml:lang="fr">Conception d'un modèle et de frameworks de distribution d'applications sur grappes de PCs avec tolérance aux pannes à faible coût</title>
<author>
<name sortKey="Makassikis, Constantinos" sort="Makassikis, Constantinos" uniqKey="Makassikis C" first="Constantinos" last="Makassikis">Constantinos Makassikis</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-26305" status="VALID">
<orgName>SUPELEC-Campus Metz</orgName>
<desc>
<address>
<addrLine>2 rue Edouard Belin 57070 Metz</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.metz.supelec.fr/metz/</ref>
</desc>
<listRelation>
<relation active="#struct-300812" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300812" type="direct">
<org type="institution" xml:id="struct-300812" status="VALID">
<orgName>SUPELEC</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="mix" xml:lang="en">
<term>Master-Worker algorithms</term>
<term>SPMD algorithms</term>
<term>checkpoint</term>
<term>fault tolerance</term>
<term>programming skeletons</term>
</keywords>
<keywords scheme="mix" xml:lang="fr">
<term>algorithmes Maître-Travailleur</term>
<term>algorithmes SPMD</term>
<term>framework</term>
<term>points de reprise</term>
<term>squelettes de programmation</term>
<term>tolérance aux pannes</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">PC clusters are distributed architectures whose adoption spreads as a result of their low cost but also their extensibility in terms of nodes. In particular, the increase in nodes is responsable for the increase of fail-stop failures which jeopardize distributed applications. The absence of efficient and portable solutions limits their use to non critical applications or without time constraints. MoLOToF is a model for application-level fault tolerance based on checkpointing. To ease the addition of fault tolerance, it proposes to structure applications using fault-tolerant skeletons as well as collaborations between the programmer and the fault tolerance system to gain in efficiency. The application of MoLOToF on SPMD and Master-Worker families of parallel algorithms lead to FT-GReLoSSS and ToMaWork frameworks respectively. Each framework provides fault-tolerant skeletons suited to targeted families of algorithms and an original implementation. FT-GReLoSSS uses C++ on top of MPI while ToMaWork uses Java on top of virtual shared memory system provided by JavaSpaces technology. The frameworks' evaluation reveals a reasonable time development overhead and negligible runtime overheads in absence of fault tolerance. Experiments up to 256 nodes on a dualcore PC cluster, demonstrate a better efficiency of FT-GReLoSSS' fault tolerance solution compared to existing system-level solutions (LAM/MPI and DMTCP).</div>
</front>
</TEI>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002514 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 002514 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Main
   |étape=   Merge
   |type=    RBID
   |clé=     Hal:tel-00591083
   |texte=   Design of a model and frameworks for application distribution on PC clusters with low-overhead fault tolerance
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022