Serveur d'exploration sur la recherche en informatique en Lorraine

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Design of a model and frameworks for application distribution on PC clusters with low-overhead fault tolerance

Identifieur interne : 001B82 ( Hal/Corpus ); précédent : 001B81; suivant : 001B83

Design of a model and frameworks for application distribution on PC clusters with low-overhead fault tolerance

Auteurs : Constantinos Makassikis

Source :

RBID : Hal:tel-00591083

Descripteurs français

English descriptors

Abstract

PC clusters are distributed architectures whose adoption spreads as a result of their low cost but also their extensibility in terms of nodes. In particular, the increase in nodes is responsable for the increase of fail-stop failures which jeopardize distributed applications. The absence of efficient and portable solutions limits their use to non critical applications or without time constraints. MoLOToF is a model for application-level fault tolerance based on checkpointing. To ease the addition of fault tolerance, it proposes to structure applications using fault-tolerant skeletons as well as collaborations between the programmer and the fault tolerance system to gain in efficiency. The application of MoLOToF on SPMD and Master-Worker families of parallel algorithms lead to FT-GReLoSSS and ToMaWork frameworks respectively. Each framework provides fault-tolerant skeletons suited to targeted families of algorithms and an original implementation. FT-GReLoSSS uses C++ on top of MPI while ToMaWork uses Java on top of virtual shared memory system provided by JavaSpaces technology. The frameworks' evaluation reveals a reasonable time development overhead and negligible runtime overheads in absence of fault tolerance. Experiments up to 256 nodes on a dualcore PC cluster, demonstrate a better efficiency of FT-GReLoSSS' fault tolerance solution compared to existing system-level solutions (LAM/MPI and DMTCP).

Url:

Links to Exploration step

Hal:tel-00591083

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Design of a model and frameworks for application distribution on PC clusters with low-overhead fault tolerance</title>
<title xml:lang="fr">Conception d'un modèle et de frameworks de distribution d'applications sur grappes de PCs avec tolérance aux pannes à faible coût</title>
<author>
<name sortKey="Makassikis, Constantinos" sort="Makassikis, Constantinos" uniqKey="Makassikis C" first="Constantinos" last="Makassikis">Constantinos Makassikis</name>
<affiliation>
<hal:affiliation type="laboratory" xml:id="struct-26305" status="VALID">
<orgName>SUPELEC-Campus Metz</orgName>
<desc>
<address>
<addrLine>2 rue Edouard Belin 57070 Metz</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.metz.supelec.fr/metz/</ref>
</desc>
<listRelation>
<relation active="#struct-300812" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300812" type="direct">
<org type="institution" xml:id="struct-300812" status="VALID">
<orgName>SUPELEC</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:tel-00591083</idno>
<idno type="halId">tel-00591083</idno>
<idno type="halUri">https://tel.archives-ouvertes.fr/tel-00591083</idno>
<idno type="url">https://tel.archives-ouvertes.fr/tel-00591083</idno>
<date when="2011-02-02">2011-02-02</date>
<idno type="wicri:Area/Hal/Corpus">001B82</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Design of a model and frameworks for application distribution on PC clusters with low-overhead fault tolerance</title>
<title xml:lang="fr">Conception d'un modèle et de frameworks de distribution d'applications sur grappes de PCs avec tolérance aux pannes à faible coût</title>
<author>
<name sortKey="Makassikis, Constantinos" sort="Makassikis, Constantinos" uniqKey="Makassikis C" first="Constantinos" last="Makassikis">Constantinos Makassikis</name>
<affiliation>
<hal:affiliation type="laboratory" xml:id="struct-26305" status="VALID">
<orgName>SUPELEC-Campus Metz</orgName>
<desc>
<address>
<addrLine>2 rue Edouard Belin 57070 Metz</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.metz.supelec.fr/metz/</ref>
</desc>
<listRelation>
<relation active="#struct-300812" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300812" type="direct">
<org type="institution" xml:id="struct-300812" status="VALID">
<orgName>SUPELEC</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="mix" xml:lang="en">
<term>Master-Worker algorithms</term>
<term>SPMD algorithms</term>
<term>checkpoint</term>
<term>fault tolerance</term>
<term>programming skeletons</term>
</keywords>
<keywords scheme="mix" xml:lang="fr">
<term>algorithmes Maître-Travailleur</term>
<term>algorithmes SPMD</term>
<term>framework</term>
<term>points de reprise</term>
<term>squelettes de programmation</term>
<term>tolérance aux pannes</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">PC clusters are distributed architectures whose adoption spreads as a result of their low cost but also their extensibility in terms of nodes. In particular, the increase in nodes is responsable for the increase of fail-stop failures which jeopardize distributed applications. The absence of efficient and portable solutions limits their use to non critical applications or without time constraints. MoLOToF is a model for application-level fault tolerance based on checkpointing. To ease the addition of fault tolerance, it proposes to structure applications using fault-tolerant skeletons as well as collaborations between the programmer and the fault tolerance system to gain in efficiency. The application of MoLOToF on SPMD and Master-Worker families of parallel algorithms lead to FT-GReLoSSS and ToMaWork frameworks respectively. Each framework provides fault-tolerant skeletons suited to targeted families of algorithms and an original implementation. FT-GReLoSSS uses C++ on top of MPI while ToMaWork uses Java on top of virtual shared memory system provided by JavaSpaces technology. The frameworks' evaluation reveals a reasonable time development overhead and negligible runtime overheads in absence of fault tolerance. Experiments up to 256 nodes on a dualcore PC cluster, demonstrate a better efficiency of FT-GReLoSSS' fault tolerance solution compared to existing system-level solutions (LAM/MPI and DMTCP).</div>
</front>
</TEI>
<hal api="V3">
<titleStmt>
<title xml:lang="en">Design of a model and frameworks for application distribution on PC clusters with low-overhead fault tolerance</title>
<title xml:lang="fr">Conception d'un modèle et de frameworks de distribution d'applications sur grappes de PCs avec tolérance aux pannes à faible coût</title>
<author role="aut">
<persName>
<forename type="first">Constantinos</forename>
<surname>Makassikis</surname>
</persName>
<email>cmakassikis@gmail.com</email>
<idno type="halauthor">559216</idno>
<affiliation ref="#struct-26305"></affiliation>
<affiliation ref="#struct-2346"></affiliation>
</author>
<editor role="depositor">
<persName>
<forename>Constantinos</forename>
<surname>Makassikis</surname>
</persName>
<email>cmakassikis@gmail.com</email>
</editor>
</titleStmt>
<editionStmt>
<edition n="v1" type="current">
<date type="whenSubmitted">2011-05-06 14:44:44</date>
<date type="whenModified">2016-05-18 08:52:51</date>
<date type="whenReleased">2011-05-06 14:55:52</date>
<date type="whenProduced">2011-02-02</date>
<date type="whenEndEmbargoed">2011-05-06</date>
<ref type="file" target="https://tel.archives-ouvertes.fr/tel-00591083/document">
<date notBefore="2011-05-06"></date>
</ref>
<ref type="file" n="1" target="https://tel.archives-ouvertes.fr/tel-00591083/file/these.pdf">
<date notBefore="2011-05-06"></date>
</ref>
<ref type="annex" subtype="other" n="0" target="https://tel.archives-ouvertes.fr/tel-00591083/file/soutenance.pdf">
<date notBefore="2011-05-06"></date>
</ref>
</edition>
<respStmt>
<resp>contributor</resp>
<name key="157405">
<persName>
<forename>Constantinos</forename>
<surname>Makassikis</surname>
</persName>
<email>cmakassikis@gmail.com</email>
</name>
</respStmt>
</editionStmt>
<publicationStmt>
<distributor>CCSD</distributor>
<idno type="halId">tel-00591083</idno>
<idno type="halUri">https://tel.archives-ouvertes.fr/tel-00591083</idno>
<idno type="halBibtex">makassikis:tel-00591083</idno>
<idno type="halRefHtml">Réseaux et télécommunications [cs.NI]. Université Henri Poincaré - Nancy I, 2011. Français</idno>
<idno type="halRef">Réseaux et télécommunications [cs.NI]. Université Henri Poincaré - Nancy I, 2011. Français</idno>
</publicationStmt>
<seriesStmt>
<idno type="stamp" n="CNRS">CNRS - Centre national de la recherche scientifique</idno>
<idno type="stamp" n="INRIA">INRIA - Institut National de Recherche en Informatique et en Automatique</idno>
<idno type="stamp" n="INPL">Institut National Polytechnique de Lorraine</idno>
<idno type="stamp" n="LORIA2">Publications du LORIA</idno>
<idno type="stamp" n="SUP_IMS" p="SUPELEC">IMS - Equipe Information, Multimodalité et Signal</idno>
<idno type="stamp" n="SUPELEC">SUPELEC</idno>
<idno type="stamp" n="INRIA-NANCY-GRAND-EST">INRIA Nancy - Grand Est</idno>
<idno type="stamp" n="LORIA-NSS" p="LORIA">Réseaux, systèmes et services</idno>
<idno type="stamp" n="LORIA">LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications</idno>
<idno type="stamp" n="INRIA2">INRIA 2</idno>
<idno type="stamp" n="INRIA-LORRAINE">INRIA Nancy - Grand Est</idno>
<idno type="stamp" n="LABO-LORIA-SET" p="LORIA">LABO-LORIA-SET</idno>
<idno type="stamp" n="GRID5000">Grid'5000</idno>
<idno type="stamp" n="UNIV-LORRAINE">Université de Lorraine</idno>
</seriesStmt>
<notesStmt></notesStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Design of a model and frameworks for application distribution on PC clusters with low-overhead fault tolerance</title>
<title xml:lang="fr">Conception d'un modèle et de frameworks de distribution d'applications sur grappes de PCs avec tolérance aux pannes à faible coût</title>
<author role="aut">
<persName>
<forename type="first">Constantinos</forename>
<surname>Makassikis</surname>
</persName>
<email>cmakassikis@gmail.com</email>
<idno type="halAuthorId">559216</idno>
<affiliation ref="#struct-26305"></affiliation>
<affiliation ref="#struct-2346"></affiliation>
</author>
</analytic>
<monogr>
<imprint>
<date type="dateDefended">2011-02-02</date>
</imprint>
<authority type="institution">Université Henri Poincaré - Nancy I</authority>
<authority type="school">IAEM</authority>
<authority type="supervisor">Stéphane Vialle(Stephane.Vialle@supelec.fr)</authority>
<authority type="jury">Laurent Philippe (président)</authority>
<authority type="jury">Pierre Manneback (rapporteur)</authority>
<authority type="jury">Serge Chaumette (rapporteur)</authority>
<authority type="jury">Claude Godart (examinateur)</authority>
<authority type="jury">Xavier Warin (examinateur)</authority>
<authority type="jury">Stéphane Vialle (examinateur)</authority>
<authority type="jury">Virginie Galtier (examinateur)</authority>
</monogr>
</biblStruct>
</sourceDesc>
<profileDesc>
<langUsage>
<language ident="fr">French</language>
</langUsage>
<textClass>
<keywords scheme="author">
<term xml:lang="en">fault tolerance</term>
<term xml:lang="en">checkpoint</term>
<term xml:lang="en">programming skeletons</term>
<term xml:lang="en">SPMD algorithms</term>
<term xml:lang="en">Master-Worker algorithms</term>
<term xml:lang="fr">framework</term>
<term xml:lang="fr">tolérance aux pannes</term>
<term xml:lang="fr">points de reprise</term>
<term xml:lang="fr">squelettes de programmation</term>
<term xml:lang="fr">algorithmes SPMD</term>
<term xml:lang="fr">algorithmes Maître-Travailleur</term>
</keywords>
<classCode scheme="halDomain" n="info.info-ni">Computer Science [cs]/Networking and Internet Architecture [cs.NI]</classCode>
<classCode scheme="halTypology" n="THESE">Theses</classCode>
</textClass>
<abstract xml:lang="en">PC clusters are distributed architectures whose adoption spreads as a result of their low cost but also their extensibility in terms of nodes. In particular, the increase in nodes is responsable for the increase of fail-stop failures which jeopardize distributed applications. The absence of efficient and portable solutions limits their use to non critical applications or without time constraints. MoLOToF is a model for application-level fault tolerance based on checkpointing. To ease the addition of fault tolerance, it proposes to structure applications using fault-tolerant skeletons as well as collaborations between the programmer and the fault tolerance system to gain in efficiency. The application of MoLOToF on SPMD and Master-Worker families of parallel algorithms lead to FT-GReLoSSS and ToMaWork frameworks respectively. Each framework provides fault-tolerant skeletons suited to targeted families of algorithms and an original implementation. FT-GReLoSSS uses C++ on top of MPI while ToMaWork uses Java on top of virtual shared memory system provided by JavaSpaces technology. The frameworks' evaluation reveals a reasonable time development overhead and negligible runtime overheads in absence of fault tolerance. Experiments up to 256 nodes on a dualcore PC cluster, demonstrate a better efficiency of FT-GReLoSSS' fault tolerance solution compared to existing system-level solutions (LAM/MPI and DMTCP).</abstract>
<abstract xml:lang="fr">Les grappes de PCs constituent des architectures distribuées dont l'adoption se répand à cause de leur faible coût mais aussi de leur extensibilité en termes de noeuds. Notamment, l'augmentation du nombre des noeuds est à l'origine d'un nombre croissant de pannes par arrêt qui mettent en péril l'exécution d'applications distribuées. L'absence de solutions efficaces et portables confine leur utilisation à des applications non critiques ou sans contraintes de temps. MoLOToF est un modèle de tolérance aux pannes de niveau applicatif et fondée sur la réalisation de sauvegardes. Pour faciliter l'ajout de la tolérance aux pannes, il propose une structuration de l'application selon des squelettes tolérants aux pannes, ainsi que des collaborations entre le programmeur et le système de tolérance des pannes pour gagner en efficacité. L'application de MoLOToF à des familles d'algorithmes parallèles SPMD et Maître-Travailleur a mené aux frameworks FT-GReLoSSS et ToMaWork respectivement. Chaque framework fournit des squelettes tolérants aux pannes adaptés aux familles d'algorithmes visées et une mise en oeuvre originale. FT-GReLoSSS est implanté en C++ au-dessus de MPI alors que ToMaWork est implanté en Java au-dessus d'un système de mémoire partagée virtuelle fourni par la technologie JavaSpaces. L'évaluation des frameworks montre un surcoût en temps de développement raisonnable et des surcoûts en temps d'exécution négligeables en l'absence de tolérance aux pannes. Les expériences menées jusqu'à 256 noeuds sur une grappe de PCs bi-coeurs, démontrent une meilleure efficacité de la solution de tolérance aux pannes de FT-GReLoSSS par rapport à des solutions existantes de niveau système (LAM/MPI et DMTCP).</abstract>
<particDesc>
<org type="consortium">Grid'5000</org>
</particDesc>
</profileDesc>
</hal>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Hal/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001B82 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Hal/Corpus/biblio.hfd -nk 001B82 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Hal
   |étape=   Corpus
   |type=    RBID
   |clé=     Hal:tel-00591083
   |texte=   Design of a model and frameworks for application distribution on PC clusters with low-overhead fault tolerance
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022