Serveur d'exploration Cyberinfrastructure

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes

Identifieur interne : 000C77 ( Main/Merge ); précédent : 000C76; suivant : 000C78

In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes

Auteurs : Gang Wang [République populaire de Chine] ; Xiaoguang Liu [République populaire de Chine] ; Ang Li [République populaire de Chine] ; Fan Zhang [République populaire de Chine]

Source :

RBID : ISTEX:F6CFCB9D26FFDDB6DEB8E60F8DE484F834B85BAC

Abstract

Abstract: Today, the scale of High performance computing (HPC) systems is much larger than ever. This brings a challenge to fault tolerance of HPC systems. MPI (Message Passing Interface) is one of the most important programming tools for HPC. There are quite a few fault-tolerant extensions for MPI, such as MPICH-V, StarFish, FT-MPI and so on. Most of them are based on on-disk checkpointing. In this paper, we apply two kinds of XOR-based double-erasure codes - RDP (Row-Diagonal Parity) and B-Code to in-memory checkpointing for MPI programs. We develop scalable checkpointing/recovery algorithms which embed erasure code encoding/decoding computation into MPI collective communications operations. The experiments show that the scalable algorithms decrease communication overhead and balance computation effectively. Our approach provides highly reliable, fast in-memory checkpointing for MPI programs.

Url:
DOI: 10.1007/978-3-642-03770-2_15

Links toward previous steps (curation, corpus...)


Links to Exploration step

ISTEX:F6CFCB9D26FFDDB6DEB8E60F8DE484F834B85BAC

Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes</title>
<author>
<name sortKey="Wang, Gang" sort="Wang, Gang" uniqKey="Wang G" first="Gang" last="Wang">Gang Wang</name>
</author>
<author>
<name sortKey="Liu, Xiaoguang" sort="Liu, Xiaoguang" uniqKey="Liu X" first="Xiaoguang" last="Liu">Xiaoguang Liu</name>
</author>
<author>
<name sortKey="Li, Ang" sort="Li, Ang" uniqKey="Li A" first="Ang" last="Li">Ang Li</name>
</author>
<author>
<name sortKey="Zhang, Fan" sort="Zhang, Fan" uniqKey="Zhang F" first="Fan" last="Zhang">Fan Zhang</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:F6CFCB9D26FFDDB6DEB8E60F8DE484F834B85BAC</idno>
<date when="2009" year="2009">2009</date>
<idno type="doi">10.1007/978-3-642-03770-2_15</idno>
<idno type="url">https://api.istex.fr/document/F6CFCB9D26FFDDB6DEB8E60F8DE484F834B85BAC/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000384</idno>
<idno type="wicri:Area/Istex/Curation">000384</idno>
<idno type="wicri:Area/Istex/Checkpoint">000422</idno>
<idno type="wicri:doubleKey">0302-9743:2009:Wang G:in:memory:checkpointing</idno>
<idno type="wicri:Area/Main/Merge">000C77</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes</title>
<author>
<name sortKey="Wang, Gang" sort="Wang, Gang" uniqKey="Wang G" first="Gang" last="Wang">Gang Wang</name>
<affiliation wicri:level="3">
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University, 94 Weijin Road, 300071, Tianjin</wicri:regionArea>
<placeName>
<settlement type="city">Tianjin</settlement>
</placeName>
</affiliation>
<affiliation>
<wicri:noCountry code="no comma">E-mail: wgzwp@163.com</wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Liu, Xiaoguang" sort="Liu, Xiaoguang" uniqKey="Liu X" first="Xiaoguang" last="Liu">Xiaoguang Liu</name>
<affiliation wicri:level="3">
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University, 94 Weijin Road, 300071, Tianjin</wicri:regionArea>
<placeName>
<settlement type="city">Tianjin</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">République populaire de Chine</country>
</affiliation>
</author>
<author>
<name sortKey="Li, Ang" sort="Li, Ang" uniqKey="Li A" first="Ang" last="Li">Ang Li</name>
<affiliation wicri:level="3">
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University, 94 Weijin Road, 300071, Tianjin</wicri:regionArea>
<placeName>
<settlement type="city">Tianjin</settlement>
</placeName>
</affiliation>
<affiliation>
<wicri:noCountry code="no comma">E-mail: megathere@gmail.com</wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Zhang, Fan" sort="Zhang, Fan" uniqKey="Zhang F" first="Fan" last="Zhang">Fan Zhang</name>
<affiliation wicri:level="3">
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University, 94 Weijin Road, 300071, Tianjin</wicri:regionArea>
<placeName>
<settlement type="city">Tianjin</settlement>
</placeName>
</affiliation>
<affiliation>
<wicri:noCountry code="no comma">E-mail: zhangfan555@gmail.com</wicri:noCountry>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2009</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">F6CFCB9D26FFDDB6DEB8E60F8DE484F834B85BAC</idno>
<idno type="DOI">10.1007/978-3-642-03770-2_15</idno>
<idno type="ChapterID">15</idno>
<idno type="ChapterID">Chap15</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: Today, the scale of High performance computing (HPC) systems is much larger than ever. This brings a challenge to fault tolerance of HPC systems. MPI (Message Passing Interface) is one of the most important programming tools for HPC. There are quite a few fault-tolerant extensions for MPI, such as MPICH-V, StarFish, FT-MPI and so on. Most of them are based on on-disk checkpointing. In this paper, we apply two kinds of XOR-based double-erasure codes - RDP (Row-Diagonal Parity) and B-Code to in-memory checkpointing for MPI programs. We develop scalable checkpointing/recovery algorithms which embed erasure code encoding/decoding computation into MPI collective communications operations. The experiments show that the scalable algorithms decrease communication overhead and balance computation effectively. Our approach provides highly reliable, fast in-memory checkpointing for MPI programs.</div>
</front>
</TEI>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/CyberinfraV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000C77 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 000C77 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    CyberinfraV1
   |flux=    Main
   |étape=   Merge
   |type=    RBID
   |clé=     ISTEX:F6CFCB9D26FFDDB6DEB8E60F8DE484F834B85BAC
   |texte=   In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes
}}

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Thu Oct 27 09:30:58 2016. Site generation: Sun Mar 10 23:08:40 2024