Serveur d'exploration sur la recherche en informatique en Lorraine

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Performance Testing of a Parallel Multiblock CFD Solver

Identifieur interne : 000384 ( Istex/Corpus ); précédent : 000383; suivant : 000385

Performance Testing of a Parallel Multiblock CFD Solver

Auteurs : David Kerlick ; Eric Dillon ; David Levine

Source :

RBID : ISTEX:10F89E9E8C955141EF1B7E305BA97BB38B570610

English descriptors

Abstract

A distributed-memory version of the OVERFLOW computational fluid dynamics code was evaluated on several parallel systems and compared with other approaches using test cases provided by the NASA-Boeing High-Speed Civil Transport program. A principal goal was to develop partitioning and load-balancing strategies that led to a reduction in computation time. We found multipartitioning, in which the aspect ratio of the multipartition is close to the aspect ratio of the grid zone, offered the best performance. The (uniprocessor) performance of the CRAY vector systems was superior to all other systems tested. However, the distributed-memory version when run on an SGI Origin system offers a price performance advantage over the CRAY vector systems.Performance on personal computer systems is promising but faces several hurdles.

Url:
DOI: 10.1177/109434200101500103

Links to Exploration step

ISTEX:10F89E9E8C955141EF1B7E305BA97BB38B570610

Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Performance Testing of a Parallel Multiblock CFD Solver</title>
<author wicri:is="90%">
<name sortKey="Kerlick, David" sort="Kerlick, David" uniqKey="Kerlick D" first="David" last="Kerlick">David Kerlick</name>
<affiliation>
<mods:affiliation>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</mods:affiliation>
</affiliation>
</author>
<author wicri:is="90%">
<name sortKey="Dillon, Eric" sort="Dillon, Eric" uniqKey="Dillon E" first="Eric" last="Dillon">Eric Dillon</name>
<affiliation>
<mods:affiliation>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</mods:affiliation>
</affiliation>
</author>
<author wicri:is="90%">
<name sortKey="Levine, David" sort="Levine, David" uniqKey="Levine D" first="David" last="Levine">David Levine</name>
<affiliation>
<mods:affiliation>Rosetta Inpharmatics, Kirkland, Washington</mods:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:10F89E9E8C955141EF1B7E305BA97BB38B570610</idno>
<date when="2001" year="2001">2001</date>
<idno type="doi">10.1177/109434200101500103</idno>
<idno type="url">https://api.istex.fr/ark:/67375/M70-X79RBH0Z-6/fulltext.pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000384</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">000384</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Performance Testing of a Parallel Multiblock CFD Solver</title>
<author wicri:is="90%">
<name sortKey="Kerlick, David" sort="Kerlick, David" uniqKey="Kerlick D" first="David" last="Kerlick">David Kerlick</name>
<affiliation>
<mods:affiliation>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</mods:affiliation>
</affiliation>
</author>
<author wicri:is="90%">
<name sortKey="Dillon, Eric" sort="Dillon, Eric" uniqKey="Dillon E" first="Eric" last="Dillon">Eric Dillon</name>
<affiliation>
<mods:affiliation>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</mods:affiliation>
</affiliation>
</author>
<author wicri:is="90%">
<name sortKey="Levine, David" sort="Levine, David" uniqKey="Levine D" first="David" last="Levine">David Levine</name>
<affiliation>
<mods:affiliation>Rosetta Inpharmatics, Kirkland, Washington</mods:affiliation>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">The International Journal of High Performance Computing Applications</title>
<idno type="ISSN">1094-3420</idno>
<idno type="eISSN">1741-2846</idno>
<imprint>
<publisher>Sage Publications</publisher>
<pubPlace>Sage CA: Thousand Oaks, CA</pubPlace>
<date type="published" when="2001-02">2001-02</date>
<biblScope unit="volume">15</biblScope>
<biblScope unit="issue">1</biblScope>
<biblScope unit="page" from="22">22</biblScope>
<biblScope unit="page" to="35">35</biblScope>
</imprint>
<idno type="ISSN">1094-3420</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">1094-3420</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="Teeft" xml:lang="en">
<term>Aiaa proceedings</term>
<term>Aspect ratio</term>
<term>Average number</term>
<term>Better load balancing</term>
<term>Boeing company</term>
<term>Cache</term>
<term>Coalescing</term>
<term>Coarse grain</term>
<term>Common divisor</term>
<term>Compaq</term>
<term>Computational</term>
<term>Computer science</term>
<term>Convergence</term>
<term>Cray</term>
<term>Cray cray cray</term>
<term>Cray vector systems</term>
<term>Design processes</term>
<term>Digital fortran version</term>
<term>Disk space</term>
<term>Fine grain</term>
<term>Fine grid</term>
<term>Flow field</term>
<term>Full speed</term>
<term>Grain time speedup</term>
<term>Grid</term>
<term>Grid aspect ratio</term>
<term>Grid number points</term>
<term>Grid points</term>
<term>Grid zone</term>
<term>Grid zones</term>
<term>Hatay</term>
<term>High performance</term>
<term>Hsct</term>
<term>Inner loops</term>
<term>Iteration</term>
<term>Jespersen</term>
<term>Kayak</term>
<term>Larger number</term>
<term>Larger numbers</term>
<term>Largest grid zone</term>
<term>Machine comparison number</term>
<term>Main memory</term>
<term>Medium grid</term>
<term>Memory bandwidth</term>
<term>Methods support coalescing</term>
<term>More processors</term>
<term>Multigrid</term>
<term>Multigrid method</term>
<term>Multigrid scheme</term>
<term>Multipartition</term>
<term>Multipartitioning</term>
<term>Multiple processors</term>
<term>Nasa</term>
<term>Nasa ames</term>
<term>Nasa ames research center</term>
<term>Node performance</term>
<term>Number points</term>
<term>Origin system</term>
<term>Origin systems</term>
<term>Other systems</term>
<term>Overflow</term>
<term>Overflow code</term>
<term>Overflow offer</term>
<term>Parallel directives</term>
<term>Parallel hardware</term>
<term>Parallel implementations</term>
<term>Partitioning</term>
<term>Partitioning scheme</term>
<term>Partitioning strategies</term>
<term>Performance degradation</term>
<term>Performance evaluation</term>
<term>Processor</term>
<term>Programming model</term>
<term>Pulliam</term>
<term>Risc systems</term>
<term>Sequential version</term>
<term>Serial version</term>
<term>Single grid</term>
<term>Single grid zone</term>
<term>Single processor</term>
<term>Small amount</term>
<term>Small grid zones</term>
<term>Smaller numbers</term>
<term>Solver</term>
<term>Speedup</term>
<term>Strategy sequential</term>
<term>Test case</term>
<term>Test cases</term>
<term>Test problems</term>
<term>Threshold number</term>
<term>Total grid</term>
<term>Total number</term>
<term>Turbulence model</term>
<term>Unipartitioning</term>
<term>Unusual representation</term>
<term>Vector architecture</term>
<term>Vector supercomputers</term>
<term>Virtual memory</term>
<term>Viscous compressible flow equations</term>
<term>Wingbody</term>
<term>Wingbody case</term>
<term>Wingbody problem</term>
<term>Wingbody test case</term>
<term>Wingbody timings</term>
<term>Zone</term>
</keywords>
</textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">A distributed-memory version of the OVERFLOW computational fluid dynamics code was evaluated on several parallel systems and compared with other approaches using test cases provided by the NASA-Boeing High-Speed Civil Transport program. A principal goal was to develop partitioning and load-balancing strategies that led to a reduction in computation time. We found multipartitioning, in which the aspect ratio of the multipartition is close to the aspect ratio of the grid zone, offered the best performance. The (uniprocessor) performance of the CRAY vector systems was superior to all other systems tested. However, the distributed-memory version when run on an SGI Origin system offers a price performance advantage over the CRAY vector systems.Performance on personal computer systems is promising but faces several hurdles.</div>
</front>
</TEI>
<istex>
<corpusName>sage</corpusName>
<keywords>
<teeft>
<json:string>grid</json:string>
<json:string>partitioning</json:string>
<json:string>cray</json:string>
<json:string>grid zone</json:string>
<json:string>compaq</json:string>
<json:string>grid zones</json:string>
<json:string>wingbody</json:string>
<json:string>multipartitioning</json:string>
<json:string>processor</json:string>
<json:string>unipartitioning</json:string>
<json:string>kayak</json:string>
<json:string>single processor</json:string>
<json:string>multigrid</json:string>
<json:string>iteration</json:string>
<json:string>aspect ratio</json:string>
<json:string>convergence</json:string>
<json:string>overflow</json:string>
<json:string>speedup</json:string>
<json:string>multiple processors</json:string>
<json:string>hatay</json:string>
<json:string>nasa</json:string>
<json:string>single grid zone</json:string>
<json:string>coalescing</json:string>
<json:string>pulliam</json:string>
<json:string>test problems</json:string>
<json:string>jespersen</json:string>
<json:string>hsct</json:string>
<json:string>multipartition</json:string>
<json:string>solver</json:string>
<json:string>cache</json:string>
<json:string>largest grid zone</json:string>
<json:string>test cases</json:string>
<json:string>aiaa proceedings</json:string>
<json:string>high performance</json:string>
<json:string>grid points</json:string>
<json:string>zone</json:string>
<json:string>cray vector systems</json:string>
<json:string>grid aspect ratio</json:string>
<json:string>partitioning scheme</json:string>
<json:string>boeing company</json:string>
<json:string>computational</json:string>
<json:string>parallel directives</json:string>
<json:string>total number</json:string>
<json:string>turbulence model</json:string>
<json:string>serial version</json:string>
<json:string>fine grain</json:string>
<json:string>parallel hardware</json:string>
<json:string>coarse grain</json:string>
<json:string>number points</json:string>
<json:string>test case</json:string>
<json:string>other systems</json:string>
<json:string>average number</json:string>
<json:string>wingbody timings</json:string>
<json:string>sequential version</json:string>
<json:string>computer science</json:string>
<json:string>viscous compressible flow equations</json:string>
<json:string>single grid</json:string>
<json:string>programming model</json:string>
<json:string>inner loops</json:string>
<json:string>overflow code</json:string>
<json:string>risc systems</json:string>
<json:string>origin system</json:string>
<json:string>total grid</json:string>
<json:string>medium grid</json:string>
<json:string>larger numbers</json:string>
<json:string>grid number points</json:string>
<json:string>parallel implementations</json:string>
<json:string>nasa ames</json:string>
<json:string>design processes</json:string>
<json:string>main memory</json:string>
<json:string>multigrid method</json:string>
<json:string>disk space</json:string>
<json:string>flow field</json:string>
<json:string>multigrid scheme</json:string>
<json:string>origin systems</json:string>
<json:string>wingbody problem</json:string>
<json:string>partitioning strategies</json:string>
<json:string>common divisor</json:string>
<json:string>better load balancing</json:string>
<json:string>fine grid</json:string>
<json:string>strategy sequential</json:string>
<json:string>grain time speedup</json:string>
<json:string>wingbody case</json:string>
<json:string>larger number</json:string>
<json:string>smaller numbers</json:string>
<json:string>small grid zones</json:string>
<json:string>vector architecture</json:string>
<json:string>more processors</json:string>
<json:string>small amount</json:string>
<json:string>machine comparison number</json:string>
<json:string>cray cray cray</json:string>
<json:string>virtual memory</json:string>
<json:string>node performance</json:string>
<json:string>unusual representation</json:string>
<json:string>memory bandwidth</json:string>
<json:string>full speed</json:string>
<json:string>wingbody test case</json:string>
<json:string>threshold number</json:string>
<json:string>performance degradation</json:string>
<json:string>overflow offer</json:string>
<json:string>methods support coalescing</json:string>
<json:string>nasa ames research center</json:string>
<json:string>vector supercomputers</json:string>
<json:string>performance evaluation</json:string>
<json:string>digital fortran version</json:string>
</teeft>
</keywords>
<author>
<json:item>
<name>David Kerlick</name>
<affiliations>
<json:string>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</json:string>
</affiliations>
</json:item>
<json:item>
<name>Eric Dillon</name>
<affiliations>
<json:string>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</json:string>
</affiliations>
</json:item>
<json:item>
<name>David Levine</name>
<affiliations>
<json:string>Rosetta Inpharmatics, Kirkland, Washington</json:string>
</affiliations>
</json:item>
</author>
<articleId>
<json:string>10.1177_109434200101500103</json:string>
</articleId>
<arkIstex>ark:/67375/M70-X79RBH0Z-6</arkIstex>
<language>
<json:string>eng</json:string>
</language>
<originalGenre>
<json:string>research-article</json:string>
</originalGenre>
<abstract>A distributed-memory version of the OVERFLOW computational fluid dynamics code was evaluated on several parallel systems and compared with other approaches using test cases provided by the NASA-Boeing High-Speed Civil Transport program. A principal goal was to develop partitioning and load-balancing strategies that led to a reduction in computation time. We found multipartitioning, in which the aspect ratio of the multipartition is close to the aspect ratio of the grid zone, offered the best performance. The (uniprocessor) performance of the CRAY vector systems was superior to all other systems tested. However, the distributed-memory version when run on an SGI Origin system offers a price performance advantage over the CRAY vector systems.Performance on personal computer systems is promising but faces several hurdles.</abstract>
<qualityIndicators>
<score>8.452</score>
<pdfWordCount>6487</pdfWordCount>
<pdfCharCount>39548</pdfCharCount>
<pdfVersion>1.5</pdfVersion>
<pdfPageCount>14</pdfPageCount>
<pdfPageSize>612 x 792 pts (letter)</pdfPageSize>
<refBibsNative>true</refBibsNative>
<abstractWordCount>121</abstractWordCount>
<abstractCharCount>829</abstractCharCount>
<keywordCount>0</keywordCount>
</qualityIndicators>
<title>Performance Testing of a Parallel Multiblock CFD Solver</title>
<genre>
<json:string>research-article</json:string>
</genre>
<host>
<title>The International Journal of High Performance Computing Applications</title>
<language>
<json:string>unknown</json:string>
</language>
<issn>
<json:string>1094-3420</json:string>
</issn>
<eissn>
<json:string>1741-2846</json:string>
</eissn>
<publisherId>
<json:string>HPC</json:string>
</publisherId>
<volume>15</volume>
<issue>1</issue>
<pages>
<first>22</first>
<last>35</last>
</pages>
<genre>
<json:string>journal</json:string>
</genre>
</host>
<namedEntities>
<unitex>
<date></date>
<geogName></geogName>
<orgName></orgName>
<orgName_funder></orgName_funder>
<orgName_provider></orgName_provider>
<persName></persName>
<placeName></placeName>
<ref_url></ref_url>
<ref_bibl></ref_bibl>
<bibl></bibl>
</unitex>
</namedEntities>
<ark>
<json:string>ark:/67375/M70-X79RBH0Z-6</json:string>
</ark>
<categories>
<wos>
<json:string>1 - science</json:string>
<json:string>2 - computer science, theory & methods</json:string>
<json:string>2 - computer science, interdisciplinary applications</json:string>
<json:string>2 - computer science, hardware & architecture</json:string>
</wos>
<scienceMetrix>
<json:string>1 - applied sciences</json:string>
<json:string>2 - information & communication technologies</json:string>
<json:string>3 - distributed computing</json:string>
</scienceMetrix>
<scopus>
<json:string>1 - Physical Sciences</json:string>
<json:string>2 - Computer Science</json:string>
<json:string>3 - Hardware and Architecture</json:string>
<json:string>1 - Physical Sciences</json:string>
<json:string>2 - Mathematics</json:string>
<json:string>3 - Theoretical Computer Science</json:string>
<json:string>1 - Physical Sciences</json:string>
<json:string>2 - Computer Science</json:string>
<json:string>3 - Software</json:string>
</scopus>
<inist>
<json:string>1 - sciences appliquees, technologies et medecines</json:string>
<json:string>2 - sciences exactes et technologie</json:string>
<json:string>3 - sciences et techniques communes</json:string>
<json:string>4 - sciences de l'information. documentation</json:string>
</inist>
</categories>
<publicationDate>2001</publicationDate>
<copyrightDate>2001</copyrightDate>
<doi>
<json:string>10.1177/109434200101500103</json:string>
</doi>
<id>10F89E9E8C955141EF1B7E305BA97BB38B570610</id>
<score>1</score>
<fulltext>
<json:item>
<extension>pdf</extension>
<original>true</original>
<mimetype>application/pdf</mimetype>
<uri>https://api.istex.fr/ark:/67375/M70-X79RBH0Z-6/fulltext.pdf</uri>
</json:item>
<json:item>
<extension>zip</extension>
<original>false</original>
<mimetype>application/zip</mimetype>
<uri>https://api.istex.fr/ark:/67375/M70-X79RBH0Z-6/bundle.zip</uri>
</json:item>
<istex:fulltextTEI uri="https://api.istex.fr/ark:/67375/M70-X79RBH0Z-6/fulltext.tei">
<teiHeader>
<fileDesc>
<titleStmt>
<title level="a" type="main" xml:lang="en">Performance Testing of a Parallel Multiblock CFD Solver</title>
</titleStmt>
<publicationStmt>
<authority>ISTEX</authority>
<publisher scheme="https://scientific-publisher.data.istex.fr">Sage Publications</publisher>
<pubPlace>Sage CA: Thousand Oaks, CA</pubPlace>
<availability>
<licence>
<p>sage</p>
</licence>
</availability>
<p scheme="https://loaded-corpus.data.istex.fr/ark:/67375/XBH-0J1N7DQT-B"></p>
<date>2001</date>
</publicationStmt>
<notesStmt>
<note type="research-article" scheme="https://content-type.data.istex.fr/ark:/67375/XTP-1JC4F85T-7">research-article</note>
<note type="journal" scheme="https://publication-type.data.istex.fr/ark:/67375/JMC-0GLKJH51-B">journal</note>
</notesStmt>
<sourceDesc>
<biblStruct type="inbook">
<analytic>
<title level="a" type="main" xml:lang="en">Performance Testing of a Parallel Multiblock CFD Solver</title>
<author xml:id="author-0000">
<persName>
<forename type="first">David</forename>
<surname>Kerlick</surname>
</persName>
<affiliation>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</affiliation>
</author>
<author xml:id="author-0001">
<persName>
<forename type="first">Eric</forename>
<surname>Dillon</surname>
</persName>
<affiliation>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</affiliation>
</author>
<author xml:id="author-0002">
<persName>
<forename type="first">David</forename>
<surname>Levine</surname>
</persName>
<affiliation>Rosetta Inpharmatics, Kirkland, Washington</affiliation>
</author>
<idno type="istex">10F89E9E8C955141EF1B7E305BA97BB38B570610</idno>
<idno type="ark">ark:/67375/M70-X79RBH0Z-6</idno>
<idno type="DOI">10.1177/109434200101500103</idno>
<idno type="article-id">10.1177_109434200101500103</idno>
</analytic>
<monogr>
<title level="j">The International Journal of High Performance Computing Applications</title>
<idno type="pISSN">1094-3420</idno>
<idno type="eISSN">1741-2846</idno>
<idno type="publisher-id">HPC</idno>
<idno type="PublisherID-hwp">sphpc</idno>
<imprint>
<publisher>Sage Publications</publisher>
<pubPlace>Sage CA: Thousand Oaks, CA</pubPlace>
<date type="published" when="2001-02"></date>
<biblScope unit="volume">15</biblScope>
<biblScope unit="issue">1</biblScope>
<biblScope unit="page" from="22">22</biblScope>
<biblScope unit="page" to="35">35</biblScope>
</imprint>
</monogr>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<creation>
<date>2001</date>
</creation>
<langUsage>
<language ident="en">en</language>
</langUsage>
<abstract xml:lang="en">
<p>A distributed-memory version of the OVERFLOW computational fluid dynamics code was evaluated on several parallel systems and compared with other approaches using test cases provided by the NASA-Boeing High-Speed Civil Transport program. A principal goal was to develop partitioning and load-balancing strategies that led to a reduction in computation time. We found multipartitioning, in which the aspect ratio of the multipartition is close to the aspect ratio of the grid zone, offered the best performance. The (uniprocessor) performance of the CRAY vector systems was superior to all other systems tested. However, the distributed-memory version when run on an SGI Origin system offers a price performance advantage over the CRAY vector systems.Performance on personal computer systems is promising but faces several hurdles.</p>
</abstract>
</profileDesc>
<revisionDesc>
<change when="2001-02">Published</change>
</revisionDesc>
</teiHeader>
</istex:fulltextTEI>
<json:item>
<extension>txt</extension>
<original>false</original>
<mimetype>text/plain</mimetype>
<uri>https://api.istex.fr/ark:/67375/M70-X79RBH0Z-6/fulltext.txt</uri>
</json:item>
</fulltext>
<metadata>
<istex:metadataXml wicri:clean="corpus sage not found" wicri:toSee="no header">
<istex:xmlDeclaration>version="1.0" encoding="UTF-8"</istex:xmlDeclaration>
<istex:docType PUBLIC="-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" URI="journalpublishing.dtd" name="istex:docType"></istex:docType>
<istex:document>
<article article-type="research-article" dtd-version="2.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="hwp">sphpc</journal-id>
<journal-id journal-id-type="publisher-id">HPC</journal-id>
<journal-title>The International Journal of High Performance Computing Applications</journal-title>
<issn pub-type="ppub">1094-3420</issn>
<publisher>
<publisher-name>Sage Publications</publisher-name>
<publisher-loc>Sage CA: Thousand Oaks, CA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.1177/109434200101500103</article-id>
<article-id pub-id-type="publisher-id">10.1177_109434200101500103</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Articles</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Performance Testing of a Parallel Multiblock CFD Solver</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" xlink:type="simple">
<name name-style="western">
<surname>Kerlick</surname>
<given-names>David</given-names>
</name>
<aff>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</aff>
</contrib>
</contrib-group>
<contrib-group>
<contrib contrib-type="author" xlink:type="simple">
<name name-style="western">
<surname>Dillon</surname>
<given-names>Eric</given-names>
</name>
<aff>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</aff>
</contrib>
</contrib-group>
<contrib-group>
<contrib contrib-type="author" xlink:type="simple">
<name name-style="western">
<surname>Levine</surname>
<given-names>David</given-names>
</name>
<aff>Rosetta Inpharmatics, Kirkland, Washington</aff>
</contrib>
</contrib-group>
<pub-date pub-type="ppub">
<month>02</month>
<year>2001</year>
</pub-date>
<volume>15</volume>
<issue>1</issue>
<fpage>22</fpage>
<lpage>35</lpage>
<abstract>
<p>A distributed-memory version of the OVERFLOW computational fluid dynamics code was evaluated on several parallel systems and compared with other approaches using test cases provided by the NASA-Boeing High-Speed Civil Transport program. A principal goal was to develop partitioning and load-balancing strategies that led to a reduction in computation time. We found multipartitioning, in which the aspect ratio of the multipartition is close to the aspect ratio of the grid zone, offered the best performance. The (uniprocessor) performance of the CRAY vector systems was superior to all other systems tested. However, the distributed-memory version when run on an SGI Origin system offers a price performance advantage over the CRAY vector systems.Performance on personal computer systems is promising but faces several hurdles.</p>
</abstract>
<custom-meta-wrap>
<custom-meta xlink:type="simple">
<meta-name>sagemeta-type</meta-name>
<meta-value>Journal Article</meta-value>
</custom-meta>
<custom-meta xlink:type="simple">
<meta-name>cover-date</meta-name>
<meta-value>Spring 2001</meta-value>
</custom-meta>
<custom-meta xlink:type="simple">
<meta-name>search-text</meta-name>
<meta-value> COMPUTING APPLICATIONS PERFORMANCE TESTING OF A PARALLEL MULTIBLOCK CFD SOLVER David Kerlick Eric Dillon MATHEMATICS AND COMPUTING TECHNOLOGY, BOEING COMPANY, SEATTLE, WASHINGTON David Levine ROSETTA INPHARMATICS, KIRKLAND, WASHINGTON Summary A distributed-memory version of the OVERFLOW compu- tational fluid dynamics code was evaluated on several par- allel systems and compared with other approaches using test cases provided by the NASA-Boeing High-Speed Civil Transport program. A principal goal was to develop parti- tioning and load-balancing strategies that led to a reduc- tion in computation time. We found multipartitioning, in which the aspect ratio of the multipartition is close to the aspect ratio of the grid zone, offered the best performance. The (uniprocessor) performance of the CRAY vector sys- tems was superior to all other systems tested. However, the distributed-memory version when run on an SGI Origin system offers a price performance advantage over the CRAY vector systems.Performance on personal computer systems is promising but faces several hurdles. 1 Introduction Computational fluid dynamics (CFD) calculations are used for calculating internal and external fluid flows, as well as the corresponding pressures, forces, and moments on aerodynamic surfaces. Traditionally, these calcula- tions have been run on vector machines. The trend in the past few years, however, has been toward multiple-pro- cessor RISC systems and, most recently, mass-market PC clusters. The problems of arranging parallel computa- tions on the multiple processors of these systems are, however, far from simple. In this paper, we present the re- sults of our work, developing an effective parallel-parti- tioning strategy for an important CFD code. The rest of this paper is organized as follows. Section 2 describes the OVERFLOW code, including partitioning strategies and parallel implementations. Section 3 de- scribes the test problems used and the computers they were tested on. Section 4 contains our computational re- sults. This includes a comparison of partitioning strate- gies, a comparison of two parallel programming models, and performance results from different systems. Last, Section 5 presents our conclusions. 2 The OVERFLOW Solver The OVERFLOW CFD code has been used for at least a decade for solving viscous compressible flow (Reynolds-averaged Navier-Stokes equations with turbu- lence model) and is a part of many aerodynamic design processes. OVERFLOW is based on the ARC3D implicit approximate factorization approach of Pulliam and Steger (1978) and the scalar pentadiagonal form of Pulliam and Chaussee (1981). These implicit schemes dramatically accelerate convergence at the cost of matrix inversions at each step. In ARC3D, the matrix equations are solved on a single grid that is a 3-D lattice of integers. The ratio of the lattice edge lengths (the integers Ni , Nj , and Nk ) is the grid aspect ratio Ni : Nj : Nk . Complex aerodynamic configurations may have mul- tiple components that require differing amounts of discretization. It may be difficult or impossible to map the flow domain using a single grid. One solution is to use multiple overlapping grids, each of which is mapped to a particular component in the overall design. Each of these grids, which we refer to as a grid zone, is a topological rectangular solid. The complete flow field is represented by the overlapped grid zones. Overlapped grids enable the solution of complex and multipart aerodynamic configu- rations. It allows one to add or change a component of a 22 COMPUTING APPLICATIONS The International Journal of High Performance Computing Applications, Volume 15, No. 1, Spring 2001, pp. 22-35 2001 Sage Publications, Inc. PARALLEL CFD PERFORMANCE Address reprint requests to David Kerlick, Mathematics and Com- puting Technology, Boeing Company, P.O. Box 3707 MS 7L-43, Seattle, WA 98124-2207, U.S.A.; e-mail: david.kerlick@ boeing.com. configuration without having to generate new grids for other components. Overlapped grid zones require interpolation in the re- gions of overlap between the zones at every iteration of the flow solver, the so-called Chimera scheme (Steger, Dougherty, and Benek, 1983). In the serial version of the code, the most recently updated flow variables from the current time step are used when available. This is roughly analogous, on a zonal level, to Gauss-Seidel iteration, and the convergence rate may depend on the order in which the zones are updated. F. Hatay (personal communication, 1999) expects the convergence to be faster, starting from the known conditions at the flow field boundaries, rather than from interzone boundaries. When the zones are pro- cessed in parallel, only the values from the last time step are known, and an analogy to Jacobi iteration may be made. In this case, convergence may not occur as rapidly, but the result is independent of the ordering. We have not performed anystudies of convergence in the present work. As the number of grid zones increases, the amount of computing time devoted to interpolation increases and may ultimately require a significant portion of the comput- ing time and be a candidate for further parallelization. For the present, the interpolation calculations have not been parallelized, only the flow solutions within the grid zones. One recent enhancement to OVERFLOW is the multigrid method (Jespersen, Pulliam, and Buning, 1997) of convergence acceleration. In a multigrid scheme, a coarsening of the grid is effected by taking every other grid point in each grid direction. Thus, on a scheme with L multigrid levels, only points in the fine mesh that are sepa- rated by multiples of 2L fine-mesh points remain in the coarsest mesh. If grids are used that do not have edges di- visible by 2L + 1, then asymmetries in the coarsened grids are possible, even when the fine grid is symmetric. It is therefore necessary that grid points that represent features (e.g., a leading-edge discontinuity in the wing) be pre- served in the coarsest grid. The multigrid method acceler- ates the convergence of the solution, although at the cost of significant coding complexity. OVERFLOW is under continuous revision. Most of our work was done using code from OVERFLOW 1.7v. Some later runs used code from the version 1.8 release. We do not expect a significant difference in runtime for the cases we have studied. 2.1 PARTITIONING AND LOAD BALANCING Partitioning the flow field is a natural way to decompose the problem domain for parallel solution. Because the flow PARALLEL CFD PERFORMANCE 23 "The OVERFLOW CFD code has been used for at least a decade for solving viscous compressible flow (Reynolds-averaged Navier-Stokes equations with turbulence model) and is a part of many aerodynamic design processes." field has already been divided into grid zones, the sim- plest approach assigns one processor to each zone. This approach, which we call coarse grain, was taken in Atwood and Smith (1995). The theoretical speedup of the coarse-grain approach is limited by the ratio of the total number of points in the flow field to the number of points in the largest grid zone. An improvement in efficiency can be obtained by coalesc- ing small grid zones and assigning them to the same pro- cessor. This allows the use of fewer processors, but the overall time is still limited by the time for a single proces- sor to process the largest grid zone. It is possible to add new zone boundaries to divide the zones artificially, but doing so forces the update between zones to be handled in an explicit, time-accurate way, which can slow conver- gence substantially. In most cases, the coarse-grain approach does not bal- ance the load on the processors very well. As a result, pro- cessors that have too few points to process are idle. To im- prove performance and to apply larger numbers of processors than the number of grid zones, it is necessary to divide the work on a single grid zone among multiple processors. This is a complex step, particularly for multigrid solutions, that we call fine grain. The simplest form of fine-grain partitioning, uniparti- tioning, partitions a grid zone in a single grid direction, usually the one with the largest grid dimension. A unipar- titioning scheme was implemented in the POVERFLOW (Ryan and Weeratunga, 1993) variant of OVERFLOW. A more general fine-grain partitioning scheme is multipartitioning, in which a grid zone is partitioned along multiple dimensions. Thus, to each grid zone of di- mensions Ni × Nj × Nk, a triplet of numbers, Pi.Pj.Pk, is as- signed so that a total of Pi × Pj × Pk processors are assigned to a grid zone. The total number of processors required for a problem is the sum of the processors assigned to each grid zone. In Smith, Van der Wijngaart, and Yarrow (1995), a multipartitioning method for alternating-direction im- plicit algorithms was developed. Hatay et al. (1997) de- veloped the idea of multipartitioning grid zones. This was first demonstrated on the aeroelastic code ENSAERO and later incorporated into OVERFLOW. Hatay et al. demon- strated speedups for non-multi-grid OVERFLOW calcu- lations. This version of the code allowed both coalescing of small grid zones and splitting large grid zones in multi- ple directions. To implement an implicit solve on parti- tioned zones, it is necessary for the processors working on a partition to have access to the adjacent grid planes being solved by other processors. These so-called ghost or halo points increase the effective amount of memory required in the parallel calculation (see, e.g., Hatay et al., 1997, Figure 1). Jespersen, Pulliam, and Buning (1997) extended Hatay et al.'s (1997) work to a multigrid version. Their implementation of multipartitioning ensures that the multigrid scheme is unaffected by the choice of partition. Symmetry boundary conditions and internal symmetry planes or axes have not been implemented yet. All the partitioning schemes we described employ static partitioning; that is, grid partitions are specified in the input file and are not changed during the run of the job. At present, dynamic repartitioning for load balancing during the execution of a job has not been attempted. 2.2 PARALLEL IMPLEMENTATIONS Since OVERFLOW requires supercomputer perfor- mance, it has traditionally been run on vector supercom- puters. More recently, however, several strategies have been investigated to make the best use of parallel hard- ware (Buning et al., 1995; Hatay et al., 1997; Jespersen, 1998; Taft, 1998). We classify these approaches accord- ing to the memory model they assume. 2.2.1 Shared Memory. The standard distribution of the OVERFLOW code includes parallel directives for CRAY vector systems and Silicon Graphics (SGI) Origin architectures. These directives assume a shared-memory programming model. The directives are used primarily around the inner loops of the matrix solves in each grid zone. Taft (1998) reported success with a multilevel parallel (MLP) approach. This approach takes advantage of the parallel directives for inner-loop parallelism and uses the UNIX fork() command at the outer level to support flow field partitioning. Both coalescing of small grid zones and multipartitioning of large grid zones are supported. 2.2.2 Distributed-Memory Models. Two distrib- uted-memory implementations of OVERFLOW are available. The first uses coarse-grain partitioning and was implemented using PVM.1 The second implementation uses fine-grain partitioning and is implemented using the message-passing interface (MPI). In this paper, we focus on the MPI implementation that, because of its portabil- ity, can be run on both shared- and distributed-memory systems. 24 COMPUTING APPLICATIONS 3 Test Environment 3.1 TEST PROBLEMS The test problems we used were taken from the NASA-Boeing High-Speed Civil Transport (HSCT) proj- ect. In the HSCT project, NASA was working with indus- try to outline a plan for developing long-lead, high-risk technologies that could form the foundation for an industry HSCT program launch. The projected market of more than 1000 HSCT aircraft between 2006 and 2020 was quite sub- stantial, and technology development was essential to en- able an environmentally compatible and economically via- ble HSCT aircraft. This project has recently been terminated due to certain high-risk technologies not matur- ing as originally planned. The test problems represent external aerodynamic con- figurations. Drag reduction is the main objective. To cor- rectly calculate drag, viscous phenomena such as separa- tion must be predicted, and a Navier-Stokes solution is required. Turbulence is predicted by an engineering model, here the model of Spalart and Almaras (1992). The Navier-Stokes runs are used to verify design runs done us- ing linear aerodynamics. These cases are typical of those that were running in production mode on a CRAY C-90 at NASA. The first test case is a wing-and-body combination (the "wingbody" problem). The total grid comprises 3,579,004 points in six grid zones. The largest grid zone contains ap- proximately 1.5 million points, and the smallest grid zone has approximately 54,000 points. Figure 1 shows the inte- rior boundaries of each grid zone. Table 1 contains the dimensions and number of points in each grid zone. For each grid zone, three grids are speci- fied: a fine grid, a medium grid, and a coarse grid. These three grids correspond to the three levels of multigrid used in the solver. PARALLEL CFD PERFORMANCE 25 Fig. 1 Wingbody test case Table 1 Wingbody Grid Zones Grid Zone Fine Grid Number Points Medium Grid Number Points Coarse Grid Number Points 1 208 × 93 × 49 947,856 105 × 48 × 25 126,000 53 × 26 × 13 17,914 2 271 × 121 × 46 1,508,386 136 × 62 × 24 202,368 69 × 32 × 13 28,704 3 220 × 28 × 37 227,920 111 × 15 × 19 31,635 56 × 8 × 10 4,480 4 52 × 28 × 37 53,872 27 × 15 × 19 7,695 14 × 8 × 10 1,120 5 220 × 52 × 64 732,160 111 × 27 × 33 98,901 56 × 15 × 17 14,280 6 78 × 45 × 31 108,810 40 × 24 × 16 15,360 21 × 14 × 9 2,646 Total 3,579,004 481,959 69,144 The second test case adds a nacelle and diverter 2 to the wingbody data set (the "wingbody-nacelle-diverter" prob- lem), as illustrated in Figure 2. An additional 14 grid zones are used to overlap the nacelle and diverter for a total of 20 gridzones.The14additionalgridzonesaregiveninTable2. The totalgrid comprises 9,784,210 points in20grid zones. 3.2 PARALLEL HARDWARE We tested OVERFLOW on three types of computers: vec- tor supercomputers, parallel computers with RISC micro- processors, and PCs. 3.2.1 Vector System. We tested OVERFLOW on two CRAY vector systems. One was a 16-processor C-90 lo- cated at NASA Ames with 8 GB of memory. This machine was the regular production environment for the HSCT pro- gram's use of OVERFLOW. The second was an 8-proces- sor CRAY T-90 with 2 GB of memory located at Boeing. On both machines, testing was done using a single proces- sor in vector mode during regular production hours. 3.2.2 RISC Systems. The largest system we had ac- cess to was a 512-node CRAY T3E. The T3E is a distrib- uted-memory system with a 3-torus interconnect. Each T3E node consists of an Alpha 21164 processor 26 COMPUTING APPLICATIONS Table 2 Wingbody-Nacelle-Diverter Additional Grid Zones Grid Zone Fine Grid Number Points Medium Grid Number Points Coarse Grid Number Points 7 131 × 67 × 107 939,139 66 × 34 × 54 121,176 34 × 18 × 28 17,136 8 131 × 67 × 107 939,139 66 × 34 × 54 121,176 34 × 18 × 28 17,136 9 80 × 161 × 37 476,560 41 × 81 × 19 63,099 21 × 41 × 10 8,610 10 57 × 161 × 37 339,549 29 × 81 × 19 44,631 15 × 41 × 10 6,150 11 71 × 161 × 37 422,947 36 × 81 × 19 55,404 19 × 41 × 10 7,790 12 176 × 37 × 57 371,184 89 × 19 × 29 49,039 45 × 10 × 15 6,750 13 176 × 37 × 57 371,184 89 × 19 × 29 49,039 45 × 10 × 15 6,750 14 113 × 59 × 37 246,679 57 × 30 × 19 32,490 29 × 16 × 10 4,640 15 80 × 161 × 37 476,560 41 × 81 × 19 63,099 21 × 41 × 10 8,610 16 57 × 161 × 37 339,549 29 × 81 × 19 44,631 15 × 41 × 10 6,150 17 71 × 161 × 37 422,947 36 × 81 × 19 55,404 19 × 41 × 10 7,790 18 176 × 37 × 49 319,088 89 × 19 × 25 38,475 45 × 10 × 13 5,850 19 176 × 37 × 49 319,088 89 × 19 × 25 38,475 45 × 10 × 13 5,850 20 113 × 53 × 37 221,593 57 × 27 × 19 29,241 29 × 14 × 10 4,060 Total 9,784,210 1,287,338 182,416 Fig. 2 Wingbody-nacelle-diverter test case (300 MHz) and 128 MB of memory. Each processor has a primary cache of 8 KB and a secondary cache of 96 KB and can execute two instructions per clock cycle. We tested three SGI Origin 2000 machines. The Origin has a nonuniform memory access (NUMA) architecture and is based on the MIPS R10000 processor. The first ma- chine had 8 processors (195 MHz) and 2 GB of main memory. The second Origin had 64 processors (195 MHz) and 16 GB of memory. During our testing, this ma- chine was upgraded to 128 processors and 32 GB of mem- ory. The third Origin tested had 128 processors (250 MHz) and 32 GB of main memory. All three machines had 4 MB level-2 caches for each processor. 3.2.3 PC Systems. The smallest (and least expen- sive) system we tested was a Compaq 8000 symmetric multiprocessor (SMP) with four Intel Pentium Pro pro- cessors (200 MHz). Each processor had a 512 KB level-2 cache. The processors shared 4 GB memory and 23 GB of disk space. The operating system was Windows NT ver- sion 4.0. The Fortran compiler was Digital Fortran ver- sion 5.0. The MPI used was a beta version from GENIAS. We also tested two PC clusters. The first consisted of 32 Compaq 6000 machines. The second consisted of 64 Hewlett-Packard (HP) Kayak machines. Each machine is an SMP with dual Intel Pentium II processors (300 MHz in the HP Kayak, 333 MHz in the Compaq 6000) sharing a 512 KB level-2 cache, 512 MB memory, and 4 GB of disk space. The machines in each cluster were connected to each other using a high-speed Myrinet network. Each PC was running the Windows NT version 4.0 operating sys- tem. The compiler was Digital Fortran version 5.0. The MKS Toolkit3 was used to facilitate code porting. MPI-FM (Lauria and Chien, 1997), a version of MPI based on Fast Messages and optimized for Myrinet net- works, was used for communication. 4 Results The results are presented in three stages. First, in Section 4.1, we analyze the results of various partitioning strate- gies on a single grid zone. Next, in Section 4.2, the single grid zone results are used to examine load-balancing strategies for multiple zones. Last, the multizone results are used for performance comparisons of coarse-grain versus fine-grain partitioning, MPI versus MLP, and dif- ferent computing systems and are presented in Sections 4.3, 4.4, and 4.5, respectively. The results presented are from runs that restarted a computation from an existing checkpoint file, ran 10 multigrid iterations, and wrote a new checkpoint file. This is typical of production runs to convergence, which may take several hundred to several thousand iterations. The times given are the average time (over 10 iterations) each time step takes. It includes solver and interpolation time, message-passing overhead, and the time to calculate vari- ous metrics. It does not include the time to restart and checkpoint the computation. Except in Section 4.5, all timings are from SGI Origin systems. 4.1 SINGLE GRID ZONE PARTITIONING RESULTS The effectiveness of a partitioning scheme depends on two effects. First, the speed of calculation for a single grid zone depends on the number of processors assigned to the grid zone and how they divide the work. Second, the over- all load balance is affected every time the performance on a single grid zone is improved. To study the first effect, a series of runs was made on grid zone 2 (a wing) in isolation. Recall (see Table 1) that this is the largest grid zone with approximately 1.5 mil- lion grid points. The grid aspect ratio is 5.8 : 2.6 : 1. The first set of runs used a unipartitioning strategy, with parti- tioning occurring in the dimension with the largest num- ber of points. The second set of runs used multipartition- ing where the partitions were the integer triplets closest to the grid aspect ratio.4 Figure 3 contains the results from both sets of parti- tioning experiments. The horizontal axis is the number of processors used, and the vertical axis is the parallel speedup. The straight line is the ideal speedup. The solid curved line is the unipartitioning results. The dotted curved line is the multipartitioning results. For unipartitioning, each data point corresponds to a partition of np1.1, where np is the number of processors. For multipartitioning, the data points correspond to the parti- tions 4.4.1, 6.3.1, 5.4.1, 7.3.1, 6.4.1, and 8.4.1. The unipartitioning results show good speedup (90 or greater efficiency) for up to 7 processors. Between 8 and 32 processors, there is a steady decline in efficiency, with the largest speedup never surpassing 10. The reason for this is that as a grid zone edge is partitioned into smaller and smaller sets of points, the ghost points of a partition overlap partitions other than just the neighboring one, with a consequent increase in communication and de- crease in performance. Multipartitioning clearly outperforms unipartitioning, sometimes by as much as a factor of 2. It is important, however, that the aspect ratio of the multipartition be as close as possible to the aspect ratio of the grid zone. This implies that the regions sent to each processor are approx- PARALLEL CFD PERFORMANCE 27 imately cubic and, therefore, that their surface-to-volume ratio is minimal for rectangular partitions.5 Results from experiments using multipartitions that did not have a simi- lar aspect ratio to the grid zone were consistently inferior to those that did. For example, the (inferior) multipartition 1.3.6 had a speedup of only 7.9 compared with the speedup of 12.7 for the multipartition 3.6.1, or even compared with the speedup of 9.6 for the unipartition 18.1.1. 4.2 MULTIPLE GRID ZONE PARTITIONING RESULTS Having studied the effect of the partition choice on the largest grid zone, we next looked at the effect of these choices in the context of the complete solution for all grid zones. These studies were performed on the wingbody problem using 44 processors.6 First, we calculated the average number of points that should be assigned to each processor. We then coalesced the smallest grid zones until they added up to this number. Then, the remaining grid zones were divided among the re- 28 COMPUTING APPLICATIONS Fig. 3 Partitioning of grid zone 2 "Multipartitioning clearly outperforms unipartitioning, sometimes by as much as a factor of 2. It is important, however, that the aspect ratio of the multipartition be as close as possible to the aspect ratio of the grid zone." maining processors. Within each grid zone, we calculated the unipartition by assigning the processors for each grid zone to the longest grid dimension. Our results are given in Table 3. The unipartition u was used as the starting point. This partition is computed by minimizing the maximal number of points assigned to any processor. Next, several multipartitionings of grid zone 2 were tested. The rows u921, u631, and u136 in Ta- ble 3 correspond to the multipartitions 9.2.1, 6.3.1, and 1.3.6, respectively. As expected, the "good" multiparti- tions, u921 and u631, improve performance, while the "bad" multipartition, 1.3.6, degrades performance.7 Next, we tried repartitioning grid zones other than the largest.8 Partition e, based on reducing the maximal edge lengths, is derived from partition u631 by transferring a processor from grid zone 5 to grid zone 1. This does not improve the result significantly compared to partition u631. The parti- tion c, based on a common divisor of the edge length, per- formed poorly because grid zone 5 is starved for proces- sors. We thus arrived at a heuristic for multizone grid parti- tioning. First, calculate the average number of grid points to assign to each processor. Second, coalesce grid zones with less than the average number of grid points. Third, calculate a unipartition for the remaining points and zones. Fourth, examine the load balance for the uniparti- tion and repeat the previous steps if it is not satisfactory.9 Fifth, change the unipartitions into multipartitions, at- tempting to make the aspect ratio of the multipartitions as close as possible to the grid aspect ratio, avoiding symme- try planes or boundary conditions. Finally, examine the results and rebalance. The first step, achieving an optimal unipartitioning, is the most important. Fine-tuning the larger zones (here grid zone 2, which comprises about 28 of the points) can then yield significant improvements. Further tuning of smaller grid zones did not appreciably improve the results. 4.3 COARSE-GRAIN VERSUS FINE-GRAIN PARTITIONING COMPARISON A performance comparison between fine- and coarse- grain partitioning was performed to validate that the addi- tional coding effort to implement a fine-grain partitioning scheme was worthwhile. Both strategies were compared using 6 and 20 processors (the number of grid zones in the wingbody and wingbody-nacelle-diverter test prob- lems, respectively). All tests used the MPI version of OVERFLOW. For the coarse-grain tests, each grid zone was mapped to a processor. For the fine-grain test, the mean number of points per processor was calculated; zones that had fewer points than this average were coalesced, and zones that had twice or more than this number were partitioned ac- cording to the aspect ratio criterion above, which for these cases means partitioning along the largest grid zone dimension. Timing results are given in Table 4 (SGI Origin 195 MHz processors) and Table 5 (SGI Origin 250 MHz processors). The time to run the MPI code on a single pro- cessor is used as a baseline for calculating speedup and efficiency.10 For the coarse-grain strategy, we expect the maximal speedup to be approximately the ratio of total grid points to the number of grid points in the largest zone. For the wingbody problem, this is 2.37; for the wingbody-na- celle-diverter problem, it is 6.49. As can be seen in Tables 4 and 5, the speedups achieved in these two cases are close to the theoretical maximum. The fine-grain results are significantly better than the coarse-grain results on both test problems. The reason is the better load balancing that the fine-grain strategy achieves by coalescing and splitting grid zones. In addi- tion to runtime reduction, the better load balancing also manifests itself in higher processor use. 4.4 MPI VERSUS MLP COMPARISON The MPI and MLP versions of OVERFLOW both support co- alescing multiple grid zones onto a single processor and the allocation of multiple processors to compute on a sin- gle grid zone. In the MPI version, when multiple proces- sors are allocated to a single grid zone, a multipartitioning scheme is used to allocate subzones to each processor. In the MLP version, the multiple processors simultaneously execute the inner loops of the zonal solver. There are several significant differences between the two methods. The MPI version assumes a distrib- uted-memory programming model. Conversely, the MLP version uses two shared-memory constructs, the UNIX fork system call and parallel directives, for parallelism. The MPI version requires approximately 25,000 addi- tional lines of code (compared with the serial version) to implement but can be run on SMP, NUMA, massively parallel processors (MPPs), and PC clusters. By contrast, the MLP version only uses an additional 300 lines of code but runs only on SMP and NUMA systems. Two attributes influence the performance of the meth- ods. The first is how evenly the work is distributed when PARALLEL CFD PERFORMANCE 29 multiple processors are applied to a single zone. In the MPI version, multipartitioning maps approximately equal amounts of work (subdomains) to each processor. In the MLP version, the multiple processors execute only the sec- tions of code that contain parallel loops. The second attrib- ute is the computational overhead associated with each method. The MPI version incurs message-passing over- head whenever data are exchanged. The MLP version in- curs shared-memory synchronization overheads whenever a parallel loop is encountered. Tables 6 and 7 contain timings that compare the perfor- mance of the two methods on an SGI Origin (250 MHz processors). For the wingbody case, the MPI version is sig- nificantly faster on 8 and 16 processors, slightly faster on 32 and 64 processors, and significantly slower on 128 pro- cessors. It appears that the MPI version has better load-bal- ancing characteristics due to a more equal distribution of work to the processors. However, as the number of proces- sors grows, the message-passing overhead associated with the larger number of partitions mitigates this advantage. For the wingbody-nacelle-diverter case, except for one anomalous result,11 the two versions perform similarly up to 64 processors. On 128 processors, the MLP version out- performs the MPI version. We believe the increased granu- larity of work resulting from the larger problem size is the reason that the MLP version performs similarly to the MPI version on smaller numbers of processors. Once again, however, the increased message-passing overhead associ- ated with the larger number of partitions degrades the per- formance of the MPI version with 128 processors. 4.5 MACHINE COMPARISON We tested OVERFLOW on several different machines. The sequential version was run on a CRAY C-90 and a CRAY T-90. These machines represent the vector super- 30 COMPUTING APPLICATIONS Table 3 Multizone Partitioning of Wingbody with 44 Processors Heuristic Zone 1 Zone 2 Zone 3 Zone 4 Zone 5 Zone 6 Time u 11.1.1 18.1.1 3.1.1 1.1.1 9.1.1 2.1.1 10.6 u921 11.1.1 9.2.1 3.1.1 1.1.1 9.1.1 2.1.1 8.8 u631 11.1.1 6.3.1 3.1.1 1.1.1 9.1.1 2.1.1 9.2 u136 11.1.1 1.3.6 3.1.1 1.1.1 9.1.1 2.1.1 12.9 e 12.1.1 6.3.1 3.1.1 1.1.1 4.1.2 2.1.1 9.1 c 10.1.1 7.3.1 5.1.1 1.1.1 5.1.1 2.1.1 13.7 Table 4 Wingbody Timings (seconds): Coarse Grain versus Fine Grain NP Strategy Time Speedup Efficiency 1 Sequential 289.2 (1) (100) 6 Coarse grain 139.8 2.1 34.5 6 Fine grain 75.2 3.8 64.1 Table 5 Wingbody-Nacelle-Diverter Timings (seconds): Coarse Grain versus Fine Grain NP Strategy Time Speedup Efficiency 1 Sequential 725.3 (1) (100) 20 Coarse grain 110.5 6.6 32.8 20 Fine grain 40.2 18.0 90.2 Table 6 Wingbody Timings (seconds): Message- Passing Interface (MPI) versus Multilevel Par- allel (MLP) Approach NP MPI MLP 8 31.4 44.8 16 15.9 26.2 32 13.8 15.4 64 8.7 11.2 128 10.2 8.7 computers OVERFLOW was originally optimized for. The MPI version was run on a CRAY T3E, SGI Origin, and sev- eral PC systems. The CRAY T3E and SGI Origin 12 are rep- resentative of the MPP systems that parallel versions of OVERFLOW are run on. The PC systems represent a pos- sible low-cost (and increasingly high performance) alter- native platform for running OVERFLOW. Details of the systems tested were given in Section 3.2. The results for the wingbody and wingbody-nacelle- diverter test problems are given in Tables 8 and 9, respec- tively. The minimal number of nodes needed to run these problems was 8 on the Compaq 6000 and HP Kayak and 29 on the CRAY T3E. The Compaq 6000 results for 64 pro- cessors and the HP Kayak results for 128 processors were run using both processors in each PC. All other runs used only one of the processors in each PC. The CRAY C-90 and T-90 systems significantly outper- formed the other systems tested. The T-90 was approxi- mately 50 times faster than the C-90 and an order of magni- tude faster than the next fastest system, the SGI Origin. One factor reflected in this performance gap is the maturity and performance tuning in the versions of OVERFLOW that were run. The sequential version had been highly opti- mized for the vector architecture of the CRAYs, while the MPI version was still under development during our test- ing and had not been optimized for cache architectures. Of the other systems we tested, the SGI Origin was the fastest. For all but one case, it was a factor of 2 to 4 faster than the PC clusters and CRAY T3E.13 For most cases, the Origin was as fast on 64 processors as the CRAY T3E was on 128 or more processors. The CRAY T3E had several significant limitations. The most severe was the relatively small amount of memory PARALLEL CFD PERFORMANCE 31 Table 7 Wingbody-Nacelle-Diverter Timings (seconds): Message-Passing Interface (MPI) versus Multi- level Parallel (MLP) Approach NP MPI MLP 8 128.6 98.4 16 53.7 56.1 32 30.6 31.1 64 17.0 18.2 128 15.2 11.7 Table 8 Wingbody Timings (seconds): Machine Comparison Number of Processors Compaq 6000 CRAY C-90 CRAY T-90 CRAY T3E HP Kayak SGI Origin 1 38.0 25.7 245.6 2 171.4 4 120.7 8 80.2 86.4 31.4 16 42.1 44.7 15.9 32 29.8 59.2 29.9 13.8 64 19.6 14.3 15.6 8.7 128 8.2 15.7 10.2 256 7.1 512 8.5 (and cache) on each processor, coupled with a lack of vir- tual memory. This meant that the minimal sizes for the two test cases were 29 and 70 processors, respectively. Other difficulties were poor per node performance and the un- usual floating-point representation. The performance of the PC clusters was surprisingly good. When "enough" level-2 cache was available (i.e., 32 or more processors), performance was half that of the SGI Origin. When sufficient cache was not available, perfor- mance degraded to a third of the SGI Origin. One unex- pected result was that the HP Kayak (300 MHz processors) PC cluster outperformed the Compaq 6000 (333 MHz pro- cessors) PC cluster. A likely explanation is that there were performance problems with the PCI chip set used in the Compaq 6000. The performance of the Compaq 8000 was limited in two ways. First, it used an older Pentium Pro processor, which only ran at 200 MHz. Second, its SMP architecture does not provide enough memory bandwidth to allow more than one processor to run at full speed. Nevertheless, the Compaq 8000 was still able to run the wingbody-nacelle- diverter test case on a single processor at one-fourth the performance of an SGI Origin using one processor. Superlinear speedup due to cache effects was observed on all systems except the Compaq 8000. For the Compaq 6000 and HP Kayak clusters, this occurred on up to 32 pro- cessors on the wingbody-nacelle-diverter problem. Be- yond 32 processors, the pieces of the partitioned problem were small enough to fit within each processor's cache. For 32 COMPUTING APPLICATIONS Table 9 Wingbody-Nacelle-Diverter Timings (seconds): Machine Comparison Number of Processors Compaq 6000 Compaq 8000 CRAY C-90 CRAY T-90 CRAY T3E HP Kayak SGI Origin 1 2531 111.2 75.1 725.3 2 1686 359.2 4 1345 224.3 8 1469.0 428.4 128.6 16 387.0 176.7 53.7 32 67.0 74.6 30.6 64 43.7 35.4 17.0 128 17.9 28.0 15.2 256 14.7 512 12.8 the SGI Origin, a small superlinear effect was noticeable going from 8 to 16 processors on the wingbody-nacelle- diverter test case and from 4 to 8 processors on the wingbody test case. For the CRAY T3E, a superlinear speedup effect was observed on the wingbody test case when going from 32 to 64 processors. Most systems showed good speedup until a threshold number of processors was reached. For the CRAY T3E, this happened above 128 processors on both test problems. For the SGI Origin, this happened at 16 and 64 processors for the wingbody and wingbody-nacelle-diverter test cases, respectively. For these two systems, we attribute the performance degradation to the increased message- passing overhead. For the PC clusters, performance degraded going from 32 to 64 processors on the Compaq 6000 and going from 64 to 128 processors on the HP Kayak. These cases corre- spond with the transition from computing with a single processor per PC to computing with both processors in a PC. The memory bandwidth limitation that results when both processors share the memory bus is responsible for the performance degradation observed. A similar memory bandwidth limitation affected the Compaq 8000. 5 Conclusions Both the MLP and MPI versions of OVERFLOW offer promising approaches for taking advantage of parallel hardware. Both methods support coalescing multiple grid zones onto a single processor and the allocation of multiple processors to compute on a single grid zone. The MPI ver- sion performed better on smaller numbers of processors, and the MLP version performed better on larger numbers of processors. While we believe the MPI version has better overall load-balancing characteristics, for a fixed-size problem, this is mitigated by the overhead of message passing beyond some threshold number of processors. There are several advantages to the MLP approach. First, only minimal changes to the sequential source code are necessary. Second, the dual levels of parallelism sup- port good processor utilization. Third, the shared-memory implementation enables good scalability. The primary dis- advantage of the MLP approach is portability; it will not run on distributed-memory architectures (including PC clusters) or under non-UNIX operating systems. The MPI version has two important advantages. First, it will run on a variety of architectures, including SMPs, NUMAs, MPPs, and PC clusters. Second, the MPI version has good load-balancing characteristics. The disadvan- tages of the MPI version are the extensive changes to PARALLEL CFD PERFORMANCE 33 "Both the MLP and MPI versions of OVERFLOW offer promising approaches for taking advantage of parallel hardware. Both methods support coalescing multiple grid zones onto a single processor and the allocation of multiple processors to compute on a single grid zone." source code required for implementation and the over- head ofmessage passing on large numbers of processors. To use the MPI version effectively, a good partitioning strategy must be chosen. Simple strategies such as assign- ing a grid zone to each processor lead to poor load balanc- ing and performance. While unipartitioning can yield ac- ceptable parallel performance, multipartitioning leads to significantly better results. It is important, however, to be sure that the aspect ratio of the multipartition is close to the aspect ratio of the grid zone. The CRAY T-90 was the fastest machine we tested. Its uniprocessor performance was an order of magnitude faster than its closest competitor. One reason for this is that the sequential version has been highly optimized for CRAY's vector architecture. The SGI Origin was the next fastest machine we tested. It had the advantage that its NUMA architecture supported both the MPI and MLP versions of OVERFLOW. The CRAY T3E had several significant limitations, including a small amount of mem- ory and cache on each processor, no virtual memory, poor per node performance, and an unusual floating-point representation. The performance of the PC clusters was surprisingly good. When the aggregate level-2 cache was of sufficient size, performance was half that of the SGI Origin using a comparable number of processors. The high performance Myrinet network played a key role in enabling this perfor- mance. The performance of the large-memory Compaq 8000 was notable for being able to run the wingbody- nacelle-diverter problem on a single PC processor. This is likely one of the largest numerical problems ever run on a PC. A significant limitation we noticed with SMP PCs is a degradation in performance due to memory bandwidth limitations. On the dual-processor Compaq 6000 and HP Kayak PCs, we noted a degradation in performance that coincided with the transition from computing with a sin- gle processor per PC to computing with both processors in a PC. A similar problem manifested itself in the quad- processor Compaq 8000, which did not provide enough memory bandwidth to allow more than one processor to run at full speed. ACKNOWLEDGMENTS Thanks are due Pieter Buning of NASA for the serial ver- sion of OVERFLOW and to Dennis Jespersen of NASA for his unstinting help in debugging and testing his MPI version. We thank Ferhat Hatay for discussions about multipartitioning and the SGI Origin architecture. Ste- phen Chaney of the Boeing HSCT group supplied the pro- duction runs on which these benchmarks are based, and Anutosh Moitra of that group also contributed some addi- tional runs and test cases. Jim Taft graciously ran our cases on his MLP version of OVERFLOW. Joel Hirsh ran our test cases on the CRAY T-90 at Boeing. We acknowl- edge a grant of computing time on the CRAY T3E and SGI Origin systems from NASA Ames Research Center. We thank the National Center for Supercomputing Appli- cations at the University of Illinois at Urbana-Champaign for access to their NT SuperCluster system. BIOGRAPHIES David Kerlick graduated with a degree in physics from Rensselaer Polytechnic Institute in 1970. He received a Ph.D. in theoretical physics from Princeton University in 1975 and worked in general relativity until 1979; in computational fluid dynamics, computational geometry, and electromagnetics until 1986; and thereafter in scientific visualization, parallel comput- ing, and computer graphics. He has worked for Nielson Engi- neering, NASA Ames, Tektronix, and (since 1991) for the Boe- ing Company. His areas of interest are scientific, engineering, and information visualization; parallel and distributed high per- formance computing; and Web applications using modular serv- ers and 3-D clients. Eric Dillon received his master's degree in computer science in 1993 from ESIAL (Computer Science School) at University H. Poincaré, Nancy, France. He was a nonpermanent researcher at LORIA (Nancy, France) while preparing his Ph.D. in com- puter science and graduated in 1997 from University H. Poin- care. His main field of interest includes high performance com- puting related to both parallel scientific applications (comput- ing intensive) and business applications (transaction-based ap- plications on distributed architectures). He is now a researcher at Boeing in the Mathematics and Computing Technology Divi- sion in Seattle and is mainly involved in performance evalua- tion, simulation, and prediction for distributed applications. David Levine is a computational biologist at Rosetta Inpharmatics in Kirkland, Washington, where he develops algo- rithms for the analysis of gene expression data. He received a Ph.D. degree in computer science from the Illinois Institute of Technology. He has previously worked at Control Data Corpo- ration, Argonne National Laboratory, and the Boeing Company. He is the developer of the PGAPack parallel genetic algorithm library. His research interests are in computational biology, ge- netic algorithms, parallel computing, scientific applications, and performance evaluation. NOTES 1. The choice of PVM is historical. If the work were done today, mes- sage-passing interface (MPI) would be used. 2. A nacelle is a faired engine housing. A diverter is used for directing airflow. 34 COMPUTING APPLICATIONS 3. A commercial product that provides many UNIX commands for use in a Windows NT environment. 4.For example, 6.3.1 is the integer triplet closest to the grid aspect ra- tio of 5.8 : 2.6 : 1 for 18 processors. 5.This confirms earlier work.For example, Reed, Adams, and Patrick (1987) showed that in two dimensions, the efficiency of a partition is given by the ratio of points communicated to points computed, or perime- ter-to-area ratio. These results generalize in a straightforward way to three dimensions. 6. This number of processors was originally required by a partitioning scheme, identified as c in Table 3, which divided each grid dimension by a common divisor L that is roughly the cube root of Npts/NP.Here, Npts is the total number of points in the problem, and NP is the number of processors used. For the choice L = 38, we obtain a requirement for 44 processors. 7. We obtained similar results for the wingbody-nacelle-diverter case on 120 processors, of which the 44-processor wingbody case is a proper subset, in which refining partitions toward the grid aspect ratios (6.3.1 and 9.2.1 in grid zone 2) improve overall performance, but poor partitioning, such as 1.3.6, degrades performance. 8. At the time we did this work, multipartitioning was not available for grid zone 1, which has a singular axis in the grid. 9.Here and also in the sixth step, we recommend that the user make a short OVERFLOW run ( 10 iterations), examine the printout of processor idle time, and try to hand-tune the partitions by reallocating processors to other grid zones if necessary. The intent is that a small savings in solution time could make an appreciable difference when run for thousands of itera- tions. 10.The MPI code run on a single processor is a reasonable approxima- tion of the time to run the serial code. 11.The multilevel parallel version outperforms the MPI version on eight processors. 12.The SGI Origin system used in Tables 8 and 9 had 250 MHz proces- sors. 13. The wingbody-nacelle-diverter problem is only 18% faster on the SGI Origin than on the CRAY T3E when using 128 processors. REFERENCES Atwood, C., and M. Smith. 1995. Nonlinear fluid computations in a distributed environment. In AIAA Proceedings, Paper No. 95-0224. Buning, P., M. Smith, J. Ryan, C. Atwood, K. Chawla, and S. Weeratunga. 1995. OVERFLOW-Navier-Stokes CFD [On- line]. Available: http://esdcd.gsfc.nasa.gov/ESS/annual. reports/ess95contents/app.jnnie.html. Hatay, F., D. Jespersen, G. Guruswamy, Y. Rizk, C. Byun, and K. Gee. 1997. A multi-level parallelization concept for high- fidelity multi-block solvers. In Supercomputing 97 Proceed- ings. San Jose, CA: Association for Computing Machinery. Jespersen, D. 1998. Parallelism and OVERFLOW. NASA Ames Research Center, NAS-98-013 [Online]. Available: http:// www.nas.nasa.gov/Research/Reports/Techreports/1998/ nas-98-013-abstract.html. Jespersen, D., T. Pulliam, and P. Buning. 1997. Recent enhance- ments to OVERFLOW. In AIAA Proceedings, Paper No. 97- 0644. Lauria, M., and A. Chien. 1997. MPI-FM: High performance MPI on workstation clusters. Journal of Parallel and Dis- tributed Computing 40 (1): 4-18. Pulliam, T., and D. Chaussee. 1981. A diagonal form of an im- plicit approximate factorization algorithm. Journal of Com- putational Physics 39:347. Pulliam, T., and J. Steger. 1978. On implicit finite-difference simulations of three-dimensional flows. In AIAA Proceed- ings, Paper No. 78-0010. Reed, D., L. Adams, and M. Patrick. 1987. Stencils and problem partitionings: Their influence on the performance of multi- ple processor systems. IEEE Transactions on Computers C-36 (7): 845-58. Ryan, J., and S. Weeratunga. 1993. Parallel computation of 3-D Navier-Stokes flowfields for supersonic vehicles. AIAA Pro- ceedings, Paper No. 93-0064. Smith, M., R. Van der Wijngaart, and M. Yarrow. 1995. Im- proved multi-partition method for line-based iteration schemes. In Computational Aerosciences Workshop 95. Spalart, P., and S. Almaras. 1992. A one-equation turbulence model for aerodynamic flows. In AIAA Proceedings, Paper No. 92-0439. Steger, J., F. Dougherty, and J. Benek. 1983. A chimera grid scheme. Advances in Grid Generation 5:59-69. Taft, J. 1998. OVERFLOW gets excellent results on SGI Origin 2000. NASnews 3 (1) [Online]. Available: http://science. nas.nasa.gov/Pubs/NASnews/98/01/overflow.html. PARALLEL CFD PERFORMANCE 35</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<back>
<notes>
<p>
<list list-type="order">
<list-item>
<p>1. The choice of PVM is historical. If the work were done today, message-passing interface (MPI) would be used.</p>
</list-item>
<list-item>
<p>2. A nacelle is a faired engine housing. A diverter is used for directing airflow.</p>
</list-item>
<list-item>
<p>3. A commercial product that provides many UNIX commands for use in a Windows NT environment.</p>
</list-item>
<list-item>
<p>4. For example, 6.3.1 is the integer triplet closest to the grid aspect ratio of 5.8: 2.6: 1 for 18 processors.</p>
</list-item>
<list-item>
<p>5. This confirms earlier work. For example, Reed, Adams, and Patrick (1987) showed that in two dimensions, the efficiency of a partition is given by the ratio of points communicated to points computed, or perimeter-to-area ratio. These results generalize in a straightforward way to three dimensions.</p>
</list-item>
<list-item>
<p>6. This number of processors was originally required by a partitioning scheme, identified as cin Table 3, which divided each grid dimension by a common divisor Lthat is roughly the cube root of Npts/NP.Here, Nptsis the total number of points in the problem, and NPis the number of processors used. For the choice L= 38, we obtain a requirement for 44 processors.</p>
</list-item>
<list-item>
<p>7. We obtained similar results for the wingbody-nacelle-diverter case on 120 processors, of which the 44-processor wingbody case is a proper subset, in which refining partitions toward the grid aspect ratios (6.3.1 and 9. 2.1 in grid zone 2) improve overall performance, but poor partitioning, such as 1.3.6, degrades performance.</p>
</list-item>
<list-item>
<p>8. At the time we did this work, multipartitioning was not available for grid zone 1, which has a singular axis in the grid.</p>
</list-item>
<list-item>
<p>9. Here and also in the sixth step, we recommend that the user make a short OVERFLOW run (≈ 10 iterations), examine the printout of processor idle time, and try to hand-tune the partitions by reallocating processors to other grid zones if necessary. The intent is that a small savings in solution time could make an appreciable difference when run for thousands of iterations.</p>
</list-item>
<list-item>
<p>10. The MPI code run on a single processor is a reasonable approximation of the time to run the serial code.</p>
</list-item>
<list-item>
<p>11. The multilevel parallel version outperforms the MPI version on eight processors.</p>
</list-item>
<list-item>
<p>12. The SGI Origin system used in Tables 8 and 9 had 250 MHz processors.</p>
</list-item>
<list-item>
<p>13. The wingbody-nacelle-diverter problem is only 18% faster on the SGI Origin than on the CRAY T3E when using 128 processors.</p>
</list-item>
</list>
</p>
</notes>
<ref-list>
<ref>
<citation citation-type="confproc" xlink:type="simple">
<name name-style="western">
<surname>Atwood, C.</surname>
</name>
, and
<name name-style="western">
<surname>M. Smith</surname>
</name>
. 1995.
<article-title>Nonlinear fluid computations in a distributed environment</article-title>
. In
<conf-name>AIAA Proceedings</conf-name>
, Paper No. 95-0224.</citation>
</ref>
<ref>
<citation citation-type="book" xlink:type="simple">
<name name-style="western">
<surname>Buning, P.</surname>
</name>
,
<name name-style="western">
<surname>M. Smith</surname>
</name>
,
<name name-style="western">
<surname>J. Ryan</surname>
</name>
,
<name name-style="western">
<surname>C. Atwood</surname>
</name>
,
<name name-style="western">
<surname>K. Chawla</surname>
</name>
, and
<name name-style="western">
<surname>S. Weeratunga</surname>
</name>
.
<year>1995</year>
.
<source>OVERFLOW-Navier-Stokes CFD</source>
[On-line]. Available:
<uri xlink:type="simple">http://esdcd.gsfc.nasa.gov/ESS/annual.reports/ess95contents/app.jnnie.html.</uri>
</citation>
</ref>
<ref>
<citation citation-type="confproc" xlink:type="simple">
<name name-style="western">
<surname>Hatay, F.</surname>
</name>
,
<name name-style="western">
<surname>D. Jespersen</surname>
</name>
,
<name name-style="western">
<surname>G. Guruswamy</surname>
</name>
,
<name name-style="western">
<surname>Y. Rizk</surname>
</name>
,
<name name-style="western">
<surname>C. Byun</surname>
</name>
, and
<name name-style="western">
<surname>K. Gee</surname>
</name>
. 1997.
<article-title>A multi-level parallelization concept for high-fidelity multi-block solvers</article-title>
. In
<conf-name>Supercomputing 97 Proceedings</conf-name>
.
<conf-loc>San Jose, CA: Association for Computing Machinery</conf-loc>
.</citation>
</ref>
<ref>
<citation citation-type="book" xlink:type="simple">
<name name-style="western">
<surname>Jespersen, D.</surname>
</name>
<year>1998</year>
.
<source>Parallelism and OVERFLOW</source>
. NASA Ames Research Center, NAS-98-013 [Online]. Available:
<uri xlink:type="simple">http://www.nas.nasa.gov/Research/Reports/Techreports/1998/nas-98-013-abstract.html.</uri>
</citation>
</ref>
<ref>
<citation citation-type="confproc" xlink:type="simple">
<name name-style="western">
<surname>Jespersen, D.</surname>
</name>
,
<name name-style="western">
<surname>T. Pulliam</surname>
</name>
, and
<name name-style="western">
<surname>P. Buning</surname>
</name>
. 1997.
<article-title>Recent enhancements to OVERFLOW</article-title>
. In
<conf-name>AIAA Proceedings</conf-name>
, Paper No. 97-0644.</citation>
</ref>
<ref>
<citation citation-type="journal" xlink:type="simple">
<name name-style="western">
<surname>Lauria, M.</surname>
</name>
, and
<name name-style="western">
<surname>A. Chien</surname>
</name>
.
<year>1997</year>
.
<article-title>MPI-FM: High performance MPI on workstation clusters</article-title>
.
<source>Journal of Parallel and Distributed Computing</source>
<volume>40</volume>
(
<issue>1</issue>
):
<fpage>4</fpage>
-
<lpage>18</lpage>
.</citation>
</ref>
<ref>
<citation citation-type="journal" xlink:type="simple">
<name name-style="western">
<surname>Pulliam, T.</surname>
</name>
, and
<name name-style="western">
<surname>D. Chaussee</surname>
</name>
.
<year>1981</year>
.
<article-title>A diagonal form of an implicit approximate factorization algorithm</article-title>
.
<source>Journal of Computational Physics</source>
<volume>39</volume>
:
<fpage>347</fpage>
.</citation>
</ref>
<ref>
<citation citation-type="confproc" xlink:type="simple">
<name name-style="western">
<surname>Pulliam, T.</surname>
</name>
, and
<name name-style="western">
<surname>J. Steger</surname>
</name>
. 1978.
<article-title>On implicit finite-difference simulations of three-dimensional flows</article-title>
. In
<conf-name>AIAA Proceedings</conf-name>
, Paper No. 78-0010.</citation>
</ref>
<ref>
<citation citation-type="journal" xlink:type="simple">
<name name-style="western">
<surname>Reed, D.</surname>
</name>
,
<name name-style="western">
<surname>L. Adams</surname>
</name>
, and
<name name-style="western">
<surname>M. Patrick</surname>
</name>
.
<year>1987</year>
.
<article-title>Stencils and problem partitionings: Their influence on the performance of multiple processor systems</article-title>
.
<source>IEEE Transactions on Computers</source>
<volume>C-36</volume>
(
<issue>7</issue>
):
<fpage>845</fpage>
-
<lpage>858</lpage>
.</citation>
</ref>
<ref>
<citation citation-type="confproc" xlink:type="simple">
<name name-style="western">
<surname>Ryan, J.</surname>
</name>
, and
<name name-style="western">
<surname>S. Weeratunga</surname>
</name>
. 1993.
<article-title>Parallel computation of 3-D Navier-Stokes flowfields for supersonic vehicles</article-title>
.
<conf-name>AIAA Proceedings</conf-name>
, Paper No. 93-0064.</citation>
</ref>
<ref>
<citation citation-type="confproc" xlink:type="simple">
<name name-style="western">
<surname>Smith, M.</surname>
</name>
,
<name name-style="western">
<surname>R. Van der Wijngaart</surname>
</name>
, and
<name name-style="western">
<surname>M. Yarrow</surname>
</name>
. 1995.
<article-title>Improved multi-partition method for line-based iteration schemes</article-title>
. In
<conf-name>Computational Aerosciences Workshop 95</conf-name>
.</citation>
</ref>
<ref>
<citation citation-type="confproc" xlink:type="simple">
<name name-style="western">
<surname>Spalart, P.</surname>
</name>
, and
<name name-style="western">
<surname>S. Almaras</surname>
</name>
. 1992.
<article-title>A one-equation turbulence model for aerodynamic flows</article-title>
. In
<conf-name>AIAA Proceedings</conf-name>
, Paper No. 92-0439.</citation>
</ref>
<ref>
<citation citation-type="journal" xlink:type="simple">
<name name-style="western">
<surname>Steger, J.</surname>
</name>
,
<name name-style="western">
<surname>F. Dougherty</surname>
</name>
, and
<name name-style="western">
<surname>J. Benek</surname>
</name>
.
<year>1983</year>
.
<article-title>A chimera grid scheme</article-title>
.
<source>Advances in Grid Generation</source>
<volume>5</volume>
:
<fpage>59</fpage>
-
<lpage>69</lpage>
.</citation>
</ref>
<ref>
<citation citation-type="journal" xlink:type="simple">
<name name-style="western">
<surname>Taft, J.</surname>
</name>
<year>1998</year>
.
<article-title>OVERFLOW gets excellent results on SGI Origin 2000</article-title>
.
<source>NASnews</source>
<volume>3</volume>
(
<issue>1</issue>
) [Online]. Available:
<uri xlink:type="simple">http://science.nas.nasa.gov/Pubs/NASnews/98/01/overflow.html.</uri>
</citation>
</ref>
</ref-list>
</back>
</article>
</istex:document>
</istex:metadataXml>
<mods version="3.6">
<titleInfo lang="en">
<title>Performance Testing of a Parallel Multiblock CFD Solver</title>
</titleInfo>
<titleInfo type="alternative" lang="en" contentType="CDATA">
<title>Performance Testing of a Parallel Multiblock CFD Solver</title>
</titleInfo>
<name type="personal">
<namePart type="given">David</namePart>
<namePart type="family">Kerlick</namePart>
<affiliation>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</affiliation>
</name>
<name type="personal">
<namePart type="given">Eric</namePart>
<namePart type="family">Dillon</namePart>
<affiliation>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</affiliation>
</name>
<name type="personal">
<namePart type="given">David</namePart>
<namePart type="family">Levine</namePart>
<affiliation>Rosetta Inpharmatics, Kirkland, Washington</affiliation>
</name>
<typeOfResource>text</typeOfResource>
<genre type="research-article" displayLabel="research-article" authority="ISTEX" authorityURI="https://content-type.data.istex.fr" valueURI="https://content-type.data.istex.fr/ark:/67375/XTP-1JC4F85T-7">research-article</genre>
<originInfo>
<publisher>Sage Publications</publisher>
<place>
<placeTerm type="text">Sage CA: Thousand Oaks, CA</placeTerm>
</place>
<dateIssued encoding="w3cdtf">2001-02</dateIssued>
<copyrightDate encoding="w3cdtf">2001</copyrightDate>
</originInfo>
<language>
<languageTerm type="code" authority="iso639-2b">eng</languageTerm>
<languageTerm type="code" authority="rfc3066">en</languageTerm>
</language>
<abstract lang="en">A distributed-memory version of the OVERFLOW computational fluid dynamics code was evaluated on several parallel systems and compared with other approaches using test cases provided by the NASA-Boeing High-Speed Civil Transport program. A principal goal was to develop partitioning and load-balancing strategies that led to a reduction in computation time. We found multipartitioning, in which the aspect ratio of the multipartition is close to the aspect ratio of the grid zone, offered the best performance. The (uniprocessor) performance of the CRAY vector systems was superior to all other systems tested. However, the distributed-memory version when run on an SGI Origin system offers a price performance advantage over the CRAY vector systems.Performance on personal computer systems is promising but faces several hurdles.</abstract>
<relatedItem type="host">
<titleInfo>
<title>The International Journal of High Performance Computing Applications</title>
</titleInfo>
<genre type="journal" authority="ISTEX" authorityURI="https://publication-type.data.istex.fr" valueURI="https://publication-type.data.istex.fr/ark:/67375/JMC-0GLKJH51-B">journal</genre>
<identifier type="ISSN">1094-3420</identifier>
<identifier type="eISSN">1741-2846</identifier>
<identifier type="PublisherID">HPC</identifier>
<identifier type="PublisherID-hwp">sphpc</identifier>
<part>
<date>2001</date>
<detail type="volume">
<caption>vol.</caption>
<number>15</number>
</detail>
<detail type="issue">
<caption>no.</caption>
<number>1</number>
</detail>
<extent unit="pages">
<start>22</start>
<end>35</end>
</extent>
</part>
</relatedItem>
<identifier type="istex">10F89E9E8C955141EF1B7E305BA97BB38B570610</identifier>
<identifier type="ark">ark:/67375/M70-X79RBH0Z-6</identifier>
<identifier type="DOI">10.1177/109434200101500103</identifier>
<identifier type="ArticleID">10.1177_109434200101500103</identifier>
<recordInfo>
<recordContentSource authority="ISTEX" authorityURI="https://loaded-corpus.data.istex.fr" valueURI="https://loaded-corpus.data.istex.fr/ark:/67375/XBH-0J1N7DQT-B">sage</recordContentSource>
</recordInfo>
</mods>
<json:item>
<extension>json</extension>
<original>false</original>
<mimetype>application/json</mimetype>
<uri>https://api.istex.fr/ark:/67375/M70-X79RBH0Z-6/record.json</uri>
</json:item>
</metadata>
<serie></serie>
</istex>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Istex/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000384 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Istex/Corpus/biblio.hfd -nk 000384 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Istex
   |étape=   Corpus
   |type=    RBID
   |clé=     ISTEX:10F89E9E8C955141EF1B7E305BA97BB38B570610
   |texte=   Performance Testing of a Parallel Multiblock CFD Solver
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022