InforLorV4, Istex, Corpus, bibRecord, 000384

Performance Testing of a Parallel Multiblock CFD Solver

Identifieur interne : 000384 ( Istex/Corpus ); précédent : 000383; suivant : 000385

Performance Testing of a Parallel Multiblock CFD Solver

Auteurs : David Kerlick ; Eric Dillon ; David Levine

Source :

The International Journal of High Performance Computing Applications [ 1094-3420 ] ; 2001-02.

RBID : ISTEX:10F89E9E8C955141EF1B7E305BA97BB38B570610

English descriptors

Teeft :
- Aiaa proceedings, Aspect ratio, Average number, Better load balancing, Boeing company, Cache, Coalescing, Coarse grain, Common divisor, Compaq, Computational, Computer science, Convergence, Cray, Cray cray cray, Cray vector systems, Design processes, Digital fortran version, Disk space, Fine grain, Fine grid, Flow field, Full speed, Grain time speedup, Grid, Grid aspect ratio, Grid number points, Grid points, Grid zone, Grid zones, Hatay, High performance, Hsct, Inner loops, Iteration, Jespersen, Kayak, Larger number, Larger numbers, Largest grid zone, Machine comparison number, Main memory, Medium grid, Memory bandwidth, Methods support coalescing, More processors, Multigrid, Multigrid method, Multigrid scheme, Multipartition, Multipartitioning, Multiple processors, Nasa, Nasa ames, Nasa ames research center, Node performance, Number points, Origin system, Origin systems, Other systems, Overflow, Overflow code, Overflow offer, Parallel directives, Parallel hardware, Parallel implementations, Partitioning, Partitioning scheme, Partitioning strategies, Performance degradation, Performance evaluation, Processor, Programming model, Pulliam, Risc systems, Sequential version, Serial version, Single grid, Single grid zone, Single processor, Small amount, Small grid zones, Smaller numbers, Solver, Speedup, Strategy sequential, Test case, Test cases, Test problems, Threshold number, Total grid, Total number, Turbulence model, Unipartitioning, Unusual representation, Vector architecture, Vector supercomputers, Virtual memory, Viscous compressible flow equations, Wingbody, Wingbody case, Wingbody problem, Wingbody test case, Wingbody timings, Zone.

Abstract

A distributed-memory version of the OVERFLOW computational fluid dynamics code was evaluated on several parallel systems and compared with other approaches using test cases provided by the NASA-Boeing High-Speed Civil Transport program. A principal goal was to develop partitioning and load-balancing strategies that led to a reduction in computation time. We found multipartitioning, in which the aspect ratio of the multipartition is close to the aspect ratio of the grid zone, offered the best performance. The (uniprocessor) performance of the CRAY vector systems was superior to all other systems tested. However, the distributed-memory version when run on an SGI Origin system offers a price performance advantage over the CRAY vector systems.Performance on personal computer systems is promising but faces several hurdles.

Url:

https://api.istex.fr/ark:/67375/M70-X79RBH0Z-6/fulltext.pdf

DOI: 10.1177/109434200101500103

Links to Exploration step

ISTEX:10F89E9E8C955141EF1B7E305BA97BB38B570610

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Performance Testing of a Parallel Multiblock CFD Solver</title>
<author wicri:is="90%"><name sortKey="Kerlick, David" sort="Kerlick, David" uniqKey="Kerlick D" first="David" last="Kerlick">David Kerlick</name>
<affiliation><mods:affiliation>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</mods:affiliation>
</affiliation>
</author>
<author wicri:is="90%"><name sortKey="Dillon, Eric" sort="Dillon, Eric" uniqKey="Dillon E" first="Eric" last="Dillon">Eric Dillon</name>
<affiliation><mods:affiliation>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</mods:affiliation>
</affiliation>
</author>
<author wicri:is="90%"><name sortKey="Levine, David" sort="Levine, David" uniqKey="Levine D" first="David" last="Levine">David Levine</name>
<affiliation><mods:affiliation>Rosetta Inpharmatics, Kirkland, Washington</mods:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:10F89E9E8C955141EF1B7E305BA97BB38B570610</idno>
<date when="2001" year="2001">2001</date>
<idno type="doi">10.1177/109434200101500103</idno>
<idno type="url">https://api.istex.fr/ark:/67375/M70-X79RBH0Z-6/fulltext.pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000384</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">000384</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Performance Testing of a Parallel Multiblock CFD Solver</title>
<author wicri:is="90%"><name sortKey="Kerlick, David" sort="Kerlick, David" uniqKey="Kerlick D" first="David" last="Kerlick">David Kerlick</name>
<affiliation><mods:affiliation>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</mods:affiliation>
</affiliation>
</author>
<author wicri:is="90%"><name sortKey="Dillon, Eric" sort="Dillon, Eric" uniqKey="Dillon E" first="Eric" last="Dillon">Eric Dillon</name>
<affiliation><mods:affiliation>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</mods:affiliation>
</affiliation>
</author>
<author wicri:is="90%"><name sortKey="Levine, David" sort="Levine, David" uniqKey="Levine D" first="David" last="Levine">David Levine</name>
<affiliation><mods:affiliation>Rosetta Inpharmatics, Kirkland, Washington</mods:affiliation>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">The International Journal of High Performance Computing Applications</title>
<idno type="ISSN">1094-3420</idno>
<idno type="eISSN">1741-2846</idno>
<imprint><publisher>Sage Publications</publisher>
<pubPlace>Sage CA: Thousand Oaks, CA</pubPlace>
<date type="published" when="2001-02">2001-02</date>
<biblScope unit="volume">15</biblScope>
<biblScope unit="issue">1</biblScope>
<biblScope unit="page" from="22">22</biblScope>
<biblScope unit="page" to="35">35</biblScope>
</imprint>
<idno type="ISSN">1094-3420</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">1094-3420</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="Teeft" xml:lang="en"><term>Aiaa proceedings</term>
<term>Aspect ratio</term>
<term>Average number</term>
<term>Better load balancing</term>
<term>Boeing company</term>
<term>Cache</term>
<term>Coalescing</term>
<term>Coarse grain</term>
<term>Common divisor</term>
<term>Compaq</term>
<term>Computational</term>
<term>Computer science</term>
<term>Convergence</term>
<term>Cray</term>
<term>Cray cray cray</term>
<term>Cray vector systems</term>
<term>Design processes</term>
<term>Digital fortran version</term>
<term>Disk space</term>
<term>Fine grain</term>
<term>Fine grid</term>
<term>Flow field</term>
<term>Full speed</term>
<term>Grain time speedup</term>
<term>Grid</term>
<term>Grid aspect ratio</term>
<term>Grid number points</term>
<term>Grid points</term>
<term>Grid zone</term>
<term>Grid zones</term>
<term>Hatay</term>
<term>High performance</term>
<term>Hsct</term>
<term>Inner loops</term>
<term>Iteration</term>
<term>Jespersen</term>
<term>Kayak</term>
<term>Larger number</term>
<term>Larger numbers</term>
<term>Largest grid zone</term>
<term>Machine comparison number</term>
<term>Main memory</term>
<term>Medium grid</term>
<term>Memory bandwidth</term>
<term>Methods support coalescing</term>
<term>More processors</term>
<term>Multigrid</term>
<term>Multigrid method</term>
<term>Multigrid scheme</term>
<term>Multipartition</term>
<term>Multipartitioning</term>
<term>Multiple processors</term>
<term>Nasa</term>
<term>Nasa ames</term>
<term>Nasa ames research center</term>
<term>Node performance</term>
<term>Number points</term>
<term>Origin system</term>
<term>Origin systems</term>
<term>Other systems</term>
<term>Overflow</term>
<term>Overflow code</term>
<term>Overflow offer</term>
<term>Parallel directives</term>
<term>Parallel hardware</term>
<term>Parallel implementations</term>
<term>Partitioning</term>
<term>Partitioning scheme</term>
<term>Partitioning strategies</term>
<term>Performance degradation</term>
<term>Performance evaluation</term>
<term>Processor</term>
<term>Programming model</term>
<term>Pulliam</term>
<term>Risc systems</term>
<term>Sequential version</term>
<term>Serial version</term>
<term>Single grid</term>
<term>Single grid zone</term>
<term>Single processor</term>
<term>Small amount</term>
<term>Small grid zones</term>
<term>Smaller numbers</term>
<term>Solver</term>
<term>Speedup</term>
<term>Strategy sequential</term>
<term>Test case</term>
<term>Test cases</term>
<term>Test problems</term>
<term>Threshold number</term>
<term>Total grid</term>
<term>Total number</term>
<term>Turbulence model</term>
<term>Unipartitioning</term>
<term>Unusual representation</term>
<term>Vector architecture</term>
<term>Vector supercomputers</term>
<term>Virtual memory</term>
<term>Viscous compressible flow equations</term>
<term>Wingbody</term>
<term>Wingbody case</term>
<term>Wingbody problem</term>
<term>Wingbody test case</term>
<term>Wingbody timings</term>
<term>Zone</term>
</keywords>
</textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">A distributed-memory version of the OVERFLOW computational fluid dynamics code was evaluated on several parallel systems and compared with other approaches using test cases provided by the NASA-Boeing High-Speed Civil Transport program. A principal goal was to develop partitioning and load-balancing strategies that led to a reduction in computation time. We found multipartitioning, in which the aspect ratio of the multipartition is close to the aspect ratio of the grid zone, offered the best performance. The (uniprocessor) performance of the CRAY vector systems was superior to all other systems tested. However, the distributed-memory version when run on an SGI Origin system offers a price performance advantage over the CRAY vector systems.Performance on personal computer systems is promising but faces several hurdles.</div>
</front>
</TEI>
<istex><corpusName>sage</corpusName>
<keywords><teeft><json:string>grid</json:string>
<json:string>partitioning</json:string>
<json:string>cray</json:string>
<json:string>grid zone</json:string>
<json:string>compaq</json:string>
<json:string>grid zones</json:string>
<json:string>wingbody</json:string>
<json:string>multipartitioning</json:string>
<json:string>processor</json:string>
<json:string>unipartitioning</json:string>
<json:string>kayak</json:string>
<json:string>single processor</json:string>
<json:string>multigrid</json:string>
<json:string>iteration</json:string>
<json:string>aspect ratio</json:string>
<json:string>convergence</json:string>
<json:string>overflow</json:string>
<json:string>speedup</json:string>
<json:string>multiple processors</json:string>
<json:string>hatay</json:string>
<json:string>nasa</json:string>
<json:string>single grid zone</json:string>
<json:string>coalescing</json:string>
<json:string>pulliam</json:string>
<json:string>test problems</json:string>
<json:string>jespersen</json:string>
<json:string>hsct</json:string>
<json:string>multipartition</json:string>
<json:string>solver</json:string>
<json:string>cache</json:string>
<json:string>largest grid zone</json:string>
<json:string>test cases</json:string>
<json:string>aiaa proceedings</json:string>
<json:string>high performance</json:string>
<json:string>grid points</json:string>
<json:string>zone</json:string>
<json:string>cray vector systems</json:string>
<json:string>grid aspect ratio</json:string>
<json:string>partitioning scheme</json:string>
<json:string>boeing company</json:string>
<json:string>computational</json:string>
<json:string>parallel directives</json:string>
<json:string>total number</json:string>
<json:string>turbulence model</json:string>
<json:string>serial version</json:string>
<json:string>fine grain</json:string>
<json:string>parallel hardware</json:string>
<json:string>coarse grain</json:string>
<json:string>number points</json:string>
<json:string>test case</json:string>
<json:string>other systems</json:string>
<json:string>average number</json:string>
<json:string>wingbody timings</json:string>
<json:string>sequential version</json:string>
<json:string>computer science</json:string>
<json:string>viscous compressible flow equations</json:string>
<json:string>single grid</json:string>
<json:string>programming model</json:string>
<json:string>inner loops</json:string>
<json:string>overflow code</json:string>
<json:string>risc systems</json:string>
<json:string>origin system</json:string>
<json:string>total grid</json:string>
<json:string>medium grid</json:string>
<json:string>larger numbers</json:string>
<json:string>grid number points</json:string>
<json:string>parallel implementations</json:string>
<json:string>nasa ames</json:string>
<json:string>design processes</json:string>
<json:string>main memory</json:string>
<json:string>multigrid method</json:string>
<json:string>disk space</json:string>
<json:string>flow field</json:string>
<json:string>multigrid scheme</json:string>
<json:string>origin systems</json:string>
<json:string>wingbody problem</json:string>
<json:string>partitioning strategies</json:string>
<json:string>common divisor</json:string>
<json:string>better load balancing</json:string>
<json:string>fine grid</json:string>
<json:string>strategy sequential</json:string>
<json:string>grain time speedup</json:string>
<json:string>wingbody case</json:string>
<json:string>larger number</json:string>
<json:string>smaller numbers</json:string>
<json:string>small grid zones</json:string>
<json:string>vector architecture</json:string>
<json:string>more processors</json:string>
<json:string>small amount</json:string>
<json:string>machine comparison number</json:string>
<json:string>cray cray cray</json:string>
<json:string>virtual memory</json:string>
<json:string>node performance</json:string>
<json:string>unusual representation</json:string>
<json:string>memory bandwidth</json:string>
<json:string>full speed</json:string>
<json:string>wingbody test case</json:string>
<json:string>threshold number</json:string>
<json:string>performance degradation</json:string>
<json:string>overflow offer</json:string>
<json:string>methods support coalescing</json:string>
<json:string>nasa ames research center</json:string>
<json:string>vector supercomputers</json:string>
<json:string>performance evaluation</json:string>
<json:string>digital fortran version</json:string>
</teeft>
</keywords>
<author><json:item><name>David Kerlick</name>
<affiliations><json:string>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</json:string>
</affiliations>
</json:item>
<json:item><name>Eric Dillon</name>
<affiliations><json:string>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</json:string>
</affiliations>
</json:item>
<json:item><name>David Levine</name>
<affiliations><json:string>Rosetta Inpharmatics, Kirkland, Washington</json:string>
</affiliations>
</json:item>
</author>
<articleId><json:string>10.1177_109434200101500103</json:string>
</articleId>
<arkIstex>ark:/67375/M70-X79RBH0Z-6</arkIstex>
<language><json:string>eng</json:string>
</language>
<originalGenre><json:string>research-article</json:string>
</originalGenre>
<abstract>A distributed-memory version of the OVERFLOW computational fluid dynamics code was evaluated on several parallel systems and compared with other approaches using test cases provided by the NASA-Boeing High-Speed Civil Transport program. A principal goal was to develop partitioning and load-balancing strategies that led to a reduction in computation time. We found multipartitioning, in which the aspect ratio of the multipartition is close to the aspect ratio of the grid zone, offered the best performance. The (uniprocessor) performance of the CRAY vector systems was superior to all other systems tested. However, the distributed-memory version when run on an SGI Origin system offers a price performance advantage over the CRAY vector systems.Performance on personal computer systems is promising but faces several hurdles.</abstract>
<qualityIndicators><score>8.452</score>
<pdfWordCount>6487</pdfWordCount>
<pdfCharCount>39548</pdfCharCount>
<pdfVersion>1.5</pdfVersion>
<pdfPageCount>14</pdfPageCount>
<pdfPageSize>612 x 792 pts (letter)</pdfPageSize>
<refBibsNative>true</refBibsNative>
<abstractWordCount>121</abstractWordCount>
<abstractCharCount>829</abstractCharCount>
<keywordCount>0</keywordCount>
</qualityIndicators>
<title>Performance Testing of a Parallel Multiblock CFD Solver</title>
<genre><json:string>research-article</json:string>
</genre>
<host><title>The International Journal of High Performance Computing Applications</title>
<language><json:string>unknown</json:string>
</language>
<issn><json:string>1094-3420</json:string>
</issn>
<eissn><json:string>1741-2846</json:string>
</eissn>
<publisherId><json:string>HPC</json:string>
</publisherId>
<volume>15</volume>
<issue>1</issue>
<pages><first>22</first>
<last>35</last>
</pages>
<genre><json:string>journal</json:string>
</genre>
</host>
<namedEntities><unitex><date></date>
<geogName></geogName>
<orgName></orgName>
<orgName_funder></orgName_funder>
<orgName_provider></orgName_provider>
<persName></persName>
<placeName></placeName>
<ref_url></ref_url>
<ref_bibl></ref_bibl>
<bibl></bibl>
</unitex>
</namedEntities>
<ark><json:string>ark:/67375/M70-X79RBH0Z-6</json:string>
</ark>
<categories><wos><json:string>1 - science</json:string>
<json:string>2 - computer science, theory & methods</json:string>
<json:string>2 - computer science, interdisciplinary applications</json:string>
<json:string>2 - computer science, hardware & architecture</json:string>
</wos>
<scienceMetrix><json:string>1 - applied sciences</json:string>
<json:string>2 - information & communication technologies</json:string>
<json:string>3 - distributed computing</json:string>
</scienceMetrix>
<scopus><json:string>1 - Physical Sciences</json:string>
<json:string>2 - Computer Science</json:string>
<json:string>3 - Hardware and Architecture</json:string>
<json:string>1 - Physical Sciences</json:string>
<json:string>2 - Mathematics</json:string>
<json:string>3 - Theoretical Computer Science</json:string>
<json:string>1 - Physical Sciences</json:string>
<json:string>2 - Computer Science</json:string>
<json:string>3 - Software</json:string>
</scopus>
<inist><json:string>1 - sciences appliquees, technologies et medecines</json:string>
<json:string>2 - sciences exactes et technologie</json:string>
<json:string>3 - sciences et techniques communes</json:string>
<json:string>4 - sciences de l'information. documentation</json:string>
</inist>
</categories>
<publicationDate>2001</publicationDate>
<copyrightDate>2001</copyrightDate>
<doi><json:string>10.1177/109434200101500103</json:string>
</doi>
<id>10F89E9E8C955141EF1B7E305BA97BB38B570610</id>
<score>1</score>
<fulltext><json:item><extension>pdf</extension>
<original>true</original>
<mimetype>application/pdf</mimetype>
<uri>https://api.istex.fr/ark:/67375/M70-X79RBH0Z-6/fulltext.pdf</uri>
</json:item>
<json:item><extension>zip</extension>
<original>false</original>
<mimetype>application/zip</mimetype>
<uri>https://api.istex.fr/ark:/67375/M70-X79RBH0Z-6/bundle.zip</uri>
</json:item>
<istex:fulltextTEI uri="https://api.istex.fr/ark:/67375/M70-X79RBH0Z-6/fulltext.tei"><teiHeader><fileDesc><titleStmt><title level="a" type="main" xml:lang="en">Performance Testing of a Parallel Multiblock CFD Solver</title>
</titleStmt>
<publicationStmt><authority>ISTEX</authority>
<publisher scheme="https://scientific-publisher.data.istex.fr">Sage Publications</publisher>
<pubPlace>Sage CA: Thousand Oaks, CA</pubPlace>
<availability><licence><p>sage</p>
</licence>
</availability>
<p scheme="https://loaded-corpus.data.istex.fr/ark:/67375/XBH-0J1N7DQT-B"></p>
<date>2001</date>
</publicationStmt>
<notesStmt><note type="research-article" scheme="https://content-type.data.istex.fr/ark:/67375/XTP-1JC4F85T-7">research-article</note>
<note type="journal" scheme="https://publication-type.data.istex.fr/ark:/67375/JMC-0GLKJH51-B">journal</note>
</notesStmt>
<sourceDesc><biblStruct type="inbook"><analytic><title level="a" type="main" xml:lang="en">Performance Testing of a Parallel Multiblock CFD Solver</title>
<author xml:id="author-0000"><persName><forename type="first">David</forename>
<surname>Kerlick</surname>
</persName>
<affiliation>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</affiliation>
</author>
<author xml:id="author-0001"><persName><forename type="first">Eric</forename>
<surname>Dillon</surname>
</persName>
<affiliation>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</affiliation>
</author>
<author xml:id="author-0002"><persName><forename type="first">David</forename>
<surname>Levine</surname>
</persName>
<affiliation>Rosetta Inpharmatics, Kirkland, Washington</affiliation>
</author>
<idno type="istex">10F89E9E8C955141EF1B7E305BA97BB38B570610</idno>
<idno type="ark">ark:/67375/M70-X79RBH0Z-6</idno>
<idno type="DOI">10.1177/109434200101500103</idno>
<idno type="article-id">10.1177_109434200101500103</idno>
</analytic>
<monogr><title level="j">The International Journal of High Performance Computing Applications</title>
<idno type="pISSN">1094-3420</idno>
<idno type="eISSN">1741-2846</idno>
<idno type="publisher-id">HPC</idno>
<idno type="PublisherID-hwp">sphpc</idno>
<imprint><publisher>Sage Publications</publisher>
<pubPlace>Sage CA: Thousand Oaks, CA</pubPlace>
<date type="published" when="2001-02"></date>
<biblScope unit="volume">15</biblScope>
<biblScope unit="issue">1</biblScope>
<biblScope unit="page" from="22">22</biblScope>
<biblScope unit="page" to="35">35</biblScope>
</imprint>
</monogr>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><creation><date>2001</date>
</creation>
<langUsage><language ident="en">en</language>
</langUsage>
<abstract xml:lang="en"><p>A distributed-memory version of the OVERFLOW computational fluid dynamics code was evaluated on several parallel systems and compared with other approaches using test cases provided by the NASA-Boeing High-Speed Civil Transport program. A principal goal was to develop partitioning and load-balancing strategies that led to a reduction in computation time. We found multipartitioning, in which the aspect ratio of the multipartition is close to the aspect ratio of the grid zone, offered the best performance. The (uniprocessor) performance of the CRAY vector systems was superior to all other systems tested. However, the distributed-memory version when run on an SGI Origin system offers a price performance advantage over the CRAY vector systems.Performance on personal computer systems is promising but faces several hurdles.</p>
</abstract>
</profileDesc>
<revisionDesc><change when="2001-02">Published</change>
</revisionDesc>
</teiHeader>
</istex:fulltextTEI>
<json:item><extension>txt</extension>
<original>false</original>
<mimetype>text/plain</mimetype>
<uri>https://api.istex.fr/ark:/67375/M70-X79RBH0Z-6/fulltext.txt</uri>
</json:item>
</fulltext>
<metadata><istex:metadataXml wicri:clean="corpus sage not found" wicri:toSee="no header"><istex:xmlDeclaration>version="1.0" encoding="UTF-8"</istex:xmlDeclaration>
<istex:docType PUBLIC="-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" URI="journalpublishing.dtd" name="istex:docType"></istex:docType>
<istex:document><article article-type="research-article" dtd-version="2.3" xml:lang="EN"><front><journal-meta><journal-id journal-id-type="hwp">sphpc</journal-id>
<journal-id journal-id-type="publisher-id">HPC</journal-id>
<journal-title>The International Journal of High Performance Computing Applications</journal-title>
<issn pub-type="ppub">1094-3420</issn>
<publisher><publisher-name>Sage Publications</publisher-name>
<publisher-loc>Sage CA: Thousand Oaks, CA</publisher-loc>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="doi">10.1177/109434200101500103</article-id>
<article-id pub-id-type="publisher-id">10.1177_109434200101500103</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Articles</subject>
</subj-group>
</article-categories>
<title-group><article-title>Performance Testing of a Parallel Multiblock CFD Solver</article-title>
</title-group>
<contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Kerlick</surname>
<given-names>David</given-names>
</name>
<aff>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</aff>
</contrib>
</contrib-group>
<contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Dillon</surname>
<given-names>Eric</given-names>
</name>
<aff>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</aff>
</contrib>
</contrib-group>
<contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Levine</surname>
<given-names>David</given-names>
</name>
<aff>Rosetta Inpharmatics, Kirkland, Washington</aff>
</contrib>
</contrib-group>
<pub-date pub-type="ppub"><month>02</month>
<year>2001</year>
</pub-date>
<volume>15</volume>
<issue>1</issue>
<fpage>22</fpage>
<lpage>35</lpage>
<abstract><p>A distributed-memory version of the OVERFLOW computational fluid dynamics code was
                evaluated on several parallel systems and compared with other approaches using test
                cases provided by the NASA-Boeing High-Speed Civil Transport program. A principal
                goal was to develop partitioning and load-balancing strategies that led to a
                reduction in computation time. We found multipartitioning, in which the aspect ratio
                of the multipartition is close to the aspect ratio of the grid zone, offered the
                best performance. The (uniprocessor) performance of the CRAY vector systems was
                superior to all other systems tested. However, the distributed-memory version when
                run on an SGI Origin system offers a price performance advantage over the CRAY
                vector systems.Performance on personal computer systems is promising but faces
                several hurdles.</p>
</abstract>
<custom-meta-wrap><custom-meta xlink:type="simple"><meta-name>sagemeta-type</meta-name>
<meta-value>Journal Article</meta-value>
</custom-meta>
<custom-meta xlink:type="simple"><meta-name>cover-date</meta-name>
<meta-value>Spring 2001</meta-value>
</custom-meta>
<custom-meta xlink:type="simple"><meta-name>search-text</meta-name>
<meta-value> COMPUTING APPLICATIONS PERFORMANCE TESTING OF A PARALLEL MULTIBLOCK CFD SOLVER
            David Kerlick Eric Dillon MATHEMATICS AND COMPUTING TECHNOLOGY, BOEING COMPANY, SEATTLE,
            WASHINGTON David Levine ROSETTA INPHARMATICS, KIRKLAND, WASHINGTON Summary A
            distributed-memory version of the OVERFLOW compu- tational fluid dynamics code was
            evaluated on several par- allel systems and compared with other approaches using test
            cases provided by the NASA-Boeing High-Speed Civil Transport program. A principal goal
            was to develop parti- tioning and load-balancing strategies that led to a reduc- tion in
            computation time. We found multipartitioning, in which the aspect ratio of the
            multipartition is close to the aspect ratio of the grid zone, offered the best
            performance. The (uniprocessor) performance of the CRAY vector sys- tems was superior to
            all other systems tested. However, the distributed-memory version when run on an SGI
            Origin system offers a price performance advantage over the CRAY vector
            systems.Performance on personal computer systems is promising but faces several hurdles.
            1 Introduction Computational fluid dynamics (CFD) calculations are used for calculating
            internal and external fluid flows, as well as the corresponding pressures, forces, and
            moments on aerodynamic surfaces. Traditionally, these calcula- tions have been run on
            vector machines. The trend in the past few years, however, has been toward multiple-pro-
            cessor RISC systems and, most recently, mass-market PC clusters. The problems of
            arranging parallel computa- tions on the multiple processors of these systems are,
            however, far from simple. In this paper, we present the re- sults of our work,
            developing an effective parallel-parti- tioning strategy for an important CFD code. The
            rest of this paper is organized as follows. Section 2 describes the OVERFLOW code,
            including partitioning strategies and parallel implementations. Section 3 de- scribes
            the test problems used and the computers they were tested on. Section 4 contains our
            computational re- sults. This includes a comparison of partitioning strate- gies, a
            comparison of two parallel programming models, and performance results from different
            systems. Last, Section 5 presents our conclusions. 2 The OVERFLOW Solver The OVERFLOW
            CFD code has been used for at least a decade for solving viscous compressible flow
            (Reynolds-averaged Navier-Stokes equations with turbu- lence model) and is a part of
            many aerodynamic design processes. OVERFLOW is based on the ARC3D implicit approximate
            factorization approach of Pulliam and Steger (1978) and the scalar pentadiagonal form of
            Pulliam and Chaussee (1981). These implicit schemes dramatically accelerate convergence
            at the cost of matrix inversions at each step. In ARC3D, the matrix equations are solved
            on a single grid that is a 3-D lattice of integers. The ratio of the lattice edge
            lengths (the integers Ni , Nj , and Nk ) is the grid aspect ratio Ni : Nj : Nk . Complex
            aerodynamic configurations may have mul- tiple components that require differing amounts
            of discretization. It may be difficult or impossible to map the flow domain using a
            single grid. One solution is to use multiple overlapping grids, each of which is mapped
            to a particular component in the overall design. Each of these grids, which we refer to
            as a grid zone, is a topological rectangular solid. The complete flow field is
            represented by the overlapped grid zones. Overlapped grids enable the solution of
            complex and multipart aerodynamic configu- rations. It allows one to add or change a
            component of a 22 COMPUTING APPLICATIONS The International Journal of High Performance
            Computing Applications, Volume 15, No. 1, Spring 2001, pp. 22-35 2001 Sage Publications,
            Inc. PARALLEL CFD PERFORMANCE Address reprint requests to David Kerlick, Mathematics and
            Com- puting Technology, Boeing Company, P.O. Box 3707 MS 7L-43, Seattle, WA 98124-2207,
            U.S.A.; e-mail: david.kerlick@ boeing.com. configuration without having to generate new
            grids for other components. Overlapped grid zones require interpolation in the re- gions
            of overlap between the zones at every iteration of the flow solver, the so-called
            Chimera scheme (Steger, Dougherty, and Benek, 1983). In the serial version of the code,
            the most recently updated flow variables from the current time step are used when
            available. This is roughly analogous, on a zonal level, to Gauss-Seidel iteration, and
            the convergence rate may depend on the order in which the zones are updated. F. Hatay
            (personal communication, 1999) expects the convergence to be faster, starting from the
            known conditions at the flow field boundaries, rather than from interzone boundaries.
            When the zones are pro- cessed in parallel, only the values from the last time step are
            known, and an analogy to Jacobi iteration may be made. In this case, convergence may not
            occur as rapidly, but the result is independent of the ordering. We have not performed
            anystudies of convergence in the present work. As the number of grid zones increases,
            the amount of computing time devoted to interpolation increases and may ultimately
            require a significant portion of the comput- ing time and be a candidate for further
            parallelization. For the present, the interpolation calculations have not been
            parallelized, only the flow solutions within the grid zones. One recent enhancement to
            OVERFLOW is the multigrid method (Jespersen, Pulliam, and Buning, 1997) of convergence
            acceleration. In a multigrid scheme, a coarsening of the grid is effected by taking
            every other grid point in each grid direction. Thus, on a scheme with L multigrid
            levels, only points in the fine mesh that are sepa- rated by multiples of 2L fine-mesh
            points remain in the coarsest mesh. If grids are used that do not have edges di- visible
            by 2L + 1, then asymmetries in the coarsened grids are possible, even when the fine grid
            is symmetric. It is therefore necessary that grid points that represent features (e.g.,
            a leading-edge discontinuity in the wing) be pre- served in the coarsest grid. The
            multigrid method acceler- ates the convergence of the solution, although at the cost of
            significant coding complexity. OVERFLOW is under continuous revision. Most of our work
            was done using code from OVERFLOW 1.7v. Some later runs used code from the version 1.8
            release. We do not expect a significant difference in runtime for the cases we have
            studied. 2.1 PARTITIONING AND LOAD BALANCING Partitioning the flow field is a natural
            way to decompose the problem domain for parallel solution. Because the flow PARALLEL CFD
            PERFORMANCE 23 "The OVERFLOW CFD code has been used for at least a decade for solving
            viscous compressible flow (Reynolds-averaged Navier-Stokes equations with turbulence
            model) and is a part of many aerodynamic design processes." field has already been
            divided into grid zones, the sim- plest approach assigns one processor to each zone.
            This approach, which we call coarse grain, was taken in Atwood and Smith (1995). The
            theoretical speedup of the coarse-grain approach is limited by the ratio of the total
            number of points in the flow field to the number of points in the largest grid zone. An
            improvement in efficiency can be obtained by coalesc- ing small grid zones and assigning
            them to the same pro- cessor. This allows the use of fewer processors, but the overall
            time is still limited by the time for a single proces- sor to process the largest grid
            zone. It is possible to add new zone boundaries to divide the zones artificially, but
            doing so forces the update between zones to be handled in an explicit, time-accurate
            way, which can slow conver- gence substantially. In most cases, the coarse-grain
            approach does not bal- ance the load on the processors very well. As a result, pro-
            cessors that have too few points to process are idle. To im- prove performance and to
            apply larger numbers of processors than the number of grid zones, it is necessary to
            divide the work on a single grid zone among multiple processors. This is a complex step,
            particularly for multigrid solutions, that we call fine grain. The simplest form of
            fine-grain partitioning, uniparti- tioning, partitions a grid zone in a single grid
            direction, usually the one with the largest grid dimension. A unipar- titioning scheme
            was implemented in the POVERFLOW (Ryan and Weeratunga, 1993) variant of OVERFLOW. A more
            general fine-grain partitioning scheme is multipartitioning, in which a grid zone is
            partitioned along multiple dimensions. Thus, to each grid zone of di- mensions Ni × Nj ×
            Nk, a triplet of numbers, Pi.Pj.Pk, is as- signed so that a total of Pi × Pj × Pk
            processors are assigned to a grid zone. The total number of processors required for a
            problem is the sum of the processors assigned to each grid zone. In Smith, Van der
            Wijngaart, and Yarrow (1995), a multipartitioning method for alternating-direction im-
            plicit algorithms was developed. Hatay et al. (1997) de- veloped the idea of
            multipartitioning grid zones. This was first demonstrated on the aeroelastic code
            ENSAERO and later incorporated into OVERFLOW. Hatay et al. demon- strated speedups for
            non-multi-grid OVERFLOW calcu- lations. This version of the code allowed both coalescing
            of small grid zones and splitting large grid zones in multi- ple directions. To
            implement an implicit solve on parti- tioned zones, it is necessary for the processors
            working on a partition to have access to the adjacent grid planes being solved by other
            processors. These so-called ghost or halo points increase the effective amount of memory
            required in the parallel calculation (see, e.g., Hatay et al., 1997, Figure 1).
            Jespersen, Pulliam, and Buning (1997) extended Hatay et al.'s (1997) work to a multigrid
            version. Their implementation of multipartitioning ensures that the multigrid scheme is
            unaffected by the choice of partition. Symmetry boundary conditions and internal
            symmetry planes or axes have not been implemented yet. All the partitioning schemes we
            described employ static partitioning; that is, grid partitions are specified in the
            input file and are not changed during the run of the job. At present, dynamic
            repartitioning for load balancing during the execution of a job has not been attempted.
            2.2 PARALLEL IMPLEMENTATIONS Since OVERFLOW requires supercomputer perfor- mance, it has
            traditionally been run on vector supercom- puters. More recently, however, several
            strategies have been investigated to make the best use of parallel hard- ware (Buning et
            al., 1995; Hatay et al., 1997; Jespersen, 1998; Taft, 1998). We classify these
            approaches accord- ing to the memory model they assume. 2.2.1 Shared Memory. The
            standard distribution of the OVERFLOW code includes parallel directives for CRAY vector
            systems and Silicon Graphics (SGI) Origin architectures. These directives assume a
            shared-memory programming model. The directives are used primarily around the inner
            loops of the matrix solves in each grid zone. Taft (1998) reported success with a
            multilevel parallel (MLP) approach. This approach takes advantage of the parallel
            directives for inner-loop parallelism and uses the UNIX fork() command at the outer
            level to support flow field partitioning. Both coalescing of small grid zones and
            multipartitioning of large grid zones are supported. 2.2.2 Distributed-Memory Models.
            Two distrib- uted-memory implementations of OVERFLOW are available. The first uses
            coarse-grain partitioning and was implemented using PVM.1 The second implementation uses
            fine-grain partitioning and is implemented using the message-passing interface (MPI). In
            this paper, we focus on the MPI implementation that, because of its portabil- ity, can
            be run on both shared- and distributed-memory systems. 24 COMPUTING APPLICATIONS 3 Test
            Environment 3.1 TEST PROBLEMS The test problems we used were taken from the NASA-Boeing
            High-Speed Civil Transport (HSCT) proj- ect. In the HSCT project, NASA was working with
            indus- try to outline a plan for developing long-lead, high-risk technologies that could
            form the foundation for an industry HSCT program launch. The projected market of more
            than 1000 HSCT aircraft between 2006 and 2020 was quite sub- stantial, and technology
            development was essential to en- able an environmentally compatible and economically
            via- ble HSCT aircraft. This project has recently been terminated due to certain
            high-risk technologies not matur- ing as originally planned. The test problems represent
            external aerodynamic con- figurations. Drag reduction is the main objective. To cor-
            rectly calculate drag, viscous phenomena such as separa- tion must be predicted, and a
            Navier-Stokes solution is required. Turbulence is predicted by an engineering model,
            here the model of Spalart and Almaras (1992). The Navier-Stokes runs are used to verify
            design runs done us- ing linear aerodynamics. These cases are typical of those that were
            running in production mode on a CRAY C-90 at NASA. The first test case is a
            wing-and-body combination (the "wingbody" problem). The total grid comprises 3,579,004
            points in six grid zones. The largest grid zone contains ap- proximately 1.5 million
            points, and the smallest grid zone has approximately 54,000 points. Figure 1 shows the
            inte- rior boundaries of each grid zone. Table 1 contains the dimensions and number of
            points in each grid zone. For each grid zone, three grids are speci- fied: a fine grid,
            a medium grid, and a coarse grid. These three grids correspond to the three levels of
            multigrid used in the solver. PARALLEL CFD PERFORMANCE 25 Fig. 1 Wingbody test case
            Table 1 Wingbody Grid Zones Grid Zone Fine Grid Number Points Medium Grid Number Points
            Coarse Grid Number Points 1 208 × 93 × 49 947,856 105 × 48 × 25 126,000 53 × 26 × 13
            17,914 2 271 × 121 × 46 1,508,386 136 × 62 × 24 202,368 69 × 32 × 13 28,704 3 220 × 28 ×
            37 227,920 111 × 15 × 19 31,635 56 × 8 × 10 4,480 4 52 × 28 × 37 53,872 27 × 15 × 19
            7,695 14 × 8 × 10 1,120 5 220 × 52 × 64 732,160 111 × 27 × 33 98,901 56 × 15 × 17 14,280
            6 78 × 45 × 31 108,810 40 × 24 × 16 15,360 21 × 14 × 9 2,646 Total 3,579,004 481,959
            69,144 The second test case adds a nacelle and diverter 2 to the wingbody data set (the
            "wingbody-nacelle-diverter" prob- lem), as illustrated in Figure 2. An additional 14
            grid zones are used to overlap the nacelle and diverter for a total of 20
            gridzones.The14additionalgridzonesaregiveninTable2. The totalgrid comprises 9,784,210
            points in20grid zones. 3.2 PARALLEL HARDWARE We tested OVERFLOW on three types of
            computers: vec- tor supercomputers, parallel computers with RISC micro- processors, and
            PCs. 3.2.1 Vector System. We tested OVERFLOW on two CRAY vector systems. One was a
            16-processor C-90 lo- cated at NASA Ames with 8 GB of memory. This machine was the
            regular production environment for the HSCT pro- gram's use of OVERFLOW. The second was
            an 8-proces- sor CRAY T-90 with 2 GB of memory located at Boeing. On both machines,
            testing was done using a single proces- sor in vector mode during regular production
            hours. 3.2.2 RISC Systems. The largest system we had ac- cess to was a 512-node CRAY
            T3E. The T3E is a distrib- uted-memory system with a 3-torus interconnect. Each T3E node
            consists of an Alpha 21164 processor 26 COMPUTING APPLICATIONS Table 2
            Wingbody-Nacelle-Diverter Additional Grid Zones Grid Zone Fine Grid Number Points Medium
            Grid Number Points Coarse Grid Number Points 7 131 × 67 × 107 939,139 66 × 34 × 54
            121,176 34 × 18 × 28 17,136 8 131 × 67 × 107 939,139 66 × 34 × 54 121,176 34 × 18 × 28
            17,136 9 80 × 161 × 37 476,560 41 × 81 × 19 63,099 21 × 41 × 10 8,610 10 57 × 161 × 37
            339,549 29 × 81 × 19 44,631 15 × 41 × 10 6,150 11 71 × 161 × 37 422,947 36 × 81 × 19
            55,404 19 × 41 × 10 7,790 12 176 × 37 × 57 371,184 89 × 19 × 29 49,039 45 × 10 × 15
            6,750 13 176 × 37 × 57 371,184 89 × 19 × 29 49,039 45 × 10 × 15 6,750 14 113 × 59 × 37
            246,679 57 × 30 × 19 32,490 29 × 16 × 10 4,640 15 80 × 161 × 37 476,560 41 × 81 × 19
            63,099 21 × 41 × 10 8,610 16 57 × 161 × 37 339,549 29 × 81 × 19 44,631 15 × 41 × 10
            6,150 17 71 × 161 × 37 422,947 36 × 81 × 19 55,404 19 × 41 × 10 7,790 18 176 × 37 × 49
            319,088 89 × 19 × 25 38,475 45 × 10 × 13 5,850 19 176 × 37 × 49 319,088 89 × 19 × 25
            38,475 45 × 10 × 13 5,850 20 113 × 53 × 37 221,593 57 × 27 × 19 29,241 29 × 14 × 10
            4,060 Total 9,784,210 1,287,338 182,416 Fig. 2 Wingbody-nacelle-diverter test case (300
            MHz) and 128 MB of memory. Each processor has a primary cache of 8 KB and a secondary
            cache of 96 KB and can execute two instructions per clock cycle. We tested three SGI
            Origin 2000 machines. The Origin has a nonuniform memory access (NUMA) architecture and
            is based on the MIPS R10000 processor. The first ma- chine had 8 processors (195 MHz)
            and 2 GB of main memory. The second Origin had 64 processors (195 MHz) and 16 GB of
            memory. During our testing, this ma- chine was upgraded to 128 processors and 32 GB of
            mem- ory. The third Origin tested had 128 processors (250 MHz) and 32 GB of main memory.
            All three machines had 4 MB level-2 caches for each processor. 3.2.3 PC Systems. The
            smallest (and least expen- sive) system we tested was a Compaq 8000 symmetric
            multiprocessor (SMP) with four Intel Pentium Pro pro- cessors (200 MHz). Each processor
            had a 512 KB level-2 cache. The processors shared 4 GB memory and 23 GB of disk space.
            The operating system was Windows NT ver- sion 4.0. The Fortran compiler was Digital
            Fortran ver- sion 5.0. The MPI used was a beta version from GENIAS. We also tested two
            PC clusters. The first consisted of 32 Compaq 6000 machines. The second consisted of 64
            Hewlett-Packard (HP) Kayak machines. Each machine is an SMP with dual Intel Pentium II
            processors (300 MHz in the HP Kayak, 333 MHz in the Compaq 6000) sharing a 512 KB
            level-2 cache, 512 MB memory, and 4 GB of disk space. The machines in each cluster were
            connected to each other using a high-speed Myrinet network. Each PC was running the
            Windows NT version 4.0 operating sys- tem. The compiler was Digital Fortran version 5.0.
            The MKS Toolkit3 was used to facilitate code porting. MPI-FM (Lauria and Chien, 1997), a
            version of MPI based on Fast Messages and optimized for Myrinet net- works, was used for
            communication. 4 Results The results are presented in three stages. First, in Section
            4.1, we analyze the results of various partitioning strate- gies on a single grid zone.
            Next, in Section 4.2, the single grid zone results are used to examine load-balancing
            strategies for multiple zones. Last, the multizone results are used for performance
            comparisons of coarse-grain versus fine-grain partitioning, MPI versus MLP, and dif-
            ferent computing systems and are presented in Sections 4.3, 4.4, and 4.5, respectively.
            The results presented are from runs that restarted a computation from an existing
            checkpoint file, ran 10 multigrid iterations, and wrote a new checkpoint file. This is
            typical of production runs to convergence, which may take several hundred to several
            thousand iterations. The times given are the average time (over 10 iterations) each time
            step takes. It includes solver and interpolation time, message-passing overhead, and the
            time to calculate vari- ous metrics. It does not include the time to restart and
            checkpoint the computation. Except in Section 4.5, all timings are from SGI Origin
            systems. 4.1 SINGLE GRID ZONE PARTITIONING RESULTS The effectiveness of a partitioning
            scheme depends on two effects. First, the speed of calculation for a single grid zone
            depends on the number of processors assigned to the grid zone and how they divide the
            work. Second, the over- all load balance is affected every time the performance on a
            single grid zone is improved. To study the first effect, a series of runs was made on
            grid zone 2 (a wing) in isolation. Recall (see Table 1) that this is the largest grid
            zone with approximately 1.5 mil- lion grid points. The grid aspect ratio is 5.8 : 2.6 :
            1. The first set of runs used a unipartitioning strategy, with parti- tioning occurring
            in the dimension with the largest num- ber of points. The second set of runs used
            multipartition- ing where the partitions were the integer triplets closest to the grid
            aspect ratio.4 Figure 3 contains the results from both sets of parti- tioning
            experiments. The horizontal axis is the number of processors used, and the vertical axis
            is the parallel speedup. The straight line is the ideal speedup. The solid curved line
            is the unipartitioning results. The dotted curved line is the multipartitioning results.
            For unipartitioning, each data point corresponds to a partition of np1.1, where np is
            the number of processors. For multipartitioning, the data points correspond to the
            parti- tions 4.4.1, 6.3.1, 5.4.1, 7.3.1, 6.4.1, and 8.4.1. The unipartitioning results
            show good speedup (90 or greater efficiency) for up to 7 processors. Between 8 and 32
            processors, there is a steady decline in efficiency, with the largest speedup never
            surpassing 10. The reason for this is that as a grid zone edge is partitioned into
            smaller and smaller sets of points, the ghost points of a partition overlap partitions
            other than just the neighboring one, with a consequent increase in communication and de-
            crease in performance. Multipartitioning clearly outperforms unipartitioning, sometimes
            by as much as a factor of 2. It is important, however, that the aspect ratio of the
            multipartition be as close as possible to the aspect ratio of the grid zone. This
            implies that the regions sent to each processor are approx- PARALLEL CFD PERFORMANCE 27
            imately cubic and, therefore, that their surface-to-volume ratio is minimal for
            rectangular partitions.5 Results from experiments using multipartitions that did not
            have a simi- lar aspect ratio to the grid zone were consistently inferior to those that
            did. For example, the (inferior) multipartition 1.3.6 had a speedup of only 7.9 compared
            with the speedup of 12.7 for the multipartition 3.6.1, or even compared with the speedup
            of 9.6 for the unipartition 18.1.1. 4.2 MULTIPLE GRID ZONE PARTITIONING RESULTS Having
            studied the effect of the partition choice on the largest grid zone, we next looked at
            the effect of these choices in the context of the complete solution for all grid zones.
            These studies were performed on the wingbody problem using 44 processors.6 First, we
            calculated the average number of points that should be assigned to each processor. We
            then coalesced the smallest grid zones until they added up to this number. Then, the
            remaining grid zones were divided among the re- 28 COMPUTING APPLICATIONS Fig. 3
            Partitioning of grid zone 2 "Multipartitioning clearly outperforms unipartitioning,
            sometimes by as much as a factor of 2. It is important, however, that the aspect ratio
            of the multipartition be as close as possible to the aspect ratio of the grid zone."
            maining processors. Within each grid zone, we calculated the unipartition by assigning
            the processors for each grid zone to the longest grid dimension. Our results are given
            in Table 3. The unipartition u was used as the starting point. This partition is
            computed by minimizing the maximal number of points assigned to any processor. Next,
            several multipartitionings of grid zone 2 were tested. The rows u921, u631, and u136 in
            Ta- ble 3 correspond to the multipartitions 9.2.1, 6.3.1, and 1.3.6, respectively. As
            expected, the "good" multiparti- tions, u921 and u631, improve performance, while the
            "bad" multipartition, 1.3.6, degrades performance.7 Next, we tried repartitioning grid
            zones other than the largest.8 Partition e, based on reducing the maximal edge lengths,
            is derived from partition u631 by transferring a processor from grid zone 5 to grid zone
            1. This does not improve the result significantly compared to partition u631. The parti-
            tion c, based on a common divisor of the edge length, per- formed poorly because grid
            zone 5 is starved for proces- sors. We thus arrived at a heuristic for multizone grid
            parti- tioning. First, calculate the average number of grid points to assign to each
            processor. Second, coalesce grid zones with less than the average number of grid points.
            Third, calculate a unipartition for the remaining points and zones. Fourth, examine the
            load balance for the uniparti- tion and repeat the previous steps if it is not
            satisfactory.9 Fifth, change the unipartitions into multipartitions, at- tempting to
            make the aspect ratio of the multipartitions as close as possible to the grid aspect
            ratio, avoiding symme- try planes or boundary conditions. Finally, examine the results
            and rebalance. The first step, achieving an optimal unipartitioning, is the most
            important. Fine-tuning the larger zones (here grid zone 2, which comprises about 28 of
            the points) can then yield significant improvements. Further tuning of smaller grid
            zones did not appreciably improve the results. 4.3 COARSE-GRAIN VERSUS FINE-GRAIN
            PARTITIONING COMPARISON A performance comparison between fine- and coarse- grain
            partitioning was performed to validate that the addi- tional coding effort to implement
            a fine-grain partitioning scheme was worthwhile. Both strategies were compared using 6
            and 20 processors (the number of grid zones in the wingbody and
            wingbody-nacelle-diverter test prob- lems, respectively). All tests used the MPI version
            of OVERFLOW. For the coarse-grain tests, each grid zone was mapped to a processor. For
            the fine-grain test, the mean number of points per processor was calculated; zones that
            had fewer points than this average were coalesced, and zones that had twice or more than
            this number were partitioned ac- cording to the aspect ratio criterion above, which for
            these cases means partitioning along the largest grid zone dimension. Timing results are
            given in Table 4 (SGI Origin 195 MHz processors) and Table 5 (SGI Origin 250 MHz
            processors). The time to run the MPI code on a single pro- cessor is used as a baseline
            for calculating speedup and efficiency.10 For the coarse-grain strategy, we expect the
            maximal speedup to be approximately the ratio of total grid points to the number of grid
            points in the largest zone. For the wingbody problem, this is 2.37; for the wingbody-na-
            celle-diverter problem, it is 6.49. As can be seen in Tables 4 and 5, the speedups
            achieved in these two cases are close to the theoretical maximum. The fine-grain results
            are significantly better than the coarse-grain results on both test problems. The reason
            is the better load balancing that the fine-grain strategy achieves by coalescing and
            splitting grid zones. In addi- tion to runtime reduction, the better load balancing also
            manifests itself in higher processor use. 4.4 MPI VERSUS MLP COMPARISON The MPI and MLP
            versions of OVERFLOW both support co- alescing multiple grid zones onto a single
            processor and the allocation of multiple processors to compute on a sin- gle grid zone.
            In the MPI version, when multiple proces- sors are allocated to a single grid zone, a
            multipartitioning scheme is used to allocate subzones to each processor. In the MLP
            version, the multiple processors simultaneously execute the inner loops of the zonal
            solver. There are several significant differences between the two methods. The MPI
            version assumes a distrib- uted-memory programming model. Conversely, the MLP version
            uses two shared-memory constructs, the UNIX fork system call and parallel directives,
            for parallelism. The MPI version requires approximately 25,000 addi- tional lines of
            code (compared with the serial version) to implement but can be run on SMP, NUMA,
            massively parallel processors (MPPs), and PC clusters. By contrast, the MLP version only
            uses an additional 300 lines of code but runs only on SMP and NUMA systems. Two
            attributes influence the performance of the meth- ods. The first is how evenly the work
            is distributed when PARALLEL CFD PERFORMANCE 29 multiple processors are applied to a
            single zone. In the MPI version, multipartitioning maps approximately equal amounts of
            work (subdomains) to each processor. In the MLP version, the multiple processors execute
            only the sec- tions of code that contain parallel loops. The second attrib- ute is the
            computational overhead associated with each method. The MPI version incurs
            message-passing over- head whenever data are exchanged. The MLP version in- curs
            shared-memory synchronization overheads whenever a parallel loop is encountered. Tables
            6 and 7 contain timings that compare the perfor- mance of the two methods on an SGI
            Origin (250 MHz processors). For the wingbody case, the MPI version is sig- nificantly
            faster on 8 and 16 processors, slightly faster on 32 and 64 processors, and
            significantly slower on 128 pro- cessors. It appears that the MPI version has better
            load-bal- ancing characteristics due to a more equal distribution of work to the
            processors. However, as the number of proces- sors grows, the message-passing overhead
            associated with the larger number of partitions mitigates this advantage. For the
            wingbody-nacelle-diverter case, except for one anomalous result,11 the two versions
            perform similarly up to 64 processors. On 128 processors, the MLP version out- performs
            the MPI version. We believe the increased granu- larity of work resulting from the
            larger problem size is the reason that the MLP version performs similarly to the MPI
            version on smaller numbers of processors. Once again, however, the increased
            message-passing overhead associ- ated with the larger number of partitions degrades the
            per- formance of the MPI version with 128 processors. 4.5 MACHINE COMPARISON We tested
            OVERFLOW on several different machines. The sequential version was run on a CRAY C-90
            and a CRAY T-90. These machines represent the vector super- 30 COMPUTING APPLICATIONS
            Table 3 Multizone Partitioning of Wingbody with 44 Processors Heuristic Zone 1 Zone 2
            Zone 3 Zone 4 Zone 5 Zone 6 Time u 11.1.1 18.1.1 3.1.1 1.1.1 9.1.1 2.1.1 10.6 u921
            11.1.1 9.2.1 3.1.1 1.1.1 9.1.1 2.1.1 8.8 u631 11.1.1 6.3.1 3.1.1 1.1.1 9.1.1 2.1.1 9.2
            u136 11.1.1 1.3.6 3.1.1 1.1.1 9.1.1 2.1.1 12.9 e 12.1.1 6.3.1 3.1.1 1.1.1 4.1.2 2.1.1
            9.1 c 10.1.1 7.3.1 5.1.1 1.1.1 5.1.1 2.1.1 13.7 Table 4 Wingbody Timings (seconds):
            Coarse Grain versus Fine Grain NP Strategy Time Speedup Efficiency 1 Sequential 289.2
            (1) (100) 6 Coarse grain 139.8 2.1 34.5 6 Fine grain 75.2 3.8 64.1 Table 5
            Wingbody-Nacelle-Diverter Timings (seconds): Coarse Grain versus Fine Grain NP Strategy
            Time Speedup Efficiency 1 Sequential 725.3 (1) (100) 20 Coarse grain 110.5 6.6 32.8 20
            Fine grain 40.2 18.0 90.2 Table 6 Wingbody Timings (seconds): Message- Passing Interface
            (MPI) versus Multilevel Par- allel (MLP) Approach NP MPI MLP 8 31.4 44.8 16 15.9 26.2 32
            13.8 15.4 64 8.7 11.2 128 10.2 8.7 computers OVERFLOW was originally optimized for. The
            MPI version was run on a CRAY T3E, SGI Origin, and sev- eral PC systems. The CRAY T3E
            and SGI Origin 12 are rep- resentative of the MPP systems that parallel versions of
            OVERFLOW are run on. The PC systems represent a pos- sible low-cost (and increasingly
            high performance) alter- native platform for running OVERFLOW. Details of the systems
            tested were given in Section 3.2. The results for the wingbody and wingbody-nacelle-
            diverter test problems are given in Tables 8 and 9, respec- tively. The minimal number
            of nodes needed to run these problems was 8 on the Compaq 6000 and HP Kayak and 29 on
            the CRAY T3E. The Compaq 6000 results for 64 pro- cessors and the HP Kayak results for
            128 processors were run using both processors in each PC. All other runs used only one
            of the processors in each PC. The CRAY C-90 and T-90 systems significantly outper-
            formed the other systems tested. The T-90 was approxi- mately 50 times faster than the
            C-90 and an order of magni- tude faster than the next fastest system, the SGI Origin.
            One factor reflected in this performance gap is the maturity and performance tuning in
            the versions of OVERFLOW that were run. The sequential version had been highly opti-
            mized for the vector architecture of the CRAYs, while the MPI version was still under
            development during our test- ing and had not been optimized for cache architectures. Of
            the other systems we tested, the SGI Origin was the fastest. For all but one case, it
            was a factor of 2 to 4 faster than the PC clusters and CRAY T3E.13 For most cases, the
            Origin was as fast on 64 processors as the CRAY T3E was on 128 or more processors. The
            CRAY T3E had several significant limitations. The most severe was the relatively small
            amount of memory PARALLEL CFD PERFORMANCE 31 Table 7 Wingbody-Nacelle-Diverter Timings
            (seconds): Message-Passing Interface (MPI) versus Multi- level Parallel (MLP) Approach
            NP MPI MLP 8 128.6 98.4 16 53.7 56.1 32 30.6 31.1 64 17.0 18.2 128 15.2 11.7 Table 8
            Wingbody Timings (seconds): Machine Comparison Number of Processors Compaq 6000 CRAY
            C-90 CRAY T-90 CRAY T3E HP Kayak SGI Origin 1 38.0 25.7 245.6 2 171.4 4 120.7 8 80.2
            86.4 31.4 16 42.1 44.7 15.9 32 29.8 59.2 29.9 13.8 64 19.6 14.3 15.6 8.7 128 8.2 15.7
            10.2 256 7.1 512 8.5 (and cache) on each processor, coupled with a lack of vir- tual
            memory. This meant that the minimal sizes for the two test cases were 29 and 70
            processors, respectively. Other difficulties were poor per node performance and the un-
            usual floating-point representation. The performance of the PC clusters was surprisingly
            good. When "enough" level-2 cache was available (i.e., 32 or more processors),
            performance was half that of the SGI Origin. When sufficient cache was not available,
            perfor- mance degraded to a third of the SGI Origin. One unex- pected result was that
            the HP Kayak (300 MHz processors) PC cluster outperformed the Compaq 6000 (333 MHz pro-
            cessors) PC cluster. A likely explanation is that there were performance problems with
            the PCI chip set used in the Compaq 6000. The performance of the Compaq 8000 was limited
            in two ways. First, it used an older Pentium Pro processor, which only ran at 200 MHz.
            Second, its SMP architecture does not provide enough memory bandwidth to allow more than
            one processor to run at full speed. Nevertheless, the Compaq 8000 was still able to run
            the wingbody-nacelle- diverter test case on a single processor at one-fourth the
            performance of an SGI Origin using one processor. Superlinear speedup due to cache
            effects was observed on all systems except the Compaq 8000. For the Compaq 6000 and HP
            Kayak clusters, this occurred on up to 32 pro- cessors on the wingbody-nacelle-diverter
            problem. Be- yond 32 processors, the pieces of the partitioned problem were small enough
            to fit within each processor's cache. For 32 COMPUTING APPLICATIONS Table 9
            Wingbody-Nacelle-Diverter Timings (seconds): Machine Comparison Number of Processors
            Compaq 6000 Compaq 8000 CRAY C-90 CRAY T-90 CRAY T3E HP Kayak SGI Origin 1 2531 111.2
            75.1 725.3 2 1686 359.2 4 1345 224.3 8 1469.0 428.4 128.6 16 387.0 176.7 53.7 32 67.0
            74.6 30.6 64 43.7 35.4 17.0 128 17.9 28.0 15.2 256 14.7 512 12.8 the SGI Origin, a small
            superlinear effect was noticeable going from 8 to 16 processors on the wingbody-nacelle-
            diverter test case and from 4 to 8 processors on the wingbody test case. For the CRAY
            T3E, a superlinear speedup effect was observed on the wingbody test case when going from
            32 to 64 processors. Most systems showed good speedup until a threshold number of
            processors was reached. For the CRAY T3E, this happened above 128 processors on both
            test problems. For the SGI Origin, this happened at 16 and 64 processors for the
            wingbody and wingbody-nacelle-diverter test cases, respectively. For these two systems,
            we attribute the performance degradation to the increased message- passing overhead. For
            the PC clusters, performance degraded going from 32 to 64 processors on the Compaq 6000
            and going from 64 to 128 processors on the HP Kayak. These cases corre- spond with the
            transition from computing with a single processor per PC to computing with both
            processors in a PC. The memory bandwidth limitation that results when both processors
            share the memory bus is responsible for the performance degradation observed. A similar
            memory bandwidth limitation affected the Compaq 8000. 5 Conclusions Both the MLP and MPI
            versions of OVERFLOW offer promising approaches for taking advantage of parallel
            hardware. Both methods support coalescing multiple grid zones onto a single processor
            and the allocation of multiple processors to compute on a single grid zone. The MPI ver-
            sion performed better on smaller numbers of processors, and the MLP version performed
            better on larger numbers of processors. While we believe the MPI version has better
            overall load-balancing characteristics, for a fixed-size problem, this is mitigated by
            the overhead of message passing beyond some threshold number of processors. There are
            several advantages to the MLP approach. First, only minimal changes to the sequential
            source code are necessary. Second, the dual levels of parallelism sup- port good
            processor utilization. Third, the shared-memory implementation enables good scalability.
            The primary dis- advantage of the MLP approach is portability; it will not run on
            distributed-memory architectures (including PC clusters) or under non-UNIX operating
            systems. The MPI version has two important advantages. First, it will run on a variety
            of architectures, including SMPs, NUMAs, MPPs, and PC clusters. Second, the MPI version
            has good load-balancing characteristics. The disadvan- tages of the MPI version are the
            extensive changes to PARALLEL CFD PERFORMANCE 33 "Both the MLP and MPI versions of
            OVERFLOW offer promising approaches for taking advantage of parallel hardware. Both
            methods support coalescing multiple grid zones onto a single processor and the
            allocation of multiple processors to compute on a single grid zone." source code
            required for implementation and the over- head ofmessage passing on large numbers of
            processors. To use the MPI version effectively, a good partitioning strategy must be
            chosen. Simple strategies such as assign- ing a grid zone to each processor lead to poor
            load balanc- ing and performance. While unipartitioning can yield ac- ceptable parallel
            performance, multipartitioning leads to significantly better results. It is important,
            however, to be sure that the aspect ratio of the multipartition is close to the aspect
            ratio of the grid zone. The CRAY T-90 was the fastest machine we tested. Its
            uniprocessor performance was an order of magnitude faster than its closest competitor.
            One reason for this is that the sequential version has been highly optimized for CRAY's
            vector architecture. The SGI Origin was the next fastest machine we tested. It had the
            advantage that its NUMA architecture supported both the MPI and MLP versions of
            OVERFLOW. The CRAY T3E had several significant limitations, including a small amount of
            mem- ory and cache on each processor, no virtual memory, poor per node performance, and
            an unusual floating-point representation. The performance of the PC clusters was
            surprisingly good. When the aggregate level-2 cache was of sufficient size, performance
            was half that of the SGI Origin using a comparable number of processors. The high
            performance Myrinet network played a key role in enabling this perfor- mance. The
            performance of the large-memory Compaq 8000 was notable for being able to run the
            wingbody- nacelle-diverter problem on a single PC processor. This is likely one of the
            largest numerical problems ever run on a PC. A significant limitation we noticed with
            SMP PCs is a degradation in performance due to memory bandwidth limitations. On the
            dual-processor Compaq 6000 and HP Kayak PCs, we noted a degradation in performance that
            coincided with the transition from computing with a sin- gle processor per PC to
            computing with both processors in a PC. A similar problem manifested itself in the quad-
            processor Compaq 8000, which did not provide enough memory bandwidth to allow more than
            one processor to run at full speed. ACKNOWLEDGMENTS Thanks are due Pieter Buning of NASA
            for the serial ver- sion of OVERFLOW and to Dennis Jespersen of NASA for his unstinting
            help in debugging and testing his MPI version. We thank Ferhat Hatay for discussions
            about multipartitioning and the SGI Origin architecture. Ste- phen Chaney of the Boeing
            HSCT group supplied the pro- duction runs on which these benchmarks are based, and
            Anutosh Moitra of that group also contributed some addi- tional runs and test cases. Jim
            Taft graciously ran our cases on his MLP version of OVERFLOW. Joel Hirsh ran our test
            cases on the CRAY T-90 at Boeing. We acknowl- edge a grant of computing time on the CRAY
            T3E and SGI Origin systems from NASA Ames Research Center. We thank the National Center
            for Supercomputing Appli- cations at the University of Illinois at Urbana-Champaign for
            access to their NT SuperCluster system. BIOGRAPHIES David Kerlick graduated with a
            degree in physics from Rensselaer Polytechnic Institute in 1970. He received a Ph.D. in
            theoretical physics from Princeton University in 1975 and worked in general relativity
            until 1979; in computational fluid dynamics, computational geometry, and
            electromagnetics until 1986; and thereafter in scientific visualization, parallel
            comput- ing, and computer graphics. He has worked for Nielson Engi- neering, NASA Ames,
            Tektronix, and (since 1991) for the Boe- ing Company. His areas of interest are
            scientific, engineering, and information visualization; parallel and distributed high
            per- formance computing; and Web applications using modular serv- ers and 3-D clients.
            Eric Dillon received his master's degree in computer science in 1993 from ESIAL
            (Computer Science School) at University H. Poincaré, Nancy, France. He was a
            nonpermanent researcher at LORIA (Nancy, France) while preparing his Ph.D. in com- puter
            science and graduated in 1997 from University H. Poin- care. His main field of interest
            includes high performance com- puting related to both parallel scientific applications
            (comput- ing intensive) and business applications (transaction-based ap- plications on
            distributed architectures). He is now a researcher at Boeing in the Mathematics and
            Computing Technology Divi- sion in Seattle and is mainly involved in performance evalua-
            tion, simulation, and prediction for distributed applications. David Levine is a
            computational biologist at Rosetta Inpharmatics in Kirkland, Washington, where he
            develops algo- rithms for the analysis of gene expression data. He received a Ph.D.
            degree in computer science from the Illinois Institute of Technology. He has previously
            worked at Control Data Corpo- ration, Argonne National Laboratory, and the Boeing
            Company. He is the developer of the PGAPack parallel genetic algorithm library. His
            research interests are in computational biology, ge- netic algorithms, parallel
            computing, scientific applications, and performance evaluation. NOTES 1. The choice of
            PVM is historical. If the work were done today, mes- sage-passing interface (MPI) would
            be used. 2. A nacelle is a faired engine housing. A diverter is used for directing
            airflow. 34 COMPUTING APPLICATIONS 3. A commercial product that provides many UNIX
            commands for use in a Windows NT environment. 4.For example, 6.3.1 is the integer
            triplet closest to the grid aspect ra- tio of 5.8 : 2.6 : 1 for 18 processors. 5.This
            confirms earlier work.For example, Reed, Adams, and Patrick (1987) showed that in two
            dimensions, the efficiency of a partition is given by the ratio of points communicated
            to points computed, or perime- ter-to-area ratio. These results generalize in a
            straightforward way to three dimensions. 6. This number of processors was originally
            required by a partitioning scheme, identified as c in Table 3, which divided each grid
            dimension by a common divisor L that is roughly the cube root of Npts/NP.Here, Npts is
            the total number of points in the problem, and NP is the number of processors used. For
            the choice L = 38, we obtain a requirement for 44 processors. 7. We obtained similar
            results for the wingbody-nacelle-diverter case on 120 processors, of which the
            44-processor wingbody case is a proper subset, in which refining partitions toward the
            grid aspect ratios (6.3.1 and 9.2.1 in grid zone 2) improve overall performance, but
            poor partitioning, such as 1.3.6, degrades performance. 8. At the time we did this work,
            multipartitioning was not available for grid zone 1, which has a singular axis in the
            grid. 9.Here and also in the sixth step, we recommend that the user make a short
            OVERFLOW run ( 10 iterations), examine the printout of processor idle time, and try to
            hand-tune the partitions by reallocating processors to other grid zones if necessary.
            The intent is that a small savings in solution time could make an appreciable difference
            when run for thousands of itera- tions. 10.The MPI code run on a single processor is a
            reasonable approxima- tion of the time to run the serial code. 11.The multilevel
            parallel version outperforms the MPI version on eight processors. 12.The SGI Origin
            system used in Tables 8 and 9 had 250 MHz proces- sors. 13. The
            wingbody-nacelle-diverter problem is only 18% faster on the SGI Origin than on the CRAY
            T3E when using 128 processors. REFERENCES Atwood, C., and M. Smith. 1995. Nonlinear
            fluid computations in a distributed environment. In AIAA Proceedings, Paper No. 95-0224.
            Buning, P., M. Smith, J. Ryan, C. Atwood, K. Chawla, and S. Weeratunga. 1995.
            OVERFLOW-Navier-Stokes CFD [On- line]. Available: http://esdcd.gsfc.nasa.gov/ESS/annual.
            reports/ess95contents/app.jnnie.html. Hatay, F., D. Jespersen, G. Guruswamy, Y. Rizk, C.
            Byun, and K. Gee. 1997. A multi-level parallelization concept for high- fidelity
            multi-block solvers. In Supercomputing 97 Proceed- ings. San Jose, CA: Association for
            Computing Machinery. Jespersen, D. 1998. Parallelism and OVERFLOW. NASA Ames Research
            Center, NAS-98-013 [Online]. Available: http://
            www.nas.nasa.gov/Research/Reports/Techreports/1998/ nas-98-013-abstract.html. Jespersen,
            D., T. Pulliam, and P. Buning. 1997. Recent enhance- ments to OVERFLOW. In AIAA
            Proceedings, Paper No. 97- 0644. Lauria, M., and A. Chien. 1997. MPI-FM: High
            performance MPI on workstation clusters. Journal of Parallel and Dis- tributed Computing
            40 (1): 4-18. Pulliam, T., and D. Chaussee. 1981. A diagonal form of an im- plicit
            approximate factorization algorithm. Journal of Com- putational Physics 39:347. Pulliam,
            T., and J. Steger. 1978. On implicit finite-difference simulations of three-dimensional
            flows. In AIAA Proceed- ings, Paper No. 78-0010. Reed, D., L. Adams, and M. Patrick.
            1987. Stencils and problem partitionings: Their influence on the performance of multi-
            ple processor systems. IEEE Transactions on Computers C-36 (7): 845-58. Ryan, J., and S.
            Weeratunga. 1993. Parallel computation of 3-D Navier-Stokes flowfields for supersonic
            vehicles. AIAA Pro- ceedings, Paper No. 93-0064. Smith, M., R. Van der Wijngaart, and M.
            Yarrow. 1995. Im- proved multi-partition method for line-based iteration schemes. In
            Computational Aerosciences Workshop 95. Spalart, P., and S. Almaras. 1992. A
            one-equation turbulence model for aerodynamic flows. In AIAA Proceedings, Paper No.
            92-0439. Steger, J., F. Dougherty, and J. Benek. 1983. A chimera grid scheme. Advances
            in Grid Generation 5:59-69. Taft, J. 1998. OVERFLOW gets excellent results on SGI Origin
            2000. NASnews 3 (1) [Online]. Available: http://science.
            nas.nasa.gov/Pubs/NASnews/98/01/overflow.html. PARALLEL CFD PERFORMANCE 35</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<back><notes><p><list list-type="order"><list-item><p>1. The choice of PVM is historical. If the work were done today,
                        message-passing interface (MPI) would be used.</p>
</list-item>
<list-item><p>2. A nacelle is a faired engine housing. A diverter is used for directing airflow.</p>
</list-item>
<list-item><p>3. A commercial product that provides many UNIX commands for use in a Windows
                        NT environment.</p>
</list-item>
<list-item><p>4. For example, 6.3.1 is the integer triplet closest to the grid aspect ratio
                        of 5.8: 2.6: 1 for 18 processors.</p>
</list-item>
<list-item><p>5. This confirms earlier work. For example, Reed, Adams, and Patrick (1987)
                        showed that in two dimensions, the efficiency of a partition is given by the
                        ratio of points communicated to points computed, or perimeter-to-area ratio.
                        These results generalize in a straightforward way to three dimensions.</p>
</list-item>
<list-item><p>6. This number of processors was originally required by a partitioning
                        scheme, identified as cin Table 3, which divided each grid dimension by a
                        common divisor Lthat is roughly the cube root of Npts/NP.Here, Nptsis the
                        total number of points in the problem, and NPis the number of processors
                        used. For the choice L= 38, we obtain a requirement for 44 processors.</p>
</list-item>
<list-item><p>7. We obtained similar results for the wingbody-nacelle-diverter case on 120
                        processors, of which the 44-processor wingbody case is a proper subset, in
                        which refining partitions toward the grid aspect ratios (6.3.1 and 9. 2.1 in
                        grid zone 2) improve overall performance, but poor partitioning, such as
                        1.3.6, degrades performance.</p>
</list-item>
<list-item><p>8. At the time we did this work, multipartitioning was not available for grid
                        zone 1, which has a singular axis in the grid.</p>
</list-item>
<list-item><p>9. Here and also in the sixth step, we recommend that the user make a short
                        OVERFLOW run (≈ 10 iterations), examine the printout of processor
                        idle time, and try to hand-tune the partitions by reallocating processors to
                        other grid zones if necessary. The intent is that a small savings in
                        solution time could make an appreciable difference when run for thousands of iterations.</p>
</list-item>
<list-item><p>10. The MPI code run on a single processor is a reasonable approximation of
                        the time to run the serial code.</p>
</list-item>
<list-item><p>11. The multilevel parallel version outperforms the MPI version on eight processors.</p>
</list-item>
<list-item><p>12. The SGI Origin system used in Tables 8 and 9 had 250 MHz processors.</p>
</list-item>
<list-item><p>13. The wingbody-nacelle-diverter problem is only 18% faster on the SGI
                        Origin than on the CRAY T3E when using 128 processors.</p>
</list-item>
</list>
</p>
</notes>
<ref-list><ref><citation citation-type="confproc" xlink:type="simple"><name name-style="western"><surname>Atwood, C.</surname>
</name>
                , and 
                    <name name-style="western"><surname>M. Smith</surname>
</name>
                . 1995. 
                    <article-title>Nonlinear fluid computations in a distributed environment</article-title>
                . In <conf-name>AIAA Proceedings</conf-name>
, Paper No. 95-0224.</citation>
</ref>
<ref><citation citation-type="book" xlink:type="simple"><name name-style="western"><surname>Buning, P.</surname>
</name>
                , 
                    <name name-style="western"><surname>M. Smith</surname>
</name>
                , 
                    <name name-style="western"><surname>J. Ryan</surname>
</name>
                , 
                    <name name-style="western"><surname>C. Atwood</surname>
</name>
                , 
                    <name name-style="western"><surname>K. Chawla</surname>
</name>
                , and 
                    <name name-style="western"><surname>S. Weeratunga</surname>
</name>
                . <year>1995</year>
. <source>OVERFLOW-Navier-Stokes CFD</source>
 [On-line]. Available: <uri xlink:type="simple">http://esdcd.gsfc.nasa.gov/ESS/annual.reports/ess95contents/app.jnnie.html.</uri>
</citation>
</ref>
<ref><citation citation-type="confproc" xlink:type="simple"><name name-style="western"><surname>Hatay, F.</surname>
</name>
                , 
                    <name name-style="western"><surname>D. Jespersen</surname>
</name>
                , 
                    <name name-style="western"><surname>G. Guruswamy</surname>
</name>
                , 
                    <name name-style="western"><surname>Y. Rizk</surname>
</name>
                , 
                    <name name-style="western"><surname>C. Byun</surname>
</name>
                , and 
                    <name name-style="western"><surname>K. Gee</surname>
</name>
                . 1997. 
                    <article-title>A multi-level parallelization concept for high-fidelity multi-block solvers</article-title>
                . In <conf-name>Supercomputing 97 Proceedings</conf-name>
. <conf-loc>San Jose,
                    CA: Association for Computing Machinery</conf-loc>
.</citation>
</ref>
<ref><citation citation-type="book" xlink:type="simple"><name name-style="western"><surname>Jespersen, D.</surname>
</name>
<year>1998</year>
. <source>Parallelism and OVERFLOW</source>. NASA Ames Research Center,
                NAS-98-013 [Online]. Available: <uri xlink:type="simple">http://www.nas.nasa.gov/Research/Reports/Techreports/1998/nas-98-013-abstract.html.</uri>
</citation>
</ref>
<ref><citation citation-type="confproc" xlink:type="simple"><name name-style="western"><surname>Jespersen, D.</surname>
</name>
                , 
                    <name name-style="western"><surname>T. Pulliam</surname>
</name>
                , and 
                    <name name-style="western"><surname>P. Buning</surname>
</name>
                . 1997. 
                    <article-title>Recent enhancements to OVERFLOW</article-title>
                . In <conf-name>AIAA Proceedings</conf-name>
, Paper No. 97-0644.</citation>
</ref>
<ref><citation citation-type="journal" xlink:type="simple"><name name-style="western"><surname>Lauria, M.</surname>
</name>
                , and 
                    <name name-style="western"><surname>A. Chien</surname>
</name>
                . <year>1997</year>. 
                    <article-title>MPI-FM: High performance MPI on workstation clusters</article-title>
                . <source>Journal of Parallel and Distributed Computing</source>
<volume>40</volume>
 (<issue>1</issue>): 
                    <fpage>4</fpage>
-<lpage>18</lpage>
                .</citation>
</ref>
<ref><citation citation-type="journal" xlink:type="simple"><name name-style="western"><surname>Pulliam, T.</surname>
</name>
                , and 
                    <name name-style="western"><surname>D. Chaussee</surname>
</name>
                . <year>1981</year>. 
                    <article-title>A diagonal form of an implicit approximate factorization algorithm</article-title>
                . <source>Journal of Computational Physics</source>
<volume>39</volume>:
                    <fpage>347</fpage>
                .</citation>
</ref>
<ref><citation citation-type="confproc" xlink:type="simple"><name name-style="western"><surname>Pulliam, T.</surname>
</name>
                , and 
                    <name name-style="western"><surname>J. Steger</surname>
</name>
                . 1978. 
                    <article-title>On implicit finite-difference simulations of three-dimensional flows</article-title>
                . In <conf-name>AIAA Proceedings</conf-name>
, Paper No. 78-0010.</citation>
</ref>
<ref><citation citation-type="journal" xlink:type="simple"><name name-style="western"><surname>Reed, D.</surname>
</name>
                , 
                    <name name-style="western"><surname>L. Adams</surname>
</name>
                , and 
                    <name name-style="western"><surname>M. Patrick</surname>
</name>
                . <year>1987</year>. 
                    <article-title>Stencils and problem partitionings: Their influence on the performance of
                        multiple processor systems</article-title>
                . <source>IEEE Transactions on Computers</source>
<volume>C-36</volume>
 (<issue>7</issue>): 
                    <fpage>845</fpage>
-<lpage>858</lpage>
                .</citation>
</ref>
<ref><citation citation-type="confproc" xlink:type="simple"><name name-style="western"><surname>Ryan, J.</surname>
</name>
                , and 
                    <name name-style="western"><surname>S. Weeratunga</surname>
</name>
                . 1993. 
                    <article-title>Parallel computation of 3-D Navier-Stokes flowfields for supersonic vehicles</article-title>
                . <conf-name>AIAA Proceedings</conf-name>
, Paper No. 93-0064.</citation>
</ref>
<ref><citation citation-type="confproc" xlink:type="simple"><name name-style="western"><surname>Smith, M.</surname>
</name>
                , 
                    <name name-style="western"><surname>R. Van der Wijngaart</surname>
</name>
                , and 
                    <name name-style="western"><surname>M. Yarrow</surname>
</name>
                . 1995. 
                    <article-title>Improved multi-partition method for line-based iteration schemes</article-title>
                . In <conf-name>Computational Aerosciences Workshop 95</conf-name>
.</citation>
</ref>
<ref><citation citation-type="confproc" xlink:type="simple"><name name-style="western"><surname>Spalart, P.</surname>
</name>
                , and 
                    <name name-style="western"><surname>S. Almaras</surname>
</name>
                . 1992. 
                    <article-title>A one-equation turbulence model for aerodynamic flows</article-title>
                . In <conf-name>AIAA Proceedings</conf-name>
, Paper No. 92-0439.</citation>
</ref>
<ref><citation citation-type="journal" xlink:type="simple"><name name-style="western"><surname>Steger, J.</surname>
</name>
                , 
                    <name name-style="western"><surname>F. Dougherty</surname>
</name>
                , and 
                    <name name-style="western"><surname>J. Benek</surname>
</name>
                . <year>1983</year>. 
                    <article-title>A chimera grid scheme</article-title>
                . <source>Advances in Grid Generation</source>
<volume>5</volume>:
                    <fpage>59</fpage>
-<lpage>69</lpage>
                .</citation>
</ref>
<ref><citation citation-type="journal" xlink:type="simple"><name name-style="western"><surname>Taft, J.</surname>
</name>
<year>1998</year>. 
                    <article-title>OVERFLOW gets excellent results on SGI Origin 2000</article-title>
                . <source>NASnews</source>
<volume>3</volume>
 (<issue>1</issue>
) [Online]. Available: <uri xlink:type="simple">http://science.nas.nasa.gov/Pubs/NASnews/98/01/overflow.html.</uri>
</citation>
</ref>
</ref-list>
</back>
</article>
</istex:document>
</istex:metadataXml>
<mods version="3.6"><titleInfo lang="en"><title>Performance Testing of a Parallel Multiblock CFD Solver</title>
</titleInfo>
<titleInfo type="alternative" lang="en" contentType="CDATA"><title>Performance Testing of a Parallel Multiblock CFD Solver</title>
</titleInfo>
<name type="personal"><namePart type="given">David</namePart>
<namePart type="family">Kerlick</namePart>
<affiliation>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</affiliation>
</name>
<name type="personal"><namePart type="given">Eric</namePart>
<namePart type="family">Dillon</namePart>
<affiliation>Mathematics and Computing Technology, Boeing Company, Seattle, Washington</affiliation>
</name>
<name type="personal"><namePart type="given">David</namePart>
<namePart type="family">Levine</namePart>
<affiliation>Rosetta Inpharmatics, Kirkland, Washington</affiliation>
</name>
<typeOfResource>text</typeOfResource>
<genre type="research-article" displayLabel="research-article" authority="ISTEX" authorityURI="https://content-type.data.istex.fr" valueURI="https://content-type.data.istex.fr/ark:/67375/XTP-1JC4F85T-7">research-article</genre>
<originInfo><publisher>Sage Publications</publisher>
<place><placeTerm type="text">Sage CA: Thousand Oaks, CA</placeTerm>
</place>
<dateIssued encoding="w3cdtf">2001-02</dateIssued>
<copyrightDate encoding="w3cdtf">2001</copyrightDate>
</originInfo>
<language><languageTerm type="code" authority="iso639-2b">eng</languageTerm>
<languageTerm type="code" authority="rfc3066">en</languageTerm>
</language>
<abstract lang="en">A distributed-memory version of the OVERFLOW computational fluid dynamics code was evaluated on several parallel systems and compared with other approaches using test cases provided by the NASA-Boeing High-Speed Civil Transport program. A principal goal was to develop partitioning and load-balancing strategies that led to a reduction in computation time. We found multipartitioning, in which the aspect ratio of the multipartition is close to the aspect ratio of the grid zone, offered the best performance. The (uniprocessor) performance of the CRAY vector systems was superior to all other systems tested. However, the distributed-memory version when run on an SGI Origin system offers a price performance advantage over the CRAY vector systems.Performance on personal computer systems is promising but faces several hurdles.</abstract>
<relatedItem type="host"><titleInfo><title>The International Journal of High Performance Computing Applications</title>
</titleInfo>
<genre type="journal" authority="ISTEX" authorityURI="https://publication-type.data.istex.fr" valueURI="https://publication-type.data.istex.fr/ark:/67375/JMC-0GLKJH51-B">journal</genre>
<identifier type="ISSN">1094-3420</identifier>
<identifier type="eISSN">1741-2846</identifier>
<identifier type="PublisherID">HPC</identifier>
<identifier type="PublisherID-hwp">sphpc</identifier>
<part><date>2001</date>
<detail type="volume"><caption>vol.</caption>
<number>15</number>
</detail>
<detail type="issue"><caption>no.</caption>
<number>1</number>
</detail>
<extent unit="pages"><start>22</start>
<end>35</end>
</extent>
</part>
</relatedItem>
<identifier type="istex">10F89E9E8C955141EF1B7E305BA97BB38B570610</identifier>
<identifier type="ark">ark:/67375/M70-X79RBH0Z-6</identifier>
<identifier type="DOI">10.1177/109434200101500103</identifier>
<identifier type="ArticleID">10.1177_109434200101500103</identifier>
<recordInfo><recordContentSource authority="ISTEX" authorityURI="https://loaded-corpus.data.istex.fr" valueURI="https://loaded-corpus.data.istex.fr/ark:/67375/XBH-0J1N7DQT-B">sage</recordContentSource>
</recordInfo>
</mods>
<json:item><extension>json</extension>
<original>false</original>
<mimetype>application/json</mimetype>
<uri>https://api.istex.fr/ark:/67375/M70-X79RBH0Z-6/record.json</uri>
</json:item>
</metadata>
<serie></serie>
</istex>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Istex/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000384 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Istex/Corpus/biblio.hfd -nk 000384 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Istex
   |étape=   Corpus
   |type=    RBID
   |clé=     ISTEX:10F89E9E8C955141EF1B7E305BA97BB38B570610
   |texte=   Performance Testing of a Parallel Multiblock CFD Solver
}}

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022

	Serveur d'exploration sur la recherche en informatique en Lorraine
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur la recherche en informatique en Lorraine

Performance Testing of a Parallel Multiblock CFD Solver

Performance Testing of a Parallel Multiblock CFD Solver

Source :

English descriptors

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri