Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Languages cool as they expand: Allometric scaling and the decreasing need for new words

Identifieur interne : 000000 ( Pmc/Curation ); suivant : 000001

Languages cool as they expand: Allometric scaling and the decreasing need for new words

Auteurs : Alexander M. Petersen [Italie] ; Joel N. Tenenbaum [États-Unis] ; Shlomo Havlin [Israël] ; H. Eugene Stanley [États-Unis] ; Matjaž Perc [Slovénie]

Source :

RBID : PMC:3517984

Abstract

We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This “cooling pattern” forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature.


Url:
DOI: 10.1038/srep00943
PubMed: 23230508
PubMed Central: 3517984

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:3517984

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Languages cool as they expand: Allometric scaling and the decreasing need for new words</title>
<author>
<name sortKey="Petersen, Alexander M" sort="Petersen, Alexander M" uniqKey="Petersen A" first="Alexander M." last="Petersen">Alexander M. Petersen</name>
<affiliation wicri:level="1">
<nlm:aff id="a1">
<institution>Laboratory for the Analysis of Complex Economic Systems, IMT Lucca Institute for Advanced Studies</institution>
, Lucca 55100,
<country>Italy</country>
</nlm:aff>
<country xml:lang="fr">Italie</country>
<wicri:regionArea># see nlm:aff country strict</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Tenenbaum, Joel N" sort="Tenenbaum, Joel N" uniqKey="Tenenbaum J" first="Joel N." last="Tenenbaum">Joel N. Tenenbaum</name>
<affiliation wicri:level="1">
<nlm:aff id="a2">
<institution>Center for Polymer Studies and Department of Physics, Boston University</institution>
, Boston, Massachusetts 02215,
<country>USA</country>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea># see nlm:aff country strict</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="a3">
<institution>Operations and Technology Management, School of Management, Boston University</institution>
, Boston, Massachusetts 02215,
<country>USA</country>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea># see nlm:aff country strict</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Havlin, Shlomo" sort="Havlin, Shlomo" uniqKey="Havlin S" first="Shlomo" last="Havlin">Shlomo Havlin</name>
<affiliation wicri:level="1">
<nlm:aff id="a4">
<institution>Minerva Center and Department of Physics, Bar-Ilan University</institution>
, Ramat-Gan 52900, Israel</nlm:aff>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>, Ramat-Gan 52900</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Stanley, H Eugene" sort="Stanley, H Eugene" uniqKey="Stanley H" first="H. Eugene" last="Stanley">H. Eugene Stanley</name>
<affiliation wicri:level="1">
<nlm:aff id="a2">
<institution>Center for Polymer Studies and Department of Physics, Boston University</institution>
, Boston, Massachusetts 02215,
<country>USA</country>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea># see nlm:aff country strict</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Perc, Matjaz" sort="Perc, Matjaz" uniqKey="Perc M" first="Matjaž" last="Perc">Matjaž Perc</name>
<affiliation wicri:level="1">
<nlm:aff id="a5">
<institution>Department of Physics, Faculty of Natural Sciences and Mathematics, University of Maribor</institution>
, Koroška cesta 160, SI-2000 Maribor, Slovenia</nlm:aff>
<country xml:lang="fr">Slovénie</country>
<wicri:regionArea>, Koroška cesta 160, SI-2000 Maribor</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">23230508</idno>
<idno type="pmc">3517984</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3517984</idno>
<idno type="RBID">PMC:3517984</idno>
<idno type="doi">10.1038/srep00943</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">000000</idno>
<idno type="wicri:Area/Pmc/Curation">000000</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Languages cool as they expand: Allometric scaling and the decreasing need for new words</title>
<author>
<name sortKey="Petersen, Alexander M" sort="Petersen, Alexander M" uniqKey="Petersen A" first="Alexander M." last="Petersen">Alexander M. Petersen</name>
<affiliation wicri:level="1">
<nlm:aff id="a1">
<institution>Laboratory for the Analysis of Complex Economic Systems, IMT Lucca Institute for Advanced Studies</institution>
, Lucca 55100,
<country>Italy</country>
</nlm:aff>
<country xml:lang="fr">Italie</country>
<wicri:regionArea># see nlm:aff country strict</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Tenenbaum, Joel N" sort="Tenenbaum, Joel N" uniqKey="Tenenbaum J" first="Joel N." last="Tenenbaum">Joel N. Tenenbaum</name>
<affiliation wicri:level="1">
<nlm:aff id="a2">
<institution>Center for Polymer Studies and Department of Physics, Boston University</institution>
, Boston, Massachusetts 02215,
<country>USA</country>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea># see nlm:aff country strict</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="a3">
<institution>Operations and Technology Management, School of Management, Boston University</institution>
, Boston, Massachusetts 02215,
<country>USA</country>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea># see nlm:aff country strict</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Havlin, Shlomo" sort="Havlin, Shlomo" uniqKey="Havlin S" first="Shlomo" last="Havlin">Shlomo Havlin</name>
<affiliation wicri:level="1">
<nlm:aff id="a4">
<institution>Minerva Center and Department of Physics, Bar-Ilan University</institution>
, Ramat-Gan 52900, Israel</nlm:aff>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>, Ramat-Gan 52900</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Stanley, H Eugene" sort="Stanley, H Eugene" uniqKey="Stanley H" first="H. Eugene" last="Stanley">H. Eugene Stanley</name>
<affiliation wicri:level="1">
<nlm:aff id="a2">
<institution>Center for Polymer Studies and Department of Physics, Boston University</institution>
, Boston, Massachusetts 02215,
<country>USA</country>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea># see nlm:aff country strict</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Perc, Matjaz" sort="Perc, Matjaz" uniqKey="Perc M" first="Matjaž" last="Perc">Matjaž Perc</name>
<affiliation wicri:level="1">
<nlm:aff id="a5">
<institution>Department of Physics, Faculty of Natural Sciences and Mathematics, University of Maribor</institution>
, Koroška cesta 160, SI-2000 Maribor, Slovenia</nlm:aff>
<country xml:lang="fr">Slovénie</country>
<wicri:regionArea>, Koroška cesta 160, SI-2000 Maribor</wicri:regionArea>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Scientific Reports</title>
<idno type="eISSN">2045-2322</idno>
<imprint>
<date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This “cooling pattern” forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Evans, J A" uniqKey="Evans J">J. A. Evans</name>
</author>
<author>
<name sortKey="Foster, J G" uniqKey="Foster J">J. G. Foster</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ball, P" uniqKey="Ball P">P. Ball</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Helbing, D" uniqKey="Helbing D">D. Helbing</name>
</author>
<author>
<name sortKey="Balietti, S" uniqKey="Balietti S">S. Balietti</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lazer, D" uniqKey="Lazer D">D. Lazer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Barabasi, A L" uniqKey="Barabasi A">A. L. Barabási</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vespignani, A" uniqKey="Vespignani A">A. Vespignani</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Michel, J B" uniqKey="Michel J">J.-B. Michel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Petersen, A M" uniqKey="Petersen A">A. M. Petersen</name>
</author>
<author>
<name sortKey="Tenenbaum, J" uniqKey="Tenenbaum J">J. Tenenbaum</name>
</author>
<author>
<name sortKey="Havlin, S" uniqKey="Havlin S">S. Havlin</name>
</author>
<author>
<name sortKey="Stanley, H E" uniqKey="Stanley H">H. E. Stanley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gao, J" uniqKey="Gao J">J. Gao</name>
</author>
<author>
<name sortKey="Hu, J" uniqKey="Hu J">J. Hu</name>
</author>
<author>
<name sortKey="Mao, X" uniqKey="Mao X">X. Mao</name>
</author>
<author>
<name sortKey="Perc, M" uniqKey="Perc M">M. Perc</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zipf, G K" uniqKey="Zipf G">G. K. Zipf</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tsonis, A A" uniqKey="Tsonis A">A. A. Tsonis</name>
</author>
<author>
<name sortKey="Schultz, C" uniqKey="Schultz C">C. Schultz</name>
</author>
<author>
<name sortKey="Tsonis, P A" uniqKey="Tsonis P">P. A. Tsonis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Serrano, M" uniqKey="Serrano M">M. Á. Serrano</name>
</author>
<author>
<name sortKey="Flammini, A" uniqKey="Flammini A">A. Flammini</name>
</author>
<author>
<name sortKey="Menczer, F" uniqKey="Menczer F">F. Menczer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferrer I Cancho, R" uniqKey="Ferrer I Cancho R">R. Ferrer i Cancho</name>
</author>
<author>
<name sortKey="Sole, R V" uniqKey="Sole R">R. V. Solé</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferrer I Cancho, R" uniqKey="Ferrer I Cancho R">R. Ferrer i Cancho</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ferrer I Cancho, R" uniqKey="Ferrer I Cancho R">R. Ferrer i Cancho</name>
</author>
<author>
<name sortKey="Sole, R V" uniqKey="Sole R">R. V. Solé</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Baek, S K" uniqKey="Baek S">S. K. Baek</name>
</author>
<author>
<name sortKey="Bernhardsson, S" uniqKey="Bernhardsson S">S. Bernhardsson</name>
</author>
<author>
<name sortKey="Minnhagen, P" uniqKey="Minnhagen P">P. Minnhagen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Heaps, H S" uniqKey="Heaps H">H. S. Heaps</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bernhardsson, S" uniqKey="Bernhardsson S">S. Bernhardsson</name>
</author>
<author>
<name sortKey="Correa Da Rocha, L E" uniqKey="Correa Da Rocha L">L. E. Correa da Rocha</name>
</author>
<author>
<name sortKey="Minnhagen, P" uniqKey="Minnhagen P">P. Minnhagen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bernhardsson, S" uniqKey="Bernhardsson S">S. Bernhardsson</name>
</author>
<author>
<name sortKey="Correa Da Rocha, L E" uniqKey="Correa Da Rocha L">L. E. Correa da Rocha</name>
</author>
<author>
<name sortKey="Minnhagen, P" uniqKey="Minnhagen P">P. Minnhagen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kleiber, M" uniqKey="Kleiber M">M. Kleiber</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="West, G B" uniqKey="West G">G. B. West</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Makse, H A" uniqKey="Makse H">H. A. Makse</name>
</author>
<author>
<name sortKey="Havlin, S" uniqKey="Havlin S">S. Havlin</name>
</author>
<author>
<name sortKey="Stanley, H E" uniqKey="Stanley H">H. E. Stanley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Makse, H A Jr J S A" uniqKey="Makse H">H. A. Jr. J. S. A. Makse</name>
</author>
<author>
<name sortKey="Batty, M" uniqKey="Batty M">M. Batty</name>
</author>
<author>
<name sortKey="Havlin, S" uniqKey="Havlin S">S. Havlin</name>
</author>
<author>
<name sortKey="Stanley, H E" uniqKey="Stanley H">H. E. Stanley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rozenfeld, H D" uniqKey="Rozenfeld H">H. D. Rozenfeld</name>
</author>
<author>
<name sortKey="Rybski, D" uniqKey="Rybski D">D. Rybski</name>
</author>
<author>
<name sortKey="Andrade, Jr J S" uniqKey="Andrade J">Jr. J. S. Andrade</name>
</author>
<author>
<name sortKey="Batty, M" uniqKey="Batty M">M. Batty</name>
</author>
<author>
<name sortKey="Stanley, H E" uniqKey="Stanley H">H. E. Stanley</name>
</author>
<author>
<name sortKey="Makse, H A" uniqKey="Makse H">H. A. Makse</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gabaix, X" uniqKey="Gabaix X">X. Gabaix</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bettencourt, L M A" uniqKey="Bettencourt L">L. M. A. Bettencourt</name>
</author>
<author>
<name sortKey="Lobo, J" uniqKey="Lobo J">J. Lobo</name>
</author>
<author>
<name sortKey="Helbing, D" uniqKey="Helbing D">D. Helbing</name>
</author>
<author>
<name sortKey="Kuhnert, C" uniqKey="Kuhnert C">C. Kuhnert</name>
</author>
<author>
<name sortKey="West, G B" uniqKey="West G">G. B. West</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Batty, M" uniqKey="Batty M">M. Batty</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rozenfeld, H D" uniqKey="Rozenfeld H">H. D. Rozenfeld</name>
</author>
<author>
<name sortKey="Rybski, D" uniqKey="Rybski D">D. Rybski</name>
</author>
<author>
<name sortKey="Gabaix, X" uniqKey="Gabaix X">X. Gabaix</name>
</author>
<author>
<name sortKey="Makse, H A" uniqKey="Makse H">H. A. Makse</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Newman, M E J" uniqKey="Newman M">M. E. J. Newman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stanley, M H R" uniqKey="Stanley M">M. H. R. Stanley</name>
</author>
<author>
<name sortKey="Buldyrev, S V" uniqKey="Buldyrev S">S. V. Buldyrev</name>
</author>
<author>
<name sortKey="Havlin, S" uniqKey="Havlin S">S. Havlin</name>
</author>
<author>
<name sortKey="Mantegna, R" uniqKey="Mantegna R">R. Mantegna</name>
</author>
<author>
<name sortKey="Salinger, M" uniqKey="Salinger M">M. Salinger</name>
</author>
<author>
<name sortKey="Stanley, H E" uniqKey="Stanley H">H. E. Stanley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mantegna, R N" uniqKey="Mantegna R">R. N. Mantegna</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Clauset, A" uniqKey="Clauset A">A. Clauset</name>
</author>
<author>
<name sortKey="Shalizi, C R" uniqKey="Shalizi C">C. R. Shalizi</name>
</author>
<author>
<name sortKey="Newman, M E J" uniqKey="Newman M">M. E. J. Newman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mandelbrot, B" uniqKey="Mandelbrot B">B. Mandelbrot</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Karlin, S" uniqKey="Karlin S">S. Karlin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gnedin, A" uniqKey="Gnedin A">A. Gnedin</name>
</author>
<author>
<name sortKey="Hansen, B" uniqKey="Hansen B">B. Hansen</name>
</author>
<author>
<name sortKey="Pitman, J" uniqKey="Pitman J">J. Pitman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Van Leijenhorst, D C" uniqKey="Van Leijenhorst D">D. C. van Leijenhorst</name>
</author>
<author>
<name sortKey="Van Der Weide, Th P" uniqKey="Van Der Weide T">Th. P. van der Weide</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lu, L" uniqKey="Lu L">L. Lü</name>
</author>
<author>
<name sortKey="Zhang, Z K" uniqKey="Zhang Z">Z.-K. Zhang</name>
</author>
<author>
<name sortKey="Zhou, T" uniqKey="Zhou T">T. Zhou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Steyvers, M" uniqKey="Steyvers M">M. Steyvers</name>
</author>
<author>
<name sortKey="Tenenbaum, J B" uniqKey="Tenenbaum J">J. B. Tenenbaum</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Markosova, M" uniqKey="Markosova M">M. Markosova</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altmann, E G" uniqKey="Altmann E">E. G. Altmann</name>
</author>
<author>
<name sortKey="Pierrehumbert, J B" uniqKey="Pierrehumbert J">J. B. Pierrehumbert</name>
</author>
<author>
<name sortKey="Motter, A E" uniqKey="Motter A">A. E. Motter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Riccaboni, M" uniqKey="Riccaboni M">M. Riccaboni</name>
</author>
<author>
<name sortKey="Pammolli, F" uniqKey="Pammolli F">F. Pammolli</name>
</author>
<author>
<name sortKey="Buldyrev, S V" uniqKey="Buldyrev S">S. V. Buldyrev</name>
</author>
<author>
<name sortKey="Ponta, L" uniqKey="Ponta L">L. Ponta</name>
</author>
<author>
<name sortKey="Stanley, H E" uniqKey="Stanley H">H. E. Stanley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Oehlert, G W" uniqKey="Oehlert G">G. W. Oehlert</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Amaral, L A N" uniqKey="Amaral L">L. A. N. Amaral</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Amaral, L A N" uniqKey="Amaral L">L. A. N. Amaral</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fu, D" uniqKey="Fu D">D. Fu</name>
</author>
<author>
<name sortKey="Pammolli, F" uniqKey="Pammolli F">F. Pammolli</name>
</author>
<author>
<name sortKey="Buldyrev, S V" uniqKey="Buldyrev S">S. V. Buldyrev</name>
</author>
<author>
<name sortKey="Riccaboni, M" uniqKey="Riccaboni M">M. Riccaboni</name>
</author>
<author>
<name sortKey="Matia, K" uniqKey="Matia K">K. Matia</name>
</author>
<author>
<name sortKey="Yamasaki, K" uniqKey="Yamasaki K">K. Yamasaki</name>
</author>
<author>
<name sortKey="Stanley, H E" uniqKey="Stanley H">H. E. Stanley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Podobnik, B" uniqKey="Podobnik B">B. Podobnik</name>
</author>
<author>
<name sortKey="Horvatic, D" uniqKey="Horvatic D">D. Horvatic</name>
</author>
<author>
<name sortKey="Petersen, A M" uniqKey="Petersen A">A. M. Petersen</name>
</author>
<author>
<name sortKey="Stanley, H E" uniqKey="Stanley H">H. E. Stanley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Podobnik, B" uniqKey="Podobnik B">B. Podobnik</name>
</author>
<author>
<name sortKey="Horvatic, D" uniqKey="Horvatic D">D. Horvatic</name>
</author>
<author>
<name sortKey="Petersen, A M" uniqKey="Petersen A">A. M. Petersen</name>
</author>
<author>
<name sortKey="Njavro, M" uniqKey="Njavro M">M. Njavro</name>
</author>
<author>
<name sortKey="Stanley, H E" uniqKey="Stanley H">H. E. Stanley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mufwene, S" uniqKey="Mufwene S">S. Mufwene</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mufwene, S" uniqKey="Mufwene S">S. Mufwene</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Perc, M" uniqKey="Perc M">M. Perc</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sigman, M" uniqKey="Sigman M">M. Sigman</name>
</author>
<author>
<name sortKey="Cecchi, G A" uniqKey="Cecchi G">G. A. Cecchi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alvarez Lacalle, E" uniqKey="Alvarez Lacalle E">E. Alvarez-Lacalle</name>
</author>
<author>
<name sortKey="Dorow, B" uniqKey="Dorow B">B. Dorow</name>
</author>
<author>
<name sortKey="Eckmann, J P" uniqKey="Eckmann J">J.-P. Eckmann</name>
</author>
<author>
<name sortKey="Moses, E" uniqKey="Moses E">E. Moses</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altmann, E A" uniqKey="Altmann E">E. A. Altmann</name>
</author>
<author>
<name sortKey="Cristadoro, G" uniqKey="Cristadoro G">G. Cristadoro</name>
</author>
<author>
<name sortKey="Esposti, M D" uniqKey="Esposti M">M. D. Esposti</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Montemurro, M A" uniqKey="Montemurro M">M. A. Montemurro</name>
</author>
<author>
<name sortKey="Pury, P A" uniqKey="Pury P">P. A. Pury</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Corral, A" uniqKey="Corral A">A. Corral</name>
</author>
<author>
<name sortKey="Ferrer I Cancho, R" uniqKey="Ferrer I Cancho R">R. Ferrer i Cancho</name>
</author>
<author>
<name sortKey="Diaz Guilera, A" uniqKey="Diaz Guilera A">A. Díaz-Guilera</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altmann, E G" uniqKey="Altmann E">E. G. Altmann</name>
</author>
<author>
<name sortKey="Pierrehumbert, J B" uniqKey="Pierrehumbert J">J. B. Pierrehumbert</name>
</author>
<author>
<name sortKey="Motter, A E" uniqKey="Motter A">A. E. Motter</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Sci Rep</journal-id>
<journal-id journal-id-type="iso-abbrev">Sci Rep</journal-id>
<journal-title-group>
<journal-title>Scientific Reports</journal-title>
</journal-title-group>
<issn pub-type="epub">2045-2322</issn>
<publisher>
<publisher-name>Nature Publishing Group</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">23230508</article-id>
<article-id pub-id-type="pmc">3517984</article-id>
<article-id pub-id-type="pii">srep00943</article-id>
<article-id pub-id-type="doi">10.1038/srep00943</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Languages cool as they expand: Allometric scaling and the decreasing need for new words</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Petersen</surname>
<given-names>Alexander M.</given-names>
</name>
<xref ref-type="corresp" rid="c1">a</xref>
<xref ref-type="aff" rid="a1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Tenenbaum</surname>
<given-names>Joel N.</given-names>
</name>
<xref ref-type="aff" rid="a2">2</xref>
<xref ref-type="aff" rid="a3">3</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Havlin</surname>
<given-names>Shlomo</given-names>
</name>
<xref ref-type="aff" rid="a4">4</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Stanley</surname>
<given-names>H. Eugene</given-names>
</name>
<xref ref-type="aff" rid="a2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Perc</surname>
<given-names>Matjaž</given-names>
</name>
<xref ref-type="corresp" rid="c2">b</xref>
<xref ref-type="aff" rid="a5">5</xref>
</contrib>
<aff id="a1">
<label>1</label>
<institution>Laboratory for the Analysis of Complex Economic Systems, IMT Lucca Institute for Advanced Studies</institution>
, Lucca 55100,
<country>Italy</country>
</aff>
<aff id="a2">
<label>2</label>
<institution>Center for Polymer Studies and Department of Physics, Boston University</institution>
, Boston, Massachusetts 02215,
<country>USA</country>
</aff>
<aff id="a3">
<label>3</label>
<institution>Operations and Technology Management, School of Management, Boston University</institution>
, Boston, Massachusetts 02215,
<country>USA</country>
</aff>
<aff id="a4">
<label>4</label>
<institution>Minerva Center and Department of Physics, Bar-Ilan University</institution>
, Ramat-Gan 52900, Israel</aff>
<aff id="a5">
<label>5</label>
<institution>Department of Physics, Faculty of Natural Sciences and Mathematics, University of Maribor</institution>
, Koroška cesta 160, SI-2000 Maribor, Slovenia</aff>
</contrib-group>
<author-notes>
<corresp id="c1">
<label>a</label>
<email>alexander.petersen@imtlucca.it</email>
</corresp>
<corresp id="c2">
<label>b</label>
<email>matjaz.perc@uni-mb.si</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>10</day>
<month>12</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="collection">
<year>2012</year>
</pub-date>
<volume>2</volume>
<elocation-id>943</elocation-id>
<history>
<date date-type="received">
<day>16</day>
<month>10</month>
<year>2012</year>
</date>
<date date-type="accepted">
<day>24</day>
<month>10</month>
<year>2012</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright © 2012, Macmillan Publishers Limited. All rights reserved</copyright-statement>
<copyright-year>2012</copyright-year>
<copyright-holder>Macmillan Publishers Limited. All rights reserved</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by-nc-sa/3.0/">
<pmc-comment>author-paid</pmc-comment>
<license-p>This work is licensed under a Creative Commons Attribution-NonCommercial-ShareALike 3.0 Unported License. To view a copy of this license, visit
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc-sa/3.0/">http://creativecommons.org/licenses/by-nc-sa/3.0/</ext-link>
</license-p>
</license>
</permissions>
<abstract>
<p>We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This “cooling pattern” forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature.</p>
</abstract>
</article-meta>
</front>
<body>
<p>Books in libraries and attics around the world constitute an immense “crowd-sourced” historical record that traces the evolution of culture back beyond the limits of oral history. However, the disaggregation of written language into individual books makes the longitudinal analysis of language a difficult open problem. To this end, the book digitization project at
<italic>Google</italic>
Inc. presents a monumental step forward providing an enormous, publicly accessible, collection of written language in the form of the
<italic>Google Books Ngram Viewer</italic>
web application
<xref ref-type="bibr" rid="b1">1</xref>
. Approximately 4% of all books ever published have been scanned, making available over 10
<sup>7</sup>
occurrence time series (word-use trajectories) that archive cultural dynamics in seven different languages over a period of more than two centuries. This dataset highlights the utility of open “Big Data,” which is the gateway to “metaknowledge”
<xref ref-type="bibr" rid="b2">2</xref>
, the knowledge about knowledge. A digital data deluge is sustaining extensive interdisciplinary research efforts towards quantitative insights into the social and natural sciences
<xref ref-type="bibr" rid="b3">3</xref>
<xref ref-type="bibr" rid="b4">4</xref>
<xref ref-type="bibr" rid="b5">5</xref>
<xref ref-type="bibr" rid="b6">6</xref>
<xref ref-type="bibr" rid="b7">7</xref>
.</p>
<p>“Culturomics,” the use of high-throughput data for the purpose of studying human culture, is a promising new empirical platform for gaining insight into subjects ranging from political history to epidemiology
<xref ref-type="bibr" rid="b8">8</xref>
. As first demonstrated by Michel et al.
<xref ref-type="bibr" rid="b8">8</xref>
, the
<italic>Google</italic>
n-gram dataset is well-suited for examining the microscopic properties of an entire language ecosystem. Using this dataset to analyze the growth patterns of individual word frequencies, Petersen et al.
<xref ref-type="bibr" rid="b9">9</xref>
recently identified tipping points in the life trajectory of new words, statistical patterns that govern the fluctuations in word use, and quantitative measures for cultural memory. The statistical properties of cultural memory, derived from the quantitative analysis of individual word-use trajectories, were also investigated by Gao et al.
<xref ref-type="bibr" rid="b10">10</xref>
, who found that words describing social phenomena tend to have different long-range correlations than words describing natural phenomena.</p>
<p>Here we study the growth and evolution of written language by analyzing the macroscopic scaling patterns that characterize word-use. Using the
<italic>Google</italic>
1-gram data collected at the 1-year time resolution over the period 1800–2008, we quantify the annual fluctuation scale of words within a given corpora and show that languages can be said to “cool by expansion.” This effect constitutes a dynamic law, in contrast to the static laws of Zipf and Heaps which are founded upon snapshots of single texts. The Zipf law
<xref ref-type="bibr" rid="b11">11</xref>
<xref ref-type="bibr" rid="b12">12</xref>
<xref ref-type="bibr" rid="b13">13</xref>
<xref ref-type="bibr" rid="b14">14</xref>
<xref ref-type="bibr" rid="b15">15</xref>
<xref ref-type="bibr" rid="b16">16</xref>
<xref ref-type="bibr" rid="b17">17</xref>
, quantifying the distribution of word frequencies, and the Heaps law
<xref ref-type="bibr" rid="b13">13</xref>
<xref ref-type="bibr" rid="b18">18</xref>
<xref ref-type="bibr" rid="b19">19</xref>
<xref ref-type="bibr" rid="b20">20</xref>
, relating the size of a corpus to the vocabulary size of that corpus, are classic paradigms that capture many complexities of language in remarkably simple statistical patterns. While these laws have been exhaustively tested on relatively small snapshots of empirical data, here we test the validity of these laws using extremely large corpora.</p>
<p>Interestingly, we observe two scaling regimes in the probability density functions of word usage, with the Zipf law holding only for the set of more frequently used words, referred to as the “kernel lexicon” by Ferrer i Cancho et al.
<xref ref-type="bibr" rid="b14">14</xref>
. The word frequency distribution for the rarely used words constituting the “unlimited lexicon”
<xref ref-type="bibr" rid="b14">14</xref>
obeys a distinct scaling law, suggesting that rare words belong to a distinct class. This “unlimited lexicon” is populated by highly technical words, new words, numbers, spelling variants of kernel words, and optical character recognition (OCR) errors.</p>
<p>Many new words start in relative obscurity, and their eventual importance can be under-appreciated by their initial frequency. This fact is closely related to the information cost of introducing new words and concepts. For single topical texts, Heaps observed that the vocabulary size exhibits sub-linear growth with document size
<xref ref-type="bibr" rid="b18">18</xref>
. Extending this concept to entire corpora, we find a scaling relation that indicates a decreasing “marginal need” for new words which are the manifestation of cultural evolution and the seeds for language growth. We introduce a pruning method to study the role of infrequent words on the allometric scaling properties of language. By studying progressively smaller sets of the kernel lexicon we can better understand the marginal utility of the core words. The pattern that arises for all languages analyzed provides insight into the intrinsic dependency structure between words.</p>
<p>The correlations in word use can also be author and topic dependent. Bernhardsson et al. recently introduced the “metabook” concept
<xref ref-type="bibr" rid="b19">19</xref>
<xref ref-type="bibr" rid="b20">20</xref>
, according to which word-frequency structures are author-specific: the word-frequency characteristics of a random excerpt from a compilation of everything that a specific author could ever conceivably write (his/her “metabook”) should accurately match those of the author's actual writings. It is not immediately obvious whether a compilation of all the metabooks of all authors would still conform to the Zipf law and the Heaps law. The immense size and time span of the
<italic>Google</italic>
n-gram dataset allows us to examine this question in detail.</p>
<sec disp-level="1" sec-type="results">
<title>Results</title>
<sec disp-level="2">
<title>Longitudinal analysis of written language</title>
<p>Allometric scaling analysis
<xref ref-type="bibr" rid="b21">21</xref>
is used to quantify the role of system size on general phenomena characterizing a system, and has been applied to systems as diverse as the metabolic rate of mitochondria
<xref ref-type="bibr" rid="b22">22</xref>
and city growth
<xref ref-type="bibr" rid="b23">23</xref>
<xref ref-type="bibr" rid="b24">24</xref>
<xref ref-type="bibr" rid="b25">25</xref>
<xref ref-type="bibr" rid="b26">26</xref>
<xref ref-type="bibr" rid="b27">27</xref>
<xref ref-type="bibr" rid="b28">28</xref>
<xref ref-type="bibr" rid="b29">29</xref>
. Indeed, city growth shares two common features with the growth of written text: (i) the Zipf law is able to describe the distribution of city sizes regardless of country or the time period of the data
<xref ref-type="bibr" rid="b26">26</xref>
, and (ii) city growth has inherent constraints due to geography, changing labor markets and their effects on opportunities for innovation and wealth creation
<xref ref-type="bibr" rid="b27">27</xref>
<xref ref-type="bibr" rid="b28">28</xref>
, just as vocabulary growth is constrained by human brain capacity and the varying utilities of new words across users
<xref ref-type="bibr" rid="b14">14</xref>
.</p>
<p>We construct a word counting framework by first defining the quantity
<italic>u
<sub>i</sub>
</italic>
(
<italic>t</italic>
) as the number of times word
<italic>i</italic>
is used in year
<italic>t</italic>
. Since the number of books and the number of distinct words grow dramatically over time, we define the
<italic>relative</italic>
word use,
<italic>f
<sub>i</sub>
</italic>
(
<italic>t</italic>
), as the fraction of the total body of text occupied by word
<italic>i</italic>
in the same year
<disp-formula id="m1">
<inline-graphic id="d32e268" xlink:href="srep00943-m1.jpg"></inline-graphic>
</disp-formula>
where the quantity
<inline-formula id="m9">
<inline-graphic id="d32e271" xlink:href="srep00943-m9.jpg"></inline-graphic>
</inline-formula>
is the total number of indistinct word uses while
<italic>N
<sub>w</sub>
</italic>
(
<italic>t</italic>
) is the total number of distinct words digitized from books printed in year
<italic>t</italic>
. Both the
<italic>N
<sub>w</sub>
</italic>
(“types” giving the vocabulary size) and the
<italic>N
<sub>u</sub>
</italic>
(“tokens” giving the size of the body of text) are generally increasing over time.</p>
</sec>
<sec disp-level="2">
<title>The Zipf law and the two scaling regimes</title>
<p>Zipf investigated a number of bodies of literature and observed that the frequency of any given word is roughly inversely proportional to its rank
<xref ref-type="bibr" rid="b11">11</xref>
, with the frequency of the
<italic>z</italic>
-ranked word given by the relation
<disp-formula id="m2">
<inline-graphic id="d32e306" xlink:href="srep00943-m2.jpg"></inline-graphic>
</disp-formula>
with a scaling exponent
<italic>ζ</italic>
≈ 1. This empirical law has been confirmed for a broad range of data, ranging from income rankings, city populations, and the varying sizes of avalanches, forest fires
<xref ref-type="bibr" rid="b30">30</xref>
and firm size
<xref ref-type="bibr" rid="b31">31</xref>
to the linguistic features of nonconding DNA
<xref ref-type="bibr" rid="b32">32</xref>
. The Zipf law can be derived through the “principle of least effort,” which minimizes the communication noise between speakers (writers) and listeners (readers)
<xref ref-type="bibr" rid="b16">16</xref>
. The Zipf law has been found to hold for a large dataset of English text
<xref ref-type="bibr" rid="b14">14</xref>
, but there are interesting deviations observed in the lexicon of individuals diagnosed with schizophrenia
<xref ref-type="bibr" rid="b15">15</xref>
. Here, we also find statistical regularity in the distribution of relative word use for 11 different datasets, each comprising more than half a million distinct words taken from millions of books
<xref ref-type="bibr" rid="b8">8</xref>
.</p>
<p>
<xref ref-type="fig" rid="f1">Figure 1</xref>
shows the probability density functions
<italic>P</italic>
(
<italic>f</italic>
) resulting from data aggregated over all the years (A,B) as well as over 1-year periods as demonstrated for the year
<italic>t</italic>
= 2000 (C,D). Regardless of the language and the considered time span, the probability density functions are characterized by a striking two-regime scaling, which was first noted by Ferrer i Cancho and Solé
<xref ref-type="bibr" rid="b14">14</xref>
, and can be quantified as
<disp-formula id="m3">
<inline-graphic id="d32e342" xlink:href="srep00943-m3.jpg"></inline-graphic>
</disp-formula>
These two regimes, designated “kernel lexicon” and “unlimited lexicon,” are thought to reflect the cognitive constraints of the brain's finite vocabulary
<xref ref-type="bibr" rid="b14">14</xref>
. The specialized words found in the unlimited lexicon are not universally shared and are used significantly less frequently than the words in the kernel lexicon. This is reflected in the kink in the probability density functions and gives rise to the anomalous two-scaling distribution shown in
<xref ref-type="fig" rid="f1">Fig. 1</xref>
.</p>
<p>The exponent
<italic>α</italic>
<sub>+</sub>
and the corresponding rank-frequency scaling exponent
<italic>ζ</italic>
in Eq. (2) are related asymptotically by
<xref ref-type="bibr" rid="b14">14</xref>
<disp-formula id="m4">
<inline-graphic id="d32e363" xlink:href="srep00943-m4.jpg"></inline-graphic>
</disp-formula>
with no analogous relationship for the unlimited lexicon values
<italic>α</italic>
<sub></sub>
and
<italic>ζ</italic>
<sub></sub>
.
<xref ref-type="table" rid="t1">Table I</xref>
lists the average
<italic>α</italic>
<sub>+</sub>
and
<italic>α</italic>
<sub></sub>
values calculated by aggregating
<italic>α</italic>
<sub>±</sub>
values for each year using a maximum likelihood estimator for the power-law distribution
<xref ref-type="bibr" rid="b33">33</xref>
. We characterize the two scaling regimes using a crossover region around
<italic>f</italic>
<sub>×</sub>
≈ 10
<sup>−5</sup>
to distinguish between
<italic>α</italic>
<sub></sub>
and
<italic>α</italic>
<sub>+</sub>
: (i) 10
<sup>−8</sup>
<italic>f</italic>
≤ 10
<sup>−6</sup>
corresponds to
<italic>α</italic>
<sub></sub>
and (ii) 10
<sup>−4</sup>
<italic>f</italic>
≤ 10
<sup>−1</sup>
corresponds to
<italic>α</italic>
<sub>+</sub>
. For the words that satisfy
<italic>f</italic>
<italic>f</italic>
<sub>×</sub>
that comprise the kernel lexicon, we verify the Zipf scaling law
<italic>ζ</italic>
≈ 1 (corresponding to
<italic>α</italic>
≈ 2) for all corpora analyzed. For the unlimited lexicon regime
<italic>f</italic>
<italic>f</italic>
<sub>×</sub>
, however, the Zipf law is not obeyed, as we find
<italic>α</italic>
<sub></sub>
≈ 1.7. Note that
<italic>α</italic>
<sub></sub>
is significantly smaller in the Hebrew, Chinese, and the Russian corpora, which suggests that a more generalized version of the Zipf law
<xref ref-type="bibr" rid="b14">14</xref>
may be needed, one which is slightly language-dependent, especially when taking into account the usage of specialized words from the unlimited lexicon.</p>
</sec>
<sec disp-level="2">
<title>The Heaps law and the increasing marginal returns of new words</title>
<p>Heaps observed that vocabulary size, i.e. the number of distinct words, exhibits a sub-linear growth with document size
<xref ref-type="bibr" rid="b18">18</xref>
. This observation has important implications for the “return on investment” of a new word as it is established and becomes disseminated throughout the literature of a given language. As a proxy for this return, Heaps studied how often new words are invoked in lieu of preexisting competitors and examined the linguistic value of new words and ideas by analyzing the relation between the total number of words printed in a body of text
<italic>N
<sub>u</sub>
</italic>
, and the number of these which are distinct
<italic>N
<sub>w</sub>
</italic>
, i.e. the vocabulary size
<xref ref-type="bibr" rid="b18">18</xref>
. The marginal returns of new words, ∂
<italic>N
<sub>u</sub>
</italic>
/∂
<italic>N
<sub>w</sub>
</italic>
quantifies the impact of the addition of a single word to the vocabulary of a corpus on the aggregate output (corpus size).</p>
<p>For individual books, the empirically-observed scaling relation between
<italic>N
<sub>u</sub>
</italic>
and
<italic>N
<sub>w</sub>
</italic>
obeys
<disp-formula id="m5">
<inline-graphic id="d32e524" xlink:href="srep00943-m5.jpg"></inline-graphic>
</disp-formula>
with
<italic>b</italic>
< 1, with Eq. (5) referred to as “the Heaps law”. It has subsequently been found that Heaps' law emerges naturally in systems that can be described as sampling from an underlying Zipf distribution. In an information theoretic formulation of the the abstract concept of word cost, B. Mandelbrot predicted the relation
<italic>b</italic>
= 1/
<italic>ζ</italic>
in 1961
<xref ref-type="bibr" rid="b34">34</xref>
, where
<italic>ζ</italic>
is the scaling exponent corresponding to
<italic>α</italic>
<sub>+</sub>
, as in Eqs. (3) and (4). This prediction is limited to relatively small texts where the unlimited lexicon, which manifests in the
<italic>α</italic>
<sub></sub>
regime, does not play a significant role. A mathematical extension of this result for general underlying rank-distributions is also provided by Karlin
<xref ref-type="bibr" rid="b35">35</xref>
using an infinite urn scheme, and extended to broader classes of heavy-tailed distributions recently by Gnedin et al
<xref ref-type="bibr" rid="b36">36</xref>
. Recent research efforts using stochastic master equation techniques to model the growth of a book have also predicted this intrinsic relation between Zipf's law and Heaps' law
<xref ref-type="bibr" rid="b13">13</xref>
<xref ref-type="bibr" rid="b37">37</xref>
<xref ref-type="bibr" rid="b38">38</xref>
.</p>
<p>
<xref ref-type="fig" rid="f2">Figure 2</xref>
confirms a sub-linear scaling (
<italic>b</italic>
< 1) between
<italic>N
<sub>u</sub>
</italic>
and
<italic>N
<sub>w</sub>
</italic>
for each corpora analyzed. These results show how the marginal returns of new words are given by
<disp-formula id="m6">
<inline-graphic id="d32e576" xlink:href="srep00943-m6.jpg"></inline-graphic>
</disp-formula>
which is an increasing function of
<italic>N
<sub>w</sub>
</italic>
for
<italic>b</italic>
< 1. Thus, the relative increase in the induced volume of written languages is larger for new words than for old words. This is likely due to the fact that new words are typically technical in nature, requiring additional explanations that put the word into context with pre-existing words. Specifically, a new word requires the additional use of preexisting words as a result of both (i) the explanation of the content of the new word using existing technical terms, and (ii) the grammatical infrastructure necessary for that explanation. Hence, there are large spillovers in the size of the written corpus that follow from the intricate dependency structure of language stemming from the various grammatical roles
<xref ref-type="bibr" rid="b39">39</xref>
<xref ref-type="bibr" rid="b40">40</xref>
.</p>
<p>In order to investigate the role of rare and new words, we calculate
<italic>N
<sub>u</sub>
</italic>
and
<italic>N
<sub>w</sub>
</italic>
using only words that have appeared at least
<italic>U
<sub>c</sub>
</italic>
times. We select the absolute number of uses as a word use threshold because a word in a given year can not appear with a frequency less than 1/
<italic>N
<sub>u</sub>
</italic>
, hence any criteria using relative frequency would necessarily introduce a bias for small corpora samples. This choice also eliminates words that can spuriously arise from Optical Character Recognition (OCR) errors in the digitization process and also from intrinsic spelling errors and orthographic spelling variations.</p>
<p>
<xref ref-type="fig" rid="f3">Figures 3</xref>
and
<xref ref-type="fig" rid="f4">4</xref>
show the relational dependence of
<italic>N
<sub>u</sub>
</italic>
and
<italic>N
<sub>w</sub>
</italic>
on the exclusion of low-frequency words using a variable cutoff
<italic>U
<sub>c</sub>
</italic>
= 2
<italic>
<sup>n</sup>
</italic>
with
<italic>n</italic>
= 0 … 11. As
<italic>U
<sub>c</sub>
</italic>
increases the Heaps scaling exponent increases from
<italic>b</italic>
≈ 0.5, approaching
<italic>b</italic>
≈ 1, indicating that core words are structurally integrated into language as a proportional background. Interestingly, Altmann et al.
<xref ref-type="bibr" rid="b41">41</xref>
recently showed that “word niche” can be an essential factor in modeling word use dynamics. New niche words, though they are marginal increases to a language's lexicon, are themselves anything but “marginal” - they are core words within a subset of the language. This is particularly the case in online communities in which individuals strive to distinguish themselves on short timescales by developing stylistic jargon, highlighting how language patterns can be context dependent.</p>
<p>We now return to the relation between Heaps' law and Zipf's law.
<xref ref-type="table" rid="t1">Table I</xref>
summarizes the
<italic>b</italic>
values calculated by means of ordinary least squares regression using
<italic>U
<sub>c</sub>
</italic>
= 0 to relate
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
) to
<italic>N
<sub>w</sub>
</italic>
(
<italic>t</italic>
). For
<italic>U
<sub>c</sub>
</italic>
= 1 we find that
<italic>b</italic>
≈ 0.5 for all languages analyzed, as expected from Heaps law, but for
<italic>U
<sub>c</sub>
</italic>
≳ 8 the
<italic>b</italic>
value significantly deviates from 0.5, and for
<italic>U
<sub>c</sub>
</italic>
≳ 1000 the
<italic>b</italic>
value begins to saturate approaching unity. Considering that
<italic>α</italic>
<sub>+</sub>
≈ 2 implies
<italic>ζ</italic>
≈ 1 for all corpora,
<xref ref-type="fig" rid="f3">Figures 3</xref>
and
<xref ref-type="fig" rid="f4">4</xref>
shows that we can confirm the relation
<italic>b</italic>
(
<italic>U
<sub>c</sub>
</italic>
) ≈ 1/
<italic>ζ</italic>
only for the more pruned corpora that require relatively large
<italic>U
<sub>c</sub>
</italic>
. This hidden feature of the scaling relation highlights the underlying structure of language, which forms a dependency network between the common words of the kernel lexicon and their more esoteric counterparts in the unlimited lexicon. Moreover, the function ∂
<italic>N
<sub>w</sub>
</italic>
/∂
<italic>N
<sub>u</sub>
</italic>
~ (
<italic>N
<sub>u</sub>
</italic>
)
<italic>
<sup>b</sup>
</italic>
<sup>−1</sup>
is a monotonically decreasing function for
<italic>b</italic>
< 1, demonstrating the
<italic>decreasing marginal need</italic>
for additional words as a corpora grows. In other words, since we get more and more “mileage” out of new words in an already large language, additional words are needed less and less.</p>
</sec>
<sec disp-level="2">
<title>Corpora size and word-use fluctuations</title>
<p>Lastly, it is instructive to examine how vocabulary size
<italic>N
<sub>w</sub>
</italic>
and the overall size of the corpora
<italic>N
<sub>u</sub>
</italic>
affect fluctuations in word use.
<xref ref-type="fig" rid="f5">Figure 5</xref>
shows how
<italic>N
<sub>w</sub>
</italic>
(
<italic>t</italic>
) and
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
) vary over time over the past two centuries. Note that, apart from the periods during the two World Wars, the number of words printed, which we will refer to as the “literary productivity”, has been increasing over time. The number of distinct words (vocabulary size) has also increased reflecting basic social and technological advancement
<xref ref-type="bibr" rid="b8">8</xref>
.</p>
<p>To investigate the role of fluctuations, we focus on the logarithmic growth rate, commonly used in finance and economics
<disp-formula id="m7">
<inline-graphic id="d32e808" xlink:href="srep00943-m7.jpg"></inline-graphic>
</disp-formula>
to measure the relative growth of word use over 1-year periods, Δ
<italic>t</italic>
≡ 1 year. Recent quantitative analysis on the distribution
<italic>P</italic>
(
<italic>r</italic>
) of word use growth rates
<italic>r
<sub>i</sub>
</italic>
(
<italic>t</italic>
) indicates that annual fluctuations in word use deviates significantly from the predictions of null models for language evolution
<xref ref-type="bibr" rid="b9">9</xref>
.</p>
<p>We define an aggregate fluctuation scale,
<italic>σ
<sub>r</sub>
</italic>
(
<italic>t</italic>
|
<italic>f
<sub>c</sub>
</italic>
), using a frequency cutoff
<italic>f
<sub>c</sub>
</italic>
∝ 1/
<italic>Min</italic>
[
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
)] to eliminate infrequently used words. The quantity
<italic>Min</italic>
[
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
)] is the minimum corpora size over the period of analysis, and so 1/
<italic>Min</italic>
[
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
)] is an upper bound for the minimum observed frequency for words in the corpora.
<xref ref-type="fig" rid="f6">Figure 6</xref>
shows
<italic>σ
<sub>r</sub>
</italic>
(
<italic>t</italic>
|
<italic>f
<sub>c</sub>
</italic>
), the standard deviation of
<italic>r
<sub>i</sub>
</italic>
(
<italic>t</italic>
) calculated across all words that satisfy the condition 〈
<italic>f
<sub>i</sub>
</italic>
〉 ≥
<italic>f
<sub>c</sub>
</italic>
for words with lifetime
<italic>T
<sub>i</sub>
</italic>
≥ 10 years, using
<italic>f
<sub>c</sub>
</italic>
= 1/
<italic>Min</italic>
[
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
)]. Visual inspection suggests a general decrease in
<italic>σ
<sub>r</sub>
</italic>
(
<italic>t</italic>
|
<italic>f
<sub>c</sub>
</italic>
) over time, marked by sudden increases during times of political conflict. Hence, the persistent increase in the volume of written language is correlated with a persistent downward trend what could be thought of as the “system temperature”
<italic>σ
<sub>r</sub>
</italic>
(
<italic>t</italic>
|
<italic>f
<sub>c</sub>
</italic>
): as a language grows and matures it also “cools off”.</p>
<p>Since this cooling pattern could arise as a simple artifact of an independent identically distributed (i.i.d) sampling from an increasingly large dataset, we test the scaling of
<italic>σ
<sub>r</sub>
</italic>
(
<italic>t</italic>
|
<italic>f
<sub>c</sub>
</italic>
) with corpora size.
<xref ref-type="fig" rid="f7">Figure 7(A)</xref>
shows that for large
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
), each language is characterized by a scaling relation
<disp-formula id="m8">
<inline-graphic id="d32e997" xlink:href="srep00943-m8.jpg"></inline-graphic>
</disp-formula>
with language-dependent scaling exponent
<italic>β</italic>
≈ 0.08–0.35. We use
<italic>f
<sub>c</sub>
</italic>
= 10/
<italic>Min</italic>
[
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
)], which defines the frequency threshold for the inclusion of a given word in our analysis. There are two candidate null models which give insight into the limiting behavior of
<italic>β</italic>
. The Gibrat proportional growth model predicts
<italic>β</italic>
= 0 and the Yule- Simon urn model predicts
<italic>β</italic>
= 1/2
<xref ref-type="bibr" rid="b42">42</xref>
. We observe
<italic>β</italic>
< 1/2, which indicates that the fluctuation scale decreases more slowly with increasing corpora size than would be expected from the Yule-Simon urn model prediction, deducible via the “delta method” for determining the approximate scaling of a distribution and its standard deviation
<italic>σ</italic>
<xref ref-type="bibr" rid="b43">43</xref>
.</p>
<p>To further compare the roles of the kernel lexicon versus the unlimited lexicon, we apply our pruning method to quantify the dependence of the scaling exponent
<italic>β</italic>
on the fluctuations arising from rare words. We omit words from our calculation of
<italic>σ
<sub>r</sub>
</italic>
(
<italic>t</italic>
|
<italic>U
<sub>c</sub>
</italic>
) if their use
<italic>u
<sub>i</sub>
</italic>
(
<italic>t</italic>
) in year
<italic>t</italic>
falls below the word-use threshold
<italic>U
<sub>c</sub>
</italic>
.
<xref ref-type="fig" rid="f7">Fig. 7(B)</xref>
shows that
<italic>β</italic>
(
<italic>U
<sub>c</sub>
</italic>
) increases from values close to 0 to values less than 1/2 as
<italic>U
<sub>c</sub>
</italic>
increases exponentially. An increasing
<italic>β</italic>
(
<italic>U
<sub>c</sub>
</italic>
) confirms our conjecture that rare words are largely responsible for the fluctuations in a language. However, because of the dependency structure between words, there are residual fluctuation spillovers into the kernel lexicon likely accounting for the fact that
<italic>β</italic>
< 1/2 even when the fluctuations from the unlimited lexicon are removed.</p>
<p>A size-variance relation showing that larger entities have smaller characteristic fluctuations was also demonstrated at the scale of individual words using the same
<italic>Google n-gram</italic>
dataset
<xref ref-type="bibr" rid="b9">9</xref>
. Moreover, this size-variance relation is strikingly analogous to the decreasing growth rate volatility observed as complex economic entities (i.e. firms or countries) increase in size
<xref ref-type="bibr" rid="b42">42</xref>
<xref ref-type="bibr" rid="b44">44</xref>
<xref ref-type="bibr" rid="b45">45</xref>
<xref ref-type="bibr" rid="b46">46</xref>
<xref ref-type="bibr" rid="b47">47</xref>
<xref ref-type="bibr" rid="b48">48</xref>
, which strengthens the analogy of language as a complex ecosystem of words governed by competitive forces.</p>
<p>Further possible explanations for
<italic>β</italic>
< 1/2 is that language growth is counteracted by the influx of new words which tend to have growth-spurts around 30–50 years following their birth in the written corpora
<xref ref-type="bibr" rid="b9">9</xref>
. Moreover, the fluctuation scale
<italic>σ
<sub>r</sub>
</italic>
(
<italic>t</italic>
|
<italic>f
<sub>c</sub>
</italic>
) is positively influenced by adverse conditions such as wars and revolutions, since a decrease in
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
) may decrease the competitive advantage that old words have over new words, allowing new words to break through. The globalization effect, manifesting from increased human mobility during periods of conflict, is also responsible for the emergence of new words within a language.</p>
</sec>
</sec>
<sec disp-level="1" sec-type="discussion">
<title>Discussion</title>
<p>A coevolutionary description of language and culture requires many factors and much consideration
<xref ref-type="bibr" rid="b49">49</xref>
<xref ref-type="bibr" rid="b50">50</xref>
. While scientific and technological advances are largely responsible for written language growth as well as the birth of many new words
<xref ref-type="bibr" rid="b9">9</xref>
, socio-political factors also play a strong role. For example, the sexual revolution of the 1960s triggered the sudden emergence of the words “girlfriend” and “boyfriend” in the English corpora
<xref ref-type="bibr" rid="b1">1</xref>
, illustrating the evolving culture of romantic courting. Such technological and socio-political perturbations require case-by-case analysis for any deeper understanding, as demonstrated comprehensively by Michel et al.
<xref ref-type="bibr" rid="b8">8</xref>
.</p>
<p>Here we analyzed the macroscopic properties of written language using the
<italic>Google Books</italic>
database
<xref ref-type="bibr" rid="b1">1</xref>
. We find that the word frequency distribution
<italic>P</italic>
(
<italic>f</italic>
) is characterized by two scaling regimes. While frequently used words that constitute the kernel lexicon follow the Zipf law, the distribution has a less-steep scaling regime quantifying the rarer words constituting the
<italic>unlimited lexicon</italic>
. Our result is robust across languages as well as across other data subsets, thus extending the validity of the seminal observation by Ferrer i Cancho and Solé
<xref ref-type="bibr" rid="b14">14</xref>
, who first reported it for a large body of English text. The kink in the slope preceding the entry into the unlimited lexicon is a likely consequence of the limits of human mental ability that force the individual to optimize the usage of frequently used words and forget specialized words that are seldom used. This hypothesis agrees with the “principle of least effort” that minimizes communication noise between speakers (writers) and listeners (readers), which in turn may lead to the emergence of the Zipf law
<xref ref-type="bibr" rid="b16">16</xref>
.</p>
<p>Using an extremely large written corpora that documents the profound expansion of language over centuries, we analyzed the dependence of vocabulary growth on corpus growth and validate the Heaps law scaling relation given by Eq. 5. Furthermore we systematically prune the corpora data using a word occurrence threshold
<italic>U
<sub>c</sub>
</italic>
, and comparing the resulting
<italic>b</italic>
(
<italic>U
<sub>c</sub>
</italic>
) value to the
<italic>ζ</italic>
≈ 1 value, which is stable since it is derived from the “kernel” lexicon. We conditionally confirm the theoretical prediction
<italic>ζ</italic>
≈ 1/
<italic>b</italic>
<xref ref-type="bibr" rid="b13">13</xref>
<xref ref-type="bibr" rid="b34">34</xref>
<xref ref-type="bibr" rid="b35">35</xref>
<xref ref-type="bibr" rid="b36">36</xref>
<xref ref-type="bibr" rid="b37">37</xref>
<xref ref-type="bibr" rid="b38">38</xref>
, which we validate only in the case that the extremely rare “unlimited” lexicon words are not included in the data sample (see
<xref ref-type="fig" rid="f3">Figs. 3</xref>
and
<xref ref-type="fig" rid="f4">4</xref>
).</p>
<p>The economies of scale (
<italic>b</italic>
< 1) indicate that there is an
<italic>increasing marginal return</italic>
for new words, or alternatively, a
<italic>decreasing marginal need</italic>
for new words, as evidenced by allometric scaling. This can intuitively be understood in terms of the increasing complexities and combinations of words that become available as more words are added to a language, lessening the need for lexical expansion. However, a relationship between new words and existing words is retained. Every introduction of a word, from an informal setting (e.g. an expository text) to a formal setting (e.g. a dictionary) is yet another chance for the more common describing words to play out their respective frequencies, underscoring the hierarchy of words. This can be demonstrated quite instructively from Eq. (6) which implies that for
<inline-formula id="m10">
<inline-graphic id="d32e1217" xlink:href="srep00943-m10.jpg"></inline-graphic>
</inline-formula>
, meaning that it requires a quantity proportional to the vocabulary size
<italic>N
<sub>w</sub>
</italic>
to introduce a new word, or alternatively, that a quantity proportional to
<italic>N
<sub>w</sub>
</italic>
necessarily results from the addition.</p>
<p>Though new words are needed less and less, the expansion of language continues, doing so with marked characteristics. Taking the growth rate fluctuations of word use to be a kind of temperature, we note that like an ideal gas, most languages “cool” when they expand. The fact that the relationship between the temperature and corpus volume is a power law, one may, loosely speaking, liken language growth to the expansion of a gas or the growth of a company
<xref ref-type="bibr" rid="b42">42</xref>
<xref ref-type="bibr" rid="b44">44</xref>
<xref ref-type="bibr" rid="b45">45</xref>
<xref ref-type="bibr" rid="b46">46</xref>
<xref ref-type="bibr" rid="b47">47</xref>
<xref ref-type="bibr" rid="b48">48</xref>
. In contrast to the static laws of Zipf and Heaps, we note that this finding is of a dynamical nature.</p>
<p>Other aspects of language growth may also be understood in terms of expansion of a gas. Since larger literary productivity imposes a downward trend on growth rate fluctuations — which also implies that the ranking of the top words and phases becomes more stable
<xref ref-type="bibr" rid="b51">51</xref>
—productivity itself can be thought of as a kind of inverse pressure in that highly productive years are observed to “cool” a language off. Also, it is during the “high-pressure” low productivity years that new words tend to emerge more frequently.</p>
<p>Interestingly, the appearance of new words is more like gas condensation, tending to cancel the cooling brought on by language expansion. These two effects, corpus expansion and new word “condensation,” therefore act against each other. Across all corpora we calculate a size-variance scaling exponent 0 <
<italic>β</italic>
< 1/2, bounded by the prediction of
<italic>β</italic>
= 0 (Gibrat growth model) and
<italic>β</italic>
= 1/2 (Yule-Simon growth model)
<xref ref-type="bibr" rid="b42">42</xref>
.</p>
<p>In the context of allometric relations, Bettencourt et al.
<xref ref-type="bibr" rid="b27">27</xref>
note that the scaling relations describing the dynamics of cities show an
<italic>increase</italic>
in the characteristic pace of life as the system size grows, whereas those found in biological systems show
<italic>decrease</italic>
in characteristic rates as the system size grows. Since the languages we analyzed tend to “cool” as they expand, there may be deep-rooted parallels with biological systems based on principles of efficiency
<xref ref-type="bibr" rid="b16">16</xref>
. Languages, like biological systems demonstrate economies of scale (
<italic>b</italic>
< 1) manifesting from a complex dependency structure that mimics a hierarchical “circulatory system” required by the organization of language
<xref ref-type="bibr" rid="b39">39</xref>
<xref ref-type="bibr" rid="b52">52</xref>
<xref ref-type="bibr" rid="b53">53</xref>
<xref ref-type="bibr" rid="b54">54</xref>
<xref ref-type="bibr" rid="b55">55</xref>
<xref ref-type="bibr" rid="b56">56</xref>
and the limits of the efficiency of the speakers/writers who exchange the words
<xref ref-type="bibr" rid="b19">19</xref>
<xref ref-type="bibr" rid="b41">41</xref>
<xref ref-type="bibr" rid="b57">57</xref>
.</p>
</sec>
<sec disp-level="1">
<title>Author Contributions</title>
<p>A.M.P., J.T., S.H., H.E.S. & M.P. designed research, performed research, wrote, reviewed and approved the manuscript. A.M.P. performed the numerical and statistical analysis of the data.</p>
</sec>
</body>
<back>
<ack>
<p>AMP acknowledges support from the IMT Lucca Foundation. JT, SH and HES acknowledge support from the DTRA, ONR, the European EPIWORK and LINC projects, and the Israel Science Foundation. MP acknowledges support from the Slovenian Research Agency.</p>
</ack>
<ref-list>
<ref id="b1">
<mixed-citation publication-type="other">
<article-title>Google Books Ngram Viewer</article-title>
.
<ext-link ext-link-type="uri" xlink:href="http://books.google.com/ngrams">http://books.google.com/ngrams</ext-link>
(date of access: 14 January 2011).</mixed-citation>
</ref>
<ref id="b2">
<mixed-citation publication-type="journal">
<name>
<surname>Evans</surname>
<given-names>J. A.</given-names>
</name>
&
<name>
<surname>Foster</surname>
<given-names>J. G.</given-names>
</name>
<article-title>Metaknowledge</article-title>
.
<source>Science</source>
<volume>331</volume>
,
<fpage>721</fpage>
<lpage>725</lpage>
(
<year>2011</year>
).
<pub-id pub-id-type="pmid">21311014</pub-id>
</mixed-citation>
</ref>
<ref id="b3">
<mixed-citation publication-type="book">
<name>
<surname>Ball</surname>
<given-names>P.</given-names>
</name>
<source>Why Society is a Complex Matter</source>
(Springer-Verlag, Berlin,
<year>2012</year>
).</mixed-citation>
</ref>
<ref id="b4">
<mixed-citation publication-type="journal">
<name>
<surname>Helbing</surname>
<given-names>D.</given-names>
</name>
&
<name>
<surname>Balietti</surname>
<given-names>S.</given-names>
</name>
<article-title>How to Create an Innovation Accelerator</article-title>
.
<source>Eur. Phys. J. Special Topics</source>
<volume>195</volume>
,
<fpage>101</fpage>
<lpage>136</lpage>
(
<year>2011</year>
).</mixed-citation>
</ref>
<ref id="b5">
<mixed-citation publication-type="journal">
<name>
<surname>Lazer</surname>
<given-names>D.</given-names>
</name>
<italic>et al.</italic>
<article-title>Computational social science</article-title>
.
<source>Science</source>
<volume>323</volume>
,
<fpage>721</fpage>
<lpage>723</lpage>
(
<year>2009</year>
).
<pub-id pub-id-type="pmid">19197046</pub-id>
</mixed-citation>
</ref>
<ref id="b6">
<mixed-citation publication-type="journal">
<name>
<surname>Barabási</surname>
<given-names>A. L.</given-names>
</name>
<article-title>The network takeover</article-title>
.
<source>Nature Physics</source>
<volume>8</volume>
,
<fpage>14</fpage>
<lpage>16</lpage>
(
<year>2012</year>
).</mixed-citation>
</ref>
<ref id="b7">
<mixed-citation publication-type="journal">
<name>
<surname>Vespignani</surname>
<given-names>A.</given-names>
</name>
<article-title>Modeling dynamical processes in complex socio-technical systems</article-title>
.
<source>Nature Physics</source>
<volume>8</volume>
,
<fpage>32</fpage>
<lpage>39</lpage>
(
<year>2012</year>
).</mixed-citation>
</ref>
<ref id="b8">
<mixed-citation publication-type="journal">
<name>
<surname>Michel</surname>
<given-names>J.-B.</given-names>
</name>
<italic>et al.</italic>
<article-title>Quantitative analysis of culture using millions of digitized books</article-title>
.
<source>Science</source>
<volume>331</volume>
,
<fpage>176</fpage>
<lpage>182</lpage>
(
<year>2011</year>
).
<pub-id pub-id-type="pmid">21163965</pub-id>
</mixed-citation>
</ref>
<ref id="b9">
<mixed-citation publication-type="journal">
<name>
<surname>Petersen</surname>
<given-names>A. M.</given-names>
</name>
,
<name>
<surname>Tenenbaum</surname>
<given-names>J.</given-names>
</name>
,
<name>
<surname>Havlin</surname>
<given-names>S.</given-names>
</name>
&
<name>
<surname>Stanley</surname>
<given-names>H. E.</given-names>
</name>
<article-title>Statistical laws governing fluctuations in word use from word birth to word death</article-title>
.
<source>Scientific Reports</source>
<volume>2</volume>
,
<fpage>313</fpage>
(
<year>2012</year>
).
<pub-id pub-id-type="pmid">22423321</pub-id>
</mixed-citation>
</ref>
<ref id="b10">
<mixed-citation publication-type="journal">
<name>
<surname>Gao</surname>
<given-names>J.</given-names>
</name>
,
<name>
<surname>Hu</surname>
<given-names>J.</given-names>
</name>
,
<name>
<surname>Mao</surname>
<given-names>X.</given-names>
</name>
&
<name>
<surname>Perc</surname>
<given-names>M.</given-names>
</name>
<article-title>Culturomics meets random fractal theory: Insights into long-range correlations of social and natural phenomena over the past two centuries</article-title>
.
<source>J. R. Soc. Interface</source>
<volume>9</volume>
,
<fpage>1956</fpage>
<lpage>1964</lpage>
(
<year>2012</year>
).
<pub-id pub-id-type="pmid">22337632</pub-id>
</mixed-citation>
</ref>
<ref id="b11">
<mixed-citation publication-type="book">
<name>
<surname>Zipf</surname>
<given-names>G. K.</given-names>
</name>
<source>Human Behavior and the Principle of Least-Effort: An Introduction to Human Ecology.</source>
Addison-Wesley, Cambridge, MA, (
<year>1949</year>
).</mixed-citation>
</ref>
<ref id="b12">
<mixed-citation publication-type="journal">
<name>
<surname>Tsonis</surname>
<given-names>A. A.</given-names>
</name>
,
<name>
<surname>Schultz</surname>
<given-names>C.</given-names>
</name>
&
<name>
<surname>Tsonis</surname>
<given-names>P. A.</given-names>
</name>
<article-title>Zipf's law and the structure and evolution of languages</article-title>
.
<source>Complexity</source>
<volume>3</volume>
,
<fpage>12</fpage>
<lpage>13</lpage>
(
<year>1997</year>
).</mixed-citation>
</ref>
<ref id="b13">
<mixed-citation publication-type="journal">
<name>
<surname>Serrano</surname>
<given-names>M. Á.</given-names>
</name>
,
<name>
<surname>Flammini</surname>
<given-names>A.</given-names>
</name>
&
<name>
<surname>Menczer</surname>
<given-names>F.</given-names>
</name>
<article-title>Modeling statistical properties of written text</article-title>
.
<source>PLoS ONE</source>
<volume>4</volume>
,
<fpage>e5372</fpage>
(
<year>2009</year>
).
<pub-id pub-id-type="pmid">19401762</pub-id>
</mixed-citation>
</ref>
<ref id="b14">
<mixed-citation publication-type="journal">
<name>
<surname>Ferrer i Cancho</surname>
<given-names>R.</given-names>
</name>
&
<name>
<surname>Solé</surname>
<given-names>R. V.</given-names>
</name>
<article-title>Two regimes in the frequency of words and the origin of complex lexicons: Zipf's law revisited</article-title>
.
<source>Journal of Quantitative Linguistics</source>
<volume>8</volume>
,
<fpage>165</fpage>
<lpage>173</lpage>
(
<year>2001</year>
).</mixed-citation>
</ref>
<ref id="b15">
<mixed-citation publication-type="journal">
<name>
<surname>Ferrer i Cancho</surname>
<given-names>R.</given-names>
</name>
<article-title>The variation of Zipf's law in human language</article-title>
.
<source>Eur. Phys. J. B</source>
<volume>44</volume>
,
<fpage>249</fpage>
<lpage>257</lpage>
(
<year>2005</year>
).</mixed-citation>
</ref>
<ref id="b16">
<mixed-citation publication-type="journal">
<name>
<surname>Ferrer i Cancho</surname>
<given-names>R.</given-names>
</name>
&
<name>
<surname>Solé</surname>
<given-names>R. V.</given-names>
</name>
<article-title>Least effort and the origins of scaling in human language</article-title>
.
<source>Proc. Natl. Acad. Sci. USA</source>
<volume>100</volume>
,
<fpage>788</fpage>
<lpage>791</lpage>
(
<year>2003</year>
).
<pub-id pub-id-type="pmid">12540826</pub-id>
</mixed-citation>
</ref>
<ref id="b17">
<mixed-citation publication-type="journal">
<name>
<surname>Baek</surname>
<given-names>S. K.</given-names>
</name>
,
<name>
<surname>Bernhardsson</surname>
<given-names>S.</given-names>
</name>
&
<name>
<surname>Minnhagen</surname>
<given-names>P.</given-names>
</name>
<article-title>Zipf's law unzipped</article-title>
.
<source>New J. Phys.</source>
<volume>13</volume>
,
<fpage>043004</fpage>
(
<year>2011</year>
).</mixed-citation>
</ref>
<ref id="b18">
<mixed-citation publication-type="book">
<name>
<surname>Heaps</surname>
<given-names>H. S.</given-names>
</name>
<source>Information Retrieval: Computational and Theoretical Aspects.</source>
(Academic Press, New York,
<year>1978</year>
).</mixed-citation>
</ref>
<ref id="b19">
<mixed-citation publication-type="journal">
<name>
<surname>Bernhardsson</surname>
<given-names>S.</given-names>
</name>
,
<name>
<surname>Correa da Rocha</surname>
<given-names>L. E.</given-names>
</name>
&
<name>
<surname>Minnhagen</surname>
<given-names>P.</given-names>
</name>
<article-title>The meta book and size-dependent properties of written language</article-title>
.
<source>New J. Phys.</source>
<volume>11</volume>
,
<fpage>123015</fpage>
(
<year>2009</year>
).</mixed-citation>
</ref>
<ref id="b20">
<mixed-citation publication-type="journal">
<name>
<surname>Bernhardsson</surname>
<given-names>S.</given-names>
</name>
,
<name>
<surname>Correa da Rocha</surname>
<given-names>L. E.</given-names>
</name>
&
<name>
<surname>Minnhagen</surname>
<given-names>P.</given-names>
</name>
<article-title>Size-dependent word frequencies and translational invariance of books</article-title>
.
<source>Physica A</source>
<volume>389</volume>
,
<fpage>330</fpage>
<lpage>341</lpage>
(
<year>2010</year>
).</mixed-citation>
</ref>
<ref id="b21">
<mixed-citation publication-type="journal">
<name>
<surname>Kleiber</surname>
<given-names>M.</given-names>
</name>
<article-title>Body size and metabolism</article-title>
.
<source>Hilgardia</source>
<volume>6</volume>
,
<fpage>315</fpage>
<lpage>351</lpage>
(
<year>1932</year>
).</mixed-citation>
</ref>
<ref id="b22">
<mixed-citation publication-type="journal">
<name>
<surname>West</surname>
<given-names>G. B.</given-names>
</name>
<article-title>Allometric scaling of metabolic rate from molecules and mitochondria to cells and mammals</article-title>
.
<source>Proc. Natl. Acad. Sci. USA</source>
<volume>98</volume>
,
<fpage>2473</fpage>
<lpage>2478</lpage>
(
<year>2002</year>
).
<pub-id pub-id-type="pmid">11875197</pub-id>
</mixed-citation>
</ref>
<ref id="b23">
<mixed-citation publication-type="journal">
<name>
<surname>Makse</surname>
<given-names>H. A.</given-names>
</name>
,
<name>
<surname>Havlin</surname>
<given-names>S.</given-names>
</name>
&
<name>
<surname>Stanley</surname>
<given-names>H. E.</given-names>
</name>
<article-title>Modelling urban growth patterns</article-title>
.
<source>Nature</source>
<volume>377</volume>
,
<fpage>608</fpage>
<lpage>612</lpage>
(
<year>1995</year>
).</mixed-citation>
</ref>
<ref id="b24">
<mixed-citation publication-type="journal">
<name>
<surname>Makse</surname>
<given-names>H. A. Jr. J. S. A.</given-names>
</name>
,
<name>
<surname>Batty</surname>
<given-names>M.</given-names>
</name>
,
<name>
<surname>Havlin</surname>
<given-names>S.</given-names>
</name>
&
<name>
<surname>Stanley</surname>
<given-names>H. E.</given-names>
</name>
<article-title>Modeling urban growth patterns with correlated percolation</article-title>
.
<source>Phys. Rev. E</source>
<volume>58</volume>
,
<fpage>7054</fpage>
<lpage>7062</lpage>
(
<year>1998</year>
).</mixed-citation>
</ref>
<ref id="b25">
<mixed-citation publication-type="journal">
<name>
<surname>Rozenfeld</surname>
<given-names>H. D.</given-names>
</name>
,
<name>
<surname>Rybski</surname>
<given-names>D.</given-names>
</name>
,
<name>
<surname>Andrade</surname>
<given-names>Jr. J. S.</given-names>
</name>
,
<name>
<surname>Batty</surname>
<given-names>M.</given-names>
</name>
,
<name>
<surname>Stanley</surname>
<given-names>H. E.</given-names>
</name>
&
<name>
<surname>Makse</surname>
<given-names>H. A.</given-names>
</name>
<article-title>Laws of population growth</article-title>
.
<source>Proc. Natl. Acad. Sci. USA</source>
<volume>48</volume>
,
<fpage>18702</fpage>
<lpage>18707</lpage>
(
<year>2008</year>
).
<pub-id pub-id-type="pmid">19033186</pub-id>
</mixed-citation>
</ref>
<ref id="b26">
<mixed-citation publication-type="journal">
<name>
<surname>Gabaix</surname>
<given-names>X.</given-names>
</name>
<article-title>Zipf's law for cities: An explanation</article-title>
.
<source>Quarterly Journal of Economics</source>
<volume>114</volume>
,
<fpage>739</fpage>
<lpage>767</lpage>
(
<year>1999</year>
).</mixed-citation>
</ref>
<ref id="b27">
<mixed-citation publication-type="journal">
<name>
<surname>Bettencourt</surname>
<given-names>L. M. A.</given-names>
</name>
,
<name>
<surname>Lobo</surname>
<given-names>J.</given-names>
</name>
,
<name>
<surname>Helbing</surname>
<given-names>D.</given-names>
</name>
,
<name>
<surname>Kuhnert</surname>
<given-names>C.</given-names>
</name>
&
<name>
<surname>West</surname>
<given-names>G. B.</given-names>
</name>
<article-title>Growth, innovation, scaling, and the pace of life in cities</article-title>
.
<source>Proc. Natl. Acad. Sci. USA</source>
<volume>104</volume>
,
<fpage>7301</fpage>
<lpage>7306</lpage>
(
<year>2007</year>
).
<pub-id pub-id-type="pmid">17438298</pub-id>
</mixed-citation>
</ref>
<ref id="b28">
<mixed-citation publication-type="journal">
<name>
<surname>Batty</surname>
<given-names>M.</given-names>
</name>
<article-title>The size, scale, and shape of cities</article-title>
.
<source>Science</source>
<volume>319</volume>
,
<fpage>769</fpage>
<lpage>771</lpage>
(
<year>2008</year>
).
<pub-id pub-id-type="pmid">18258906</pub-id>
</mixed-citation>
</ref>
<ref id="b29">
<mixed-citation publication-type="journal">
<name>
<surname>Rozenfeld</surname>
<given-names>H. D.</given-names>
</name>
,
<name>
<surname>Rybski</surname>
<given-names>D.</given-names>
</name>
,
<name>
<surname>Gabaix</surname>
<given-names>X.</given-names>
</name>
&
<name>
<surname>Makse</surname>
<given-names>H. A.</given-names>
</name>
<article-title>The area and population of cities: New insights from a different perspective on cities</article-title>
.
<source>American Economic Review</source>
<volume>101</volume>
,
<fpage>2205</fpage>
<lpage>2225</lpage>
(
<year>2011</year>
).</mixed-citation>
</ref>
<ref id="b30">
<mixed-citation publication-type="journal">
<name>
<surname>Newman</surname>
<given-names>M. E. J.</given-names>
</name>
<article-title>Power laws, Pareto distributions and Zipf's law</article-title>
.
<source>Contemporary Phys.</source>
<volume>46</volume>
,
<fpage>323</fpage>
<lpage>351</lpage>
(
<year>2005</year>
).</mixed-citation>
</ref>
<ref id="b31">
<mixed-citation publication-type="journal">
<name>
<surname>Stanley</surname>
<given-names>M. H. R.</given-names>
</name>
,
<name>
<surname>Buldyrev</surname>
<given-names>S. V.</given-names>
</name>
,
<name>
<surname>Havlin</surname>
<given-names>S.</given-names>
</name>
,
<name>
<surname>Mantegna</surname>
<given-names>R.</given-names>
</name>
,
<name>
<surname>Salinger</surname>
<given-names>M.</given-names>
</name>
&
<name>
<surname>Stanley</surname>
<given-names>H. E.</given-names>
</name>
<article-title>Zipf plots and the size distribution of firms</article-title>
.
<source>Econ. Lett.</source>
<volume>49</volume>
,
<fpage>453</fpage>
<lpage>457</lpage>
(
<year>1995</year>
).</mixed-citation>
</ref>
<ref id="b32">
<mixed-citation publication-type="journal">
<name>
<surname>Mantegna</surname>
<given-names>R. N.</given-names>
</name>
<italic>et al.</italic>
<article-title>Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics</article-title>
.
<source>Phys. Rev. E</source>
<volume>52</volume>
,
<fpage>2939</fpage>
<lpage>2950</lpage>
(
<year>1995</year>
).</mixed-citation>
</ref>
<ref id="b33">
<mixed-citation publication-type="journal">
<name>
<surname>Clauset</surname>
<given-names>A.</given-names>
</name>
,
<name>
<surname>Shalizi</surname>
<given-names>C. R.</given-names>
</name>
&
<name>
<surname>Newman</surname>
<given-names>M. E. J.</given-names>
</name>
<article-title>Power-law distributions in empirical data</article-title>
.
<source>SIAM Rev.</source>
<volume>51</volume>
,
<fpage>661</fpage>
<lpage>703</lpage>
(
<year>2009</year>
).</mixed-citation>
</ref>
<ref id="b34">
<mixed-citation publication-type="journal">
<name>
<surname>Mandelbrot</surname>
<given-names>B.</given-names>
</name>
<article-title>On the theory of word frequencies and on related Markovian models of discourse, in: R. Jakobson, Structure of Language and its Mathematical Aspects</article-title>
.
<source>Proceedings of Symposia in Applied Mathematics</source>
<volume>Vol. XII</volume>
,
<fpage>190</fpage>
<lpage>219</lpage>
(
<year>1961</year>
).</mixed-citation>
</ref>
<ref id="b35">
<mixed-citation publication-type="journal">
<name>
<surname>Karlin</surname>
<given-names>S.</given-names>
</name>
<article-title>Central limit theorems for certain infinite urn schemes</article-title>
.
<source>Journal of Mathematics and Mechanics</source>
<volume>17</volume>
,
<fpage>373</fpage>
<lpage>401</lpage>
(
<year>1967</year>
).</mixed-citation>
</ref>
<ref id="b36">
<mixed-citation publication-type="journal">
<name>
<surname>Gnedin</surname>
<given-names>A.</given-names>
</name>
,
<name>
<surname>Hansen</surname>
<given-names>B.</given-names>
</name>
&
<name>
<surname>Pitman</surname>
<given-names>J.</given-names>
</name>
<article-title>Notes on the occupancy problem with infinitely many boxes: general asymptotics and power laws</article-title>
.
<source>Probability Surveys</source>
<volume>4</volume>
,
<fpage>146</fpage>
<lpage>171</lpage>
(
<year>2007</year>
).</mixed-citation>
</ref>
<ref id="b37">
<mixed-citation publication-type="journal">
<name>
<surname>van Leijenhorst</surname>
<given-names>D. C.</given-names>
</name>
&
<name>
<surname>van der Weide</surname>
<given-names>Th. P.</given-names>
</name>
<article-title>A formal derivation of Heaps' Law</article-title>
.
<source>Inform. Sci.</source>
<volume>170</volume>
,
<fpage>263</fpage>
<lpage>272</lpage>
(
<year>2005</year>
).</mixed-citation>
</ref>
<ref id="b38">
<mixed-citation publication-type="journal">
<name>
<surname></surname>
<given-names>L.</given-names>
</name>
,
<name>
<surname>Zhang</surname>
<given-names>Z.-K.</given-names>
</name>
&
<name>
<surname>Zhou</surname>
<given-names>T.</given-names>
</name>
<article-title>Zipf's law leads to Heaps' law: Analyzing their relation in finite-size systems</article-title>
.
<source>PLoS One</source>
<volume>5</volume>
,
<fpage>e14139</fpage>
(
<year>2010</year>
).
<pub-id pub-id-type="pmid">21152034</pub-id>
</mixed-citation>
</ref>
<ref id="b39">
<mixed-citation publication-type="journal">
<name>
<surname>Steyvers</surname>
<given-names>M.</given-names>
</name>
&
<name>
<surname>Tenenbaum</surname>
<given-names>J. B.</given-names>
</name>
<article-title>The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth</article-title>
.
<source>Cogn. Sci.</source>
<volume>29</volume>
,
<fpage>41</fpage>
<lpage>78</lpage>
(
<year>2005</year>
).
<pub-id pub-id-type="pmid">21702767</pub-id>
</mixed-citation>
</ref>
<ref id="b40">
<mixed-citation publication-type="journal">
<name>
<surname>Markosova</surname>
<given-names>M.</given-names>
</name>
<article-title>Network model of human language</article-title>
.
<source>Physica A</source>
<volume>387</volume>
,
<fpage>661</fpage>
<lpage>666</lpage>
(
<year>2008</year>
).</mixed-citation>
</ref>
<ref id="b41">
<mixed-citation publication-type="journal">
<name>
<surname>Altmann</surname>
<given-names>E. G.</given-names>
</name>
,
<name>
<surname>Pierrehumbert</surname>
<given-names>J. B.</given-names>
</name>
&
<name>
<surname>Motter</surname>
<given-names>A. E.</given-names>
</name>
<article-title>Niche as a determinant of word fate in online groups</article-title>
.
<source>PLoS ONE</source>
<volume>6</volume>
,
<fpage>e19009</fpage>
(
<year>2011</year>
).
<pub-id pub-id-type="pmid">21589910</pub-id>
</mixed-citation>
</ref>
<ref id="b42">
<mixed-citation publication-type="journal">
<name>
<surname>Riccaboni</surname>
<given-names>M.</given-names>
</name>
,
<name>
<surname>Pammolli</surname>
<given-names>F.</given-names>
</name>
,
<name>
<surname>Buldyrev</surname>
<given-names>S. V.</given-names>
</name>
,
<name>
<surname>Ponta</surname>
<given-names>L.</given-names>
</name>
&
<name>
<surname>Stanley</surname>
<given-names>H. E.</given-names>
</name>
<article-title>The size variance relationship of business firm growth rates</article-title>
.
<source>Proc. Natl. Acad. Sci. USA</source>
<volume>105</volume>
,
<fpage>19595</fpage>
<lpage>19600</lpage>
(
<year>2008</year>
).
<pub-id pub-id-type="pmid">19066227</pub-id>
</mixed-citation>
</ref>
<ref id="b43">
<mixed-citation publication-type="journal">
<name>
<surname>Oehlert</surname>
<given-names>G. W.</given-names>
</name>
<article-title>A Note on the Delta Method</article-title>
.
<source>The American Statistician</source>
<volume>46</volume>
,
<fpage>27</fpage>
<lpage>29</lpage>
(
<year>1992</year>
).</mixed-citation>
</ref>
<ref id="b44">
<mixed-citation publication-type="journal">
<name>
<surname>Amaral</surname>
<given-names>L. A. N.</given-names>
</name>
<italic>et al.</italic>
<article-title>Scaling Behavior in Economics: I. Empirical Results for Company Growth</article-title>
.
<source>J. Phys. I France</source>
<volume>7</volume>
,
<fpage>621</fpage>
<lpage>633</lpage>
(
<year>1997</year>
).</mixed-citation>
</ref>
<ref id="b45">
<mixed-citation publication-type="journal">
<name>
<surname>Amaral</surname>
<given-names>L. A. N.</given-names>
</name>
<italic>et al.</italic>
<article-title>Power Law Scaling for a System of Interacting Units with Complex Internal Structure</article-title>
.
<source>Phys. Rev. Lett.</source>
<volume>80</volume>
,
<fpage>1385</fpage>
<lpage>1388</lpage>
(
<year>1998</year>
).</mixed-citation>
</ref>
<ref id="b46">
<mixed-citation publication-type="journal">
<name>
<surname>Fu</surname>
<given-names>D.</given-names>
</name>
,
<name>
<surname>Pammolli</surname>
<given-names>F.</given-names>
</name>
,
<name>
<surname>Buldyrev</surname>
<given-names>S. V.</given-names>
</name>
,
<name>
<surname>Riccaboni</surname>
<given-names>M.</given-names>
</name>
,
<name>
<surname>Matia</surname>
<given-names>K.</given-names>
</name>
,
<name>
<surname>Yamasaki</surname>
<given-names>K.</given-names>
</name>
&
<name>
<surname>Stanley</surname>
<given-names>H. E.</given-names>
</name>
<article-title>The growth of business firms: Theoretical framework and empirical evidence</article-title>
.
<source>Proc. Natl. Acad. Sci. USA</source>
<volume>102</volume>
,
<fpage>18801</fpage>
<lpage>18806</lpage>
(
<year>2005</year>
).
<pub-id pub-id-type="pmid">16365284</pub-id>
</mixed-citation>
</ref>
<ref id="b47">
<mixed-citation publication-type="journal">
<name>
<surname>Podobnik</surname>
<given-names>B.</given-names>
</name>
,
<name>
<surname>Horvatic</surname>
<given-names>D.</given-names>
</name>
,
<name>
<surname>Petersen</surname>
<given-names>A. M.</given-names>
</name>
&
<name>
<surname>Stanley</surname>
<given-names>H. E.</given-names>
</name>
<article-title>Quantitative relations between risk, return, and firm size</article-title>
.
<source>EPL</source>
<volume>85</volume>
,
<fpage>50003</fpage>
(
<year>2009</year>
).</mixed-citation>
</ref>
<ref id="b48">
<mixed-citation publication-type="journal">
<name>
<surname>Podobnik</surname>
<given-names>B.</given-names>
</name>
,
<name>
<surname>Horvatic</surname>
<given-names>D.</given-names>
</name>
,
<name>
<surname>Petersen</surname>
<given-names>A. M.</given-names>
</name>
,
<name>
<surname>Njavro</surname>
<given-names>M.</given-names>
</name>
&
<name>
<surname>Stanley</surname>
<given-names>H. E.</given-names>
</name>
<article-title>Common scaling behavior in finance and macroeconomics</article-title>
.
<source>Eur. Phys. J. B</source>
<volume>76</volume>
,
<fpage>487</fpage>
<lpage>490</lpage>
(
<year>2010</year>
).
<source>EPL</source>
<volume>85</volume>
,
<fpage>50003</fpage>
(
<year>2009</year>
).</mixed-citation>
</ref>
<ref id="b49">
<mixed-citation publication-type="book">
<name>
<surname>Mufwene</surname>
<given-names>S.</given-names>
</name>
<source>The Ecology of Language Evolution.</source>
(Cambridge Univ. Press, Cambridge, UK,
<year>2001</year>
).</mixed-citation>
</ref>
<ref id="b50">
<mixed-citation publication-type="book">
<name>
<surname>Mufwene</surname>
<given-names>S.</given-names>
</name>
<source>Language Evolution: Contact, Competition and Change.</source>
(Continuum International Publishing Group, New York, NY,
<year>2008</year>
).</mixed-citation>
</ref>
<ref id="b51">
<mixed-citation publication-type="other">
<name>
<surname>Perc</surname>
<given-names>M.</given-names>
</name>
<article-title>Evolution of the most common English words and phrases over the centuries.
<italic>J. R. Soc. Interface</italic>
</article-title>
.
<volume>9</volume>
,
<fpage>3323</fpage>
<lpage>3328</lpage>
(
<year>2012</year>
).</mixed-citation>
</ref>
<ref id="b52">
<mixed-citation publication-type="journal">
<name>
<surname>Sigman</surname>
<given-names>M.</given-names>
</name>
&
<name>
<surname>Cecchi</surname>
<given-names>G. A.</given-names>
</name>
<article-title>Global organization of the wordnet lexicon</article-title>
.
<source>Proc. Natl. Acad. Sci. USA</source>
<volume>99</volume>
,
<fpage>1742</fpage>
<lpage>1747</lpage>
(
<year>2002</year>
).
<pub-id pub-id-type="pmid">11830677</pub-id>
</mixed-citation>
</ref>
<ref id="b53">
<mixed-citation publication-type="journal">
<name>
<surname>Alvarez-Lacalle</surname>
<given-names>E.</given-names>
</name>
,
<name>
<surname>Dorow</surname>
<given-names>B.</given-names>
</name>
,
<name>
<surname>Eckmann</surname>
<given-names>J.-P.</given-names>
</name>
&
<name>
<surname>Moses</surname>
<given-names>E.</given-names>
</name>
<article-title>Hierarchical structures induce long-range dynamical correlations in written texts</article-title>
.
<source>Proc. Natl. Acad. Sci. USA</source>
<volume>103</volume>
,
<fpage>7956</fpage>
<lpage>7961</lpage>
(
<year>2006</year>
).
<pub-id pub-id-type="pmid">16698933</pub-id>
</mixed-citation>
</ref>
<ref id="b54">
<mixed-citation publication-type="journal">
<name>
<surname>Altmann</surname>
<given-names>E. A.</given-names>
</name>
,
<name>
<surname>Cristadoro</surname>
<given-names>G.</given-names>
</name>
&
<name>
<surname>Esposti</surname>
<given-names>M. D.</given-names>
</name>
<article-title>On the origin of long-range correlations in texts</article-title>
.
<source>Proc. Natl. Acad. Sci. USA</source>
<volume>109</volume>
,
<fpage>11582</fpage>
<lpage>11587</lpage>
(
<year>2012</year>
).
<pub-id pub-id-type="pmid">22753514</pub-id>
</mixed-citation>
</ref>
<ref id="b55">
<mixed-citation publication-type="journal">
<name>
<surname>Montemurro</surname>
<given-names>M. A.</given-names>
</name>
&
<name>
<surname>Pury</surname>
<given-names>P. A.</given-names>
</name>
<article-title>Long-range fractal correlations in literary corpora</article-title>
.
<source>Fractals</source>
<volume>10</volume>
,
<fpage>451</fpage>
<lpage>461</lpage>
(
<year>2002</year>
).</mixed-citation>
</ref>
<ref id="b56">
<mixed-citation publication-type="journal">
<name>
<surname>Corral</surname>
<given-names>A.</given-names>
</name>
,
<name>
<surname>Ferrer i Cancho</surname>
<given-names>R.</given-names>
</name>
&
<name>
<surname>Díaz-Guilera</surname>
<given-names>A.</given-names>
</name>
<article-title>Universal complex structures in written language</article-title>
.
<source>arXiv</source>
:
<fpage>0901.2924</fpage>
(
<year>2009</year>
).</mixed-citation>
</ref>
<ref id="b57">
<mixed-citation publication-type="journal">
<name>
<surname>Altmann</surname>
<given-names>E. G.</given-names>
</name>
,
<name>
<surname>Pierrehumbert</surname>
<given-names>J. B.</given-names>
</name>
&
<name>
<surname>Motter</surname>
<given-names>A. E.</given-names>
</name>
<article-title>Beyond word frequency: bursts, lulls, and scaling in the temporal distributions of words</article-title>
.
<source>PLoS ONE</source>
<volume>4</volume>
,
<fpage>e7678</fpage>
(
<year>2009</year>
).
<pub-id pub-id-type="pmid">19907645</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
<floats-group>
<fig id="f1">
<label>Figure 1</label>
<caption>
<title>Two-regime scaling distribution of word frequency.</title>
<p>The kink in the probability density functions
<italic>P</italic>
(
<italic>f</italic>
) occurs around
<italic>f</italic>
<sub>×</sub>
≈ 10
<sup>−5</sup>
for each corpora analyzed (see legend). (A,B) Data from all years are aggregated into a single distribution. (C,D)
<italic>P</italic>
(
<italic>f</italic>
) comprising data from only year
<italic>t</italic>
= 2000 providing evidence that the distribution is stable even over shorter time frames and likely emerges in corpora that are sufficiently large to be comprehensive of the language studied. For details concerning the scaling exponents we refer to
<xref ref-type="table" rid="t1">Table I</xref>
and the main text.</p>
</caption>
<graphic xlink:href="srep00943-f1"></graphic>
</fig>
<fig id="f2">
<label>Figure 2</label>
<caption>
<title>Allometric scaling of language.</title>
<p>Scatter plots of the output corpora size
<italic>N
<sub>u</sub>
</italic>
given the empirical vocabulary size
<italic>N
<sub>w</sub>
</italic>
using all data (
<italic>U
<sub>c</sub>
</italic>
= 0) over the 209-year period 1800–2008. Shown are OLS estimation of the exponent
<italic>b</italic>
quantifying the Heaps' law relation
<italic>N
<sub>w</sub>
</italic>
~ [
<italic>N
<sub>u</sub>
</italic>
]
<italic>
<sup>b</sup>
</italic>
.</p>
</caption>
<graphic xlink:href="srep00943-f2"></graphic>
</fig>
<fig id="f3">
<label>Figure 3</label>
<caption>
<title>Pruning reveals the variable marginal return of words.</title>
<p>The Heaps scaling exponent
<italic>b</italic>
depends on the extent of the inclusion of the rarest words. For a given corpora and
<italic>U
<sub>c</sub>
</italic>
value we make a scatter plot between
<italic>N
<sub>w</sub>
</italic>
(
<italic>t</italic>
|
<italic>U
<sub>c</sub>
</italic>
) and
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
|
<italic>U
<sub>c</sub>
</italic>
) using words with
<italic>u
<sub>i</sub>
</italic>
(
<italic>t</italic>
) ≥
<italic>U
<sub>c</sub>
</italic>
. (Panel Inset) We use OLS estimation to estimate the scaling exponent
<italic>b</italic>
(
<italic>U
<sub>c</sub>
</italic>
) for the model
<italic>N
<sub>w</sub>
</italic>
(
<italic>t</italic>
|
<italic>U
<sub>c</sub>
</italic>
) ~[
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
|
<italic>U
<sub>c</sub>
</italic>
)]
<sup>
<italic>b</italic>
</sup>
to show that
<italic>b</italic>
(
<italic>U
<sub>c</sub>
</italic>
) increases from approximately 0.5 towards unity as we prune the corpora of extremely rare words. Our longitudinal language analysis provides insight into the structural importance of the most frequent words which are used more times per appearance and which play a crucial role in the usage of new and rare words.</p>
</caption>
<graphic xlink:href="srep00943-f3"></graphic>
</fig>
<fig id="f4">
<label>Figure 4</label>
<caption>
<title>Pruning reveals the variable marginal return of words.</title>
<p>The Heaps scaling exponent
<italic>b</italic>
depends on the extent of the inclusion of the rarest words. For a given corpora and
<italic>U
<sub>c</sub>
</italic>
value we make a scatter plot between
<italic>N
<sub>w</sub>
</italic>
(
<italic>t</italic>
|
<italic>U
<sub>c</sub>
</italic>
) and
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
|
<italic>U
<sub>c</sub>
</italic>
) using words with
<italic>u
<sub>i</sub>
(t</italic>
) ≥
<italic>U
<sub>c</sub>
</italic>
, using the same data color-
<italic>U
<sub>c</sub>
</italic>
correspondence as in
<xref ref-type="fig" rid="f3">Fig. 3</xref>
. (Panel Inset) We use OLS estimation to estimate the scaling exponent
<italic>b</italic>
(
<italic>U
<sub>c</sub>
</italic>
) for the model
<italic>N
<sub>w</sub>
</italic>
(
<italic>t</italic>
|
<italic>U
<sub>c</sub>
</italic>
) ~ [
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
|
<italic>U
<sub>c</sub>
</italic>
)]
<italic>
<sup>b</sup>
</italic>
to show that
<italic>b</italic>
(
<italic>U
<sub>c</sub>
</italic>
) increases from approximately 0.5 towards unity as we prune the corpora of extremely rare words. Our longitudinal language analysis provides insight into the structural importance of the most frequent words which are used more times per appearance and which play a crucial role in the usage of new and rare words.</p>
</caption>
<graphic xlink:href="srep00943-f4"></graphic>
</fig>
<fig id="f5">
<label>Figure 5</label>
<caption>
<title>Literary productivity and vocabulary size in the
<italic>Google Inc.</italic>
1-gram dataset over the past two centuries.</title>
<p>(A) Total size of the different corpora
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
|
<italic>U
<sub>c</sub>
</italic>
) over time, calculated by using words that satisfy
<italic>u
<sub>i</sub>
</italic>
(
<italic>t</italic>
) ≥
<italic>U
<sub>c</sub>
</italic>
≡ 16 to eliminate extremely rare 1-grams. (B) Size of the written vocabulary
<italic>N
<sub>w</sub>
</italic>
(
<italic>t</italic>
|
<italic>U
<sub>c</sub>
</italic>
) over time, calculated under the same conditions as (A).</p>
</caption>
<graphic xlink:href="srep00943-f5"></graphic>
</fig>
<fig id="f6">
<label>Figure 6</label>
<caption>
<title>Non-stationarity in the characteristic growth fluctuation of word use.</title>
<p>The standard deviation
<italic>σ
<sub>r</sub>
</italic>
(
<italic>t</italic>
|
<italic>f
<sub>c</sub>
</italic>
) of the logarithmic growth rate
<italic>r
<sub>i</sub>
</italic>
(
<italic>t</italic>
) is presented for all examined corpora. There is an overall decreasing trend arising from the increasing size of the corpora, as depicted in
<xref ref-type="fig" rid="f5">Fig. 5(A)</xref>
. On the other hand, the steady production of new words, as depicted in
<xref ref-type="fig" rid="f5">Fig. 5(B)</xref>
counteracts this effect. We calculate
<italic>σ
<sub>r</sub>
</italic>
(
<italic>t</italic>
|
<italic>f
<sub>c</sub>
</italic>
) using the relatively common words that meet the criterion that their average word use 〈
<italic>f
<sub>i</sub>
</italic>
〉 over the entire word history
<italic>T
<sub>i</sub>
</italic>
(using words with lifetime
<italic>T
<sub>i</sub>
</italic>
≥ 10 years) is larger than a threshold
<italic>f
<sub>c</sub>
</italic>
≡ 1/
<italic>Min</italic>
[
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
)] (see
<xref ref-type="table" rid="t1">Table I</xref>
).</p>
</caption>
<graphic xlink:href="srep00943-f6"></graphic>
</fig>
<fig id="f7">
<label>Figure 7</label>
<caption>
<title>Growth fluctuation of word use scale with the size of the corpora.</title>
<p>(A) Depicted is the quantitative relation in Eq.(8) between
<italic>σ
<sub>r</sub>
</italic>
(
<italic>t</italic>
|
<italic>f
<sub>c</sub>
</italic>
) and the corpus size
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
|
<italic>f
<sub>c</sub>
</italic>
). We calculate
<italic>σ
<sub>r</sub>
</italic>
(
<italic>t</italic>
|
<italic>f
<sub>c</sub>
</italic>
) using the relatively common words that meet the criterion that their average word use 〈
<italic>f
<sub>i</sub>
</italic>
〉 over the entire word history (using words with lifetime
<italic>T
<sub>i</sub>
</italic>
≥ 10 years) is larger than a threshold
<italic>f
<sub>c</sub>
</italic>
≡ 10/
<italic>Min</italic>
[
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
)] (see
<xref ref-type="table" rid="t1">Table I</xref>
). We show the language-dependent scaling value
<italic>β</italic>
≈ 0.08–0.35 in each panel. For each language we show the value of the ordinary least squares best-fit
<italic>β</italic>
value with the standard error in parentheses. (B) Summary of
<italic>β</italic>
(
<italic>U
<sub>c</sub>
</italic>
) exponents calculated using a use-threshold
<italic>U
<sub>c</sub>
</italic>
, instead of a frequency threshold
<italic>f
<sub>c</sub>
</italic>
as used in (A). Error bars indicate the standard error in the OLS regression. We perform this additional analysis in order to provide alternative insight into the role of extremely rare words. For increasing
<italic>U
<sub>c</sub>
</italic>
the
<italic>β</italic>
(
<italic>U
<sub>c</sub>
</italic>
) value for each corpora increases from
<italic>β</italic>
≈ 0.05 to
<italic>β</italic>
< 0.25. This language pruning method quantifies the role of new rare words (also including OCR errors, spelling and other orthographic variants), which are the significant components of language volatility.</p>
</caption>
<graphic xlink:href="srep00943-f7"></graphic>
</fig>
<table-wrap position="float" id="t1">
<label>Table 1</label>
<caption>
<title>Summary of the scaling exponents characterizing the Zipf law and the Heaps law. To calculate σ
<italic>
<sub>r</sub>
</italic>
(
<italic>t</italic>
|
<italic>f
<sub>c</sub>
</italic>
) (see
<xref ref-type="fig" rid="f6">Figs. 6</xref>
and
<xref ref-type="fig" rid="f7">7</xref>
) we use only the relatively common words that meet the criterion that their average word use 〈
<italic>f
<sub>i</sub>
</italic>
〉 over the entire word history is larger than a threshold
<italic>f
<sub>c</sub>
</italic>
= 10/
<italic>Min</italic>
[
<italic>N
<sub>u</sub>
</italic>
)(
<italic>t</italic>
)] listed in the first column for each corpus. The
<italic>b</italic>
values shown are calculated using all words (
<italic>U
<sub>c</sub>
</italic>
= 0). The “unlimited lexicon” scaling exponent
<italic>α</italic>
<sub></sub>
(
<italic>t</italic>
) is calculated for 10
<sup>−8</sup>
<
<italic>f</italic>
< 10
<sup>−6</sup>
and the “kernel lexicon” exponent
<italic>α</italic>
<sub>+</sub>
(
<italic>t</italic>
) is calculated for 10
<sup>−4</sup>
<
<italic>f</italic>
< 10
<sup>−1</sup>
using the maximum likelihood estimator method for each year. The average and standard deviation
<inline-formula id="m11">
<inline-graphic id="d32e1902" xlink:href="srep00943-m11.jpg"></inline-graphic>
</inline-formula>
listed are computed using the
<italic>α</italic>
<sub>+</sub>
(
<italic>t</italic>
) and
<italic>α</italic>
<sub></sub>
(
<italic>t</italic>
) values over the 209-year period 1800–2008 (except for Chinese, which is calculated from 1950–2008 data). We show the Zipf scaling exponent calculated as
<italic>ζ</italic>
= 1/(〈α
<sub>+</sub>
〉 −1). The last column indicates the
<italic>β</italic>
scaling exponents from
<xref ref-type="fig" rid="f7">Fig. 7(A)</xref>
</title>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="bottom">
<tr>
<th align="left" valign="top" charoff="50"> </th>
<th colspan="6" align="center" valign="top" charoff="50">Scaling parameters</th>
</tr>
<tr>
<th align="center" valign="top" charoff="50">Corpus (1-grams)</th>
<th align="center" valign="top" charoff="50">
<italic>Min</italic>
[
<italic>N
<sub>u</sub>
</italic>
(
<italic>t</italic>
)]</th>
<th align="center" valign="top" charoff="50">
<italic>b</italic>
(
<italic>U
<sub>c</sub>
</italic>
= 0)</th>
<th align="center" valign="top" charoff="50">
<italic>α</italic>
<sub></sub>
</th>
<th align="center" valign="top" charoff="50">
<italic>α</italic>
<sub>+</sub>
</th>
<th align="center" valign="top" charoff="50">
<italic>ζ</italic>
</th>
<th align="center" valign="top" charoff="50">
<italic>β</italic>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center" valign="top" charoff="50">Chinese</td>
<td align="center" valign="top" charoff="50">35, 394</td>
<td align="center" valign="top" charoff="50">0.77 ± 0.02</td>
<td align="center" valign="top" charoff="50">1.49 ± 0.15</td>
<td align="center" valign="top" charoff="50">1.91 ± 0.04</td>
<td align="center" valign="top" charoff="50">1.10 ± 0.05</td>
<td align="center" valign="top" charoff="50">0.20 ± 0.01</td>
</tr>
<tr>
<td align="center" valign="top" charoff="50">English</td>
<td align="center" valign="top" charoff="50">42, 786, 702</td>
<td align="center" valign="top" charoff="50">0.54 ± 0.01</td>
<td align="center" valign="top" charoff="50">1.73 ± 0.05</td>
<td align="center" valign="top" charoff="50">2.04 ± 0.06</td>
<td align="center" valign="top" charoff="50">0.96 ± 0.06</td>
<td align="center" valign="top" charoff="50">0.19 ± 0.01</td>
</tr>
<tr>
<td align="center" valign="top" charoff="50">English fiction</td>
<td align="center" valign="top" charoff="50">13, 184, 111</td>
<td align="center" valign="top" charoff="50">0.49 ± 0.01</td>
<td align="center" valign="top" charoff="50">1.68 ± 0.10</td>
<td align="center" valign="top" charoff="50">1.97 ± 0.04</td>
<td align="center" valign="top" charoff="50">1.03 ± 0.04</td>
<td align="center" valign="top" charoff="50">0.18 ± 0.01</td>
</tr>
<tr>
<td align="center" valign="top" charoff="50">English GB</td>
<td align="center" valign="top" charoff="50">38, 956, 621</td>
<td align="center" valign="top" charoff="50">0.44 ± 0.01</td>
<td align="center" valign="top" charoff="50">1.71 ± 0.07</td>
<td align="center" valign="top" charoff="50">2.02 ± 0.05</td>
<td align="center" valign="top" charoff="50">0.98 ± 0.05</td>
<td align="center" valign="top" charoff="50">0.17 ± 0.01</td>
</tr>
<tr>
<td align="center" valign="top" charoff="50">English US</td>
<td align="center" valign="top" charoff="50">5, 821, 340</td>
<td align="center" valign="top" charoff="50">0.51 ± 0.01</td>
<td align="center" valign="top" charoff="50">1.70 ± 0.08</td>
<td align="center" valign="top" charoff="50">2.03 ± 0.06</td>
<td align="center" valign="top" charoff="50">0.97 ± 0.06</td>
<td align="center" valign="top" charoff="50">0.18 ± 0.01</td>
</tr>
<tr>
<td align="center" valign="top" charoff="50">English 1M</td>
<td align="center" valign="top" charoff="50">42, 778, 968</td>
<td align="center" valign="top" charoff="50">0.53 ± 0.01</td>
<td align="center" valign="top" charoff="50">1.71 ± 0.04</td>
<td align="center" valign="top" charoff="50">2.04 ± 0.06</td>
<td align="center" valign="top" charoff="50">0.96 ± 0.06</td>
<td align="center" valign="top" charoff="50">0.25 ± 0.01</td>
</tr>
<tr>
<td align="center" valign="top" charoff="50">French</td>
<td align="center" valign="top" charoff="50">34, 198, 362</td>
<td align="center" valign="top" charoff="50">0.52 ± 0.01</td>
<td align="center" valign="top" charoff="50">1.69 ± 0.06</td>
<td align="center" valign="top" charoff="50">1.98 ± 0.04</td>
<td align="center" valign="top" charoff="50">1.02 ± 0.04</td>
<td align="center" valign="top" charoff="50">0.26 ± 0.01</td>
</tr>
<tr>
<td align="center" valign="top" charoff="50">German</td>
<td align="center" valign="top" charoff="50">2, 274, 842</td>
<td align="center" valign="top" charoff="50">0.60 ± 0.01</td>
<td align="center" valign="top" charoff="50">1.63 ± 0.16</td>
<td align="center" valign="top" charoff="50">2.02 ± 0.03</td>
<td align="center" valign="top" charoff="50">0.98 ± 0.03</td>
<td align="center" valign="top" charoff="50">0.27 ± 0.01</td>
</tr>
<tr>
<td align="center" valign="top" charoff="50">Hebrew</td>
<td align="center" valign="top" charoff="50">9, 482</td>
<td align="center" valign="top" charoff="50">0.47 ± 0.01</td>
<td align="center" valign="top" charoff="50">1.34 ± 0.09</td>
<td align="center" valign="top" charoff="50">2.06 ± 0.05</td>
<td align="center" valign="top" charoff="50">0.94 ± 0.05</td>
<td align="center" valign="top" charoff="50">0.35 ± 0.01</td>
</tr>
<tr>
<td align="center" valign="top" charoff="50">Russian</td>
<td align="center" valign="top" charoff="50">6, 944, 366</td>
<td align="center" valign="top" charoff="50">0.65 ± 0.01</td>
<td align="center" valign="top" charoff="50">1.55 ± 0.17</td>
<td align="center" valign="top" charoff="50">2.04 ± 0.06</td>
<td align="center" valign="top" charoff="50">0.96 ± 0.06</td>
<td align="center" valign="top" charoff="50">0.08 ± 0.01</td>
</tr>
<tr>
<td align="center" valign="top" charoff="50">Spanish</td>
<td align="center" valign="top" charoff="50">1, 777, 563</td>
<td align="center" valign="top" charoff="50">0.51 ± 0.01</td>
<td align="center" valign="top" charoff="50">1.61 ± 0.15</td>
<td align="center" valign="top" charoff="50">2.07 ± 0.04</td>
<td align="center" valign="top" charoff="50">0.93 ± 0.04</td>
<td align="center" valign="top" charoff="50">0.26 ± 0.01</td>
</tr>
</tbody>
</table>
</table-wrap>
</floats-group>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Pmc/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000000 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 000000 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:3517984
   |texte=   Languages cool as they expand: Allometric scaling and the decreasing need for new words
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:23230508" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a OcrV1 

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024