Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Semi-Supervised Morphosyntactic Classification of Old Icelandic

Identifieur interne : 000180 ( Pmc/Curation ); précédent : 000179; suivant : 000181

Semi-Supervised Morphosyntactic Classification of Old Icelandic

Auteurs : Kryztof Urban [États-Unis] ; Timothy R. Tangherlini [États-Unis] ; Aurelijus Vij Nas [République populaire de Chine] ; Peter M. Broadwell [États-Unis]

Source :

RBID : PMC:4100772

Abstract

We present IceMorph, a semi-supervised morphosyntactic analyzer of Old Icelandic. In addition to machine-read corpora and dictionaries, it applies a small set of declension prototypes to map corpus words to dictionary entries. A web-based GUI allows expert users to modify and augment data through an online process. A machine learning module incorporates prototype data, edit-distance metrics, and expert feedback to continuously update part-of-speech and morphosyntactic classification. An advantage of the analyzer is its ability to achieve competitive classification accuracy with minimum training data.


Url:
DOI: 10.1371/journal.pone.0102366
PubMed: 25029462
PubMed Central: 4100772

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:4100772

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Semi-Supervised Morphosyntactic Classification of Old Icelandic</title>
<author>
<name sortKey="Urban, Kryztof" sort="Urban, Kryztof" uniqKey="Urban K" first="Kryztof" last="Urban">Kryztof Urban</name>
<affiliation wicri:level="1">
<nlm:aff id="aff1">
<addr-line>The Scandinavian Section, University of California Los Angeles, Los Angeles, California, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>The Scandinavian Section, University of California Los Angeles, Los Angeles, California</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Tangherlini, Timothy R" sort="Tangherlini, Timothy R" uniqKey="Tangherlini T" first="Timothy R." last="Tangherlini">Timothy R. Tangherlini</name>
<affiliation wicri:level="1">
<nlm:aff id="aff1">
<addr-line>The Scandinavian Section, University of California Los Angeles, Los Angeles, California, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>The Scandinavian Section, University of California Los Angeles, Los Angeles, California</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Vij Nas, Aurelijus" sort="Vij Nas, Aurelijus" uniqKey="Vij Nas A" first="Aurelijus" last="Vij Nas">Aurelijus Vij Nas</name>
<affiliation wicri:level="1">
<nlm:aff id="aff2">
<addr-line>Department of English, National Kaohsiung Normal University, Kaohsiung, Republic of China</addr-line>
</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Department of English, National Kaohsiung Normal University, Kaohsiung</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Broadwell, Peter M" sort="Broadwell, Peter M" uniqKey="Broadwell P" first="Peter M." last="Broadwell">Peter M. Broadwell</name>
<affiliation wicri:level="1">
<nlm:aff id="aff3">
<addr-line>The University Library, University of California Los Angeles, Los Angeles, California, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>The University Library, University of California Los Angeles, Los Angeles, California</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">25029462</idno>
<idno type="pmc">4100772</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4100772</idno>
<idno type="RBID">PMC:4100772</idno>
<idno type="doi">10.1371/journal.pone.0102366</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000180</idno>
<idno type="wicri:Area/Pmc/Curation">000180</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Semi-Supervised Morphosyntactic Classification of Old Icelandic</title>
<author>
<name sortKey="Urban, Kryztof" sort="Urban, Kryztof" uniqKey="Urban K" first="Kryztof" last="Urban">Kryztof Urban</name>
<affiliation wicri:level="1">
<nlm:aff id="aff1">
<addr-line>The Scandinavian Section, University of California Los Angeles, Los Angeles, California, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>The Scandinavian Section, University of California Los Angeles, Los Angeles, California</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Tangherlini, Timothy R" sort="Tangherlini, Timothy R" uniqKey="Tangherlini T" first="Timothy R." last="Tangherlini">Timothy R. Tangherlini</name>
<affiliation wicri:level="1">
<nlm:aff id="aff1">
<addr-line>The Scandinavian Section, University of California Los Angeles, Los Angeles, California, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>The Scandinavian Section, University of California Los Angeles, Los Angeles, California</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Vij Nas, Aurelijus" sort="Vij Nas, Aurelijus" uniqKey="Vij Nas A" first="Aurelijus" last="Vij Nas">Aurelijus Vij Nas</name>
<affiliation wicri:level="1">
<nlm:aff id="aff2">
<addr-line>Department of English, National Kaohsiung Normal University, Kaohsiung, Republic of China</addr-line>
</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Department of English, National Kaohsiung Normal University, Kaohsiung</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Broadwell, Peter M" sort="Broadwell, Peter M" uniqKey="Broadwell P" first="Peter M." last="Broadwell">Peter M. Broadwell</name>
<affiliation wicri:level="1">
<nlm:aff id="aff3">
<addr-line>The University Library, University of California Los Angeles, Los Angeles, California, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>The University Library, University of California Los Angeles, Los Angeles, California</wicri:regionArea>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint>
<date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>We present IceMorph, a semi-supervised morphosyntactic analyzer of Old Icelandic. In addition to machine-read corpora and dictionaries, it applies a small set of declension prototypes to map corpus words to dictionary entries. A web-based GUI allows expert users to modify and augment data through an online process. A machine learning module incorporates prototype data, edit-distance metrics, and expert feedback to continuously update part-of-speech and morphosyntactic classification. An advantage of the analyzer is its ability to achieve competitive classification accuracy with minimum training data.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cucerzan, S" uniqKey="Cucerzan S">S Cucerzan</name>
</author>
<author>
<name sortKey="Yarowsky, D" uniqKey="Yarowsky D">D Yarowsky</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Forsberg, M" uniqKey="Forsberg M">M Forsberg</name>
</author>
<author>
<name sortKey="Ranta, A" uniqKey="Ranta A">A Ranta</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ranta, A" uniqKey="Ranta A">A Ranta</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wagner, Ra" uniqKey="Wagner R">RA Wagner</name>
</author>
<author>
<name sortKey="Fischer, Mj" uniqKey="Fischer M">MJ Fischer</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Goldwater, S" uniqKey="Goldwater S">S Goldwater</name>
</author>
<author>
<name sortKey="Griffiths, Tl" uniqKey="Griffiths T">TL Griffiths</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Borin, L" uniqKey="Borin L">L Borin</name>
</author>
<author>
<name sortKey="Forsberg, M" uniqKey="Forsberg M">M Forsberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Loftsson, H" uniqKey="Loftsson H">H Loftsson</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Toutanova, K" uniqKey="Toutanova K">K Toutanova</name>
</author>
<author>
<name sortKey="Johnson, M" uniqKey="Johnson M">M Johnson</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lafferty, J" uniqKey="Lafferty J">J Lafferty</name>
</author>
<author>
<name sortKey="Mccallum, A" uniqKey="Mccallum A">A McCallum</name>
</author>
<author>
<name sortKey="Pereira, F" uniqKey="Pereira F">F Pereira</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Clark, S" uniqKey="Clark S">S Clark</name>
</author>
<author>
<name sortKey="Curran, Jr" uniqKey="Curran J">JR Curran</name>
</author>
<author>
<name sortKey="Osborne, M" uniqKey="Osborne M">M Osborne</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Loftsson, H" uniqKey="Loftsson H">H Loftsson</name>
</author>
<author>
<name sortKey="Helgad Ttir, S" uniqKey="Helgad Ttir S">S Helgadóttir</name>
</author>
<author>
<name sortKey="Rognvaldsson, E" uniqKey="Rognvaldsson E">E Rögnvaldsson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schmid, H" uniqKey="Schmid H">H Schmid</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ratnaparkhi, A" uniqKey="Ratnaparkhi A">A Ratnaparkhi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chatzis, Sp" uniqKey="Chatzis S">SP Chatzis</name>
</author>
<author>
<name sortKey="Demiris, Y" uniqKey="Demiris Y">Y Demiris</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Renooij, S" uniqKey="Renooij S">S Renooij</name>
</author>
<author>
<name sortKey="Van Der Gaag, Lc" uniqKey="Van Der Gaag L">LC Van Der Gaag</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, P" uniqKey="Liu P">P Liu</name>
</author>
<author>
<name sortKey="Lei, L" uniqKey="Lei L">L Lei</name>
</author>
<author>
<name sortKey="Wu, N" uniqKey="Wu N">N Wu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rish, I" uniqKey="Rish I">I Rish</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rabiner, L" uniqKey="Rabiner L">L Rabiner</name>
</author>
<author>
<name sortKey="Juang, Bh" uniqKey="Juang B">BH Juang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Forney, G" uniqKey="Forney G">G Forney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tataru, P" uniqKey="Tataru P">P Tataru</name>
</author>
<author>
<name sortKey="Sand, A" uniqKey="Sand A">A Sand</name>
</author>
<author>
<name sortKey="Hobolth, A" uniqKey="Hobolth A">A Hobolth</name>
</author>
<author>
<name sortKey="Mailund, T" uniqKey="Mailund T">T Mailund</name>
</author>
<author>
<name sortKey="Pedersen, Cns" uniqKey="Pedersen C">CNS Pedersen</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ringger, E" uniqKey="Ringger E">E Ringger</name>
</author>
<author>
<name sortKey="Mcclanahan, P" uniqKey="Mcclanahan P">P McClanahan</name>
</author>
<author>
<name sortKey="Haertel, R" uniqKey="Haertel R">R Haertel</name>
</author>
<author>
<name sortKey="Busby, G" uniqKey="Busby G">G Busby</name>
</author>
<author>
<name sortKey="Carmen, M" uniqKey="Carmen M">M Carmen</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group>
<journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">25029462</article-id>
<article-id pub-id-type="pmc">4100772</article-id>
<article-id pub-id-type="publisher-id">PONE-D-13-45341</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0102366</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline-v2">
<subject>Computer and Information Sciences</subject>
<subj-group>
<subject>Information Technology</subject>
<subj-group>
<subject>Data Mining</subject>
<subject>Text Mining</subject>
</subj-group>
</subj-group>
<subj-group>
<subject>Software Engineering</subject>
<subj-group>
<subject>Software Tools</subject>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v2">
<subject>Physical Sciences</subject>
<subj-group>
<subject>Mathematics</subject>
<subj-group>
<subject>Applied Mathematics</subject>
<subj-group>
<subject>Algorithms</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v2">
<subject>Social Sciences</subject>
<subj-group>
<subject>Linguistics</subject>
<subj-group>
<subject>Computational Linguistics</subject>
<subject>Historical Linguistics</subject>
<subject>Linguistic Morphology</subject>
</subj-group>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Semi-Supervised Morphosyntactic Classification of Old Icelandic</article-title>
<alt-title alt-title-type="running-head">Morphosyntactic Classification of Old Icelandic</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Urban</surname>
<given-names>Kryztof</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Tangherlini</surname>
<given-names>Timothy R.</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="cor1">
<sup>*</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Vijūnas</surname>
<given-names>Aurelijus</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Broadwell</surname>
<given-names>Peter M.</given-names>
</name>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<label>1</label>
<addr-line>The Scandinavian Section, University of California Los Angeles, Los Angeles, California, United States of America</addr-line>
</aff>
<aff id="aff2">
<label>2</label>
<addr-line>Department of English, National Kaohsiung Normal University, Kaohsiung, Republic of China</addr-line>
</aff>
<aff id="aff3">
<label>3</label>
<addr-line>The University Library, University of California Los Angeles, Los Angeles, California, United States of America</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Aronoff</surname>
<given-names>Mark</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">
<addr-line>Stony Brook University, United States of America</addr-line>
</aff>
<author-notes>
<corresp id="cor1">* E-mail:
<email>tango@humnet.ucla.edu</email>
</corresp>
<fn fn-type="conflict">
<p>
<bold>Competing Interests: </bold>
The authors have declared that no competing interests exist.</p>
</fn>
<fn fn-type="con">
<p>Conceived and designed the experiments: KU TRT AV PB. Performed the experiments: KU TRT AV PB. Analyzed the data: KU TRT AV PB. Contributed reagents/materials/analysis tools: KU TRT AV PB. Wrote the paper: KU TRT AV PB.</p>
</fn>
</author-notes>
<pub-date pub-type="collection">
<year>2014</year>
</pub-date>
<pub-date pub-type="epub">
<day>16</day>
<month>7</month>
<year>2014</year>
</pub-date>
<volume>9</volume>
<issue>7</issue>
<elocation-id>e102366</elocation-id>
<history>
<date date-type="received">
<day>10</day>
<month>10</month>
<year>2013</year>
</date>
<date date-type="accepted">
<day>18</day>
<month>6</month>
<year>2014</year>
</date>
</history>
<permissions>
<copyright-year>2014</copyright-year>
<copyright-holder>Urban et al</copyright-holder>
<license>
<license-p>This is an open-access article distributed under the terms of the
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License</ext-link>
, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</license-p>
</license>
</permissions>
<abstract>
<p>We present IceMorph, a semi-supervised morphosyntactic analyzer of Old Icelandic. In addition to machine-read corpora and dictionaries, it applies a small set of declension prototypes to map corpus words to dictionary entries. A web-based GUI allows expert users to modify and augment data through an online process. A machine learning module incorporates prototype data, edit-distance metrics, and expert feedback to continuously update part-of-speech and morphosyntactic classification. An advantage of the analyzer is its ability to achieve competitive classification accuracy with minimum training data.</p>
</abstract>
<funding-group>
<funding-statement>Funding for this project was provided through National Science Foundation (NSF) #BCS-0921123; NSF #IIS-0122491/EU IST2001-32745; with additional support from UCLA's Center for Medieval and Renaissance Studies; the UCLA Council on Research; and the UCLA Office of the Vice Chancellor for Research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</funding-statement>
</funding-group>
<counts>
<page-count count="8"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec id="s1">
<title>Introduction</title>
<p>IceMorph
<xref rid="pone.0102366-Icemorph1" ref-type="bibr">[1]</xref>
is a semi-supervised part-of-speech (POS) and morphosyntactic (MS) tagger for Old Icelandic. Old Icelandic is a difficult language to tag for morphosyntactic features given its inflectional and morphonological complexity. IceMorph is designed to achieve competitive classification accuracy using a minimum of cleanly tagged training data, and to allow for continuous online retraining.</p>
<p>The IceMorph system consists of a number of interacting modules, including a Perl machine parser for Old Icelandic dictionaries, a prototype-based inflection generator coded in Haskell based on similar tools used in Functional Morphology
<xref rid="pone.0102366-Forsberg1" ref-type="bibr">[11]</xref>
,
<xref rid="pone.0102366-Ranta1" ref-type="bibr">[12]</xref>
,
<xref rid="pone.0102366-Borin1" ref-type="bibr">[22]</xref>
, an edit distance classifier, a website to collect feedback from human experts, and a context-based machine learning algorithm for grammatical disambiguation. We hypothesize that this multi-pronged approach can offer better outcomes than any one of the approaches alone to the vexing problem of morphological analysis in Old Icelandic. Although this may seem to be an obvious solution for the problem of POS and MS tagging in a language that not only has a complex morphology but also for which there is a paucity of clean training data and a noisy target corpus, we have not encountered similar multi-pronged approaches to this problem for Old Icelandic.</p>
<p>For the machine learning component, we rely on a Hidden Markov Model (HMM) classifier that makes use of the restricted Viterbi algorithm, and retrain from expert input as opposed to co-training
<xref rid="pone.0102366-Clark1" ref-type="bibr">[28]</xref>
. Although recent work on sequential tagging has returned excellent results with Conditional Random Fields (CRF)
<xref rid="pone.0102366-Lafferty1" ref-type="bibr">[27]</xref>
, because of problems associated with Old Icelandic's inflectional complexity and the very limited scope of our training data, the CRF we implemented returned sub-optimal results. Instead, our results show that the multi-pronged approach we describe, despite a very small and noisy training set, can achieve competitive classification (96.84% on the POS task, and 84.21% on the MS task).</p>
<p>We took inspiration for IceMorph from a number of sources. Several tools exist for morphosyntactic tagging of Modern Icelandic; for instance
<xref rid="pone.0102366-Rgnvaldsson1" ref-type="bibr">[21]</xref>
, achieves 91.18% accuracy by applying a TnT tagger trained on an extensive corpus of Old Icelandic texts orthographically and grammatically normalized to Modern Icelandic. Another approach is IceTagger
<xref rid="pone.0102366-Loftsson1" ref-type="bibr">[23]</xref>
, a rule-based POS tagger for Modern Icelandic that achieves a 91.54% accuracy rate on a POS classification task. There are also a large number of semi-supervised Bayesian POS taggers such as
<xref rid="pone.0102366-Feldman1" ref-type="bibr">[24]</xref>
,
<xref rid="pone.0102366-Toutanova1" ref-type="bibr">[25]</xref>
, with
<xref rid="pone.0102366-Feldman1" ref-type="bibr">[24]</xref>
reporting an accuracy of 79.7% on an MS classification task, and
<xref rid="pone.0102366-Toutanova1" ref-type="bibr">[25]</xref>
reporting 93.4% accuracy on a POS task. However, all of the existing approaches require either a set of manually crafted rules or fairly extensive training sets. Importantly, the approaches for Icelandic described elsewhere
<xref rid="pone.0102366-Rgnvaldsson1" ref-type="bibr">[21]</xref>
,
<xref rid="pone.0102366-Loftsson1" ref-type="bibr">[23]</xref>
,
<xref rid="pone.0102366-Loftsson2" ref-type="bibr">[29]</xref>
are all tuned for Modern Icelandic, a space in which relatively large, clean training data exist. A philosophical underpinning of IceMorph is to provide competitive tagging performance for Old Icelandic utilizing available resources while requiring a minimum of clean input data. For example, our training sets are an order of magnitude smaller than those used in
<xref rid="pone.0102366-Rgnvaldsson1" ref-type="bibr">[21]</xref>
. Consequently, we feel that IceMorph is closely related to projects such as
<xref rid="pone.0102366-Cucerzan1" ref-type="bibr">[5]</xref>
,
<xref rid="pone.0102366-Brill1" ref-type="bibr">[6]</xref>
,
<xref rid="pone.0102366-Loftsson2" ref-type="bibr">[29]</xref>
which make use of language tools to reduce the amount of man-hours required to tag a corpus.
<xref rid="pone.0102366-Cucerzan1" ref-type="bibr">[5]</xref>
reports an accuracy of 93.1% on a Spanish POS task
<xref rid="pone.0102366-Brill1" ref-type="bibr">[6]</xref>
, reports an accuracy of 90.7% on a POS task in English, and
<xref rid="pone.0102366-Loftsson2" ref-type="bibr">[29]</xref>
reports an accuracy of 93.84% on a POS task in Modern Icelandic (
<xref ref-type="table" rid="pone-0102366-t001">Table 1</xref>
).</p>
<table-wrap id="pone-0102366-t001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0102366.t001</object-id>
<label>Table 1</label>
<caption>
<title>Accuracies for different POS/MS taggers with commonalities to IceMorph.</title>
</caption>
<alternatives>
<graphic id="pone-0102366-t001-1" xlink:href="pone.0102366.t001"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">Approach</td>
<td align="left" rowspan="1" colspan="1">POS classification</td>
<td align="left" rowspan="1" colspan="1">MS classification</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">IceMorph HMM-rV (Expert/Gold)</td>
<td align="left" rowspan="1" colspan="1">96.84%/73.16%</td>
<td align="left" rowspan="1" colspan="1">84.21%/54.86%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Loftsson
<xref rid="pone.0102366-Loftsson2" ref-type="bibr">[29]</xref>
</td>
<td align="left" rowspan="1" colspan="1">93.84%</td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Cucerzan & Yarowsky
<xref rid="pone.0102366-Cucerzan1" ref-type="bibr">[5]</xref>
</td>
<td align="left" rowspan="1" colspan="1">93.1% (Sp)/89.2% (Ro)</td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Rögnvaldsson TnT
<xref rid="pone.0102366-Rgnvaldsson1" ref-type="bibr">[21]</xref>
</td>
<td align="left" rowspan="1" colspan="1">91.8%</td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Loftsson IceTagger
<xref rid="pone.0102366-Loftsson1" ref-type="bibr">[23]</xref>
</td>
<td align="left" rowspan="1" colspan="1">91.54%</td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Brill & Marcus
<xref rid="pone.0102366-Brill1" ref-type="bibr">[6]</xref>
</td>
<td align="left" rowspan="1" colspan="1">90.7%</td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Feldman & Hana
<xref rid="pone.0102366-Feldman1" ref-type="bibr">[24]</xref>
</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">79.7%</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt101">
<label></label>
<p>For comparison, the accuracy of the IceMorph HMM-rV tagger is presented in the first row. Our measures of accuracy reflect the use of two distinct sets of tagged data. The first set (called EXPERT) contains longer sequences of training data and thus reflects more accurately IceMorph's performance when trained with a rich data set, and is also more comparable to the training data used in these comparison studies.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec sec-type="methods" id="s2">
<title>Methods</title>
<sec id="s2a">
<title>System architecture</title>
<p>IceMorph consists of a collection of modules designed to streamline the creation, maintenance, and analysis of input data as well as the prediction of POS and morphosyntactic (MS) classes for previously unseen words. It can be conceptualized as consisting of two separate systems. The first system produces an initial set of tags for each corpus instance, providing broad coverage (>98%) with sub-optimal accuracy. The second system refines the initial set of tags by continuously directing novel expert feedback into a machine learning algorithm.</p>
<p>
<xref ref-type="fig" rid="pone-0102366-g001">Figures 1</xref>
and
<xref ref-type="fig" rid="pone-0102366-g002">2</xref>
depict the general layout of IceMorph. In the following paragraphs, each module is described in more detail.</p>
<fig id="pone-0102366-g001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0102366.g001</object-id>
<label>Figure 1</label>
<caption>
<title>Creation of a base tagged corpus within IceMorph using various data sources.</title>
<p>Dictionaries and corpora are machine parsed and inserted into a relational database. Declension prototypes are created by an expert via a functional programming language using readily available Old Icelandic grammars. Each dictionary lemma is mapped to corresponding declension prototypes to yield multiple declension paradigms. Finally, each corpus instance is compared to the list of inflected lemmata to produce the base tagged corpus.</p>
</caption>
<graphic xlink:href="pone.0102366.g001"></graphic>
</fig>
<fig id="pone-0102366-g002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0102366.g002</object-id>
<label>Figure 2</label>
<caption>
<title>Integration of expert feedback to continuously improve POS and morphosyntactic tagging.</title>
<p>Human experts update and enrich the existing base tagged corpus via a website interface. A machine learning algorithm continuously updates its tagging performance based on new expert input.</p>
</caption>
<graphic xlink:href="pone.0102366.g002"></graphic>
</fig>
</sec>
<sec id="s2b">
<title>Dictionaries</title>
<p>IceMorph currently uses two standard dictionaries of Old Icelandic for basic lexical and grammatical information: Cleasby-Vigfusson
<xref rid="pone.0102366-Cleasby1" ref-type="bibr">[3]</xref>
(including the Lexicon Poeticum) and Zoëga
<xref rid="pone.0102366-Zoga1" ref-type="bibr">[4]</xref>
. The dictionaries were gathered from online sources
<xref rid="pone.0102366-A1" ref-type="bibr">[7]</xref>
,
<xref rid="pone.0102366-A2" ref-type="bibr">[8]</xref>
,
<xref rid="pone.0102366-An1" ref-type="bibr">[9]</xref>
or transformed into electronic text using optical character recognition. Each dictionary entry was machine parsed and, where necessary, normalized into standard Old Icelandic orthography using the widely accepted
<italic>Íslenzk fornrit</italic>
orthographical conventions
<xref rid="pone.0102366-slenzk1" ref-type="bibr">[10]</xref>
.</p>
<p>Each of the two dictionaries features approximately 27,000 entries with 42% overlap in headwords. We considered Fritzner
<xref rid="pone.0102366-Fritzner1" ref-type="bibr">[2]</xref>
as an additional resource because it contains considerably more unique lemmata compared to Cleasby-Vigfusson or Zoëga. However, its lack of morphosyntactic detail in its entries led us to disregard it for the purposes of this study.</p>
<p>We encountered a number of issues during this initial data preparation phase that can be classified into three problem areas as follows:</p>
<p>(1)
<bold>OCR errors and other inconsistencies in underlying data:</bold>
Although OCR errors are to be expected, we have uncovered both errors and inconsistencies in each of the underlying dictionaries. We corrected a number of those errors to reduce their influence on other modules of the IceMorph system.</p>
<p>For instance, while Zoëga differentiates between
<italic>ø</italic>
&
<italic>ö</italic>
, æ & œ, and uses -
<italic>st</italic>
for the mediopassive forms, Cleasby-Vigfusson only uses æ,
<italic>ö</italic>
, and -
<italic>sk</italic>
. Related characters (e.g.
<italic>i</italic>
and
<italic>í</italic>
) were often interpreted incorrectly by our OCR software.</p>
<p>(2)
<bold>Disagreement between sources:</bold>
not all sources agree on the classification of individual lemmata. For instance, Cleasby-Vigfusson defines
<italic>báðir</italic>
as a dual adjectival pronoun (adj. pron. dual), while Zoëga lists it simply as an adjective, but considers its dual form
<italic>b</italic>
æ
<italic>ði</italic>
as a conjunction. We relied on
<xref rid="pone.0102366-Ordbog1" ref-type="bibr">[41]</xref>
to mediate these differences.</p>
<p>(3)
<bold>Inconsistencies in the use of morphosyntactic information:</bold>
we relied heavily on morphosyntactic clues present in the dictionaries to determine the class of a given verb or noun. However, the same morphosyntactic syntax was often used within the same dictionary to describe lemmata belonging to different classes.</p>
<p>On the other hand, morphosyntactic elements of irregular forms often had unique patterns that also affected classification negatively. For instance:</p>
<p>
<bold>faðir (gen., dat. and acc. föður, pl. feðr)</bold>
, m.
<italic>father</italic>
.</p>
<p>
<bold>feðr</bold>
, m.
<italic>father</italic>
,  = faðir.</p>
<p>The pattern [LEMMA]+“, m.” + [TRANSLATION] usually signals masculine a-class nouns in Zoëga, so our machine parser defined a lemma
<italic>feðr</italic>
. The same dictionary contains an additional entry for
<italic>faðir</italic>
with a unique morphosyntactic structure. In this case, the machine parser was unable to categorize the lemma.</p>
<p>In a final step, we performed alignment on our various dictionary sources to produce a single uniform multi-dictionary relational database structure. Ambiguous or overlapping entries were discovered using simple SQL queries, and the limited number of problematic entries that we discovered were subsequently corrected by hand. Our current merged dictionary contains 48,973 lemmata. While this dictionary covers most words found in the Old Icelandic prose corpus, it has less comprehensive coverage for compounds, names, and archaic words. Each lemma is associated with at least one source entry in the dictionaries.
<xref ref-type="table" rid="pone-0102366-t002">Table 2</xref>
shows a sample source entry for lemma
<italic>afdrykkja</italic>
.</p>
<table-wrap id="pone-0102366-t002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0102366.t002</object-id>
<label>Table 2</label>
<caption>
<title>Sample source entry for lemma
<italic>afdrykkja</italic>
.</title>
</caption>
<alternatives>
<graphic id="pone-0102366-t002-2" xlink:href="pone.0102366.t002"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">
<bold>LEMMA</bold>
</td>
<td align="left" rowspan="1" colspan="1">afdrykkja</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<bold>COMPOUND (IF EXISTS)</bold>
</td>
<td align="left" rowspan="1" colspan="1">af-drykkja</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<bold>PART OF SPEECH</bold>
</td>
<td align="left" rowspan="1" colspan="1">noun</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<bold>CLASS (IF EXISTS)</bold>
</td>
<td align="left" rowspan="1" colspan="1">feminine –ijo:n</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<bold>DEFINITION/TRANSLATION</bold>
</td>
<td align="left" rowspan="1" colspan="1">u, f. over-drinking, drunkenness, = ofdrykkja [af- intens.]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<bold>SEMANTIC EQUIVALENCES</bold>
</td>
<td align="left" rowspan="1" colspan="1"> = ofdrykkja</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt102">
<label></label>
<p>Each lemma may contain a separate source entry for each dictionary source. Different source entries are linked through semantic equivalence.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>Both Cleasby-Vigfusson and Zoëga contain numerous definitions referring to other lemmata, typically using symbols such as “ = ” or “cf”. For instance:</p>
<p>
<bold>œði-vindr (noun m_a)</bold>
 = -veðr</p>
<p>
<bold>œði-veðr (noun n_a)</bold>
 = -stormr</p>
<p>
<bold>œði-stormr (noun m_a)</bold>
 = furious gale</p>
<p>We capture these semantic associations between lemmata in our source entry definitions (see
<xref ref-type="table" rid="pone-0102366-t002">Table 2</xref>
for an example). As an aside, both dictionaries contain instances of missing lemmata for a given semantic association, but those instances are fortunately rare.</p>
</sec>
<sec id="s2c">
<title>Corpora</title>
<p>IceMorph uses the Icelandic Legendary Sagas
<xref rid="pone.0102366-FornaldarsgurNorurlanda1" ref-type="bibr">[13]</xref>
as a target corpus. The corpus spans a total of 357,604 non-unique words and 22,815 unique words.
<xref ref-type="fig" rid="pone-0102366-g003">Figure 3</xref>
illustrates the distribution of unique word frequencies in the corpus. Its logarithmic shape confirms Zipf's law
<xref rid="pone.0102366-Manning1" ref-type="bibr">[26]</xref>
that few words occur with very high frequency. We take advantage of this common property by having human experts correct paradigms of high frequency words. We also take advantage of the fact that many of these high frequency words are conjunctions as well as other words that do not inflect. The effect is a sizeable reduction in the noise related to POS and morphosyntactic information.</p>
<fig id="pone-0102366-g003" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0102366.g003</object-id>
<label>Figure 3</label>
<caption>
<title>Distribution of unique word frequency in the Old Icelandic Legendary Sagas.</title>
<p>As expected, the corpus follows a logarithmic distribution. IceMorph takes advantage of the universal fact that relatively few unique words in a corpus tend to occur with high frequency.</p>
</caption>
<graphic xlink:href="pone.0102366.g003"></graphic>
</fig>
</sec>
<sec id="s2d">
<title>Declension prototyping</title>
<p>IceMorph performs morphosyntactic classification in two steps. First we create declension prototypes for the most common nouns, verbs, and adjectives with the objective of creating prototypes that can generate declension paradigms for words whose inflections contain no or few irregularities. In keeping with the inherent methodology of IceMorph, we used readily available Old Icelandic grammars
<xref rid="pone.0102366-Zoga1" ref-type="bibr">[4]</xref>
,
<xref rid="pone.0102366-Gordon1" ref-type="bibr">[14]</xref>
to produce those paradigms.</p>
<p>We integrated the declension paradigms into the system using the Functional Grammar (FM) approach
<xref rid="pone.0102366-Forsberg1" ref-type="bibr">[11]</xref>
,
<xref rid="pone.0102366-Ranta1" ref-type="bibr">[12]</xref>
,
<xref rid="pone.0102366-Borin1" ref-type="bibr">[22]</xref>
, which represents an intuitive method for implementing natural language morphology in the functional language Haskell
<xref rid="pone.0102366-The1" ref-type="bibr">[15]</xref>
.</p>
<p>The coding of Old Icelandic inflectional rules in FM/Haskell is accessible and easily understood by non-programmers, a necessary development criterion given the general lack of programming expertise among Old Icelandic language specialists. Such coding allowed us to take advantage of a panel of three Old Icelandic language experts who could then check for inaccuracies in the declension prototypes, which would have been impossible if we had used a different method of coding the inflection module. For instance,
<xref ref-type="fig" rid="pone-0102366-g004">Figure 4</xref>
illustrates the implementation of Old Icelandic masculine
<italic>i</italic>
-stem nouns using FM. While using Old Norse “staðr” as its sample noun, this paradigm produces correct or near-correct declension paradigms for most masculine
<italic>i</italic>
-stem nouns in Old Icelandic.</p>
<fig id="pone-0102366-g004" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0102366.g004</object-id>
<label>Figure 4</label>
<caption>
<title>FM implementation of Old Icelandic masculine
<italic>i</italic>
-stem noun.</title>
<p>Each declension entry is defined towards the end of the segment. Functions like ‘u_mutation’ or ‘syncope’ operate on the declension entry in question to execute the desired string manipulation.</p>
</caption>
<graphic xlink:href="pone.0102366.g004"></graphic>
</fig>
<p>IceMorph has a total of 96 prototypes: 40 noun prototypes covering nine strong and three weak declensions, 55 verb prototypes describing seven strong as well as four weak classes, and one adjective prototype. Each prototype in turn populates declension tables of varying sizes. For instance, noun declension tables consist of eight entries while verb declension tables contain 55 inflectional forms.</p>
<p>Using these declension prototypes, we created inflection paradigms for each lemma in our composite dictionary. Depending on the properties of a lexicon entry, we performed one of the following mappings:</p>
<p>
<italic>Case 1 - known morpho-syntactic classification</italic>
: If the lemma is associated with POS and class information, we generate paradigms for each prototype matching this information. For instance, lemma
<bold>af-runr</bold>
was classified as a masculine
<italic>i</italic>
-stem by the dictionary parser. There are two prototypes for masculine
<italic>i</italic>
-stem nouns, so two inflectional paradigms with a total of sixteen entries were created for this lemma.</p>
<p>
<italic>Case 2 - unknown class</italic>
: If, for a given lemma, the dictionary parser was only able to determine POS but not class, then inflectional paradigms were generated using each prototype of the given POS. In all cases, we were able to determine the gender of nouns and whether a verb was weak or strong. For a strong verb, such as
<bold>antigna</bold>
, we generated 20 inflectional paradigms with a total of 1100 entries.</p>
<p>
<italic>Case 3 - unknown classification</italic>
. For a purely hypothetical case in which neither POS nor class are known, declensions for all prototypes would be generated.</p>
<p>At the end of this process, IceMorph produced approximately one million declension paradigms to which we added closed-class words taken directly from our composite dictionary.</p>
<p>Given the Old Icelandic target corpus and the generated list of inflectional paradigms, we were able to classify each word in the corpus using the Wagner-Fischer edit distance algorithm
<xref rid="pone.0102366-Wagner1" ref-type="bibr">[16]</xref>
. Each unique word in the corpus was compared to the set of declensions and classified as the declension with the smallest edit distance. To reduce computational overhead, we made the following three assumptions:</p>
<list list-type="order">
<list-item>
<p>compound prefixes do not undergo transformations; if a corpus word does not begin with the prefix of a compound word in the dictionary, the pair is skipped</p>
</list-item>
<list-item>
<p>certain Old Icelandic characters must be present in the corpus word if they are present in the lemma, and vice versa</p>
</list-item>
<list-item>
<p>the edit distance cost of transforming a declension instance into a corpus word could not exceed a value of 2</p>
</list-item>
</list>
<p>Furthermore, we used a modified cost schema tailored to the characteristics of Old Icelandic sound changes. For instance, the Old Icelandic character “a” might transform into an “ö” due to a process called
<italic>u</italic>
-mutation, so we reduced the transformation cost for those characters to a value of 0.2 (see
<xref ref-type="table" rid="pone-0102366-t003">Table 3</xref>
for more examples). On the other hand, “e” rarely changes to “ö” in Old Icelandic, so its cost remains fixed at 1. The purpose of adjusted cost is to make IceMorph less susceptible to errors, such as those generated by optical character recognition, that occur in upstream system components.</p>
<table-wrap id="pone-0102366-t003" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0102366.t003</object-id>
<label>Table 3</label>
<caption>
<title>Examples of edit-distance transformations and their associated cost.</title>
</caption>
<alternatives>
<graphic id="pone-0102366-t003-3" xlink:href="pone.0102366.t003"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">TYPE OF CHANGE</td>
<td align="left" rowspan="1" colspan="1">FROM</td>
<td align="left" rowspan="1" colspan="1">TO</td>
<td align="left" rowspan="1" colspan="1">COST</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">gemination</td>
<td align="left" rowspan="1" colspan="1">E</td>
<td align="left" rowspan="1" colspan="1">t, or r</td>
<td align="left" rowspan="1" colspan="1">0.2</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">simplification</td>
<td align="left" rowspan="1" colspan="1">r, t, or n</td>
<td align="left" rowspan="1" colspan="1">E</td>
<td align="left" rowspan="1" colspan="1">0.2</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">assimilation</td>
<td align="left" rowspan="1" colspan="1">r</td>
<td align="left" rowspan="1" colspan="1">l</td>
<td align="left" rowspan="1" colspan="1">0.2</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">assimilation</td>
<td align="left" rowspan="1" colspan="1">ð</td>
<td align="left" rowspan="1" colspan="1">d, t, or s</td>
<td align="left" rowspan="1" colspan="1">0.2</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">devoicing</td>
<td align="left" rowspan="1" colspan="1">n, or g</td>
<td align="left" rowspan="1" colspan="1">k</td>
<td align="left" rowspan="1" colspan="1">0.2</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">consonant loss</td>
<td align="left" rowspan="1" colspan="1">l, or n</td>
<td align="left" rowspan="1" colspan="1">E</td>
<td align="left" rowspan="1" colspan="1">0.2</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt103">
<label></label>
<p>The transformations are specific to Old Icelandic. Their purpose is to improve classification performance by making the classifier more robust with respect to errors introduced earlier in the IceMorph system, such as OCR errors or differences in spelling convention between words in the corpus and dictionary sources.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>At the end of this process, over 98% of the corpus was tagged for both POS and morphosyntactic class. Although this approach provided broad coverage, we anticipated considerable noise in these tags mainly due to the creation of imperfect declension paradigms. One of the key features of the IceMorph design is to allow expert users to manually correct data. To that end, we developed an online tool
<xref rid="pone.0102366-Icemorph2" ref-type="bibr">[17]</xref>
that enables expert users (currently a committee of three Old Icelandic language experts) to edit and correct any data point. At the time this article was written, our experts had tagged 490 (∼0.14%) corpus words involving 289 (0.59%) dictionary entries.</p>
<p>Language specific phenomena such as homonymy also lead to ambiguity in classification. Homonymy is common in Old Icelandic. For instance, the corpus instance noun
<bold>menn</bold>
(“men”) could be the Nominative or Accusative Plural of the lemma
<bold>maðr</bold>
. In order to provide correct MS classification for an observed word, we needed to consider its context in the corpus. For example, a classifier is more likely to classify
<bold>menn</bold>
as Accusative Plural if it is preceded by an Accusative Plural pronoun such as
<bold>sína</bold>
. This type of context sensitive tagging is well described in the literature
<xref rid="pone.0102366-Lafferty1" ref-type="bibr">[27]</xref>
,
<xref rid="pone.0102366-Schmid1" ref-type="bibr">[30]</xref>
,
<xref rid="pone.0102366-Ratnaparkhi1" ref-type="bibr">[31]</xref>
.</p>
<p>The second portion of the IceMorph system is designed to address issues related to context-based morphosyntactic (MS) tagging.</p>
</sec>
<sec id="s2e">
<title>Semi-supervised morphosyntactic (MS) classifiers</title>
<p>IceMorph now has two very different sources of information for POS/MS tagging. On the one hand, there are prototype-generated inflectional paradigms that operate in conjunction with the edit-distance based mapping between corpus words and declension entries. Their coverage is expansive yet very noisy. On the other hand, we have a small set of declensions contributed by our experts.</p>
<p>As
<xref ref-type="table" rid="pone-0102366-t004">Table 4</xref>
shows, expert feedback is considered to be correct by default. On the other end of the spectrum, prototype mappings using edit distance are expected to contain a considerable degree of noise. The two intermediate knowledge sources result from homonyms and multiple occurrences of a word in a given inflection paradigm. The table also reveals an inverse relation between the usefulness of a knowledge source and its coverage of corpus words. We refer to the first three types of feedback as “expert-related”. Combined, they provide considerable corpus coverage (∼67.6%) with relatively low noise levels.</p>
<table-wrap id="pone-0102366-t004" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0102366.t004</object-id>
<label>Table 4</label>
<caption>
<title>Different knowledge sources.</title>
</caption>
<alternatives>
<graphic id="pone-0102366-t004-4" xlink:href="pone.0102366.t004"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">NAME</td>
<td align="left" rowspan="1" colspan="1">SOURCE</td>
<td align="left" rowspan="1" colspan="1">NOTE</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Expert Feedback</td>
<td align="left" rowspan="1" colspan="1">Declension table manually entered by a language expert for a specific word in the corpus and checked for accuracy by a second expert</td>
<td align="left" rowspan="1" colspan="1">Assumed accurate; corpus coverage: ∼0.14%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Unique matches</td>
<td align="left" rowspan="1" colspan="1">Corpus words that match a single expert form</td>
<td align="left" rowspan="1" colspan="1">Likely accurate; corpus coverage: ∼31.9%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Non-unique matches</td>
<td align="left" rowspan="1" colspan="1">Corpus words that match multiple expert forms</td>
<td align="left" rowspan="1" colspan="1">One of the forms likely accurate; corpus coverage: ∼35.6%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Edit-distance mapping</td>
<td align="left" rowspan="1" colspan="1">Corpus words that do not match an expert form; by default they are mapped to one or more prototype forms with the smallest edit-distance between them</td>
<td align="left" rowspan="1" colspan="1">Least likely to be accurate; ∼31%</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt104">
<label></label>
<p>These different knowledge sources are associated with varying degrees of likelihood of providing noise-free data (overall corpus coverage: >98%).</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>Our classification module attempts to improve overall tagging accuracy based on this data. Our strategy was to classify MS tags directly and then infer the corresponding POS tags via simple lookup (for instance, the MS tag
<italic>nom_sg</italic>
uniquely maps to the POS tag
<italic>noun</italic>
). We considered three types of classifiers for this classification task: a dynamic Bayesian network classifier, a Hidden Markov Model (HMM) classifier with maximum likelihood estimation (MLE) using both a default and restricted Viterbi algorithm, and a linear chain Conditional Random Field (CRF) classifier.</p>
<p>For a given event, the
<bold>dynamic Bayesian network classifier</bold>
<xref rid="pone.0102366-Murphy1" ref-type="bibr">[20]</xref>
considers its prior likelihood, as well as its likelihood in the presence of other (presumably independent) features to determine the likelihood of the event itself. The following function picks the feature set yielding maximum likelihood.
<disp-formula id="pone.0102366.e001">
<graphic xlink:href="pone.0102366.e001.jpg" position="anchor" orientation="portrait"></graphic>
<label>(1)</label>
</disp-formula>
In the context of IceMorph, the prior likelihood is the distribution of morphosyntactic tags based on expert feedback as well as unique and non-unique matches. The features chosen are the morphosyntactic tags preceding and following a given corpus word. We then calculate the likelihood of a given morphosyntactic element being associated with that word (
<xref ref-type="table" rid="pone-0102366-t005">Table 5</xref>
).</p>
<table-wrap id="pone-0102366-t005" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0102366.t005</object-id>
<label>Table 5</label>
<caption>
<title>Probabilities for given target words using context feature window size = 3.</title>
</caption>
<alternatives>
<graphic id="pone-0102366-t005-5" xlink:href="pone.0102366.t005"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">LEFT CONTEXT</td>
<td align="left" rowspan="1" colspan="1">TARGET WORD</td>
<td align="left" rowspan="1" colspan="1">RIGHT CONTEXT</td>
<td align="left" rowspan="1" colspan="1">PROBABILITY</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">dat_sg_masc</td>
<td align="left" rowspan="1" colspan="1">nom_sg_masc</td>
<td align="left" rowspan="1" colspan="1">acc_pl</td>
<td align="left" rowspan="1" colspan="1">0.00024</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">dat_sg_masc</td>
<td align="left" rowspan="1" colspan="1">nom_sg_masc</td>
<td align="left" rowspan="1" colspan="1">preposition</td>
<td align="left" rowspan="1" colspan="1">0.00024</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">dat_sg_masc</td>
<td align="left" rowspan="1" colspan="1">nom_sg_masc</td>
<td align="left" rowspan="1" colspan="1">nom_pl</td>
<td align="left" rowspan="1" colspan="1">0.00024</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">dat_sg_masc</td>
<td align="left" rowspan="1" colspan="1">neut_strong_pl_pos_nom</td>
<td align="left" rowspan="1" colspan="1">acc_pl_masc</td>
<td align="left" rowspan="1" colspan="1">0.00024</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">dat_sg_masc</td>
<td align="left" rowspan="1" colspan="1">neut_strong_pl_pos_nom</td>
<td align="left" rowspan="1" colspan="1">act_opt_pres_1_sg</td>
<td align="left" rowspan="1" colspan="1">0.00048</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">dat_sg_masc</td>
<td align="left" rowspan="1" colspan="1">neut_strong_pl_pos_nom</td>
<td align="left" rowspan="1" colspan="1">adverb</td>
<td align="left" rowspan="1" colspan="1">0.00024</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">dat_sg_masc</td>
<td align="left" rowspan="1" colspan="1">neut_strong_pl_pos_nom</td>
<td align="left" rowspan="1" colspan="1">nom_pl_neut</td>
<td align="left" rowspan="1" colspan="1">0.00096</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">dat_sg_masc</td>
<td align="left" rowspan="1" colspan="1">neut_strong_pl_pos_nom</td>
<td align="left" rowspan="1" colspan="1">conjunc</td>
<td align="left" rowspan="1" colspan="1">0.00143</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">dat_sg_masc</td>
<td align="left" rowspan="1" colspan="1">neut_strong_pl_pos_nom</td>
<td align="left" rowspan="1" colspan="1">gen_pl_masc</td>
<td align="left" rowspan="1" colspan="1">0.00024</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">dat_sg_masc</td>
<td align="left" rowspan="1" colspan="1">neut_strong_pl_pos_nom</td>
<td align="left" rowspan="1" colspan="1">acc_pl_neut</td>
<td align="left" rowspan="1" colspan="1">0.00096</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt105">
<label></label>
<p>The first three rows illustrate relatively low probabilities for unlikely POS combinations: in this example, two consecutive pronouns. The remaining rows show how more likely POS sequences receive higher probability scores; for instance, the probability of finding a word associated with MS tag
<italic>nom_sg_masc</italic>
given that it is preceded by
<italic>dat_sg_masc</italic>
and followed by
<italic>acc_pl</italic>
is 0.00024.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>We restrict the knowledge sources for these features by prioritizing them from most to least strict. For instance, if a preceding word is the unique match of a given expert form, then only that morphosyntactic tag is used when calculating likelihood. If, on the other hand, it does not match any expert-based tags, then all available edit-distance tags are used.</p>
<p>Previous studies have shown that dynamic Bayesian network classifiers are associated with a number of attractive features, such as computational efficiency
<xref rid="pone.0102366-Zhang1" ref-type="bibr">[18]</xref>
as well as robustness in the presence of noisy input
<xref rid="pone.0102366-Goldwater1" ref-type="bibr">[19]</xref>
and missing data
<xref rid="pone.0102366-Renooij1" ref-type="bibr">[33]</xref>
,
<xref rid="pone.0102366-Liu1" ref-type="bibr">[34]</xref>
due to their integration over the complete feature space. It has also been shown that these classifiers perform well even if the feature independence requirement has been violated
<xref rid="pone.0102366-Rish1" ref-type="bibr">[35]</xref>
.</p>
<p>
<bold>Hidden Markov Models</bold>
<xref rid="pone.0102366-Rabiner1" ref-type="bibr">[36]</xref>
are widely used for the task of sequence tagging. The HMM defines the problem space in terms of</p>
<list list-type="bullet">
<list-item>
<p>
<italic>S</italic>
hidden states; in IceMorph, these are morphosyntactic tags</p>
</list-item>
<list-item>
<p>
<italic>O</italic>
observations; in IceMorph, these are corpus words</p>
</list-item>
<list-item>
<p>transition probabilities
<italic>T
<sub>i = 1..S,j = 1..S</sub>
</italic>
between two states
<italic>i</italic>
and
<italic>j</italic>
</p>
</list-item>
<list-item>
<p>emission probabilities
<italic>E
<sub>i = 1..S</sub>
</italic>
capturing the probability of an outcome for state
<italic>i</italic>
</p>
</list-item>
</list>
<p>We use a standard trigram HMM. In order to find the most likely sequence of hidden states based on given observations, we implement the Viterbi algorithm
<xref rid="pone.0102366-Forney1" ref-type="bibr">[37]</xref>
. For a given t ∈ T and observations
<italic>o
<sub>1</sub>
</italic>
, …, o
<sub>n</sub>
we find the most likely state sequence by solving
<disp-formula id="pone.0102366.e002">
<graphic xlink:href="pone.0102366.e002.jpg" position="anchor" orientation="portrait"></graphic>
<label>(2)</label>
</disp-formula>
for a given element
<italic>x</italic>
in the sequence.</p>
<p>Similar to the process applied when creating the dynamic Bayesian network classifier, we only used expert-related data from our corpus when creating the HMM. In addition, we created two versions of the Viterbi algorithm, a default and a restricted version. The default Viterbi (dV) uses all the transition probabilities offered by the HMM. In contrast, the restricted Viterbi (rV)
<xref rid="pone.0102366-Tataru1" ref-type="bibr">[38]</xref>
uses the expert-related subset of transition probabilities whenever they are available.</p>
<p>
<bold>Conditional Random Fields</bold>
<xref rid="pone.0102366-Lafferty1" ref-type="bibr">[27]</xref>
,
<xref rid="pone.0102366-Chatzis1" ref-type="bibr">[32]</xref>
is an undirected graphical model often used for tagging sequential data. A CRF assigns probabilities to output nodes based on the values of input nodes. In contrast to the HMM, it includes sequential knowledge and allows for the inclusion of feature functions describing the feature space. A linear-chain CRF takes into account features from the current and previous position in a given sequence and provides a score such that:
<disp-formula id="pone.0102366.e003">
<graphic xlink:href="pone.0102366.e003.jpg" position="anchor" orientation="portrait"></graphic>
<label>(3)</label>
</disp-formula>
for a given position
<italic>i</italic>
in a sequence of words, where
<italic>f
<sub>j</sub>
</italic>
denotes a feature function and λ
<sub>j</sub>
represents its corresponding weight. Its feature space may include a variety of data, such as corpus instances, POS, morphosyntactic tags, positioning in a given sequence, etc. This makes CRFs quite powerful, but at a higher computational cost. Our experiments were conducted using the open source CRF++ tool
<xref rid="pone.0102366-CRF1" ref-type="bibr">[39]</xref>
.</p>
</sec>
</sec>
<sec id="s3">
<title>Results and Discussion</title>
<sec id="s3a">
<title>Tagged corpora</title>
<p>When we started work on IceMorph we manually tagged a subset of 462 words. They were randomly chosen but reflect the relative frequency distribution of POS in Old Icelandic. We refer to this tagged set as the GOLD corpus.</p>
<p>In addition to the creation of GOLD, we asked our language experts to check and, if necessary, correct declension paradigms created by our prototype classifier via our online tool. At the point of writing this article 488 corpus words had been processed by our experts; we refer to this tagged set as the EXPERT corpus.</p>
<p>
<xref ref-type="fig" rid="pone-0102366-g005">Figure 5</xref>
provides details with respect to the two subsets we used for testing and evaluation. The two test corpora differ in nature. Since GOLD instances have been chosen randomly they are distributed evenly throughout the corpus. In addition, words representing high frequency POS (as measured by occurrence in a dictionary) such as nouns (192 GOLD instances) and adjectives (153 GOLD instances) occur in GOLD relatively more often than words that belong to less frequent POS.</p>
<fig id="pone-0102366-g005" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0102366.g005</object-id>
<label>Figure 5</label>
<caption>
<title>IceMorph uses two distinct test sets to evaluate classification performance.</title>
<p>Corpus GOLD consists of 462 randomly selected corpus words. Corpus EXPERT, on the other hand, consists of 488 words tagged by expert users. This figure shows the relative frequency of POS in EXPERT and GOLD.</p>
</caption>
<graphic xlink:href="pone.0102366.g005"></graphic>
</fig>
<p>EXPERT instances, on the other hand, tend to cluster at the beginning of the corpus because our language experts focused on that section. Moreover, EXPERT contains many instances of words occurring frequently in the corpus even though the relative frequency of their associated POS in the dictionary may be lower (for instance, verbs with 160 instances or about 33%, and pronouns with 74 instances or about 15%).
<xref ref-type="table" rid="pone-0102366-t006">Table 6</xref>
shows the distribution of POS in EXPERT, GOLD, and in our concatenated dictionary.</p>
<table-wrap id="pone-0102366-t006" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0102366.t006</object-id>
<label>Table 6</label>
<caption>
<title>Relative distribution of POS in the IceMorph dictionary, GOLD, and EXPERT.</title>
</caption>
<alternatives>
<graphic id="pone-0102366-t006-6" xlink:href="pone.0102366.t006"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">POS</td>
<td align="left" rowspan="1" colspan="1">DICTIONARY (%)</td>
<td align="left" rowspan="1" colspan="1">GOLD corpus (%)</td>
<td align="left" rowspan="1" colspan="1">EXPERT corpus (%)</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">noun</td>
<td align="left" rowspan="1" colspan="1">64.39</td>
<td align="left" rowspan="1" colspan="1">41.56</td>
<td align="left" rowspan="1" colspan="1">30.53</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">adjective</td>
<td align="left" rowspan="1" colspan="1">18.45</td>
<td align="left" rowspan="1" colspan="1">33.12</td>
<td align="left" rowspan="1" colspan="1">7.17</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">verb</td>
<td align="left" rowspan="1" colspan="1">8.45</td>
<td align="left" rowspan="1" colspan="1">13.85</td>
<td align="left" rowspan="1" colspan="1">32.79</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">adverb</td>
<td align="left" rowspan="1" colspan="1">3.93</td>
<td align="left" rowspan="1" colspan="1">8.01</td>
<td align="left" rowspan="1" colspan="1">7.17</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">pronoun</td>
<td align="left" rowspan="1" colspan="1">0.13</td>
<td align="left" rowspan="1" colspan="1">1.29</td>
<td align="left" rowspan="1" colspan="1">15.16</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">preposition</td>
<td align="left" rowspan="1" colspan="1">0.1</td>
<td align="left" rowspan="1" colspan="1">1.29</td>
<td align="left" rowspan="1" colspan="1">3.69</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">other</td>
<td align="left" rowspan="1" colspan="1">4.54</td>
<td align="left" rowspan="1" colspan="1">0.88</td>
<td align="left" rowspan="1" colspan="1">1.24</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt106">
<label></label>
<p>The tagged corpus GOLD resembles more closely the distribution of the dictionary while the tagged corpus EXPERT owes its pattern of distribution to frequencies in the saga corpus.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>When testing classifiers we distinguish between results obtained using EXPERT and GOLD, respectively. EXPERT is our closest analogy to a properly tagged test environment because it contains long sequences of tagged words. GOLD, on the other hand, allows us to study the robustness of a given classifier since most of its instances occur in a highly noisy environment (i.e. preceding and following words tend to not be tagged).</p>
<p>The data used for this project is available through the California Digital Library's “Merritt” data repository. We have deposited three sets of data in the repository which can be used in conjunction with our code, available from GitHub. The three datasets are collected as a single data package on Merritt, with the following DOI: 10.5068/D1WC7K. The contents of this package is as follows:</p>
<list list-type="alpha-lower">
<list-item>
<p>the concatenated dictionary file, stored as a json (dictionary_20140605.json)</p>
</list-item>
<list-item>
<p>the untagged and tagged Fornaldarsögur corpus (allvol.zip and icemorph_corpus-2014-06-01.zip)</p>
</list-item>
<list-item>
<p>the EXPERT and GOLD training/testing corpora (tagged_corpora_20140605.json)</p>
</list-item>
</list>
</sec>
<sec id="s3b">
<title>Classification results</title>
<p>As a baseline measure, we ran all classifiers on an in-sample data set (i.e., the same data was used for training and testing) for both the EXPERT and GOLD tagged sets. As expected, all classifiers performed well. We then split our test data into 80% training and 20% testing. In future work, the selection of corpus instances will be driven by “Query by Uncertainty”, an active learning algorithm that
<xref rid="pone.0102366-Ringger1" ref-type="bibr">[40]</xref>
has shown to provide increased accuracy for corpora with minimal training sets. From the EXPERT corpus we used the first 20% for testing because forms tagged by experts tend to be clustered around the beginning of our corpus. Since the GOLD forms are more evenly spread throughout the corpus, we chose the last 20% as test data.</p>
<p>When applying our classifiers to the split data set, the HMM classifier clearly outperformed the other two, its accuracy not suffering relative to its baseline (indeed, it scored higher). The restricted Viterbi consistently performed superior relative to the default Viterbi. This is pronounced in the performance of HMM-rV on the GOLD corpus, which contains a higher degree of uncertainty. With respect to results from EXPERT corpus on the POS tagging task, our HMM classifier yields results similar to state-of-the-art POS taggers trained on noise-free data.
<xref ref-type="table" rid="pone-0102366-t007">Table 7</xref>
contains the results of our classification tests.</p>
<table-wrap id="pone-0102366-t007" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0102366.t007</object-id>
<label>Table 7</label>
<caption>
<title>Accuracies for POS and MS tagging.</title>
</caption>
<alternatives>
<graphic id="pone-0102366-t007-7" xlink:href="pone.0102366.t007"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">TEST</td>
<td align="left" rowspan="1" colspan="1">POS EXPERT</td>
<td align="left" rowspan="1" colspan="1">POS GOLD</td>
<td align="left" rowspan="1" colspan="1">MS EXPERT</td>
<td align="left" rowspan="1" colspan="1">MS GOLD</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Bayes-base</td>
<td align="left" rowspan="1" colspan="1">95.43%</td>
<td align="left" rowspan="1" colspan="1">79.25%</td>
<td align="left" rowspan="1" colspan="1">80.67%</td>
<td align="left" rowspan="1" colspan="1">48.34%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Bayes-80/20</td>
<td align="left" rowspan="1" colspan="1">85.71%</td>
<td align="left" rowspan="1" colspan="1">75.14%</td>
<td align="left" rowspan="1" colspan="1">62.37%</td>
<td align="left" rowspan="1" colspan="1">43.24%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">HMM-dV-base</td>
<td align="left" rowspan="1" colspan="1">93.85%</td>
<td align="left" rowspan="1" colspan="1">25.60%</td>
<td align="left" rowspan="1" colspan="1">75.82%</td>
<td align="left" rowspan="1" colspan="1">13.62%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">HMM-dV-80/20</td>
<td align="left" rowspan="1" colspan="1">93.68%</td>
<td align="left" rowspan="1" colspan="1">34.74%</td>
<td align="left" rowspan="1" colspan="1">82.11%</td>
<td align="left" rowspan="1" colspan="1">18.75%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">HMM-rV-base</td>
<td align="left" rowspan="1" colspan="1">96.11%</td>
<td align="left" rowspan="1" colspan="1">71.58%</td>
<td align="left" rowspan="1" colspan="1">79.92%</td>
<td align="left" rowspan="1" colspan="1">53.98%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">HMM-rV-80/20</td>
<td align="left" rowspan="1" colspan="1">96.84%</td>
<td align="left" rowspan="1" colspan="1">73.16%</td>
<td align="left" rowspan="1" colspan="1">84.21%</td>
<td align="left" rowspan="1" colspan="1">54.86%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">CRF-1-base</td>
<td align="left" rowspan="1" colspan="1">89.75%</td>
<td align="left" rowspan="1" colspan="1">36.58%</td>
<td align="left" rowspan="1" colspan="1">78.07%</td>
<td align="left" rowspan="1" colspan="1">11.54%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">CRF-1-80/20</td>
<td align="left" rowspan="1" colspan="1">87.30%</td>
<td align="left" rowspan="1" colspan="1">46.07%</td>
<td align="left" rowspan="1" colspan="1">77.78%</td>
<td align="left" rowspan="1" colspan="1">16.55%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">CRF-2-80/20</td>
<td align="left" rowspan="1" colspan="1">84.13%</td>
<td align="left" rowspan="1" colspan="1">48.69%</td>
<td align="left" rowspan="1" colspan="1">56.08%</td>
<td align="left" rowspan="1" colspan="1">17.24%</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt107">
<label></label>
<p>Tests with postfix “base” were performed using in-sample test sets. For the others, the supervised set was split into 80% training and 20% testing.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>The relatively poor performance of the CRF classifier deserves special explanation. Due to its higher demand for computing resources, we initially restricted its training set to sequences in which each word was associated with no more than one morphosyntactic form. As features we chose surface forms and MS tags of the preceding and following corpus words. Test CRF-1-80/20 performed below its in-sample base line, but the decline was considerably less than the dynamic Bayesian network classifier. We assumed that increasing the number of allowed morphosyntactic forms associated with a given word from one to two we could improve CRF performance. But as test CRF-2-80/20 shows, the opposite was true: performance declined somewhat for EXPERT words. Our interpretation of these results is that while CRF performs very well when trained with noise-free input, it is less capable of handling uncertainty in its training set than our HMM classifier with restricted Viterbi.</p>
</sec>
</sec>
<sec id="s4">
<title>Conclusion and Outlook</title>
<p>The IceMorph POS and MS tagger attempts to maximize classification performance using a minimum of cleanly tagged training data. It is a hybrid system combining readily available resources for Old Icelandic (such as dictionaries, grammars, and corpora) and human expert feedback with machine learning algorithms for continuous automated classification. Given a small set of tagged words, IceMorph achieves corpus-wide POS classification of over 96% and MS classification of over 84% accuracy.</p>
<p>None of the resources used by IceMorph is noise free. Dictionaries and corpora contain errors introduced during OCR or inherent in the source itself. Furthermore, the context-based classifier learns its probability matrix from highly noisy data. IceMorph is designed to maximize performance given this noisy environment. It does so by taking cues from human experts, as well as exploiting the logarithmic distribution of unique words in corpora, essentially reducing the task of classification to a process of disambiguation of homographs.</p>
<p>The key to improved performance will be to further reduce noise throughout the IceMorph system, most easily accomplished by expanding expert feedback. We are exploring additional ways to improve accuracy by refining our machine learning algorithms. We are also investigating how to optimize the selection of corpus words to have maximum impact on classification performance by implementing appropriate active learning algorithms. Finally, we are looking at ways to incorporate phenomena specific to Old Icelandic, such as enclitics (suffixed determiners), so as to reduce classification failures.</p>
</sec>
<sec id="s5">
<title>Software and Data</title>
<p>Software for this project can be found at GitHub. Search for IceMorph. Data is available at the University of California/California Digital Library repository Merritt, with the following DOI: 10.5068/D1WC7K</p>
</sec>
</body>
<back>
<ack>
<p>Jackson Crawford (UCLA), Zoe Borovsky (UCLA), David Gabriel (UCLA), and Monit Tyagi (UCLA) all contributed to the development of IceMorph.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="pone.0102366-Icemorph1">
<label>1</label>
<mixed-citation publication-type="other">Icemorph website. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.purl.org/icemorph/index">http://www.purl.org/icemorph/index</ext-link>
Accessed 2014 Jun 10.</mixed-citation>
</ref>
<ref id="pone.0102366-Fritzner1">
<label>2</label>
<mixed-citation publication-type="book">Fritzner J (1867) Ordbog over det gamle norske sprog. Christiania: Feilberg & Landmark. 874 p.</mixed-citation>
</ref>
<ref id="pone.0102366-Cleasby1">
<label>3</label>
<mixed-citation publication-type="book">Cleasby R, Vigfússon G (1874) An Icelandic-English Dictionary. Oxford: Clarendon Press. 779 p.</mixed-citation>
</ref>
<ref id="pone.0102366-Zoga1">
<label>4</label>
<mixed-citation publication-type="book">Zoëga G (1910) A concise dictionary of Old Icelandic. Oxford: Clarendon Press. 551 p.</mixed-citation>
</ref>
<ref id="pone.0102366-Cucerzan1">
<label>5</label>
<mixed-citation publication-type="journal">
<name>
<surname>Cucerzan</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Yarowsky</surname>
<given-names>D</given-names>
</name>
(
<year>2002</year>
)
<article-title>Bootstrapping a multilingual part-of-speech tagger in one person-day</article-title>
.
<source>Proc of CoNLL-2002</source>
<fpage>132</fpage>
<lpage>138</lpage>
</mixed-citation>
</ref>
<ref id="pone.0102366-Brill1">
<label>6</label>
<mixed-citation publication-type="book">Brill E, Marcus M (1992) Tagging an unfamiliar text with minimal human supervision. In: Goldman R (ed). Working Notes of the AAAI Fall Symposium on Probabilistic Approaches to Natural Language: 10–16.</mixed-citation>
</ref>
<ref id="pone.0102366-A1">
<label>7</label>
<mixed-citation publication-type="other">A Concise Dictionary of Old Icelandic. Available:
<ext-link ext-link-type="uri" xlink:href="http://norse.ulver.com/dct/Zoega">http://norse.ulver.com/dct/Zoega</ext-link>
Accessed 2014 Jun 10.
<ext-link ext-link-type="uri" xlink:href="http://norse.ulver.com/dct/zoega/m.html">http://norse.ulver.com/dct/zoega/m.html</ext-link>
</mixed-citation>
</ref>
<ref id="pone.0102366-A2">
<label>8</label>
<mixed-citation publication-type="other">A Concise Dictionary of Old Icelandic. Available:
<ext-link ext-link-type="uri" xlink:href="http://lexicon.ff.cuni.cz/texts/oi_zoega_about.html">http://lexicon.ff.cuni.cz/texts/oi_zoega_about.html</ext-link>
Accessed 2014 Jun 10.
<ext-link ext-link-type="uri" xlink:href="http://lexicon.ff.cuni.cz/texts/oi_zoega_about.html">http://lexicon.ff.cuni.cz/texts/oi_zoega_about.html</ext-link>
</mixed-citation>
</ref>
<ref id="pone.0102366-An1">
<label>9</label>
<mixed-citation publication-type="other">An Icelandic-English Dictionary. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.ling.upenn.edu/~kurisuto/germanic/oi_cleasbyvigfusson_about.html">http://www.ling.upenn.edu/~kurisuto/germanic/oi_cleasbyvigfusson_about.html</ext-link>
Accessed 2014 Jun 10.
<ext-link ext-link-type="uri" xlink:href="http://www.ling.upenn.edu/~kurisuto/germanic/oi_cleasbyvigfusson_about.html">http://www.ling.upenn.edu/~kurisuto/germanic/oi_cleasbyvigfusson_about.html</ext-link>
</mixed-citation>
</ref>
<ref id="pone.0102366-slenzk1">
<label>10</label>
<mixed-citation publication-type="other">Íslenzk fornrit. Available:
<ext-link ext-link-type="uri" xlink:href="http://hib.is/kynningar/fornrit2011.pdf">http://hib.is/kynningar/fornrit2011.pdf</ext-link>
Accessed 2014 Jun 10.
<ext-link ext-link-type="uri" xlink:href="http://hib.is/kynningar/fornrit2011.pdf">http://hib.is/kynningar/fornrit2011.pdf</ext-link>
</mixed-citation>
</ref>
<ref id="pone.0102366-Forsberg1">
<label>11</label>
<mixed-citation publication-type="journal">
<name>
<surname>Forsberg</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Ranta</surname>
<given-names>A</given-names>
</name>
(
<year>2004</year>
)
<article-title>Functional Morphology</article-title>
.
<source>Proc 9th ACM SIGPLAN International Conf on Functional Programming</source>
<fpage>213</fpage>
<lpage>223</lpage>
<comment>DOI:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1145/1016850.1016879"> 10.1145/1016850.1016879</ext-link>
</comment>
</mixed-citation>
</ref>
<ref id="pone.0102366-Ranta1">
<label>12</label>
<mixed-citation publication-type="journal">
<name>
<surname>Ranta</surname>
<given-names>A</given-names>
</name>
(
<year>2004</year>
)
<article-title>Grammatical Framework: A Type-theoretical Grammar Formalism</article-title>
.
<source>J Functional Programming</source>
<volume>14</volume>
<issue>(2)</issue>
<fpage>145</fpage>
<lpage>189</lpage>
</mixed-citation>
</ref>
<ref id="pone.0102366-FornaldarsgurNorurlanda1">
<label>13</label>
<mixed-citation publication-type="other">Fornaldarsögur_Norðurlanda. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.heimskringla.no/wiki/Fornaldars">http://www.heimskringla.no/wiki/Fornaldars</ext-link>
ögur_Norðurlanda. Accessed 2014 Jun 10.</mixed-citation>
</ref>
<ref id="pone.0102366-Gordon1">
<label>14</label>
<mixed-citation publication-type="book">Gordon E (1938) An Introduction to Old Norse. Oxford: Oxford University Press. 383 p.</mixed-citation>
</ref>
<ref id="pone.0102366-The1">
<label>15</label>
<mixed-citation publication-type="other">The Haskell Programming Language. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.haskell.org">http://www.haskell.org</ext-link>
Accessed 2014 Jun 10.
<ext-link ext-link-type="uri" xlink:href="http://www.haskell.org/">http://www.haskell.org/</ext-link>
</mixed-citation>
</ref>
<ref id="pone.0102366-Wagner1">
<label>16</label>
<mixed-citation publication-type="journal">
<name>
<surname>Wagner</surname>
<given-names>RA</given-names>
</name>
,
<name>
<surname>Fischer</surname>
<given-names>MJ</given-names>
</name>
(
<year>1974</year>
)
<article-title>The string to string correction problem</article-title>
.
<source>J Assoc Comput Mach</source>
<volume>21</volume>
<issue>(1)</issue>
<fpage>168</fpage>
<lpage>183</lpage>
</mixed-citation>
</ref>
<ref id="pone.0102366-Icemorph2">
<label>17</label>
<mixed-citation publication-type="other">Icemorph Morphological Analyzer Interface. Available:
<ext-link ext-link-type="uri" xlink:href="http://icemorph.scandinavian.ucla.edu">http://icemorph.scandinavian.ucla.edu</ext-link>
Accessed 2014 Jun 10.</mixed-citation>
</ref>
<ref id="pone.0102366-Zhang1">
<label>18</label>
<mixed-citation publication-type="other">Zhang H (2004) The Optimality of Naive Bayes. Proc 17th International Florida Artificial Intelligence Research Society Conf (FLAIRS 2004) Available:
<ext-link ext-link-type="uri" xlink:href="http://www.cs.unb.ca/profs/hzhang/publications/FLAIRS04ZhangH.pdf">http://www.cs.unb.ca/profs/hzhang/publications/FLAIRS04ZhangH.pdf</ext-link>
Accessed 2014 Jun 10.</mixed-citation>
</ref>
<ref id="pone.0102366-Goldwater1">
<label>19</label>
<mixed-citation publication-type="journal">
<name>
<surname>Goldwater</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Griffiths</surname>
<given-names>TL</given-names>
</name>
(
<year>2007</year>
)
<article-title>A Fully Bayesian Approach to Unsupervised Part-Of-Speech Tagging</article-title>
.
<source>Proc 45th Annual Meeting of the Assoc of Computational Linguistics</source>
<fpage>744</fpage>
<lpage>751</lpage>
</mixed-citation>
</ref>
<ref id="pone.0102366-Murphy1">
<label>20</label>
<mixed-citation publication-type="other">Murphy KP (2002) Dynamic bayesian networks: representation, inference and learning PhD dissertation, University of California, Berkeley. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.ee.uwa.edu.au/~roberto/research/projectsbiblio/10.1.1.93.778.pdf">http://www.ee.uwa.edu.au/~roberto/research/projectsbiblio/10.1.1.93.778.pdf</ext-link>
Accessed 2014 May 5.
<ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier">https://en.wikipedia.org/wiki/Naive_Bayes_classifier</ext-link>
</mixed-citation>
</ref>
<ref id="pone.0102366-Rgnvaldsson1">
<label>21</label>
<mixed-citation publication-type="book">Rögnvaldsson E, Helgadóttir S (2011) Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change. In: Sporleder, C, van den Bosch, APJ Zervanou, KA (eds). Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series. Berlin: Springer. Pp. 63–76.</mixed-citation>
</ref>
<ref id="pone.0102366-Borin1">
<label>22</label>
<mixed-citation publication-type="journal">
<name>
<surname>Borin</surname>
<given-names>L</given-names>
</name>
,
<name>
<surname>Forsberg</surname>
<given-names>M</given-names>
</name>
(
<year>2008</year>
)
<article-title>Something Old, Something New: A Computational Morphological Description of Old Swedish</article-title>
.
<source>Proc 6th Language Resources and Evaluation Conf</source>
<fpage>9</fpage>
<lpage>16</lpage>
</mixed-citation>
</ref>
<ref id="pone.0102366-Loftsson1">
<label>23</label>
<mixed-citation publication-type="journal">
<name>
<surname>Loftsson</surname>
<given-names>H</given-names>
</name>
(
<year>2008</year>
)
<article-title>Tagging Icelandic text: A linguistic rule-based approach</article-title>
.
<source>Nordic J Linguistics</source>
<volume>31</volume>
<issue>(1)</issue>
<fpage>47</fpage>
<lpage>72</lpage>
</mixed-citation>
</ref>
<ref id="pone.0102366-Feldman1">
<label>24</label>
<mixed-citation publication-type="book">Feldman A, Hana J (2009) A Resource-Light Approach to Morpho-Syntactic Tagging. Amsterdam: Rodopi. 185p.</mixed-citation>
</ref>
<ref id="pone.0102366-Toutanova1">
<label>25</label>
<mixed-citation publication-type="journal">
<name>
<surname>Toutanova</surname>
<given-names>K</given-names>
</name>
,
<name>
<surname>Johnson</surname>
<given-names>M</given-names>
</name>
(
<year>2008</year>
)
<article-title>A Bayesian LDA-based model for semi-supervised part-of-speech tagging</article-title>
.
<source>Advances in NIPS</source>
<volume>20</volume>
:
<fpage>1521</fpage>
<lpage>1528</lpage>
</mixed-citation>
</ref>
<ref id="pone.0102366-Manning1">
<label>26</label>
<mixed-citation publication-type="book">Manning C, Schütze H (2003) Foundations of Statistical Natural Language Processing. Cambridge: MIT Press. Pp. 23–29.</mixed-citation>
</ref>
<ref id="pone.0102366-Lafferty1">
<label>27</label>
<mixed-citation publication-type="journal">
<name>
<surname>Lafferty</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>McCallum</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Pereira</surname>
<given-names>F</given-names>
</name>
(
<year>2001</year>
)
<article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
.
<source>Proc 18th International Conf on Machine Learning</source>
<fpage>282</fpage>
<lpage>289</lpage>
</mixed-citation>
</ref>
<ref id="pone.0102366-Clark1">
<label>28</label>
<mixed-citation publication-type="journal">
<name>
<surname>Clark</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Curran</surname>
<given-names>JR</given-names>
</name>
,
<name>
<surname>Osborne</surname>
<given-names>M</given-names>
</name>
(
<year>2003</year>
)
<article-title>Bootstrapping POS taggers using unlabeled data</article-title>
.
<source>Proc 7th Conf on Natural language learning at HLT-NAACL</source>
<volume>4</volume>
:
<fpage>49</fpage>
<lpage>55</lpage>
</mixed-citation>
</ref>
<ref id="pone.0102366-Loftsson2">
<label>29</label>
<mixed-citation publication-type="journal">
<name>
<surname>Loftsson</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Helgadóttir</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Rögnvaldsson</surname>
<given-names>E</given-names>
</name>
(
<year>2011</year>
)
<article-title>Using a morphological database to increase the accuracy in PoS tagging</article-title>
.
<source>Proc Recent Advances in Natural Language Processing (RANLP 2011)</source>
<fpage>49</fpage>
<lpage>55</lpage>
</mixed-citation>
</ref>
<ref id="pone.0102366-Schmid1">
<label>30</label>
<mixed-citation publication-type="journal">
<name>
<surname>Schmid</surname>
<given-names>H</given-names>
</name>
(
<year>1994</year>
)
<article-title>Probabilistic part-of-speech tagging using decision trees</article-title>
.
<source>Proc International Conf New Methods in Language Processing</source>
<volume>12</volume>
:
<fpage>44</fpage>
<lpage>49</lpage>
</mixed-citation>
</ref>
<ref id="pone.0102366-Ratnaparkhi1">
<label>31</label>
<mixed-citation publication-type="journal">
<name>
<surname>Ratnaparkhi</surname>
<given-names>A</given-names>
</name>
(
<year>1996</year>
)
<article-title>A maximum entropy model for part-of-speech tagging</article-title>
.
<source>Proc Conf Empirical Methods in Natural Language Processing</source>
<volume>1</volume>
:
<fpage>133</fpage>
<lpage>142</lpage>
</mixed-citation>
</ref>
<ref id="pone.0102366-Chatzis1">
<label>32</label>
<mixed-citation publication-type="journal">
<name>
<surname>Chatzis</surname>
<given-names>SP</given-names>
</name>
,
<name>
<surname>Demiris</surname>
<given-names>Y</given-names>
</name>
(
<year>2013</year>
)
<article-title>The Infinite-Order Conditional Random Field Model for Sequential Data Modelling</article-title>
.
<source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
<volume>35</volume>
<issue>(6)</issue>
<fpage>1523</fpage>
<lpage>1534</lpage>
<pub-id pub-id-type="pmid">23599063</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0102366-Renooij1">
<label>33</label>
<mixed-citation publication-type="journal">
<name>
<surname>Renooij</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Van Der Gaag</surname>
<given-names>LC</given-names>
</name>
(
<year>2008</year>
)
<article-title>Evidence and scenario sensitivities in naive Bayesian classifiers</article-title>
.
<source>International J Approximate Reasoning</source>
<volume>49</volume>
<issue>(2)</issue>
<fpage>398</fpage>
<lpage>416</lpage>
</mixed-citation>
</ref>
<ref id="pone.0102366-Liu1">
<label>34</label>
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Lei</surname>
<given-names>L</given-names>
</name>
,
<name>
<surname>Wu</surname>
<given-names>N</given-names>
</name>
(
<year>2005</year>
)
<article-title>A quantitative study of the effect of missing data in classifiers</article-title>
.
<comment>Doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1109/CIT.2005.41">10.1109/CIT.2005.41</ext-link>
</comment>
</mixed-citation>
</ref>
<ref id="pone.0102366-Rish1">
<label>35</label>
<mixed-citation publication-type="journal">
<name>
<surname>Rish</surname>
<given-names>I</given-names>
</name>
(
<year>2001</year>
)
<article-title>An empirical study of the naive Bayes classifier</article-title>
.
<source>IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence</source>
<volume>3</volume>
<issue>(22)</issue>
<fpage>41</fpage>
<lpage>46</lpage>
</mixed-citation>
</ref>
<ref id="pone.0102366-Rabiner1">
<label>36</label>
<mixed-citation publication-type="journal">
<name>
<surname>Rabiner</surname>
<given-names>L</given-names>
</name>
,
<name>
<surname>Juang</surname>
<given-names>BH</given-names>
</name>
(
<year>1986</year>
)
<article-title>An introduction to hidden Markov models</article-title>
.
<source>ASSP Magazine, IEEE</source>
<volume>3</volume>
<issue>(1)</issue>
<fpage>4</fpage>
<lpage>16</lpage>
</mixed-citation>
</ref>
<ref id="pone.0102366-Forney1">
<label>37</label>
<mixed-citation publication-type="journal">
<name>
<surname>Forney</surname>
<given-names>G</given-names>
<suffix>Jr</suffix>
</name>
(
<year>1973</year>
)
<article-title>The Viterbi algorithm</article-title>
.
<source>Proc of the IEEE</source>
<volume>61</volume>
<issue>(3)</issue>
<fpage>268</fpage>
<lpage>278</lpage>
</mixed-citation>
</ref>
<ref id="pone.0102366-Tataru1">
<label>38</label>
<mixed-citation publication-type="journal">
<name>
<surname>Tataru</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Sand</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Hobolth</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Mailund</surname>
<given-names>T</given-names>
</name>
,
<name>
<surname>Pedersen</surname>
<given-names>CNS</given-names>
</name>
(
<year>2013</year>
)
<article-title>Algorithms for Hidden Markov Models Restricted to Occurrences of Regular Expressions</article-title>
.
<source>Biology</source>
<volume>2</volume>
<issue>(4)</issue>
<fpage>1282</fpage>
<lpage>1295</lpage>
<pub-id pub-id-type="pmid">24833225</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0102366-CRF1">
<label>39</label>
<mixed-citation publication-type="other">CRF++: Yet another CRF toolkit. Available:
<ext-link ext-link-type="uri" xlink:href="http://crfpp.googlecode.com/svn/trunk/doc/index.html">http://crfpp.googlecode.com/svn/trunk/doc/index.html</ext-link>
Accessed 2014 Jun 10.</mixed-citation>
</ref>
<ref id="pone.0102366-Ringger1">
<label>40</label>
<mixed-citation publication-type="journal">
<name>
<surname>Ringger</surname>
<given-names>E</given-names>
</name>
,
<name>
<surname>McClanahan</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Haertel</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Busby</surname>
<given-names>G</given-names>
</name>
,
<name>
<surname>Carmen</surname>
<given-names>M</given-names>
</name>
,
<etal>et al</etal>
(
<year>2007</year>
)
<article-title>Active learning for part-of-speech tagging: accelerating corpus annotation</article-title>
.
<source>Proc Linguistic Annotation Workshop (LAW '07)</source>
<fpage>101</fpage>
<lpage>108</lpage>
</mixed-citation>
</ref>
<ref id="pone.0102366-Ordbog1">
<label>41</label>
<mixed-citation publication-type="other">Ordbog over det norrøne prosaprog. Available:
<ext-link ext-link-type="uri" xlink:href="http://onp.ku.dk/">http://onp.ku.dk/</ext-link>
Accessed 2014 Jun 10.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Pmc/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000180 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 000180 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:4100772
   |texte=   Semi-Supervised Morphosyntactic Classification of Old Icelandic
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:25029462" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a OcrV1 

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024