CIDE (2009) Marcoux : Différence entre versions

De CIDE
imported>Abdelhakim Aidene
imported>Abdelhakim Aidene
 
(25 révisions intermédiaires par 2 utilisateurs non affichées)
Ligne 30 : Ligne 30 :
 
==Introduction==
 
==Introduction==
  
In a structured document (XML, SGML, etc.), what is the “meaning” of the various tags (the markup) present in the document? How is the meaning of the document augmented—or otherwise affected—by the presence of markup?
+
In a structured document (XML, [[A pour norme citée::Standard Generalized Markup Language|SGML]], etc.), what is the “meaning” of the various tags (''the markup'') present in the document? How is the meaning of the document augmented—or otherwise affected—by the presence of markup?
  
Fundamentally, there are two possible avenues to give an answer to that question: the formal one and the informal one. One can devise a framework in which the meaning of a marked-up document is represented by a set of formal statements, for example in first-order logic. Or, one can seek a framework in which the meaning of a marked-up document is represented by a set of sentences in an informal language, for example a natural language.
+
Fundamentally, there are two possible avenues to give an answer to that question: the formal one and the informal one. One can devise a framework in which the meaning of a marked-up document is represented by a set of ''formal'' statements, for example in first-order logic. Or, one can seek a framework in which the meaning of a marked-up document is represented by a set of sentences in an ''informal'' language, for example a natural language.
  
 
If automatic inferencing (through an inference engine) is aimed at, then a formal approach probably has a leading edge. However, if some other use of the “meaning” of the document is envisioned, which for example involves showing that meaning to humans, then the situation may be reversed.
 
If automatic inferencing (through an inference engine) is aimed at, then a formal approach probably has a leading edge. However, if some other use of the “meaning” of the document is envisioned, which for example involves showing that meaning to humans, then the situation may be reversed.
  
Formal Tag-Set Descriptions (see for example [{{CIDE lien citation|6}}], [{{CIDE lien citation|7}}] and [{{CIDE lien citation|8}}]) are an example of the approaches along the formal avenue. Intertextual semantics [{{CIDE lien citation|1}}] [{{CIDE lien citation|2}}] [{{CIDE lien citation|4}}] is an approach along the informal avenue. In intertextual semantics (IS), the meaning of a marked-up document is entirely and exclusively represented in natural language.
+
''Formal Tag-Set Descriptions'' (see for example [{{CIDE lien citation|6}}], [{{CIDE lien citation|7}}] and [{{CIDE lien citation|8}}]) are an example of the approaches along the formal avenue. Intertextual semantics [{{CIDE lien citation|1}}] [{{CIDE lien citation|2}}] [{{CIDE lien citation|4}}] is an approach along the informal avenue. In intertextual semantics (IS), the meaning of a marked-up document is entirely and exclusively represented in natural language.
 +
 
 
The intertextual semantics (IS) approach is based on the hypothesis (of which traces can be found in, among other places, the works of  Wirzbicka [{{CIDE lien citation|9}}], Smedslund [{{CIDE lien citation|5}}], and even Wittgenstein [{{CIDE lien citation|10}}]) that humans ultimately “make sense” of artefacts through the use of natural language (NL), and that in designing artefacts, one should be preoccupied by how, and how easily and with how much ambiguity (or unambiguity), humans can derive NL from those artefacts. No matter how useful intermediate formal representations of meaning (including marked-up documents) may be for conciseness, machine processing, etc., they must ultimately be translatable (not necessarily translated) to NL, and are ever only as “meaningful” as such NL expressions of them are.
 
The intertextual semantics (IS) approach is based on the hypothesis (of which traces can be found in, among other places, the works of  Wirzbicka [{{CIDE lien citation|9}}], Smedslund [{{CIDE lien citation|5}}], and even Wittgenstein [{{CIDE lien citation|10}}]) that humans ultimately “make sense” of artefacts through the use of natural language (NL), and that in designing artefacts, one should be preoccupied by how, and how easily and with how much ambiguity (or unambiguity), humans can derive NL from those artefacts. No matter how useful intermediate formal representations of meaning (including marked-up documents) may be for conciseness, machine processing, etc., they must ultimately be translatable (not necessarily translated) to NL, and are ever only as “meaningful” as such NL expressions of them are.
  
 
In the realm of structured (i.e., marked-up) documents, IS suggests that the creators of tag-sets (modelers) should be preoccupied by how markup can be translated to NL. Even if “end users” never see any marked-up document, some other humans, for example, processing software developers, or archivists, will have to deal with them directly or indirectly, unless the documents are totally pointless. One might say it is even more important to be preoccupied by that translation as the number of intermediate representations increases, because there are then more opportunities for misinterpretations.
 
In the realm of structured (i.e., marked-up) documents, IS suggests that the creators of tag-sets (modelers) should be preoccupied by how markup can be translated to NL. Even if “end users” never see any marked-up document, some other humans, for example, processing software developers, or archivists, will have to deal with them directly or indirectly, unless the documents are totally pointless. One might say it is even more important to be preoccupied by that translation as the number of intermediate representations increases, because there are then more opportunities for misinterpretations.
 
   
 
   
IS proposes a mechanism by which NL passages (or whole documents) are generated from marked-up documents, according to an IS specification for the tag-set. So far, only very weak NL generation mechanisms have been explored, and it is extremely important that those mechanisms be weak, because too powerful mechanisms would “hide under the carpet” inherent interpretation complications which IS, in contrast, seeks to uncover. In the current state of the IS framework, an IS specification takes the form of a table giving, for each element type two NL segments: a “text-before” segment and a “text-after” segment, generically called “peritexts.”
+
IS proposes a mechanism by which NL passages (or whole documents) are generated from marked-up documents, according to an ''IS specification'' for the tag-set. So far, only very weak NL generation mechanisms have been explored, ''and it is extremely important that those mechanisms'' be weak, because too powerful mechanisms would “hide under the carpet” inherent interpretation complications which IS, in contrast, seeks to uncover. In the current state of the IS framework, an IS specification takes the form of a table giving, for each element type two NL segments: a “text-before” segment and a “text-after” segment, generically called “peritexts.”
  
 
Attributes require special attention, but a way of handling them in keeping with the spirit of IS is presented in [{{CIDE lien citation|2}}]. They are handled through the possibility of including in the peritexts “guarded segments,” segments guarded by an attribute name, that are only included if the corresponding attribute is specified on the element, and that can refer to the attribute value. “Local” elements (in the sense of W3C schemas) are supported, so that different peritexts can be assigned depending on the ancestors of the element.
 
Attributes require special attention, but a way of handling them in keeping with the spirit of IS is presented in [{{CIDE lien citation|2}}]. They are handled through the possibility of including in the peritexts “guarded segments,” segments guarded by an attribute name, that are only included if the corresponding attribute is specified on the element, and that can refer to the attribute value. “Local” elements (in the sense of W3C schemas) are supported, so that different peritexts can be assigned depending on the ancestors of the element.
  
The IS generation process is akin to styling the document with the peritexts, concatenating peritexts and element contents as the document tree is traversed depth-first. The IS, or IS-meaning, of the document is the resulting character string. It is important to stress that, in spite of the similarity between styling and the generation of the IS of a document, the preoccupations of IS are absolutely not at the presentational level, but really at the semantic level.
+
The IS generation process is akin to styling the document with the peritexts, concatenating peritexts and element contents as the document tree is traversed depth-first. The IS, or ''IS-meaning'', of the document is the resulting character string. It is important to stress that, in spite of the similarity between styling and the generation of the IS of a document, the preoccupations of IS are absolutely not at the presentational level, but really at the semantic level.
  
 
In this article, we present a complete implementation, in XSLT 1.0, of the intertextual semantics generation mechanism. The transformation is model-independent in that it reads the peritexts from an XML document encoding the IS specification for a given model. It implements attribute handling as defined in [{{CIDE lien citation|2}}]; hyperlinks in peritexts or as attribute or element content, as described in [{{CIDE lien citation|1}}]; and local element definitions, also as described in [{{CIDE lien citation|1}}]. In addition, it performs indentation of the output (in the same line as [{{CIDE lien citation|3}}], but more elaborate), for increased readability, and handles exceptions, elements for which no peritext exists in the IS specification and unexpected attributes.
 
In this article, we present a complete implementation, in XSLT 1.0, of the intertextual semantics generation mechanism. The transformation is model-independent in that it reads the peritexts from an XML document encoding the IS specification for a given model. It implements attribute handling as defined in [{{CIDE lien citation|2}}]; hyperlinks in peritexts or as attribute or element content, as described in [{{CIDE lien citation|1}}]; and local element definitions, also as described in [{{CIDE lien citation|1}}]. In addition, it performs indentation of the output (in the same line as [{{CIDE lien citation|3}}], but more elaborate), for increased readability, and handles exceptions, elements for which no peritext exists in the IS specification and unexpected attributes.
 
  
 
==General approach==
 
==General approach==
As mentioned earlier, the implementation is model-independent: the same XSLT stylesheet is used to process any document. In principle, the association between elements and peritexts is determined by the model (DTD or schema) to which the document conforms. Knowing the  model,
 
 
  
the generic stylesheet can read an IS specification (ISS) file, giving the peritexts for all elements, and compute the IS of the instance.
+
As mentioned earlier, the implementation is model-independent: the same XSLT stylesheet is used to process any document. In principle, the association between elements and peritexts is determined by the model (DTD or schema) to which the document conforms. Knowing the  model,the generic stylesheet can read an IS specification (ISS) file, giving the peritexts for all elements, and compute the IS of the instance.
  
 
In theory, namespace or schema-location information could be used to identify the appropriate ISS file applicable to a document. However, complications would arise from the fact that the same document can conform to different schemas, and contain elements of different namespaces. Thus, it seems simpler to determine the model of a  document for IS purposes independently from namespace and schema- location information. One possibility would be to point explicitly to the ISS file through a processing instruction in the document. We chose a more implicit approach, requiring no model-specific addition to the documents, and which proved flexible enough:  
 
In theory, namespace or schema-location information could be used to identify the appropriate ISS file applicable to a document. However, complications would arise from the fact that the same document can conform to different schemas, and contain elements of different namespaces. Thus, it seems simpler to determine the model of a  document for IS purposes independently from namespace and schema- location information. One possibility would be to point explicitly to the ISS file through a processing instruction in the document. We chose a more implicit approach, requiring no model-specific addition to the documents, and which proved flexible enough:  
Ligne 61 : Ligne 59 :
 
The processing performed by the generic stylesheet is one-pass, i.e., it takes as input the document instance and directly generates its IS. Thus, no pipelining environment is necessary. Any current browser with an XSLT 1.0 processor can be used to view the IS of documents directly, provided a link to the generic stylesheet is included in the documents, for example:
 
The processing performed by the generic stylesheet is one-pass, i.e., it takes as input the document instance and directly generates its IS. Thus, no pipelining environment is necessary. Any current browser with an XSLT 1.0 processor can be used to view the IS of documents directly, provided a link to the generic stylesheet is included in the documents, for example:
  
 +
<small>
 +
<source lang="xml">
 
<?xml-stylesheet type="text/xsl" href="ISG.xsl" ?>
 
<?xml-stylesheet type="text/xsl" href="ISG.xsl" ?>
 
+
</source>
 +
</small>
  
 
==IS specifications==
 
==IS specifications==
Ligne 75 : Ligne 76 :
 
An ISS file is an XML document. All its elements and attributes belong  to the specific namespace:
 
An ISS file is an XML document. All its elements and attributes belong  to the specific namespace:
  
http://grds.ebsi.umontreal.ca/ns/ISS/
+
 
 +
<small><nowiki>< </nowiki> http://grds.ebsi.umontreal.ca/ns/ISS/<nowiki>></nowiki></small>
  
 
Its top-level element is an iss element. The content of that element is one or more rule elements. Each rule element is empty and has three mandatory attributes: paths, text-before, and text-after. The effect of a rule element is to assign the pair of peritexts text-before and text-after to the path or space-delimited paths given in paths.
 
Its top-level element is an iss element. The content of that element is one or more rule elements. Each rule element is empty and has three mandatory attributes: paths, text-before, and text-after. The effect of a rule element is to assign the pair of peritexts text-before and text-after to the path or space-delimited paths given in paths.
Ligne 82 : Ligne 84 :
  
 
The sequences {{ and }} in peritexts are hyperlink delimiters, i.e., what is between them is interpreted as a URL and converted to a hyperlink in the IS. It is possible to have {{ in a text-before and }} in the corresponding text-after, but this will only work when the element contains neither sub- elements nor }} character sequences (which would be unusual in a URL). Peritexts can contain passages “guarded” by an attribute name, such as:
 
The sequences {{ and }} in peritexts are hyperlink delimiters, i.e., what is between them is interpreted as a URL and converted to a hyperlink in the IS. It is possible to have {{ in a text-before and }} in the corresponding text-after, but this will only work when the element contains neither sub- elements nor }} character sequences (which would be unusual in a URL). Peritexts can contain passages “guarded” by an attribute name, such as:
@attribName[Some text containing exactly one @.]
+
::@attribName[Some text containing exactly one @.]
  
 
Such guarded passages in peritexts are included in the resulting IS only if the guarding attribute is present on the element to which the peritext is applied. Otherwise, the entire guarded passage is omitted. When the passage is included, the actual value of the attribute is inserted in place of the @.
 
Such guarded passages in peritexts are included in the resulting IS only if the guarding attribute is present on the element to which the peritext is applied. Otherwise, the entire guarded passage is omitted. When the passage is included, the actual value of the attribute is inserted in place of the @.
 +
 
It is possible to use xmlns as an attribute to refer to the namespace-uri of an element. The guarded passage is then included only if the element belongs to a namespace.
 
It is possible to use xmlns as an attribute to refer to the namespace-uri of an element. The guarded passage is then included only if the element belongs to a namespace.
  
 
+
===Examples===
 
 
 
3.3 Examples
 
 
Here is the ISS file used for our examples. It is intended for a top-level element of story, and should thus be named story.iss.xml and reside in the same directory as the generic stylesheet:
 
Here is the ISS file used for our examples. It is intended for a top-level element of story, and should thus be named story.iss.xml and reside in the same directory as the generic stylesheet:
 
+
<small>
 +
<source lang="xml">
 
<?xml version="1.0"?>
 
<?xml version="1.0"?>
 
<iss xmlns="http://grds.ebsi.umontreal.ca/ns/ISS/">
 
<iss xmlns="http://grds.ebsi.umontreal.ca/ns/ISS/">
 
<rule paths="story"
 
<rule paths="story"
text-before="This document tells a tiny story.@xmlns[ The document belongs to the XML namespace &quot;@&quot; (if you are not familiar with XML namespaces, you can read about them at {{http://www.w3.org/TR/REC-xml- names/}}).]@author[ The author of this story is @.]"
+
  text-before="This document tells a tiny story.@xmlns[ The document  
text-after="End of the tiny story."/>
+
    belongs to the XML namespace &quot;@&quot; (if you are not  
<rule paths="para" text-before="A bit of the story: " text- after=""/>
+
    familiar with XML namespaces, you can read about them at  
 +
    {{http://www.w3.org/TR/REC-xml-names/}}).]@author[The author of  
 +
    this story is @.]"  
 +
  text-after="End of the tiny story."/>
 +
<rule paths="para" text-before="A bit of the story: "  
 +
  text- after=""/>
 
<rule paths="person" text-before="The person named "
 
<rule paths="person" text-before="The person named "
 
text-after=" @key[{{http://en.wikipedia.org/wiki/@}} ]"/>
 
text-after=" @key[{{http://en.wikipedia.org/wiki/@}} ]"/>
<rule paths="place" text-before="The place named "
+
<rule paths="place"  
text-after=" @key[{{http://en.wikipedia.org/wiki/@}} ]"/>
+
  text-before="The place named "
 +
  text-after=" @key[{{http://en.wikipedia.org/wiki/@}} ]"/>
 
</iss>
 
</iss>
 +
</source>
 +
</small>
  
 
Example 1 is the following XML document:
 
Example 1 is the following XML document:
Ligne 117 : Ligne 126 :
  
  
 +
[[Fichier:CIDE (2009) Marcoux fig 1.jpg|center|400px|thumb|]]
  
 +
Note that the text contributed by peritexts is typeset in italics and the text contributed  by  the  document  is  typeset  in  normal  font  on  blue
  
Note that the text contributed by peritexts is typeset in italics and the text contributed  by  the  document  is  typeset  in  normal  font  on  blue
 
 
  
 
background. This is in keeping with the philosophy of IS, which demands that the origin of all text in the IS of a document be clearly identifiable.
 
background. This is in keeping with the philosophy of IS, which demands that the origin of all text in the IS of a document be clearly identifiable.
 
Example 2 uses more of the richness allowed by the model:
 
Example 2 uses more of the richness allowed by the model:
  
 +
<small>
 +
<source lang="xml">
 
<?xml version="1.0" ?>
 
<?xml version="1.0" ?>
 
<?xml-stylesheet type="text/xsl" href="ISG.xsl" ?>
 
<?xml-stylesheet type="text/xsl" href="ISG.xsl" ?>
 
<story author="Bram Stocker" xmlns="http://ts.org">
 
<story author="Bram Stocker" xmlns="http://ts.org">
<para>
+
  <para>
<person key="Bluebeard">Barbe-Bleue</person> went to
+
    <person  
<place key="Transylvania">Transylvania</place>. There, he met
+
      key="Bluebeard">Barbe-Bleue
<person>Dracula</person>.</para>
+
    </person>  
<para>He did not like <person>Dracula</person>. So he decided
+
      went to
to go back to <place>France</place>.</para>
+
    <place key="Transylvania">
 +
      Transylvania
 +
    </place>
 +
  .   There, he met
 +
    <person>
 +
      Dracula
 +
    </person>
 +
.</para>
 +
 
 +
  <para>
 +
    He did not like  
 +
    <person>
 +
      Dracula
 +
    </person>
 +
    . So he decided to go back to  
 +
        <place>
 +
        France
 +
        </place>.
 +
  </para>
 
</story>
 
</story>
 +
 +
</source>
 +
</small>
  
 
This time, the resulting IS is:
 
This time, the resulting IS is:
Ligne 143 : Ligne 175 :
 
Example 3 will illustrate exception handling. In the document of  Example 1, we add to some element (story) an attribute (year) that is  not mentioned anywhere in the peritexts of that element. This is considered to be an “unknown attribute:”
 
Example 3 will illustrate exception handling. In the document of  Example 1, we add to some element (story) an attribute (year) that is  not mentioned anywhere in the peritexts of that element. This is considered to be an “unknown attribute:”
  
 +
<small>
 +
<source lang="xml">
 
<?xml version="1.0" ?>
 
<?xml version="1.0" ?>
<?xml-stylesheet type="text/xsl" href="ISG.xsl" ?>
+
<?xml-stylesheet type="text/xsl" href="ISG.xsl" ?>  
<story author="Bram Stocker" year="1897">
+
<story author="Bram Stocker" year="1897">
+
  <para>
 
+
      <person>
<para><person>Dracula</person> went to France. There, he met
+
        Dracula
<person>Barbe-Bleue</person>.</para>
+
      </person>  
 +
        went to France. There, he met
 +
      <person>
 +
        Barbe-Bleue
 +
      </person>.
 +
  </para>
 
</story>
 
</story>
  
 +
</source>
 +
</small>
 
The resulting IS is:
 
The resulting IS is:
  
Ligne 158 : Ligne 199 :
  
  
4 Structure of the generic stylesheet
+
==Structure of the generic stylesheet==
 +
 
 
The general structure of the generic stylesheet (ISG.xsl in the above examples) is as follows:
 
The general structure of the generic stylesheet (ISG.xsl in the above examples) is as follows:
1. Global variables initialization.
+
 
2. Overall HTML structure template, matching /.
+
#Global variables initialization.
3. Templates for text-only elements.
+
#Overall HTML structure template, matching /.
4. Templates for all other elements.
+
#Templates for text-only elements.
5. Templates for text nodes (PCDATA).
+
#Templates for all other elements.
6. Called templates for finding the best matching rule.
+
#Templates for text nodes (PCDATA).
7. Called templates for processing attributes.
+
#Called templates for finding the best matching rule.
8. Called templates for processing hyperlinks.
+
#Called templates for processing attributes.
9. Called templates for exception handling.
+
#Called templates for processing hyperlinks.
 +
#Called templates for exception handling.
  
 
The rules contained in the model-specific ISS file genID.iss.xml, where genID is the generic ID of the top-level element of the document, are read in Part 1 and placed in a global variable. Part 2 produces the overall HTML structure of the output, including an internal CSS stylesheet.
 
The rules contained in the model-specific ISS file genID.iss.xml, where genID is the generic ID of the top-level element of the document, are read in Part 1 and placed in a global variable. Part 2 produces the overall HTML structure of the output, including an internal CSS stylesheet.
 +
 
Parts 3, 4, and 5 are actually pairs of templates: one for block formatting and one for flowed formatting. The indentation heuristics, mentioned earlier, is realized by the templates in Part 4 and determines how blocks are indented relative to one another and at which level the formatting should be changed from block to flowed.
 
Parts 3, 4, and 5 are actually pairs of templates: one for block formatting and one for flowed formatting. The indentation heuristics, mentioned earlier, is realized by the templates in Part 4 and determines how blocks are indented relative to one another and at which level the formatting should be changed from block to flowed.
   
+
The templates in Part 6 are used to determine the rule, in the ISS file, that “best” matches an element. As sketched earlier, a rule is a best match for an element E iff one of its paths matches E (i.e., is a suffix of E’s ancestral line) and no other rule specifies a longer path matching E. If  two rules or more are best matches for an element, the first one (in ISS file order) is chosen.
  
The templates in Part 6 are used to determine the rule, in the ISS file, that “best” matches an element. As sketched earlier, a rule is a best match for an element E iff one of its paths matches E (i.e., is a suffix of E’s  ancestral line) and no other rule specifies a longer path matching E. If  two rules or more are best matches for an element, the first one (in ISS file order) is chosen.
 
 
The templates in Parts 7 and 8 process attributes and hyperlinks, respectively. Processing consists essentially in recursive search-replace of various delimiters and placeholders. The templates in Part 9 are called by those of Parts 3 and 4 to verify the presence of exceptions and, if needed, include the appropriate warnings in the produced IS.
 
The templates in Parts 7 and 8 process attributes and hyperlinks, respectively. Processing consists essentially in recursive search-replace of various delimiters and placeholders. The templates in Part 9 are called by those of Parts 3 and 4 to verify the presence of exceptions and, if needed, include the appropriate warnings in the produced IS.
  
 +
==Discussion==
  
5 Discussion
 
 
The examples illustrate that peritexts can be very long. It is an essential feature of IS that there be no limit on their length. They should not be constrained lexically either. However, this is not entirely the case in the current implementation. Indeed, it is currently not possible to include some of the delimiters and placeholders for attribute guarded-passages and hyperlinks as data in the peritexts (and, in a few cases, in element content). One possible improvement would thus be to define and implement conventions for allowing all delimiters and placeholders to be included as data in peritexts (and element content).
 
The examples illustrate that peritexts can be very long. It is an essential feature of IS that there be no limit on their length. They should not be constrained lexically either. However, this is not entirely the case in the current implementation. Indeed, it is currently not possible to include some of the delimiters and placeholders for attribute guarded-passages and hyperlinks as data in the peritexts (and, in a few cases, in element content). One possible improvement would thus be to define and implement conventions for allowing all delimiters and placeholders to be included as data in peritexts (and element content).
 +
 
A related issue is the validation of the syntax used for attribute guarded- passages and hyperlinks in peritexts. At the moment, no validation or error detection is performed. While this will never cause the abnormal termination of the transformation, it could yield unexpected results. Another possible improvement would thus be to implement full syntactic validation of the peritexts.
 
A related issue is the validation of the syntax used for attribute guarded- passages and hyperlinks in peritexts. At the moment, no validation or error detection is performed. While this will never cause the abnormal termination of the transformation, it could yield unexpected results. Another possible improvement would thus be to implement full syntactic validation of the peritexts.
 +
 
Certain delimiters are used internally as placeholders in text variables during processing. Their use relies implicitly on certain character sequences not occurring as textual content in the processed document. This should be replaced by more robust mechanisms.
 
Certain delimiters are used internally as placeholders in text variables during processing. Their use relies implicitly on certain character sequences not occurring as textual content in the processed document. This should be replaced by more robust mechanisms.
 +
 
One of the challenges of developing a generic stylesheet in XSLT 1.0 is  to maintain a one-pass approach. Solving the above-mentioned weaknesses would without doubt make this challenge even bigger. Switching to a two-pass approach may thus be an attractive avenue for future developments.
 
One of the challenges of developing a generic stylesheet in XSLT 1.0 is  to maintain a one-pass approach. Solving the above-mentioned weaknesses would without doubt make this challenge even bigger. Switching to a two-pass approach may thus be an attractive avenue for future developments.
Adopting a two-pass approach can be done in essentially two ways: an external          pipelining          mechanism          (such          as        XProc
 
<http://www.w3.org/TR/xproc/>) can be used with XSLT 1.0, or multi- passes can be handled internally in XSLT 2.0, through the node-set function,  which  allows  some  pipeline-like  processing.  In  both  cases,
 
 
  
browser integration could be non-trivial. One interesting way to exploit  an external pipelining mechanism would be to have a generic stylesheet generate a model-specific stylesheet from the IS specification, then apply the generated stylesheet to the document instance to generate its IS. Another functionality that could benefit from the enhanced possibilities  of multi-pass / XSLT 2.0 processing is the automatic indentation of the output. Right now, the heuristics used is fairly simple, and it can break down even on simple cases. A more sophisticated and robust heuristics should thus be developed, and this would likely be easier with  multi-pass
+
Adopting a two-pass approach can be done in essentially two ways: an external pipelining mechanism (such as XProc
 +
<small><nowiki>< </nowiki> <http://www.w3.org/TR/xproc/><nowiki>></nowiki></small>) can be used with XSLT 1.0, or multi- passes can be handled internally in XSLT 2.0, through the node-set function,  which  allows  some  pipeline-like  processing.  In  both cases,browser integration could be non-trivial. One interesting way to exploit  an external pipelining mechanism would be to have a generic stylesheet generate a model-specific stylesheet from the IS specification, then apply the generated stylesheet to the document instance to generate its IS. Another functionality that could benefit from the enhanced possibilities  of multi-pass / XSLT 2.0 processing is the automatic indentation of the output. Right now, the heuristics used is fairly simple, and it can break down even on simple cases. A more sophisticated and robust heuristics should thus be developed, and this would likely be easier with  multi-pass
 
/ XSLT 2.0 processing.
 
/ XSLT 2.0 processing.
A question that needs to be investigated through experimentation is that  of determining how much of the indentation should be automatic. In [1], indentation was specified explicitly in a conventional manner in the peritexts.
+
 
 +
A question that needs to be investigated through experimentation is that  of determining how much of the indentation should be automatic. In [{{CIDE lien citation|1}}], indentation was specified explicitly in a conventional manner in the peritexts.
 +
 
 
Let us now consider the foreseeable evolutions of the IS framework and how they could impact the IS generation mechanism. Consider the output of Example 1. As textual content, it includes the following passage:
 
Let us now consider the foreseeable evolutions of the IS framework and how they could impact the IS generation mechanism. Consider the output of Example 1. As textual content, it includes the following passage:
There, he met The person named Barbe-Bleue
+
::''There, he met The person named Barbe-Bleue''
  
 
Note that the article The has been capitalized. Why? The answer is that it comes from a text-before segment that is sometimes located at the beginning of a sentence, where capitalization is appropriate. But capitalization in the middle of a sentence (as in Example 1) is inappropriate. The source of the problem is that, in the current framework, the same text-before segment must be used consistently, regardless of its position in a sentence.
 
Note that the article The has been capitalized. Why? The answer is that it comes from a text-before segment that is sometimes located at the beginning of a sentence, where capitalization is appropriate. But capitalization in the middle of a sentence (as in Example 1) is inappropriate. The source of the problem is that, in the current framework, the same text-before segment must be used consistently, regardless of its position in a sentence.
 +
 
Remember that in IS, the focus is not on presentation but on meaning. Since the problem at hand only affects presentation and does not hinder comprehension, it must not be considered major. Moreover, it can be alleviated by various devices, such as writing the peritext all in capitals. With Example 1, this gives the following output, which, though still unusual, is not as strange-looking as the original output:
 
Remember that in IS, the focus is not on presentation but on meaning. Since the problem at hand only affects presentation and does not hinder comprehension, it must not be considered major. Moreover, it can be alleviated by various devices, such as writing the peritext all in capitals. With Example 1, this gives the following output, which, though still unusual, is not as strange-looking as the original output:
  
  
 +
[[Fichier:CIDE (2009) Marcoux fig 2.jpg|center|400px|thumb|]]
  
 
Thus, it is not clear that the framework needs to be modified to accommodate peritexts that vary according to their position in a sentence.
 
Thus, it is not clear that the framework needs to be modified to accommodate peritexts that vary according to their position in a sentence.
 
  
 
Other possible extensions in the same line would include peritexts that vary with the position of the element relative to its siblings, with the number of children of the element, and with the grammatical gender of a word or expression in the content of some element or attribute. Clearly, adding any such extension to IS would complicate the IS-generation mechanism. For one thing, it might, require the inclusion of additional peritexts in the IS specification of a model. Then, those additional peritexts would have to be appropriately processed during IS-generation. We believe that, in all cases, experimentation should be used to determine whether an extension is truly necessary or if some workaround without extension is possible. We think extreme parsimony is of utmost importance for the evolution of IS, because the inclusion of too powerful mechanisms could severely impair the explanatory power of  the approach.
 
Other possible extensions in the same line would include peritexts that vary with the position of the element relative to its siblings, with the number of children of the element, and with the grammatical gender of a word or expression in the content of some element or attribute. Clearly, adding any such extension to IS would complicate the IS-generation mechanism. For one thing, it might, require the inclusion of additional peritexts in the IS specification of a model. Then, those additional peritexts would have to be appropriately processed during IS-generation. We believe that, in all cases, experimentation should be used to determine whether an extension is truly necessary or if some workaround without extension is possible. We think extreme parsimony is of utmost importance for the evolution of IS, because the inclusion of too powerful mechanisms could severely impair the explanatory power of  the approach.
  
  
6 Conclusion
+
==Conclusion==
In this article, we presented a complete implementation, in XSLT 1.0, of the intertextual semantics (IS) generation mechanism for XML documents. The implementation is model-independent, in that a generic XSLT stylesheet reads the peritexts from an IS specification file, an XML document giving the IS specification (ISS) applicable to the document being processed. The implementation handles attributes as described in [2], hyperlinks (in peritexts or as attribute or element content) as described in [1], and local element definitions (in the sense of W3C schemas), also as described in [1]. In addition, it performs indentation of the IS produced (in the same line as [3], but more elaborate), for  increased readability, and handles exceptions, elements for which no peritexts are given in the IS specification or attributes unexpected by the peritexts.
+
In this article, we presented a complete implementation, in XSLT 1.0, of the intertextual semantics (IS) generation mechanism for XML documents. The implementation is model-independent, in that a generic XSLT stylesheet reads the peritexts from an IS specification file, an XML document giving the IS specification (ISS) applicable to the document being processed. The implementation handles attributes as described in [{{CIDE lien citation|2}}], hyperlinks (in peritexts or as attribute or element content) as described in [{{CIDE lien citation|1}}], and local element definitions (in the sense of W3C schemas), also as described in [{{CIDE lien citation|1}}]. In addition, it performs indentation of the IS produced (in the same line as [{{CIDE lien citation|3}}], but more elaborate), for  increased readability, and handles exceptions, elements for which no peritexts are given in the IS specification or attributes unexpected by the peritexts.
 +
 
 
After describing the format adopted for ISS files, we gave examples illustrating the functionalities of the implementation, then outlined the structure of the generic stylesheet. Finally, we discussed various aspects of the implementation, possible improvements, and the impact that foreseeable generalizations of the IS framework might have on the IS generation mechanism.
 
After describing the format adopted for ISS files, we gave examples illustrating the functionalities of the implementation, then outlined the structure of the generic stylesheet. Finally, we discussed various aspects of the implementation, possible improvements, and the impact that foreseeable generalizations of the IS framework might have on the IS generation mechanism.
The   current   version   of   the   stylesheet   is   available       through
+
 
 +
The current version of the stylesheet is available through
 
<http://grds.ebsi.umontreal.ca/>. It is published under the Creative Commons “Attribution-Noncommercial-Share Alike 2.5 Canada” license
 
<http://grds.ebsi.umontreal.ca/>. It is published under the Creative Commons “Attribution-Noncommercial-Share Alike 2.5 Canada” license
 
<http://creativecommons .org/licenses/by-nc-sa/2.5/ca/>. We warmly encourage readers to experiment with it, look at the examples, write IS specifications for their models, either extant or under development, and send comments and suggestions.
 
<http://creativecommons .org/licenses/by-nc-sa/2.5/ca/>. We warmly encourage readers to experiment with it, look at the examples, write IS specifications for their models, either extant or under development, and send comments and suggestions.
 
   
 
   
  
7 References
+
==References==
[1] Marcoux, Yves. “A natural-language approach to modeling:  Why is some XML so difficult to write?” Proceedings of Extreme Markup Languages 2006.
+
 
[2] Marcoux, Yves; Rizkallah, Élias. “Exploring  intertextual semantics: a reflection on attributes and optionality.” Proceedings of Extreme Markup Languages 2007.
+
{{CIDE biblio
[3] Marcoux, Yves; Rizkallah, Élias. “Experience with the use  of
+
  |id=1
 +
  |texte= Marcoux, Yves. “A natural-language approach to modeling:  Why is some XML so difficult to write?” Proceedings of Extreme Markup Languages 2006.
 +
}}
 +
 
 +
 
 +
{{CIDE biblio
 +
  |id=2
 +
  |texte= Marcoux, Yves; Rizkallah, Élias. “Exploring  intertextual semantics: a reflection on attributes and optionality.” Proceedings of Extreme Markup Languages 2007.
 +
 
 +
}}
 +
{{CIDE biblio
 +
  |id=3
 +
  |texte= Marcoux, Yves; Rizkallah, Élias. “Experience with the use  of
 
peritexts to support modeler-author communication in a structured- document system.” Proceedings of SIGDOC  2007.
 
peritexts to support modeler-author communication in a structured- document system.” Proceedings of SIGDOC  2007.
[4] Marcoux, Yves; Rizkallah, Élias. “Intertextual semantics:  a
+
 
 +
}}
 +
{{CIDE biblio
 +
  |id=4
 +
  |texte= Marcoux, Yves; Rizkallah, Élias. “Intertextual semantics:  a
 
semantics for information design.” Journal of the American Society for Information Science & Technology, Perspectives issue on design. September 2009, in Press.
 
semantics for information design.” Journal of the American Society for Information Science & Technology, Perspectives issue on design. September 2009, in Press.
[5] Smedslund, J. Dialogues about a new psychology.  Chagrin Falls, Ohio: Taos Institute. 2004.
+
 
[6] Sperberg-McQueen, C. M., Huitfeldt, C., & Renear,  A. “Meaning and interpretation of markup.” Markup Languages: Theory  and
+
}}
 +
{{CIDE biblio
 +
  |id=5
 +
  |texte= Smedslund, J. Dialogues about a new psychology.  Chagrin Falls, Ohio: Taos Institute. 2004.
 +
 
 +
}}
 +
{{CIDE biblio
 +
  |id=6
 +
  |auteur=Michael Sperberg-McQueen{{!}}Sperberg-McQueen, C. M.
 +
  |texte=, Huitfeldt, C., & Renear,  A. “Meaning and interpretation of markup.” Markup Languages: Theory  and
 
Practice 2, 3 (2000), 215–234.
 
Practice 2, 3 (2000), 215–234.
[7] Sperberg-McQueen, C. M., Dubin, D., Huitfeldt, C., &  Renear, A. “Drawing inferences on the basis of markup.” In Proceedings  of
+
 
Extreme Markup Languages 2002 (Montreal, Canada, August  2002),
+
}}
 +
{{CIDE biblio
 +
  |id=7
 +
  |texte= Sperberg-McQueen, C. M., Dubin, D., Huitfeldt, C., &  Renear, A. “Drawing inferences on the basis of markup.” In Proceedings  of Extreme Markup Languages 2002 (Montreal, Canada, August  2002),
 
B. T. Usdin and S. R. Newcomb,  Eds.
 
B. T. Usdin and S. R. Newcomb,  Eds.
[8] Sperberg-McQueen, C. M. & Miller, E. “On mapping  from
 
colloquial XML to RDF using XSLT.” Proceedings of Extreme Markup Languages 2004.
 
[9] Wierzbicka, A. Semantics, culture, and cognition :  universal human concepts in culture-specific configurations. Oxford University Press. 1992.
 
[10] Wittgenstein, L. Philosophical investigations. Oxford:  Blackwell.
 
1953.
 
  
 +
}}
 +
{{CIDE biblio
 +
  |id=8
 +
  |texte= Sperberg-McQueen, C. M. & Miller, E. “On mapping  from colloquial XML to RDF using XSLT.” Proceedings of Extreme Markup Languages 2004.
 +
 +
}}
 +
{{CIDE biblio
 +
  |id=9
 +
  |texte= Wierzbicka, A. Semantics, culture, and cognition :  universal human concepts in culture-specific configurations. Oxford University Press. 1992.
 +
 +
}}
 +
{{CIDE biblio
 +
  |id=10
 +
  |texte= Wittgenstein, L. Philosophical investigations. Oxford:  Blackwell.
 +
 +
}}
  
  
Ligne 239 : Ligne 327 :
  
 
{{Clr}}
 
{{Clr}}
[[Catégorie:référence bibliographique, article de conférence]]
+
[[Catégorie:article de conférence]]
 
[[Catégorie:Article avec PDF]]
 
[[Catégorie:Article avec PDF]]
  
 
__SHOWFACTBOX__
 
__SHOWFACTBOX__

Version actuelle datée du 18 juillet 2016 à 11:45

Intertextual semantics generation for structured documents:a complete implementation in XSLT


 
 

 
titre
Intertextual semantics generation for structured documents:a complete implementation in XSLT
auteurs
Yves Marcoux.
Affiliations
GRDS, EBSI, Université de Montréal.
In
CIDE.12 (Montréal), 2009
En PDF 
CIDE (2009) Marcoux.pdf.pdf
Mots-clés 
Sémantique intertextuelle, documents structurés, langages de balisage, XML, XSLT, descriptions formelles de jeux de balises.
Keywords
Intertextual semantics, structured documents, markup languages, XML, XSLT, formal tag-set descriptions.
Résumé
La sémantique intertextuelle (SI) [1] [4] attribue aux documents balisés un sens en langue naturelle. Alors que les sémantiques formelles visent une représentation du sens des documents pour la machine, la SI vise l’humain. Dans la forme actuelle de l’approche, la SI d’un modèle (DTD, schéma) est donnée par deux péritextes associés à chaque élément: un texte-avant et un texte- après. La SI d’un document est la concaténation des péritextes et des contenus d’élément dans l’ordre du document. Nous présentons une implantation complète, en XSLT 1.0, de la génération de SI. L’implantation traite les attributs tel que décrit dans [2], et les hyperliens et éléments locaux tel que décrit dans [1]. Elle indente aussi l’extrant pour une meilleure lisibilité tel que suggéré dans [3] et gère les exceptions que sont les éléments et attributs inconnus.