CIDE (2009) Marcoux : Différence entre versions

De CIDE
imported>Abdelhakim Aidene
imported>Abdelhakim Aidene
 
(8 révisions intermédiaires par le même utilisateur non affichées)
Ligne 51 : Ligne 51 :
  
 
==General approach==
 
==General approach==
 +
 
As mentioned earlier, the implementation is model-independent: the same XSLT stylesheet is used to process any document. In principle, the association between elements and peritexts is determined by the model (DTD or schema) to which the document conforms. Knowing the  model,the generic stylesheet can read an IS specification (ISS) file, giving the peritexts for all elements, and compute the IS of the instance.
 
As mentioned earlier, the implementation is model-independent: the same XSLT stylesheet is used to process any document. In principle, the association between elements and peritexts is determined by the model (DTD or schema) to which the document conforms. Knowing the  model,the generic stylesheet can read an IS specification (ISS) file, giving the peritexts for all elements, and compute the IS of the instance.
  
Ligne 75 : Ligne 76 :
 
An ISS file is an XML document. All its elements and attributes belong  to the specific namespace:
 
An ISS file is an XML document. All its elements and attributes belong  to the specific namespace:
  
http://grds.ebsi.umontreal.ca/ns/ISS/
+
 
 +
<small><nowiki>< </nowiki> http://grds.ebsi.umontreal.ca/ns/ISS/<nowiki>></nowiki></small>
  
 
Its top-level element is an iss element. The content of that element is one or more rule elements. Each rule element is empty and has three mandatory attributes: paths, text-before, and text-after. The effect of a rule element is to assign the pair of peritexts text-before and text-after to the path or space-delimited paths given in paths.
 
Its top-level element is an iss element. The content of that element is one or more rule elements. Each rule element is empty and has three mandatory attributes: paths, text-before, and text-after. The effect of a rule element is to assign the pair of peritexts text-before and text-after to the path or space-delimited paths given in paths.
Ligne 82 : Ligne 84 :
  
 
The sequences {{ and }} in peritexts are hyperlink delimiters, i.e., what is between them is interpreted as a URL and converted to a hyperlink in the IS. It is possible to have {{ in a text-before and }} in the corresponding text-after, but this will only work when the element contains neither sub- elements nor }} character sequences (which would be unusual in a URL). Peritexts can contain passages “guarded” by an attribute name, such as:
 
The sequences {{ and }} in peritexts are hyperlink delimiters, i.e., what is between them is interpreted as a URL and converted to a hyperlink in the IS. It is possible to have {{ in a text-before and }} in the corresponding text-after, but this will only work when the element contains neither sub- elements nor }} character sequences (which would be unusual in a URL). Peritexts can contain passages “guarded” by an attribute name, such as:
@attribName[Some text containing exactly one @.]
+
::@attribName[Some text containing exactly one @.]
  
 
Such guarded passages in peritexts are included in the resulting IS only if the guarding attribute is present on the element to which the peritext is applied. Otherwise, the entire guarded passage is omitted. When the passage is included, the actual value of the attribute is inserted in place of the @.
 
Such guarded passages in peritexts are included in the resulting IS only if the guarding attribute is present on the element to which the peritext is applied. Otherwise, the entire guarded passage is omitted. When the passage is included, the actual value of the attribute is inserted in place of the @.
 +
 
It is possible to use xmlns as an attribute to refer to the namespace-uri of an element. The guarded passage is then included only if the element belongs to a namespace.
 
It is possible to use xmlns as an attribute to refer to the namespace-uri of an element. The guarded passage is then included only if the element belongs to a namespace.
 
 
  
 
===Examples===
 
===Examples===
Ligne 124 : Ligne 125 :
 
Note that ISG.xsl is the generic stylesheet and it is here assumed to be in the same directory as the document. The resulting IS, as can be viewed in any XSLT 1.0-compliant web browser, is:
 
Note that ISG.xsl is the generic stylesheet and it is here assumed to be in the same directory as the document. The resulting IS, as can be viewed in any XSLT 1.0-compliant web browser, is:
  
 +
 +
[[Fichier:CIDE (2009) Marcoux fig 1.jpg|center|400px|thumb|]]
  
 
Note that the text contributed by peritexts is typeset in italics and the text contributed  by  the  document  is  typeset  in  normal  font  on  blue
 
Note that the text contributed by peritexts is typeset in italics and the text contributed  by  the  document  is  typeset  in  normal  font  on  blue
Ligne 137 : Ligne 140 :
 
<story author="Bram Stocker" xmlns="http://ts.org">
 
<story author="Bram Stocker" xmlns="http://ts.org">
 
   <para>
 
   <para>
    <person  
+
    <person  
    key="Bluebeard">Barbe-Bleue
+
      key="Bluebeard">Barbe-Bleue
    </person>  
+
    </person>  
    went to
+
      went to
    <place key="Transylvania">
+
    <place key="Transylvania">
    Transylvania
+
      Transylvania
    </place>
+
    </place>
   . There, he met
+
   .   There, he met
    <person>
+
    <person>
    Dracula
+
      Dracula
    </person>
+
    </person>
 
  .</para>
 
  .</para>
   <para>He did not like <person>Dracula</person>. So he decided
+
 
to go back to <place>France</place>.</para>
+
   <para>
 +
    He did not like  
 +
    <person>
 +
      Dracula
 +
    </person>
 +
    . So he decided to go back to  
 +
        <place>
 +
        France
 +
        </place>.
 +
  </para>
 
</story>
 
</story>
  
Ligne 163 : Ligne 175 :
 
Example 3 will illustrate exception handling. In the document of  Example 1, we add to some element (story) an attribute (year) that is  not mentioned anywhere in the peritexts of that element. This is considered to be an “unknown attribute:”
 
Example 3 will illustrate exception handling. In the document of  Example 1, we add to some element (story) an attribute (year) that is  not mentioned anywhere in the peritexts of that element. This is considered to be an “unknown attribute:”
  
 +
<small>
 +
<source lang="xml">
 
<?xml version="1.0" ?>
 
<?xml version="1.0" ?>
<?xml-stylesheet type="text/xsl" href="ISG.xsl" ?>
+
<?xml-stylesheet type="text/xsl" href="ISG.xsl" ?>  
<story author="Bram Stocker" year="1897">
+
<story author="Bram Stocker" year="1897">
+
  <para>
 
+
      <person>
<para><person>Dracula</person> went to France. There, he met
+
        Dracula
<person>Barbe-Bleue</person>.</para>
+
      </person>  
 +
        went to France. There, he met
 +
      <person>
 +
        Barbe-Bleue
 +
      </person>.
 +
  </para>
 
</story>
 
</story>
  
 +
</source>
 +
</small>
 
The resulting IS is:
 
The resulting IS is:
  
Ligne 178 : Ligne 199 :
  
  
4 Structure of the generic stylesheet
+
==Structure of the generic stylesheet==
 +
 
 
The general structure of the generic stylesheet (ISG.xsl in the above examples) is as follows:
 
The general structure of the generic stylesheet (ISG.xsl in the above examples) is as follows:
1. Global variables initialization.
+
 
2. Overall HTML structure template, matching /.
+
#Global variables initialization.
3. Templates for text-only elements.
+
#Overall HTML structure template, matching /.
4. Templates for all other elements.
+
#Templates for text-only elements.
5. Templates for text nodes (PCDATA).
+
#Templates for all other elements.
6. Called templates for finding the best matching rule.
+
#Templates for text nodes (PCDATA).
7. Called templates for processing attributes.
+
#Called templates for finding the best matching rule.
8. Called templates for processing hyperlinks.
+
#Called templates for processing attributes.
9. Called templates for exception handling.
+
#Called templates for processing hyperlinks.
 +
#Called templates for exception handling.
  
 
The rules contained in the model-specific ISS file genID.iss.xml, where genID is the generic ID of the top-level element of the document, are read in Part 1 and placed in a global variable. Part 2 produces the overall HTML structure of the output, including an internal CSS stylesheet.
 
The rules contained in the model-specific ISS file genID.iss.xml, where genID is the generic ID of the top-level element of the document, are read in Part 1 and placed in a global variable. Part 2 produces the overall HTML structure of the output, including an internal CSS stylesheet.
 +
 
Parts 3, 4, and 5 are actually pairs of templates: one for block formatting and one for flowed formatting. The indentation heuristics, mentioned earlier, is realized by the templates in Part 4 and determines how blocks are indented relative to one another and at which level the formatting should be changed from block to flowed.
 
Parts 3, 4, and 5 are actually pairs of templates: one for block formatting and one for flowed formatting. The indentation heuristics, mentioned earlier, is realized by the templates in Part 4 and determines how blocks are indented relative to one another and at which level the formatting should be changed from block to flowed.
 
 
 
The templates in Part 6 are used to determine the rule, in the ISS file, that “best” matches an element. As sketched earlier, a rule is a best match for an element E iff one of its paths matches E (i.e., is a suffix of E’s  ancestral line) and no other rule specifies a longer path matching E. If  two rules or more are best matches for an element, the first one (in ISS file order) is chosen.
 
The templates in Part 6 are used to determine the rule, in the ISS file, that “best” matches an element. As sketched earlier, a rule is a best match for an element E iff one of its paths matches E (i.e., is a suffix of E’s  ancestral line) and no other rule specifies a longer path matching E. If  two rules or more are best matches for an element, the first one (in ISS file order) is chosen.
  
Ligne 201 : Ligne 223 :
  
 
The examples illustrate that peritexts can be very long. It is an essential feature of IS that there be no limit on their length. They should not be constrained lexically either. However, this is not entirely the case in the current implementation. Indeed, it is currently not possible to include some of the delimiters and placeholders for attribute guarded-passages and hyperlinks as data in the peritexts (and, in a few cases, in element content). One possible improvement would thus be to define and implement conventions for allowing all delimiters and placeholders to be included as data in peritexts (and element content).
 
The examples illustrate that peritexts can be very long. It is an essential feature of IS that there be no limit on their length. They should not be constrained lexically either. However, this is not entirely the case in the current implementation. Indeed, it is currently not possible to include some of the delimiters and placeholders for attribute guarded-passages and hyperlinks as data in the peritexts (and, in a few cases, in element content). One possible improvement would thus be to define and implement conventions for allowing all delimiters and placeholders to be included as data in peritexts (and element content).
 +
 
A related issue is the validation of the syntax used for attribute guarded- passages and hyperlinks in peritexts. At the moment, no validation or error detection is performed. While this will never cause the abnormal termination of the transformation, it could yield unexpected results. Another possible improvement would thus be to implement full syntactic validation of the peritexts.
 
A related issue is the validation of the syntax used for attribute guarded- passages and hyperlinks in peritexts. At the moment, no validation or error detection is performed. While this will never cause the abnormal termination of the transformation, it could yield unexpected results. Another possible improvement would thus be to implement full syntactic validation of the peritexts.
  
Ligne 207 : Ligne 230 :
 
One of the challenges of developing a generic stylesheet in XSLT 1.0 is  to maintain a one-pass approach. Solving the above-mentioned weaknesses would without doubt make this challenge even bigger. Switching to a two-pass approach may thus be an attractive avenue for future developments.
 
One of the challenges of developing a generic stylesheet in XSLT 1.0 is  to maintain a one-pass approach. Solving the above-mentioned weaknesses would without doubt make this challenge even bigger. Switching to a two-pass approach may thus be an attractive avenue for future developments.
  
Adopting a two-pass approach can be done in essentially two ways: an external         pipelining         mechanism         (such         as       XProc
+
Adopting a two-pass approach can be done in essentially two ways: an external pipelining mechanism (such as XProc
<http://www.w3.org/TR/xproc/>) can be used with XSLT 1.0, or multi- passes can be handled internally in XSLT 2.0, through the node-set function,  which  allows  some  pipeline-like  processing.  In  both   cases,
+
<small><nowiki>< </nowiki> <http://www.w3.org/TR/xproc/><nowiki>></nowiki></small>) can be used with XSLT 1.0, or multi- passes can be handled internally in XSLT 2.0, through the node-set function,  which  allows  some  pipeline-like  processing.  In  both cases,browser integration could be non-trivial. One interesting way to exploit  an external pipelining mechanism would be to have a generic stylesheet generate a model-specific stylesheet from the IS specification, then apply the generated stylesheet to the document instance to generate its IS. Another functionality that could benefit from the enhanced possibilities  of multi-pass / XSLT 2.0 processing is the automatic indentation of the output. Right now, the heuristics used is fairly simple, and it can break down even on simple cases. A more sophisticated and robust heuristics should thus be developed, and this would likely be easier with  multi-pass
+
/ XSLT 2.0 processing.
  
browser integration could be non-trivial. One interesting way to exploit  an external pipelining mechanism would be to have a generic stylesheet generate a model-specific stylesheet from the IS specification, then apply the generated stylesheet to the document instance to generate its IS. Another functionality that could benefit from the enhanced possibilities  of multi-pass / XSLT 2.0 processing is the automatic indentation of the output. Right now, the heuristics used is fairly simple, and it can break down even on simple cases. A more sophisticated and robust heuristics should thus be developed, and this would likely be easier with  multi-pass
 
/ XSLT 2.0 processing.
 
 
A question that needs to be investigated through experimentation is that  of determining how much of the indentation should be automatic. In [{{CIDE lien citation|1}}], indentation was specified explicitly in a conventional manner in the peritexts.
 
A question that needs to be investigated through experimentation is that  of determining how much of the indentation should be automatic. In [{{CIDE lien citation|1}}], indentation was specified explicitly in a conventional manner in the peritexts.
  
 
Let us now consider the foreseeable evolutions of the IS framework and how they could impact the IS generation mechanism. Consider the output of Example 1. As textual content, it includes the following passage:
 
Let us now consider the foreseeable evolutions of the IS framework and how they could impact the IS generation mechanism. Consider the output of Example 1. As textual content, it includes the following passage:
There, he met The person named Barbe-Bleue
+
::''There, he met The person named Barbe-Bleue''
  
 
Note that the article The has been capitalized. Why? The answer is that it comes from a text-before segment that is sometimes located at the beginning of a sentence, where capitalization is appropriate. But capitalization in the middle of a sentence (as in Example 1) is inappropriate. The source of the problem is that, in the current framework, the same text-before segment must be used consistently, regardless of its position in a sentence.
 
Note that the article The has been capitalized. Why? The answer is that it comes from a text-before segment that is sometimes located at the beginning of a sentence, where capitalization is appropriate. But capitalization in the middle of a sentence (as in Example 1) is inappropriate. The source of the problem is that, in the current framework, the same text-before segment must be used consistently, regardless of its position in a sentence.
Ligne 223 : Ligne 244 :
  
  
 +
[[Fichier:CIDE (2009) Marcoux fig 2.jpg|center|400px|thumb|]]
  
 
Thus, it is not clear that the framework needs to be modified to accommodate peritexts that vary according to their position in a sentence.
 
Thus, it is not clear that the framework needs to be modified to accommodate peritexts that vary according to their position in a sentence.
 
  
 
Other possible extensions in the same line would include peritexts that vary with the position of the element relative to its siblings, with the number of children of the element, and with the grammatical gender of a word or expression in the content of some element or attribute. Clearly, adding any such extension to IS would complicate the IS-generation mechanism. For one thing, it might, require the inclusion of additional peritexts in the IS specification of a model. Then, those additional peritexts would have to be appropriately processed during IS-generation. We believe that, in all cases, experimentation should be used to determine whether an extension is truly necessary or if some workaround without extension is possible. We think extreme parsimony is of utmost importance for the evolution of IS, because the inclusion of too powerful mechanisms could severely impair the explanatory power of  the approach.
 
Other possible extensions in the same line would include peritexts that vary with the position of the element relative to its siblings, with the number of children of the element, and with the grammatical gender of a word or expression in the content of some element or attribute. Clearly, adding any such extension to IS would complicate the IS-generation mechanism. For one thing, it might, require the inclusion of additional peritexts in the IS specification of a model. Then, those additional peritexts would have to be appropriately processed during IS-generation. We believe that, in all cases, experimentation should be used to determine whether an extension is truly necessary or if some workaround without extension is possible. We think extreme parsimony is of utmost importance for the evolution of IS, because the inclusion of too powerful mechanisms could severely impair the explanatory power of  the approach.

Version actuelle datée du 18 juillet 2016 à 11:45

Intertextual semantics generation for structured documents:a complete implementation in XSLT


 
 

 
titre
Intertextual semantics generation for structured documents:a complete implementation in XSLT
auteurs
Yves Marcoux.
Affiliations
GRDS, EBSI, Université de Montréal.
In
CIDE.12 (Montréal), 2009
En PDF 
CIDE (2009) Marcoux.pdf.pdf
Mots-clés 
Sémantique intertextuelle, documents structurés, langages de balisage, XML, XSLT, descriptions formelles de jeux de balises.
Keywords
Intertextual semantics, structured documents, markup languages, XML, XSLT, formal tag-set descriptions.
Résumé
La sémantique intertextuelle (SI) [1] [4] attribue aux documents balisés un sens en langue naturelle. Alors que les sémantiques formelles visent une représentation du sens des documents pour la machine, la SI vise l’humain. Dans la forme actuelle de l’approche, la SI d’un modèle (DTD, schéma) est donnée par deux péritextes associés à chaque élément: un texte-avant et un texte- après. La SI d’un document est la concaténation des péritextes et des contenus d’élément dans l’ordre du document. Nous présentons une implantation complète, en XSLT 1.0, de la génération de SI. L’implantation traite les attributs tel que décrit dans [2], et les hyperliens et éléments locaux tel que décrit dans [1]. Elle indente aussi l’extrant pour une meilleure lisibilité tel que suggéré dans [3] et gère les exceptions que sont les éléments et attributs inconnus.