DC 2010 Artist paper V1

From Artist
Revision as of 18:32, 25 March 2010 by imported>Jacques Ducloy (Handling Wicri network consistency)
DC 2010
This article will be submited to DC 2010 Conference
previous version
LogoDC2010small.png
DC 2010 Conference
Dublin Core Metadata Initiative
Pittsburgh,
20-22 October 2010.
Title
Metadata for science & innovation wikis networks
Abstract
cet article présente les enjeux des métadonnées dans les réseaux de wikis sémantiques.
Authors
Jacques Ducloy(i), Thierry Daunois(ii), Muriel Foulonneau(iii), Alice Hermann(iv), Jean-Charles Lamirel(v), Stéphane Sire(vi) and Christine Vanoirbeek(vi).

 

Introduction

Since March 25, 1995, when Ward Cunningham launched WikiWikiWeb, a collaborative web site devoted to software development, wikis are playing an increasing role in the fields of scientific and technical information. This paper would like to analyse the place of metadata in a wiki organization. When a research working-group launches a tiny lonesome wiki, dealing with a clearly identified topic, metadata does not play a role that is perceived as important. This feeling evolves depending on the size of an application, for instance Wikipedia, or on its complexity, for instance a network. We are starting a project in which we have to think about a large network of semantic wikis.

With Wikipedia's size reaching 3 millions of articles, a large amount of which being related to scientific or technical topics, the need for metadata becomes ubiquitous. It is not surprising, then, if Wikipedia Statistics for January 2010[1] state that it contains 259.000 templates and 552.000 categories. A page, like the "animal"[2] one, demonstrates a cooperation of several specialists: computer scientists for complex templates (like infobox), experts in communication for the readability of the display, information scientists for the “poly taxonomy” design, zoologists and palaeontologists. All these specialists share the same pages and could modify the related metadata. The successful story of Wikipedia is also based on the consistency of the encyclopaedia, and therefore, on its metadata system.

Anyway, the global architecture hosted by the Wikimedia Foundation is rather centralized: a multilingual family structured around the largest wiki, supplemented by several specialized wikis. Right now, most wikis we can found in research organizations are quite monolithic. But what happens when a community of scientists is building an editorial collection of scientific information distributed in a network of semantic wikis? We are just now discovering the extent of this problem in the Wicri project.

This article aims at identifying several metadata issues we faced when starting the Wicri network. WICRI is an acronym that stands for "WIkis for Communities in Research and Innovation". Right now, Wicri is a demonstrator, containing about sixty wikis ; some of them are designed on a regional or institutional basis, others are related to several scientific topics. Anyway, the knowledge architecture we must design is quite the same as would be required for several thousands of wikis. Thus metadata does play a crucial and increasing role. Semantic wikis introduce a new generation of metadata, allowing a knowledge modelling in a RDF framework that is interesting to consider.

In this paper, we will first introduce the Wicri network; then we will present the initial technical choices we started with. Paths to be explored in the future will be discussed in two views: that of a contributor facing the production of metadata, and that of the computer scientist developing new services.

Note
This article is written while using a collaborative practice (in a same way that we have done for DC 2006)‎[d1]. It will be published in two versions: traditional on the web site of the conference; and wicrified[3] on the Artist wiki.

Introducing Wicri

Wicri network has been created in the framework of Mission Ticri (Technologies dealing with Information and Communication for Communities involved in Research and Innovation). This initiative was launched by the Lorraine representative of Ministry in charge of research affairs. Ticri aims at disseminating main results of research communities in order to promote partnerships between innovation actors, to encourage outreach, and to develop technology transfers in a multidisciplinary context.

The first concrete action of this mission was the installation of a demonstrator, Wicri, to show the interest of the wiki approach. Following a good feeling, WICRI is becoming now an infrastructure that relies on a network of semantic wikis.

This section will introduce Wicri network and its main options: compatibility with Wikipedia, networked collections, thematic and regional views, CRIS (Current Research Information System) organization, and using Semantic MediaWiki[4] (SMW).

Wicri, a network of wikis for research and innovation

Customizing Wikipedia for research and innovation

Wikipedia has demonstrated the interest of the wiki approach to build and disseminate a common knowledge on a very large scale. Thus Wikipedia brings a first answer to research needs (and we are using this media). Yet it is not sufficient to provide a global solution.

A first point deals with the suspicions of many heads or makers in academic institutions in relation to Wikipedia. Transparency of contributions and validity assessment is a necessary condition for their support. As a result, such a wiki infrastructure must be driven by institutional entities in order to manage registration processes. In other words, Wikipedia's authors can be anonymous as far as their bibliographic references are significant and link to explicitly named people. But the academic communities are producing the knowledge that Wikipedia could use. In many cases, knowledge is in progress and many assumptions appear to be hypothesis. For these reasons we think that the authors must be clearly known; and anonymous contributions are forbidden. Thus the institutions must find an advantage in investing in wiki approach and visibility becomes a strong parameter. The network approach allows each partner to promote its own wiki site, and its own visibility.

Our first experience has also highlighted several parameters that may be more fundamental, from an editorial point of view.

For instance, publishing new results of research activities is not compatible with Wikipedia's practices. Wikipedia's contributors must display information attested by external references. These results must be written under the control (or at least, moderation) of scientific committees. We are testing this way of doing with a periodical (AMETIST) that is soon being published in the network. Publishing authored articles implies a very constrained way of modifying the original text, i.e. limited in adding links to articles explaining a particular topic, or discussion area.

At least, a networked framework allows managing several editorial strategies, and mainly: institutional, thematic and regional. In a first step, we have built a little demonstrator with several institutional wikis. The limits have appeared quite immediately: if several organizations are working on the same topic, this topic must be developped on a thematic wiki. Thus we have quickly introduced several wikis on thematic or regional design.

As a consequence, a given information can be described in different ways on several wikis. A little team, mainly 3 people in the same office, has operated the demonstrator. As soon as we were more than one person on one topic, many consistency problems have been met. The need of an effective carrying of metadata has appeared quite immediatly.

Different classes of wikis

The WICRI network accepts two main classes of wikis.

  • Institutional wikis : an institutional wiki is handled by an organization. In this paper, we will often use a naming scheme with two parts for institutional wikis: region then accronym. For instance, Lorraine/SGE stands for the research cluster SGE (environmental sciences and engineering, Sciences et génie de l'environnement in French) in Lorraine area. For wikis related with scientific working groups, we use in first part a code identifying the global thematic; for instance, ICT/Artist is the wiki of the Artist workgroup, dealing with Information and Communication Technologies.
  • Common wikis : a common wiki's design is set up by the global Wicri Community. Be it managed by an organization or not, it fully shares the common rules and is moderated by independent and scientific committees. In this paper we will use, for all common wikis, a naming scheme with Wicri as first part, like in Wicri/Lorraine or Wicri/Water.

Being directly related to an identified organization, institutionnal wikis may have specific rules, differing from the common rules of the Wicri web. For example, an institutional wiki could be open to anonymous contributions, or, on the contrary, strictly limited. The editorial lign can strongly differ from Wicri's one, as well.

In a multilingual approach, most wikis are in fact families of wikis (i.e. a set of wikis, one for each language, connected by interwiki links). In this paper, we will use the notation Wicri/Water(fr) to define the French component of the family, and Wicri/Water(en) for the English one.

Several families include a "private" wiki, where registration is required not only to contribute, but also to read what is written. This happens mainly in institutionnal wikis' families. For instance, Ict/Artist(priv) refers to the "private" wiki of the Artist wikis' family. That private wiki could be used by scientific committies for a referee process.

The current Wicri network

At the beginning of 2010, WICRI network contains the following common wikis:

  • A first set of common wikis are designed on a regional framework such as Wicri/Lorraine or Wicri/Alsace.
  • An other set of common wikis is devoted to thematic fields. At this time, one of them, Wicri/Ticri is related to Information Science & Technology (a DCMI portal is included). An other part deals with environment and contains 4 wiki families: Wicri/Water, Wicri/Woods, Wicri/Biomass and Wicri/UrbanSoils. They are also organized with information system items (such as program committees), and editorial contents (scientific articles, scientific surveys).

A few wikis have been designed for a global consistency of the network.

  • The most visible is Wicri/Wicri which gives a global view of the network: all topics must appear and link to more detailed pages or desk in other wikis.
  • An other wiki, Wicri/Media is an image repository (and plays the same role as Commons in the Wikipedia family). It can also host pdf documents, but we are looking for a better solution, using Fedora for instance.
  • Related to metadata handling, a wiki named Wicri/Base contains templates and semantic items which can be used in all other wikis.
Fig 1. The current Wicri network (a subset)

Most institutionnal wikis have relationships with a regional wiki and one or two thematic wikis. The whole network could be built and explored following two main axis: thematics (which could be structured by corresponding ontologies or taxonomies) and also as an information system.

Wicri: a networked Current Research Information System

Fig 2. CRIS on wiki

A Current Research Information System, commonly known as "CRIS", is any information tool dedicated to provide access to and disseminate research information, such as People, Projects, Organizations, Results (publications, patents and products), Facilities, and Equipment.[5]

The CRIS approach is supported by the European Commission through the CERIF (Common European Research Information Format) recommendation[6]. This way of working is spreading worldwide and, for instance, at the USDA (United States Department of Agriculture)[7].

Such a system could play a very strategic role in the WICRI networtk, something like a skeleton. This approach looks like Jeffery's ‎[j2] or Erbach's ‎[e2] ones. They would like to merge organization related items (CRIS) with open archives in order to produce an e-Science infrastructure‎[j1].

Wicri would like to go further in order to obtain a highly detailed and understandable CRIS while using editorial facilities of wikis for bringing a human readable summary. In this perspective, semantic wikis could provide a technical basis.

Initial technical issues

The Wicri project aims at setting up an operational set of services. At the present time, WICRI is a demonstrator which is becoming a digital infrastructure. At that point of the project, we cannot, at the same time, face research needs while promoting pragmatic solutions. Consequently, we first try to deal with the reality of Wicri's web environment, considering Zack Rosen's advise : "Researchers need to stop thinking of themselves as researchers and start thinking of themselves as implementors."[8]

Starting the Wicri project, we first had to choose the wiki engine. A priority issue for this project is to allow a maximum of researchers to disseminate their results to a maximum of actors potentially involved. Thus we have chosen to be fully compatible with Wikipedia, and to use MediaWiki[9] as the wiki engine of Wicri network. This engine is used by Wikipedia and is becoming very popular in research and innovation context. A strong parameter was the possibility of using Semantic MediaWiki, which provides an extension that enables wiki-users to semantically annotate wiki pages, based on which the wiki contents can be browsed, searched, and reused in novel ways‎[k1].

Our early experience showed the need for a consistent investment to achieve a level of functionality comparable to that obtained with Wikipedia. That implies to supplement the functionality of MediaWiki with php extensions and templates that are commonly used in Wikipedia, so that an occasional contributor is not disoriented when moving from Wikipedia to the Wicri network. The wiki Wicri/base has been specifically created to manage the collection of needed templates (and also semantic items) used throughout the network.

An important note: the choice of MediaWiki is not exclusive. The network can theoretically support different engines but each will require a specific investment comparable to what has been done for MediaWiki. Due to the small size of the current Wicri team, we have limited our choice, at least temporarily, to a single engine type.

Writing a networked hypertext with formulas and metadata

In most content management systems, which have been designed “before blogs and wikis”, a clear barrier exists between editing contents, programming and managing metadata. Thus a scientist uses to write mainly short and isolated papers. In many research fields, Digital Libraries are an arrangement of isolated papers that are stored in archives and various databases. Could we expect that a global consistency of a knowledge domain will be provide, to a human reading, only by ontologies and semantic properties; and why not, using a magic wand, by folksonomies?

In a quite opposite way, on a wiki, any actor, on any page, can handle all these activities (from programming to writing contents) at any time. Wikipedia acts as a Digital Library, in which a portal is directly and fully designed by the experts of a given scientific area. Authors are using categories and other metadata as a support, but they write pages and associated metadata in the same temporality. They do not write one paper or several but a "human brain designed" hypertext.

In this section, we will focus on several aspects of writing a scientific, readable and networked hypertext: handling scientific objects and knowledege; submiting a given information in different ways, in different contexts, for different audiences.

Semantic wikis for scientific objects

Scientists and engineers use to work with a lot of technical objects, such as formulas, drawings, 3D images; and not only texts. In this purpose, they must very often use formal writing, and not only WYSIWYG interfaces. This way of doing looks like entering knowlegde items or metadata. This paragraph would give some first requirements dealing with handling scientific objects in a wiki.

In other words, Wicri's pages must carry many texts containing scientific results described with scientific objects. Using MediaWiki opens a first set of features dealing with formulas or drawing. Some of them are very easy to install, for instance "imagemap"[10]. However, our experience is already showing some difficulties, stressing the need for technological support. For instance, downloading LaTeX extensions requires installing LaTeX close to the operating system that supports the wiki. Thus, a technical support appears to be mandatory. But we do need also research and developments activities.

For instance, MediaWiki supports SVG (Scalable Vector Graphics) in quite a poor way. A contributor can upload an SVG image, but, this image is afterwhile converted into a png format. So, right now, it appears difficult to manage interactions betwen text and images with the basic SVG facility. Life sciences give a good example of what could be needed as interactions in a scientific area. The Proteopedia project ‎[h1] is carrying 3D images of molecular items such as protein, RNA, DNA and other macromolecules[11]. The contributor can set several kinds of interaction while using green links in the wiki text. These links interact with a Java applet (jmol). If we would like to generalize this way of doing we probably need a more complete XML support.

Thus the contributor will have to get a good understanding of structured markup language. In such a context, handling syntax of metadata or semantic items is not complex. The difficulties come from designing a global knowledge in a collective way.

For instance, Wikipedia has implemented several sets of taxonomies, with a quite complete schema in life species. Implementation is made with a tree of categories and related templates (taxobox). This taxonomy is distributed on each language version, on Wikipedia Commons (images), and on WikiSpecies. A comparison between these different wikis shows a multipurpose utilization of 3 classification schemes[12].

Dealing with Semantic MediaWiki, we have found several wikis that handle mostly metadata dealing with science organization. For instance, semanticweb.org or openresearch.org provides a semantic metadata model around scientific events. We began by adapting this model and we have encountered several difficulties, due to a variety of situations in the different scientific communities, and a translation exercise (in French). We have also find some wikis whose purpose is to build or curate an ontology. But, until now we have not found wikis that use ontologies in order to handle scientific data, objects or information.

As a remark, Semantic MediaWiki is not an universal solution for using wikis in scientific fields. For instance SWiM ‎[l1], a semantic wiki for mathematical knowledge management[13], has a better handling of mathematical formulas than the Latex extension of SMW. For our future plans, Wicri must integrate several kinds of wiki engines, which implies a strong handling of metadata.

Different writing in different contexts for different audiences

Most data should be developed several times on different wikis. For instance, each research project with several partners must be cited and commented in the regional wiki of each partner, as well as in all relevant thematic wikis.

Even in the initial phase of the Wicri project, we have encountered a significant number of cases. Here follow 3 samples quite strongly differentiated: a city description, a scientific paper, a call for paper.

  • The city of Pittsburgh, where the 2010 DC conference will be held, appears at least on 3 wikis. On Wicri/Ticri, Pittsburgh is directly connected to DC 2010 and the corresponding page speaks about main activities related to information science in this geographic area[14]. On Wicri/Water, we describe the confluence of Allegheny and Monongahela rivers for giving the source of Ohio[15]. On Wicri/Wicri, we talk about general facts about this city and introduce commented links on the other pages[16]. These 3 pages are related to the same topic, but display clearly distinct contents.
  • Carl Lagoze has written an article which is becoming very popular in French-speaking areas: Qu’est-ce qu’une bibliothèque numérique, au juste ? / What Is a Digital Library anymore, anyway? []. In ICT/Artist the paper is integrated in the portal of Ametist journal in which it was first translated[17]. A copy has been done in Wicri/Ticri, as it is considered as a reference paper for a wiki dealing with digital libraries[18]. Anchors and links are sometimes different that on ICT/Artist. Since this paper's introduction is of general interest, and could get a very large audience, this part of the article is also displayed on Wicri/Wicri[19].
  • Finally, we present a situation dealing with an ICT conference held in Lorraine. The call for papers is duplicated on two wikis, Wicri/Ticri and Wicri/Lorraine. Figure 3 shows different ways of managing the relation between this event and committee members. The event model of semanticweb.org is used with properties Has PC member and Has OC member. Paul Dupont, working in Lorraine, is always qualified with the property Has PC member. On Wicri/Lorraine John Smith is only linked to Wicri/Ticri with an interwiki mechanism ([[ticri.en:John Smith]]), because he has no author page on Wicri/Lorraine (and, up to now, SMW does not provide semantic links between different wikis).
Program Committee
Organizing Committee
<!-- it would be displayed in a thematic (Ticri) wiki as: -->

==Program Committee==
... 
* [[Has PC member::Paul Dupont]], Nancy (Fr)
* [[Has PC member::John Smith]], London (UK)
...
==Organizing Committee==
* [[wicri-lor.fr:Jean Durand|Jean Durand]]
...

<!-- it would be displayed in a regional (Lorraine) wiki as: -->

==Program Committee==
... 
* [[Has PC member::Paul Dupont]], Nancy (Fr)
* [[ticri.en:John Smith|John Smith]], London (UK)
...
==Organizing Committee==
* [[Has OC member::Jean Durand]]
...

Fig. 3. A part of a page relative to a conference happening in Nancy

Managing network's consistency

A critical issue is managing network consistency, which implies a large set of pages. This aspect could be illustrated by geographic items such as countries, regions, towns, etc..

Fig 4. Interlinks between geographic items

When a new city appears on a given wiki, the contributor should theoretically keep the connectivity of the networked hypertext. Figure 4 gives an example with the city of Nancy in an institutional wiki (Artist). The Nancy related page on ICT/Artist must be linked with:

  • The Lorraine, France and Europe pages on the same wiki (which must eventually be created).
  • The Nancy page on Wicri/Ticri, Wicri/Wicri; and so on.
  • In a multilingual context, this graph must be duplicated with taking care of translation (for instance, for Lorraine the page name would be "Lorraine (region)" in English for disambiguation reason).

For a better undestanding by an internaut, this consistency needs to be explained by some text. We could provide automatic tools for an initial building, but contributors must also be implied in writing explanations. Thus, managing network consistency and related metadata is a cooperative task involving altogether human contributors and computers.

Metadata for authors and contributors

All these pages are mainly written by human contributors, and not by computers. Computers could help in various ways but, in fine, pages are made by contributors. In a repository based network, using OAI-PMH for example, the consistency is done by computer protocols, which share controlled metadata. In a wiki network, a contributor can write on many wikis and interact with metadata. Thus metadata plays a crucial role not only with programming activities but also with authoring process.

Fig. 4. Metadata consistency in wikis network (W1...) vs repositories (R1...)

This section introduces the need of a new wiki for designing metadata items, its contents and its organization.

Introducing Wicri/metadata

Almost any contributor may be faced with having to create new metadata in WICRI network.

Here is a common situation in a researcher's life: the writing of a call for papers. The first sentence looks like: DCMI is pleased to announce that DC-2010 will take place in Pittsburgh,. How to write it in a semantic wiki with the good properties?

While reading the user manual of Semantic MediaWiki, introducing a new property seems to be very easy. Researcher have just to write something like this:

[[organizer::DCMI]]is pleased to announce that DC-2010 will take place in [[place::Pittsburgh]]

Just when he pushes on the "Save page" button, the relations and, if needed, the properties are created. Thus the true problem does not deal with syntax, but with semantics: how to choose and to name a property? For instance, about the role of DCMI in DC conference, we could write: organizer, has organizer, has global organizer, has local organizer, DC:contributor, dc:contributor, has dc:contributor, funded by, organized by etc.

A looking at semanticweb.org illustrates this difficulty[20]. Its "Property namespace" contains 773 pages. 768 are real properties (5 redirect). 277 pages are classified as "wanted properties" (without explicit page). Looking for DC:creator, we have found several variants. The preferred term is "Has author" (frequency 99). The most used term is "Author" (1058). The expression "Written by" appears 35 times. At least "Author of", "Content author", and "Creator" appear once.

Thus the following aspects have to be adressed:

  1. How can someone know if a property dealing with this situation exists in the semantic model of the wiki? In Wicri project, the problem, that we have pointed out for semanticweb.org, is distributed on a wikis' network.
  2. How can someone choose a new name for a new property in consistency with the existing ones?
  3. In a multilingual family of wikis, how should metadata items be translated?

We propose to set up a wiki, with an encyclopedic philosophy dealing with metadata. There are some samples of wikis dedicated to metadata on the web. For instance, the DCMI site offeers seveval wikis‎[e1]. But they are usually dedicated to specialists and, often related to a particular schema. Here, we want to be understood by a non-specialist[21] who have to deal with many topics at the same time.

Main lines for Wicri/metadata

Metadata are related to a model (possibly expressed through an ontology in a semantic wiki) to represent the structure of the wiki and the properties of wiki resources. Each wiki can be created with a different domain model (e.g. conference resources in the case of the Fuel Cell conference wiki[22], terminology resources in the case of the World Reference Base for soil resources). Moreover, some concepts may exist in different languages. As a result, different wikis may use close or similar concepts using different models. This limits the navigation across the wiki network.

A specific wiki, called Wicri-base [23] was created to provide common tools for the Wicri community, including presentation models or templates and particular metadata sets (e.g. Infobox laboratory) and metadata elements (e. g. Attribut:A pour ville adapted from [1]).

Representing research resources

The wiki network is composed of resources of scientific communication. It relates to Current research Information Systems (CRIS) as well as Research repositories. The representation of resources is bound by the general domain of research, including concepts which belong to CRIS, Knowledge Organization Systems used in the different research domains or created ad hoc, bibliographic formats such as MARC or the DCMI Scholarly Work Application Profile, datasets formatting models such as text formatting (TEI, DocBook), survey datasets (DDI), educational formats such as LOM and the IMS-QTI application profile for assessment resources. Additionally, more general resources are necessary to describe Persons (e.g. FOAF) or Knowledge Organization Systems (e.g. SKOS).

Defining new properties to ensure interoperability with other semantic applications

The model used to build the wiki may also be different from the ones used in external applications or other wiki networks. For example, the WRB terminology [2] uses one model, whereas Agrowiki uses a different model [3]. A conference on the OpenResearch.org platform uses a specific model [4], whereas on Wicri, it uses another one [5].

Resources from ontology repositories can be used, such as Semanticweb.org [6] or Ontologypattern [7]. All the same, the Watson system [8] allows discovering existing concepts on the Semantic Web [JACQUES TU PEUX DONNER UN EXEMPLE DES RESULTATS QUE CELA T'A DONNE?). It is possible to import complete ontologies or vocabularies, such as FOAF.


However, metadata editors have to search specifically for existing properties and sometimes they may find close but not exactly similar properties. This raises an issue to define the relations between concepts defined in different models.

The wiki as a metadata registry?

Until now, Wicri has chosen to define redirects (i.e. owl:sameAs relations) with concepts from ontology repositories. However, the strict equivalence of two concepts is limited. Ontology mapping requires richer relations to be encoded, such as SKOS mapping properties [9] skos:exactMatch, skos:closeMatch, skos:broadMatch, skos:narrowMatch and skos:relatedMatch (see also Giunchiglia et al., 2007). Moreover, collaborative ontology mapping mechanisms (e. g. Correndo et al., 2008)‎[c2] should be available to the network so that any contributor who creates a new metadata concept or identifies a relation between metadata concepts should be able to enrich the system.

This should end up as a wiki-based metadata registry for the Wicri network, with some specificity though. The wiki architecture allows expressing a mix between structured and unstructured content. Scientific concepts are not defined only with traditional definitions, but also using scientific literature, guidelines etc. This is particularly important in a multilingual context as we identified in the Wicri network as well as in other collaborative scientific platforms. A review of concepts used to describe e-assessment resources in the field of education (Sarre et al., 2010) demonstrates that many concepts proposed as metadata for this domain are not fully specified. There are metadata schemas, as well as concepts only defined in journal articles, guidelines, … It should therefore be possible to add concepts, even outside the scope of a proper ontology. In addition, semantic wikis include some intelligence which can be useful to make inferences on the relations or potential relations between the concepts used in the network. The wiki network is not only an interface to a CRIS and research repositores, it also makes research content and scientific communication a building block of the semantic Web by providing dereferenceable resources and reasoning mechanisms through a decentralized and collaborative environment.

Metadata for computers

The "wiki way of doing" puts the contributor in the heart of the metadata handling. In this context, what could be the role of the computer? Our first feeling is that we can not expect real automation in a short term. However, several tools or approachs appear to be very interesting, but mainly on specific problems.

Networks and Distributed Wiki Applications

A strong issue for a network of wikis deals with replication management. In Wicri network a given data can appear on many pages of many wikis. What happens when this information must be modifyed? In the framework of Wicri, in order to examine our further development strategies, we have identified 5 classes of replication cases.

  1. Wiki replication. A given wiki in its entirety, could be duplicated in a Peer2peer network, distributed on several sites with a distributed replication mechanism ‎[o1]. This feature is useful for technical reasons (strategic wiki as Wicri/wicri) or sometimes for political ones (a wiki that could bring visibility for several institutions). But, that feature does matter with editorial replications neither metadata.
  2. Page replication. A set of pages are replicated on several wikis. This kind of facility begins to be available ‎[c1], and could be very useful for invariant pages, such as templates related to semantic models. With the same kind of P2P mechanism than in previous case, any change on any wiki is distributed on other wikis. Using DSMW (Distributed Semantic Media Wiki) extension[24], this mechanism is driven by metadata (semantic properties).
  3. Paragraph replication. Until now, we have not found an extension of SMW able to extend the previous mechanism at the paragraph level. This need is quite ubiquitus in Wicri network. In simple cases, a palliative, which consists in creating a page template for each paragraph, might work. In most case, it could not be used by a human contributor. For instance, we cannot ask an author to create explicitely one page for each bibliographic reference.
    With a metadata viewing, this case is interesting because it oblige to use simultaneously two mechanisms: data-centric and document-centric. Identifying pages to be replicated can been done by properties, in a pure data-centric RDF approach. Identifying a paragraph requests an explicit document-centric structuration (XML).
  4. Page ou paragraph replication with transformation. In many cases, the previous mechanims could not be applied because the paragraph must be transformed while replicating. For instance, for editorial reasons, requirements for handling organsation committes can be different in a regional wiki (with semantic links for local members) and in a thematic wiki (no links). We have not yet gone further analysis, but we are again in a document-centric approach.
  5. Replication of subsets of several pages. Such an example was introduce before about geographical browsing.

Due to this large amount of problems, we have to forgot fully automated system, and think about "computer assisted hypertext writing".

Handling Wicri network consistency

Wicri operates among scientific communities and institutes. If we succeed we could expect an adhesion of educational entities, such as libraries. In other words, in a quite different way of Wikipedia, we could organize a true "a posteriori validation process". Thus the situation is the following: what kind of tools could help scientific to work altogether with experts in semantic or metadata issues.

In a first time, in a pragmatic way, we would like to extend facilities that are soon provided on a simple wiki to a network. We have begun to implement bots that use an XML schema of the network:

 <wicri>
  <wiki prefix="wicri.fr" 
        type="public" 
        server="http://maquettewicri.loria.fr" 
        path="/fr.wicri/index.php5?">
     <title>Wicri (fr)</title>
     <article title="$1"/>
     <log title="Special:Connexion"/>
     <recentChanges title="Special:Modifications_r%C3%A9centes"/>
  </wiki>
  <wiki prefix="wicri.en" 
        type="public"        
        server="http://maquettewicri.loria.fr" 
        path="/en.wicri/index.php5?">
     <title>Wicri (en)</title>
     <article title="$1"/>
     <log title="Special:UserLogin"/>
     <recentChanges title="Special:RecentChanges"/>
  </wiki>

As we have seen with geographic items, many specialized ontologies have to be replicated on many wikis, in pages where free texts and structured parts are handled by non specialist contributors. We could define a master wiki on which the ontology will be build; but a better approach seems to use an external tools, like Protégé. Several works about designing an ontology in a cooperative way ‎[t1] could be a very interesting way.

  • une première voie d'amélioration : interopérabilité avec des ontologies extérieures
    • exemple : exporter / importer une ontologie avec des systèmes qui ne sont pas Wicri
    • Tania Tudorache, Natalya F. Noy, Samson Tu and Mark A Musen. Supporting Collaborative Ontology Development in Protégé. in: Lecture Notes In Computer Science; Vol. 5318 archive Proceedings of the 7th International Conference on The Semantic Web
      < >
  • deuxième voie
  • Interface homme machine en Xml / mais cela implique une structuration XML et des extensions de Xtiger

Authoring XML all the Time, Everywhere and by Everyone. Stéphane Sire, Christine Vanoirbeek, Vincent Quint, Cécile Roisin. In Proceedings of XML Prague 2010, pages 125-149, Institute for Theoretical Computer Science, March 2010.

Mais un problème de fond est la découverte des informations avec lesquelles définir des liens

Enriching the wiki network through the use of Web data

The global exploitation of Web information represents an important challenge for enhancing the dynamicity, the flexibility and the scope of a wiki network like the one we propose. Hence, on the one hand, this process is mandatory for assisting the upcoming contributors with elaborated and reliable redaction guidelines during the network construction phase. On the other hand, it is also determinant for supplying end-users with external information whose added-value is to maintains significant relationship with the semantic context of the wiki network.

On the author's side, relevant semantic roles that should take part in the wiki context can be selected, or even attributed, through looking up a large amount of unstrutured Web data. In such case, one can rely on the help of clustering process ‎[l2] in combination with the use of wiki network metadata and the one of external annotation sources, like the TEI[25], in order to organize the querying results in a suitable way with the final goal of facilitating author's decision.

On the end-user's side, the goal of querying the web is both to complete as well as to enrich the information on a given topic as soon as this latter has been formerly furrnished to the user by the wiki network semantic context. The wiki network can thus be considered as a structured information support for intelligently querying and mining the Web. Clustering processs can also be used in a last step to synthetize the obtained Web results.

In our case, one important task is to find out the main actors and the salient institutions of a domain. This task especially implies to highlight their various potential roles in said domain, as well as to characterize the nature of their relationships in the social networks associated to their disciplines. This kind of information can only be otbained by a large scope querying process stacking a sufficient amount of information to be abble to bring out reliable hypothesis and conclusion. It thus led to consider intelligent and guided access to external wiki data through the use of existing wiki metadata.

In the context of our approach, one main challenge is thus to be abble to isolate wiki strategic information as authors or institution names in a flow of unformated data. This approach relies itself on the global domain of automatized named entities labelling techniques. Majority of such techniques are based on formal grammars associated with statistical models, possibly supplemented by ad-hoc sample databases (lists of first names, names of cities or country for example) ‎[f1]. In the large campaigns of evaluation, the systems based on manually written grammars often obtain the best results. One obvious disadvantage is that this type of systems require sometimes months of work of drafting. They are thus unapplicable in most practical cases.

The current statistical systems use for their part a great quantity of pre-annotated data to learn the possible forms of the named entities. It is no more necessary to write here any rules by hand, but to label a corpus which will serve as training tool ‎[n1]. These systems are thus themselves also very expensive in human time. To solve this problem, recent initiatives such as DBpedia[26] or Yago ‎[s1] seek to provide likely semantic corpora to help to design labelling tools. In the same spirit, certain semantic ontologies such as NLGbAse[27] are largely directed towards labelling. The framework of our wiki network can also be considered itself as a particularly rich database for picking up reliable information about such potential entities.

Conclusion

  • travailler en réseau même de façon incomplète ou insatifaisante est mieux que de le faire isolément
  • les métadonnées jouent un rôle fondamental pour maintenir la cohérence du réseau de wikis
  • améliorer la cohérence et aider le contributeur à mieux travailler demande de plus en plus de formalisme
    • médiateurs technologiques et sémantiques
    • formation appropriation, formation appropriation, formation appropriation, formation appropriation, formation appropriation,
Acknowledgments

Thanks to people who have contribute by reading and correcting this page: Jean-Pierre Thomesse.

References

  • [c1] Charbel Rahhal, Hala Skaf-Molli, Pascal Molli and Stéphane Weiss: Multi-synchronous Collaborative Semantic Wikis. In Wise'09: International Conference on Web Information Systems , 2009.
    < http://www.loria.fr/~molli/pmwiki/uploads/Main/Skaf09wise.pdf >
  • [c2] Correndo, G., Alani, H., & Smart, P. (2008). A community based approach for managing ontology alignments. In The 7th International Semantic Web Conference (p. 61).
    < http://eprints.ecs.soton.ac.uk/16673/ >
  • [d1] Jacques Ducloy, Yann Nicolas, Diane Le Hénaff, Muriel Foulonneau, Luc Grivel, Jean-Paul Ducasse. Metadata towards an e-research cyberinfrastructure - The case of francophone PhD theses. Proceedings of DC 2006, Manzanillo, Mexico, 2006.
  • [e1] Fredrik Enoksson: A MoinMoin Wiki Syntax for Description Set Profiles, DCMI Working draft,(2008)
    < http://dublincore.org/documents/2008/10/06/dsp-wiki-syntax/ >
  • [e2] Gregor Erbach - Data-centric view in e-Science information systems. Data Science Journal Vol. 5 (2006) pp.219-222
    < http://www.jstage.jst.go.jp/article/dsj/5/0/219/_pdf >
  • [f1] Jenny Rose Finkel and Christopher D. Manning. Joint parsing and named entity recognition. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (ACL), Boulder, Colorado, May 31-June 05, 2009.
  • [h1] Eran Hodis, Jaime Prilusky, Eric Martz, Israel Silman, John Moult and Joel L. Sussman: Proteopedia - a scientific 'wiki' bridging the rift between 3D structure and function of biomacromolecules, Genome Biology 2008, 9:R121 doi:10.1186/gb-2008-9-8-r121
    < http://genomebiology.com/2008/9/8/R121 >
  • [j1] Keith G. Jeffery. CRIS + open access = the route to research knowledge on the GRID. In 71st IFLA General Conference and Council proceedings, Oslo, Norway, 2005
    < http://www.ifla.org/IV/ifla71/papers/007e-Jeffery.pdf >
  • [j2] Keith G. Jeffery - Technical Infrastructure and Policy Framework for Maximising the Benefits from Research Output in:ELPUB2007. Openness in Digital Publishing: Awareness, Discovery and Access - Proceedings of the 11th International Conference on Electronic Publishing held in Vienna, Austria 13-15 June 2007 / Edited by: Leslie Chan and Bob Martens. ISBN 978-3-85437-292-9, 2007, pp. 1-12
    < http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.102.5044&rep=rep1&type=pdf>
  • [k1] Markus Krötzsch, Denny Vrandecic, Max Völkel, Heiko Haller, Rudi Studer. Semantic Wikipedia. In Journal of Web Semantics 5/2007, pp. 251–261. Elsevier 2007.
  • [l1] Christoph Lange. SWiM – a semantic wiki for mathematical knowledge management. In Sean Bechhofer, Manfred Hauswirth, Jörg Hoffmann, and Manolis Koubarakis, editors, ESWC, volume 5021 of Lecture Notes in Computer Science, pages 832–837. Springer, 2008.
  • [l2] [IC06h] Jean-Charles Lamirel and Shadi Al Shehabi. MultiSOM: a multiview neural model for accurately analyzing and mining complex data. In Proceedings of the 4th International Conference on Coordinated & Multiple Views in Exploratory Visualization (CMV), London, UK, July 2006.
  • [n1] Claire Nédellec, Philippe Bessières, Robert Bossy, Alain Kotoujansky and Alain-Pierre Manine. Annotation Guidelines for Machine Learning-Based Named Entity Recognition in Microbiology. In Proceedings of the Data and Text Mining in Integrative Biology Workshop. ECML/PKDD 2006, M. Hilario et C. Nédellec (Eds), p. 40-54, Berlin, Germany, September 2006.
  • [o1] Gérald Oster, Pascal Urso, Pascal Molli and Abdessamad Imine. In Proceedings of the 2006 ACM Conference on Computer Supported Cooperative Work, CSCW 2006, Banff, Alberta, Canada, November 4-8, 2006, 2006.
  • [s1] Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum. YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia. In Proceedings of the 16th International World Wide Web Conference, WWW 2007, Banff, Alberta, Canada, May 8-12 2007.
  • [g1] Giunchiglia, F., Yatskevich, M., & Shvaiko, P. (2007). Semantic Matching: Algorithms and Implementation. In Journal on Data Semantics IX (pp. 1-38).

Notes

  1. < http://stats.wikimedia.org/EN/TablesWikipediaEN.htm#namespaces >
  2. <http://en.wikipedia.org/wiki/Animal>
  3. Wicrified is a neologism that comes from term “wikified” in Wikipedia jargon. This task consists in using Wiki mark-up in order to adapt a document to Wicri network, i.e. setting wiki links, categories or semantic annotations.
  4. < http://semantic-mediawiki.org/wiki/Semantic_MediaWiki >
  5. http://www.eurocris.org/fileadmin/Upload/200909.pdf
  6. < http://www.euroCRIS.org >
  7. http://cwf.uvm.edu/cris/
  8. [http://www.zacker.org/semantic-web-research-isnt-working RDF Semantic web research isn't working, Zack Rosen's post
  9. < http://www.mediawiki.org/wiki/MediaWiki >
  10. An image map is a list of coordinates relating to a specific image, created in order to hyperlink areas of this image to various destinations.
  11. < http://proteopedia.org/wiki/index.php >
  12. For instance Acer on
  13. http://wiki.openmath.org/
  14. http://maquettewicri.loria.fr/fr.ticri/index.php5?title=Pittsburgh
  15. http://maquettewicri.loria.fr/fr.wicri-t-eau/index.php5?title=Pittsburgh
  16. http://maquettewicri.loria.fr/fr.wicri/index.php5?title=Pittsburgh
  17. http://maquettewicri.loria.fr/fr.artist/index.php5?title=Qu%E2%80%99est-ce_qu%E2%80%99une_biblioth%C3%A8que_num%C3%A9rique%2C_au_juste_%3F
  18. http://maquettewicri.loria.fr/fr.ticri/index.php5?title=Qu%E2%80%99est-ce_qu%E2%80%99une_biblioth%C3%A8que_num%C3%A9rique%2C_au_juste_%3F
  19. http://maquettewicri.loria.fr/fr.wicri/index.php5?title=Qu%27est-ce_qu%27une_biblioth%C3%A8que_num%C3%A9rique%2C_au_juste_%3F
  20. Datas collected on the 4th, March 2010.
  21. For instance, we must avoid to link to pages containing a thousand lines of RDF/XML, as an explanation!
  22. < http://maquettewicri.loria.fr/en.incubWicri/index.php5?title=Fuel_cell >
  23. < http://maquettewicri.loria.fr/fr.wicri-base/index.php5?title=Accueil >
  24. < http://m3p.gforge.inria.fr/pmwiki/pmwiki.php >
  25. < http://www.tei-c.org/ >
  26. < http://dbpedia.org/ >
  27. < http://www.nlgbase.org/>