Open data in Luxembourg, strategy and best practices (2012) chapter 6

From Wicri Luxembourg (en)

Which datasets ?

File:Open data deployment.png
Steps leading to the creation of an open data portal



Formats

Analysis framework

Braunschweig [2012] chooses to categorize formats according their ability to be read by a human being, or to be read by machines, enabling automated operation. If the format is human readable, the data were transformed into documents. This allows easy viewing by the user, but information has been selected, it can be false, and the vision of the user depends more heavily than the one who made the choice, and therefore its intentions. It is also more difficult to recover the underlying data. In case of machine readable format, users can select the data according to their needs, which requires some familiarity with the datasets and minimal knowledge of the processing tools. This format facilitates the reuse of data for other purposes, and allows automatic processing.

Usability

Format determines in large part the ease of reuse of data. If the format is machine readable, it can easily be reused and combined with other datasets. The provision of an API provides flexibility and a certainty of access to up to date data and even real time data, but it requires a minimum of technical skills. If a large number of end users may understand a file in table format, less are able to connect to an API. Therefore it puts more responsibility on the category of reusers. Different text formats such as Word, PDF, and partially ODF, are difficult to reuse, if we exclude certain text mining tools which, for the moment, have only given some thoughts on how to cross them with open public data. However one can assume that if some documents conform to a consistent structure, e.g. a mailing list, it can be still possible to process automatically. Should an organization aim at a semantic outset ? This is related to the role of innovation in organizations. For example, the INSEE, whereas it is too early to restructure the whole organization around the issue of semantic web, has invested in the project of INRIA Data Lift to adapt some of its datasets to semantic web. This is one of the possible strategies. Another is to take advantage of the opportunity of the open data to begin a process leading to release the data according to the principles of the semantic web, which eliminates the need to redesign the whole process in the longer term. Depending on the perspective chosen, the duration and the costs will be different.

Technical openness

“An open format is one where the specifications for the software are available to anyone, free of charge, so that anyone can use these specifications in their own software without any limitations on reuse imposed by intellectual property rights". If a file format is “closed”, this may be either because the file format is proprietary and the specification is not publicly available, or because the file format is proprietary and even though the specification has been made public, reuse is limited. If information is released in a closed file format, this can cause significant obstacles to reuse the information encoded in it, forcing those who wish to use the information to buy the necessary software .”

Does the principle of open data necessary imply an open format? We can distinguish two trends. The first one emphasizes the need to open much data as possible, as quickly as possible, and if it has any recommendations on the technical aspect, they concern the machine-readable question. In contrast, the second trend emphasizes the need to be consistent: if we open the data repositories, they must be in an open format. Understood in a radical way this idea would exclude such documents in excel format from the field of open data.

Formats in practice

Frequency Proprietary Usability assessment Semantic
CSV High No Average No (but possible conversion)
XLS High Yes Average No
TXT Average No Weak No
DOC Average Yes Weak No
RTF Low Yes Weak No
PDF Average ? Weakest No
XML Low No Good Yes
RDF Low No Good Yes
API Low Matters Good Matters

As most institutions have opened their datasets following one of the possible ways, opening data like they are in their repositories and thus respecting original format, there is a wide variety of formats available, at least 72 for the European open data initiatives studied. There are, however, major trends. CSV format is most used on most platforms in Europe, but with only 29%. When it does not dominate, the reasons may be related to context: Rennes opened especially geographic information, or more rarely to a technical choice: Berlin XLS prevails with 29% of datasets, against 14% for the CSV format. The most widely used file formats are of array type: at a European level CSV and XLS added reach 50%. Json can be interesting because it represents a compromise between fully human or machine readable formats, and is generally considered easier to manipulate than the XML format. It is a way of opening rare enough, less than 0.5% in Europe, but 10% in Berlin. API - which can also use JSON – are in the same situation. They have the advantage of providing access to data in real time, supposed to allow reuse more varied and generate more value. The analysis of the current European open data also shows that formats like Datex2 on traffic, and promoted at European level, meet a mixed success. Only one dataset is recorded in the European catalog studied, one opened by Bison Futé as experimental data on smart traffic around major cities of France. One can also find a specificity of European open data: while GTFS format prevails in the publication of data transport in the United States, it is still a minority in Europe, even if the Deutsche Bahn has recently announced the release of its data in the GTFS format.

Assessment for Luxembourg

Datasets found in Luxembourg indicate a large domination of PDF files, which is not surprising since in the absence of open data, the producers want mainly to transmit information to citizens, a task for which the PDF is suitable. Within these PDF however, the large amount of information in tabular form allows to foresee a relatively easy opening process of the corresponding datasets. In Luxembourg, Statec is the only institution that opens a lot of datasets according a coherent manner and following many reusable formats.

Legal framework

Current framework

In the current Intellectual Property laws, databases are most similar to the situation of open data and are generally subject to a sui generis right in Europe. Licenses specifically designed fort open data reflect these characteristics.

At European level, the two most important texts for open data are PSI and Inspire Directives. European Directive PSI n° 2003/98/CE the 17th November 2003 regulates the provision and reuse of public sector information. The PSI Directive is being reformed, and projects already exposed tend to increase similarities with open data. In the particular case of geographic data one must include the European Directive Inspire n° 2007/2/CE the 14th march 2007 establishing an Infrastructure for Spatial Information in the European Community.

In Luxembourg, the PSI Directive was transposed by the "Loi du 4 décembre 2007 sur la réutilisation des informations du secteur public" [1]. The Inspire Directive was transposed by the "Loi du 26 juillet 2010 portant transposition de la directive 2007/2/CE du Parlement européen et du Conseil du 14 mars 2007 établissant une infrastructure d’information géographique dans la Communauté européenne(INSPIRE) en droit national" [2].


Description and comparison of open data licenses

An important issue is the distribution conditions. Should there be an open approach imposed to reusers in exchange for access to open data? Creative Commons licenses allow the institutions to choice. There is only one obligatory criteria, the Attribution (BY). This condition can be combined with one or more of the following: Non Commercial (NC), No Derivatives (ND). If changes are allowed: Share Alike (SA), so products and services generated by the reuse have to follow the same license. Most of these conditions are an obstacle to the reuse of data and inconsistent with the definitions of open data. The choice is an important parameter that affects the economic landscape: do we want to encourage open source approaches or leave full freedom to reuse hoping that the ecosystem will structure itself to provide the maximum profit of the whole society ?


Within a national framework, there is the example of the "licence information publique librement réutilisable” in France, designed by the french Ministry of Justice. Adaptation and modification are permitted provided that there is some enrichment of the resource. It is the same for the sales.

If such licenses reflect the consideration of issues of open data by public bodies, they may nevertheless lead to a risk of fragmentation of platforms, a deterioration of the flow of data, while limiting the ability to create mashups : en plus des aspects techniques, if two datasets have different legal requirements, the difficulty of crossing will be higher. The right to reuse, its limitations and its interaction with intellectual property rights is not uniform across countries. Significant differences appear between civil law countries and common law countries, even with a Creative Commons license .


The Open knowledge foundation has released he Open Database License (ODbL), adopted for example by Paris for its open data portal. The ideological and practical proximity between open data and open source world explain the similarities between this license and Creative Commons licenses. ODbL is also originally designed to fit into the creative commons license. Lawrence Lessig, who contributed to the 2007 statement of the principles of open data, also participated in the launch of Creatives commons license. These licenses share common notions of sharing alike, allow the access, use, download, copy, share and distribute of data. ODbL is now adopted by OpenStreetMaps.


Notes :