Avant-propos : ce texte en anglais est celui de la keynote que j’ai eu le grand plaisir d’assurer le 12 décembre 2019 à l’invitation de Vincent Razanajao et Alberto Dalla Rosa lors de la conférence « Linked Pasts V » qui a eu lieu à Bordeaux (11-13 décembre 2019). Il a été traduit par Emmanuelle Bermès que je remercie encore énormément pour ce travail. Il reprend en grande partie des billets déjà publiés sur ce blog. Vous trouverez avant le texte en lui-même les slides qui accompagnaient mon intervention.
I started to be interested in Semantic Web technologies in 2005. My first talk on this topic was in 2006 at the Digital Humanities conference in Paris. Then, I had the opportunity to test them life-size in 2007 for a project conducted by the CCH of King's College. But, it was during the SPAR project of the National Library of France, started in 2008, that I really started to touch the tremendous promise of these technologies and their limits, already. Between 2008 and 2014, I had the opportunity to deploy them in different contexts, in order to address different use cases: data publication, harvesting of data embedded within web pages, bridging internal silos and data consistency, data enrichment and mashups... I would like to share this experience with you today, with two objectives:
- show in what contexts and how we can use Semantic Web technologies;
- take a comprehensive look at these technologies and explain how they have impacted my thinking in the field of data management, even if I don’t use them anymore.
But first, I'd like to go back to the history of these technologies: after all history is a great way to put things into perspective ....
A quick history of the Semantic Web
The initial purpose of the Web was to enable CERN researchers in Geneva and affiliated laboratories to grant quick and easy access to the elements contained in their computers, in order to exchange them with their peers. What did the researchers' machines contain?
- First, documents: experiences protocols and reports, scientific papers, monographs... that is, a set of finite, organized and coherent information intended for humans
- But also, data: tables and databases containing research results, that is, information in a standardized form aimed at being used by machines.
- Therefore, the Web has been designed, from its origin, to allow the linking and exchange of documents and data in a global interoperability space.
The pioneers of the Web, led by Tim Berners-Lee, focused first on developing building blocks for the exchange and linking of documents. To do so, they relied on a mature set of technologies and principles:
- A communication protocol, HTTP, based on the TCP / IP protocol, also known as the Internet;
- An identification mechanism, the URLs, which makes it possible to access a document on a distributed network of machines;
- A principle for linking documents, the hypertext, created at the end of the Second World War by Vanevar Bush and adapted to computer science by Ted Nelson in the mid-1960s;
- A document encoding language, HTML, based on SGML, a standard for hierarchical description of information.
The success of the Web of documents is due to several factors:
- Web standards are open and free: anybody can implement them without having to pay;
- Web standards are robust: they do not contain any SPOF, Single Point Of Failure. The Web can be extended without limits, you just need a web server connected to the Internet. There is no centralized directory. Furthermore, linking to another page is not subject to any authorization or verification
- Web standards are easy to implement: you can learn HTML in one or two hours. The hypertext is natural because it is based on the principle of the association of ideas, just like human thinking.
So, the development of standards for the web of documents and their appropriation by a large community were quite simple. However, the situation when it comes to data was much more complex. In the 90s, there wasn’t any technological consensus to implement in this field. In September 1994, at the 1st WWW conference, Tim Berners-Lee draws the future directions of the W3C and demonstrates the "need for semantics for the Web". Today, we no longer speak about semantic web but about knowledge graphs. In between, these technologies have been redefined several times while keeping their technological base.
We can identify 3 periods:
Semantic Web era : from 1994 to 2004, a first period was dedicated to the development of major Semantic Web standards like RDF and OWL.
Linked Data era : Then from 2006 to 2014, the effort focused on adoption, with a new initiative called Linking Open Data. SPARQL was developed and major industry players like Google, Bing and Yahoo entered the game by releasing the Schema.org vocabulary. Finally, Wikidata was opened to contributors and new versions of RDF, JSON and SPARQL were released.
Knowledge graph era: Today, we can consider that we are in a third period, the knowledge graph era. It started when Google announced the knowledge vault and is characterized by the emergence of new de facto standards for graphs, more fit to be used by the industry.
Three periods, three redefenitions of what the semantic web technologies are but same technologies, Isn’t it the indication of a problem with them?
To illustrate my point, I would like to provide some feedback on some projects I’ve been involved in. I will focus on the reasons that led us to choose semantic Web technologies, the limits we faced and the lessons learnt.
The SPAR project / Flexibility and linking of heterogeneous data
The National Library of France’s project for a “Scalable Preservation and Archiving Repository” (SPAR) aims to ensure the long-term continuity of access to its digital collections. The system strictly follows the principles of the OAIS model (Open Archival Information System), including in its architecture. Indeed, each functional entity of the OAIS model is implemented as an application module: Ingest, Data Management, Storage, Access and Administration.
- The Ingest module enables the control and enrichment of information packages and the transformation of submission information packages into archival information packages
- The Data Management module provides storage, indexing and querying of all metadata: reference data and information packages
- The storage module provides secure storage of all packages
- The Access module retrieves information packages
- Finally the Administration module is used to manage the system.
The "Data Management" module, in charge of storing and querying metadata, had to answer various problems:
- all metadata needed to be searchable without any preconceived idea of how to query them;
- the data was heterogeneous and included semi-structured data;
- the metadata from the information packages was to be queried along with the reference data;
- we needed a powerful query language, preferably a standard, accessible to non-IT staff like preservation experts;
- the system had to be flexible and standard, independent of any software implementation in order to ensure evolutivity and reversibility in the long term.
At the time (that is, in 2008), it turned out that the RDF model and the SPARQL query language were the most appropriate answer to all these issues:
- Relational databases seemed too rigid and software-dependent;
- Search engines did not allow indexing in real time and were also problematic because they limited the query to a single entity and made fine querying of structured data difficult;
- As for document databases, they were still in their infancy and even today, they do not offer the same query flexibility as SPARQL.
So we decided to implement the whole "Data Management" module with the RDF model and to expose a sparql endpoint as an API. To do this, we deployed the Virtuoso software, already in use since two years within Dbpedia.
As a result, this is how metadata is handled within SPAR. The submission information package is a zip archive that contains:
- all files to be preserved;
- an XML file using METS and Dublin Core to describe the list of files, their structure and the descriptive metadata.
The ingest module checks the package against reference data (agents, formats, etc.). At the end of the ingest process, the files are stored in the Storage module while the metadata is transformed into RDF to be indexed within the Data Management module.
The ontology used for this RDF transformation was created ad hoc based on existing models: Dublin Core, OAI-ORE and home-made ontologies.
The first problems appeared when we ran performance tests. The model proved to be as flexible as expected and the query language was expressive enough. But the system was lacking performance and scalability.
We had to adapt the architecture. So, we created multiple instances of Virtuoso, containing different metadata sets based on usage, including an instance with all the metadata. The result was much more complex than initially planned. We also worked on the volume of indexed metadata by excluding redundant information. Last but not least, it is worth noting that the instance containing all the metadata has a very limited number of users (roughly counted on the fingers of both hands).
Ten years later, the system is still in place. The BnF has gone from the open (and free) version of Virtuoso to the paid version to ensure scalability. As far as I know, they remains convinced of this choice. As for myself, I’ll remember this experience as a crazy bet (a few of us have spent restless nights asking ourselves if it was the right solution...). Fortunately, speed of response and number of users were not an issue on this project. I remain convinced that it would be difficult to build a production service with a large number of users and / or data, directly on a searchable RDF database with SPARQL.
The Isidore project / Data retrieval and exposure
Isidore is a project led by the research infrastructure Huma-Num, started in 2008. Since 2010, it provides an online platform for searching open access resources in humanities and social sciences. Today, almost 6 million resources from more than 6,400 data sources are aggregated.
The architecture of Isidore is composed of three parts:
- Back office applications for managing data sources and repositories;
- A data pipeline system responsible for retrieving data sources, controlling them, enriching them and indexing them in a search engine;
- Two storage systems to request and publish the data: a search engine that publishes the data through a specific API and a triple store providing a SPARQL endpoint.
Semantic Web technologies are involved in several ways in this project:
Isidore harvests metadata and content in three different ways, one of which is RDFa. RDF statements are embedded in Web pages using RDFa, thus allowing to retrieve metadata from those web pages listed via the Sitemap protocol. At that time, we were convinced that the OAI-PMH protocol, often used to expose the data, presented major drawbacks:
- it was impossible to express different granularity levels, because disparate entities were described at the same level;
- the description was limited to simple Dublin Core;
- the HTTP protocol was used correctly (the error mechanism does not use the appropriate error codes, mechanism of the hypermedia is unused...).
In our view, RDFa was a simple way to lead producers towards Semantic Web technologies and to go beyond the limits of OAI-PMH. With a supposedly limited investment, it provided a greater capacity for description using OAI-ORE, an RDF vocabulary created in 2008 to address the limitations of OAI-PMH, and Dublin Core terms.
In Isidore, the reference data used to carry out the enrichments is expressed in RDF, connected with links;
All metadata retrieved from data sources or obtained from Isidore's enrichment mechanisms are converted to RDF and stored in an RDF database. They are published according to Linked Data principles.
Inspired by the emerging initiatives around Open Data, we felt it was essential to make the data available in return. Several objectives were targeted:
- as a public initiative, be transparent about the data used to build the search engine;
- make available to producers (and, by extension, to everyone) the enrichments produced from their data (generation of a unique Handle identifier, classification of resources, automatic annotation with reference systems...) in a logic of counter-gift;
- help SHS academics get acquainted with semantic Web technologies.
I am not in the best place to draw conclusions from these different initiatives. With the perspective of time but also of my current position, I have the impression that the challenge was half met. The choices were certainly good at the time. They served as an example to help us progress when it came to the reuse of exposed data, the use of semantic Web technologies and, more generally, the interoperability of research data.
However, actual reuse of these data remain scarce. Maybe this kind of data is not necessarily amenable to reuse, but furthermore, their exploitation requires a challenging learning curve. Researchers want simple, accessible things. During a study day organized in 2017 around the relationship between research and heritage institutions, Raphaëlle Lapotre, product manager of data.bnf.fr, confirmed:
"We mostly get in touch with the researchers when things go wrong with the data. And it often goes wrong for several reasons. But, indeed, there was the question of these standards giving the researchers a hard time [...] they tell us: but why don’t you just use csv rather than bother with your semantic web business? "
This is a general finding in the world of Open Data: the more complex the data, the less reused they are.
As for RDFa, this formalism proved to be much more complex than expected to handle. There was no initiative or vocabulary at the time to structure this data in a consistent manner. We advocated the Dublin Core terms because it seemed most appropriate for this type of data. Since then, Schema.org has come forth and gradually, RDFa has given way to Json-LD. Obviously, I would focus towards this couple today. Finally, it should be noted that, despite its shortcomings, the OAI-PMH protocol remains the main channel for Isidore to retrieve data... victory of simplicity over expressivity? Whatever, this is a lesson to remember while OAI-ORE celebrates its 10 years.
From mashups to Linked enterprise Data: breaking silos / linking and bringing consistency to heterogeneous data
In their founding paper published in 2001 in the journal Scientifc American, Tim Berners-Lee, Ora Lassila and James Hendler illustrate their proposal with a concrete example. A software agent searches various sources of available data on the web and combines them in real time to arrange a medical appointment. This use case is supposed to demonstrate the possibilities of data publication and how Semantic Web technologies can connect heterogeneous data to deduce information. Decentralization, interoperability and inference are ultimately the three main objectives of the Semantic Web.
Following the same principles, we could mix several sources of heterogeneous data exposed in RDF (or another way) to create new applications with real time updated data from different sources. That's the whole principle of data mashups.
However, from theory to practice, there is a gap.
In the different mashups that we have developed, like this one on historical monuments, the general principle and the place of Semantic Web technologies are the same. We used them in two different ways:
- to recover the published data sources according to the principles of Linked Data or through a sparql endpoint;
- to make the "glue" between heterogeneous data sources by building a graph itself stored in an RDF database and from which we built the XML files to index in the search engine.
The first obvious difference with the use case outlined in the 2001 paper is data retrieval. With the current state of the art and taking into account the problems of network resilience, building a fast and scalable application requires to recover the data asynchronously, process it and then store it in a local database. This requires putting in place heavy mechanisms to update the data that can not be done in real time, thus questioning the idea of decentralization.
In this kind of exercise, preparing a consistent dataset is complex in itself, regardless of whether or not Semantic Web technologies are at play. In addition, two other challenges have to be faced:
- the mapping of all data sources to RDF and the development of a data model that can describe all the harvested data;
- the conversion of the stored data back into a search engine-readable formalism (Json or XML), the capabilities of the RDF database being limited from this point of view.
Is the conversion and storage in RDF step really useful? We could go directly from the retrieval phase to the storage in the search engine via data processing. The main interest of this choice is to separate the data and its logic from the way of exploiting it. In this way, it is possible, simple and fast to create different views focused on the different entities of the model. We could invent new ways to navigate the data depending on how to browse the graph. Obviously, this supposes that the reusers know perfectly the structure of the graph and master Semantic Web technologies...
Yes, flexible it is. But in the end, when you compare the time spent developing and automating the mapping and RDF storage and the time gained in the exploitation of data, there is almost no immediate advantage in using these technologies; it may even be the contrary. Such an effort will be justified (perhaps) only with time (without guarantee ...) and the actual creation of different uses or views on the data.
The Linked Enterprise Data, a concept that we tried to push at Antidot, could be compared to a data mashup of the legacy information systems of organizations. The idea is to free data from existing silos, separate data from usage, link and create consistency between all data sources, in order to propose new uses and a new way of exploiting / exploring the data assets of the organization.
In the case of Electre, a company specializing in the supply of bibliographic data, the implementation of Linked Enterprise Data made it possible to recentralize all the data dispersed in different silos, to make them consistent and link them via a common model, and to enrich them. The goal was to simplify the reuse of data.
As in the case of mashups, the goal is achieved. But at what cost ? It was necessary to convert all data sources into RDF and ensure consistency within a triple store in near real-time with every change in the system. This proved to be very complex to supervise and maintain over time. In this case, the actual benefit of the RDF mapping isn’t obvious. The construction of a new data silo in RDF may be actually revealing a wider architecture problem in the information system, and even worse, a way to avoid seeing it completely.
We only had a few opportunities to develop this vision for organizations. Indeed, we ran into different problems. Beyond technical issues like scalability, performance, data retrieval from silos, RDF mapping and issues with contextualization of RDF triples, there are organizational issues. Organizations lack interest in managing their own data, especially when the return on investment is not obvious. Sometimes the IT isn’t legitimate to take a transversal vision inside the organization.
- issue of scalability and performance;
- complexity in moving data out of silos and mapping it to RDF;
- disinterest of organizations for the data itself and its logic;
- real or supposed illegitimacy of IT units to take a transversal vision in the organization;
- limit of the RDF model to express the source of the information and, more generally, difficulty to contextualize the triple;
- inability to guarantee a return on investment;
- lack of skills of developers in the field.
Finally, lack of skills of developers in the field is an important flaw, because it conditions the realization, the supervision and the maintainability of such systems. The IT staff will never move towards a solution that they are not sure to be able to sustain over time.
Conclusions and perspectives
These different experiences demonstrate the benefits offered by semantic Web technologies regarding two elements:
- The flexibility of the graph at the core of the RDF model;
- The possibilities offered by these technologies in terms of data publication, interoperability and decentralization.
For each of these topics, I will now show the contribution of Semantic Web technologies, but also their limits and the means to overcome them.
The flexibility of the graph model
Benefits of Semantic Web technologies
Compared with the rigidity of relational databases - be it a reality or just an impression - the RDF graph appears like absolute freedom:
- the data structure is no longer separated but part of the data itself.
- where relationship tables sometimes had an approximate typology, the graph model makes things explicit and the logic model has never been closer to the conceptual model and real-world logic.
- No more local entity identifiers, URIs make universal resource identification.
- the data is structured according to the logic of the data and not the use that is made of it.
It makes possible what the tabular model has never been able to solve: the linking of heterogeneous entities easily, either directly by typed links or via reference data, also including the structure of the data as part of the data itself.
The graph can evolve over time and its growth is potentially infinite without needing to edit the entire logic model.
Yes, all these promises are held by the Semantic Web technologies, RDF and SPARQL in particular.
However, the very structure of the RDF model has revealed limitations on the management of the provenance of the various pieces of information and the contextualization of the triple. And this point, already present in Tim Berners-Lee's Semantic Web, is still not really resolved. Solutions have appeared but they are not entirely satisfactory. From this point of view, RDF 1.1 is a missed appointment.
Meanwhile, another model called "property graph" has emerged. It proposes a response to this limit. This model is today at the heart of all graph database technologies proposed by the major players in the sector: IBM, Microsoft, Amazon (based in principle on the Blazegraph product whose company seems to have been bought by Amazon), Google, not to mention the new ones: Huawei, Datastax, Neo4j or OrientDB.
Thus, the graph model is doing well and for a good reason. It offers unparalleled flexibility in the manipulation of structured data and in the cross-query of heterogeneous data. But, most industry players made the choice to implement the property graph model and they all adopted the Apache Tinkerpop framework and the Gremlin query language to interact with the storage system, making it a de facto standard.
Maintainability and management of data in a graph system
Beyond the limits
Acknowledging the limits of RDF doesn’t mean to renounce everything that Semantic Web technologies have brought over these years. In particular, a reconciliation remains possible between property graph and RDF with RDF * / SPARQL *. If there is no need to publish the data on the Web, the use of property graphs seems to me a good idea.
Working with these technologies also led us to think about data governance. Creating a global map of all your data seems a good starting point. You should also focus on conceptual modeling as the first stage of your project and all along the development.
Finally, when getting started, ask yourself: do you really need the graph model? We shouldn’t just use Semantic Web technologies just because they are fashionable. As Dan Brickley says: you do what you want inside your own system. RDF is to be seen as a tool to exchange data, not as a mandatory standard to be used at the core of your architecture.
Data publication / Interoperability / Decentralisation
Contributions of semantic Web technologies
Indeed, the main strength and interest of Semantic Web technologies (maybe the only one?) is to ensure the interoperability of structured data by offering a common model (the triple). From this point of view, the promise is perfectly kept. If we design data-level interoperability, Semantic Web technologies are at the moment the best solution, and they have deeply influenced our thinking on this matter.
Although it was not enough for these technologies to be widely adopted, they accelerated the reflexion on interoperability by opening up unexplored technical possibilities. They allowed us to better understand the conditions required for linking heterogeneous data and to create bridges between worlds that seemed remote or even impossible to reconcile (see also the presentation). Semantic Web technologies have made it possible to consider new ways of conceiving interoperability.
Moreover, as it is the case with Wikidata, SPARQL is a powerful tool for querying data, regardless of RDF being used as a model.
Limitations of Semantic Web Technologies
However, there are major shortcomings in this area as well:
- Network performance and resiliency issue still require asynchronous data retrieval.
- Maintainability over time of the infrastructure implies important costs and technological complexity, without providing any solution to deal with the problem of error 404, just like for the web pages.
- The level of knowledge required for the exploitation is also important, and today, another issue has arisen: to interrogate Wikidata is not to interrogate any sparql endpoint, the appropriation of the model finally takes as much time as to appropriate a new proprietary API.
- All queries are not possible, the full text search is very limited or impossible because the technologies are not designed in this perspective.
- Structural interoperability doesn't actually work, because not everyone uses the same ontology and even when they do, these ontologies are not necessarily used homogeneously.
Overcoming the limits
The data actually needs to be published according to its own nature, the possible uses and the intended users:
- Simple CSV or Json / XML dumps
- A simple API
- A sparql end-point for the most advanced uses if resources are available to maintain it.
We should focus on simple and easy-to-use ontologies: in short schema.org to expose the data. Who has never wasted time trying to get their heads around the data.bnf.fr model, or getting lost in the maze of the british museum sparql endpoint with its data model strictly following the CIDOC-CRM?
Do we need this level of interoperability? We must face the facts: faced with an increasing mass of data, we must give up the idea of syntactical or structural interoperability through the use of a single model, be it for production, storage or exploitation within an Information System. However, it is still possible to provide some linking of the different bits of information, for example through the use of independent identifiers that are common to the whole information system. It does not mean that interoperability between organizations is a utopia, but it relies rather on interoperability of systems end-to-end via data processing, than on global interoperability in the storage of data.
Data management at the National Audiovisual Institute
It is still possible to envisage a global consistency of the different data of an organization without using semantic Web technologies, by deploying transverse data governance and designing the data models based on the logic of the data itself rather than on their use. In short, it is a matter of properly managing the data. The answer to this question is not only technical... This observation, along with this year-long experience with Semantic Web technologies, led us to the way we finally designed the new information system of the french national institute of audiovisual.
The main directions of our project are as follows:
- Technically separate data from their usage by setting up a technical data storage and processing infrastructure independent of the business applications that use them;
- Functionally separate data from their usage by managing all types of data handled by the institution, rethinking data models in relation to their logic and not their use and acknowledging that some data models are dedicated to production and storage while several other models are designed specifically for data publication.
- Five types of databases are used in order to meet the storage needs of all types of data (structured, semi-structured and unstructured) and their exploitation:
- A relational database
- A document database
- A graph database
- A search engine.
- A column store
- A single, centralized infrastructure includes the five storage systems, the layer for processing and synchronizing data (the real core of the system and what actually ensures interoperability), and a dissemination layer in order to abstract the architecture from the business applications that use the data.
So currently, Semantic Web technologies are only used to retrieve and process the data from Wikidata so as to enrich our reference vocabularies for people and places. If the need arises and / or if there is a political intent, we can still consider exposing all or part of our data as Linked Data and through a sparql endpoint, but otherwise, we will prefer dumps in simple formats, to ensure greater data reuse. Thus, while being impregnated with all the reflection and the contributions resulting from the Semantic Web technologies, our project does not use them any more.