Projects Overview

In the last 10+ years I have been involved in number of research and industrial projects from various sectors. In these projects I have contributed in the design and implementation of novel concepts and ideas for knowledge extraction, integration, modelling and retrieval. These include European projects, funded by the European Commission (EC) within the FP7 and H2020 funding schemas, but also national projects. In addition, I have contributed to several community and open source projects. Bellow, I have listed the most relevant projects I have worked over the last 10+ years, where most of them involved research institutions and companies from different European countries.

EU Projects

COST NexusLinguarum - European network for Web-centred linguistic data science

Project lifetime: Sep 2019 - Aug 2023

The main aim of this Action is to promote synergies across Europe between linguists, computer scientists, terminologists, and other stakeholders in industry and society, in order to investigate and extend the area of linguistic data science. We understand linguistic data science as a subfield of the emerging “data science”, which focuses on the systematic analysis and study of the structure and properties of data at a large scale, along with methods and techniques to extract new knowledge and insights from it. Linguistic data science is a specific case, which is concerned with providing a formal basis to the analysis, representation, integration and exploitation of language data (syntax, morphology, lexicon, etc.). In fact, the specificities of linguistic data are an aspect largely unexplored so far in a big data context.

H2020 ALIGNED - Aligned, Quality-centric Software and Data Engineering

Project lifetime: Feb 2015 - Jan 2018

ALIGNED quality-centric, software and data engineering is a research project funded by Horizon 2020. ALIGNED developed new ways to build and maintain IT systems that use big data on the web. It provides: 1) A methodology for combined software and data engineering, based on a metamodel which describes the software and data lifecycles; 2) Tools to produce software development models from the metamodel, including transformations that generate or configure software applications; 3) Tools to produce data development models from the metamodel, incorporating data quality and integrity constraints, data curation, and data transformations; and 4) Methods to use the metamodel and tools as part of a unified software and data engineering process, emphasising techniques which ensure data quality and integrity, as well as software security and reliability

H2020 FREME - Open Framework of E-services for Multilingual and Semantic Enrichment of Digital Content

Project lifetime: Feb 2015 - Jan 2017

The general objective of the FREME innovation action is to build an open innovative commercial- grade framework of e-services for multilingual and semantic enrichment of digital content. FREME will empower digital content managers with its advantages and benefits it brings to the market. The FREME Framework for multilingual and semantic enrichment of digital content will integrate the following e-services based on existing and mature technologies: e-Internationalisation, e-Link, e-Entity, e-Terminology, e-Translation and e-Publishing.

FP7 LinkedTV - Television Linked To The Web

Project lifetime: Oct 2011 – Mar 2015

Our vision of future Television Linked To The Web (LinkedTV) is of a ubiquitously online cloud of video content decoupled from place, device or source. LinkedTV will provide broadcast and online information (videos, images, text) to TV usable in the same way as text based information is used today in the original Web: interlinked with each other at different granularities, with any other kind of information, searchable, and accessible everywhere and at every time. Ultimately, this means creating Internet-based hypermedia for television.

FP7 LOD2 - Creating Knowledge out of Interlinked Data

Project lifetime: Sep 2010 - Aug 2014

The key goals of the LOD2 project are: i) research and development of novel, innovative Sematic Data Web technologies, ii) expansion and integration of openly accessible and interlinked data on the web, and iii) adoption and implementation of Linked Data for media, enterprise and government. LOD2 will integrate and syndicate linked data with large-scale, existing applications and showcase the benefits in the three application scenarios. The resulting tools, methods and data sets have the potential to change the Web as we know it today.

National Projects

BMWi Smart Data Web - Creation of an industry knowledge base for the German industry

Project lifetime: April 2015 – March 2018

Smart Data Web was a German national project funded by the BMWi funding scheme. Its ultimate goal is to create an industry knowledge base for the German industry. It leverages state-of-the-art data extraction and enrichment technologies as well as Linked Data to create value-added systems for the German industry. Knowledge which is relevant to decision-making processes will be extracted from government and industry data, official web pages and social media. Then the data is analyzed using natural language processing frameworks and integrated into knowledge graphs. These knowledge graphs are accessible via dashboards and APIs, as well as via Linked Data. Special concern is given to legal questions, such as data licensing as well as data security and privacy.

Community and Open Source Projects

GlobalFactSyncRe - Factual Consensus Across Different Knowledge Sources

Project lifetime: June 2019 – May 2020

GlobalFactSyncRE is project funded by the Wikimedia Foundation which aims at synchronization of facts extracted from Wikipedia infoboxes. It aims extraction of all infobox facts, together with their references with the goal to create a tool for Wikipedia editors that detects and displays differences across infobox facts in an intelligent way. It will help sync infoboxes between languages and/or Wikidata. The project is run by DBpedia, which frequently crawls and analyses over 120 Wikipedia language editions has near complete information about (1) which facts are in infoboxes across all Wikipedias (2) where Wikidata is already used in those infoboxes. The extracted references will also be used to enhance Wikidata.

DBpedia - Global and Unified Access to Knowledge

Project lifetime: 2007 – ongoing

DBpedia is a crowd-sourced community effort to extract structured content from the information created in various Wikimedia projects. This structured information resembles an open knowledge graph (OKG) which is available for everyone on the Web. DBpedia data is served as Linked Data, which is revolutionizing the way applications interact with the Web. One can navigate this Web of facts with standard Web browsers, automated crawlers or pose complex queries with SQL-like query languages (e.g. SPARQL). Have you thought of asking the Web about all cities with low criminality, warm weather and open jobs? That's the kind of query we are talking about.

DBpedia NIF - Open, Large-Scale and Multilingual Knowledge Extraction Corpus

Project lifetime: 2015 – ongoing

The DBpedia community has put significant amount of effort on developing technical infrastructure and methods for efficient extraction of structured information from Wikipedia. These efforts have been primarily focused on harvesting, refinement and publishing semi-structured information found in Wikipedia articles, such as information from infoboxes, categorization information, images, wikilinks and citations. Nevertheless, still vast amount of valuable information is contained in the unstructured Wikipedia article texts. DBpedia NIF aims to fill in these gaps and extract valuable information from Wikipedia article texts. In its core, DBpedia NIF is a large-scale and multilingual knowledge extraction corpus. The purpose of this project is two-fold: to dramatically broaden and deepen the amount of structured information in DBpedia, and to provide large-scale and multilingual language resource for development of various NLP and IR task. The dataset provides the content of all articles for 128 Wikipedia languages. It captures the content as it is found in Wikipedia-it captures the structure (sections and paragraphs) and the annotations provided by the Wikipedia editors.

The Linked Web APIs Dataset - APIs meet Linked Data

Project lifetime: 2012 – ongoing

The Linked Web APIs dataset is a Linked Data dataset with semantic descriptions about Web APIs. It contains over 11,339 of Web APIs descriptions, over 7,415 mashups and almost 7,717 mashup developers' profiles. The data is retrieved from, the largest Web service and mashup repository. In total the datasets contain over half million of RDF triples. - Unsupervised Named Entity Recognition and Linking

Project lifetime: 2012 – ongoing is an open-source NER system supported with unsupervised methods for entity spotting and classification and several most-frequent-sense based and context-based entity linking methods. The system implements a method for learning entity salience based on local - derived from the document, and global features - derived from an existing knowledge graph (i.e. DBpedia).