Friday, November 8, 2019 - 12:57pm
- Pinky: Gee, Brain, what are we gonna do this year?
- Brain: The same thing we do every year, Pinky. Taking over GSoC.
And, this is exactly what we did. We had been accepted as one of 206 open source organizations to participate in Google Summer of Code (GSoC) again. More than 25 students followed our call for project ideas. In the end, we chose six amazing students and their project proposals to work with during summer 2019.
In the following post, we will show you some insights into the project ideas and how they turned out. Additionally, we will shed some light onto our amazing team of mentors who devoted a lot of time and expertise in mentoring our students.
Meet the students and their projects
A Neural QA Model for DBpedia by Anand Panchbhai
With booming amount of information being continuously added to the internet, organising the facts and serving this information to the users becomes a very difficult task. Currently, DBpedia hosts billions of data points and corresponding relations in the RDF format. Accessing data on DBpedia via a SPARQL query is difficult for amateur users, who do not know how to write a query. This project tried to make this humongous linked data available to a larger user base in their natural languages (now restricted to English). The primary objective of the project was to translate natural language questions to a valid SPARQL query. Click here if you want to check his final code.
Multilingual Neural RDF Verbalizer for DBpedia by Dwaraknath Gnaneshwar
Presently, the generation of Natural Language from RDF data has gained substantial attention and has also been proven to support the creation of Natural Language Generation benchmarks. However, most models are aimed at generating coherent sentences in English, while other languages have enjoyed comparatively less attention from researchers. RDF data is usually in the form of triples, <subject, predicate, object>. Subject denotes the resource, the predicate denotes traits or aspects of the resource and expresses the relationship between subject and object. In this project, we aimed to create a multilingual Neural Verbalizer, ie, generating high-quality natural-language text from sets of RDF triples in multiple languages using one stand-alone, end-to-end trainable model. You can follow up on the progress and outcome of the project here.
Predicate Detection using Word Embeddings for Question Answering over Linked Data by Yajing Bian
Knowledge-based question-answering system (KBQA) has demonstrated an ability to generate answers to natural language from information stored in a large-scale knowledge base. Generally, it completes the analysis challenge via three steps: identifying named entities, detecting predicates and generate SPARQL queries. In these three steps, predicate detection identifies the KB relation(s) a question refers to. To build a predicate detection structure, we identified all possible named entity first, then collected all predicates corresponding to the above entities. What follows is to calculate the similarity between problem and candidate predicates using a multi-granularity neural network model (MGNN). To find the globally optimal entity-predicate assignment, we use a joint model which is based on the result of entity linking and predicate detection process rather than considering the local predictions (i.e. most possible entity or predicate) as the final result. More details on the project are available here.
A tool to generate RDF triples from DBpedia abstract by Jayakrishna Sahit
The main aim of this project was to research and develop a tool in order to generate highly trustable RDF triples from DBpedia abstracts. In order to develop such a tool, we implemented algorithms which would take the output generated from the syntactic analyzer along with DBpedia spotlight’s named entity identifiers. Further information and the project’s results can be found here.
A transformer of Attention Mechanism for Long-context QA by Stuart Chan
In this GSoC project, I choose to employ the language model of the transformer with an attention mechanism to automatically discover query templates for the neural question-answering knowledge-based model. The ultimate goal was to train the attention-based NSpM model on DBpedia with its evaluation against the QALD benchmark. Check here for more details on the project.
Workflow for linking External datasets by Jaydeep Chakraborty
The requirement of the project was to create a workflow for entity linking between DBpedia and external data sets. We aimed at an approach for ontology alignment through the use of an unsupervised mixed neural network. We explored reading and parsing the ontology and extracted all necessary information about concepts and instances. Additionally, we generated semantic vectors for each entity with different meta information like entity hierarchy, object property, data property, and restrictions and designed a User Interface based system which showed all necessary information about the workflow. Further info, download details and project results are available here.
Meet our Mentors
First of all, a big shout out and thank you to all mentors and co-mentors who helped our students to succeed in their endeavours.
- Aman Mehta, former GSoC student and current junior mentor, recently interned as a software engineer at Facebook, London.
- Beyza Yaman, a senior mentor and organizational admin, Post-Doctoral Researcher based in ADAPT, Dublin City University, former Springer Nature-DBpedia intern and former research associate at the InfAI/University of Leipzig. She is responsible for the Turkish DBpedia and her field of interests are information retrieval, data extraction and integration over Linked Data.
- Tommaso Soru, senior mentor and organizational admin. I’m a Machine Learning & AI enthusiast, Data Scientist at Data Lens Ltd in London and a PhD candidate at the University of Leipzig.
“DBpedia is my window to the world of semantic data, not only for its intuitive interface but also because its knowledge is organised in a simple and uncomplicated way”Tommaso Soru, GSoC 2019
- Amandeep Srivastava, Junior Mentor and analyst at Goldman Sachs. He’s a huge fan of Christopher Nolan and likes to read fiction books in his free time.
- Diego Moussalem, Senior mentor, Senior Researcher at Paderborn University, an active and vital member of the Portuguese DBpedia Chapter.
- Luca Virgili, currently a Computer Science PhD student at the Polytechnic University of Marche.He was a GSoC student for a year and a GSoC mentor for 2 years in DBpedia.
- Bharat Suri, former GSOC student, Junior Mentor, Masters degree in Computer Science at The Ohio State University
“I have thoroughly enjoyed both my years of GSoC with DBpedia and I plan to stay and help out in whichever way I can”Bharat Suri, GSoC 2019
- Mariano Rico, senior mentor, Senior Doctor Researcher at Ontology Engineering Group, Universidad Politécnica de Madrid.
- Nausheen Fatma, senior mentor, Data Scientist, Natural Language Processing, Machine Learning at Info Edge (naukri.com).
- Ram G Athreya long-term GSoC mentor, Research Engineer at Viv Labs, Bay Area, San Francisco.
- Ricardo Usbeck, team leader ‘Conversational AI and Knowledge Graphs’ at Fraunhofer IAIS.
- Rricha Jalota, former GSoC students, current senior mentor, developer in the Data Science Group at University of Paderborn, Germany
“The reason why I love collaborating with DBpedia (apart from the fact that, it’s a powerhouse of knowledge-driven applications) is not only it gave me my first big break to the amazing field of NLP but also to the world of open-source!”Rricha Jalota, GSoC 2019
Mentor Summit Recap
This GSoC marked the 15th consecutive year of the program and was the 8th season in a row for DBpedia. As usual in each year we had two of our mentors, Rricha Jalota and Aashay Singhal joining the annual GSoC mentor summit. Selected mentors get the chance to meet each other and engage in a vital knowledge and expertise exchange around various GSoC related and non-related topics. Apart from more entertaining activities such as games, a scavenger hunt and a guided trip through Munich mentors also discussed pressing questions such as “why is it important to fail your students” or “how can we have our GSoC students stay and contribute for long”.
After GSoC is before the next GSoC
If you are interested in either mentoring a DBpedia GSoC project or if you want to contribute to a project of your own we are happy to have you on board. There are a few things to get you started.
- Have a look at previous DBpedia projects on GitHub
- Get in touch with old mentors and potential future mentors for example via our DBpedia Forum. We have a dedicated group for exchange about the upcoming season in 2020.
Likewise, if you are an ambitious student who is interested in open source development and working with DBpedia you are more than welcome to either contribute your own project idea or apply for project ideas we offer starting in early 2020.
See you soon,
The post Better late than never – GSOC 2019 recap & outlook GSoC 2020 appeared first on DBpedia Blog.
Thursday, October 24, 2019 - 1:41pm
We will be spending the next three days in Berlin at WikidataCon 2019, the conference for open data enthusiasts. From October 24th till 26th we will be presenting the latest developments and first results of our work in the GlobalFactSyncRE-Project.
Short Project Intro
Funded by the Wikimedia Foundation, the project started in June 2019 and has two goals:
- Answer the following questions:
- How is data edited in Wikipedia and Wikidata?
- Where does it come from?
- How can we synchronize it globally?
- Build an information system to synchronize facts between all Wikipedia language-editions, Wikidata, DBpedia and eventually multiple external sources, while also providing respective references.
In order to help Wikipedians to maintain their infoboxes, check for factual correctness, and also improve data in Wikidata, we use data from Wikipedia infoboxes of different languages, Wikidata, and DBpedia and fuse them into our PreFusion dataset (in JSON-LD). More information on the fusion process, which is the engine behind GFS, can be found in the FlexiFusion paper.
Can’t join the conference or want to find out more about GlobalFactSync?
No problem, the poster we are presenting at the conference is currently available here and will soon be available here. Additionally, why not go through our project timeline, follow up on our progress so far and find out what’s coming up next.
In case you have specific questions regarding GlobalfactSync or even some helpful feedback just ping us via email@example.com. We also have our new DBpedia Forum, home to the DBpedia Comunity, which just waits for you to initialize a discussion around GlobalFactSync. Why not start it now?
For general DBpedia news and updates follow us on Twitter.
…And if you are in Berlin at WikiDataCon2019 stop by our poster and talk to our developers. They are looking forward to vital exchanges with you.
All the best
Wednesday, October 2, 2019 - 3:41pm
… by and for Consumers until 2025
One Billion – what a mission! We are proud to announce that the DBpedia Databus website at https://databus.dbpedia.org and the SPARQL API at https://databus.dbpedia.org/(repo/sparql|yasgui) (docu) are in public beta now!
The system is usable (eat-your-own-dog-food tested) following a “working software over comprehensive documentation” approach. Due to its many components (website, SPARQL endpoints, keycloak, mods, upload client, download client, and data debugging), we estimate approximately six months in beta to fix bugs, implement all features and improve the details.
But, let’s start from the beginning
The DBpedia Databus is a platform to capture invested effort by data consumers who needed better data quality (fitness for use) in order to use the data and give improvements back to the data source and other consumers. DBpedia Databus enables anybody to build an automated DBpedia-style extraction, mapping and testing for any data they need. Databus incorporates features from DNS, Git, RSS, online forums and Maven to harness the full work power of data consumers. Vision
Professional consumers of data worldwide have already built stable cleaning and refinement chains for all available datasets, but their efforts are invisible and not reusable. Deep, cleaned data silos exist beyond the reach of publishers and other consumers trapped locally in pipelines. Data is not oil that flows out of inflexible pipelines. Databus breaks existing pipelines into individual components that together form a decentralized, but centrally coordinated data network. In this set-up, data can flow back to previous components, the original sources, or end up being consumed by external components.
One Billion interconnected, quality-controlled Knowledge Graphs until 2025
The Databus provides a platform for re-publishing these files with very little effort (leaving file traffic as only cost factor) while offering the full benefits of built-in system features such as automated publication, structured querying, automatic ingestion, as well as pluggable automated analysis, data testing via continuous integration, and automated application deployment (software with data). The impact is highly synergistic. Just a few thousand professional consumers and research projects can expose millions of cleaned datasets, which are on par with what has long existed in deep silos and pipelines.
To a data consumer network
As we are inverting the paradigm form a publisher-centric view to a data consumer network, we will open the download valve to enable discovery and access to massive amounts of cleaner data than published by the original source. The main DBpedia Knowledge Graph alone has 600k file downloads per year complemented by downloads at over 20 chapters, e.g. http://es.dbpedia.org as well as over 8 million daily hits on the main Virtuoso endpoint.
Community extension from the alpha phase such as DBkWik, LinkedHypernyms are being loaded onto the bus and consolidated. We expect this number to reach over 100 by the end of the year. Companies and organisations who have previously uploaded their backlinks here will be able to migrate to the databus. Other datasets are cleaned and posted. In two of our research projects LOD-GEOSS and PLASS, we will re-publish open datasets, clean them and create collections, which will result in DBpedia-style knowledge graphs for energy systems and supply-chain management.
A new era for decentralized collaboration on data quality
DBpedia was established around producing a queryable knowledge graph derived from Wikipedia content that’s able to answer questions like “What have Innsbruck and Leipzig in common?” A community and consumer network quickly formed around this highly useful data, resulting in a large, well-structured, open knowledge graph that seeded the Linked Open Data Cloud — which is the largest knowledge graph on earth. The main lesson learned after these 13 years is that current data “copy” or “download” processes are inefficient by a magnitude that can only be grasped from a global perspective. Consumers spend tremendous effort fixing errors on the client-side. If one unparseable line needs 15 minutes to find and fix, we are talking about 104 days of work for 10,000 downloads. Providers – on the other hand – will never have the resources to fix the last error as cost increases exponentially (20/80 rule).
One billion knowledge graphs in mind – the progress so far
Discarding faulty data often means that a substitute source has to be found, which is hours of research and might lead to similar problems. From the dozens of DBpedia Community meetings we held we can summarize that for each clean-up procedure, data transformation, linkset or schema mapping that a consumer creates client-side, dozens of consumers have invested the same effort client-side before him and none of it reaches the source or other consumers with the same problem. Holding the community meetings just showed us the tip of the iceberg.
As a foundation, we implemented a mappings wiki that allowed consumers to improve data quality centrally. A next advancement was the creation of the SHACL standard by our former CTO and board member Dimitris Kontokostas. SHACL allows consumers to specify repeatable tests on graph structures and datatypes, which is an effective way to systematically assess data quality. We established the DBpedia Databus as a central platform to better capture decentrally created, client-side value by consumers.
It is an open system, therefore value that is captured flows right back to everybody.
The full document “DBpedia’s Databus and strategic initiative to facilitate “One Billion derived Knowledge Graphs by and for Consumers” until 2025 is available here.
Thursday, September 19, 2019 - 3:07pm
SEMANTiCS is THE leading European conference in the field of semantic technologies and the platform for professionals who make semantic computing work, and understand its benefits and know its limitations.
Since we at DBpedia have a long-standing partnership with Semantics we also joined this year’s event in Karlsruhe. September 12, the last day of the conference was dedicated to the DBpedia community.
First and foremost, we would like to thank the Institute for Applied Informatics for supporting our community and many thanks to FIZ Karlsruhe for hosting our community meeting.
Following, we will give you a brief retrospective about the presentations.
Katja Hose – “Querying the web of data”
….on the search for the killer App.
The concept of Linked Open Data and the promise of the Web of Data have been around for over a decade now. Yet, the great potential of free access to a broad range of data that these technologies offer has not yet been fully exploited. This talk will, therefore review the current state of the art, highlight the main challenges from a query processing perspective, and sketch potential ways on how to solve them. Slides are available here.
Dan Weitzner – “timbr-DBpedia – Exploration and Query of DBpedia in SQL”
The timbr SQL Semantic Knowledge Platform enables the creation of virtual knowledge graphs in SQL. The DBpedia version of timbr supports query of DBpedia in SQL and seamless integration of DBpedia data into data warehouses and data lakes. We already published a detailed blogpost about timbr where you can find all relevant information about this amazing new DBpedia Service.
Maribel Acosta – “A closer look at the changing dynamics of DBpedia mappings”
Her presentation looked at the mappings wiki and how different language chapters use and edit it. Slides are available here.
Mariano Rico – “Polishing a diamond: techniques and results to enhance the quality of DBpedia data”
DBpedia is more than a source for creating papers. It is also being used by companies as a remarkable data source. This talk is focused on how we can detect errors and how to improve the data, from the perspective of academic researchers and but also on private companies. We show the case for the Spanish DBpedia (the second DBpedia in size after the English chapter) through a set of techniques, paying attention to results and further work. Slides are available here.
Guillermo Vega-Gorgojo – “Clover Quiz: exploiting DBpedia to create a mobile trivia game”
Clover Quiz is a turn-based multiplayer trivia game for Android devices with more than 200K multiple choice questions (in English and Spanish) about different domains generated out of DBpedia. Questions are created off-line through a data extraction pipeline and a versatile template-based mechanism. A back-end server manages the question set and the associated images, while a mobile app has been developed and released in Google Play. The game is available free of charge and has been downloaded by +10K users, answering more than 1M questions. Therefore, Clover Quiz demonstrates the advantages of semantic technologies for collecting data and automating the generation of multiple-choice questions in a scalable way. Slides are available here.
Fabian Hoppe and Tabea Tiez – “The Return of German DBpedia”
Fabian and Tabea will present the latest news on the German DBpedia chapter as it returns to the language chapter family after an extended offline period. They will talk about the data set, discuss a few challenges along the way and give insights into future perspectives of the German chapter. Slides are available here.
Wlodzimierz Lewoniewski and Krzysztof Węcel – “References extraction from Wikipedia infoboxes”
In Wikipedia’s infoboxes, some facts have references, which can be useful for checking the reliability of the provided data. We present challenges and methods connected with the metadata extraction of Wikipedia’s sources. We used DBpedia Extraction Framework along with own extensions in Python to provide statistics about citations in 10 language versions. Provided methods can be used to verify and synchronize facts depending on the quality assessment of sources. Slides are available here.
Wlodzimierz Lewoniewski – “References extraction from Wikipedia infoboxes” … He gave insight into the process of extracting references for Wikipedia infoboxes, which we will use in our GFS project.
Sebastian Hellmann, Johannes Frey, Marvin Hofer – “The DBpedia Databus – How to build a DBpedia for each of your Use Cases”
The DBpedia Databus is a platform that is intended for data consumers. It will enable users to build an automated DBpedia-style Knowledge Graph for any data they need. The big benefit is that users not only have access to data, but are also encouraged to apply improvements and, therefore, will enhance the data source and benefit other consumers. We want to use this session to officially introduce the Databus, which is currently in beta and demonstrate its power as a central platform that captures decentrally created client-side value by consumers.
We will give insight on how the new monthly DBpedia releases are built and validated to copy and adapt for your use cases. Slides are available here.
Interactive session, moderator: Sebastian Hellmann – “DBpedia Connect & DBpedia Commerce – Discussing the new Strategy of DBpedia”
In order to keep growing and improving, DBpedia has been undergoing a growth hack for the last couple of months. As part of this process, we developed two new subdivisions of DBpedia: DBpedia Connect and DBpedia Commerce. The former is a low-code platform to interconnect your public or private databus data with the unified, global DBpedia graph and export the interconnected and enriched knowledge graph into your infrastructure. DBpedia Commerce is an access and payment platform to transform Linked Data into a networked data economy. It will allow DBpedia to offer any data, mod, application or service on the market. During this session, we will provide more insight into these as well as an overview of how DBpedia users can best utilize them. Slides are available here.
If you want to organize a DBpedia Community meeting yourself, just get in touch with us via firstname.lastname@example.org regarding program and organization.
The post More than 50 DBpedia enthusiasts joined the Community Meeting in Karlsruhe. appeared first on DBpedia Blog.
Thursday, August 29, 2019 - 12:02pm
Today’s post features an interview with our DBpedia Day keynote speaker Katja Hose, a Professor of Computer Science at Aalborg University, Denmark. In this Interview, Katja talks about increasing the reliability of Knowledge Graph Access as well as her expectations for SEMANTiCS 2019.
Prior to joining Aalborg University, Katja was a postdoc at the Max Planck Institute for Informatics in Saarbrücken. She received her doctoral degree in Computer Science from Ilmenau University of Technology in Germany.
Can you tell us something about your research focus?
The most important focus of my research has been querying the Web of Data, in particular, efficient query processing over distributed knowledge graphs and Linked Data. This includes indexing, source selection, and efficient query execution. Unfortunately, it happens all too often that the services needed to access remote knowledge graphs are temporarily not available, for instance, because a software component crashed. Hence, we are currently developing a decentralized architecture for knowledge sharing that will make access to knowledge graphs a reliable service, which I believe is the key to a wider acceptance and usage of this technology.
How do you personally contribute to the advancement of semantic technologies?
I contribute by doing research, advancing the state of the art, and applying semantic technologies to practical use cases. The most important achievements so far have been our works on indexing and federated query processing, and we have only recently published our first work on a decentralized architecture for sharing and querying semantic data. I have also been using semantic technologies in other contexts, such as data warehousing, fact-checking, sustainability assessment, and rule mining over knowledge bases.
Overall, I believe the greatest ideas and advancements come when trying to apply semantic technologies to real-world use cases and problems, and that is what I will keep on doing.
Which trends and challenges do you see for linked data and the semantic web?
The goal and the idea behind Linked Data and the Semantic Web is the second-best invention after the Internet. But unlike the Internet, Linked Data and the Semantic Web are only slowly being adopted by a broader community and by industry.
I think part of the reason is that from a company’s point of view, there are not many incentives and added benefit of broadly sharing the achievements. Some companies are simply reluctant to openly share their results and experiences in the hope of retaining an advantage over their competitors. I believe that if these success stories were shared more openly, and this is the trend we are witnessing right now, more companies will see the potential for their own problems and find new exciting use cases.
Another particular challenge, which we will have to overcome, is that it is currently still far too difficult to obtain and maintain an overview of what data is available and formulate a query as a non-expert in SPARQL and the particular domain… and of course, there is the challenge that accessing these datasets is not always reliable.
As artificial intelligence becomes more and more important, what is your vision of AI?
AI and machine learning are indeed becoming more and more important. I do believe that these technologies will bring us a huge step ahead. The process has already begun. But we also need to be aware that we are currently in the middle of a big hype where everybody wants to use AI and machine learning – although many people actually do not truly understand what it is and if it is actually the best solution to their problems. It reminds me a bit of the old saying “if the only tool you have is a hammer, then every problem looks like a nail”. Only time will tell us which problems truly require machine learning, and I am very curious to find out which solutions will prevail.
However, the current state of the art is still very far away from the AI systems that we all know from Science Fiction. Existing systems operate like black boxes on well-defined problems and lack true intelligence and understanding of the meaning of the data. I believe that the key to making these systems trustworthy and truly intelligent will be their ability to explain their decisions and their interpretation of the data in a transparent way.
What are your expectations about Semantics 2019 in Karlsruhe?
First and foremost, I am looking forward to meeting a broad range of people interested in semantic technologies. In particular, I would like to get in touch with industry-based research and to be exposed
We like to thank Katje Hose for her insights and are happy to have her as one of our keynote speakers.
Yours DBpedia Association
Tuesday, August 20, 2019 - 1:45pm
As the upcoming 14th DBpedia Community Meeting, co-located with SEMANTiCS 2019 in Karlsruhe, Sep 9-12, is drawing nearer, we like to take that opportunity to introduce you to our DBpedia keynote speakers.
Today’s post features an interview with Dan Weitzner from WPSemantix who talks about timbr-DBpedia, which we blogged about recently, as well as future trends and challenges of linked data and the semantic web.
Dan Weitzner is co-founder and Vice President of Research and Development of WPSemantix. He obtained his Bachelor of Science in Computer Science from Florida Atlantic University. In collaboration with DBpedia, he and his colleagues at WPSemantix launched timbr, the first SQL Semantic Knowledge Graph that integrates Wikipedia and Wikidata Knowledge into SQL engines.
1. Can you tell us something about your research focus?
WPSemantix bridges the worlds of standard databases and the Semantic Web by creating ontologies accessible in standard SQL.
Our platform – timbr– is a virtual knowledge graph that maps existing data-sources to abstract concepts, accessible directly in all the popular Business Intelligence (BI) tools and also natively integrated into Apache Spark, R, Python, Java and Scala.
timbr enables reasoning and inference for complex analytics without the need for costly Extract-Transform-Load (ETL) processes to graph databases.
2. How do you personally contribute to the advancement of semantic technologies?
We believe we have lowered the fundamental barriers to adoption of semantic technologies for large organizations who want to benefit from knowledge graph capabilities without firstly requiring fundamental changes in their database infrastructure and secondly, without requiring expensive organizational changes or significant personnel retraining.
Additionally, we implemented the W3C Semantic Web principles to enable inference and inheritance between concepts in SQL, and to allow seamless integration of existing ontologies from OWL. Subsequently, users across organizations can do complex analytics using the same tools that they currently use to access and query their databases, and in addition, to facilitate the sophisticated query of big data without requiring highly technical expertise.
timbr-DBpedia is one example of what can be achieved with our technology. This joint effort with the DBpedia Association allows semantic SQL query of the DBpedia knowledge graph, and the semantic integration of the DBpedia knowledge into data warehouses and data lakes. Finally, timbr-DBpedia allows organizations to benefit from enriching their data with DBpedia knowledge, combining it with machine learning and/or accessing it directly from their favourite BI tools.
3. Which trends and challenges do you see for linked data and the semantic web?
Currently, the use of semantic technologies for data exploration and data integration is a significant trend followed by data-driven communities. It allows companies to leverage the relationship-rich data to find meaningful insights into their data.
One of the big difficulties for the average developer and business intelligence analyst is the challenge to learn semantic technologies. Another one is to create ontologies that are flexible and easily maintained. We aim to solve both challenges with timbr.
4. Which application areas for semantic technologies do you perceive as most promising?
I think semantic technologies will bloom in applications that require data integration and contextualization for machine learning models.
Ontology-based integration seems very promising by enabling accurate interpretation of data from multiple sources through the explicit definition of terms and relationships – particularly in big data systems, where ontologies could bring consistency, expressivity and abstraction capabilities to the massive volumes of data.
5. As artificial intelligence becomes more and more important, what is your vision of AI?
I envision knowledge-based business intelligence and contextualized machine learning models. This will be the bedrock of cognitive computing as any analysis will be semantically enriched with human knowledge and statistical models.
This will bring analysts and data scientists to the next level of AI.
6. What are your expectations about Semantics 2019 in Karlsruhe?
I want to share our vision with the semantic community and I would also like to learn about the challenges, vision and expectations of companies and organizations dealing with semantic technologies. I will present “timbr-DBpedia – Exploration and Query of DBpedia in SQL”
Visit SEMANTiCS 2019 in Karlsruhe, Sep 9-12 and find out more about timbr-DBpedia and all the other new developments at DBpedia. Get your tickets for our community meeting here. We are looking forward to meeting you during DBpedia Day.
Yours DBpedia Association
Thursday, August 8, 2019 - 1:41pm
During the DBpedia Day in Leipzig, I gave a talk about how to use the facts contained in the DBpedia Knowledge Graph for generating coherent sentences and texts.
We essentially rely on Natural Language Generation (NLG) techniques for accomplishing this task. NLG is the process of generating coherent natural language text from non-linguistic data (Reiter and Dale, 2000). Despite community agreement on the actual text and speech output of these systems, there is far less consensus on what the input should be (Gatt and Krahmer, 2017). A large number of inputs have been taken for NLG systems, including images (Xu et al., 2015), numeric data (Gkatzia et al., 2014), semantic representations (Theune et al., 2001).
Why not generate text from Knowledge graphs?
The generation of natural language from the Semantic Web has been already introduced some years ago (Ngonga Ngomo et al., 2013; Bouayad-Agha et al., 2014; Staykova, 2014). However, it has gained recently substantial attention and some challenges have been proposed to investigate the quality of automatically generated texts from RDF (Colin et al., 2016). Moreover, RDF has demonstrated a promising ability to support the creation of NLG benchmarks (Gardent et al., 2017). Still, English is the only language which has been widely targeted. Thus, we proposed RDF2NL which can generate texts in other languages than English by relying on different language versions of SimpleNLG.
What is RDF2NL?
While the exciting avenue of using deep learning techniques in NLG approaches (Gatt and Krahmer, 2017) is open to this task and deep learning has already shown promising results for RDF data (Sleimi and Gardent, 2016), the morphological richness of some languages led us to develop a rule-based approach. This was to ensure that we could identify the challenges imposed by each language from the SW perspective before applying Machine Learning (ML) algorithms. RDF2NL is able to generate either a single sentence or a summary of a given resource. RDF2NL is based on Ngonga Ngomo et.al LD2NL and it also uses the Brazilian, Spanish, French, German and Italian adaptations of SimpleNLG to the realization task.
An example of RDF2NL application:
We envisioned a promising application by using RDF2PT which aims to support the automatic creation of benchmarking datasets to Named Entity Recognition (NER) and Entity Linking (EL) tasks. In Brazilian Portuguese, there is a lack of gold standards datasets for these tasks, which makes the investigation of these problems difficult for the scientific community. Our aim was to create Brazilian Portuguese silver standard datasets which are able to be uploaded into GERBIL for easy evaluation. To this end, we implemented RDF2PT ( Portuguese version of RDF2NL) in BENGAL , which is an approach for automatically generating NER benchmarks based on RDF triples and Knowledge Graphs. This application has already resulted in promising datasets which we have used to investigate the capability of multilingual entity linking systems for recognizing and disambiguating entities in Brazilian Portuguese texts. Some results you can find below:
NER – http://gerbil.aksw.org/gerbil/experiment?id=201801050043
NED – http://gerbil.aksw.org/gerbil/experiment?id=201801110012
More application scenarios
- Summarize or Explain KBs to non-experts
- Create news automatically (automated journalism)
- Summarize medical records
- Generate technical manuals
- Support the training of other NLP tasks
- Generate product descriptions (Ebay)
Deep Learning into RDF2NL
After devising our rule-based approach, we realized that RD2NL is really good by selecting adequate content from the RDF triples, but the fluency of its generated texts remains a challenge. Therefore, we decided to move forward and work with neural network models to improve the fluency of texts as they have already shown promising results in the generation of translations. Thus, we focused on the generation of referring expressions, which is an essential part while generating texts, it basically decides how the NLG model will present the information about a given entity. For example, the referring expressions of the entity Barack Obama can be “the former president of USA”, “Obama”, “Barack”, “He” and so on. Afterward, we have been working on combining different NLG sub-tasks into single neural models for improving the fluency of our texts.
GSoC on it – Stay tuned!
Apart from trying to improve the fluency of our models, we relied previously on different language versions of SimpleNLG to the realization task. Nowadays, we have been investigating the generation of multiple languages by using a unique neural model. Our student has been working hard to provide nice results and we are basically at the end of our GSoC project. So stay tuned to know the outcome of this exciting project.
Many thanks to Diego for his contribution. If you want to write a guest post, share your results on the DBpedia Blog, and thus give your work more visibility and outreach, just ping us via email@example.com.
Thursday, August 1, 2019 - 12:41pm
DBpedia Live is a long term core project of DBpedia that immediately extracts fresh triples from all changed Wikipedia articles. After a long hiatus, fresh and live updated data is available once again, thanks to our former co-worker Lena Schindler whose work we feature in this blog post. Before we dive into Lena’s report, let’s have a look at some general info about DBpedia Live:
Live Enterprise Version
OpenLink Software provides a scalable, dedicated, live Virtuoso instance, built on Lena’s remastering. Kingsley Idehen announced the dedicated business service in our new DBpedia forum. .
On the Databus, we collect publicly shared and business-ready dedicated services in the same place where you can download the data. Databus allows you to download the data, build a service, and offer that service, all in one place. Data up-loaders can also see who builds something with their data
Remastering the DBpedia Live Module
Contribution by Lena Schindler
After developing the DBpedia REST API as part of a student project in 2018, I worked as a student Research Assistant for DBpedia. My task was to analyze and patch severe issues in the DBpedia Live instance. I will shortly describe the purpose of DBpedia Live, the reasons it went out of service, what I did to fix these, and finally, the changes needed to support multi-language abstract extraction.
The DBpedia Extraction Framework is Scala-based software with numerous features that have evolved around extracting knowledge (as RDF) from Wikis. One part is the DBpedia Live module in the “live-deployed” branch, which is intended to provide a continuously updated version of DBpedia by processing Wikipedia pages on demand, immediately after they have been modified by a user. The backbone of this module is a queue that is filled with recently edited Wikipedia pages, combined with a relational database, called Live Cache, that handles the diff between two consecutive versions of a page. The module that fills the queue, called Feeder, needs some kind of connection to a Wiki instance that reports changes to a Wiki Page. The processing then takes place in four steps:
- A wiki page is taken out of the queue.
- Triples are extracted from the page, with a given set of extractors.
- The new triples from the page are compared to the old triples from the Live Cache.
- The triple sets that have been deleted and added are published as text files, and the Cache is updated.
DBpedia Live has been out of service since May 2018, due to the termination of the Wikimedia RCStream Service, upon which the old DBpedia Live Feeder module relied. This socket-based service provided information about changes to an existing Wikimedia instance and was replaced by the EventStreams service, which runs over a single HTTP connection using chunked transfer encoding, and is following the Server-Sent Event (SSE) protocol. It provides a stream of events, each of which contains information about title, id, language, author, and time of every page edit of all Wikimedia instances.
Starting in September 2018, my first task was to implement a new Feeder for DBpedia Live that is based on this new Wikimedia EventStreams Service. For the Java world, the Akka framework provides an implementation of a SSE client. Akka is a toolkit developed by Lightbend. It simplifies the construction of concurrent and distributed JVM applications, enabling both Java and Scala access. The Akka SSE client and the Akka Streams module are used in the new EventStreamsFeeder (Akka Helper) to extract and process the data stream. I decided to use Scala instead of Java, because it is a more natural fit to Akka.
After I was able to process events, I had the problem that frequent interruptions in the upstream connection were causing the processing stream to fail. Luckily, Akka provides a fallback mechanism with back-off, similar to the Binary Exponential Backoff of the Ethernet protocol which I could use to restart the stream (called “Graph” in Akka terminology).
Another problem was that in many cases, there were many changes to a page within a short time interval, and if events were processed quickly enough, each change would be processed separately, stressing the Live Instance with unnecessary load. A simple “thread sleep” reduced the number of change-sets being published every hour from thousands to a few hundred.
The next task was to prepare the Live module for the extraction of abstracts (typically the first paragraph of a page, or the text before the table of contents). The extractors used for this task were re-implemented in 2017. It turned out to be a configuration issue first, and second a candidate for long debugging sessions, fixing issues in the dependencies between the “live” and “core” modules. Then, in order to allow the extraction of abstracts in multiple languages, the “live” module needed many small changes, at places spread across the code-base, and care had to be taken not to slow down the extraction in the single language case, compared to the performance before the change. Deployment was delayed by an issue with the remote management unit of the production server, but was accomplished by May 2019.
I also collected my knowledge of the Live module in detailed documentation, addressed to developers who want to contribute to the code. This includes an explanation of the architecture as well as installation instructions. After 400 hours of work, DBpedia Live is alive and kicking, and now supports multi-language abstract extraction. Being responsible for many aspects of Software Engineering, like development, documentation, and deployment, I was able to learn a lot about DBpedia and the Semantic Web, hone new skills in database development and administration, and expand my programming experience using Scala and Akka.
“Thanks a lot to the whole DBpedia Team who always provided a warm and supportive environment!”
Thank you Lena, it is people like you who help DBpedia improve and develop further, and help to make data networks a reality.
Yours DBpedia Association
Thursday, July 25, 2019 - 1:33pm
How is data edited in Wikipedia/Wikidata? Where does it come from? And how can we synchronize it globally?
The GlobalFactSync (GFS) Project — funded by the Wikimedia Foundation — started in June 2019 and has two goals:
- Answer the above-mentioned three questions.
- Build an information system to synchronize facts between all Wikipedia language-editions and Wikidata.
Now we are seven weeks into the project (10+ more months to go) and we are releasing our first prototypes to gather feedback.
How – Synchronization vs Consensus
We follow an absolute “Human(s)-in-the-loop” approach when we talk about synchronization. The final decision whether to synchronize a value or not should rest with a human editor who understands consensus and the implications. There will be no automatic imports. Our focus is to drastically reduce the time to research all references for individual facts.
A trivial example to illustrate our reasoning is the release date of the single “Boys Don’t Cry” (March 16th, 1989) in the English, Japanese, and French Wikipedia, Wikidata and finally in the external open database MusicBrainz. A human editor might need 15-30 minutes finding and opening all different sources, while our current prototype can spot differences and display them in 5 seconds.
We already had our first successful edit where a Wikipedia editor fixed the discrepancy with our prototype: “I’ve updated Wikidata so that all five sources are in agreement.” We are now working on the following tasks:
- Scaling the system to all infoboxes, Wikidata and selected external databases (see below on the difficulties there)
- Making the system:
- “live” without stale information
- “reliable” with less technical errors when extracting and indexing data
- “better referenced” by not only synchronizing facts but also references
Contributions and Feedback
To ensure that GlobalFactSync will serve and help the Wikiverse we encourage everyone to try our data and micro-services and leave us some feedback, either on our Meta-Wiki page or via email. In the following 10+ months, we intend to improve and build upon these initial results. At the same time, these microservices are available to every developer to exploit it and hack useful applications. The most promising contributions will be rewarded and receive the book “Engineering Agile Big-Data Systems”. Please post feedback or any tool or GUI here. In case you need changes to be made to the API, please let us know, too.
For the ambitious future developers among you, we have some budget left that we will dedicate to an internship. In order to apply, just mention it in your feedback post.
Data, APIs & Microservices (Technical prototypes)
Data Processing and Infobox Extraction
For GlobalFactSync we use data from Wikipedia infoboxes of different languages, as well as Wikidata, and DBpedia and fuse them to receive one big, consolidated dataset – a PreFusion dataset (in JSON-LD). More information on the fusion process, which is the engine behind GFS, can be found in the FlexiFusion paper. One of our next steps is to integrate MusicBrainz into this process as an external dataset. We hope to implement even more such external datasets to increase the amount of available information and references.
We deployed a set of microservices to show the current state of our toolchain.
- [Initial User Interface] The GlobalFactSync UI prototype (available at http://global.dbpedia.org) shows all extracted information available for one entity for different sources. It can be used to analyze the factual consensus between different Wikipedia articles for the same thing. Example: Look at the variety of population counts for Grimma.
- [PreFusion JSON API] While the UI allows simple, fast and easy browsing for one entity at a time, we also provide raw access to the underlying data (PreFusion dump). The query UI (http://global.dbpedia.org:8990 (user: read, pw: gfs) can be utilized to run simple analytical queries. Thus, we can determine the number of locations having at least one population value (1,194,007) but can also focus on examples with data quality problems (e.g. one of the 4,268 locations with more than 10 population values). Moreover, documentation about the PreFusion dataset and the download link for the data are available on the Databus website.
- [Reference Data Download] We ran the Reference Extraction Service over 10 Wikipedia languages. Download dumps here.
- [Reference Extraction Service] Good references are crucial for an import of facts from Wikipedia to Wikidata. We are currently working with colleagues from Poznań University of Economics and Business on reference extraction for facts from Wikipedia. A current development reference extraction microservice shows all references and the location where they were spotted in the Infobox – ad hoc – for a given article: http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=https://en.wikipedia.org/wiki/Facebook&format=json ( ‘&format=tsv’ also available)
- [Infobox Extraction Service] A similar ad hoc extraction of factual information from infoboxes and other Wikipedia article information is available here. This microservice displays information which can be extracted with the help of DBpedia mappings from an infobox e.g. from the German Facebook Wikipedia article: http://dbpedia.informatik.uni-leipzig.de:9998/server/extraction/en/extract?title=Facebook&revid=&format=trix&extractors=mappings. See here for more options: http://dbpedia.informatik.uni-leipzig.de:9999/server/extraction/.
- [ID service] Last but not least, we offer the Global ID Resolution Service. It ties together all available identifiers for one thing (i.e. at the moment all DBpedia/Wikipedia and Wikidata identifiers – MusicBrainz coming soon…) and shows their stable DBpedia Global ID.
Finding sync targets
In order to test out our algorithms, we started by looking at various groups of subjects, our so-called sync targets. Based on the different subjects a set of problems were identified with varying layers of complexity:
- identity check/check for ambiguity — Are we talking about the same entity?
- fixed vs. varying property — Some properties vary depending on nationality (e.g., release dates), or point in time (e.g., population count).
- reference — Depending on the entity’s identity check and the property’s fixed or varying state the reference might vary. Also, for some targets, no query-able online reference might be available.
- normalization/conversion of values — Depending on language/nationality of the article properties can have varying units (e.g., currency, metric vs imperial system).
The check for ambiguity is the most crucial step to ensure that the infoboxes that are being compared do refer to the same entity. We found, instances where the Wikipedia page and the infobox shown on that page were presenting information about different subjects (e.g., see here).
As a good sync target to start with the group ‘NBA players’ was identified. There are no ambiguity issues, it is a clearly defined group of persons, and the amount of varying properties is very limited. Information seems to be derived from mainly two web sites (nba.com and basketball-reference.com) and normalization is only a minor issue. ‘Video games’ also proved to be an easy sync target, with the main problem being varying properties such as different release dates for different platforms (Microsoft Windows, Linux, MacOS X, XBox) and different regions (NA vs EU).
More difficult topics, such as ‘cars’, ’music albums’, and ‘music singles’ showed more potential for ambiguity as well as property variability. A major concern we found was Wikipedia pages that contain multiple infoboxes (often seen for pages referring to a certain type of car, such as this one). Reference and fact extraction can be done for each infobox, but currently, we run into trouble once we fuse this data.
Further information about sync targets and their challenges can be found on our Meta-Wiki discussion page, where Wikipedians that deal with infoboxes on a regular basis can also share their insights on the matter. Some issues were also found regarding the mapping of properties. In order to make GlobalFactSync as applicable as possible, we rely on the DBpedia community to help us improve the mappings. If you are interested in participating, we will connect with you at http://mappings.dbpedia.org and in the DBpedia forum.
Bottomline – We value your feedback
Your DBpedia Association
The post Global Fact Sync – Synchronizing Wikidata & Wikipedia’s infoboxes appeared first on DBpedia Blog.
Thursday, July 18, 2019 - 11:39am
With timbr, WPSemantix and the DBpedia Association launch the first SQL Semantic Knowledge Graph that integrates Wikipedia and Wikidata Knowledge into SQL engines.
In part three of DBpedia’s growth hack blog series, we feature timbr, the latest development at DBpedia in collaboration with WPSemantix. Read on to find out how it works.
timbr – DBpedia SQL Semantic Knowledge Platform
Tel Aviv, Israel and Leipzig, Germany – July 18, 2019 – WP-Semantix (WPS) – the “SQL Knowledge Graph Company” and DBpedia Association – Institut für Angewandte Informatik e.V., announced today the launch of the timbr-DBpedia SQL Semantic Knowledge Platform, a unique version of WPS’ timbr SQL Semantic Knowledge Graph that integrates timbr-DBpedia ontology, timbr’s ontology explorer/visualizer and timbr’s SQL query service, to provide for the first time semantic access to DBpedia knowledge in SQL and to thus facilitate DBpedia knowledge integration into standard data warehouses and data lakes.
DBpedia is the crowd-sourced community effort to extract structured content from the information created in various Wikimedia projects and publish these as files on the Databus and via online databases. This structured information resembles an open knowledge graph which has been available for everyone on the Web for over a decade. Knowledge graphs are a new kind of databases developed to store knowledge in a machine-readable form, organized as connected, relationship-rich data. After the publication of DBpedia (in parallel to Freebase) 12 years ago, knowledge graphs have become very successful and Google uses a similar approach to create the knowledge cards displayed in search results.
Query the world’s knowledge in standard SQL
Amit Weitzner, founder and CEO at WPS commented: “Knowledge graphs use specialized languages, require resource-intensive, dedicated infrastructure and require costly ETL operations. That is, they did until timbr came along. timbr employs SQL – the most widely known database language, to eliminate the technological barriers to entry for using knowledge graphs and to implement Semantic Web principles to provide knowledge graph functionality in SQL. timbr enables modelling of data as connected, context-enriched concepts with inference and graph traversal capabilities while being queryable in standard SQL, to represent knowledge in data warehouses and data lakes. timbr-DBpedia is our first vertical application and we are very excited by the prospects of our cooperation with the DBpedia team to enable the largest user base to query the world’s knowledge in standard SQL.”
Sebastian Hellmann, executive director of the DBpedia Association commented that:
timbr will help to explore the power of semantic technologies
Prof. James Hendler, pioneer and a world-leading authority in Semantic Web technologies and WPS’ advisory board member commented “timbr can be a game-changing solution by enabling the semantic inference capabilities needed in many modelling applications to be done in SQL. This approach will enable many users to get the advantages of semantic AI technologies and data integration without the learning curve of many current systems. By giving more people access to the semantic version of Wikipedia, timbr-DBpedia will definitely contribute to allowing the majority of the market to explore the power of semantic technologies.”
timbr-DBpedia is available as a query service or licensed for use as SaaS or on-premises. See the DBpedia website: wiki.dbpedia.org/timbr.
WP-Semantix Ltd. (wpsemantix.com) is the developer of the timbr SQL semantic knowledge platform, a dynamic abstraction layer over relational and non-relational data, facilitating declaration and powerful exploration of semantically rich ontologies using a standard SQL query interface. timbr is natively accessible in Apache Spark, Python, R and SQL to empower data scientists to perform complex analytics and generate sophisticated ML algorithms. Its JDBC interface provides seamless integration with the most popular business intelligence solutions to make complex analytics accessible to analysts and domain experts across the organization.
WP-Semantix, timbr, “SQL Knowledge Graph”, “SQL Semantic Knowledge Graph” and associated marks and trademarks are registered trademarks of WP Semantix Ltd.
DBpedia is looking forward to this cooperation. Follow us on Twitter for the latest information and stay tuned for part four of our growth hack series. The next post features the GlobalFactSyncRe. Curious? You have to be a little more patient and wait till Thursday, July 25th.
Yours DBpedia Association
The post timbr – the DBpedia SQL Semantic Knowledge Platform appeared first on DBpedia Blog.