An extractor is a mapping from a page node to a graph of statements about it. All relevant classes are located in the org.dbpedia.extraction.extractors.



1 Overview


2 Available Extractors

2.1 LabelExtractor

Extracts labels to articles based on their title.
Supported languages: All languages

2.2 MappingExtractor

Extracts structured data based on hand-generated mappings of Wikipedia infoboxes to the DBpedia ontology. Mappings can be edited via the Mappings Wiki.
Supported languages: All languages, for which mappings are available.

2.3 InfoboxExtractor

This extractor extracts all properties from all infoboxes. Extracted information is represented using properties in the http://dbpedia.org/property/ namespace. The names of the these properties directly reflect the name of the Wikipedia infobox property. Property names are not cleaned or merged. Property types are not part of a subsumption hierarchy and there is no consistent ontology for the infobox dataset. The infobox extractor performs only a minimal amount of property value clean-up, e.g., by converting a value like “June 2009” to the XML Schema format “2009–06”. You should therefore use the infobox dataset only if your application requires complete coverage of all Wikipeda properties and you are prepared to accept relatively noisy data. 
Supported languages: All languages

2.4 WikiPageExtractor

Extracts links to corresponding Articles in Wikipedia.
Supported languages: All languages

2.5 PageLinksExtractor

Extracts internal links between DBpedia instances from the internal pagelinks between Wikipedia articles. The page links might be useful for structural analysis, data mining or for ranking DBpedia instances using Page Rank or similar algorithms.
Supported languages: All languages

2.6 GeoExtractor

Extracts geographic coordinates.
Supported languages: All languages

2.7 ArticleCategoriesExtractor

Extracts links from concepts to categories using the SKOS vocabulary.
Supported languages: en

2.8 CategoryLabelExtractor

Extracts labels for Categories.
Supported languages: en

2.9 ImageExtractor

Extracts the first image of a Wikipedia page. Constructs a thumbnail from it, and the full size image.
Supported languages: en

2.10 ExternalLinksExtractor

Extracts links to external web pages.
Supported languages: All languages

2.11 HomepageExtractor

Extracts links to the official homepage of an instance.
Supported languages: en, de, fr

2.12 DisambiguationExtractor

Extracts disambiguation links.
Supported languages: All languages

2.13 PersondataExtractor

Extracts information about persons (date and place of birth etc.) from the English and German Wikipedia, represented using the FOAF vocabulary.
Supported languages: en, de

2.14 PndExtractor

Extracts PND (Personennamendatei) data about a person. PND is published by the German National Library. For each person there is a record with his name, birth and occupation connected with a unique identifier, the PND number.
Supported languages: en, de

2.15 SkosCategoriesExtractor

Extracts information about which concept is a category and how categories are related using the SKOS Vocabulary.
Supported languages: en

2.16 RedirectExtractor

Extracts redirect links between Articles in Wikipedia.
Supported languages: All languages

3 Using an Extractor

As Extractor is a first-class function, it is very easy to use. All you have to do is to call it with the page node.
As all extractors are thread-safe, it is safe to call them from multiple threads without further synchronization.

4 Implementing new Extractors

In order to implement a new extractor, all that is needed is to inherit from the Extractor class and to implement the extract method, which takes three arguments:

  • page : PageNode : The page node represents the root of the Abstract Syntax Tree (AST), that represents the current MediaWiki page.
  • subjectUri : String : This is the URI of the instance which is currently extracted.
  • context : PageContext : The page context holds the mutable state of the current page extraction. Among other things, it can be used to generate URIs.

The extracted statements are returned as a Graph from the extract method.
Note that each Extractor must be thread-safe.

5 Related Extractors

Other projects may reuse and/or extend extractors from DBpedia. For example, the DBpedia Spotlight extraction pipeline contains extractors for mentions of DBpedia Resources within Wikipedia paragraphs. For more info, see the Data Generation? page for the project.