Step 1. Add the JitPack repository to your build file
Add it in your root settings.gradle at the end of repositories:
dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
mavenCentral()
maven { url 'https://jitpack.io' }
}
}
Add it in your settings.gradle.kts at the end of repositories:
dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
mavenCentral()
maven { url = uri("https://jitpack.io") }
}
}
Add to pom.xml
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
Add it in your build.sbt at the end of resolvers:
resolvers += "jitpack" at "https://jitpack.io"
Add it in your project.clj at the end of repositories:
:repositories [["jitpack" "https://jitpack.io"]]
Step 2. Add the dependency
dependencies {
implementation 'com.github.diegoceccarelli:json-wikipedia:2.0.2-SNAPSHOT'
}
dependencies {
implementation("com.github.diegoceccarelli:json-wikipedia:2.0.2-SNAPSHOT")
}
<dependency>
<groupId>com.github.diegoceccarelli</groupId>
<artifactId>json-wikipedia</artifactId>
<version>2.0.2-SNAPSHOT</version>
</dependency>
libraryDependencies += "com.github.diegoceccarelli" % "json-wikipedia" % "2.0.2-SNAPSHOT"
:dependencies [[com.github.diegoceccarelli/json-wikipedia "2.0.2-SNAPSHOT"]]
Json Wikipedia contains code to convert the Wikipedia XML dump into a JSON or avro dump.
multistream
dump.compile the project running
mvn package
the command will produce a JAR file containing all the dependencies the target folder.
You can convert the Wikipedia dump to JSON format by running the commands:
java -jar target/json-wikipedia-*.jar -input wikipedia-dump.xml.bz -output wikipedia-dump.json[.gz] -lang [en|it]
or
./scripts/convert-xml-dump-to-json.sh [en|it] wikipedia-dump.xml.bz wikipedia-dump.json[.gz]
Or to Apache Avro:
java -jar target/json-wikipedia-*.jar -input wikipedia-dump.xml.bz -output wikipedia-dump.avro -lang [en|it]
or
./scripts/convert-xml-dump-to-json.sh [en|it] wikipedia-dump.xml.bz wikipedia-dump.avro
Both the commands will produce a file contain a file containing a record for each article. In the JSON format each line of the file contains an article of dump encoded in JSON. Each record can be deserialized in an Article object, which represents an enriched version of the wikitext page. The Article object contains:
Once you have created (or downloaded) the JSON dump (say wikipedia.json
), you can iterate over the articles of the collection
easily using this snippet:
RecordReader<Article> reader = new RecordReader<Article>(
"wikipedia.json",new JsonRecordParser<Article>(Article.class)
)
for (Article a : reader) {
// do what you want with your articles
}
In order to use these classes, you will have to install json-wikipedia
in your maven repository:
mvn install
and import the project in your new maven project adding the dependency:
<dependency>
<groupId>it.cnr.isti.hpc</groupId>
<artifactId>json-wikipedia</artifactId>
<version>2.0.0-SNAPSHOT</version>
</dependency>
The full schema of a record is encoded in avro