Step 1. Add the JitPack repository to your build file
Add it in your root settings.gradle at the end of repositories:
dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
mavenCentral()
maven { url 'https://jitpack.io' }
}
}
Add it in your settings.gradle.kts at the end of repositories:
dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
mavenCentral()
maven { url = uri("https://jitpack.io") }
}
}
Add to pom.xml
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
Add it in your build.sbt at the end of resolvers:
resolvers += "jitpack" at "https://jitpack.io"
Add it in your project.clj at the end of repositories:
:repositories [["jitpack" "https://jitpack.io"]]
Step 2. Add the dependency
dependencies {
implementation 'com.github.danneu:reddit:0.0.2'
}
dependencies {
implementation("com.github.danneu:reddit:0.0.2")
}
<dependency>
<groupId>com.github.danneu</groupId>
<artifactId>reddit</artifactId>
<version>0.0.2</version>
</dependency>
libraryDependencies += "com.github.danneu" % "reddit" % "0.0.2"
:dependencies [[com.github.danneu/reddit "0.0.2"]]
A Reddit crawler implemented in Kotlin for consuming submissions and comments with a flattened iterator.
The crawler's methods return Iterator<Submission>
or Iterator<Comment>
.
Useful for consuming the entire set of submissions and comments for a subreddit.
Internally, the iterator lazily unrolls pagination including "Continue Thread" and "Load More" nodes in comment trees as you iterate.
By default, an ApiClient
instance never sends more than one request per second to comply
with Reddit's API requirements.
repositories {
maven { url "https://jitpack.io" }
}
dependencies {
// Always get latest:
compile "com.danneu:reddit:master-SNAPSHOT"
// Or get a specific GitHub release:
compile "com.danneu:reddit:0.0.1"
}
You can see all releases here: https://github.com/danneu/reddit/releases
The server only responds with the latest 1,000 comments, but since you hit CDN cache, this endpoint will paginate 10,000+ comments.
Though due to the cache, the farther you paginate, the more sparse the comments become. For instance, fewer comments per hour.
import com.danneu.reddit.ApiClient
fun main(args: Array<String>) {
val client = ApiClient()
client.recentComments().forEach { comment ->
println(comment)
}
}
However, if you run this 24/7 in a loop, you can consume all Reddit comments as they are published.
You'll need to use the following subreddit-scoped methods to consume historical comments.
import com.danneu.reddit.ApiClient
fun main(args: Array<String>) {
val client = ApiClient()
client.submissionsOf("futurology").forEach { submission ->
println(submission)
}
}
Iterates comments depth-first.
import com.danneu.reddit.ApiClient
fun main(args: Array<String>) {
val client = ApiClient()
client.commentsOf("futurology").forEach { comment ->
println(comment)
}
}
Or:
import com.danneu.reddit.ApiClient
fun main(args: Array<String>) {
val client = ApiClient()
client.submissionsOf("futurology").forEach { submission ->
println("Now crawling: ${submission.url()}")
client.commentsOf(submission).forEach { comment ->
println(comment)
}
}
}
ApiClient#fork()
returns a new ApiClient with the same configuration as the original one. Pass in a builder
lambda to customize the copy.
import com.danneu.reddit.ApiClient
import java.net.InetSocketAddress
import java.time.Duration
import java.net.Proxy
fun main(args: Array<String>) {
val client1 = ApiClient {
proxy = Proxy(Proxy.Type.HTTP, InetSocketAddress("1.2.3.4", 8080))
}
val client2 = client1.fork {
throttle = Duration.ofSeconds(2)
}
}
client2
uses the same proxy as client1
, but it waits two seconds before each request instead of the
default one second.
import com.danneu.reddit.ApiClient
import com.danneu.reddit.HasContent
import java.net.URI
// Extend Submissions and Comments with a method that returns only the absolute URLs
fun HasContent.absoluteUrls(): List<URI> = urls().filter { it.isAbsolute }
fun processUrl(uri: URI) = println("found absolute url: ${uri}")
fun main(args: Array<String>) {
val client = ApiClient()
client.submissionsOf("futurology").forEach { submission ->
// Process URLs in the submission body
submission.absoluteUrls().forEach(::processUrl)
client.commentsOf(submission).forEach { comment ->
// Process URLs in each comment body
comment.absoluteUrls().forEach(::processUrl)
}
}
}
Even better, in Kotlin, we can rewrite the previous example with a client extension #urlsOf(subreddit)
that returns
a lazy sequence of urls found in the submission and comment bodies.
import com.danneu.reddit.ApiClient
import java.net.URI
fun ApiClient.urlsOf(subreddit: String): Sequence<URI> {
return submissionsOf(subreddit).asSequence().flatMap { submission ->
submission.urls().asSequence().plus(
commentsOf(submission).asSequence().flatMap { comment ->
comment.urls().asSequence()
}
)
}
}
fun main(args: Array<String>) {
val client = ApiClient()
client.urlsOf("futurology").filter { it.isAbsolute }.forEach { url ->
println("found absolute url: $url")
}
}