umjammer/vavi-util-screenscraping


✂ Screen scraping utility for Java using annotations

Download


Step 1. Add the JitPack repository to your build file

Add it in your root settings.gradle at the end of repositories:

	dependencyResolutionManagement {
		repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
		repositories {
			mavenCentral()
			maven { url 'https://jitpack.io' }
		}
	}

Add it in your settings.gradle.kts at the end of repositories:

	dependencyResolutionManagement {
		repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
		repositories {
			mavenCentral()
			maven { url = uri("https://jitpack.io") }
		}
	}

Add to pom.xml

	<repositories>
		<repository>
		    <id>jitpack.io</id>
		    <url>https://jitpack.io</url>
		</repository>
	</repositories>

Add it in your build.sbt at the end of resolvers:

 
    resolvers += "jitpack" at "https://jitpack.io"
        
    

Add it in your project.clj at the end of repositories:

 
    :repositories [["jitpack" "https://jitpack.io"]]
        
    

Step 2. Add the dependency

	dependencies {
		implementation 'com.github.umjammer:vavi-util-screenscraping:1.0.16'
	}
	dependencies {
		implementation("com.github.umjammer:vavi-util-screenscraping:1.0.16")
	}
	<dependency>
	    <groupId>com.github.umjammer</groupId>
	    <artifactId>vavi-util-screenscraping</artifactId>
	    <version>1.0.16</version>
	</dependency>

                            
    libraryDependencies += "com.github.umjammer" % "vavi-util-screenscraping" % "1.0.16"
        
        

                            
    :dependencies [[com.github.umjammer/vavi-util-screenscraping "1.0.16"]]
        
        

Readme


Release Java CI CodeQL Java

Screen Scraping Library for Java

🌏 Scrape the world!

Install

Usage

This library screen-scrapes data from html and injects data into POJO using annotation.

    @WebScraper(url = "http://foo.com/bar.html")
    public class Baz {
        @Target(value = "//TABLE//TR/TD[2]/DIV/text()")
        String artist;
        @Target(value = "//TABLE//TR/TD[4]/A/text()")
        String title;
        @Target(value = "//TABLE//TR/TD[4]/A/@href")
        String url;
    }
    
    :
    
    List<Baz> bazs = WebScraper.Util.scrape(Baz.class);

Details

  • InputHandler ... apply any processing before parsing

  • Parser

    • XPathParser ... default
    • HtmlXPathParser ... for original purpose
    • SaxonXPathParser ... for huge xml file
    • JsonPathParser ... for json return
  • Parser#foreach() ... like java collection stream

Sample

References

  • https://www2.jasrac.or.jp/eJwid/

TODO

  • ~~Tidy version~~
  • ~~deleted garbled text~~
  • InputHandler w/o cache
  • ~~argument injection into WebScraper#url~~

        @WebScraper(url = "http://foo.com?bar={bar}")
        public static class Result {
            :
    
        List<Result> data = WebScraper.Util.scrape(Result.class, @UrlParam(bar) args[0]);
    
  • ~~json parser~~
  • ~~css selector~~
    • ~~https://github.com/jhy/jsoup~~ -> serdes
  • integrate serdes
  • @WebScraper#encoding()
  • @Target add exception handler or second, third option
  • ~~xml2xpath~~