Annotation-based HTML to Object Mapper using JSoup Parser

I’ve recently worked on a project that requires crawling and retrieval of information from a website. After looking for open source Java HTML parsers, we found JSoup. JSoup is a library that provides JQuery-like selectors for extracting data from an HTML source.

JSoup is awesome but it also left us with a lot of boilerplate codes for parsing different HTML pages. To avoid verbose code, I tried playing around with annotations. The idea is to use annotations to map an HTML source to a Java object (sort of like JAXB). The basic code of what I came up with is discussed in this blog post. (Please do note that I used Spring and there may be some Spring APIs in the code.)

For the implementation, the annotations’ targets are the setters of the Java object’s fields.

The first annotation is the @Selector. This will store the CSS selector for retrieving the element that contains the value that will be set using the annotated setter. The value parameter should contain the CSS selector of the HTML element.

@Target({ ElementType.METHOD })
@Retention(RetentionPolicy.RUNTIME)
public @interface Selector {
    String value();
}

@Selector will need either of the following annotations to determine how the value will be extracted from the selected element:

  • @TextValue – retrieve the text within the element (remove all HTML tags within the element)
  • @HtmlValue – retrieve the HTML within the element
  • @AttributeValue – retrieve the value from an attribute in the element. The name of the attribute can be specified in the name parameter.
@Target({ ElementType.METHOD })
@Retention(RetentionPolicy.RUNTIME)
public @interface TextValue {
}

@Target({ ElementType.METHOD })
@Retention(RetentionPolicy.RUNTIME)
public @interface HtmlValue {
}

@Target({ ElementType.METHOD })
@Retention(RetentionPolicy.RUNTIME)
public @interface AttributeValue {
String name();
}

The HTML parser just needs to read annotations from a Java bean’s methods and retrieve the different annotations above. When a @Selector is present in a method, the value of the @Selector will be used to retrieve the element. @TextValue, @HtmlValue or @AttributeValue will then be used to get the data from the element.

import java.io.InputStream;
import java.lang.reflect.Method;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.core.convert.ConversionService
import org.springframework.core.convert.support.DefaultConversionService;
import com.google.common.base.Preconditions;

public class JSoupHtmlParser<T> implements HtmlParser<T> {

    // host of the website that will be crawled
    private final static String HOST = "localhost:8080/sample";

    private final Class<T> classModel;

    // Pass in the class Java bean that will contain the mapped data from the HTML source
    public JSoupHtmlParser( final Class<T> classModel) {
        this.classModel = classModel;
    }

    // Main method that will translate HTML to object
    public T parse( final InputStream is) throws HtmlParserException {
        try {
            final Document doc = Jsoup.parse(is, "UTF-8", HOST );
            T model = this.classModel.newInstance();

            for (Method m : this.classModel.getMethods()) {
                String value = null;
                // check if Selector annotation is present in any of the methods
                if (m.isAnnotationPresent(Selector .class)) {
                    value = parseValue(doc, m);
                }

                if (value != null) {
                    m.invoke( model , convertValue(value, m));
                }
            }

            return model ;
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    // Use Spring's ConversionService to convert the selected value from String to the type of the parameter in the setter method
    private static final ConversionService conversion = new DefaultConversionService();

    private Object convertValue( final String value, final Method m) {
        Preconditions. checkArgument(m.getParameterTypes().length > 0);

        // Only set the first parameter
        return conversion .convert(value, m.getParameterTypes()[0]);
    }

    private String parseValue( final Document doc, final Method m) {
        final String selector = m.getAnnotation(Selector .class).value();

        final Elements elems = doc.select(selector);

        if (elems.size() > 0) {
            // no support for multiple selected elements yet. Just get the first element.
            final Element elem = elems.get(0);

            // Check which value annotation is present and retrieve data depending on the type of annotation
            if (m.isAnnotationPresent(TextValue .class)) {
                return elem.text();
            } else if (m.isAnnotationPresent(HtmlValue.class)) {
                return elem.html();
            } else if (m.isAnnotationPresent(AttributeValue. class)) {
                return elem.attr(m.getAnnotation(AttributeValue .class).name());
            }
        }

        return null ;
    }
}
Advertisements