Web Data Commons Analysis Result
Petar Petrovski, , Research Group Data and Web Science

Many e-Shops have started to markup the products that they are offering within their HTML pages using the Microdata markup format. Within this document, we analyze 1,986,359 product offers originating from 9240 different e-shops that use Microdata markup and classify the offers into the 9 main product categories of the Amazon product catalog. The offers are taken from the Web Data Commons data set that has been extracted from the August 2012 version of the Common Crawl.

In addition to the basic statistics about the microdata product descriptions, this document analyzes:

  1. The distribution of the offers over the 9 main product categories from Amazon.
  2. The number of e-shops offering products from a specific category.
  3. The number of product categories offered by individual e-shops.

We first present the results of the analysis. Afterwards, we describe the methodology that was used for classifing the offers into the product categories.

1. Results

1.1 Number of Offers per Product Category

Figure 1 shows the total number of offers per product category. The most offered product category is Books with 233439 products offered. Close second and third most offered product categories are 'Electronics & Computers' and 'Clothing, Shoes & Jewelry', with 219188 and 206315 products offered respectively. The least offered product category is 'Toys, Kids, Baby & Pets' with 114263. The mean of the product offers for the 9 categories is 175,415.

Fig. 1 - Number of Offers per Product Category (including the generic catgegory Other Products)

1.2 Number of e-Shops offering Products beloning to a specific Category

Figure 2 shows the number of e-shops offering products from a specific category. 'Books' is the most offered product category with 2974 e-shops. 'Automotive & Industrial' is the least offered product category with 1446 shops. The average number of e-shops per product category is 2,365.

Fig. 2 - Number of e-stores offering a certain class

1.3 Number of Product Categories offered by individual Shops

Figure 3 shows the number of product categories offered by individual e-shops. The distribution is left-skewed with most shops offering only products from one to two product categories. This is to be expected because most of the e-stores are small companies specialized for a particular product category.

Fig. 3 - Number of product categories offered by individual Shops.



2. Methodology


This section describes the methodology used for classifying the offers into the different product categories. The basic approach is to use product descriptions from Amazon to train and validate a classifier and to apply the classifier afterwards to the WebDataCommons data.

Microdata Vocabularies, Classes and Properties

There are three main vocabularies with which products are marked up in Microdata: schema (http://schema.org/), data-voc (http://data-vocabulary.org/) and gr (http://purl.org/goodrelations/v1#). The main class for marking up products in schema and data-voc is "Product" with more than 95% percent usage, the other < 5% being marked up with the class "Offer" or "ProductService" in gr. As stated in additional statistic page most frequent Microdata properties that are used to describe products are title and description with 86% and 61% presence respectively. All other more specific properties are used within less than 50% of the descriptions.

Dataset and Training Set

As we only have title and description property values for most offers, we needed to treat the task as a text classification problem and learn a classifier for these two properties. As a training set for our model we chose 18000 labeled products with titles and descriptions (Editorial review) from Amazon.com spread into 9 product categories. We filtered the WebDataCommons product descriptions to only contain English descriptions that are at least 20 words long in order to give the classifier a fair chance to determine the correct category. This reduced the overall number of offers contained in the WebDataCommons data set from 9,454,403 to 1,986,359 offers originating from 9240 e-shops. The data set used for the analysis is available for download and can be used for further investigations.

Feature Generation

The process of feature generation is a 4-step process that is based on generating a word vector from the documents (title + description) from our training set. The first step is tokenizing (by non-alpha characters) the text and removing stop words. The second step employed multiple pruning techniques in order to reduce the number of features. First we prune by the relative frequency of the terms in the documents, with a lower bound of 0.2% and a upper bound of 98%. In addition to the relative frequency pruning, weighted association analysis was computed in order to determine the terms association to a certain class. The computed weights served as a pruning bound if a certain term is only weakly associated with any class (the maximum weight given to a term is smaller than 0.2) or if a certain term is highly associated with multiple classes (the second highest weight is within 50% of the max weight assigned to a term). The third step is creating 1 to 4-grams out of the remaining terms. As a result we get ~3600 features for which the TF-IDF is computed as the last step.

Model-Training and Application

The model we chose is the Naive Bayes classification model. This allows products to be labeled as non-matching against any of the 9 classes, which in turn allows better precision in classifying. The model was trained with the generated features (explained above). To improve the precision of the model two threshold functions were applied. First, a flat threshold of 0.15 was applied i.e. if a products highest probability given by the Naive Bayes model is lower than 0.15 the product was assigned to the 'Other Products' category. To determine the threshold value we simplified the problem to binomial classification (match or a non-match against a certain class) and compared the ROC curves. Furthermore, an analysis on the probabilities given for all the classes for a given product was conducted. The analysis determines if a the maximum probability for a given product is greater than the other probabilities with a margin big enough that maximizes the certainty of the prediction. This in turn is used as the second threshold for improving our models' precision.

Validation and Evaluation


To validate the model with 10-fold cross-validation against the training set (18000 labeled products from Amazon) using stratified sampling. The results of the cross-validation are found in Table 1.

CategoriesPrecision %Recall %
Books86.5887.95
Movies, Music & Games89.8170.63
Electronics & Computers92.9888.00
Home, Garden & Tools73.8160.78
Grocery, Health & Beauty70.2072.86
Toys, Kids, Baby & Pets75.0064.85
Clothing, Shoes & Jewelry88.5689.93
Sports & Outdoors72.8367.90
Automotive & Industrial73.0665.50
Average80.3174.26

Table 1 - Precision and Recall table out of cross validation (per class)