Dominique Ritze
Oliver Lehmberg
Robert Meusel
Christian Bizer
Sanikumar Zope

This page provides basis statistics describing the relational subset of the WDC Web TablesCorpus 2015. The subset consists of 90 million tables out of the 233 million Web tables in the corpus. In relational tables, a set of similar entities is described with one or more attributes. In addition to this subset, we offer statistics about a subset consisting of only English-language relational tables and a subset containing entity tables. All tables are publicly available for download.

Contents

1. TLDs Distribution

Figure 1 shows the distribution of extracted Web tables per top-level domain.

Fig. 1 - Number of tables per TLD

The complete distribution of tables per top-level domain can be found here. The file contains a list of key values pairs, with TLD as key and #tables as value. For example the entry, Key : .com Value : 62249515, means that there are 62,249,515 tables extracted from the "com" domain. Compared to the previous corpus, it is noticeable that the "gov" domain was not even among the top 20 TLDs.

2. PLD Distribution

Figure 2 shows the distribution of extracted Web tables per pay-level domain.

Fig. 2 - Number of tables per PLD

Altogether, 540,418 different PLDs are represented in all the 90 million relational tables. The complete distribution of tables per pay-level domain can be found here. Again, the file contains key value pairs where the key represents the PLD and the value the number of tables per PLD.

3. Table Sizes and Distribution

Table 1 shows the overall number of extracted relational tables, divided into horizontal and vertical tables. In a horizontal table, the entities are represented in rows and the attributes in columns. Whenever the entities are included in columns and the attributes in rows, we talk about vertical tables. A vertical table can be transferred into a horizontal table by simply flipping it.

#tables
horizontal84,784,969
vertical5,481,254
sum90,266,223
Table 1: Number of horizontal and vertical relational tables

Table 2 provides basic statistics for the tables' size. The row numbers exclude the header row (if present) and thus refer data rows. Data rows are all rows of the table that are positioned under the header row and contain at least one non-empty cell.

min.max.averagemedian
Columns Horizontal Tables (attributes)2 18,1065.204
Rows Horizontal Tables (entities)217,03314.456
Columns Vertical Tables (entities)3 16,1428.445
Rows Vertical Tables (atributes)14863.663
Table 2: Statistics about the columns and rows

3.1. Horizontal Tables

Figure 3 shows the distribution of number of columns (attributes) per table (Horizontal Table).

Fig. 3 - Distribution of Number of Columns (attributes) per Table(Horizontal Table)


The complete distribution of number of columns per horizontal table can be found here. The key of the key value pair represents the number of columns and the value the number of horizontal tables having exactly this number of columns, e.g. the line Key : 33 Value : 1323 means that there are 1,323 tables with exactly 13 columns.

Figure 4 shows the distribution of number of data rows (entities) per table (Horizontal Table).

Fig. 4 - Distribution of Number of Rows (entities) per Table(Horizontal Table)

The complete distribution of number of rows per horizontal table can be found here. The key of the key value pair represents the number of data rows and the value the number of horizontal tables having exactly this number of data rows, e.g. the line Key : 8 Value : 2932320 means that there are 2,932,320 tables with exactly 8 data rows.

3.2. Vertical Tables

Figure 5 shows the distribution of number of columns (entities) per table (Vertical Table).

Fig. 5 - Distribution of Number of Columns (entities) per Table(Vertical Table)


The complete distribution of number of columns per vertical table can be found here. The key of the key value pair represents the number of columns and the value the number of vertical tables having exactly this number of columns, e.g. the line Key : 13 Value : 42917 means that there are 42,917 tables with exactly 13 columns. Since we know that these tables are vertical, this number corresponds to the number of rows after flipping the table.

Figure 6 shows the distribution of number of data rows(attributes) per table (Vertical Table).

Fig. 6 - Distribution of Number of Rows (attributes) per Table(Vertical Table)

The complete distribution of number of rows per vertical table can be found here. The key of the key value pair represents the number of data rows and the value the number of vertical tables having exactly this number of data rows, e.g. the line Key : 8 Value : 38554 means that there are 38,554 tables with exactly 8 rows. Since we know that these tables are vertical, this number corresponds to the number of columns after flipping the table.

In order to get a first impression about the topics of the tables, we applied a simple heuristic for identifying the column headers of each Web table. Our heuristic is based on the Cell Content Pattern, which is defined as a tuple containing a representation of the composition of characters in a cell [Tang2006]. After extracting the content pattern of a cell, a comparison between the patterns in the current row and the following rows is made. If one row shows different patterns compared to its following rows, we consider this row as header containing the column names. By now, we only consider two cases: the first row is the header row or no header row exists. A more sophisticated header unfolding [Chen2013] would be necessary to find for example headers that spanning over several rows.

In contrast to our previous extraction, we can now deal with headers of vertical tables [Crestan2011] and we know whether a header is present or not (about 20% of all tables do not have a header according to [Pimplikar2012]). We did not take any column name synonyms like 'population' and 'number of inhabitants' into account. The only simple normalization we apply is to remove trailing 's' to get singular forms of nouns. Thus, the number of different headers can be seen as upper bound.


With the current approach were able to identify total of 462,165,071 column headers from which 5,477,071 are different. Figure 7 shows popular (useful) column headers together with their number of occurrences. The most often used header is the empty string (29,879,843 times) which we exclude in the figure since it does not provide any information about the content of the tables.

Fig. 7 - Popular Column Headers

The complete distribution of headers can be found here. The key of the key value pair represents the header and the value the number of columns having exactly this header, e.g. the line Key : Title Value : 4043775 means that there are 4,043,775 columns with title as header.

5. Column Data Types Distribution

We used a rough type guessing algorithm to detect the data type of each table column. First, the data type of each column cell is detect by trying to parse it into the according types. We use the following 6 pre-defined data types: string, numeric, date, link, boolean and list. Afterwards, the most frequent data type in the column is chosen as the final data type of the whole column.

Figure 8 shows distribution of column data types. (Other include link, boolean and list because of very less percentage of overall datatypes)

Fig. 8 - Column Data Types Distribution

6. Context Information

Table 3 provides basic statistics about context related data which we further extracted. For each tables, we extract the 200 words before and after the table. In previous experiments, we found out that without additional context or temporal information, it is difficult to further process the tables, e.g. to match them to a knowledge base [Zhang2013]. For almost half of the tables, we can extract a timestamp which is located after the relational table. In many cases, this timestamp is the imprint of the webpage. The last modified date comes from the HTTP header of the HTML page.

#number of tables
Timestamp Before Relational Table13,520,557
Timestamp After Relational Table42,990,899
Last Modified Date19,275,614
Table 3: Extracted Context Information

7. Further Information

We also offer statistics describing the English-language subset of the relational tables and the entity tables subset.

All tables are publicly available for download.