Web Data Commons Extraction Report - August 2012 Corpus

This document provides statistics about the Web Data Commons data set which has been extracted from the August 2012 version of the Common Crawl.

Apart from the analysis of the different formats and the raw statistics presented in this document, we have also created additional statistics and analysis which can be found here.
To allow an easy access to the data we created an additiona section, explaining how to get the data in detail.

Please note that term Domains refers to pay-level-domains. Subdomains are not counted as seperate domains.

Overall


Crawl DateQ1/Q2 2012
Total Data40.1 Terabyte(compressed)
Parsed HTML URLs3,005,629,093
URLs with Triples369,254,196
Domains in Crawl40,600,000
Domains with Triples2,286,277
Typed Entities1,811,471,956
Triples7,350,953,995

Results per Format


FormatDomainsURLsTyped EntitiesTriple
html-rdfa519,379168,654,234188,243,5351,079,175,202
html-microdata140,31297,048,329266,169,1511,488,063,426
html-mf-geo48,4156,602,77913,206,24832,722,603
html-mf-hcalendar37,6203,745,05132,630,606142,975,309
html-mf-hcard1,511,855120,027,6021,113,527,3603,547,824,107
html-mf-hrecipe3,2811,110,71211,695,28050,898,293
html-mf-hlisting4,030772,40225,220,51297,711,757
html-mf-hresume1,25739,41243,100678,097
html-mf-hreview20,7814,959,67227,781,420207,589,518
html-mf-species9137,186274,862774,671
html-mf-xfn490,28640,123,185132,679,882703,188,115

Top Domains by Extracted Triples


  1. blogspot.com (410,845,586 Triples)
  2. youtube.com (366,504,194 Triples)
  3. hotels.com (160,556,202 Triples)
  4. twitter.com (134,812,653 Triples)
  5. wordpress.com (120,143,998 Triples)
  6. rhapsody.com (47,218,824 Triples)
  7. flickr.com (33,302,981 Triples)
  8. flogao.com.br (28,853,494 Triples)
  9. cylex-uk.co.uk (28,452,814 Triples)
  10. wikipedia.org (27,805,240 Triples)
  11. bizrate.com (24,965,149 Triples)
  12. food.com (23,894,192 Triples)
  13. tabelog.com (23,260,071 Triples)
  14. kelkoo.fr (22,869,887 Triples)
  15. identi.ca (22,862,917 Triples)
  16. thefreelibrary.com (22,541,841 Triples)
  17. cylex-usa.com (22,385,212 Triples)
  18. citysearch.com (21,791,903 Triples)
  19. cylex-tudakozo.hu (21,693,538 Triples)
  20. gumtree.com (21,299,104 Triples)
  21. More

Top Domains by URLs with Triples


  1. youtube.com (48,634,453 URLs)
  2. blogspot.com (25,062,401 URLs)
  3. tumblr.com (14,503,656 URLs)
  4. flickr.com (9,130,849 URLs)
  5. wordpress.com (8,460,124 URLs)
  6. wikipedia.org (1,668,194 URLs)
  7. thefreedictionary.com (1,494,990 URLs)
  8. yahoo.com (1,426,569 URLs)
  9. hotels.com (1,279,488 URLs)
  10. flightaware.com (908,410 URLs)
  11. typepad.com (869,591 URLs)
  12. diplodocs.com (810,688 URLs)
  13. vivastreet.com (652,968 URLs)
  14. threadless.com (632,757 URLs)
  15. tripadvisor.com (574,741 URLs)
  16. shopping.com (568,365 URLs)
  17. tripadvisor.es (561,179 URLs)
  18. over-blog.com (554,427 URLs)
  19. ehow.com (542,091 URLs)
  20. fotolog.com (540,440 URLs)
  21. More

Extractor html-rdfa


Triples Extracted1,079,175,202
URLs with Triples168,654,234
Average Triples per URL6.40
Domains with Triples519,379
Average Triples per Domain2,077.82
Top Domains by Extracted TriplesShow top domains
Typed Entities188,243,535
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-microdata


Triples Extracted1,488,063,426
URLs with Triples97,048,329
Average Triples per URL15.33
Domains with Triples140,312
Average Triples per Domain10,605.39
Top Domains by Extracted TriplesShow top domains
Typed Entities266,169,151
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Top PropertiesShow top values by entity count

Extractor html-mf-geo


Triples Extracted32,722,603
URLs with Triples6,602,779
Average Triples per URL5.0
Domains with Triples48,415
Average Triples per Domain675.88
Top Domains by Extracted TriplesShow top domains
Typed Entities13,206,248
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hcalendar


Triples Extracted142,975,309
URLs with Triples3,745,051
Average Triples per URL38.18
Domains with Triples37,620
Average Triples per Domain3,800.51
Top Domains by Extracted TriplesShow top domains
Typed Entities32,630,606
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hcard


Triples Extracted3,547,824,107
URLs with Triples120,027,602
Average Triples per URL29.56
Domains with Triples1,511,855
Average Triples per Domain2,346.67
Top Domains by Extracted TriplesShow top domains
Typed Entities1,113,527,360
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hlisting


Triples Extracted97,711,757
URLs with Triples772,402
Average Triples per URL126.50
Domains with Triples4,030
Average Triples per Domain24,246.09
Top Domains by Extracted TriplesShow top domains
Typed Entities25,220,512
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hrecipe


Triples Extracted50,898,293
URLs with Triples1,110,712
Average Triples per URL45.82
Domains with Triples3,281
Average Triples per Domain15,513.04
Top Domains by Extracted TriplesShow top domains
Typed Entities11,695,280
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values
Show top values by entity count

Extractor html-mf-hresume


Triples Extracted678,097
URLs with Triples39,412
Average Triples per URL17.21
Domains with Triples1,257
Average Triples per Domain539.46
Top Domains by Extracted TriplesShow top domains
Typed Entities43,100
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hreview


Triples Extracted207,589,518
URLs with Triples4,959,672
Average Triples per URL25.6
Domains with Triples20,781
Average Triples per Domain81.98
Top Domains by Extracted TriplesShow top domains
Typed Entities27,781,420
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-species


Triples Extracted774,671
URLs with Triples37,186
Average Triples per URL20.83
Domains with Triples91
Average Triples per Domain8,512.87
Top Domains by Extracted TriplesShow top domains
Typed Entities274,862
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-xfn


Triples Extracted703,188,115
URLs with Triples40,123,185
Average Triples per URL17.52
Domains with Triples490,286
Average Triples per Domain1,434.24
Top Domains by Extracted TriplesShow top domains
Typed Entities132,679,882
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count