Web Data Commons Extraction Report - February 2012 Corpus

This file contains the extraction report for the Web Data Commons project. Both extraction and data statistics are given overall as well as for each structured data format.

Overall


Total Data20.9 Terabyte(compressed)
Total URLs1,700,611,442
Parsed HTML URLs1,486,186,868
Domains with Triples65,408,946
URLs with Triples188.821.015
Typed Entities1,222,563,749
Triples3,294,248,652

Results per Format


ExtractorDomains with TriplesURLs with TriplesTyped EntitiesTriples
html-rdfa16,976,23267,901,24649,370,729456,169,126
html-microdata3,952,67426,929,86590,526,013404,413,915
html-mf-geo897,0802,491,9334,787,12611,222,766
html-mf-hcalendar629,3191,506,37927,165,54565,547,870
html-mf-hcard30,417,19261,360,686865,633,0591,837,847,772
html-mf-hlisting69,569197,0278,252,63220,703,189
html-mf-hresume9,89020,76292,346432,363
html-mf-hreview615,6811,971,8707,809,08850,475,411
html-mf-species4,10914,033139,631224,847
html-mf-hrecipe127,381422,2895,516,0365,513,030
html-mf-xfn11,709,81926,004,925163,271,544441,698,363

Top Domains by URLs with Triples


  1. www.youtube.com (15,858,531 URLs)
  2. picasaweb.google.com (412,771 URLs)
  3. www.playlist.com (120,520 URLs)
  4. www.flogao.com.br (103,631 URLs)
  5. www.purepeople.com (97,773 URLs)
  6. elbo.ws (96,362 URLs)
  7. www.dogster.com (96,329 URLs)
  8. www.kovideo.net (93,896 URLs)
  9. www.ucomparehealthcare.com (93,868 URLs)
  10. www.mp3lyrics.org (87,954 URLs)
  11. www.ncbi.nlm.nih.gov (84,006 URLs)
  12. blog.moviefone.com (82,136 URLs)
  13. www.tvfanatic.com (82,037 URLs)
  14. www.xumbia.com (81,502 URLs)
  15. www.thecarconnection.com (72,541 URLs)
  16. www.partypop.com (69,782 URLs)
  17. weheartit.com (63,528 URLs)
  18. menmedia.co.uk (58,934 URLs)
  19. www.xing.com (56,572 URLs)
  20. www.tripadvisor.dk (53,109 URLs)
  21. More

In the following statistics, the term "Property values" refers to the overall number of properties that describe all typed entities. The term "URL values" refers to the subset of the "Property Values" that have a URL as object. The term "Remote URL Values" refers to the subset of the "URL values" which point at a different websites (meaning that the namespace of the URL differs from the namespace of the described entity). The term "Literal Values" refers to the subset of the "Property values" that are literals and no URLs

Extractor html-rdfa


Triples Extracted456,169,126
URLs with Triples67,901,246
Average Triples per URL6.72
Domains with Triples16,976,232
Average Triples per Domain26.87
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities49,370,729
Property Values141,619,045
URL Values27,651,137
Remote URL Values4,539,440
Literal Values75,117,453
Other Values38,850,455
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-microdata


Triples Extracted404,413,915
URLs with Triples26,929,865
Average Triples per URL15.02
Domains with Triples3,952,674
Average Triples per Domain102.31
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities90,526,013
Property Values325,687,210
URL Values78,892,087
Remote URL Values24,690,392
Literal Values178,624,098
Other Values68,171,025
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-geo


Triples Extracted11,222,766
URLs with Triples2,491,933
Average Triples per URL4.5
Domains with Triples897,080
Average Triples per Domain12.51
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities4,787,126
Property Values11,224,444
URL Values801
Remote URL Values0
Literal Values6,436,517
Other Values4,787,126
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-hcalendar


Triples Extracted65,547,870
URLs with Triples1,506,379
Average Triples per URL43.51
Domains with Triples629,319
Average Triples per Domain104.16
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities27,165,545
Property Values57,201,453
URL Values6,320,698
Remote URL Values451,431
Literal Values36,937,042
Other Values13,943,713
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-hcard


Triples Extracted1,837,847,772
URLs with Triples61,360,686
Average Triples per URL29.95
Domains with Triples30,417,192
Average Triples per Domain60.42
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities865,633,059
Property Values1,965,706,118
URL Values207,251,360
Remote URL Values139,299,801
Literal Values1,191,144,262
Other Values567,310,496
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-hlisting


Triples Extracted20,703,189
URLs with Triples197,027
Average Triples per URL105.08
Domains with Triples69,569
Average Triples per Domain297.59
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities8,252,632
Property Values20,096,893
URL Values6,559,579
Remote URL Values2,058,779
Literal Values8,373,907
Other Values5,163,407
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-hresume


Triples Extracted432,363
URLs with Triples20,762
Average Triples per URL20.82
Domains with Triples9,890
Average Triples per Domain43.72
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities92,346
Property Values92,815
URL Values806
Remote URL Values801
Literal Values68,002
Other Values24,007
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-hreview


Triples Extracted50,475,411
URLs with Triples1,971,870
Average Triples per URL25.6
Domains with Triples615,681
Average Triples per Domain81.98
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities7,809,088
Property Values26,171,383
URL Values29,611
Remote URL Values7,655
Literal Values18,337,604
Other Values7,804,168
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-species


Triples Extracted224,847
URLs with Triples14,033
Average Triples per URL16.02
Domains with Triples4,109
Average Triples per Domain54.72
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities139,631
Property Values225,104
URL Values0
Remote URL Values0
Literal Values127,524
Other Values97,580
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-hrecipe


Triples Extracted5,513,030
URLs with Triples422,289
Average Triples per URL13.06
Domains with Triples127,381
Average Triples per Domain43.28
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities5,516,036
Property Values5,516,036
URL Values0
Remote URL Values0
Literal Values0
Other Values5,516,036
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-xfn


Triples Extracted441,698,363
URLs with Triples26,004,925
Average Triples per URL16.99
Domains with Triples11,709,819
Average Triples per Domain37.72
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities163,271,544
Property Values162,387,719
URL Values81,173,478
Remote URL Values49,901,614
Literal Values16,440
Other Values81,197,801
Top ClassesShow top values
Top PropertiesShow top values