Web Data Commons - RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets - October 2021

This document provides statistics about the Web Data Commons RDFa, Microdata and Microformats data sets which have been extracted from the October 2021 release of the Common Crawl.

In summary, we found structured data within 1.5 billion HTML pages out of the 3.2 billion pages contained in the crawl (47.4%). These pages originate from 14.6 million different pay-level-domains out of the 35.4 million pay-level-domains covered by the crawl (41.1%). Altogether, the extracted data sets consist of 82 billion RDF quads.

Instructions on how to download the RDFa, Microdata, Embedded JSON-LD and Microformats data sets are given on the page how to get the data.

In addition we have extracted schema.org class-specific datasets from the Microdata and JSON-LD corpora.

Please note that in the following term Domains refers to pay-level-domains. Subdomains are not counted as separate domains.

Overall


Crawl Date October 2021
Total Data 85.11 Terabyte (compressed)
Parsed HTML URLs 3,195,003,256
URLs with Triples 1,516,194,663
Domains in Crawl 35,377,372
Domains with Triples 14,564,790
Typed Entities 18,483,343,653
Triples 82,142,918,869
Size of Extracted Data 1.6 Terabyte (compressed)

Results per Format


Format Domains URLs Typed Entities Triples
html-microdata 7,827,114 837,527,236 6,294,453,008 30,500,208,616
html-embedded-jsonld 8,342,031 793,347,572 7,952,535,579 37,872,880,504
html-mf-hcard 4,156,046 330,580,166 3,802,922,422 12,265,118,615
html-rdfa 720,156 111,741,339 321,223,153 939,395,103
html-mf-xfn 387,671 24,820,616 55,053,767 343,472,171
html-mf-adr 159,756 10,313,987 20,856,614 69,283,127
html-mf-geo 60,467 2,926,136 5,203,891 14,570,965
html-mf-hcalendar 28,363 2,176,078 14,144,815 62,146,461
html-mf-hreview 22,779 1,644,215 5,014,056 34,052,728
html-mf-hlisting 9,693 241,573 9,896,263 32,867,553
html-mf-hrecipe 3,951 303,445 9,896,263 7,415,652
html-mf2-h-adr 12,922 218,289 299,564 1,109,503
html-mf-hresume 121 2,392 5,681 14,492
html-mf-species 538 62,012 161,009 383,379
overall 14,564,790 1,516,194,663 18,483,643,217 82,142,918,869



Top Domains by Extracted Triples


  1. blogspot.com (861,352,418 triples)
  2. wordpress.com (490,439,578 triples)
  3. livejournal.com (201,853,562 triples)
  4. wikipedia.org (100,024,107 triples)
  5. onlinehome.us (84,907,774 triples)
  6. google.com (79,211,388 triples)
  7. minnesotamonthly.com (75,978,381 triples)
  8. alibaba.com (68,702,672 triples)
  9. drroyspencer.com (67,547,705 triples)
  10. smittenkitchen.com (63,955,228 triples)
  11. elpais.com (61,701,299 triples)
  12. rbweb.us (57,494,226 triples)
  13. kayak.com (56,800,084 triples)
  14. yahoo.com (53,766,775 triples)
  15. southleedslife.com (52,430,540 triples)
  16. marqspusta.com (48,636,891 triples)
  17. smugmug.com (47,688,953 triples)
  18. sputniknews.com (46,899,246 triples)
  19. dannybuyshouses.com (45,006,343 triples)
  20. mojepozivnice.com (44,439,637 triples)
  21. More

Top Domains by URLs with Triples


  1. blogspot.com (16,321,154 urls)
  2. wordpress.com (11,464,158 urls)
  3. livejournal.com (7,379,255 urls)
  4. wikipedia.org (4,619,261 urls)
  5. photoshelter.com (1,900,496 urls)
  6. airbnb.com (1,812,725 urls)
  7. hatenablog.com (1,427,514 urls)
  8. typepad.com (952,700 urls)
  9. tistory.com (927,059 urls)
  10. yahoo.com (909,346 urls)
  11. stackexchange.com (878,180 urls)
  12. altervista.org (751,593 urls)
  13. ning.com (674,756 urls)
  14. google.com (617,974 urls)
  15. uol.com.br (601,577 urls)
  16. player.fm (591,320 urls)
  17. ox.ac.uk (583,626 urls)
  18. europa.eu (576,055 urls)
  19. cnn.com (572,071 urls)
  20. threadless.com (553,939 urls)
  21. More

Extractor html-microdata


Triples Extracted 30,500,208,616
URLs with Triples 837,527,236
Average Triples per URL 36.42
Domains with Triples 7,827,114
Average Triples per Domain 3,896.74
Typed Entities 6,294,453,008
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-microdata.xlsx

Extractor html-embedded-jsonld


Triples Extracted 37,872,880,504
URLs with Triples 793,347,572
Average Triples per URL 47.74
Domains with Triples 8,342,031
Average Triples per Domain 4,540.01
Typed Entities 7,952,535,579
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-embedded-jsonld.xlsx

Extractor html-mf-hcard


Triples Extracted 12,265,118,615
URLs with Triples 330,580,166
Average Triples per URL 37.1
Domains with Triples 4,156,046
Average Triples per Domain 2,951.15
Typed Entities 3,802,922,422
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-rdfa


Triples Extracted 939,395,103
URLs with Triples 111,741,339
Average Triples per URL 8.41
Domains with Triples 720,156
Average Triples per Domain 1,304.43
Typed Entities 321,223,153
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-rdfa.xlsx

Extractor html-mf-xfn


Triples Extracted 343,472,171
URLs with Triples 24,820,616
Average Triples per URL 13.84
Domains with Triples 387,671
Average Triples per Domain 885.99
Typed Entities 55,053,767
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-adr


Triples Extracted 69,283,127
URLs with Triples 10,313,987
Average Triples per URL 6.72
Domains with Triples 159,756
Average Triples per Domain 433.68
Typed Entities 20,856,614
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-geo


Triples Extracted 14,570,965
URLs with Triples 2,926,136
Average Triples per URL 4.98
Domains with Triples 60,467
Average Triples per Domain 240.97
Typed Entities 5,203,891
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hcalendar


Triples Extracted 62,146,461
URLs with Triples 2,176,078
Average Triples per URL 28.56
Domains with Triples 28,363
Average Triples per Domain 2,191.11
Typed Entities 14,144,815
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hreview


Triples Extracted 34,052,728
URLs with Triples 1,644,215
Average Triples per URL 20.71
Domains with Triples 22,779
Average Triples per Domain 1,494.92
Typed Entities 5,014,056
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hlisting


Triples Extracted 32,867,553
URLs with Triples 241,573
Average Triples per URL 136.06
Domains with Triples 9,693
Average Triples per Domain 3,390.85
Typed Entities 9,896,263
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hrecipe


Triples Extracted 7,415,652
URLs with Triples 303,445
Average Triples per URL 24.44
Domains with Triples 3,951
Average Triples per Domain 1,876.91
Typed Entities 1,873,395
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hresume


Triples Extracted 14,492
URLs with Triples 2,392
Average Triples per URL 6.06
Domains with Triples 121
Average Triples per Domain 119.77
Typed Entities 5,681
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-species


Triples Extracted 383,379
URLs with Triples 62,012
Average Triples per URL 6.18
Domains with Triples 538
Average Triples per Domain 712.6
Typed Entities 161,009
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count