Class-Specific Subsets of the Schema.org Data contained in the October 2023 Corpus

This page provides access to and statistics about class-specific subsets of the Schema.org data contained in the October 2023 version of the Web Data Commons Microdata and JSON-LD corpus. The datasets are part of the Web Data Commons Schema.org Data Set Series

Introduction

As many users are only interested in specific types of Schema.org data (like product data, event data, job postings, or data describing local businesses), we have created class-specific subsets out of the complete and merged Microdata and JSON-LD corpora for a selection of schema.org classes. The subsets contain all instances of a specific class of either formats as well as all other data that is found on the webpages containing these instances. For example, a page containing data about a product might also contain reviews and offers for this product; a page containing data about an event might also contain data about the location of the event and the persons involved in the event. The data is represented in N-Quads format, meaning that the forth element of each quad contains the URL of the webpage from which the data was extracted. To facilitate the download and access to the class specific data, we provide the schema.org subsets in chunks. Each chunk contains quads of specific pay-level-domains (PLDs), i.e. all quads of one PLD, e.g. yummly.com, are organized within the same chunk file. Additionally, we provide lookup files containing the mappings between PLDs and their corresponding chunks as well as csv files with PLD-specific statistics.

Please note that:

You are welcome to use the datasets and also to tell about your findings. If you find our datasets useful for your research, please cite the poster: The Web Data Commons Schema.org Data Set Series by Alexander Brinkmann, Anna Primpeli and Christian Bizer in Companion Proceedings of the ACM Web Conference 2023 (WWW ’23 Companion), Austin, Texas, USA, April 2023.

Class-Specific Subsets of the Schema.org Data

Schema.org Subset General Stats Related Classes Size
(# Files)
Download (Sample) PLD to File look-up
PLD Specific Stats
AdministrativeArea Quads: 93,509,047
URLs: 475,278
Hosts: 3,567
http://schema.org/City (1,856,683)
http://schema.org/ListItem (1,485,028)
http://schema.org/ImageObject (1,298,188)
http://schema.org/AdministrativeArea (1,107,448)
http://schema.org/PostalAddress (916,224)
1.19 GB
(8)
AdministrativeArea (sample) lookup_file
pld_stats_file
Airport Quads: 34,301,832
URLs: 140,425
Hosts: 711
http://schema.org/Airport (2,523,676)
http://schema.org/GeoCoordinates (1,271,769)
http://schema.org/Flight (884,658)
http://schema.org/Airline (826,467)
http://schema.org/Offer (620,257)
319.61 MB
(3)
Airport (sample) lookup_file
pld_stats_file
Answer Quads: 1,980,845,491
URLs: 18,003,812
Hosts: 381,423
http://schema.org/Answer (79,065,507)
http://schema.org/Question (69,348,253)
http://schema.org/ListItem (40,860,417)
http://schema.org/ImageObject (25,098,071)
https://schema.org/Answer (23,475,828)
38.03 GB
(158)
Answer (sample) lookup_file
pld_stats_file
Book Quads: 349,074,007
URLs: 5,155,249
Hosts: 26,153
http://schema.org/Book (14,176,649)
http://schema.org/Country (10,281,356)
http://schema.org/Person (9,223,300)
http://schema.org/Offer (7,223,358)
http://schema.org/ListItem (4,037,512)
5.9 GB
(28)
Book (sample) lookup_file
pld_stats_file
City Quads: 280,236,099
URLs: 1,514,800
Hosts: 14,222
http://schema.org/City (7,187,495)
http://schema.org/PostalAddress (3,950,680)
http://schema.org/OpeningHoursSpecification (3,778,969)
http://schema.org/ListItem (3,575,442)
http://schema.org/Person (3,474,824)
3.04 GB
(23)
City (sample) lookup_file
pld_stats_file
CollegeOrUniversity Quads: 139,230,188
URLs: 1,152,370
Hosts: 3,287
http://schema.org/CollegeOrUniversity (4,789,189)
http://schema.org/ListItem (3,521,763)
http://schema.org/ImageObject (3,365,727)
http://schema.org/Person (2,952,316)
http://schema.org/PostalAddress (2,259,152)
1.63 GB
(12)
CollegeOrUniversity (sample) lookup_file
pld_stats_file
Continent Quads: 2,219,972
URLs: 32,599
Hosts: 69
http://schema.org/City (234,284)
http://schema.org/AdministrativeArea (139,110)
http://schema.org/Country (37,910)
http://schema.org/Continent (36,968)
http://schema.org/GeoCoordinates (29,135)
23.19 MB
(1)
Continent (sample) lookup_file
pld_stats_file
Country Quads: 678,417,918
URLs: 5,559,842
Hosts: 27,500
http://schema.org/Country (39,943,326)
http://schema.org/ListItem (16,171,738)
http://schema.org/Organization (9,427,215)
http://schema.org/Offer (9,129,260)
http://schema.org/PostalAddress (8,574,888)
9.29 GB
(54)
Country (sample) lookup_file
pld_stats_file
CreativeWork Quads: 3,190,496,257
URLs: 68,554,518
Hosts: 1,182,156
https://schema.org/CreativeWork (123,695,386)
https://schema.org/SiteNavigationElement (82,235,811)
https://schema.org/Person (71,552,642)
https://schema.org/WPHeader (48,950,793)
https://schema.org/WPFooter (47,191,030)
128.59 GB
(255)
CreativeWork (sample) lookup_file
pld_stats_file
Dataset Quads: 88,498,570
URLs: 839,031
Hosts: 1,958
http://schema.org/PropertyValue (6,202,068)
http://schema.org/DataDownload (2,724,251)
http://schema.org/Dataset (1,352,684)
http://schema.org/Organization (1,319,413)
http://schema.org/Person (860,605)
1.08 GB
(7)
Dataset (sample) lookup_file
pld_stats_file
EducationalOrganization Quads: 94,950,615
URLs: 1,278,468
Hosts: 10,287
http://schema.org/EducationalOrganization (2,179,457)
http://schema.org/ListItem (1,844,469)
http://schema.org/ImageObject (1,358,517)
http://schema.org/PostalAddress (1,352,028)
http://schema.org/Person (890,191)
1.47 GB
(8)
EducationalOrganization (sample) lookup_file
pld_stats_file
Event Quads: 2,532,122,969
URLs: 20,974,763
Hosts: 402,077
http://schema.org/Event (88,374,986)
http://schema.org/Place (64,567,498)
http://schema.org/PostalAddress (51,701,441)
http://schema.org/Person (30,113,455)
http://schema.org/ListItem (28,414,290)
32.25 GB
(202)
Event (sample) lookup_file
pld_stats_file
FAQPage Quads: 1,749,714,956
URLs: 14,411,496
Hosts: 354,617
http://schema.org/Question (66,030,524)
http://schema.org/Answer (65,823,505)
http://schema.org/ListItem (38,336,764)
http://schema.org/ImageObject (26,199,305)
https://schema.org/Answer (16,941,782)
31.72 GB
(140)
FAQPage (sample) lookup_file
pld_stats_file
GeoCoordinates Quads: 4,186,967,156
URLs: 33,179,627
Hosts: 509,401
http://schema.org/ListItem (114,912,609)
http://schema.org/PostalAddress (69,328,670)
http://schema.org/GeoCoordinates (63,593,552)
http://schema.org/OpeningHoursSpecification (41,284,705)
http://schema.org/Offer (36,532,761)
52.94 GB
(334)
GeoCoordinates (sample) lookup_file
pld_stats_file
GovernmentOrganization Quads: 31,048,065
URLs: 497,116
Hosts: 1,687
http://schema.org/ListItem (1,436,216)
http://schema.org/GovernmentOrganization (604,170)
http://schema.org/ImageObject (465,098)
https://schema.org/ImageObject (317,409)
http://schema.org/PostalAddress (311,063)
468.15 MB
(3)
GovernmentOrganization (sample) lookup_file
pld_stats_file
Hospital Quads: 30,710,791
URLs: 288,052
Hosts: 1,935
http://schema.org/PostalAddress (828,913)
http://schema.org/GeoCoordinates (691,957)
http://schema.org/Hospital (557,245)
http://schema.org/GeoCircle (550,398)
http://schema.org/ListItem (472,076)
388.31 MB
(3)
Hospital (sample) lookup_file
pld_stats_file
Hotel Quads: 388,224,871
URLs: 2,726,279
Hosts: 25,507
http://schema.org/ImageObject (15,257,350)
http://schema.org/Hotel (8,078,597)
http://schema.org/PostalAddress (7,075,683)
http://schema.org/Rating (5,379,135)
http://schema.org/LocationFeatureSpecification (5,272,388)
5.23 GB
(31)
Hotel (sample) lookup_file
pld_stats_file
JobPosting Quads: 189,836,812
URLs: 4,056,084
Hosts: 61,024
http://schema.org/Place (4,968,710)
http://schema.org/Organization (4,890,857)
http://schema.org/PostalAddress (4,883,844)
http://schema.org/JobPosting (4,741,019)
http://schema.org/ListItem (2,985,374)
7.57 GB
(16)
JobPosting (sample) lookup_file
pld_stats_file
LakeBodyOfWater Quads: 176,871
URLs: 4,422
Hosts: 135
http://schema.org/LakeBodyOfWater (5,018)
http://schema.org/PostalAddress (3,796)
http://schema.org/GeoCoordinates (1,018)
http://schema.org/City (952)
http://schema.org/PropertyValue (745)
4.95 MB
(1)
LakeBodyOfWater (sample) lookup_file
pld_stats_file
LandmarksOrHistoricalBuildings Quads: 2,848,557
URLs: 34,917
Hosts: 405
http://schema.org/PropertyValue (74,748)
http://schema.org/ImageObject (74,725)
http://schema.org/LandmarksOrHistoricalBuildings (73,368)
http://schema.org/PostalAddress (53,626)
http://schema.org/CreativeWork (42,448)
79.02 MB
(1)
LandmarksOrHistoricalBuildings (sample) lookup_file
pld_stats_file
Language Quads: 720,590,542
URLs: 5,684,120
Hosts: 12,880
http://schema.org/Person (31,686,159)
http://schema.org/Comment (25,150,067)
http://schema.org/ListItem (12,223,370)
http://schema.org/Language (11,339,333)
http://schema.org/InteractionCounter (9,290,581)
12.84 GB
(57)
Language (sample) lookup_file
pld_stats_file
Library Quads: 8,270,159
URLs: 197,901
Hosts: 818
http://schema.org/Library (214,924)
http://schema.org/PostalAddress (95,456)
http://schema.org/Place (93,798)
http://schema.org/ListItem (86,384)
http://schema.org/OpeningHoursSpecification (78,516)
128.91 MB
(1)
Library (sample) lookup_file
pld_stats_file
LocalBusiness Quads: 2,979,247,943
URLs: 36,711,236
Hosts: 1,354,750
http://schema.org/ListItem (107,056,900)
http://schema.org/LocalBusiness (55,520,207)
http://schema.org/PostalAddress (51,173,588)
http://schema.org/ImageObject (24,221,665)
http://schema.org/OpeningHoursSpecification (21,480,235)
38.61 GB
(238)
LocalBusiness (sample) lookup_file
pld_stats_file
Mountain Quads: 244,167
URLs: 12,064
Hosts: 63
http://schema.org/Mountain (16,723)
http://schema.org/GeoCoordinates (16,704)
http://schema.org/propertyValue (7,540)
http://schema.org/Place (2,887)
https://schema.org/ListItem (1,436)
5.74 MB
(1)
Mountain (sample) lookup_file
pld_stats_file
Movie Quads: 162,588,730
URLs: 2,003,583
Hosts: 7,641
http://schema.org/Person (12,242,939)
http://schema.org/Movie (4,451,037)
http://schema.org/ListItem (2,464,915)
http://schema.org/AggregateRating (1,348,493)
http://schema.org/ImageObject (962,320)
2.38 GB
(13)
Movie (sample) lookup_file
pld_stats_file
Museum Quads: 5,539,798
URLs: 92,048
Hosts: 675
http://schema.org/Museum (110,728)
http://schema.org/Event (93,667)
http://schema.org/ListItem (93,259)
http://schema.org/PostalAddress (85,710)
http://schema.org/OpeningHoursSpecification (67,635)
85.08 MB
(1)
Museum (sample) lookup_file
pld_stats_file
MusicAlbum Quads: 112,398,520
URLs: 819,666
Hosts: 18,779
http://schema.org/Country (8,123,813)
http://schema.org/MusicRecording (4,520,767)
http://schema.org/MusicAlbum (2,815,229)
http://schema.org/Offer (2,622,574)
http://schema.org/EntryPoint (1,338,109)
1.03 GB
(9)
MusicAlbum (sample) lookup_file
pld_stats_file
MusicRecording Quads: 173,478,132
URLs: 1,459,187
Hosts: 27,876
http://schema.org/Country (14,033,776)
http://schema.org/MusicRecording (10,037,791)
http://schema.org/Offer (2,842,197)
http://schema.org/MusicGroup (1,873,068)
http://schema.org/MusicAlbum (1,758,193)
1.63 GB
(14)
MusicRecording (sample) lookup_file
pld_stats_file
Organization Quads: 52,360,387,820
URLs: 824,557,426
Hosts: 6,764,349
http://schema.org/ListItem (1,475,935,021)
http://schema.org/ImageObject (1,183,253,284)
http://schema.org/Organization (1,062,852,659)
http://schema.org/BreadcrumbList (529,826,463)
http://schema.org/WebPage (514,302,028)
847.31 GB
(4168)
Organization (sample) lookup_file
pld_stats_file
Painting Quads: 12,180,582
URLs: 116,101
Hosts: 640
http://schema.org/Person (2,294,142)
http://schema.org/Painting (481,498)
http://schema.org/Offer (307,589)
http://schema.org/ListItem (219,232)
http://schema.org/Property (110,457)
123.99 MB
(1)
Painting (sample) lookup_file
pld_stats_file
Park Quads: 1,654,003
URLs: 13,506
Hosts: 324
http://schema.org/Organization (55,799)
http://schema.org/PostalAddress (30,844)
http://schema.org/OpeningHoursSpecification (17,788)
http://schema.org/ListItem (16,248)
http://schema.org/Park (14,366)
21.53 MB
(1)
Park (sample) lookup_file
pld_stats_file
Person Quads: 34,171,228,713
URLs: 465,715,308
Hosts: 5,117,767
http://schema.org/ImageObject (864,419,612)
http://schema.org/Person (778,026,765)
http://schema.org/ListItem (764,540,830)
http://schema.org/WebPage (420,788,422)
http://schema.org/Organization (383,671,490)
645.02 GB
(2723)
Person (sample) lookup_file
pld_stats_file
Place Quads: 4,649,072,412
URLs: 37,241,661
Hosts: 535,530
http://schema.org/ListItem (116,975,200)
http://schema.org/Place (109,578,157)
http://schema.org/PostalAddress (91,318,816)
http://schema.org/Event (71,585,370)
http://schema.org/Person (47,511,716)
63.13 GB
(371)
Place (sample) lookup_file
pld_stats_file
Product Quads: 23,849,625,142
URLs: 347,477,842
Hosts: 2,897,121
http://schema.org/Offer (826,534,359)
http://schema.org/ListItem (636,352,840)
http://schema.org/Product (595,117,400)
http://schema.org/Organization (322,255,895)
http://schema.org/ImageObject (191,985,760)
349.61 GB
(1899)
Product (sample) lookup_file
pld_stats_file
QAPage Quads: 186,659,370
URLs: 3,303,871
Hosts: 10,746
http://schema.org/Person (12,196,036)
http://schema.org/Answer (7,259,119)
https://schema.org/Answer (2,582,837)
http://schema.org/Question (2,441,450)
http://schema.org/QAPage (2,384,815)
4.69 GB
(15)
QAPage (sample) lookup_file
pld_stats_file
Question Quads: 2,013,523,983
URLs: 18,820,629
Hosts: 383,439
http://schema.org/Answer (78,080,422)
http://schema.org/Question (71,036,212)
http://schema.org/ListItem (41,071,086)
http://schema.org/ImageObject (26,505,287)
https://schema.org/Answer (23,080,355)
38.59 GB
(161)
Question (sample) lookup_file
pld_stats_file
RadioStation Quads: 17,376,036
URLs: 321,302
Hosts: 1,138
http://schema.org/ListItem (601,939)
http://schema.org/RadioStation (374,079)
http://schema.org/ImageObject (291,891)
http://schema.org/NewsArticle (270,600)
http://schema.org/Organization (169,207)
308.8 MB
(2)
RadioStation (sample) lookup_file
pld_stats_file
Recipe Quads: 502,684,939
URLs: 4,489,240
Hosts: 42,727
http://schema.org/HowToStep (16,492,994)
http://schema.org/ListItem (8,525,059)
http://schema.org/ImageObject (8,000,669)
http://schema.org/Person (6,387,072)
http://schema.org/Recipe (5,683,640)
8.07 GB
(40)
Recipe (sample) lookup_file
pld_stats_file
Restaurant Quads: 223,449,936
URLs: 1,717,316
Hosts: 64,072
http://schema.org/Offer (6,574,202)
http://schema.org/MenuItem (5,599,769)
http://schema.org/Restaurant (4,862,139)
http://schema.org/ListItem (4,136,057)
http://schema.org/PostalAddress (3,805,739)
2.5 GB
(18)
Restaurant (sample) lookup_file
pld_stats_file
RiverBodyOfWater Quads: 94,489
URLs: 2,224
Hosts: 16
http://schema.org/ImageObject (3,379)
http://schema.org/ListItem (2,875)
http://schema.org/RiverBodyOfWater (2,316)
http://schema.org/Organization (1,991)
http://schema.org/PropertyValue (1,913)
3.45 MB
(1)
RiverBodyOfWater (sample) lookup_file
pld_stats_file
School Quads: 14,495,397
URLs: 292,758
Hosts: 2,030
http://schema.org/ListItem (400,913)
http://schema.org/School (381,043)
http://schema.org/PostalAddress (234,219)
http://schema.org/ImageObject (165,397)
http://schema.org/GeoCoordinates (124,684)
221.82 MB
(2)
School (sample) lookup_file
pld_stats_file
ShoppingCenter Quads: 12,849,735
URLs: 152,010
Hosts: 1,285
http://schema.org/PostalAddress (241,647)
http://schema.org/Organization (238,002)
http://schema.org/Offer (231,301)
http://schema.org/ShoppingCenter (214,191)
http://schema.org/ListItem (167,885)
176.61 MB
(2)
ShoppingCenter (sample) lookup_file
pld_stats_file
SkiResort Quads: 1,494,935
URLs: 29,605
Hosts: 266
http://schema.org/ListItem (48,318)
http://schema.org/SkiResort (40,543)
http://schema.org/Person (34,224)
http://schema.org/Review (33,478)
http://schema.org/PostalAddress (25,515)
30.61 MB
(1)
SkiResort (sample) lookup_file
pld_stats_file
SportsEvent Quads: 165,906,925
URLs: 1,010,368
Hosts: 7,207
http://schema.org/SportsEvent (7,345,109)
http://schema.org/SportsTeam (6,804,715)
http://schema.org/Place (6,640,892)
http://schema.org/PostalAddress (5,556,554)
http://schema.org/Organization (2,138,486)
1.53 GB
(14)
SportsEvent (sample) lookup_file
pld_stats_file
SportsTeam Quads: 126,009,626
URLs: 781,371
Hosts: 3,853
http://schema.org/SportsTeam (7,898,303)
http://schema.org/SportsEvent (3,144,937)
http://schema.org/Place (2,738,920)
http://schema.org/Organization (2,010,041)
http://schema.org/PostalAddress (1,864,192)
1.05 GB
(11)
SportsTeam (sample) lookup_file
pld_stats_file
StadiumOrArena Quads: 28,984,463
URLs: 90,159
Hosts: 258
http://schema.org/Organization (1,223,801)
http://schema.org/ImageObject (954,725)
http://schema.org/SportsTeam (923,758)
http://schema.org/SportsEvent (453,481)
http://schema.org/BlogPosting (446,060)
234.01 MB
(3)
StadiumOrArena (sample) lookup_file
pld_stats_file
TelevisionStation Quads: 1,792,637
URLs: 19,196
Hosts: 94
http://schema.org/ListItem (50,738)
http://schema.org/CreativeWorkSeries (34,057)
http://schema.org/TelevisionStation (34,023)
http://schema.org/SiteNavigationElement (32,902)
http://schema.org/ImageObject (32,397)
29.12 MB
(1)
TelevisionStation (sample) lookup_file
pld_stats_file
TVEpisode Quads: 62,300,689
URLs: 305,104
Hosts: 1,048
http://schema.org/Country (3,827,991)
https://schema.org/TVEpisode (3,653,982)
http://schema.org/TVEpisode (1,556,216)
http://schema.org/ListItem (602,831)
http://schema.org/Person (409,220)
633.05 MB
(5)
TVEpisode (sample) lookup_file
pld_stats_file


In case you are interested in a particular class or set of classes which is not listed above, please get in contact with the WebDataCommons team via Mailing List or our Google Group.

Conversion to Other Formats

We provide the extracted data for download using a variation of the N-Quads format. For users who prefer other formats, we provide code for converting the download files into CSV and JSON formats, which are supported by a wide range of spreadsheet applications, relational databases and data mining frameworks like the python data analysis library pandas. Please find further details on how to convert the download files to other formats on the main page.

Get the Code

The jupyter notebooks used to create the schema.org subsets from the MD and JSON-LD corpus can be checked out from our Git repository.

The extraction of the December 2023 was done with version 1.5 of the extractor. For more information about the framework and a detailed description how to run a own extraction visit the framework page.

Get Support

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.