Class-Specific Subsets of the Schema.org Data contained in the November 2015 Corpus

This page provides access to and statistics about class-specific subsets of the Schema.org data contained in the November 2015 version of the Web Data Commons Microdata corpus. The datasets are part of the Web Data Commons Schema.org Data Set Series

Introduction

As many users are only interested in specific types of Schema.org data (like product data, event data, or address data), we have created class-specific subsets out of the complete Microdata corpus for a selection of schema.org classes. The subsets contain all instances of a specific class as well as all other data that is found on the webpages containing these instances. For example, a page containing data about a product might also contain reviews and offers for this product; a page containing data about an event might also contain data about the location of the event and the persons involved in the event. The data is represented in N-Quads format, meaning that the forth element of each quad contains the URL of the webpage from which the data was extracted.

Please note that

You are welcome to use the datasets and also to tell about your findings. If you find our datasets useful for your research, please quote the paper: The WebDataCommons Microdata, RDFa and Microformat Dataset Series by Robert Meusel, Petar Petrovski, and Christian Bizer in the Proceedings of the 13th International Semantic Web Conference: Replication, Benchmark, Data and Software Track (ISWC2014).

Class-Specific Subsets of the Schema.org Data

Class NameTotal Number ofTop Classes (Entity Count)Total File SizeQuad File
http://schema.org/AdministrativeArea Quads: 4,849,338
URLs: 91,914
Hosts: 130
http://schema.org/City (445,468)
http://schema.org/AdministrativeArea (209,305)
http://schema.org/GeoCoordinates (91,632)
http://schema.org/Country (80,834)
http://schema.org/Continent (79,277)
70.8 MBschema_AdministrativeArea.gz (sample)
http://schema.org/Airport Quads: 113,014,885
URLs: 1,371,521
Hosts: 70
http://schema.org/Airport (26,176,317)
http://schema.org/Thing (1,152,832)
http://schema.org/WebPage (384,308)
http://schema.org/PostalAddress (18,003)
http://schema.org/GeoCoordinates (9,483)
1.8 GBschema_Airport.gz (sample)
http://schema.org/Book Quads: 160,179,686
URLs: 3,763,154
Hosts: 2,324
http://schema.org/Book (8,836,201)
http://schema.org/Person (7,786,582)
http://schema.org/Offer (3,499,192)
http://schema.org/ScholarlyArticle (2,126,138)
http://schema.org/Review (1,022,432)
3.1 GBschema_Book.gz (sample)
http://schema.org/City Quads: 22,111,248
URLs: 287,383
Hosts: 294
http://schema.org/City (784,593)
http://schema.org/GeoCoordinates (633,325)
http://schema.org/PostalAddress (460,601)
http://schema.org/Person (356,327)
http://schema.org/Offer (328,818)
392 MBschema_City.gz (sample)
http://schema.org/CollegeOrUniversity Quads: 19,073,398
URLs: 499,640
Hosts: 348
http://schema.org/CollegeOrUniversity (1,327,158)
http://schema.org/Person (904,617)
http://schema.org/CreativeWork (855,099)
http://schema.org/PostalAddress (337,986)
http://schema.org/AggregateRating (184,770)
400 MBschema_CollegeOrUniversity.gz (sample)
http://schema.org/Continent Quads: 3,669,937
URLs: 81,720
Hosts: 9
http://schema.org/City (442,529)
http://schema.org/AdministrativeArea (138,841)
http://schema.org/GeoCoordinates (86,847)
http://schema.org/Continent (82,720)
http://schema.org/Country (81,813)
47.6 MBschema_Continent.gz (sample)
http://schema.org/Country Quads: 120,125,213
URLs: 641,031
Hosts: 289
http://schema.org/MusicRecording (6,403,764)
http://schema.org/LodgingBusinessAmenity (1,928,390)
http://schema.org/Person (1,888,208)
http://schema.org/UserComments (1,864,924)
http://schema.org/Country (1,643,078)
2 GBschema_Country.gz (sample)
http://schema.org/CreativeWork Quads: 295,800,594
URLs: 6,454,246
Hosts: 44,339
http://schema.org/CreativeWork (16,901,641)
http://schema.org/Person (10,249,776)
http://schema.org/Comment (5,465,208)
http://schema.org/Organization (3,582,214)
http://schema.org/WebPage (2,816,816)
10.3 GBschema_CreativeWork.gz (sample)
http://schema.org/EducationalOrganization Quads: 5,209,884
URLs: 143,572
Hosts: 1,224
http://schema.org/EducationalOrganization (292,877)
http://schema.org/PostalAddress (220,587)
http://schema.org/MedicalScholarlyArticle (113,590)
http://schema.org/GeoCoordinates (89,911)
http://schema.org/EducationEvent (89,870)
104.6 MBschema_EducationalOrganization.gz (sample)
http://schema.org/Event Quads: 240,250,191
URLs: 1,574,622
Hosts: 12,429
http://schema.org/Event (13,184,936)
http://schema.org/Place (9,504,931)
http://schema.org/PostalAddress (8,177,965)
http://schema.org/GeoCoordinates (3,735,707)
http://schema.org/AggregateOffer (3,332,782)
4.2 GBschema_Event.gz (sample)
http://schema.org/GeoCoordinates Quads: 637,232,864
URLs: 5,235,522
Hosts: 17,365
http://schema.org/GeoCoordinates (25,900,663)
http://schema.org/PostalAddress (25,194,078)
http://schema.org/LocalBusiness (12,660,143)
http://schema.org/AggregateRating (10,030,610)
http://schema.org/Place (5,839,006)
10.7 GBschema_GeoCoordinates.gz (sample)
http://schema.org/GovernmentOrganization Quads: 1,049,453
URLs: 36,199
Hosts: 161
http://schema.org/GovernmentOrganization (69,413)
http://schema.org/PostalAddress (39,190)
http://schema.org/Article (7,555)
http://schema.org/Event (5,206)
http://schema.org/NewsArticle (5,145)
21.6 MBschema_GovernmentOrganization.gz (sample)
http://schema.org/Hospital Quads: 10,857,143
URLs: 406,687
Hosts: 223
http://schema.org/PostalAddress (625,422)
http://schema.org/Hospital (514,304)
http://schema.org/Physician (269,801)
http://schema.org/MedicalSpecialty (143,512)
http://schema.org/GeoCoordinates (126,877)
203.1 MBschema_Hospital.gz (sample)
http://schema.org/Hotel Quads: 291,506,752
URLs: 4,040,460
Hosts: 5,362
http://schema.org/Hotel (23,297,263)
http://schema.org/LandmarksOrHistoricalBuildings (15,568,363)
http://schema.org/PostalAddress (3,413,459)
http://schema.org/Review (3,140,368)
http://schema.org/AggregateRating (2,999,410)
5.7 GBschema_Hotel.gz (sample)
http://schema.org/JobPosting Quads: 271,062,391
URLs: 2,045,084
Hosts: 3,656
http://schema.org/JobPosting (25,507,180)
http://schema.org/Place (19,149,557)
http://schema.org/Organization (13,656,356)
http://schema.org/Postaladdress (5,803,830)
http://schema.org/PostalAddress (4,259,531)
5.3 GBschema_JobPosting.gz (sample)
http://schema.org/LakeBodyOfWater Quads: 210,129
URLs: 1,371
Hosts: 15
http://schema.org/PostalAddress (10,379)
http://schema.org/GeoCoordinates (10,320)
http://schema.org/LakeBodyOfWater (3,135)
http://schema.org/City (1,835)
http://schema.org/Park (1,046)
3.3 MBschema_LakeBodyOfWater.gz (sample)
http://schema.org/LandmarksOrHistoricalBuildings Quads: 112,059,617
URLs: 769,388
Hosts: 84
http://schema.org/LandmarksOrHistoricalBuildings (15,593,862)
http://schema.org/Hotel (11,889,717)
http://schema.org/Review (600,591)
http://schema.org/Offer (458,892)
http://schema.org/Organization (42,770)
1.9 GBschema_LandmarksOrHistoricalBuildings.gz (sample)
http://schema.org/Language Quads: 536,574
URLs: 3,517
Hosts: 162
http://schema.org/SiteNavigationElement (17,739)
http://schema.org/Language (6,831)
http://schema.org/PostalAddress (5,425)
http://schema.org/Organization (4,079)
http://schema.org/WPFooter (3,713)
13.5 MBschema_Language.gz (sample)
http://schema.org/Library Quads: 1,289,802
URLs: 33,804
Hosts: 45
http://schema.org/CreativeWork (57,293)
http://schema.org/Library (42,633)
http://schema.org/PostalAddress (39,664)
http://schema.org/GeoCoordinates (25,722)
http://schema.org/Place (24,707)
20.2 MBschema_Library.gz (sample)
http://schema.org/LocalBusiness Quads: 569,754,144
URLs: 6,280,198
Hosts: 77,659
http://schema.org/LocalBusiness (31,690,304)
http://schema.org/PostalAddress (25,683,431)
http://schema.org/GeoCoordinates (12,859,248)
http://schema.org/AggregateRating (9,752,425)
http://schema.org/Product (6,397,651)
9 GBschema_LocalBusiness.gz (sample)
http://schema.org/Mountain Quads: 301,954
URLs: 2,138
Hosts: 12
http://schema.org/GeoCoordinates (12,375)
http://schema.org/Mountain (12,127)
http://schema.org/PostalAddress (11,982)
http://schema.org/Review (2,611)
http://schema.org/City (1,569)
4.7 MBschema_Mountain.gz (sample)
http://schema.org/Movie Quads: 109,148,410
URLs: 1,412,757
Hosts: 3,395
http://schema.org/Person (9,069,362)
http://schema.org/Movie (5,647,480)
http://schema.org/AggregateRating (1,056,819)
http://schema.org/CreativeWork (648,533)
http://schema.org/ImageGallery (594,508)
2.4 GBschema_Movie.gz (sample)
http://schema.org/Museum Quads: 2,544,434
URLs: 23,669
Hosts: 69
http://schema.org/Painting (390,837)
http://schema.org/Event (94,761)
http://schema.org/PostalAddress (33,595)
http://schema.org/Museum (29,096)
http://schema.org/GeoCoordinates (26,620)
44.1 MBschema_Museum.gz (sample)
http://schema.org/MusicAlbum Quads: 251,633,850
URLs: 879,573
Hosts: 409
http://schema.org/MusicRecording (22,619,712)
http://schema.org/MusicAlbum (13,062,133)
http://schema.org/Offer (8,586,854)
http://schema.org/AudioObject (8,519,865)
http://schema.org/Person (2,056,857)
3.8 GBschema_MusicAlbum.gz (sample)
http://schema.org/MusicRecording Quads: 318,158,175
URLs: 1,871,921
Hosts: 2,138
http://schema.org/MusicRecording (31,348,530)
http://schema.org/MusicAlbum (11,898,247)
http://schema.org/AudioObject (8,750,393)
http://schema.org/Offer (8,676,133)
http://schema.org/Person (3,213,938)
4.8 GBschema_MusicRecording.gz (sample)
http://schema.org/Organization Quads: 2,681,017,265
URLs: 41,853,100
Hosts: 79,102
http://schema.org/Organization (110,247,692)
http://schema.org/Product (58,567,430)
http://schema.org/TVSeries (50,436,187)
http://schema.org/Offer (35,571,153)
http://schema.org/AggregateRating (27,035,780)
56.7 GBschema_Organization.gz (sample)
http://schema.org/Painting Quads: 1,425,159
URLs: 11,980
Hosts: 69
http://schema.org/Painting (400,189)
http://schema.org/Person (12,955)
http://schema.org/Comment (9,856)
http://schema.org/Museum (4,311)
http://schema.org/PostalAddress (4,127)
29 MBschema_Painting.gz (sample)
http://schema.org/Park Quads: 548,686
URLs: 3,890
Hosts: 39
http://schema.org/PostalAddress (27,155)
http://schema.org/GeoCoordinates (26,145)
http://schema.org/Park (9,746)
http://schema.org/City (3,943)
http://schema.org/TouristAttraction (2,313)
8.9 MBschema_Park.gz (sample)
http://schema.org/Person Quads: 2,021,449,102
URLs: 25,637,330
Hosts: 74,427
http://schema.org/Person (168,363,779)
http://schema.org/UserComments (25,500,181)
http://schema.org/Comment (21,193,122)
http://schema.org/ImageObject (18,999,858)
http://schema.org/Article (14,896,021)
62 GBschema_Person.gz (sample)
http://schema.org/Place Quads: 663,039,048
URLs: 5,590,863
Hosts: 22,738
http://schema.org/Place (41,960,508)
http://schema.org/JobPosting (18,783,162)
http://schema.org/PostalAddress (18,598,576)
http://schema.org/Organization (13,370,924)
http://schema.org/Event (9,502,789)
12.6 GBschema_Place.gz (sample)
http://schema.org/Product Quads: 3,775,412,920
URLs: 47,888,512
Hosts: 108,387
http://schema.org/Product (252,233,316)
http://schema.org/Offer (193,846,906)
http://schema.org/AggregateRating (59,608,310)
http://schema.org/Review (30,653,561)
http://schema.org/Rating (27,421,509)
65.5 GBschema_Product.gz (sample)
http://schema.org/RadioStation Quads: 1,065,412
URLs: 71,308
Hosts: 82
http://schema.org/RadioStation (94,181)
http://schema.org/PostalAddress (83,928)
http://schema.org/Review (20,966)
http://schema.org/Rating (20,910)
http://schema.org/AggregateRating (12,973)
20.3 MBschema_RadioStation.gz (sample)
http://schema.org/Recipe Quads: 75,222,033
URLs: 1,589,075
Hosts: 8,944
http://schema.org/Recipe (2,347,678)
http://schema.org/AggregateRating (1,537,937)
http://schema.org/Person (1,325,948)
http://schema.org/NutritionInformation (883,679)
http://schema.org/Comment (576,791)
2.1 GBschema_Recipe.gz (sample)
http://schema.org/Restaurant Quads: 20,157,626
URLs: 294,134
Hosts: 3,831
http://schema.org/PostalAddress (857,827)
http://schema.org/Restaurant (851,035)
http://schema.org/LocalBusiness (299,868)
http://schema.org/Review (267,789)
http://schema.org/AggregateRating (245,252)
383.3 MBschema_Restaurant.gz (sample)
http://schema.org/RiverBodyOfWater Quads: 161,839
URLs: 1,311
Hosts: 9
http://schema.org/PostalAddress (7,893)
http://schema.org/GeoCoordinates (7,835)
http://schema.org/RiverBodyOfWater (3,063)
http://schema.org/City (1,004)
http://schema.org/LakeBodyOfWater (589)
2.6 MBschema_RiverBodyOfWater.gz (sample)
http://schema.org/School Quads: 16,427,157
URLs: 318,668
Hosts: 200
http://schema.org/PostalAddress (1,381,159)
http://schema.org/School (1,237,255)
http://schema.org/WebSite (157,554)
http://schema.org/SearchAction (157,552)
http://schema.org/Review (83,336)
247.9 MBschema_School.gz (sample)
http://schema.org/ShoppingCenter Quads: 594,623
URLs: 4,863
Hosts: 82
http://schema.org/PostalAddress (27,239)
http://schema.org/ShoppingCenter (25,679)
http://schema.org/ClothingStore (12,270)
http://schema.org/GeoCoordinates (6,338)
http://schema.org/Restaurant (6,243)
9.3 MBschema_ShoppingCenter.gz (sample)
http://schema.org/SkiResort Quads: 78,737
URLs: 4,414
Hosts: 25
http://schema.org/SkiResort (4,972)
http://schema.org/PostalAddress (2,128)
http://schema.org/GeoCoordinates (2,110)
http://schema.org/AggregateRating (1,151)
http://schema.org/Review (748)
2.2 MBschema_SkiResort.gz (sample)
http://schema.org/SportsEvent Quads: 25,762,126
URLs: 94,534
Hosts: 410
http://schema.org/SportsEvent (1,302,541)
http://schema.org/PostalAddress (1,046,485)
http://schema.org/EventVenue (570,556)
http://schema.org/SportsTeam/Soccer (312,853)
http://schema.org/SportsAthlete/Soccer (312,853)
363.7 MBschema_SportsEvent.gz (sample)
http://schema.org/SportsTeam Quads: 9,624,458
URLs: 158,256
Hosts: 197
http://schema.org/Article (527,658)
http://schema.org/SportsTeam (465,610)
http://schema.org/Person (434,668)
http://schema.org/SportsMatchCompetitor (205,434)
http://schema.org/SiteNavigationElement (198,154)
201.6 MBschema_SportsTeam.gz (sample)
http://schema.org/StadiumOrArena Quads: 12,330,509
URLs: 11,457
Hosts: 43
http://schema.org/PostalAddress (913,351)
http://schema.org/SportsEvent (685,605)
http://schema.org/EventVenue (626,629)
http://schema.org/StadiumOrArena (295,348)
http://schema.org/MusicEvent (159,533)
165.5 MBschema_StadiumOrArena.gz (sample)
http://schema.org/TelevisionStation Quads: 58,691
URLs: 1,955
Hosts: 19
http://schema.org/TelevisionStation (8,998)
http://schema.org/PostalAddress (637)
http://schema.org/Review (489)
http://schema.org/Rating (482)
http://schema.org/Event (326)
1.4 MBschema_TelevisionStation.gz (sample)
http://schema.org/TVEpisode Quads: 44,044,409
URLs: 472,303
Hosts: 244
http://schema.org/TVEpisode (3,936,841)
http://schema.org/Person (1,754,279)
http://schema.org/TVSeries (605,323)
http://schema.org/AggregateRating (304,805)
http://schema.org/SiteNavigationElement (261,452)
832 MBschema_TVEpisode.gz (sample)

In case you are interested in a particular class or set of classes which is not listed above, please get in contact with the WebDataCommons team via Mailing List or our Google Group.

Get the Code

The source code can be checked out from our Github repository. For more information about the framework and a detailed description how to run a own extraction visit the framework page.

Get Support

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.