var _gaq = _gaq || []; _gaq.push(['_setAccount', 'UA-30248817-1']); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })();

Web Data Commons - Feb 2012 Corpus - Download Instructions

This file contains the detailed extraction report of the extraction of February 2012 of the Web Data Commons project.

All our data and used code is available as download.

Download the Extracted RDF Data

The extracted structured data is provided for download in the N-Quads RDF encoding and divided according to the format the data was encoded in. Files are compressed using GZIP and split after reaching a size of 100MB. Overall, 501 files with a total size of 49 GB were produced.

List of download URLs for RDF from the February 2012 corpus (Example Content)

The extracted RDF data can be downloaded using wget with the command wget -i http://webdatacommons.org/downloads/2012-02/nquads/files.list

Download the Generated CSV Tables

The extracted microformat data are also available for download as CSV tables. The SPARQL queries used for generating the CSV tables are available as well.


CSV TableSPARQL QuerySample File (1000 entries)
hCalendar.csv (~7MB)HCalendar.rqhCalendar-sample.csv
Geo.csv (~136MB)Geo.rqGeo-sample.csv
hListing.csv (~256MB)hListing.rqhListing-sample.csv
hResume.csv (~10MB)hResume.rqhResume-sample.csv
hReview.csv (~452MB)hReview.rqhReview-sample.csv
hRecipe.csv (~11MB)hRecipe.rqhRecipe-sample.csv
Species.csv (~1MB)species.rqSpecies-sample.csv
XFN.csv (~510MB)XFN.rqXFN-sample.csv

Download the Extraction Statistics

To provide a general overview about the URLs using structured data and the to link back to the Common Crawl .arc files the detailed extraction statistic can be used. The extraction statistics record the amount of structured data found for each URL from the crawl data. Be advised to use a parser which is able to skip invalid lines, since they could present in the tab-separated files. The table contains the following columns (not in this order):

Source Data Columns

Result Data Columns

Sample Extraction Statistic File (csv)
Extraction Statistic File (csv)