Extracting Data from common Crawl Dataset

Extracting Data from common Crawl Dataset
Spread the love

Common Crawl is a non-profit organization that crawls the web and provides datasets and metadata to the public freely. The Common Crawl corpus contains petabytes of data including raw web page data, metadata data and text data collected over 8 years of web crawling. Common Crawl data are stored on Public Data sets of Amazon and other cloud platforms around the world. So access to the Common Crawl corpus is free. You can use Amazon’s cloud platform to perform analysis jobs by downloading it.

Data Location

The Common Crawl dataset lives on Amazon S3. You can download the files entirely free from Public 

On the Common Crawl website, we can get details of format and metadata of raw crawl data of 2020

  • s3://commoncrawl/crawl-data/CC-MAIN-2020-05 – January 2020
  • s3://commoncrawl/crawl-data/CC-MAIN-2020-10 – February 2020
  • s3://commoncrawl/crawl-data/CC-MAIN-2020-16 – March/April 2020
  • s3://commoncrawl/crawl-data/CC-MAIN-2020-24 – May/June 2020
  • s3://commoncrawl/crawl-data/CC-MAIN-2020-29 – July 2020
  • s3://commoncrawl/crawl-data/CC-MAIN-2020-34 – August 2020
  • s3://commoncrawl/crawl-data/CC-MAIN-2020-40 – September 2020
  • s3://commoncrawl/crawl-data/CC-MAIN-2020-45 – October 2020
  • s3://commoncrawl/crawl-data/CC-MAIN-2020-50 – November/December 2020

For all crawls, it is stored in Amazon S3 as WARC file format and also contains metadata (WAT) and text data (WET) extracts. It provides file path for WAT and WET extract separately. You can access file from s3 by replacing s3://commoncrawl by https://commoncrawl.s3.amazonaws.com/ on each line.

Data Format

Common Crawl currently uses the Web ARChive (WARC) format for storing crawl raw data. Previously, the raw data was stored in the ARC file format. The WARC format allows more efficient storage and processing of raw crawl data. The WARC file format could be a revision and generalization of the ARC format. The Web ARChive archive is used for combining several resources into an aggregate archive file with related knowledge. The Web ARChive can crawl raw data with size of hundreds of terabytes. WARC is now used by most national libraries for web archiving.

There are three type of format in Web ARChive

  • WARC files :- contain the raw crawl data
  • WAT files :-  contain computed metadata for the data stored in the WARC
  • WET files : – contain extracted plaintext from the data stored in the WARC and more easy to use.

WARC Format:-

The WARC format is the raw data that provides more details about request and response. In simple words it gives a direct path to the crawl process. This format store the HTTP response from the websites like WARC-Type: response and it also stores how that information was requested like WARC-Type: request and crawl process like WARC-Type: metadata

Example of WARC extract

WARC/1.0

WARC-Type: response

WARC-Date: 2014-08-02T09:52:13Z

WARC-Record-ID: 

Content-Length: 43428

Content-Type: application/http; msgtype=response

WARC-Warcinfo-ID: 

WARC-Concurrent-To: 

WARC-IP-Address: 212.58.244.61

WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm

WARC-Payload-Digest: sha1:M63W6MNGFDWXDSLTHF7GWUPCJUH4JK3J

WARC-Block-Digest: sha1:YHKQUSBOS4CLYFEKQDVGJ457OAPD6IJO

WARC-Truncated: length

WAT Response Format:-

WAT files provide important metadata about the crawl records stored in the WARC format. This metadata is computed only for these types of records ( metadata, request, and response). If the data crawled is HTML, the computed metadata contains the HTTP headers and the links presented on the html page.

This WAT format is stored in Json form for reducing the size of the records. JSON removes all unnecessary whitespace in the record, humans aren’t able to read these formats. The example of the JSON format is outlined below.

Envelope

  WARC-Header-Metadata

    WARC-Target-URI [string]

    WARC-Type [string]

    WARC-Date [datetime string]

    …

  Payload-Metadata

    HTTP-Response-Metadata

      Headers

        Content-Language

        Content-Encoding

        …

      HTML-Metadata

        Head

          Title [string]

          Link [list]

          Metas [list]

        Links [list]

      Headers-Length [int]

      Entity-Length [int]

      …

    …

  …

Container

  Gzip-Metadata [object]

  Compressed [boolean]

  Offset [int]

WET Response Format

As we already know the Common Crawl dataset contains WARC files, WAT files and WET files. In WET files provide only plain text. The raw data is stored in the WET format and is quite simple to understand. The WARC metadata contains various information; contain the URL and the length of the plain text data with plain text data follow

Example:-

WARC/1.0

WARC-Type: conversion

WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm

WARC-Date: 2014-08-02T09:52:13Z

WARC-Record-ID: 

WARC-Refers-To: 

WARC-Block-Digest: sha1:JROHLCS5SKMBR6XY46WXREW7RXM64EJC

Content-Type: text/plain

Content-Length: 6724

BBC NEWS | Africa | Namibia braces for Nujoma exit

President Sam Nujoma works in very pleasant surroundings in the small but beautiful old State House…

Dataset Size

Main challenge with the dataset is its size. Every month, the dataset contains terabytes of crawl data.it is difficult to download monthly crawl data. So we need to category records with common things. Common Crawl generates an index of its records, such as the Languages, Media type, Character sets. Each record contains indexes for category raw data. Using this index, we can extract relevant records for any application.

Language:-

On the Common Crawl website, we can get details of records available for each language in percentage. 

Data crawl

From the above details, Malayalam (mal) makes up to 10 places with 0.017% of whole data. Crawl achieve contains 3.1 billion web pages or 300 TiB of uncompressed content for January. In percentage it is significantly small but it’s around 50 GB of raw records for Malayalam language.

MIME Types:-

The crawled data is dominated by HTML pages and contains only a small percentage of other document formats. The tables show the percentage of the top 10 media or MIME types of the latest monthly crawls.

From this table, without thinking we know that HTML is dominating all other mime Types because crawling data consist of 3.1 billion web pages that are created by html. Combining other media types like pdf, jpeg also isn’t able to create 1 % of whole data. 

Table is based on Content-Type HTTP header for latest month

Data crawl

The table shows the Top 10 MIME type detected by Apache Tika based on the actual content for the latest month 

Data crawl

Character sets:-

The character set or encoding of hypertext markup language pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used in available records crawled by the latest monthly. 

Data crawl

From the above table, UTF-8 charset is the dominant charset of all others. UTF or Unicode Transformation Format, it can convert any Unicode to a unique binary string and can also convert back to Unicode characters. UTF-8 is an encoding system used for electronic communication. It worked based on variable-width character encoding.

Top-Level Domains

Top-level domains is the abbreviation of “TLD”/”TLDs” are a significant indicator for the representativeness of the data, whether the raw data set or crawl is biased towards certain countries, regions and languages.

Top-level domains are classified by IANA Root Zone Database into five types:-

  • Generic (“gTLD”), not bound to a specific country. The core is made by .com, .info, .net and .org TLDs, but was later extended to a long list of generic terms ( .bike), or brands (.apple, .volkswagen), geographical or cultural entities ( .kiwi)
  • Sponsored TLDs (“sTLD”) are restricted to defined groups of users; registration of domains cannot open by anybody. That’s obvious for .gov, .edu, .mil but also applies to .museum and others
  • generic-restricted (“grTLD”): .biz, .name and .pro
  • One single “infrastructure” TLD: .arpa
  • Country-code top-level domains: .uk, .fr, .jp, etc.

The generic and country-code TLDs include internationalized top-level domains that are written in non-Latin alphabets or containing non-ASCII characters. These are two type of groups:

  • internationalized country-code TLD (“IDN ccTLD”):  .рф – Russia
  • internationalized generic TLD (“IDN gTLD”):  .セール – Japanese for ‘sale’, .vermögensberatung – German ‘financial consulting’.

Host- and Domain-Level Web Graphs

Host-level graph

The Host-level graph includes 539 million nodes and 3.02 billion edges and also contains dangling nodes that is. Hosts that have not been crawled yet are pointed to or from a link on a crawled page. More interesting fact is there are 467 million dangling nodes which are 86.7% of the entire corpus and the largest strongly connected component includes 46 million (8.5%) nodes of dangling nodes.

You can download the Host-level graph and the ranks of all 539 million hosts from AWS S3 using this path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-nov-dec/host/. On the other hand, you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020- nov-dec/host/ as the prefix to access the files that contain graphs and ranks from everywhere.

Domain-Level Web Graph

The domain level graph was created by gathering the host graph on the level of pay-level domains which is an abbreviation of PLDs based on the public suffix list which is supported on publicsuffix.org. The domain-level graph included 89 million nodes and 1.71 billion edges. Here 51% or 45 million nodes are dangling nodes; the largest strongly connected component includes 35 million or 39% nodes of dangling nodes.

You can download the Domain level graph and the ranks of all 89 million hosts from AWS S3 using this path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020- nov-dec/domain/ or you can use https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020- nov-dec/domain/ as prefix to access the files that contain graph from everywhere.

How To Download Index Files

  1. Go to this link http://index.commoncrawl.org/
  2. Show a list of indexes for example :- https://commoncrawl.s3.amazonaws.com/crawl-data/ CC-MAIN-2020-50/cc-index.paths.gz
  3. Download and decompress it.
  4. Fetch the files by adding prefix https://commoncrawl.s3.amazonaws.com/ or when accessing it via S3 s3://commoncrawl/

How to fetch index files for a single top-level domain (here .fr)?

  1. The file list contains a cluster.idx file cc-index/collections/CC-MAIN-2020-50/indexes/cluster.idx
  2. Fetch it for example :- wget https://commoncrawl.s3.amazonaws.com/crawl-data/ CC-MAIN-2020-50/cc-index.paths.gz/indexes/cluster.idx
  3. The first field in the cluster.idx contains the SURT representation of the URL,
    with the reversed domain name:
    (fr,01-portable)/pal-et-si-internet-nexistait-pas.htm
  4. Now we can easy to list the .cdx files containing all output from the .fr TLD:
    grep ‘^fr,’ cluster.idx | cut -f2 | uniq
    cdx-00193.gz
    cdx-00194.gz
    cdx-00195.gz
    cdx-00196.gz
    That’s only 4 files that you are able to find the full path/URL in the file list.
  5.  .com results make more than 50% of the index:
    grep ‘^com,’ cluster.idx | cut -f2 | uniq | wc –l 155

URL Search Tool

URL Search is a web application used as a tool that allows users to search for any URL, URL prefix, subdomain or top-level domain and matrices. The output of user search shows the number of files in the Common Crawl corpus that came from that URL and provides a downloadable JSON metadata file with the address and offset of the data for each URL. Once a user downloads the JSON file, the user can drop it into his or her local code so that user is able to run the program against the subset of the corpus where the user specified. URL Search tool makes it much easier and efficient to find the files users are interested in and it will reduce the time and money it takes to run a user program. So it helps users to run a program across only on the interested files instead of the entire corpus.

Conclusion

Common Crawl corpus is an excellent opportunity for every individual or business to costless or cost- effectively accesses a large portion of the raw content from the internet: Raw data with 210 terabytes size corresponding to 3.83 billion documents or 41.4 million distinct second- level domains or hosts. Ten of the top-level domains have a representation of above 1% whereas documents from .com accounts share more than 55% of the whole corpus. The corpus includes a huge amount of working sites from youtube.com, blog publishing services like blogspot.com and wordpress.com as well as online shopping sites such as amazon.com and flipkart.com. These sites are good sources for comments, watching and reviews. Almost half of all web documents are utf-8 encoded whereas the other encoding of the 43% share is unknown. The corpus includes 92% HTML content and 2.4% PDF files. The remaining are images, XML or code like JavaScript and cascading style sheets with very small percentages.