Common Crawl is a non-profit foundation dedicated to the Open Web.
Common Crawl Foundation
We are very happy to announce our next seminar: Pedro Ortiz Suarez @pjox.bsky.social ( @commoncrawl.bsky.social ) "Expanding Linguistic and Cultural Coverage in Common Crawl" on Friday 12th June 2026, 11am CEST. Details here 👉 almanach.inria.fr/seminars-en....
Browsers can now fetch Common Crawl data directly, no backend needed. Build SQL explorers, snapshot viewers and diff tools as static pages.
commoncrawl.org/blog/you-can...
Browsers can now fetch Common Crawl data directly, no backend needed. Build SQL explorers, snapshot viewers and diff tools as static pages.
RSVP and join speakers @very-laurie.bsky.social and @pjox.bsky.social from the Common Crawl Foundation and Kostis Saitas Zarkias and Robert Pugh from Mozilla Data Collective for a truly hands-on session.
Thursday, June 4th
6 PM CEST | 12 PM ET | 9 AM PDT
Register via Zoom: zoom.us/meeting/regi...
We are happy to announce the release of the May 2026 crawl archive, consisting of 2.16 billion web pages, or 365.56 TiB of uncompressed content.
www.commoncrawl.org/blog/may-202...
Under-represented languages deserve better tools!
On June 4th, The Common Crawl Foundation and Mozilla Data Collective will host a webinar to test language identification for the languages you care about.
Introducing the AI Visibility Audit!
A free guide for SEOs and GEOs on how to check whether AI systems can actually reach a site, and how to stay visible in the crawl that trains them.
commoncrawl.org/blog/introdu...
Welcome! You are invited to join a meeting: Text Language Identification (LID) with CommonCrawl and Mozilla Data Collective. After registering, you will receive a confirmation email about joining the ...
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February, March, and April 2026.
commoncrawl.org/blog/host--a...
As an early experiment in distributing Common Crawl data through another channel, the April 2026 crawl archive is now available in a Hugging Face Storage Bucket, alongside its existing home on AWS S3.
commoncrawl.org/blog/april-2...
The Columnar Index Is Now the URL Index!
We have renamed the Columnar Index to the URL Index, to be clearer about its purpose and to pave the way for more datasets in a columnar format.
commoncrawl.org/blog/the-col...
Common Crawl Foundation
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of March, April, and May 2026.
commoncrawl.org/blog/host--a...
We are happy to announce the release of the May 2026 crawl archive, consisting of 2.16 billion web pages, or 365.56 TiB of uncompressed content.
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February, March, and April 2026. The graphs consist of 269.0 million nodes and 9.4 billion edg...
As an early experiment in distributing Common Crawl data through another channel, the April 2026 crawl archive is now available in a Hugging Face Storage Bucket, alongside its existing home on AWS S3.
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of March, April, and May 2026. The graphs consist of 262.4 million nodes and 8.1 billion edges at...