Inlay

Profile

Common Crawl is a non-profit foundation dedicated to the Open Web.

Common Crawl Foundation

We are very happy to announce our next seminar: Pedro Ortiz Suarez @pjox.bsky.social ( @commoncrawl.bsky.social ) "Expanding Linguistic and Cultural Coverage in Common Crawl" on Friday 12th June 2026, 11am CEST. Details here 👉 almanach.inria.fr/seminars-en....

Browsers can now fetch Common Crawl data directly, no backend needed. Build SQL explorers, snapshot viewers and diff tools as static pages. commoncrawl.org/blog/you-can...

1mo

Browsers can now fetch Common Crawl data directly, no backend needed. Build SQL explorers, snapshot viewers and diff tools as static pages.

commoncrawl.org

Common Crawl - Blog - You can now build directly on Common Crawl from the browser

Inria Paris NLP (ALMAnaCH team)

Common Crawl Foundation

RSVP and join speakers @very-laurie.bsky.social and @pjox.bsky.social from the Common Crawl Foundation and Kostis Saitas Zarkias and Robert Pugh from Mozilla Data Collective for a truly hands-on session. Thursday, June 4th 6 PM CEST | 12 PM ET | 9 AM PDT Register via Zoom: zoom.us/meeting/regi...

20d

We are happy to announce the release of the May 2026 crawl archive, consisting of 2.16 billion web pages, or 365.56 TiB of uncompressed content. www.commoncrawl.org/blog/may-202...

Under-represented languages deserve better tools! On June 4th, The Common Crawl Foundation and Mozilla Data Collective will host a webinar to test language identification for the languages you care about.

Introducing the AI Visibility Audit! A free guide for SEOs and GEOs on how to check whether AI systems can actually reach a site, and how to stay visible in the crawl that trains them. commoncrawl.org/blog/introdu...

22d

Welcome! You are invited to join a meeting: Text Language Identification (LID) with CommonCrawl and Mozilla Data Collective. After registering, you will receive a confirmation email about joining the ...

zoom.us

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February, March, and April 2026. commoncrawl.org/blog/host--a...

As an early experiment in distributing Common Crawl data through another channel, the April 2026 crawl archive is now available in a Hugging Face Storage Bucket, alongside its existing home on AWS S3. commoncrawl.org/blog/april-2...

The Columnar Index Is Now the URL Index! We have renamed the Columnar Index to the URL Index, to be clearer about its purpose and to pave the way for more datasets in a columnar format. commoncrawl.org/blog/the-col...

20d

Common Crawl Foundation

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of March, April, and May 2026. commoncrawl.org/blog/host--a...

11d

1mo

26d

11d

We are happy to announce the release of the May 2026 crawl archive, consisting of 2.16 billion web pages, or 365.56 TiB of uncompressed content.

www.commoncrawl.org

Common Crawl - Blog - May 2026 Crawl Archive Now Available

Common Crawl Foundation

A free guide for SEOs and GEOs on how to check whether AI systems can actually reach a site, and how to stay visible in the crawl that trains them.

commoncrawl.org

Common Crawl - Blog - Introducing the AI Visibility Audit

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February, March, and April 2026. The graphs consist of 269.0 million nodes and 9.4 billion edg...

commoncrawl.org

Common Crawl - Blog - Host- and Domain-Level Web Graphs February, March, and April 2026

Common Crawl Foundation

commoncrawl.org

Common Crawl - Blog - April 2026 Crawl Archive Now Available in a Hugging Face Storage Bucket

commoncrawl.org

We have renamed the Columnar Index to the URL Index, to be clearer about its purpose and to pave the way for more datasets in a columnar format.

Common Crawl - Blog - The Columnar Index Is Now the URL Index

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of March, April, and May 2026. The graphs consist of 262.4 million nodes and 8.1 billion edges at...

commoncrawl.org

Common Crawl - Blog - Host- and Domain-Level Web Graphs March, April, and May 2026

Common Crawl Foundation