commoncrawl/web-languages
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
GitHub repository with 69 stars and 93 forks.
Topics: crawling, dataset, language-detection