bigcode/the-stack · Datasets at Hugging Face

bigcode/the-stack · Datasets at Hugging Face

The Stack contains over 3TB of permissively-licensed source code files covering 30 programming languages crawled from GitHub. The dataset was created as part of the BigCode Project, an open scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs). The Stack serves as a pre-training dataset for Code LLMs, i.e., code-generating AI systems which enable the synthesis of programs from natural language descriptions as well as other from code snippets.

The Stack is a pre-training dataset for creating code LLMs. Code LLMs can be used for a wide variety of downstream tasks such as code completion from natural language descriptions (HumanEval, MBPP), documentation generation for individual functions (CodeSearchNet), and auto-completion of code snippets (HumanEval-Infilling). However, these downstream evaluation benchmarks are outside the scope of The Stack.

The following natural languages appear in the comments and docstrings from files in the dataset: EN, ZH, FR, PT, ES, RU, DE, KO, JA, UZ, IT, ID, RO, AR, FA, CA, HU, ML, NL, TR, TE, EL, EO, BN, LV, GL, PL, GU, CEB, IA, KN, SH, MK, UR, SV, LA, JKA, MY, SU, CS, MN. This kind of data is essential for applications such as documentation generation and natural-language-to-code translation.

Each data instance corresponds to one file. The content of the file is in the feature, and other features (, , etc.) provide some metadata. Note that a given file can appear in several different repositories that satisfy our safe-license criterion. If that is the case, only the first – in alphabetical order -- of these repositories is shown for simplicity.

The dataset has no splits and all data is loaded as train split by default. If you want to setup a custom train-test split beware that dataset contains a lot of near-duplicates which can cause leakage into the test split.

One of the challenges faced by researchers working on code LLMs is the lack of openness and transparency around the development of these systems. Most prior works described the high-level data collection process but did not release the training data. It is therefore difficult for other researchers to fully reproduce these models and understand what kind of pre-training data leads to high-performing code LLMs. By releasing an open large-scale code dataset we hope to make training of code LLMs more reproducible.

220.92M active GitHub repository names were collected from the event archives published between January 1st, 2015 and March 31st, 2022 on GHArchive. Only 137.36M of these repositories were public and accessible on GitHub – others were not accessible as they had been deleted by their owners. 51.76B files were downloaded from the public repositories on GitHub between November 2021 and June 2022. 5.28B files were unique. The uncompressed size of all stored files is 92.36TB.

The list of programming language extensions is taken from this list (also provided in Appendix C of the paper).

Near-deduplication was implemented in the pre-processing pipeline on top of exact deduplication. To find near-duplicates, MinHash with 256 permutations of all documents was computed in linear time. Locality Sensitive Hashing was used to find the clusters of duplicates. Jaccard Similarities were computed inside these clusters to remove any false positives and with a similarity threshold of 0.85. Roughly 40% of permissively licensed files were (near-)duplicates. See section 3 of the paper for further details.

The following are not stored:

Permissive licenses have minimal restrictions on how the software can be copied, modified, and redistributed. These include MIT-0, MIT, MIT-feh, Apache-2.0, BSD-3-Clause, BSD-3-Clause-Clear, BSD-3-Clause-No-Nuclear-License-2014, BSD-2-Clause, CC0-1.0, EPL-1.0, MPL-2.0, Unlicense, ISC, Artistic-2.0, deprecated_LGPL-3.0+, deprecated_LGPL-2.1+, ECL-2.0, SHL-0.51, MPL-2.0-no-copyleft-exception.

GHArchive contained the license information for approximately 12% of the collected repositories. For the remaining repositories, go-license-detector was run to detect the most likely SPDX license identifier. The detector did not detect a license for ~81% of the repositories, in which case the repository was excluded from the dataset.

A file was in included in the safe license dataset if at least one of the repositories containing the file had a permissive license.

The source (code) language producers are users of GitHub that created unique repository names between January 1st, 2015, and March 31st, 2022.

The released dataset may contain sensitive information such as emails, IP addresses, and API/ssh keys that have previously been published to public repositories on GitHub. Deduplication has helped to reduce the amount of sensitive data that may exist. In the event that the dataset contains personal information, researchers should only use public, non-personal information in support of conducting and publishing their open-access research. Personal information should not be used for spamming purposes, including sending unsolicited emails or selling of personal information. Complaints, removal requests, and "do not contact" requests can be sent to contact@bigcode-project.org.

The PII pipeline for this dataset is still a work in progress (see this issue for updates). Researchers that wish to contribute to the anonymization pipeline of the project can apply to join here. Developers with source code in the dataset can request to have it removed here (proof of code contribution is required).

The Stack is an output of the BigCode Project. BigCode aims to be responsible by design and by default. The project is conducted in the spirit of Open Science, focused on the responsible development of LLMs for code.

With the release of The Stack, we aim to increase access, reproducibility, and transparency of code LLMs in the research community. Work to de-risk and improve on the implementation of ethical best practices of code LLMs is conducted in various BigCode working groups. The Legal, Ethics, and Governance working group has explored topics such as licensing (including copyleft and the intended use of permissively licensed code), attribution of generated code to original code, rights to restrict processing, the inclusion of Personally Identifiable Information (PII), and risks of malicious code, among other topics. This work is ongoing as of October 25th, 2022.

We expect code LLMs to enable people from diverse backgrounds to write higher quality code and develop low-code applications. Mission-critical software could become easier to maintain as professional developers are guided by code-generating systems on how to write more robust and efficient code. While the social impact is intended to be positive, the increased accessibility of code LLMs comes with certain risks such as over-reliance on the generated code and long-term effects on the software development job market.

A broader impact analysis relating to Code LLMs can be found in section 7 of this paper. An in-depth risk assessments for Code LLMs can be found in section 4 of this paper.

The code collected from GitHub does not contain demographic information or proxy information about the demographics. However, it is not without risks, as the comments within the code may contain harmful or offensive language, which could be learned by the models.

Widely adopted programming languages like C and Javascript are overrepresented compared to niche programming languages like Julia and Scala. Some programming languages such as SQL, Batchfile, TypeScript are less likely to be permissively licensed (4% vs the average 10%). This may result in a biased representation of those languages. Permissively licensed files also tend to be longer.

Roughly 40 natural languages are present in docstrings and comments with English being the most prevalent. In python files, it makes up ~96% of the dataset.

For further information on data analysis of the Stack, see this repo.

One of the current limitations of The Stack is that scraped HTML for websites may not be compliant with Web Content Accessibility Guidelines (WCAG). This could have an impact on HTML-generated code that may introduce web accessibility issues.

The training dataset could contain malicious code and/or the model could be used to generate malware or ransomware.

To the best of our knowledge, all files contained in the dataset are licensed with one of the permissive licenses (see list in Licensing information). The accuracy of license attribution is limited by the accuracy of GHArchive and go-license-detector. Any mistakes should be reported to BigCode Project for review and follow-up as needed.

The Stack is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in The Stack must abide by the terms of the original licenses, including attribution clauses when relevant. We facilitate this by providing provenance information for each data point.

The list of SPDX license identifiers included in the dataset are:

IMPORTANT UPDATE 27/20 It was brought to our attention that licenses such as MPL, LGPL, and EPL were erroneously labeled as permissive and were included in the dataset when they are in fact weak copyleft licenses. We will remove these weak copyleft license files from The Stack and release an updated version in the coming weeks.

The Stack dataset is a collection of 3.1 TB of source code in 30 programming languages. We ask that you read and acknowledge the following points before using the dataset:

Images Powered by Shutterstock