Re-Launch of Platform Governance Archive (PGA): Datasets, downloads and data paper, new website and collaboration with OTA

The Lab Platform Governance, Media and Technology (PGMT) at the ZeMKI, Centre for Media, Communication and Information Research, and the Alexander von Humboldt Institut for Internet and Society (HIIG) launch this week an updated version of its pioneering open-access repository of platform policies, the Platform Governance Archive (PGA). The extensive update includes the launch of a new website, which enables easier data access, the publication of a data paper, which gives a holistic overview of building the PGA, and the release of an updated dataset, which widens the scope of the PGA to cover more platforms and policies. 

The power of social media platforms has been a focal point of critical discussion and research – long before Musk took over Twitter. Platforms corporate policies are a key measure of the way platforms govern and order public discourse as they articulate which kind of content and conduct is allowed and prohibited on their services. These rulebooks are the subject of the Platform Governance Archive (PGA), an open-access repository of platform policies which aims to enable collaborative research on/critical engagement with how and when and why platforms are changing their rules founded by the Alexander von Humboldt Institut for Internet and Society (HIIG) and now hosted at the University of Bremen. 

The need for the systematic study of platform policies 

When first launched in April 2021, the PGA emerged out of the need to systematically study the historical evolution of platform policies and due to the lack of coherently collected data in this area, which did not rely on the platforms’ own corporate archives. The resulting PGA v1 dataset contains all historical versions of the Terms of Service, Community Guidelines and Privacy Policies by Facebook, YouTube, Twitter and Instagram (with the exception of YouTube’s Community Guidelines) from the time when they were first introduced through late 2021. 

New download option and data paper

The dataset was built through a combination of automated and manual approaches of data collection and data cleaning which are explained in detail in our newly published data paper. The paper also lays out the conceptual set up of the PGA and gives a detailed overview of the specificities of the included policies as well as some of the general trends and patterns which run through the historical evolution of the PGA v1 corpus.

As part of the new PGA website, the dataset is now available as a direct download. Overall, the corpus of the PGA v1 contains 153 policy documents with a total of 6,036 pages, which are provided in PDF, HTML and Markdown formats. The downloadable archive furthermore contains additional material and tools that were used in the data collection process. 

Collaboration with Open Terms Archive: New dataset includes more platforms and policies 

With the launch of PGA v2, we are also publishing a new dataset which widens the scope of the PGA to cover 18 platforms and currently 79 policies. The dataset is published in collaboration with the Open Terms Archive (OTA), an open source initiative which is dedicated to increase the transparency and democratic oversight of digital services. 

The PGA v2 dataset goes back to April 2022 and is automatically updated on a daily basis to enable the continuous tracking of changes in the selected policies. Whenever a change is made to one of the tracked policies, the system stores a snapshot to a Github repository where the change can be examined using Github’s change visualisation. The dataset can also be downloaded as a bulk download as an archive of Markdown files. 

Funding for the PGA has been provided by the hosting institutions as well as by different partners and funding schemes such as the EU Horizon 2020 project reCreating Europe, Wikimedia Deutschland and the Data Science Center (DSC) at the University of Bremen.

Future directions 

In the future, the PGMT Lab will continue developing the PGA by merging the historical dataset with the ongoing data collection into an integrated dataset. The roadmap also includes the addition of more platforms and more language versions. The PGA has been used for a growing body of research on platform policies and enables researchers, journalists and the public to answer questions on the historical evolution of platform policies.