Automated access to (non-personal) data for research purposes on disinformation under the Strengthened Code of Practice

Nathaniel Persily observed a few years ago «Researcher access […] is a condition precedent to effective tech regulation. Right now, we do not know what we do not know.»[1] The common perception was similar also in the EU at the time despite the 2018 Code of Practice on Online Disinformation, and the commitment under its pillar on Empowering the Research Community, on sharing privacy protected datasets.[2] Rasmus Kleis-Nielsen and Madeleine De Cock Buning similarly lamented that the near-total absence of independent evidence means evidence-based policy-making is almost impossible in the field of disinformation.[3] Despite platform commitments, as noted by Rebekah Tromble, «scholarly research on disinformation and its impacts has been severely hampered by a continued lack of platform transparency, by failed (or failing) platform-academic partnerships, and by continued—and sometimes even increasing—barriers to data access.»[4]

The limits of the first Code of Practice are well known, with the main official assessments all pointing in the same direction, where providing access to researchers of platform data was one of the areas in which online platforms had perhaps most blatantly failed to deliver.[5] The Strengthened code of practice set out to remedy this, through increased accountability and enforcement mechanisms and much more detailed commitments under the Empowering the Research Community section. These commitments cover the thorny issue of access to private data, but the chapter also opens with a clear commitment to provide access to non-personal data and anonymised, aggregated, or manifestly-made public data for research purposes on disinformation, «wherever safe and practicable, to continuous, real-time or near real-time, searchable stable access […] through automated means such as APIs or other open and accessible technical solutions allowing the analysis of said data.»[6]

The multiple roles of independent research on disinformation in EU policy

The mantra that only an increased understanding of disinformation patterns can lead to appropriate informed policy-making, appears today to permeate disinformation policy making itself. Not only self-regulatory instruments such as the code, but researcher data access is a central provision in the DSA.[7] So much that Alex Engler has seen it as the lynchpin of the legislation’s efficacy,[8] while Martin Husovec calls researchers ‘the heroes of the DSA’.[9] Researchers are called in to play multiple roles here. The first, and perhaps the most important, is to provide public accountability, given the lack of transparency and the extraordinary complexity of understanding the online information ecosystem. Increasingly data science is called in to interpret datasets, the impact of specific actions by different actors including platforms, and inevitably also the role played by recommender systems and algorithms in this context. In an increasingly data driven and AI powered knowledge space, researchers come in to play a role in translating into common language and opening to public scrutiny the realities of the online sphere, and researcher APIs have provided key avenues to do so. There is no doubt that with growing threats, such as the war in Ukraine and the Covid-19 pandemic, and the ensuing reactions to them, has come an increased urgency for a better understanding of the patterns of disinformation online. And there is a huge unexplored potential of business to science data sharing that goes well beyond disinformation studies.[10] The role of researchers spans from auditing algorithms, understanding the implications of online content moderation practices, as well as studying specific responses to disinformation by the myriad of different stakeholders involved in tackling disinformation. Researchers ultimately help to shape the agenda for regulators and providers of digital services themselves.[11]

But researchers may also play a role in assessing the transparency efforts by platforms – it has even been argued that the EC is outsourcing enforcement and oversight of the DSA to independent researchers.[12] While independent researchers should by no means substitute independent auditors, researchers may come in to provide an independent check not only on the platform’s self-assessment but also on the independent audits themselves. Similarly in the context of the Code of Practice, increased access to data may allow for independent oversight of the current self-reporting by the signatories. Given the highly technical nature of the Strengthened Code of Practice, researchers can also play a key role in translating stated actions taken by Platforms in response to their commitments, into reality checks and informing the public about their possible impacts or any eventual shortcomings. In order to do so researchers will of course need the necessary data to gather independent evidence. Finally, independent researchers may play a role in assessing the overall effectiveness and possible shortcomings of the policy instruments themselves. It goes without saying that if researchers are to perform all these complex roles, they will need to be adequately resourced to do so in full independence.

Automated access to data for research in the first platform reports under the Strengthened Code

The first reports under the Strengthened Code of Practice were published in the new Transparency Centre in January 2023. While the community of researchers have yet to pronounce themselves on the full content of the baseline reports, the length and detail of most reports developed along standard templates, are clear signs of some of the improvements in reporting compared to the previous code of practice. At the same time, questions have already started to be raised about the verifiability of the reports in light of limited access to data for researchers,[13] while Twitter announced a few days ahead of the publication of the reports that its researcher API would no longer be available to researchers for free. While Twitter’s report, under the code, openly refers to its long-standing industry-leading API program, no mention is made of starting to charge for its access, concern has been raised widely about the implications of such an announced move away from free access for researchers.

The first baseline reports offer only a starting point to understand what reporting under the strengthened code will look like and capture many areas that may be in flux as platforms change their policies to align with the code. With regard to the research empowerment provisions, platforms have started to report on the details of the APIs they offer researchers to access non-personal, anonymised, aggregated, or manifestly-made public data. While some reports, like that of Twitter and Meta, may be seen as a small step towards reassuring researchers, that existing products such as Crowdtangle may not be dismantled despite reports to the contrary, for those VLOPs that did not have a track record of establishing researcher APIs, some initial slow movement in the right direction may be starting to happen. TikTok reports to have been working on developing a global transparency API to provide selected researchers with access to public and anonymized data, as well as a dedicated misinformation API in the making. The reality of its first API may be somewhat disappointing for a European audience, with access only open to US researchers for the time being, and its Terms of Service proving to be a minefield for researchers. YouTube too reports launching its Researcher Program in July 2022, which «provides scaled, expanded access to global video metadata across the entire public YouTube corpus via a Data API.»[14] While the number of researchers reported to access the data API during the reporting period – less than 15 – may also dampen research excitement about the new program, there appears to be hope that increasing openness may also be the direction in which YouTube is heading. Although the commitment on access to public data under the Strengthened Code does not spell out that access to APIs should be free, there is a general feeling that compliance means free access and that the progress should at least be incremental over time, and platforms should be increasing access to data for research rather than reducing it.

[1] N. Persily, Journal of Online Trust and Safety, October 2021.

[2] 2018 Code of Practice on Disinformation, Section II.

[3] R. Kleis Nielsen – R. Gorwa – M. de Cock Buning, What Can Be Done? Digital Media Policy Options for Strengthening European Democracy, Reuters Institute Report, November 2019.

[4] R. Tromble, A Paucity of Data: The Digital Platforms’ Responses to Pillar 5 of the Code of Practice on Disinformation, The George Washington University, May 2020.

[5] Commission Staff Working Document, SWD(2020) 180, Assessment of the Code of Practice on Disinformation – Achievements and areas for further improvement, 11.

[6] 2022 Strengthened Code of Practice, Commitment 26.

[7] Digital Services Act, Article 31.

[8] A. Engler, Platform data access is a lynchpin of the EU’s Digital Services Act, 15 January 2021, Brookings.

[9] M. Husovec, Will the DSA Work?, 9 November 2022, Verfassungsblog.

[10] S. Verhulst – A. Young, Identifying and addressing data asymmetries so as to enable (better) science, Frontiers in Big Data, no. 5/2022.

[11] M. Husovec, Will the DSA Work?, 9 November 2022, Verfassungsblog.

[12] A. Engler, Platform data access is a lynchpin of the EU’s Digital Services Act, 15 January 2021, Brookings.

[13] J Albert, Platforms’ promises to researchers: first reports missing the baseline, 16 February 2023, Algorithmwatch.org.

[14] See YouTube Research Program Launch.

Share this article!

Download article as PDF