Access to Platform Data – 27th Nov 2020

One of the issues I am working on at the moment is the use of data held by digital platforms for research purposes.

This is a space where the prevailing attitude of many people in the academic and policy communities might best be described as “extremely frustrated”.

They believe that many important questions for society could be answered (or at least usefully explored) if there were more access to platform data (and algorithms).

They see platforms as unnecessarily obstructive in terms of the information they are willing to share voluntarily, and are increasingly looking to regulation as a means of forcing platforms to disclose more.

There is a lot to unpick here about what this might mean in practice – who should get access, which kinds of data they should be able to access, and what conditions might have to apply to the access.

There are particular sensitivities around access to personally identifiable information and there can be significant legal compliance issues here that are more onerous than for other kinds of public or aggregated data.

These questions about the implications of data protection law for research data are the subject of much debate at the moment and I do not plan to go into them in this post.

If you are interested in this field, you may wish to contribute to a proposed initiative by the European Digital Media Observatory to look into the legal questions in some detail [NB I am a member of the Exec Board of EDMO].

For today, I wanted consider the other side of this equation – if researchers are feeling frustrated, then why might platforms be hesitant to release data and resolve that frustration?

My motive in writing about this is one of trying to get to a place where platforms do release more data – it is not to excuse platform inaction but rather to understand the mindset better so that progress can be made.

I strongly believe that it is in the platforms’ own interest, as well as the public interest, to keep improving their transparency and make more data available.

Scrutiny of large platforms is never going to stop – the choice is between having that scrutiny depend on partial data and/or assumptions made in the absence of data, or having it depend on comprehensive, reliable data.

There is a parallel with the UK’s somewhat qualified right to remain silent which says :-

You do not have to say anything. But, it may harm your defence if you do not mention when questioned something which you later rely on in court. 

UK Police Caution

Where a platform is initially hesitant to provide information about an issue under scrutiny, this should not be read as proof of guilt, but it is likely that people will apply a discount to information that is only provided late in the day.

If platforms believe that data helps them to make their case in response to claims that they are causing harm then they will benefit most when they make this available early on in (or even in anticipation of) the public debate.

Not Just Platforms

This post will naturally reflect my experience from being in conversations about sharing large digital platform data, but these dynamics apply more widely to any organisation that holds interesting data.

Before working at Facebook, I was involved in a UK government initiative called the Power of Information Taskforce which was attempting to get more government data into the public domain.

The attitudes I describe below were present in varying degrees in many of the organisations that held public sector data and were pushing back against efforts to have them share more.

You may be thinking that this is more easily fixable in the public sector as Ministers can just instruct government entities to disclose information, but in reality there is rarely a simple stick to wave that overcomes all obstacles where someone is not persuaded that disclosure is the “right thing”.

Let’s get on to the pop psychology and flesh out why platforms may end up doing what appears to be the “wrong thing”, contrary to societal and their own interests, on data disclosure [where there are no legal obstacles].

Embarrassment

“I felt ashamed."

"But of what? Psyche, they hadn't stripped you naked or anything?"

"No, no, Maia. Ashamed of looking like a mortal -- of being a mortal."

"But how could you help that?"

"Don't you think the things people are most ashamed of are things they can't help?”

C.S. Lewis, Till We Have Faces

I will make the bold claim that no organisation has perfect datasets – in the sense that they contain the right data, only the right data, and nothing but the right data.

I would be happy to be told about exceptions to this rule but my experience working in public sector organisations – the NHS and Parliament – and private sector entities – Cisco and Facebook – is that data is invariably less tidy internally than may be presented externally.

People within an organisation learn the workarounds that are needed for their datasets, and there will be periodic attempts to tidy them up before a certain amount of entropy creeps in again over time.

As long as the data is only exposed to people within the organisation then this is all manageable, but when a request comes in to make it available to strangers then those oddities and anomalies become a source of potential embarrassment.

A domestic parallel is when you offer to lend your house to someone.

You get used to the fact that things don’t work perfectly in your home and there is a list of workarounds that people who live in your household all know and follow daily.

You have internalised these quirky instructions, and you may be mildly embarrassed explaining them to close friends and family who come to stay but can make a joke out of it and trust they will understand.

But if you are faced with strangers staying at your house then explaining what doesn’t work may take the embarrassment to another level, and you might instead either put off the guests until you have fixed everything or not to have them stay at all.

This is not meant to trivialise the importance of managing data correctly, especially when it comes to personally identifiable information where there are legal as well as professional obligations to do this well.

But it is an attempt to help understand one of the feelings that will come into play when people who manage datasets are asked to share them.

They may be entirely relaxed about internal sharing, cautious but persuadable when it comes to sharing with a close circle of trusted partners, and stressed and resistant when asked to share the same data with external entities.

In some cases, the problems with the data may be serious and the owner should be rightfully embarrassed about the state of it.

You may feel that this is precisely one of the benefits of transparency – that it brings errors in data into the light, and while this is often true it is not necessarily compelling for the person making the decision to open up.

More typically for well-run organisations, any issues with the data will be explicable and comparable with those of their peers, but the thought of having to expose them can still remain daunting.

To add some colour, we can consider the kind of issues that might surface when you put together data about platform content takedowns for release.

You might find that a logging system stopped working during a data collection cycle but this was only picked up when the next quarterly statistics were compiled and seemed off the mark.

Or an error in a training manual might mean that some content reviewers in some locations attached the ‘pornography’ code to ‘hate speech’ content or vice-versa.

These are not the kind of errors that create a material data protection risk but they can mean that the platform does not have reliable data to share externally – at best they can provide only partial data and an explanation.

Within an organisation, those who are closest to the data may be especially hesitant as they both understand the precise nature of any anomalies and fear they will be blamed for the fact that they exist.

Where an organisation takes a decision in principle to share data, they may only later discover issues with its integrity and will then either try to delay the release to fix things, or backtrack on the release if problems cannot be resolved.

Overload

Won't you help me cure this Overload
Won't you help me cure this Overload

Zappacosta, Dirty Dancing Soundtrack

If concerns about the quality of the data within an organisation generate embarrassment, then concerns about what people externally might do with released data arouse fears about the knock-on workload this might create.

These concerns fall into two buckets – “bad” interpretations of the data, and data releases creating even more demand for data.

It is natural for people within an organisation to feel that they have a better understanding of the data they hold than people on the outside.

As long as insiders are doing the analyses of the data then the organisation gets to intepret and frame the results according to their own expert knowledge of what is happening.

If their analysis is rigorous then an outsider should come to similar conclusions when they have access to the same data and are asking the same questions.

But the starting point for many insiders is the worry that outsiders do not have the same deep knowledge of the data, and perhaps that they do not have the same analytical skills, and so they will arrive at the ‘wrong’ conclusions.

Again, proponents of transparency will feel that this is precisely why access to data is useful – that it allows various researchers to look at the same problem, and that if they reach different conclusions then this is a really interesting outcome that merits more work.

But from the insider perspective, if they have done an analysis and it is “right”, as far as they are concerned, then having to argue and fix things when outsiders produce a “wrong” analysis can feel like all downside.

Insiders are also likely to have access to a much wider range of data than any outsider, and may be suspicious that releasing some data will end up creating endless cycles of demand as outsiders seek parity.

If insiders thought a particular data release would satisfy people and allow them more space to get on with other work, then they might happily support this, but their suspicion is that the reverse will apply and each release means more of their time being pulled into arguing with outsiders.

This can create a dynamic where some people in an organisation are keen to push data out to satisfy an external party while colleagues are hesitant as they fear they will be diverted into responding to external analyses and meeting ever more extravagant data demands.

At this point, you may be wondering how large, wealthy organisations like digital platforms can ever claim they are overloaded, especially when you compare their resources with those of most external researchers.

There is definitely scope for more overall investment by platforms in supporting research, and there may be advantages to certain structures, eg having team members whose only job is to look after external researchers.

But there are also limits to what you can achieve by throwing people and money at some problems.

Where things are changing fast, as is often the case in the digital world, there may only be a small group of people fully up-to-speed on particular datasets and the methods used to analyse them.

The long-term solution may be investment to expand this group, if there is sustained demand for their expertise, but there will still be a lag where they feel overloaded and act as a bottle-neck, however big the overall organisation.

The hesitancy I have just described relates to workload issues on people within an organisation rather than there necessarily being any significant concerns about the intent of external researchers.

These intent questions are where we turn to paranoia as the final element of this potent mix.

Paranoia

“Just because you're paranoid doesn't mean they aren't after you.”

Joseph Heller, Catch-22

Demands for transparency rarely see the provision of data as an end itself – it is rather more likely that there is an underlying concern about the behaviour of an organisation that prompts the demands to be made.

A prominent current example of this is the claim that ‘social media is destroying democracy’ which has fuelled significant demands for data from social media companies.

These demands are often explicitly framed as ‘we know you are a threat to democracy, but we really need your data to demonstrate just how big a threat you are.’

The culture of academic research is such that platforms should be confident that it will find what is there, and so be as likely to prove that platforms are benign as malign in some particular aspect if that is indeed the case.

But when the public debate is loud and premised on the fact that you are the ‘baddies’ then it is hard to get out from the feeling that you will not get a fair hearing.

This creates a cycle of mistrust where outsiders see an organisation’s hesitancy as proof they are ‘covering something up’, while insiders have ever-decreasing trust that outsiders would be fair and impartial in using released data.

It is essential that researchers are free to determine their own questions within the ethical and legal frameworks that apply to their institutions, and inevitable that some of these will be ones that platforms feel are ‘hostile’.

In normal times, most questions will be ‘interesting but non-threatening’ to an organisation but in the current climate around digital platforms there is a high quotient of interest in areas that are ‘paranoia-inducing’ for them.

Again, proponents of transparency may feel this strengthens the case for external access to data as they expect platforms to filter out research questions they feel are hostile in their internal research programs.

It is hard for any organisation to feel they are facilitating work that is against their interests, or may even threaten their very existence, so this is likely always to be a challenging area.

But it will become proportionately more difficult where natural concerns about threats get spun up into paranoia and an organisation starts to treat everyone as hostile rather than seeing a more balanced picture.

Push Me Pull You

By this stage, those involved in these decisions around data-sharing may feel hard done by as I have described them as overloaded paranoiacs suffering from acute embarrassment about their dodgy data.

As some of these people are former colleagues, I hope they will not take this too personally [NB it is entirely coincidental that I am putting this out when many of them will still be digesting their Thanksgiving turkey and not online].

But the point of my working through these is an attempt to be more specific about the factors affecting these decisions rather than seeing this in broad brush ‘platforms don’t want to share data because they are evil’ terms.

And, having potentially offended former colleagues on the platform side, I may also be irritating those on the researcher side as they think ‘why the hell should I worry about how these people feel, just share the data already.”

The big stick of regulated data disclosure is on the agenda and is likely come into use in more places over the next few years – the EU and UK are certainly talking up their intention to legislate for this.

In the regulated model, there is no need to worry about how platforms ‘feel’ about disclosure as they will just have to do it or pay the penalty.

But even if disclosure will increasingly be required, rather than requested, there are benefits to working through the dynamics that tend to hold this up within the regulated entities.

Open discussions about data quality and sharing of information about any weaknesses in particular datasets will be more useful to researchers than playing games of hide and seek where platforms try to conceal less-than-perfect data (ie most of it) and outsiders are looking to cry ‘gotcha’ and embarrass them over any errors.

Agreeing priorities for data releases and a realistic assessment of the capacity of platforms to respond to external requests (which should certainly be more than it has been but will never be infinite) can help overcome overload resistance.

Paranoia is perhaps the hardest state to address as it is always likely to be the case that research will focus on more problematic and sensitive areas for platforms.

And there is an additional challenge in the lens through which many decision makers in businesses see external research, which is through high profile media stories rather than the broad scope of work being done and quietly published in journals and other fora.

I do not want to sound like an ‘anti-MSM nutter’ here, as I think it is entirely logical for stories about research finding problems to hit the headlines rather than a good story being ‘there is nothing to see here’, but this can create a real perception issue that tends to accentuate the paranoia.

I can contrast the research papers I now get to read in study groups at the Reuters Institute for the Study of Journalism with the diet of news stories I fed on while working at Facebook, and these paint a very different picture.

If we take one particular issue, the ‘filter bubble’, most news stories take this as a proven fact and focus on research (or anecdote) that backs it up, while the broad span of research tends to demonstrate that the theory is weak [while pointing to other potential problems so platforms are not off the hook].

If you are a decision-maker at a platform being asked to release data for filter bubble research this may trigger your paranoia based on the stories you see in the news when the bigger research picture would make you feel much more relaxed about opening up.

Eyes on the Prize

The end goal is for us all to be able to benefit from research that helps us understand the various impacts of the platforms that play such an important part in our lives.

There are lots of tools available for researchers to collect their own data and opportunities for us to provide data directly to researchers about our use of digital services, but it will always be hard to understand what is happening without access to data held by platforms.

There are three tracks running that are likely to result in improved access.

The first is the consideration of any legal and/or technical issues that might act as barriers to access even where a platform wants to share data.

I have not explored these questions in this post but hope that various initiatives, including that of EDMO, will help us make progress on them.

The second is the development of legislative instruments that will require new forms of disclosure by platforms.

There are several templates for how laws might do this from the transparency requirements in NetzDG in Germany to the real-time models described by a French government working group.

The third is a step change in the willingness and capacity of the platforms themselves to share data with researchers.

Understanding the dynamics raised in this post may help us to accelerate (or initiate if you are more sceptical about where platforms are to date) this change.

2 Comments

Philip Virgo

I very much agree.

In my original (late 1960s) training I was taught that datasets contained random and/or systemic rubbish unless maintained by staff who relied on their accuracy and had the authority to make corrections. Later, as a Corporate Planner at the Wellcome Foundation (including responsibility for assembling the annual UK R&D budgets), I well remember the cost/effort involved in tidying up research datasets before they were used in support of applications to register new drugs.

November 28, 2020 Reply
Neal

It would be interesting to see a similar post but from the other side – why do researchers want the data and what are the benefits? You mention it very briefly – to benefit society. But following the logic of your analysis makes it seem like researchers and regulators want the data for all the reasons the platforms fear – to embarrass them by exposing issues that create headlines and perhaps not coincidentally promote the careers of the researchers and regulators. Clarity on the benefit to platforms and consumers might help bridge this “trust” gap.

November 30, 2020 Reply