Anti-Human Data Analysis

I was taught to “treat everyone the way you want to be treated”. It is difficult for me to look at some data science projects that use data about humans without seeing some serious ethical problems alongside major logical issues. I hate to hop on a “moral high horse”, but hear me out.


It is unethical to reduce humans and their activities down to numbers and categories – anonymized or not.

The only way to have humans and human activities recorded inside a dataset is by reducing those humans and their actions down to numerical and categorical variables, which is inherently unethical, because it literally strips them of their humanity. Humans have intrinsic value beyond the handful of qualities or characteristics that can be recorded about them in a table, and humans exist in a multifaceted world whose nuances cannot be represented properly in a database. Therefore, humans and their actions are inherently misrepresented in datasets. The models built on such data may negatively impact their lives in the future for no justifiable reason – the models themselves are built from incomplete information.

Missing context and human perspective of collected information

This incomplete information completely misses the motives, purposes, and reasons for why certain things happen. Analyzing the husk of myself – one so shallow that it can be represented by a data table – is a crime against humanity. It whittles away all the human aspects of life into easily definable chunks for the computer. In doing so it strips away all the things that make me human. Databases treat me like a number and not a human, like a product with my barcode, SKU and item characteristics, and it is wildly unethical to analyze that data and make decisions that impact my life, as if this shallow representation of me is even close to the human being I am. Models are built to approve a loan application or determine insurance pricing, from nothing but a husk of who the person applying is.

Take all the data the world has on me and try to build a Daniel. You won’t even get close. Digital information systems are too rigid to effectively reflect what happens in the world around us or provide a picture more than a shallow interpretation of a person.

Humans are complex and deserve to be treated as “innocent until proven guilty”

Humans deserve to be treated with the dignity that comes with being a living human being. It is “cheap”, naïve, and inhuman to toss people’s entire lives into boxes and judge them about it. People have an entire spectrum of motives for taking the actions that they do. And sometimes they have no choice to be put into a certain category. Automatically treating someone as “guilty” for a data point, say a black mark about their credit history, doesn’t give them any option to explain or appeal that decision.

You are nothing in a dataset beyond the set of characteristics put in there about you. And you can’t add context, change it, or in many cases, even view it! No human will hear your story about the context behind that late night credit purchase at the cookie shop and be able to wrangle the machine to go easier on you – from the data point of view, you are now a late-night sweets muncher prone to a hundred medical problems, so your health insurance policy will cost more, and there is nothing more to say. No amount of human reasoning or oversight to provide checks and balances changes the fact that the digital system is unable to properly represent human lives, personality, motives, actions. This is not an ethical way to treat human beings.

Weapons of Math Destruction is a great book and worth the read. One thing that blows the mind is “recidivism” models – designed to predict if criminals will commit another crime – which are actively influencing judges' decisions in some parts of the US.

datacenter A data center with water cooling pipes. Some data centers use so much water that protests have erupted to save the drinking water supplies for the humans who live nearby.

Unverified and false data

Datasets are chock-full of data points that were never verified as true by the very people whose information was recorded in them! Data might not only be false, but also unknown to be false, which creates a particularly insidious level of danger during modeling.

For example, years ago I downloaded my Instagram content when I decided to delete my main account. Alongside the pictures I had uploaded, it had a variety of other files and data they had about me, which included my interests. Seriously, Instagram just guessed on what my interests were based on the things I liked, and there was a list of dozens of interests and interest groups, many of which I had zero real interest in. One that comes to mind was “magazines”. Where did this even come from, and why am I tied to it inside a database somewhere? If they had asked me about it, I would have laughed. I’ve read maybe one issue of a physical magazine in the past 10 years.

Anonymizing is not a solution

Even anonymizing the data, like GDPR requires, provides no ethical justification to reduce people down to numbers and categories. What good is it if I change your name to Bob, and still reduce you to numbers and categories as before? The anonymization is more of an attempt at the privacy-concern solution and doesn’t address my core argument here. The fact that GDPR ruins data science projects is a good thing – it means people are being protected.

Yes, ethical analytics exists

Disclaimer: Not all data science projects deal with data that has humans and their activities recorded in it, or use this information like heuristics against humans in negative ways. Not all data scientists are heartless! Plenty of major data science efforts are in fields that aren’t trying to predict humans' activities or the like. The pop culture view of data science is “the algorithm” which tracks you around and shows you ads, and this doesn’t represent the entirety of the field.

Is analytics just a sly way to judge people in shallow ways?

Most of my argument is essentially: All we’re doing is using a computer to make shallow judgments about people like we were specifically taught NOT to do, and it should be easy to see that this is unethical. The golden rule of “treat others like you would want to be treated” applies to people building models too. If a human were to act as judgemental and arrogant like I explained the algorithm would (reduce people down to data points and use it against them), we would scoff at their childishness and (hopefully) they wouldn’t be given much power in the world. Putting this arrogance and coldheartedness into a computer only justifies it for data scientists because nobody feels bad about it.

In the abstract world of datasets, it is easy to shirk the responsibility of treating human beings as humans because it doesn’t feel real – the people in the data are already numbered and categorized on the screen in front of you. The weight of the fact that these people are reduced to this inhuman state is forgotten. This, alongside “computer disease”, and a nice paycheck, are driving people to give up their humanity and use the tools against us all.

The exploitation and optimization of people’s collected data via algorithms reminds me of the Milgram experiment and how the fact that these people couldn’t see the human getting hurt is potentially a deciding factor in their decision to press the button. It also reminds me of Farenheit 451’s scene where people in a car are trying to run Montag over, for the thrill of it. After all, what value does a human body have?

Imagine if data scientists had to ask – in person – the people who they were analyzing if it was OK to do so. A chunk of the field would vanish, and that’s the chunk that I take issue with.

Sorry to be on some moral high horse, I’m just being honest. I am not the face of moral perfection, I am human.

David Foster Wallace - “This is Water”

Daniel