A blueprint for the responsible use of computer vision in society
Dr. Sean Carnaffan
The ever-increasing use of computer vision in the public domain is a double-edged sword. As a society, we only get one shot at doing it right. Here’s how we do it.
As the effects of COVID-19 reverberate around the world, it can sometimes be difficult to imagine anything else dominating the headlines. However, one subject that has seen a wealth of debate and discussion in recent months is computer vision: the use of algorithms to allow computers to perceive the world visually through photos and video. On one hand, the disruptive technology has delivered some incredible capabilities, with applications from medicine, to public safety, to professional sport. On the other, they have been a source of exacerbated racial biases in law enforcement, and violations of human rights on a mass scale.
In its current state, computer vision technology sits in an extremely precarious balance. It has such a potent disruptive capability that late adopters could soon see themselves left in the wake of competitors who are more willing to leverage its value. However, those who do adopt the technology face a host of ethical and technical landmines to navigate in its deployment. This leaves businesses and governments between a rock and a hard place: seemingly having to choose between either falling behind their competitors, or becoming the proverbial ‘Big Brother’, intruding on the public's privacy for profit.
As a technology-driven company (with computer vision capabilities) and an independent advisor to government and major organisations, Smash Delta would like to weigh in on the debate to give its stance on this divisive technology. In particular, we assert the existence of a middle ground that we should strive to achieve. We believe that computer vision technology can be leveraged to obtain powerful insights, while still respecting ethical principles, provided that a set of ground rules are adhered to. In this article, a set of such ground rules will be suggested - including a move away from intrusive applications like facial recognition, and towards ones which drive societal benefit while prioritising privacy and equity.
To frame up the context for these ground rules, we will first take a deeper look at the controversy around computer vision in modern society.
A ‘round the world’ on computer vision
Computer vision (especially facial recognition ) has hit the news several times in recent months, almost always highlighting its negative societal impacts.
It is well known that the use of facial recognition technology for public surveillance is rife in China, with it playing a central role in the Chinese government’s social credit program and being a centrepiece of its highly controversial COVID-19 contact tracing capabilities. However, the use of the technology as a mechanism for authoritarian mass surveillance is by no means confined to China. In the US, the case of Robert Julian-Borchak Williams made international news in June, as the first case of a person falsely arrested following a false match from a facial recognition algorithm. That is, the first knowncase- the wide scale use of this technology leads experts to believe that many others have most likely been wrongfully targeted by law enforcement due to false facial recognition matches. In addition, a recent UK Court of Appeals ruling found that facial recognition had been applied unlawfully and in violation of data privacy standards under the European Convention on Human Rights, and that systems were deployed without being proactively checked for racial and gender bias, in breach of the Public Sector Equality Duty. Closer to home, the Australian Federal Police have admitted to recently trialling controversial facial recognition software provided by the privately-owned Clearview AI. This facial recognition database built from bulk-scraping images from social media sites, such as Facebook and Instagram has provided many of the systems law enforcement departments have been criticised for using overseas). In short, we are seeing around the world, even in western democracies and on our own shores, the misuse of people’s faces, particularly by authorities, using technology in an indiscriminate manner.
This misuse of faces has prompted movement around the world from major organisations and governments distancing themselves from, and even banning, facial recognition technology. As a result, a number of prominent players in this space (including Amazon and IBM) have recently declared indefinite pauses on providing facial recognition technology to law enforcement. Many jurisdictions in the United States, most notably tech hub San Francisco, have taken the decision out of the tech giants’ hands and simply banned facial recognition use by law enforcement altogether. The city of Portland has taken this a step further and outlawed the use for public facing commercial businesses as well.
This movement away from facial recognition is hugely important (and overdue) in terms of preserving individual rights and freedoms. However, it should not mean the end of computer vision technology altogether. Instead, it is crucial to recognise what elements of current facial recognition technologies make them so invasive and prone to unethical use, and create a framework for computer vision that strips these elements away.
As a starting point, it is important to recognise that the term ‘computer vision ’ describes a whole category of technologies, not just facial recognition, which vary greatly in their level of intrusiveness. Computer vision technology services an enormous number of usages which span the entire ethical spectrum. In particular, there are many applications of computer vision technology designed for social benefit which do not need to use facial recognition , or link back to any user’s identity. Some examples of this include:
Public safety: protecting people from violence, overcrowding, hazards, and even the coronavirus, by assisting to understand adherence levels to mask wearing and social distancing advice
Helping with the safe provision of gambling or alcohol: assisting staff in venues to identify problem behaviours
More commercial applications like retail analytics, medicine and diagnostics, assessment of elite performers in sport etc. can leverage computer vision in ways that are far less intrusive than facial recognition, to produce high value insights
Again, all of the above can be achieved without facial recognition or any knowledge of any user’s identity.
Clearly there exists a plethora of possible uses of computer vision technology that can provide a positive benefit without infringing on our rights. So instead of throwing out computer vision technology altogether, we should strive to thread the needle in order to get the ethics right. We need to favour uses, such as the above, that are designed to be impactful without being intrusive, and move away from uses that put individual freedoms at risk. If we do this, it will enable us to harness the incredible potential in this technology without enabling the dystopian side effects.
But mastering the ethical application of computer vision is much more challenging than simply meaning well. It requires a rigorous framework, bolstered by an understanding of the underlying models and technology at play, which we will now get into.
The nuts and bolts (How do computers see?)
Broadly speaking, computer vision is the task of using algorithms to parse visual data inputs (photos and videos) to identify what is in them- i.e. teaching computers how to ‘see’. (In the context of this article, it will be assumed that these photos and videos are of people, or at least are of scenes containing people, since this is most relevant to an ethics based discussion). There are a multitude of different tasks that fall under the computer vision heading, with an important distinction between ‘detection’ and ‘recognition’ tasks. Detection refers to finding specific objects (e.g. find all ‘people’ or ‘cars’ or ‘staircases’ in this image) while recognition refers to taking images of people and answering questions like “how similar are these peoples’ appearances?” or “are these two images of the same person?”.
As an aside, person recognition does not necessarily mean ‘re-identifying’ them in the sense of uncovering their identity. This is an extra step from person recognition, where a match is found specifically against a bulk database of images with associated identities, often scraped from social media such as Facebook or Instagram (as in, for example, Clearview AI’s controversial algorithm). For clarity, using recognition to learn people's identities will be referred to as ‘connected' recognition. This will be in contrast to ‘isolated' recognition, in which images are not cross-referenced to an external database to learn identities.
Computer vision technologies, especially those used in modern recognition tasks, generally fall under the category of machine learning . This essentially means they require a lot of labelled training images, and a good statistical model to extrapolate the ‘rules’ associated with those labels. Generally speaking, deep learning models , particularly artificial neural networks , are best suited to this task.
In the domain of person recognition, however, there is an interesting and clever subtlety to howthese neural nets are designed (and specifically what they ‘spit out’). This subtlety is not only interesting from a technical perspective, but it can be leveraged to help shore up computer vision from a privacy and ethics standpoint.
In short, whereas neural nets are usually asked to answer a ‘yes’ or ‘no’ question* by outputting a 1 or a 0, person recognition models output a list of numbers, called an ‘embedding’. Matches between two candidate people are identified by comparing how similar their embeddings are, and a match is declared if the two embeddings differ by less than a predefined amount.**
*Technically ‘is this person the same as this person?’ is a yes or no question. However, in the context of person recognition, this question is usually not answered directly by the neural network. Instead, the neural network produces two embeddings and then an extra computation compares these embeddings. More on this in the technical section below:
**Explore in technical detail
A crucial element to note is this approach of converting image data into embeddings to represent a person can be used to recognise people using their body, rather than their face. We call this body recognition* and it allows for recognition tasks to be performed without any face-specific data being used.
*This is different to the term usually used in the literature, which is ‘Person Re-identification’ or ‘Person Re-ID’ for short. We use different terminology to avoid a potential confusion that this technique is aimed at ‘re-identifying’ individuals by discovering their name etc. Instead, it refers to the process of taking an image of a person’s full body and turning that into an embedding for recognition, usually in an isolated way.
Having an approach that can recognise people without using their faces provides an opportunity for a huge step forward in privacy enabled computer vision.
Firstly, it is much more achievable in a racially equitable way than facial recognition . This is because body recognition models have less emphasis on facial features and complexion, but instead tend to incorporate things like a person’s height, gait, the colour of their hair, the colour of their outfit etc. to create embeddings.
Moreover, it mitigates the possibility of data breaches leading to people being identified, since it is much harder to cross-reference an image of someone’s body against a database on the internet than it is with their face. There are two reasons for this. Firstly, from a data perspective, profile pictures on social media are usually public, and face-oriented, so there are far more widely accessible face images that can be scraped from the internet without access to user profiles even being required. Secondly, from a modelling perspective, full body pictures are harder to match across data sets, as they rely on contextual information like outfit, hair style etc. (which are likely to change when someone is captured on two different days).
Reframing person recognition as a “body-oriented” task rather than a face-oriented one may soon become not just preferential, but a matter of legal necessity. The Portland facial recognition ban is emblematic of a growing global movement, which has the potential to spell the end of the technology in the near future. The time to develop alternative, privacy-oriented approaches is now.
There are some other very important corollaries of being able to create embeddings to represent people. For example, they provide the opportunity for person recognition to be performed so that images do not need to be stored long term. This is because only the embeddings are required for checking matches. This is important because embeddings are not "human readable", and are therefore less sensitive than image and video data, despite containing essentially the same information. Therefore, storing embeddings instead of images adds an important layer of privacy to the data without compromising its usefulness.
Finally, the construction of embeddings also means that person recognition does not at all rely on the knowledge of who that person ‘is’. That is, it is entirely possible to perform person recognition in an isolated way, rather than a connected one. This is because in order to answer questions like “are these two images of the same person?” or “is this the same person we saw an hour ago?” there is no need to link back to a name or identity- there is just a mathematical process that turns peoples’ likenesses into digital signatures and compares these signatures. This means that in practice, any application of recognition that does link back to who somebody ‘is’ (learning their name by cross-referencing to another database, for example) is doing so by choice, not by necessity.
This implies the potential for several benign applications of isolated person recognition, for example generating counts of unique customers in a shopping centre on a given day by detecting who has been seen earlier. This can be approached in a manner that does not require intruding on user privacy by uncovering their identity: customer counts can simply be ‘de-duplicated’ by counting the number of unique embeddings seen.
By leveraging these privacy-enabled features of neural network embeddings, along with a shift away from facial recognition to body recognition , a new suite of computer vision capabilities may be enabled- which are privacy focussed and more equitable (but still allow for the same useful insights to be gained).
However, understanding the general frameworks for how privacy and equity enabled computer vision works is one thing, having the technical capability to implement it is a completely separate challenge. There are a few different flavours of implementation, each with its own implications for security and privacy.
How computer vision can be implemented: The cloud vs edge computing
The computer vision marketplace is currently heavily dominated by cloud solution providers with their own proprietary solutions that can be deployed instantly. These services can provide powerful and efficient solutions to computer vision problems, but they face the challenge of requiring potentially sensitive information to be sent to a third party over the internet.
An alternative is an organisation deploying its own AI capability that functions ‘locally’ (using a computer very close to the camera site), which reduces many potential security risks. This is called ‘edge AI’, and it presents an important counterfactual to the existing cloud dominated marketplace. Creating an edge AI is certainly harder to do than simply applying a pre-built cloud solution, however there are some significant advantages that justify the investment.
The main draw card of edge AI is that it allows for the analysis of video footage to take place on-site and disconnected from the internet. Video footage can be fed to a local computer, and the relevant insights can be extracted and transmitted to a database to be stored for later use.
This approach makes it so that, for the purposes of computer vision, no raw footage ever needs to leave the premises.
For example, in a person recognition task, only the embeddings need to be sent to a central database (or, to add another layer of privacy, save only the number of unique people seen, dwell time per person etc.- whatever is relevant to the actual goal of the analysis). In a case where computer vision is being used to monitor a public space for overcrowding, only time-stamped information regarding space and person density needs to be used. In general, only theoutputsof the computer vision model need to be stored.
Specifically, this means that raw video footage can be thrown out once it is analysed.
It’s worth mentioning that cloud facial recognition services do operate on this principle (storing embeddings/outputs rather than raw footage). However, this can only take place after sending images to the cloud to create these outputs. The advantage of deploying edge computing is that the footage doesn't need to leave the site to achieve this.
Another layer of privacy which comes as a result of deploying an in-house edge AI, specifically in the context of person recognition , is that the mapping from photos to embeddings is unique to your model, rather than being the same as a publicly available and widely used one. Then, access to the model can be granted only to those who require it for purposes you intend. The upshot of this is that in the scenario where embeddings are stored in a database and they happen to be stolen by some acting adversarially,it is much harder for them to reconstruct the original images from these embeddings*. Reconstructing original images from embeddings requires either query access to the model itself, or to a large sample of the images used to train the model. If the model is not generally accessible, and the raw images are not stored, there is little risk that stolen embeddings can be used to reconstruct images.
Deploying edge computer vision enables a derisked data asset and technical capability that is owned by the organisation and intrinsically embedded within its systems- this is an approach that Smash Delta champions, and works alongside organisations in order to enable.
Computer vision is fraught with ethical and social landmines, to which an easy response may be “why don’t we just shut it all down?”. However, there is great potential to harness this technology in a way that provides great social benefit, and simply blocking these uses from ever coming into existence will prevent us from achieving that. Computer vision should be used in the public domain, but if it is to be used in an ethical manner, the following principles must be followed.
Computer vision should be a means to very specific ends, not just applied ad hoc because your organisation owns images or video footage, or can install a camera somewhere. It’s important to remember that you’re handling sensitive data, and that people have a fundamental right to privacy. Therefore, any application of computer vision in the public domain should be justified only by a legitimate and specific end use-case, and that only the data that is needed to fulfill this use case is captured. If you can’t explain why you need a user’s data, you shouldn’t have it!
The sensitivity and scale of the data collection and analysis should always be weighed up against the value of the end use-case. The more sensitive the information being collected and analysed, the more benefit (particularly to the end-user who has provided the data) there needs to be in order to justify it.
Specifically, most everyday applications of computer visiondo not require facial recognition , so move away from it where possible. We recommend reformulating problems so they can be approached with body recognition , or simple object detection .
Facial recognition should only be used in extremely rare circumstances where absolutely no other approach is fit for purpose, there is a clear demonstrable benefit, and the checks and balances to ensure privacy, equity and minimal usage are rigorously adhered to.In particular, any use of facial recognition should heavily favour isolated recognition over connected recognition, to maintain anonymity.
The vast majority of computer vision use cases do not require the long term storage of any potentially sensitive information, or any personally identifiable information. To achieve this, store the outputs of your computer vision model as your data asset rather than the actual images or video. For example in a person recognition task, only store embeddings rather than video footage. In many cases it is possible to even create completely de-identified summary information which virtually removes identification risk altogether (e.g. converting video into counts of unique people seen, or metrics on crowding). The use cases that carry the least identification risk are those which never need any kind of person-level data to begin with (e.g. assisting with public safety using computer vision to detect incidents and hazards, and only keeping time-stamped logs of incidents detected).
Be aware of risks that exist whenever sensitive, person-level information is being stored, particularly raw video and images which identify individuals. Edge AI offers a way to reduce potential vulnerabilities by analysing video on-site, minimising the number of instances where sensitive data is changing hands and could be stolen or interfered with.
Proactively examine systems for evidence of racial bias and discriminatory outputs. In particular, have a more holistic view to assessing model performance than simply “accuracy”. Enquire as to how false positive and negative rates vary across ethnic and gender groups and use this to determine where more training data or model recalibration may be needed. In addition, consult with a diverse range of stakeholders in order to understand your potential blindspots, and work continually to correct these. If a demonstrable bias exists, take systems offline while necessary improvements are being made.
Make it publicly known what you are using computer vision for, and make every effort to bring the public along for the journey. If you’re using computer vision for a purpose that you would not want the public to know about, consider that this is an indication that the use may be unethical.
Further to this, take an active role in educating the public and your users on how your systems work, what their data is being used for, and what their rights are.
Where possible, avail users of a simple ‘freedom of information’ request platform and a ‘right to be forgotten’.
Just as these principles set the standard for systems coming into existence in the future, it should set the standard for systems already in place. Therefore, if you have computer vision systems running, actively enquire as to whether they are meeting these criteria. Take this opportunity to make the necessary steps to make your system is future-proofed for ethical use. If you’re not sure how to do this, ask!
What are your thoughts on computer vision and the direction it could take in the future? Weigh in on social media using the hashtag:
As an organisation which works alongside major organisations- both as an advisor and a hands-on leader in technology enablement, Smash Delta champions the application of cutting-edge technology driving meaningful change in a way that centres on privacy, security and ethics. The world of computer vision is rapidly growing and presenting challenges for businesses, policy makers and the general public. We believe in getting it right the first time, and intend this blueprint as a critical step towards achieving that goal.
If your organisation is interested in using computer vision in an ethical way, or wishes to enable policy to forge a new path in terms of public safety and ethics, please reach out to our team here.