Back to News

Sean's Computer Vision Story

Tell us about the computers


To see or not to see: A blueprint for the responsible use of computer vision in society
The ever-increasing use of computer vision in the public domain is a double-edged sword. As a society, we only get one shot at doing it right. Here’s how we do it
As the effects of COVID-19 still reverberate around the world, it can sometimes be difficult to imagine anything else dominating the headlines. However, one subject that has seen a wealth of debate and discussion in recent months is computer vision: the use of algorithms to allow computers to perceive the world visually through photos and video. On one hand, the disruptive technology has delivered some incredible capabilities, with applications from medicine, to public safety,, to professional sport. On the other, facial recognition technologies have been a source of racial biases in law enforcement, and violations of human rights on a mass scale.
In its current state, computer vision technology sits in an extremely precarious balance. It has such a potent disruptive capability that late adopters could soon see themselves left in the wake of competitors who are more willing to leverage its insights. However, it has also been known to fuel mass surveillance and wide scale breaches of privacy. Its use is so controversial that there have been waves around the world of divestment from and banning of the technology. This is occurring particularly in the United States where many jurisdictions, most notably tech hub San Francisco, has banned facial recognition altogether.
As a technology driven company with an increasing suite of computer vision capabilities and an independent advisor to government and major organisations, Smash Delta would like to weigh in on the debate to give its stance on this technology, and the ground rules that need to be agreed upon before it can be given the ethics “all clear”.
A ‘round the world’ on computer vision
Computer vision (especially facial recognition) has hit the news several times in recent months, almost always highlighting its potential negative societal impacts.
The case of Robert Julian-Borchak Williams made international news in June, the first case of a person falsely arrested following a faulty match from a facial recognition algorithm. That is, the first known case- the widescale use of this technology leads some experts to believe others have most likely been wrongfully convicted due to faulty face recognition matches. A recent UK Court of Appeals ruling found that facial recognition had been applied unlawfully and in violation of data privacy standards under European Convention on Human Rights, and that systems were deployed without being proactively checked for racial and gender bias, in breach of the Public Sector Equality Duty. The use of computer vision plays a central role in the Chinese government’s social credit program and was a centrepiece of its highly controversial COVID-19 contact tracing capabilities.

Clearly, there are many ethical landmines to navigate when it comes to the use of computer vision in society. Particularly, we are seeing around the world, even in western democracies, the misuse of people’s faces by authorities using technology in an indiscriminate manner.

Instead, it describes a whole category of technologies which vary greatly in their level of intrusiveness, and a limitless number of applications spanning the entire ethical spectrum. In particular, there are many applications of computer vision technology that can drive great social good which do not need to incorporate facial recognition, or link any visual information back to any user’s identity. (more on how to do this later). Some examples include:

  • Public safety: monitoring overcrowding, hazards, incidents and even mask wearing and social distancing during the current pandemic
  • Helping with the safe provision of gambling or alcohol- assisting staff in venues to identify and intervene in problem behaviours
  • More industrial applications like retail analytics, medicine and diagnostics, and assessment of elite performers in sport can leverage computer vision in ways that are far less intrusive than facial recognition, to produce high value insights

Instead of throwing out computer vision technology, we should strive to thread the needle in order to get the ethics of computer vision right. We need to favour approaches and technologies that are designed to be impactful without being intrusive, and move away from those that put individual freedoms at risk. If we get it right, it will enable us to harness the incredible potential in this technology without inviting the dystopian side-effects.

But mastering the ethical application of computer vision is much more challenging than simply meaning well. It requires a rigorous framework bolstered by an understanding of the underlying models and technology at play.

We will get into this below, but first set up some essential knowledge on how computer vision technology works.

The nuts and bolts (How do computers see?)

Broadly speaking, computer vision is the task of getting useful information out of visual data inputs (photos and videos)- i.e. teaching computers how to ‘see’. There are a multitude of different specific tasks that fall under this heading, with an important distinction between ‘detection’ and ‘recognition’ tasks. Detection refers to drawing boxes around specific objects (e.g. find all ‘cars’ in this image) while recognition usually refers to taking an image of a person and answering questions like “who is this?” or “is this person the same as that person?”

``essentially meaning it requires a lot of labelled training data (e.g. an image with a label ‘cat’) and a good statistical model to process this data and extrapolate the ‘rules’ that determine the labels. Generally speaking, deep learning models, particularly artificial neural networks, are best suited to this task.

Deep learning is a field of AI where input information (image pixels in this case) is passed through many, many ‘layers’ of analysis (this is usually done with a model called an artificial neural network) to eventually produce an output at the end of the chain. The key aspect that makes the learning ‘deep’ is that the model is said to be ‘learning what to learn’ in the sense that it finds connections and relationships between aspects of the input image that we have no knowledge of or control over.

Identifying people is not the only modern application of artificial neural networks. They have been used to, amongst other things, predict weather patterns, make medical diagnoses, trade algorithmically, and even predict the risk of a pandemic outbreak - like the one the world is currently experiencing.

In the domain of person recognition, there is an interesting and clever subtlety to how these neural nets are designed (and specifically what they ‘spit out’). This subtlety is not only interesting from a technical perspective, but it can be leveraged to shore up computer vision from a privacy and reidentification risk standpoint.

That is, whereas neural nets are usually asked to answer a ‘yes’ or ‘no’ question by outputting a 1 or a 0, person recognition models output a list of numbers, called an ‘embedding’. Matches between two images are identified by comparing how similar these embeddings are, and a match is declared if the two embeddings differ by less than a predefined amount. This means that recognition databases do not need to store images to detect matches, they just need to keep the embeddings. The embeddings are non-human “readable” and are therefore less sensitive than image and video data, but contain essentially the same information).

More detail for interested technical people here:

Most typical uses of neural networks, outside of computer vision, are to perform what are called ‘classification’ tasks- take data of a specific case of something, and put it in the correct ‘box’. The simplest example of this type of task is binary classification. For example, given this set of recent meteorological observations, is it going to rain tomorrow? Given the data on this patient, do they have heart disease? Given this set of recent stock market observations, is a certain stock’s price going to go up or down in the next time window? For these tasks, the model takes in input data of a predetermined size and shape, and outputs a ‘yes’ or ‘no’ depending on what it infers from that data (more precisely, it outputs a number between 0 and 1 representing the model’s confidence that the answer is ‘yes’).

However, this binary classification framework doesn’t really capture the complexity of recognition problems. It’s not a ‘yes’ or ‘no’ question who somebody is. A yes or no framing like ‘are these two people the same?’ could be tried, but then in practice you would need to go through every person in your database, running each pair of images through the neural net to check for potential matches, and this would get very inefficient.

A next attempt one might try could be leveraging the fact that artificial neural networks can be extended to become what are called ‘multi-class’ classifiers (categorising data amongst a list of possible categories, like ‘apple’, ‘orange’, ‘banana’, instead of just ‘yes’ or ‘no’). At first, this might seem to be a viable framing since a recognition problem seems equivalent to asking ‘which of this list of people does this image belong to?’. However, this approach fails when we consider the fact that we encounter new people when we deploy the model- images which do not belong to any predefined category. However, multi-class classifiers assume the data must belong to one of the predefined categories, leading to false matches and other obvious problems. A naive solution to this may be to say ‘why don’t we just “add” the new person as a new category if the confidence that it matches any of the existing categories is sufficiently low?’. However, since the number of categories is fixed when we train the model, doing this would require retraining the model every time a new person is seen, which is obviously impractical.

Instead, a clever solution comes from framing the problem completely differently. That is, to ask the model to take in an image of a face, and instead of returning a ‘classification’ into a category, return a list of numbers that act as a ‘signature’ capturing the key information in the image data. This usually comes in the form of what’s called an ‘embedding’ (usually 128 numbers long, capturing the relevant information from millions of image pixel values). The key idea is that people that look similar will be given similar embeddings, and images that are of the same person will lead to embeddings that are so similar that the computer says “that’s a match”. This approach is called ‘metric learning’ since the neural network is being taught to learn a distance metric by which people can be given a measure of similarity/disparity.

There’s a very important upshot here which has a huge impact on the potential for user privacy: when a person recognition algorithm sees a new person and checks for a match, it does not check the face against raw images, but instead against a database of embeddings. These embeddings can therefore take the place of images and video footage in terms of what data is stored in perpetuity. They are inherently non human-readable, which adds a layer of privacy and security.

The even more technically minded might be interested to know how switching from classification to metric learning affects the process of training the model.

For this, the relevant concept is Metric Learning via Siamese Neural Networks. The training data fed to the model are triplets of images, which are generated so that each triplet contains an ‘anchor’, a ‘match’ - i.e. an image of the same person as the anchor, and a third ‘decoy’ image which is not a match. These triplets are split into pairs (anchor,match) and (anchor,decoy). Each pair is then sent through independent copies of the neural net. Defining a function f which takes in images and returns the corresponding (L2-normalised) embeddings, the objective function to be minimised by the neural network takes the form: L = sum_{image triplets} max(d(f(anchor),f(decoy))-d(f(anchor),f(match))-tol,0) where d is whatever distance metric is going to be used to assess matches- usually the n-dimensional Euclidean/Pythagorean/L2 distance d(x) := sqrt(x_1^2+x_2^2+...+x_n^2), and ‘tol’ is the predetermined threshold tolerance used to identify matches. This loss is back-propagated across the weights of the neural net, with the enforced constraint that the weights in the independent copies of the model are updated by the same amount. For more information, this paper on the construction of Google’s facial recognition algorithm FaceNet is an excellent read, and goes into more detail about how to optimally choose image triplets in order to accelerate the training process.

How computer vision can be implemented: Edge computing vs the cloud

Understanding how computer vision works is one thing, deciding how to implement it is a completely separate challenge.

The computer vision marketplace is currently heavily dominated by cloud solution providers (such as Amazon Web Services, Microsoft Azure etc.) and some vendors with their own proprietary solutions they can apply out of the box. These services provide powerful and efficient solutions to computer vision problems, but they face the challenge of requiring potentially sensitive information to be sent over the internet, or to a third party.

The alternative of developing your organisation’s own AI capability that functions ‘locally’ (meaning using a single computer very close to the camera site) circumvents many potential security risk. This is called ‘edge AI’, and it presents an important counterfactual to the existing vendor marketplace.

The main draw card of edge AI is that it allows for the analysis of video footage to take place on-site and offline. Video footage can be fed to a single computer, and the raw video can be converted into the ‘important insights’ from the footage which can be stored for later use. This approach makes it so that nothing sensitive ever leaves the premises. For example, in a person recognition task, only the embeddings for the people observed need to be sent to a central database (or, to add another layer of privacy and security, save the number of unique people seen, dwell time per person- whatever is relevant to the actual goal of the analysis). In a case where computer vision is being used to monitor a public space for overcrowding, only time-stamped information regarding space and person density needs to be used. In general, only the outputs of the computer vision model need to be stored.

Specifically, this means that raw video footage can be thrown out once it is analysed.

It’s worth mentioning that cloud facial recognition services, and most likely many private vendors do operate on this principle (storing embeddings/outputs rather than raw footage). However, the advantage of deploying your own edge computing capability is that the footage never leaves the site in the first place.

Another layer of privacy which comes as a result of deploying an edge AI is that the mapping from photos to embeddings is unique to your model, rather than being the same as a publicly available and widely used one, and the process can be set up so that the model cannot be accessed by anyone. This means that in the scenario where you store embeddings in a database and they happen to be stolen, it is much harder to reconstruct the original image data*. Reconstructing original images from embeddings requires either access to the model itself, or to a large sample of images used to feed the model. If the model is offline and the raw video footage is not stored historically, there is little risk of this.

*This is because publicly available models are more susceptible to ‘black box’ attacks, where adversaries with the ability to query a model (as they would have with a cloud service) feed examples into the model and examine the output, and use advanced techniques such as GANs (Generative Adversarial Networks)  to learn enough about how the model works, to decode the embeddings. Keeping the model offline also helps to protect against other forms of adversarial attacks against AI, particularly those which are designed to ‘fool’ algorithms into making misclassifications. An insightful blog on the various forms adversarial attacks against AI can be found here.

The growing controversy and debate surrounding facial recognition software is another great reason not to be fully leveraged to cloud providers and vendors. Indeed, a number of prominent players in this space (including Amazon and IBM) have recently declared an indefinite pause on providing facial recognition technology to law enforcement. Who’s to say that in the next five years, Amazon won’t extend its embargo on facial recognition in law enforcement to simply unplugging their computer vision capabilities altogether, at least in certain jurisdictions? Or that other vendors won’t follow suit?

Being beholden to a particular software or provider always creates a risk that that product won’t exist in its current form in the future. With computer vision software, this risk is an especially pressing one.

Moving away from third party vendors and cloud computing solutions leaves behind a substantial technical challenge in creating your own edge facial recognition capability. However, if achieved, the end result is a derisked data asset and technical capability that is owned by the organisation and intrinsically embedded within its systems- this is an approach that Smash Delta champions, and works alongside organisations in order to enable.

Our Blueprint

This understanding being in place, we arrive at the main point of this article. Computer vision is fraught with ethical and social landmines, to which an easy response may be “why don’t we just shut it all down?”. However, there is great potential to harness this technology in a way that provides great social benefit, and simply blocking these uses from ever coming into existence will prevent us from achieving that. Computer vision should be used in the public domain, but if it is to be used in an ethical manner, the following principles must be followed.

Minimality: Computer vision should be a means to very specific ends, not just applied ad hoc because your organisation owns images or video footage. It’s important to remember that you’re handling sensitive data, and that users have a fundamental right to privacy. Therefore, any application of computer vision in the public domain should be justified only by a legitimate and specific end use-case, and that only the data that is needed to fulfill this use case is captured. If you can’t explain why you need a user’s data, you shouldn’t have it!

The sensitivity and scale of the data collection and analysis should always be weighed up against the value of the end use-case. The more sensitive the information being collected and analysed, the more benefit (particularly to the end-user who has provided the data) there needs to be in order to justify it.

Specifically, most applications of computer vision do not require facial recognition, so move away from it where possible.

Even in identification tasks, where some sense of who a person ‘is’ is required, alternatives like ‘person reidentification’ (using the person’s body, rather than facial features to identify matches) exist. This approach generates the same insights as facial recognition but boosts privacy by removing any possibility that data can be cross-checked against bulk face data sets (such as social media platforms; Facebook, Instagram) to identify individuals.

Facial recognition should only be used in very rare circumstances where no other approach is fit for purpose, there is a great demonstrable benefit (e.g. public safety and security) and all other checks and balances are in place to ensure user privacy, equity and minimal usage are still respected.

Deidentification: An enormous amount of computer vision use cases do not require the long term storage of any potentially sensitive information, or any personally identifiable information. To achieve this, store the outputs of your computer vision model as your data asset rather than the actual images or video. For example only storing embeddings, which aren’t human readable, creates less identification risk than storing video footage. However, if these embeddings are used to builddeidenitified summary information pertaining to what you actually want to know, such as the counts of unique people detected per video frame, this removes reidentification altogether. The use cases that carry the least reidentification risk are those which never need any kind of person-level data to begin (e.g. assisting with public safety using computer vision to detect incidents and hazards, and only keeping time-stamped logs of incidents detected).

Security: Be aware of risks that exist whenever sensitive information is being stored, particularly embeddings, which are unique to an individual. Edge AI offers a way to bypass several potential vulnerabilities. Predominantly, these are achieved by analysing video on-site, which minimises the number of instances where sensitive data is changing hands and could be stolen, intercepted or interfered with.

Equity: Proactively examine systems for evidence of racial bias and discriminatory outputs. Consult with a diverse range of stakeholders in order to understand your potential blindspots, and work continually to correct these. Only leave systems operational if they can be demonstrated to be equitable.

Transparency: Make it publicly known what you are using computer vision for, and make every effort to bring the public along for the journey. If you’re using computer vision for a purpose that you would not want the public to know about, consider that this is an indication that the use may be unethical.

Take an active role in educating the public and your users on how your systems work, what their data is being used for, and what their rights are.

Where possible, avail users of a simple ‘freedom of information’ request platform and a ‘right to be forgotten’.

Adaptation: Just as the above sets the standard for systems coming into existence in the future, it should set the standard for systems already in place. Ask yourself if your computer vision applications are meeting the above criteria. If you’re not sure, ask!

As an organisation which works alongside major organisations both as an advisor and a hands-on leader in technology enablement, Smash Delta champions the application of cutting edge technology driving meaningful change in a way that centres on privacy, security and ethics. The computer vision space is one that is rapidly growing and presenting challenges for businesses, policy makers and the general public worldwide. We believe in getting it right the first time, and intend this blueprint as a step towards achieving that.

If your organisation is currently using computer vision, or has an interest in expanding into this area, and wishes to do it in a cutting-edge way that forges a new path in terms of public safety and ethics, please feel free to get in touch with us!

Let's Talk

We only send super interesting stuff.

Copyright 2020 - Smash Delta