Look who's talking

Tools for the responsible use of speech technology

Digitalisation

Report

Downloads

Report

file type pdf - file size 1.93 MB
Download Look who's talking

Een vrouw spreekt tegen een spraakassistent. Coverfoto van het rapport 'Hoor wie het zegt' — Photo: Frank Duenzl/ANP

In recent years, speech technology has become commonplace. Many drivers give verbal instructions to their cars and some people even wake up in the morning with the voice of their digital voice assistant, to wish them good morning and to provide the weather forecast. We are increasingly talking to computers – and that has consequences. After all, nothing is more human than our speech. In our conversations, we express ourselves and develop customs.

The Rathenau Institute has therefore devoted this study to speech technology. How does speech technology work, what is it used for and what ethical questions does it raise? We examine how the government, businesses and citizens can contribute to speech technology that enriches our society and social relations, and does not impoverish them.

29 January 2021

Computers are getting better at recognising, interpreting and producing human speech. Thanks to improvements in speech technology, it is possible to talk to computers, and users can control the digital world with their voice. Speech technology is already widely used in cars and homes, and companies and organisations are experimenting with it in many other fields, including healthcare and the security sector. The increasing use of speech technology has important consequences for society. Our speech is an essential part of who we are as human beings, and of our social relationships. Our conversations also contain highly sensitive information – about our identity, the type of conversations we have, and even about our health and mood. Our speech therefore has to be protected. This study examines ways in which society can shape this protection.

The study is based on desk research and interviews. Given that the widespread application of speech technology is a relatively recent phenomenon, the desk research consisted of studying a combination of academic literature and grey literature, The interviews were intended to get a better picture of the technical possibilities of speech technology and are exploratory in nature.

Speech technology is getting better all the time

In this study, we first analyse the technical situation: how does speech technology work, and how good is it? Speech technology consists of three key processes: recognising speech, interpreting speech, and producing speech, which is referred to as speech synthesis (see Figure 1). Progress has been made in all three areas, mainly thanks to richer and larger datasets, advanced machine learning technology and faster processing power of computers. However, despite the improvement in quality, it is a mixed picture.

Figure 1: Three elements of speech technology

Speech recognition already works pretty well. In ideal conditions, speech computers achieve an error rate of around 5%. But conditions matter a lot: the error rate increases sharply in a noisy setting, when technical words are used or when the system listens to voices of groups that are less strongly represented in the training data, such as those of children. Nevertheless, speech recognition is sufficiently accurate to provide many useful services, for example when it comes to remotely controlling music or transcribing an interview. But there are plenty of applications, such as in healthcare or heavy industry, where such an error rate is not acceptable.

Progress is less unambiguous in the area of speech interpretation. When performing tasks, help is needed from the environment and the user: he or she must give simple commands and formulate and answer questions correctly. Although it was promised that computers would learn our language, a human being still has to adapt to speech technology if it is to be interpreted properly.

Speech synthesis, on the other hand, has become much better. In short, speech systems can already make themselves clearly understood. Developers have raised the bar: speech synthesis has to be so good that people are no longer aware that they are talking to a computer. This is not yet the case for the vast majority of applications, but things are developing fast. Some speech systems, such as Google Duplex, come very close to producing human speech, including "ums" and "ahs".

Speech technology is our guide in the digital world

This study also reviewed the application of speech technology. Speech technology is already widely used in cars and in the home. Technology providers and companies are also experimenting with speech. The applications can be divided into two groups: applications that control devices and applications that support or provide services. In the first category, we are familiar with speech technology in the car (hands-free calling) and in the home (voice assistants such as Google Assistant or Amazon's Alexa). But machines can be voice-controlled in industry as well. The second category includes voice assistants who book trips for us, assist us at the office, and check our identity, for example when we want to do our banking.

Impact on social relationships and norms

This wide range of applications raises various societal and ethical issues (see Figure 2). First, speech technology interferes with people's social lives. This raises questions about the desired relationship between people and computers: do we want to, and should we always know, that we are talking to a computer instead of a human being? Do we actually hear who, or what, is saying something? Is it problematic if users consider their voice assistant to be their best friend? And how do we ensure that speech technology respects existing social norms, for example with regard to equal treatment and disciplining? We have to make sure that speech technology does not compromise our dignity as human beings.

The voice as a new data source

Moreover, all these applications collect data, by means of both call logs and audio recordings. Our study shows this means that the voice acts as a new data source. The data is used by developers to personalise speech systems, and forms the basis of analyses in the field of emotion recognition and the diagnosis of diseases. These analyses are often not scientifically proven, but various companies are expecting a lot from the future possibilities of audio recordings. Speech data contains very sensitive information: more than anywhere else, people reveal themselves in conversations at home, in the car and at work. This requires extra attention from developers and regulators to ensure that our private and family life remains respected.

Our autonomy is at stake

The use of speech technology also affects our autonomy. This technology helps users in many domains to perform tasks, take decisions and have a pleasant experience. This offers opportunities but also raises concerns. Does the use of speech technology lead to the loss of skills, and does it exert a malign influence and mislead? Take, as an example, deep fake videos, in which someone's appearance and voice are faked ("cloned"), and which can fool people and undermine public debate. In addition, speech technology offers fewer possibilities for nuance and questioning compared with screens. Who controls and decides on the answer the voice assistant gives? Finally, an empathetic and comfortable voice assistant can be so useful that people overuse it and become addicted to it.

The importance of safe and healthy use

Speech technology can also compromise people's security. Speech data can be stolen and misused, for example to commit identity fraud. And, despite the improvements made, speech technology is not perfect and accidents can happen. Before speech technology is used in critical applications in healthcare, defence or manufacturing industry, the reliability of the technology will have to be beyond doubt and money will have to be invested in technologies to combat misuse.

Tech giants' growing market power

Finally, the study shows that the power of several large technology companies is growing even faster thanks to speech technology. The objective of several technology giants such as Google and Amazon is to create a broad platform of speech applications and link them to a voice assistant, such as Alexa and Google Assistant, which can perform a multitude of tasks. In this way, these assistants will take on the role of a guide who helps us navigate through the digital world, while keeping us as deep as possible within the environment of a particular platform. To achieve this, technology giants are buying up start-ups and making significant investments. Although other players are also active in the speech technology market, such as the Houndify platform, and companies sometimes develop their own voice assistants, the question is how these players will hold their own against the tech giants' increasingly dominant position.

Ethical aspects speech technology — Figure 2: Ethical aspects of speech technology

Our voices and conversations are an essential part of who we are as human beings and the relationships we enter into with others. Speech technology gives us someone to talk to at all times – at home, in the car, at work and when shopping – and this will affect our speech and our relationships – both with each other and with computers. In addition, speech technology creates a new source of data, containing highly sensitive information. Our speech is at stake.

Speech technology adds a new dimension to the general task of managing digital technology effectively and shows that government and industry are once again taking the lead. After all, speech technology not only influences the way individuals use computers, it also affects the behaviours we develop together. It changes not only the way in which individuals acquire knowledge, but also the knowledge on which public debate is based. And it has an impact not only on the relationship between customers and companies but also on the platform economy as a whole.

The Rathenau Instituut is therefore making six recommendations to government and industry to protect human speech and to manage the speech technology applications effectively:

1. Ensure effective privacy protection

Speech technology makes it possible to collect sensitive voice data from people and use it to influence them. This includes biometric and health data. This means the processing of voice data poses risks to people and their fundamental rights. Existing privacy rules must be enforced more vigorously. The Rathenau Instituut is therefore calling on the government to introduce a permit system for biometric voice analysis and to develop strategies to regulate emotion recognition and health analysis. It is also important to monitor the use of speech analysis by law enforcement agencies: is it desirable for the police to scrape voice data from social media? Finally, it is incumbent on industry not to follow the minimum of privacy rules in their product development and service provision, but to implement them vigorously – for example by investing in technologies that minimise the data use.

2. Promote inclusive speech technology

Speech technology provides opportunities to make information more easily accessible. But speech systems can also exclude groups of users, confirm biases, and encourage discrimination. It is very important to ensure that everyone can use speech technology. To this end, government can invest in a Dutch speech database on which numerous players can base their speech technology. Industry also has responsibilities in this regard. In particular, the Rathenau Instituut calls on industry to combat stereotyping, for example by offering a diverse range of voice assistants.

3. Create a fair market

Concerns have been raised in the data economy with regard to the dominance of a few large technology companies. Speech technology gives these companies an opportunity to expand this dominant position even further. In order to make the market accessible and fair to all players, government can tighten up competition law – steps are being taken to this end at European level. It is also important to provide opportunities for alternative providers, and not just to work with the tech giants. Industry is recommended to apply consumer rights, such as the right to request information, effectively and generously.

4. Protect human dignity

The Rathenau Instituut calls on government and industry to initiate an ethical dialogue on speech technology. Particular attention should be paid to protecting human dignity: guaranteeing the right to human contact and preventing situations in which users confuse computers with people. Government and industry should reach agreements in this regard.

5. Make sure speech technology is reliable

Speech technology has a lot to offer society, provided that it is reliable. It is up to both government and industry to take the following steps: act decisively to combat disinformation and voice cloning, reduce the error rate of speech technology, invest in technology that prevents misuse and develop security standards.

6. Invest in technological citizenship

Responsible and effective use of speech technology also requires knowledge and skills, for example in terms of searching for knowledge and setting up routines, and the information the devices collect. It is therefore necessary to assist people to deal with speech technology, which requires investment in education and training in media literacy. In addition, government, knowledge institutes and industry must invest in research to analyse the impact of speech technology on our physical and mental health. Finally, individuals also have an important role to play. They can make their voices heard and put speech technology on the agenda for public debate. Our speech is a vulnerable and meaningful commodity – and worthy of debate.

FAQ

Technology that enables computers to recognise and interpret human speech and to speak it themselves - summarised as speech recognition, speech interpretation and speech synthesis (see chapter two in the report).

Nothing is more human than our speech. In our conversations, we express ourselves and develop customs. It is therefore important to get speech technology right. In addition, speech technology comes close to us: we install speech systems in our living rooms and in our offices. In the wrong hands, a speech computer is a surveillance tool that can unlock our secrets. You can even clone voices and put words in someone's mouth. Moreover, our autonomy is at stake. Speech technology is increasingly functioning as a guide that leads the user through the digital world. But this guide is made by companies who pursue their own interests, and these do not necessarily correspond to the interests and wishes of citizens.

The study therefore calls for the development of ethical speech technology that, among other things, is inclusive, respects our private lives and is offered on a healthy market. The study also calls for social dialogue and political debate. The emergence of speech technology raises questions that we must answer together. For example, do we want to be disciplined by a speech assistant? In the past, this question would have sounded fanciful, but today it is real. Computers have started to talk: time for a good conversation.

The market for voice technology is currently growing rapidly. For example, in 2018, 6% of Dutch households purchased a speaker that you can control with speech, and this percentage grew to 19% in 2019 (Multiscope, 2020). And in America and China, developments are going even faster (Kimmich, 2019). According to some analyses, the rise of smart speakers there seems to be even faster than at the time the rise of mobile phones - a device you can now increasingly control with your voice as well (Kinsella & Mutchler, 2018). (For the full source citation, see the publication).

Yes, this report on speech technology is the second in the series on immersive technologies. A study on Virtual Reality already appeared at the end of 2019: 'Verantwoord virtueel - Bescherm consumenten in virtual reality'. Wednesday 21 October followed the publication on augmented reality: Nep echt - verrijk de wereld met augmented reality.

With the breakthrough of immersive technologies, digital society is entering a new phase. The physical and digital worlds are becoming more intertwined than ever. This raises urgent social and political questions. The Rathenau Institute has therefore published a manifesto with ten design requirements for the digital society of tomorrow.

During Dutch Design Week we organised an online talk show: Enriching Reality: Designing human-centered AR, VR and Voice applications. During this talk show, coordinator Rinie van Est and researcher Jurriën Hamer received inspiring guests to discuss how AR, VR and Voice touch people's lives - and under what conditions they can enrich society. Watch the talk show on the website of the Dutch Design Week.

On 26 November 2020 (15.30-17.00 hrs), our annual Rathenau Live event took place. This year it was an online event entirely dedicated to Virtual Reality, Augmented Reality and Speech Technology. Together we discussed and experienced what these techniques do to our perception of ourselves, others and the world around us.

A computer system that can perform speech recognition, speech interpretation and/or speech synthesis is called a speech system. There are various types of speech systems available on the market. The most important is the speech assistant, a speech system that can usually perform a wide range of tasks. Well-known examples are Amazon's Alexa and Google's voice assistant. These assistants can be installed on all kinds of digital devices, such as a mobile phone, a desktop PC, or a smart speaker. The study also looks at other voice systems, such as transcription software and navigation systems.

Speech assistants are also called cognitive or virtual assistants. These digital systems can also perform tasks, and are usually able to interpret text. They do not have to be based on speech technology. In this exploration, we focus on systems equipped with speech technology. We will therefore use the term 'voice assistant'.

The metaverse makes a discussion about our digital society even more urgent

Digitalisation
Article

09 June 2022
Speech technology is more than just an interface

Digitalisation
Article

29 January 2021
Rathenau Manifesto: Set 10 design requirements for tomorrow's digital society now

Digitalisation
Report

22 October 2020
Rathenau Manifesto: Set 10 design requirements for tomorrow's digital society now

Digitalisation
Article

22 October 2020
Fake for real

Digitalisation
Report

21 January 2021
Responsible VR

Digitalisation
Report

27 February 2020
Human rights in the robot age

Digitalisation
Report

11 October 2017
Urgent Upgrade

Digitalisation
Report

06 February 2017

Look who's talking

Downloads

Report

Authors

Summary

Recommendations

FAQ

What is speech technology?

Why is this publication on speech technology relevant?

To what extent is speech technology already present in society?

Have the effects of other immersive technologies such as AR and VR also been studied?

Can I have a say in speech technology?

What is a voice assistant?