The Rathenau Institute has therefore devoted this study to speech technology. How does speech technology work, what is it used for and what ethical questions does it raise? We examine how the government, businesses and citizens can contribute to speech technology that enriches our society and social relations, and does not impoverish them.
Computers are getting better at recognising, interpreting and producing human speech. Thanks to improvements in speech technology, it is possible to talk to computers, and users can control the digital world with their voice. Speech technology is already widely used in cars and homes, and companies and organisations are experimenting with it in many other fields, including healthcare and the security sector. The increasing use of speech technology has important consequences for society. Our speech is an essential part of who we are as human beings, and of our social relationships. Our conversations also contain highly sensitive information – about our identity, the type of conversations we have, and even about our health and mood. Our speech therefore has to be protected. This study examines ways in which society can shape this protection.
The study is based on desk research and interviews. Given that the widespread application of speech technology is a relatively recent phenomenon, the desk research consisted of studying a combination of academic literature and grey literature, The interviews were intended to get a better picture of the technical possibilities of speech technology and are exploratory in nature.
Speech technology is getting better all the time
In this study, we first analyse the technical situation: how does speech technology work, and how good is it? Speech technology consists of three key processes: recognising speech, interpreting speech, and producing speech, which is referred to as speech synthesis (see Figure 1). Progress has been made in all three areas, mainly thanks to richer and larger datasets, advanced machine learning technology and faster processing power of computers. However, despite the improvement in quality, it is a mixed picture.
Speech recognition already works pretty well. In ideal conditions, speech computers achieve an error rate of around 5%. But conditions matter a lot: the error rate increases sharply in a noisy setting, when technical words are used or when the system listens to voices of groups that are less strongly represented in the training data, such as those of children. Nevertheless, speech recognition is sufficiently accurate to provide many useful services, for example when it comes to remotely controlling music or transcribing an interview. But there are plenty of applications, such as in healthcare or heavy industry, where such an error rate is not acceptable.
Progress is less unambiguous in the area of speech interpretation. When performing tasks, help is needed from the environment and the user: he or she must give simple commands and formulate and answer questions correctly. Although it was promised that computers would learn our language, a human being still has to adapt to speech technology if it is to be interpreted properly.
Speech synthesis, on the other hand, has become much better. In short, speech systems can already make themselves clearly understood. Developers have raised the bar: speech synthesis has to be so good that people are no longer aware that they are talking to a computer. This is not yet the case for the vast majority of applications, but things are developing fast. Some speech systems, such as Google Duplex, come very close to producing human speech, including "ums" and "ahs".
Speech technology is our guide in the digital world
This study also reviewed the application of speech technology. Speech technology is already widely used in cars and in the home. Technology providers and companies are also experimenting with speech. The applications can be divided into two groups: applications that control devices and applications that support or provide services. In the first category, we are familiar with speech technology in the car (hands-free calling) and in the home (voice assistants such as Google Assistant or Amazon's Alexa). But machines can be voice-controlled in industry as well. The second category includes voice assistants who book trips for us, assist us at the office, and check our identity, for example when we want to do our banking.
Impact on social relationships and norms
This wide range of applications raises various societal and ethical issues (see Figure 2). First, speech technology interferes with people's social lives. This raises questions about the desired relationship between people and computers: do we want to, and should we always know, that we are talking to a computer instead of a human being? Do we actually hear who, or what, is saying something? Is it problematic if users consider their voice assistant to be their best friend? And how do we ensure that speech technology respects existing social norms, for example with regard to equal treatment and disciplining? We have to make sure that speech technology does not compromise our dignity as human beings.
The voice as a new data source
Moreover, all these applications collect data, by means of both call logs and audio recordings. Our study shows this means that the voice acts as a new data source. The data is used by developers to personalise speech systems, and forms the basis of analyses in the field of emotion recognition and the diagnosis of diseases. These analyses are often not scientifically proven, but various companies are expecting a lot from the future possibilities of audio recordings. Speech data contains very sensitive information: more than anywhere else, people reveal themselves in conversations at home, in the car and at work. This requires extra attention from developers and regulators to ensure that our private and family life remains respected.
Our autonomy is at stake
The use of speech technology also affects our autonomy. This technology helps users in many domains to perform tasks, take decisions and have a pleasant experience. This offers opportunities but also raises concerns. Does the use of speech technology lead to the loss of skills, and does it exert a malign influence and mislead? Take, as an example, deep fake videos, in which someone's appearance and voice are faked ("cloned"), and which can fool people and undermine public debate. In addition, speech technology offers fewer possibilities for nuance and questioning compared with screens. Who controls and decides on the answer the voice assistant gives? Finally, an empathetic and comfortable voice assistant can be so useful that people overuse it and become addicted to it.
The importance of safe and healthy use
Speech technology can also compromise people's security. Speech data can be stolen and misused, for example to commit identity fraud. And, despite the improvements made, speech technology is not perfect and accidents can happen. Before speech technology is used in critical applications in healthcare, defence or manufacturing industry, the reliability of the technology will have to be beyond doubt and money will have to be invested in technologies to combat misuse.
Tech giants' growing market power
Finally, the study shows that the power of several large technology companies is growing even faster thanks to speech technology. The objective of several technology giants such as Google and Amazon is to create a broad platform of speech applications and link them to a voice assistant, such as Alexa and Google Assistant, which can perform a multitude of tasks. In this way, these assistants will take on the role of a guide who helps us navigate through the digital world, while keeping us as deep as possible within the environment of a particular platform. To achieve this, technology giants are buying up start-ups and making significant investments. Although other players are also active in the speech technology market, such as the Houndify platform, and companies sometimes develop their own voice assistants, the question is how these players will hold their own against the tech giants' increasingly dominant position.
Our voices and conversations are an essential part of who we are as human beings and the relationships we enter into with others. Speech technology gives us someone to talk to at all times – at home, in the car, at work and when shopping – and this will affect our speech and our relationships – both with each other and with computers. In addition, speech technology creates a new source of data, containing highly sensitive information. Our speech is at stake.
Speech technology adds a new dimension to the general task of managing digital technology effectively and shows that government and industry are once again taking the lead. After all, speech technology not only influences the way individuals use computers, it also affects the behaviours we develop together. It changes not only the way in which individuals acquire knowledge, but also the knowledge on which public debate is based. And it has an impact not only on the relationship between customers and companies but also on the platform economy as a whole.
The Rathenau Instituut is therefore making six recommendations to government and industry to protect human speech and to manage the speech technology applications effectively:
1. Ensure effective privacy protection
Speech technology makes it possible to collect sensitive voice data from people and use it to influence them. This includes biometric and health data. This means the processing of voice data poses risks to people and their fundamental rights. Existing privacy rules must be enforced more vigorously. The Rathenau Instituut is therefore calling on the government to introduce a permit system for biometric voice analysis and to develop strategies to regulate emotion recognition and health analysis. It is also important to monitor the use of speech analysis by law enforcement agencies: is it desirable for the police to scrape voice data from social media? Finally, it is incumbent on industry not to follow the minimum of privacy rules in their product development and service provision, but to implement them vigorously – for example by investing in technologies that minimise the data use.
2. Promote inclusive speech technology
Speech technology provides opportunities to make information more easily accessible. But speech systems can also exclude groups of users, confirm biases, and encourage discrimination. It is very important to ensure that everyone can use speech technology. To this end, government can invest in a Dutch speech database on which numerous players can base their speech technology. Industry also has responsibilities in this regard. In particular, the Rathenau Instituut calls on industry to combat stereotyping, for example by offering a diverse range of voice assistants.
3. Create a fair market
Concerns have been raised in the data economy with regard to the dominance of a few large technology companies. Speech technology gives these companies an opportunity to expand this dominant position even further. In order to make the market accessible and fair to all players, government can tighten up competition law – steps are being taken to this end at European level. It is also important to provide opportunities for alternative providers, and not just to work with the tech giants. Industry is recommended to apply consumer rights, such as the right to request information, effectively and generously.
4. Protect human dignity
The Rathenau Instituut calls on government and industry to initiate an ethical dialogue on speech technology. Particular attention should be paid to protecting human dignity: guaranteeing the right to human contact and preventing situations in which users confuse computers with people. Government and industry should reach agreements in this regard.
5. Make sure speech technology is reliable
Speech technology has a lot to offer society, provided that it is reliable. It is up to both government and industry to take the following steps: act decisively to combat disinformation and voice cloning, reduce the error rate of speech technology, invest in technology that prevents misuse and develop security standards.
6. Invest in technological citizenship
Responsible and effective use of speech technology also requires knowledge and skills, for example in terms of searching for knowledge and setting up routines, and the information the devices collect. It is therefore necessary to assist people to deal with speech technology, which requires investment in education and training in media literacy. In addition, government, knowledge institutes and industry must invest in research to analyse the impact of speech technology on our physical and mental health. Finally, individuals also have an important role to play. They can make their voices heard and put speech technology on the agenda for public debate. Our speech is a vulnerable and meaningful commodity – and worthy of debate.