Wake word что это

Обновлено: 06.07.2024

Snowboy - это высоко настраиваемый механизм обнаружения Wake-Up Word (например, голос Baidu, пробуждающий «небольшую степень», основан на нем.), Может использоваться для встроенных систем в реальном времени, и всегда слушает (даже в автономном режиме). В настоящее время его можно транспортировать на системах Raspberry Pi, (Ubuntu) и Mac OS X.

Некоторые популярные слова «Алекса» включают «Алекса» на Echo Amazon, «Хорошо, Google» и «Привет Siri» на iPhone на устройстве Android. Эти слова пробуждения используются для инициирования полного интерфейса голосового взаимодействия. В дополнение к этому, Wake Up Words также можно использовать в других целях, таких как выполнение простых команд и контрольных действий.

В сложном растворе он может управлять полным автоматическим распознаванием речи (ASR, автоматическое распознавание речи) для выполнения горячего обнаружения слова. В этом случае устройство будет соблюдать определенное спусковое слово в автоматическом распознавании речи. Специфическое спусковое слово наблюдается в транскрипции. Кроме того, при использовании облачных решений он не защищает вашу конфиденциальность. К счастью, Snowboy создан для решения этих проблем.

Snowboy имеет следующие особенности:

  • Очень настроенОтказ Позвольте вам свободно определять свои собственные волшебные слова, такие как (но не ограничиваясь) «открытым кунжутом», «открытые автомобильные ворота» или «Hello, Dream House» (Hello Dream House). Вы можете думать об этом, вы можете настроить его.
  • Всегда слушает, но защищать вашу конфиденциальностьОтказ Поскольку Snowboy не подключен к сети, нет необходимости загружать свой звук на любое место.
  • Легкий, может быть встроенПозвольте вам работать на Raspberry Pi. Потребляйте менее 10% процессора на самых маленьких Pi (один ядра 700 м HZ ARMV6).
  • Протокол Apache

В настоящее время Snowboy может поддерживать:

  • Все Raspberry Pi (оснащено Debian Jessie 8.0)
  • 64-битная Mac OS X
  • 64bit Ubuntu (12.04 и 14.04)
  • iOS
  • Android (ARMv7 CPU)
  • Сосна 64, оснащенная Debian Jessie 8.5 (ядерная версия 3.10.102)
  • Intel Edison, оборудован UbiLinux (Debian Heezy 7.8)

Во-первых, предварительно приготовление

1, устройство, которое поддерживает Snowboy и имеет микрофон;

Во-вторых, скачать Snowboy

Вы можете скачать предварительно упакованные снегоходные двоичные и его пакет Python:

ссылка для скачивания:1/2/3/Zero

Или на Github скачать

В-третьих, посетите микрофон

Здесь мы используем Portaudio в качестве аудиовходной / выходной кроссплатформенной поддержки. Мы также используем SOX как быстраю утилиту, чтобы проверить, правильная настройка микрофона. Следовательно, нам нужно установить это программное обеспечение и использовать микрофон.

1, установите SOX

2, Установите модуль Pyudio Pyudio Pyton

В-четвертых, беги демо

Эта демо может работать на любом устройстве. Но мы рекомендуем вам запустить его на ноутбуке / настольном компьютере с динамиком, потому что, когда ваше слово пробуждения срабатывает, демонстрация будет воспроизводить звук DING.


The gateway between you and your AI assistant is the Wake Word. It’s a small thing that a lot of people take for granted.

But it is no small thing to create an algorithm that’s always listening for a particular utterance that lasts less than a second that can be said by anybody, that can run on everything from a wristwatch to a car, and that maintains the privacy and security of everyone it can hear while it’s doing its job.

Wake word technolog y is a specific branch of AI, with its own unique challenges. For companies like Cisco, which are developing AI assistant products, there are teams of people who work on nothing else but tools to extract these sub-second signals in a continuous stream of noise.

Here are some of their biggest challenges.

1. Wake word technology has to run on everything. Except the cloud.

We expect our watches to respond to our voice. And our phones, laptops, cars, and also big communication devices like dual-70” screen room systems.

On small devices, the environment the wake word system has to deal with is relatively simple. A watch or smartphone can assume that only one person will be speaking to it at a time, and the person will be fairly close. Room-based systems have a more complicated challenge: They have to pick out vocalizations from multiple overlapping speakers who might be far way, and in acoustically complex spaces like big conference rooms.

In all cases, the system has to respond very quickly, which means the processing has to happen locally. The system can’t stream what its microphone hears to a cloud service continuously; the round-trip lag and the cloud-based processing would slow down wake word recognition enough to impact the user experience.

More importantly, no one wants everything they say to be sent over the Internet to a cloud service, where it might be recorded, analyzed, or possibly stolen. Using a cloud service to pick wake words out of an always-on audio stream is a security risk.

2. Wake word technology requires specialized hardware

In some ways, it is more difficult to build accurate wake word technology that works at a distance than it is to do continuous speech recognition. To trigger on a wake word — and nothing else — you need to grab a clear audio signal in a very small time window. On the input side, that means using microphone arrays or some other method to surgically extract potential wake word utterances from other noise in the room. Humans are very good at picking out a single voice in a crowd or at a distance. Machines using just standard microphones, much less so. And in meeting rooms, where several people might be talking at once, it’s that much harder to make this work.

And obviously, your wake word process needs to be running continuously, always analyzing the last second or so of sound for its cue. It’s not so difficult to find the processing cycles for this on a device with excess power, like a wall-mounted video system or a car, but running the AI continuously on a handheld, battery-powered device requires special algorithm tuning or hardware, or both. Apple, for example, runs its wake word process on the iPhone’s “Always On Processor,” a “low-power auxiliary processor,” which in turn is embedded in the Motion Coprocessor. Keeping the phone’s main CPU running constantly just to listen for the wake word would use too much power.


Cisco’s newest conference room video systems use specialized beamforming microphone arrays to isolate individual speakers from background noise — and from other people who are talking.

3. Diversity is quality

Wake word algorithms are based on neural nets, which need to be trained. The more data you provide for training, the better they’ll be. It’s not enough to just provide a lot of data, though. The datasets must be diverse. If training data is just men speaking, even if it’s millions of them, there will be more errors when women try to invoke the system. You’ll have the same problem if people with different accents or native languages try to use your system, and your training set didn’t include them. More diverse training cohorts make for better AIs — for everyone.

I could put some examples here showing where people-recognizing algorithms from the past have ended up favoring one group over another, but they are so cringe-worthy I don’t even want to link to them. Diverse training sets are required for building AI algorithms.

4. There are good wake words and bad wake words

There’s a subset of wake word theory that overlaps with linguistics, which is about creating the best wake words for the algorithms to pick out. Wake words need to be long enough to be distinct, but not so long that they vary a lot between speakers. They also need combinations of phonemes that are both easy to discern by a machine, and easy to say by a human. And what’s easy to say will vary based on which languages a speaker is comfortable with.

A good wake word has an uncommon collection of phonemes, with both fricatives (hard sounds) and distinct vowels. There are numerous things to avoid, at least in theory, like preceding a “sh” sound (technically, the unvoiced post-alveolar fricative) with an “h” as in “hey” (a voiceless glottal fricative).

5. Many of the wake words we use are terrible — thank goodness

Engineers don’t have the last say on a product’s wake word, though. The wake word is a huge part of a product’s brand, and marketers, lawyers, and other non-technical people have more say in what a product is named. Voice assistants’ invocation names need to be memorable and on-brand.

When engineers and marketing disagree on naming, marketing usually wins. Engineers don’t always get the wake words they want. But they make them work. I have seen that the challenges imposed on wake word coders for non-technical reasons actually make for more robust code.

But this and the two other reasons above are why you can’t change your smartphone’s wake word to anything you want. (See also this story from Salon: Don’t call it “Siri”: Why the wake word should be “computer.”)

6. Want to make your own wake word technology? There’s an app for that, sort of

If you’re a developer and you want to make an AI product that starts up when you call its name, you can use code available from several sources, like Snowboy, Sensory, and other companies.

These tools will get you off the ground, but when building an AI-powered tool, the much bigger challenge is providing it with training data. It takes a well-funded team to recruit the thousands of people needed to provide the voice recordings and additional human training time to coach a wake word neural net until it works well. These tools aren’t as easy to come by for a smaller company or a hobbyist. At least not today.

7. Wake words are a blip in time

Wake words won’t always be the sole method to get the attention of a speech-recognizing AI. Over time, we will have new invocation mechanisms, like intention based on context, intonation, gaze direction, a “wake wave,” and so on.

Wake word algorithms will also advance to be able to hold on to the conversational state for longer, just like one person does when talking to another person: If I start a conversation with Kathy, I don’t have to start every single sentence to her with her name: “Kathy, good morning. Kathy, want go get some coffee with me? Kathy, how are your kids?”

We will likely have wake words for the foreseeable future. But for some commands, like, “turn on the lights” saying them in a commanding tone might be enough to trigger the speech recognizer without requiring a specific wake word. In other cases, a more robust always-listening AI will be able to know when to chime in based on what you’re talking about (although we can’t build this without also working extremely hard on keeping our human conversations secure and private).

Waking up to the challenge

Everyone working on AI assistants is applying these lessons to their products. We just released a bot with wake-word technology (Spark Assistant) but we’re also going to roll out face recognition to tune and personalize the responsiveness of our AI. We have built a “wake wave” invocation for our always-on video connection experiment, TeamTV. We are also working on “eye contact wake” in Spark Assistant VR.

There’s still a lot to learn. We are at a very interesting point in the development of how we co-exist with AI-powered assistants. What do you think? Drop a note in the comments.

The wake word is the word you use to start a conversation with a voice assistant. Inside every device in which a voice assistant is embedded, a tiny process keeps listening, waiting to detect the wake word out of a continuous stream of audio. The process at stake is both typical of how voice is generally transformed to feed a machine learning model, and quite simple from a Machine Learning perspective. We are going to take this example as an illustration of how Machine Learning is done on voice.

The personal wake word detector is a new feature of the Snips Voice Platform, that we release in response to strong demand from the community. It makes it possible for anyone to pick any wake word they want to use to call their voice assistant. Go straight to the tutorial if you want to start playing with it. To understand how this detector works, and how voice is handled in many other Machine Learning applications, carry on reading!

There are two kinds of wake word detectors: Universal and Personal ones.

The universal wake word detectors are trained on a large variety of voices. The underlying model is generally a Deep Learning algorithm, that is trained to identify when anyone says the wake word.

On the other hand, the personal wake word detector is trained locally, on your device, with a small number of voice samples that you provided. This alternative is much more versatile compared to the universal one, since it allows you to use any arbitrary wake word. The difference is just that it’s not meant to work when someone else says the wake word.

What’s common to both type of detectors is the way voice is transformed before it’s fed into the Machine Learning model. This pre-processing is actually shared with many other applications of Machine Learning with voice, like Speech Recognition, or Speaker Identification. It just turns out that results obtained with this pre-processing step are often better than without, although exceptions are starting to arise.

Let’s see how this personal wake word detector works in terms of training, inference, and performances.

Audio trimming

The principle of the personal wake word detector is to compare an incoming stream of audio to a set templates of the wake word recorded by a user. It is a nearest neighbor logic. Hence, the first step is to acquire those templates. In practice, since we have no idea of how long the wake word will last, a margin is taken to record them. In our case, we give the user 2 seconds to record each template (see the documentation here). The sample rate used to acquire the sound is set to 16000 samples per seconds. Each record is then initially composed of 32000 samples.

Naturally, what comes before and after the wake word in the recording is not useful. If the recording takes place in a quiet environment, which we heavily recommend, what comes before and after are silences. In order to remove those silences and keep only the meaningful part of each template, a process called trimming is applied. This process consists in:

  • Dividing the signal into small chunks (framing)
  • Computing the energy of each frame
  • Removing every frame which energy is lower than a predefined threshold.

For example, a 65 milliseconds signal might be sliced into 5 frames of 0.025 seconds with an overlap of 0.01 seconds between each frames. The figure below illustrates the framing process on a short audio signal.


The energy of a frame is the mean of the squares of the signal. A classic approach is to compute the energy of each frame, and compare it to a predefined threshold to classify the frame as silence or not.

To increase robustness, we take a slightly different approach. We set a threshold on the ratio of energies between the energy of each frame, and the energy of the frame with highest energy. If this ratio is below 20 decibels, the frame is classified as silence. The threshold can be manually configured in the Snips platform.

The trimming process is illustrated below. On the left, the signal was recorded in clean conditions (no noise), and the trimmed signal strictly captures the wake word. On the right, an example with background noise recorded after the wake word. This example shows how noise can disrupt the trimming process.


Feature extraction

The second step of the training process is to extract a concise and meaningful representation of each template, to feed a simple machine learning model. This process is called feature extraction.

Mel Frequency Cepstral Coefficients (MFCCs) are widely used for Automatic Speech Recognition applications. This transformation tries to mimic some parts of the human speech perception by reproducing the logarithmic perception of loudness by the human ear. For a full tutorial on how MFFCs work, follow this link. The process to obtain the MFCCs from an audio signal is the following:

a. Pre-emphasis (optional). This step is aimed at amplifying the high frequencies, in order to balance the signal and improve the overall signal to noise ratio. In practice, we use a pre-emphasis coefficient if 0.97.


b. Framing. The signal is splitted into fixed-sized slices (usually between 20 ms to 40 ms long), with a predefined amount of overlap between each slice. This step is the same than the one described in the initial Trimming step, but the window sizes and the overlaps might be different. Let us denote nFrames the number of frames obtained from the original signal.

c. Windowing. A Povey window function is applied to each frame to reduce side lobes. The Povey window function is similar to the Hamming window, but equals zero at the edges.


At this point the original has been pre emphasised, framed and windowed, which we illustrate in the figure below.


d. Transformation. We first compute the Discrete Fourier Transform for each frame, with a predefined number of components (512 here). Then, we compute the square of the module of each coefficient to obtain the energy distribution of the signal across frequencies. At this stage, each frame is represented by 512 values, which represent the energy of the signal at different frequencies. The i-th value of this signal corresponds to a frequency of i*16000/(2*512).

We illustrate this on our toy examples below.


e. MEL scale mapping. We map each component onto the mel scale using triangular filters: this transformation is aimed at reflecting the way the human ear perceive sounds. The number of filters, noted nFilters, is a parameter of the algorithm. At that point, each frame comes with a vector of size nFilters. We then take the log of each component to obtain the log-filter banks.

f. Normalisation. Finally, we apply a discrete cosine transform (DCT) and normalize each component by removing their mean. This last step is meant to decorrelate the log-filter bank.

At this point, our audio sample has been transformed into a feature matrix of size nFrames ×nFilters where nFrames is the number of frames resulting from the second step above and nFilters is selected the number of mel coefficients at the last step. The figure below illustrates the feature matrix of the trimmed audio signal from the previous section.


In the next section, we will explain how we use those transformed templates in our wake word detection algorithm.

Dynamic Time Warping

Now that we have a clean representation of the audio templates, our goal is to compare an incoming stream of audios to these templates, in order to decide whether a wake word is present or not is the audio stream.

To make things simpler, let us consider for now that only one audio template is available. Concretely, the live audio stream is sliced into windows with a size equal to the template and a defined overlapping shift. Each window is then processed with the feature extractor, and finally compared to the template. To perform this comparison and output a decision, we use Dynamic Time Warping, which measures the similarity between two time series (see this link for a full introduction to DTW techniques).

Let us consider two times series of respective sizes N × K and M × K, where N and M are time dimensions, and K is the dimension of the features space. In our application N=nFrames[Template], M=nFrames[Stream] and K=nFilters.
The first step to compute the DTW between those series is to evaluate the related pairwise element cost matrix. The metric used to compute this matrix can change depending on the need, for our algorithm we chose the cosine similarity. Each element of this matrix is defined as:



After this step we end with a matrix of size (N,M) called cost matrix. The aim of DTW is to find a path going from the first element (with coordinates (1,1)) to the last element (with coordinates (N,M)) of the cost matrix with a minimal cumulated cost. This path is called an optimal warping path and its total cost defines the DTW distance between both signals.

The figure below on the left represent the cost matrix between two series together with the related optimal warping path. Intuitively, the warping path tends to align the points of the two time series together with the constraint that every point from sequence 1 (respectively sequence 2) must be mapped to at least one point of sequence 2 (respectively sequence 1). In our specific case, since a wake word is always pronounced at a relatively constant tempo and since all its syllables are always pronounced in the same order, some warping path can be discarded. This constraint can be encoded directly encoded in the cost matrix by setting infinite cost to each element in the area that we want to discard. For our wake word detection algorithm we use a diagonal constraint (see figure below on the left).


Reference distance and prediction

At this point, we are able to compare a live audio stream with each template by computing the corresponding DTWs. Formally, if we have 3 templates and consider a window from the audio stream, we can compute the 3 related DTWs, denoted DTW₁, DTW₂ and DTW₃. In practice, the duration of the audio stream window is set to the average duration of the 3 templates.

Our objective is to classify the audio stream window as containing a wake word if its DTW with respect to at least one the templates is less than a predefined threshold. An optimal value for this threshold can be found empirically for each specific wake word. Yet, this value will heavily depend on the size of the templates. The higher the number of frames, the longer the path, and hence the highest the DTW. To counter this effect, and to be able to define a universal threshold (as in not wake word dependent), each DTW is normalized by the sum of the temporal dimension of both inputs.

Finally, if we call the decision threshold 𝜏, and consider an input sequence from the audio stream, the detector will trigger if either of the normalised versions of DTW₁, DTW₂ and DTW₃ is lower than 𝜏.

Let’s now see how we determined a good default value for 𝜏.

Confidence

The wake word detection problem can be seen as a binary classification problem. The audio stream window either contains, or doesn’t contain the wake word. To tune this kind of problems, it is particularly useful to be able to output a probability to belong to each class (in our case to be or not to be a wake word). To this aim, we artificially define 3 confidences related to each input sample:


where i is template indice. Note that if DTWᵢ is less than 𝜏, then the probability will be greater that 0.5, and will increase as DTWᵢ becomes smaller, as expected. Finally, the probability threshold above which the detector will trigger is the parameter that we expose to the user (see the tutorial here). We call it the sensitivity. The higher sensitivity, the higher the number of false alarm and the lower the number of missed wake words.

The next section is dedicated to an analysis of the performances of our algorithm, and its limitations.

Finding an acceptable reference distance 𝜏

We looked for an acceptable universal value for 𝜏. To do so, we empirically recorded different wake word templates, for different people, languages, and lengths of the wake word. For each of them, we computed the normalized DTWs with respect to the following set of audio samples:

  1. Other templates, not noisy: for each wake word, and each template of this wake word, this is the set of all the other wake word templates of the same wake word, except the one in question. Ideally, all DTWs with respect to the studied template should be lower than 𝜏.
  2. Other templates, noisy (20db): similarly, for each template of each wake word, this is the set of templates of the same wake word, except the one in question, augmented with background noise, with a signal to noise ratio of 20 decibels. Ideally, all DTWs with respect to the studied template should be lower than 𝜏.
  3. Voices: this is a set of audio recordings composed of people pronouncing random text. This set is around 10 hours long in total. Ideally, all DTWs with respect to the wake word templates should be greater than 𝜏.
  4. Noises: this is a set of recordings of background noise . Ideally, all DTWs with respect to the wake word templates should be greater than 𝜏.

We repeated this process for each wake word template, for different wake words, and aggregated the results. The figure below represents the normalized DTW distribution for each of the sets defined above. Looking at this figures gives a first intuition about the distance threshold that could be used. It appears that setting it around 0.22 will lead to a good separation between wake word and non-wake word recordings, both with and without background noise.


To confirm this intuition we computed the False Alarm rate per Hour (FAH) on the Voices dataset, that is the trickiest one, and the False Rejection rate (FRR) for actual templates, both in noisy and non noisy conditions, with a distance threshold of 0.22. The FAH quantifies the number of times the detector will accidentally trigger each hour while the seconds quantifies the rate at which a wake word will be missed. Both measures are the most commonly used in the literature for this task.

On voices, the False Alarm rate per Hour is 2.06, which means that with people constantly talking next to it, the wake word Detector will trigger about twice per hour with default settings. Improved performances can be obtained by tuning decision parameters on a case by case setting. The False Rejection rate is 0% without background noise, and 0.3% in noisy conditions.

Limits of the approach

To push the exploration further, we increase the level of noise in the test datasets. We created a first dataset with 10 decibels signal to noise ratio, which is lower than the 20db used initially, and a second one at 5 decibels signal to noise ratio. Intuitively, the performance of the algorithm should decrease when the signal to noise ratio decreases, since the Wake word becomes more difficult to detect.

The same plots as in the previous section are shown with the 10 decibels set on the left and the 5 decibels on the right. As expected, in both cases the frontier between positive and negative samples becomes more blurry as the signal to noise ratio decreases.


Keeping the distance reference to a value of 0.22 the False Rejection rate respectively increase to 2.8% and 20.4% with 10 and 5 decibels signal to noise ratios.

Of course, as mentioned before, those are average performances obtained with default sensitivity threshold. Improved performances can be obtained by adapting the sensitivity for each Wake word.

In order to improve the robustness of the system, we are currently working on two approaches:

  • Adding an audio frontend in order to artificially reduce the ambient noise (noise cancellation)
  • Deep learning approaches that have been shown to be more robust to noise and transfer knowledge between all Wake words.

Now that you understand how sound is cleaned, transformed and processed to determine if a wake word has been said, try it out! A full tutorial is available here. To experiment further, you will be able to play with all the parameters defined in this article for trimming, feature extraction, etc in the script_recording.py file.

Happy hacking, and don’t hesitate to share your feedback and new ideas!


If you liked this article and care about Privacy, smash that clap button, then tweet everyone 👉👉 tibo_gissel, jodureau, and snips!

It is also highly likely we have a job for you at Snips 🤩! We are the largest voice startup in Europe, and are hiring in machine learning, software engineering, blockchain, sales, product, marketing, etc…

Accurate, widely deployed, wake word & multiple phrase technology that recognizes, analyzes and responds to dozens of keywords and phrases.

Always-Listening Wake Word

Image Alt Text

Custom Branded Wake Words

It is common for wake words like Alexa, Siri and Google to become associated with the highly valued, high tech product experience. On the other end of the automotive spectrum, a company like Tesla is sure to protect their brand and refuses to offer Siri, Google or Alexa integrations. This isn’t by accident, and companies with a strong brand presence are beginning to figure it out. Owning the voice experience starts with a branded custom wake word. Some early examples include branded wake words like, “Hi LG” or “OK Honda”. In these examples, the brand name is the doorway to the voice user interface and every incantation reinforces the consumer bond to the brand.

Image Alt Text

Why Embedded Word Detection?

Image Alt Text

Key Features

Always-Listening Hotword Detection: TrulyHandsfree

Our technology delivers customizable wake words, small to medium-sized command sets, speaker identification, and speaker verification models. TrulyHandsfree™ technology has been included in products ranging from mobile phones, tablets, PCs, loT, to wearables, hearables, medical equipment, and vehicles around the globe. Download the product brief to learn more.

How Wake Words Work

Wake words, sometimes dubbed “hotwords” are not created equal, and this is important because a wake word can be considered the key to the front door. When the wake word is uttered, the door should open every time, in other words the rate of false rejects is ideally close to zero. When the wake word is NOT said the door shouldn’t open, or the rate of false accepts should also be close to zero. Achieving this balance of low false accepts and false rejects in an embedded algorithm for a multitude of applications requires well trained and fine-tuned AI, and with over 25 years of experience, Sensory can deliver.

Highly accurate hotword detection is made complex by factors like noise, distance, and variations in accents and language. Platforms such as Amazon, Google, and a few others seem to have pretty good wake words, but if you go into your Alexa settings you can see all of the voice data that’s been collected, and a lot of it is being collected when you weren’t intentionally talking to Alexa! You can see this performance issue in the Vocalize test report. Sensory substantially outperformed Amazon in the false reject area. There’s a reason that companies like Amazon, Google, Microsoft and Samsung have turned to Sensory for our TrulyHandsfree technology.

Читайте также: