On the Application Prospect of Intelligent Speech Technology in Security Industry

From the word age to the image age, and now to the ubiquitous voice age, the popularity and outbreak of intelligent voice technology constantly refresh people's way of life. The emergence of Amazon Echo is the most striking milestone.

The environment shows that the most widely used intelligent voice technology is intelligent products (intelligent speakers, robots) and smart home, and speech recognition technology is the core landing technology of intelligent voice technology. However, it needs to be mentioned that in the period of the development of intelligent voice technology, intelligent voice technology should have some new landing scenarios.

Based on this background, this paper will analyze the application of speech recognition technology in security industry.

The security industry should be an excellent entrance to the "job selection" and "vision" of speech recognition.

With the development of AI technology, many enterprises have turned their strategy to "AI+". Based on the broad application prospects of security industry, "AI + security" soon became the mainstream melody of the market. As a major branch of artificial intelligence technology, intelligent voice technology naturally needs to "choose jobs" and "view" in the security industry, and the first one is voice recognition technology.

Learning to understand each other with machines, that is, human-computer interaction has always been the core of intelligence in the security industry. Speech recognition technology, as the core landing technology of human-computer interaction, has many footholds in the security industry. It is mainly used in the security robot represented by intelligent inspection robot.

Similar to other vocal service robots, security robots receive external voices through built-in microphones and recognize and understand human voices. Once they understand the dangerous behavior behind the human voice, they will automatically trigger the alarm system into a defensive state, thus playing a role of security protection for the target person.

In addition to security robots, speech recognition technology also plays a key role in Intelligent Hotel scenarios in security industry. In Ali's recently opened future hotel, although face recognition is its main technology, intelligent robots throughout the whole process of hotel service are also indispensable key figures. In Ali Future Hotel, the robot acts as the front desk of the hotel to guide the whole process of the guests entering the hotel. In the hotel room, the tenants can also improve their housing experience by communicating with the Skycat Elves. In the process of hotel check-in in in the future, whether it is a robot or a waiter, it is through speech recognition technology to complete human-computer interaction, so as to create intelligent interconnection scene anytime and anywhere through the full stack voice interaction system built by speech recognition technology.

Of course, the application of speech recognition technology in security industry has also involved in many intelligent scenarios such as intelligent finance, intelligent education and so on.

Alternatively, intelligent voice technology can be used as a "good helper" for "face recognition"

Face recognition technology as the core of the video surveillance security industry's main application, we need not talk about this, but in the future we can also have a big brain hole, using intelligent voice technology to assist face recognition, making video surveillance more intelligent.

Speech recognition technology is being talked about in the market, but few enterprises pay attention to it. Voice print recognition and speech emotion recognition also belong to intelligent speech technology.

Voice print recognition, also known as speaker recognition, converts sound signals into electrical signals and then recognizes them by computer. It can be divided into speaker identification and speaker confirmation. In different scenarios, the choice of voiceprint recognition technology is different. For example, in order to narrow the scope of criminal investigation, identification technology may be needed, while in banking transactions, identification technology is needed.

Speech emotion recognition is one of the ways of emotion recognition. It refers to the automatic recognition of the emotional state of the input speech by the computer. The computer calculates and analyses the structural characteristics and distribution rules of different tone expressions through sensors in terms of time structure, amplitude structure, fundamental frequency structure and formant structure, so as to identify the emotional content implied in all tones.

Although the recognition rate of current face recognition technology is as high as 99% or even 99.9%, the remaining 1% or even 0.1% is a difficult problem that current technology can not overcome. Imagine if voiceprint recognition and voice emotion recognition technology are added to the current video surveillance system with face recognition, and the audio-visual fusion technology (lip reading) is formed. Even if the target audience is silent, it can also predict and recognize their thoughts and behaviors. Whether the current video surveillance system will also be promoted to a new level of intelligence and truly achieve "prevention" In the future.

It is widely believed that there are four "how" to solve the application layout of AI in the security industry - how to create scenario AI applications to meet user needs? How to build industry intelligence system to solve the practical problems of industry? How to improve infrastructure, industrial standards and security mechanisms? How to build a new ecology of mutual benefit and win-win intelligent industry? And how about these four, put on the application of intelligent voice technology in security industry, it does not violate and feel.

Far-field speech recognition should be the key technology of intelligent speech recognition in the security industry. However, there are still three bottlenecks in this technology: echo, noise and reverberation. The most intuitive example is that when security robots perform security work in public areas, they can not separate the target speech due to too many received speech signals. So it can not be recognized normally.

Another example is the language emotion recognition technology mentioned above. In fact, it is much more difficult to characterize emotions in speech than facial expressions, because facial expression signals convey personal features and expressions rather than linguistic information, while speech signals are mixed information, including vocabulary and grammar emphasized in speaker characteristics, emotions and speech content, which requires much more training and learning data than face recognition.

In addition to the technical problems of far-field speech recognition and speech emotion recognition, intelligent voice technology itself still has many problems that have not been broken through, including accent, target speaker separation, multilingual mixing, efficient migration and data iteration, industry standards and attack defense, etc., leading to its not only in the security industry, but also in the current application of AI intelligent voice in various industries, it seems to be "artificial". It would be more appropriate to describe mental retardation.

It is undeniable that the multi-modal interaction system formed by face recognition, voiceprint recognition and language emotion recognition should be able to open many new application doors in the security industry, such as scene analysis and event detection. In the new round of AI industry reform, multi-modal technology will also become the key to success.

However, there are still some difficulties to be solved for intelligent voice to "view" the security industry.

"No voice insecurity" sounds like a good vision. But unfortunately, at present, there are still many difficulties to be solved for intelligent voice to "view" the security industry.


It is widely believed that AI is not flaunting technology, but really promoting technological innovation and solving industrial problems. Today, when AI technology enters into large-scale application, it is more necessary to properly "choose jobs" and "view" the relationship between them, from homogeneity out of differentiation.

In the era of artificial intelligence, how to break the bottleneck of technology and how to empower all walks of life? The four solutions of Yunzhisheng Chairman/CTO Liang Jiazhen may be more rational: solving the problems of deep learning in industrial scale application, solving the problems of non-Big data, end-to-end, sequence mapping, effectively combining data and knowledge, forming an efficient iterative closed-loop and fundamentally upgrading machine. Cognitive and learning abilities of organs.

In 2019, artificial intelligence technology has gradually returned to rationality, and more and more problems are beginning to emerge. But for industry, it is the worst time and the best time.