Machine listening

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Machine listening is a technique using software and hardware to extract meaningful information from audio signals. The engineer Paris Smaragdis, interviewed in Technology Review, talks about these systems --"software that uses sound to locate people moving through rooms, monitor machinery for impending breakdowns, or activate traffic cameras to record accidents."[1]

Since audio signals are interpreted by the human ear-brain system, that complex perceptual mechanism should be simulated somehow in software for "machine listening". In other words, to perform on par with humans, the computer should hear and understand audio content much as humans do. Analyzing audio accurately involves several fields: electrical engineering (spectrum analysis, filtering, and audio transforms); psychoacoustics (sound perception); cognitive sciences (neuroscience and artificial intelligence); acoustics (physics of sound production); and music (harmony, rhythm, and timbre). Furthermore, audio transformations such as pitch shifting, time stretching, and sound object filtering, should be perceptually and musically meaningful. For best results, these transformations require perceptual understanding of spectral models, high-level feature extraction, and sound analysis/synthesis. Finally, structuring and coding the content of an audio file (sound and metadata) stand to benefit from efficient compression schemes, which discard inaudible information in the sound.[2]

The Medical Intelligence and Language Engineering Laboratory (MILE Lab) at the Department of Electrical Engineering, Indian Institute of Science, Bangalore, India is working on a long term goal of multilingual speech recognition using a knowledge-based approach. Here, realistic speech is assumed, where the signal being analyzed for recognition may contain non-speech sounds made by the speaker such as laugh, cough, sneeze, humming and dysfluencies. Further, reasonable types of natural noises like thunder and man-made noises like vehicle noise will also be handled. Knowledge-based approach is distinct from statistical approaches, which require a huge amount of speech data (usually annotated at some level). It is well known that the sensory pathways in human beings are rich in active feedback channels, which can modulate the way the signal is pre-processed in the early stages of recognition (pathway). Prof. A G Ramakrishnan has proposed the concept of Attention-Feedback, which attempts to model such active feedback mechanisms in machine learning (pattern recognition) systems.

See also[edit]