YOSEKS VAS
Voice Analysis System
Platform Features
Speaker Identification (SID)
Speaker Diarization (DIAR)
Language Identification (LID)
Gender Identification (GID)
Age Estimation (AGE)
Speech Transcription (STT)
Keyword Spotting (KWS)
Voice Activity Detection (VAD)
Speech Quality Estimation (SQE)
Speaker Identification (SID)
Phonexia Speaker Identification (SID) uses the power of voice biometry to automatically recognize a speaker by their voice.
Technology
- A calibration tool for even higher accuracy
- 1:1 (verification), 1:n and n:m (identification) comparison possible
- The technology is language-, accent-, text-, and channel- independent
- Uses deep neural networks to generate highly representative voiceprints
- Applies state-of-the-art channel compensation techniques, verified by NIST evaluation
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, satphones, etc.
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8 kHz+ sampling (other audio formats automatically converted)
- Recommended speech signal for enrolment: recommended 20+ secs
- Minimum speech signal for identification: recommended 7+ secs
In specific use cases the time required for the speaker enrolment and identification can be much shorter.
Output
- XML/JSON format with all results or results files with a log likelihood ratio (-∞;∞) and/or percentage metric scoring <0-100%>
Accuracy and Processing speed
Achieves more than 99% accuracy (0.96% Equal Error Rate based on NIST evaluation data set).
Up to 182× faster than real-time processing on 1 CPU core with the most precise model – for example, a standard 1 CPU core server processes up to 4,368 hours of audio in one day of computing time.
Speaker Diarization (DIAR)
Enables segmentation of voices in one monochannel audio
. Technology
- Trained with an emphasis on spontaneous telephone conversation
- The technology is language-, accent-, text-, and channel- independent
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, satphones, etc.
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8 kHz+ sampling (other audio formats automatically converted)
Output
- XML/JSON format with all results or results files with segmentation of speech, silence, and technical signals (i.e., elimination of phone lines beeps, DTMF tones, music, etc.)
- Audio file extracted for each speaker
Processing speed
Approx. 50x faster than real-time processing on 1 CPU core.
I.e., a standard 1 CPU core server processes 1,200 hours of audio in 1 day of computing time.
Language Identification (LID)
System allows detecting the spoken language or dialect automatically.
Technology
- The technology is text and channel independent
- Applies state-of-the-art channel compensation techniques, verified by NIST evaluation
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, satphones, etc.
Supported languages
Afan_Oromo, Albanian, Amharic, Arabic, Arabic_Gulf, Arabic_Iraqi, Arabic_Levantine, Arabic_Maghrebi, Arabic_MSA, Azerbaijani, Bangla_Bengali, Bosnian, Burmese, Chinese_Cantonese, Chinese_Dialects, Chinese_Mandarin, Creole, Croatian, Czech, Dari, English_American, English_British, English_Indian, Farsi, French, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Khmer, Kirundi_Kinyarwanda, Korean, Lao, Macedonian, Ndebele, Pashto, Polish, Portuguese, Punjabi, Russian, Serbian, Shona, Slovak, Somali, Spanish, Swahili,Swedish, Tagalog, Tamil, Thai, Tibetan, Tigrigna, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese
A user can add new languages to the system, no assistance from Phonexia is necessary. Approx. 20 hours of audio recordings recommended for new language training.
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8 kHz+ sampling (other audio formats automatically converted)
- Minimum speech signal for identification: recommended 7+ secs
Output
- XML/JSON format with all results or results files with a logarithm of probabilities scoring (-∞;0> and/or percentage metric scoring <0-100%>
Processing speed
Approx. 20x faster than real-time processing on 1 CPU core with the most precise model.
I.e. a standard 8 CPU core server processes 480 hours of audio in 1 day of computing time.
Gender Identification (GID)
Gender Identification (GID) automatically recognizes the gender of a speaker.
Technology
- Uses the acoustic characteristics of speech
- Speech is converted to frequency spectra and modeled with advanced statistical methods
- The technology is language-, accent-, text-, and channel- independent
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, satphones, etc.
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8 kHz+ sampling (other audio formats automatically converted)
- Minimum speech signal for identification: recommended 7+ secs
Output
- XML/JSON format with all results or results files with processed information (scores for male and female)
Processing speed
Approx. 200x faster than real-time processing on 1 CPU core.
I.e. a standard 1 CPU core server processes 4,800 hours of audio in 1 day of computing time
Age Estimation (AGE)
Age Estimation (AGE) estimates the age of a speaker from an audio recording..
Technology
- Trained with an emphasis on spontaneous telephone conversation
- The technology is language-, accent-, text-, and channel- independent
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, satphones, etc.
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8kHz+ sampling (other audio formats automatically converted)
Output
- XML/JSON format with all results or results files with age estimates
Processing speed
Up to 182× faster than real-time processing on 1 CPU core with the most precise model – for example, a standard 1 CPU core server processes up to 4,368 hours of audio in one day of computing time.
Speech Transcription (STT)
Speech Transcription (STT) converts speech signals into plain text.
Technology
- Trained with an emphasis on spontaneous telephone conversation
- Based on state-of-the-art techniques for acoustic modeling, including discriminative training and neural network-based features
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, satphones, etc.
Supported languages
Arabic, Chinese (beta version), Czech, Dutch, English UK, English US, Farsi (beta version), French, German, Italian, Spanish – Lat.Am., Polish, Russian, Slovak
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8 kHz+ sampling (other audio formats automatically converted)
Output
- XML/JSON format with all results or results files with:
- One-best transcription i.e., a file with a time-aligned speech transcript (the time of the words’ start and end)
- n-best transcription i.e., a confusion network with hypotheses for words at each moment
Processing speed
The 5th generation is approximately 7x faster than real-time processing on 1 CPU core – for example, a standard 1 CPU core server processes 168 hours of audio in one day of computing time.
The 4th generation is approximately 1.2x faster than real-time processing on 1 CPU.
Keyword Spotting (KWS)
Keyword Spotting (KWS) identifies the occurrences of keywords and/or keyphrases in audio recordings.
Technology
- Robust acoustic-based technology, even with noisy recordings
- Keywords are automatically converted into phonemes and searched for
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, satphones, etc.
Supported languages
Arabic, Chinese (beta version), Croatian, Czech, Dutch, English US, Farsi (beta version), French, German, Hungarian, Italian, Pashtu, Polish, Russian, Slovak, Spanish – Lat.Am, Turkish (beta version)
A user can add an unlimited number of keywords to the system, as well as an unlimited number of pronunciation variants for each keyword.
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8 kHz+ sampling (other audio formats automatically converted)
Output
- XML/JSON format with all results or results files generated with detected keywords (containing the keyword, start/end time, path, probability, etc.)
Processing speed
The 5th generation is approximately 30x faster than real-time processing on 1 CPU core, i.e. a standard 1 CPU core server processes 720 hours of audio in one day of computing time.
The 5th generation is approximately 10x faster than real-time processing on 1 CPU core.
Voice Activity Detection (VAD)
Voice Activity Detection (VAD) identifies parts of audio recordings with speech content vs. nonspeech content.
Technology
- Trained with an emphasis on spontaneous telephone conversation
- The technology is language-, accent-, text-, and channel- independent
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, satphones, etc.
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8 kHz+ sampling (other audio formats automatically converted)
Output
- XML/JSON format with all results or results files with labels (speech vs. nonspeech segments)
Processing speed
Approx. 150x faster than real-time processing on 1 CPU core.
I.e. a standard 1 CPU core server processes 3,600 hours of audio in 1 day of computing time.
Speech Quality Estimation (SQE)
Speech Quality Estimator (SQE) measures the quality parameters of the speech in an audio recording.
Technology
- The technology is language-, accent-, text-, and channel- independent
- Compatible with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, etc.
Input
- Input format for processing: WAV or RAW (PCM unsigned 8 or 16 bits, IEEE float 32-bit, A-law or Mu-law, ADPCM), FLAC, OPUS; 8 kHz+ sampling (other audio formats automatically converted)
Output
- XML/JSON format with all results or results files with
- Global score i.e., a percentage expression of audio quality (range <0;100>), by default, the global score is calculated based on waveform_n_bits and waveform_snr variables.
- Detailed outputs i.e., clipped signal, amplitude, sample values, sampling frequency, SNR, technical signal, encoding, etc.
Processing speed
Approx. 2,000x faster than real-time processing on 1 CPU core.
I.e. a standard 1 CPU core server processes 48,000 hours of audio in 1 day of computing time
