
Name : Joseph Razik
Joseph Razik graduated in both mathematics and computer science from the
Nancy University in France. He then started the research in the speech
recognition area. He obtained a M.S. degree in computer science from the same
university. In 2007, he received a Ph.D. in computer science, under the
direction of Pr. Jean-Paul Haton. His topic concerned the definition of
confidence measures for automatic speech recognition. He is currently a
postdoctoral fellow with Gérard Chollet as supervisor in the Signal and Images
Processing department of TELECOM ParisTech.
His reasearch interests
include:
·
Speech/music segmentation,
·
Automatic dialogue,
·
Speech and speaker
recognition,
·
Voice transformation and
conversion.
Publications :
P. Perrot, M. Morel, J. Razik, G. Chollet. Vocal Forgery in Forensic
Sciences. e-Forensics 2009.
J. Razik, O. Mella, D. Fohr, JP. Haton. Frame-Synchronous and Local
Confidence Measures for on-the-fly Automatic Speech Recognition. Interspeech
2008.
Title of Project : Voice conversion: a toy, a threat or a forensic tool
?
Voice conversion is a topic
that has more and more development in many applications such as entertainment,
speech synthesis, and so on. But this kind of development also appears as a
real threatening tool for criminals and perhaps as an efficient tool for
investigators. This aim of voice conversion is to transform automatically a
source (impostor) speaker’s voice to the sound like a target (client) speaker’s
voice. In a criminal case (from miscellaneous call to terrorism) it is very
uncomfortable to use a professional impersonator to imitate the voice of a
target. It is really more interesting to use a system able to do it
automatically. Different methods exist in the literature. The aim of this
presentation is to make a review of the possibilities, to propose a comparative
evaluation of three specific methods based on a client voice extracted from
Internet and to open a perspective of the reversibility of voice disguise for
investigators.
Nowadays, it is easy to
collect several speech or video materials of someone from internet, especially
for well-known persons as politicians. For our study, we collected an
allocution of the French president from the internet and trained a conversion
function to imitate his voice. Fortunately, the converted signal is not high
quality: low intelligibility, unnatural, lots of artifacts and noises. But,
according to three different measures (spectral distortion, likelihood ratio,
and perceptual test) the converted source voice is closer to the target than
the source voice and can deceive automatic speaker verification systems.
Although this automatic voice
conversion technique is a potential threat, it is also a potential forensic
tool to invert voice disguise. At this level of knowledge, the goal will not be
to make speaker recognition on the “cleaned” voice, but to provide better
intelligibility or clues to know if this is a real voice.