台北市松山區1年以上碩士
CerenceAI China RD is seeking a Research Engineer to design and implement the next-generation text-to-speech systems and applications. In this role, with members in globe, you will work both frontend (Grapheme-to-phoneme, Text Normalization, phrasing and prosodic control, etc) and backend (acoustic modeling, neural vocoding) components of engine pipelines and end-2-end system for major languages and dialects. Your work will directly impact CerenceAI products ranging from voice assistants to accessibility tools, delivering natural, expressive, and multilingual speech synthesis.
Job Description
Representative responsibilities/duties will include but not limited to:
• Design and optimize text/NLP preprocessing pipelines with Deep Learning or Machine Learning methods, including Grapheme-to-phoneme (G2P) conversion for multilingual support; Text normalization; polyphone disambiguation; Prosody prediction and control
• Integrate language models (e.g., BERT, GPT variants) to improve contextual and semantic understanding for natural intonation
• Develop rule-based and neural solutions for emotion/style control in synthesized speech
• Build state-of-the-art acoustic models (e.g., Tacotron, FastSpeech, VITS) to map linguistic features to spectrograms or waveform parameters.
• Optimize neural vocoders (e.g., WaveNet, HiFi-GAN, MelGAN, LPCNet) for high-fidelity, real-time speech synthesis
• Optimize inference latencies for both edge devices and cloud platforms
• Enhance robustness through noise suppression, speaker adaptation, and multilingual/cross-language/cross -gender voice cloning
Knowledge, skills, and qualifications
Education: Master in CS, AI, EE, Math, or related field.
Required/preferred skills:
• hands-on experience in speech generative system development with deep expertise in both frontend and backend components
• Proficiency in C/C++ and Python, with mastery of ML frameworks (PyTorch, TensorFlow, etc)
• Some background in NLP techniques and/or speech signal processing is welcome
• Knowledge on transformer-based language models for prosody prediction
• Basic understanding of autoregressive / non-autoregressive acoustic models and neural vocoders
• Experience in optimizing models via quantization, pruning, or knowledge distillation
• Experience with ONNX Runtime, TensorRT, or TorchScript, etc
• Experience with zero-shot/one-shot/few-shot voice cloning or emotional TTS systems
• Skilled GPU/TPU cluster and grid user
• Fluent English is a must-have