CerenceAI China RD is seeking a Research Engineer to design and implement the next-generation text-to-speech systems and applications. In this role, with members in globe, you will work both frontend (Grapheme-to-phoneme, Text Normalization, phrasing and prosodic control, etc) and backend (acoustic modeling, neural vocoding) components of engine pipelines and end-2-end system for major languages and dialects. Your work will directly impact CerenceAI products ranging from voice assistants to accessibility tools, delivering natural, expressive, and multilingual speech synthesis. Job Description Representative responsibilities/duties will include but not limited to: • Design and optimize text/NLP preprocessing pipelines with Deep Learning or Machine Learning methods, including Grapheme-to-phoneme (G2P) conversion for multilingual support; Text normalization; polyphone disambiguation; Prosody prediction and control • Integrate language models (e.g., BERT, GPT variants) to improve contextual and semantic understanding for natural intonation • Develop rule-based and neural solutions for emotion/style control in synthesized speech • Build state-of-the-art acoustic models (e.g., Tacotron, FastSpeech, VITS) to map linguistic features to spectrograms or waveform parameters. • Optimize neural vocoders (e.g., WaveNet, HiFi-GAN, MelGAN, LPCNet) for high-fidelity, real-time speech synthesis • Optimize inference latencies for both edge devices and cloud platforms • Enhance robustness through noise suppression, speaker adaptation, and multilingual/cross-language/cross -gender voice cloning Knowledge, skills, and qualifications Education: Master in CS, AI, EE, Math, or related field. Required/preferred skills: • hands-on experience in speech generative system development with deep expertise in both frontend and backend components • Proficiency in C/C++ and Python, with mastery of ML frameworks (PyTorch, TensorFlow, etc) • Some background in NLP techniques and/or speech signal processing is welcome • Knowledge on transformer-based language models for prosody prediction • Basic understanding of autoregressive / non-autoregressive acoustic models and neural vocoders • Experience in optimizing models via quantization, pruning, or knowledge distillation • Experience with ONNX Runtime, TensorRT, or TorchScript, etc • Experience with zero-shot/one-shot/few-shot voice cloning or emotional TTS systems • Skilled GPU/TPU cluster and grid user • Fluent English is a must-have
月薪90,000~120,000元
(固定或變動薪資因個人資歷或績效而異)不拘
未填寫
周末双休;法定年假/额外年假;团队保险;每年一次体检报销;EAP服务