nvidia/personaplex-7b-v1 · Hugging Face

## You need to agree to share your contact information to access this model This repository is publicly accessible, but you have to accept the conditions to access its files and content. GOVERNING TERMS: Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). ADDITIONAL INFORMATION: [CC-BY-4.0](https://huggingface.co/kyutai/moshiko-pytorch-bf16). [Log in](https://huggingface.co/login?next=%2Fnvidia%2Fpersonaplex-7b-v1) or [Sign Up](https://huggingface.co/join?next=%2Fnvidia%2Fpersonaplex-7b-v1) to review the conditions and access this model content. # PersonaPlex: Voice and role control for full duplex conversational speech models ➡️ **Code:** [nvidia/personaplex](https://github.com/NVIDIA/personaplex) ➡️ **Demo:** [PersonaPlex Project Page](https://research.nvidia.com/labs/adlr/personaplex/) ➡️ **Paper:** [PersonaPlex Preprint](https://arxiv.org/abs/2602.06053) ### Description: Personaplex is a real-time speech-to-speech conversational model that jointly performs streaming speech understanding and speech generation. The model operates on continuous audio encoded with a neural codec and predicts both text tokens and audio tokens autoregressively to produce its spoken responses. Incoming user audio is incrementally encoded and fed to the model while Personaplex simultaneously generates its own outgoing speech, enabling natural conversational dynamics such as interruptions, barge-ins, overlaps, and rapid turn-taking. Personaplex runs in a dual-stream configuration in which listening and speaking occur concurrently. This design allows the model to update its internal state based on the user’s ongoing speech while still producing fluent output audio, supporting highly interactive conversations. Before the conversation begins, Personaplex is conditioned on two prompts: a voice prompt and a text prompt. The voice prompt consists of a sequence of audio tokens that establish the target vocal characteristics and speaking style. The text prompt specifies persona attributes such as role, background, and scenario context. Together, these prompts define the model's conversational identity and guide its linguistic and acoustic behavior throughout the interaction. This model is ready for commercial use. ## Explore more from NVIDIA: For documentation, deployment guides, enterprise-ready APIs, and the latest open models—including Nemotron and other cutting-edge speech, translation, and generative AI—visit the NVIDIA Developer Portal at [developer.nvidia.com](https://developer.nvidia.com/). Join the community to access tools, support, and resources to accelerate your development with NVIDIA's NeMo, Riva, NIM, and foundation models. What is [Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/)? NVIDIA Developer [Nemotron](https://developer.nvidia.com/nemotron) [NVIDIA Riva Speech](https://developer.nvidia.com/riva?sortBy=developer_learning_library%2Fsort%2Ffeatured_in.riva%3Adesc%2Ctitle%3Aasc#demos) [NeMo Documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html) ### License/Terms of Use: GOVERNING TERMS: Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). ADDITIONAL INFORMATION: [CC-BY-4.0](https://huggingface.co/kyutai/moshiko-pytorch-bf16). ### Use Case: Wherever NVIDIA’s speech-to-speech conversational models are used, PersonaPlex can generate English speech response for English speech input. ### Deployment Geography: Global ### Release Date: Hugging Face \[01/15/2026\] via \[ [https://huggingface.co/nvidia/personaplex-7b-v1](https://huggingface.co/nvidia/personaplex-7b-v1)\] Github \[01/15/2026\] via \[ [https://github.com/NVIDIA/personaplex](https://github.com/NVIDIA/personaplex)\] ## Model Architecture: **Architecture Type:** Transformer **Network Architecture:** [Moshi](https://github.com/kyutai-labs/moshi) Moshi uses: - Mimi Speech Encoder (ConvNet, Transformer) - Moshi Temporal Transformer + Depth Transformer - Mimi Speech Decoder (Transformer, ConvNet) \\*\\* This model was developed based on [Moshi (Moshiko weights)](https://huggingface.co/kyutai/moshiko-pytorch-bf16) \\*\\* Number of model parameters: 7B ## Input(s): **Input Type(s):** Text (prompt), Audio (user speech) **Input Format:** String, WAV/WebAudio **Input Parameters:** One-Dimensional (1D) **Other Properties Related to Input:** 24kHz sample rate for audio. ## Output(s) **Output Type(s):** Text (agent text), Audio (agent speech) **Output Format:** String, WAV/WebAudio **Output Parameters:** One-Dimensional (1D) **Other Properties Related to Output:** 24kHz sample rate for audio. Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and softw

Scraped Article