Microsoft introduces new high-speed voice and video formats

Microsoft Corp. today was introduced three artificial intelligence models optimized for image and audio processing.
The algorithms are available through Microsoft Foundry, an Azure service that developers can use to build AI applications. The tech giant has also started introducing models to many other products.
The first new algorithm, MAI-Image-2, can generate images with a resolution of up to 1024 by 1024 pixels based on user commands. Each message may contain text with a value of up to 32,000 tokens. Under the hood, MAI-Image-2 converts instructions into images using 10 to 50 billion embedded parameters. Non-embedded parameters are part of the model that focuses on content generation rather than initial data preparation activities.
Microsoft claims that MAI-Image-2 is at least twice as fast as the previous generation image generator. The second new model released today, MAI-Transcribe-1, also brings significant speed improvements. It can write speech 2.5 times faster than previous versions of Microsoft.
Another selling point of MAI-Transcribe-1 is its accuracy. Microsoft tested the error rate of the word mean model, a measure of transcription quality, in 25 languages. MAI-Transcribe-1 posted an error rate of 3.9%, which puts it ahead of Gemini 3.1 Flash and OpenAI Group PBC’s GPT-Transcribe. Another contributor to the accuracy of the model is that it includes spatial noise filtering features.
At launch, MAI-Transcribe-1 supports batch transcription. That means that the model can only process pre-prepared files such as audiobooks. According to Microsoft, an upcoming update will add the ability to record real-time audio streams. The company is also working on a so-called dialing feature that can break up a transcript into speaker-specific segments.
The third model introduced by Microsoft today is called MAI-Voice-1. As the name suggests, it is developed to generate synthetic speech based on user-supplied text. Customers can choose from one of the built-in AI voices or use their own voice.
Microsoft says all three models offer competitive pricing compared to competitors. MAI-Image-2 is priced at $5 for 1 million input tokens and $33 for 1 million output tokens. MAI-Transcribe-1 costs $0.36 per hour of transcribed speech, while MAI-Voice-1 starts at $22 for 1 million characters.
Models are available not only through Microsoft Foundry but also through several other resources. Microsoft is currently in the process of rolling out MAI-Image-2 to Bing and PowerPoint, while MAI-Voice-1 is accessible through an audio creation tool called Copilot Audio Expressions.
The tech giant has developed a line of custom AI chips called MAIA to power its AI workload. The newest addition to the family series, the conceptually designed Maia 200, made its debut in late January. Microsoft says the three-nanometer chip outperforms cloud providers’ custom-made AI chips across several benchmarks.
Image: Microsoft
Support our mission to keep content open and free by engaging with the CUBE community. Join CUBE’s Alumni Trust Networkwhere technology leaders connect, share wisdom and create opportunities.
- 15M+ viewers of CUBE videosenabling conversations across AI, cloud, cybersecurity and more
- 11.4k+ CUBE alumni – Connect with more than 11,400 technology and business leaders who are shaping the future through a unique network based on trust.
About SiliconANGLE Media
Founded by technology visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media products that reach 15+ million elite technology professionals. Our new ownership of CUBE AI Video Cloud is starting to engage with audiences, using CUBEai.com’s neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.



