SenseTime wants to create "Super Moment"

AI Racing

Author | Liu Baodan

Editor | Zhou Zhiyu

The skyrocketing popularity of ChatGPT has shown people the huge potential of AI large models. After more than a year of technological catching up, domestic large model companies are all betting on the application side.

However, creating a truly influential product is not an easy task.

At the 2024 World Artificial Intelligence Conference, SenseTime CEO Xu Li cautiously pointed out, "Despite the surging trend, we are still far from that truly industry-shaking 'super moment'." He emphasized that AI has not yet fully penetrated into the marrow of various industries, nor has it sparked broad and profound waves of change in society.

Based on this clear understanding, SenseTime is focusing on the performance of large models themselves.

On July 5th, at the "Boundless Love, Towards New Power" Artificial Intelligence Forum, SenseTime released "RiRiXin 50", the first domestically available model that is what you see is what you get, with an interactive experience comparable to GPT-40.

Specifically, "RiRiXin 50" integrates cross-modal information, based on various forms such as sound, text, images, and videos, bringing a new AI interaction mode, namely real-time streaming multimodal interaction.

Regarding why it is named 50, Lu Lewei, the research and development director of SenseTime Research Institute, told Wall Street News that this version has introduced many cutting-edge capabilities that can now rival GPT-40. The naming of this version is relatively conservative, and V6 will have a bigger plan to bring a more comprehensive and fundamental major upgrade.

Innovative Interaction Mode

At the scene, SenseTime Technology demonstrated the capabilities of "RiRiXin 50":

At the beginning, when the staff greeted "RiRiXin 50", it automatically recognized the words on the staff's neck badge strap, determined that the scene was the World Artificial Intelligence Conference venue, and indicated that it is a place to "study well".

Next, when the staff took a cute puppy doll, "RiRiXin 50" accurately described the appearance, expression, and important attire of the puppy - a white cap with the SenseTime logo, very dignified for the host.

Taking it up a notch, randomly flipping through any page of a book, "RiRiXin 50" could automatically introduce it, not just simple OCR text recognition, but recognizing the graphics and text to provide an easily understandable summary, all of which could be done in an instant, truly achieving real-time interaction.

The staff also demonstrated their "drawing skills" on the spot, casually drawing a simple bunny sketch. "RiRiXin 50" immediately exclaimed that it was cute. Then the staff drew a smiling expression, capturing the smile from this calm expression. The staff made another stroke to enlarge the mouth and add a tongue, and "RiRiXin 50" immediately noticed the increased happiness in this expression.

The dialogue created by "RiRiXin 50" is like chatting with a real person. According to SenseTime, this interaction mode is particularly suitable for real-time conversations and voice recognition applications. The ability to achieve an interactive experience comparable to GPT-40 is derived from the comprehensive improvement in the basic model capabilities of "RiRiXin 5.5" Upcoming Plans

In April of this year, SenseTime Technology released "RiRiXin 5.0", the first domestic large-scale model in China that competes with GPT-4 Turbo, sparking a frenzy in the capital market.

In just over two months, the brand new "RiRiXin 5.5" system has seen multiple upgrades, with overall performance averaging a 30% improvement over "RiRiXin 5.0". Abilities in mathematical reasoning, English proficiency, and instruction following have significantly strengthened, with interactive effects and multiple core indicators aligning with GPT-4o.

Lu Lewei stated that the release of 5.5, in terms of technical research, was not just in the past few months; it was an integration of SenseTime's methodology for native multimodal research since the end of last year. "This area happens to be the same as the actual meaning of 'o' in GPT-4o. We predicted this trend very early on, and a technical team has been working on this."

"It can cover the knowledge brought by multiple modalities during the training process and then integrate them with each other, which greatly helps improve the performance accuracy of the algorithm." Lu Lewei further emphasized that this native multimodal integration includes audio, video, and even the earliest images, all fully integrated into one model from the input encoder to the output decoder.

Furthermore, "RiRiXin 5.5" adopts a hybrid edge-cloud collaborative expert architecture to maximize edge-cloud collaboration, reduce inference costs, and train models based on over 10TB tokens of high-quality training data, including a large amount of synthesized chain-of-thought data to enhance reasoning abilities.

Regarding the plans for the next version, Lu Lewei stated that this version update is still quite significant. Initially, it was considered a V6 version in line with conventions, but the ongoing V6 version will have a larger plan, capable of accommodating a more comprehensive and fundamental major upgrade.

"We are first conservatively promoting the release of version 5.5, hoping to build anticipation. When V6 is released, it will bring a more comprehensive upgrade."