AI iPhone Prelude? Apple publishes paper to solve "running large models on mobile memory"
This new study supports the use of LLM with twice the size of the device's operating memory, which can improve the inference speed of the GPU by several times. The media reports that Apple's plan to integrate generative AI into iOS 18 may be accelerated.
AI iPhone is coming?
According to media reports, Apple recently published a paper introducing a method for running LLM (Large Language Model) on devices that exceed the available DRAM capacity.
By significantly improving memory utilization efficiency, this new research supports running LLMs that are twice the size of available memory on devices and can increase GPU inference speed by several tens of times.
The paper states that the intensive computation and memory requirements of LLMs pose a major challenge to DRAM capacity. The paper constructs a flash-based inference cost model and optimizes it in two key areas: reducing the amount of data transferred from flash and reading data blocks more smoothly.
This breakthrough research expands the applicability and accessibility of LLMs, and Apple's plan to integrate generative AI into iOS 18 may be accelerated.
Can LLM run on smartphones?
Generally speaking, DRAM is what we commonly refer to as "memory," while flash refers to a hard disk.
When performing calculations, the CPU usually uses DRAM as an "intermediate bridge" to copy data from the hard disk to memory for data processing. This can increase the speed by millions of times.
However, in terms of capacity, DRAM is an order of magnitude smaller than a hard disk. When running LLMs, a large amount of data needs to be processed simultaneously, which poses a great challenge for devices with limited capacity.
But this paper seems to have found a breakthrough in this regard. The framework proposed in this paper aims to store model parameters in flash and load data into DRAM when needed, solving the problem of LLMs exceeding the available capacity of DRAM during runtime.
Specifically, Apple mainly uses two techniques:
(1) "Windowing" technique: Reusing previously activated neurons to reduce data transmission;
(2) "Row-column bundling" technique: Customizing the order of accessing data blocks based on the characteristics of flash data, thereby increasing the size of data blocks read from flash.
The paper mentions that a model with 7 billion parameters requires more than 14GB of memory to load parameters in half-precision floating-point format, which exceeds the capacity of most network endpoints. However, through this framework, it is possible to minimize data transmission and maximize flash throughput, thereby reducing data load and improving memory utilization efficiency.
The research results show that compared to traditional loading methods, this framework supports models that are twice the size of DRAM, and the inference speed in CPU and GPU can be increased by 4-5 times and 20-25 times, respectively. The research team concludes at the end of the paper:
"This breakthrough technology is particularly important for deploying advanced LLM in resource-limited environments, thereby expanding the applicability and accessibility of LLM."
Mobile Giants Target AI
In the era of AI, major mobile manufacturers are all starting to lay out "Artificial Intelligence + Mobile".
According to previous market news, Apple will introduce AI in its iOS 18, mainly used for intelligent Q&A, sentence generation, and other functions in Siri and communication applications. In addition, Apple is also exploring the potential of AI in applications such as Apple Music, Pages, Keynote, and Xcode.
Samsung launched its generative AI model, Samsung Gauss, in early November. It is reported that this model will be incorporated into the Galaxy S24 series of smartphones, which will be released in early 2024. Samsung's laptops and tablets may also integrate this model.
Google's Gemini, a large-scale model, will also be integrated into Google's products. In November, Google announced that Gemini 1.0 will be gradually launched in Google products. Gemini Pro will be integrated into Bard for advanced reasoning and planning. Gemini Nano will support the functions of the Pixel 8 Pro smartphone.