Why does Apple want to use "small models"?

Apple introduced a personal intelligent system called Apple Intelligence at WWDC 2024. Unlike other tech companies, Apple focuses more on user experience and customization. The system consists of multiple generative models that can adapt to users' current activities and provide optimized functions such as text writing, summarization, and notification prioritization. Apple prefers to use small models on the device side to perform these tasks, while also providing third-party services for users to choose from. Apple uses the AXLearn framework for training and utilizes the web crawler AppleBot to collect publicly available data

At WWDC 2024, Apple redefined AI as Apple Intelligence.

This is a personal intelligent system deeply integrated into iOS 18, iPadOS 18, and macOS Sequoia.

Unlike other tech giants, Apple's approach to AI is not about "bigger is better".

Instead, Apple's attitude is more practical, prioritizing user experience and emphasizing the customization of AI models.

Integrating generative AI seamlessly into the operating system is, in a sense, a very "Apple" approach.

Apple Intelligence consists of multiple powerful generative models designed for users' daily tasks, capable of adapting to users' current activities in real-time.

The foundational models within Apple Intelligence are fine-tuned for user experience, such as text generation and optimization, summarization, determining notification priorities, creating interesting images for conversations, and simplifying interactions across apps.

Apple tends to handle these tasks with small on-device models, although users can choose to use third-party services like ChatGPT, in which case Apple is not responsible for the data.

Apple highlighted two models in particular: a 3 billion parameter on-device language model, and a larger server-based language model (which can run on Apple servers through private cloud computing).

Keep Small

Apple's foundational models are trained on the AXLearn framework.

AXLearn is an open-source project released by Apple in 2023, built on JAX and XLA, enabling Apple to efficiently and scalably train models on various training hardware and cloud platforms, including TPUs, cloud, and local GPUs.

Apple combines data parallelism, tensor parallelism, sequence parallelism, and fully sharded data parallelism (FSDP) to scale training across dimensions such as data, models, and sequence length.

Apple uses the web crawler AppleBot to collect publicly available data. If web publishers do not want their content to be used for training by Apple Intelligence, Apple provides various levels of control

Apple stated that it never uses users' private personal data or user interactions when training basic models, and Apple uses application filters to remove personally identifiable information (such as social security and credit card numbers) publicly available on the internet.

In addition to filtering, Apple also identifies high-quality documents through data extraction, duplicate data removal, and model-based classifiers.

Post-processing

Apple uses a mixed data strategy in the training pipeline, combining manual annotations and synthetic data, and implements thorough data management and filtering procedures.

Apple has developed two novel algorithms in the post-processing stage:

Rejection sampling fine-tuning algorithm;

Reinforcement learning based on human feedback (RLHF) algorithm, using mirror descent strategy optimization and leave-one-out advantage estimator.

These two algorithms can significantly improve the quality of model instruction compliance.

In addition to ensuring that the generated model has powerful functionality, Apple also uses a series of innovative technologies to optimize it on devices and private clouds to improve speed and efficiency.

Both device-side models and server-side models use grouped-query-attention to optimize their inference performance.

Apple uses a shared input and output vocabulary to reduce memory requirements and inference costs, ensuring that the mapping of shared embedding tensors is not duplicated.

The device-side model uses a vocabulary size of 49K, while the server-side model uses a vocabulary size of 100K.

For device-side inference, Apple uses low-bit palletization to meet the necessary memory, power, and performance requirements.

To maintain model quality, Apple has developed a new framework using LoRA adapters, combining 2-bit and 4-bit configuration strategies (average of 3.5 bits per weight) to achieve the same accuracy as uncompressed models.

In addition, Apple also uses the interactive model latency and power analysis tool Talaria to better guide the bit rate selection for each operation.

By using activation quantization and embedding quantization, efficient key-value cache updates can be achieved on Apple's neural engine With these optimizations, the iPhone 15 Pro can achieve a latency of about 0.6 milliseconds and a token generation rate of 30 tokens per second.

Adapter

Apple's base model has been fine-tuned for users' daily activities, allowing it to dynamically focus on current tasks.

The approach involves inserting small neural networks as modules (adapters) into various layers of the pre-trained model to achieve fine-tuning for specific tasks.

Additionally, Apple has adjusted the attention matrix, attention projection matrix, and fully connected layers in the feedforward network to adapt to the decoding layer of the Transformer architecture.

By fine-tuning only the adapter layers, the original parameters of the base pre-trained model remain unchanged, preserving the model's general knowledge while supporting specific tasks.

Apple Intelligence includes a wide range of adapters, which is an effective way to expand the functionality of the base model.

Apple uses 16-bit representations for the values of adapter parameters. For a device model with 3 billion parameters, 16-level adapter parameters typically require 10 megabytes.

Adapter models can be dynamically loaded, temporarily cached in memory, and swapped, ensuring the responsiveness of the operating system.

As user experience is the highest priority, Apple focuses on human evaluation when benchmarking models.

Summary

Apple's training data is based on synthetic summaries generated from larger server models and filtered through a rejection sampling strategy to retain only high-quality summaries.

To evaluate product-specific summaries, a set of 750 responses is used, with careful sampling for each use case.

The evaluation dataset covers various inputs that Apple's product features may encounter in production, including hierarchical combinations of single documents and stacked documents of different content types and lengths.

Furthermore, evaluating summary functions also considers some inherent risks, such as occasional omission of important details by the model.

Based on scorers' ratings across five dimensions, summaries are categorized as good, fair, or poor.

Experimental results show that models with adapters can generate better summaries compared to similar models.

Moreover, in over 99% of targeted adversarial examples, the summary adapter does not amplify sensitive content.

Base Functions

For the general functions of device-side and server models, Apple uses a comprehensive set of real-world prompts to evaluate the functionality of universal models.

These prompts vary in difficulty levels and cover brainstorming, classification, closed-ended questions, encoding, extraction, mathematical reasoning, open-ended questions, rewriting, security, summarization, and writing, among other major categories Compare Apple's model with open source models (Phi-3, Gemma, Mistral, DBRX) and similar scale commercial models (GPT-3.5-Turbo, GPT-4-Turbo).

Experiments show that compared to most competitors, Apple's model is more favored by human evaluators.

Apple's 3B on-device model outperforms large models like Phi-3-mini, Mistral-7B, and Gemma-7B; while Apple's server model also surpasses DBRX-Instruct, Mixtral-8x22B, and GPT-3.5-Turbo, with higher efficiency.

Security

Apple uses a set of different adversarial prompts to test the model's performance on harmful content, sensitive topics, and factual accuracy.

The violation rate of each model is measured, also through human evaluation:

The above figures show the comparison with competitors in terms of security prompts. Human evaluators found Apple's responses to be safer and more helpful.

Instruction Following

To further evaluate the models, Apple also uses the Instruction Following Evaluation (IFEval) benchmark test to compare the capabilities of similar models.

Results indicate that Apple's on-device and server models both better follow detailed instructions compared to equivalent open source and commercial models.

Lastly, the models' writing abilities are assessed based on internal summaries and essay benchmarks, including various writing prompts, with results not involving adapters used for specific tasks.

Reference:

https://machinelearning.apple.com/research/introducing-apple-foundation-models