Introduction

In the ever-evolving world of machine learning and artificial intelligence, staying ahead of the curve is crucial. This is where the WasmEdge projects especially LLAMAEdge comes into play. WasmEdge NN plugins provides support to run LLM models like LLAMA and others on edge devices like Apple M series MacBooks. This lead to my project, where I had to build a plugin to integrate the mlx.cpp library into the WasiNN plugin ecosystem.

mlx.cpp, a cutting-edge machine learning framework developed by Apple ML Research, is designed to provide a flexible and efficient platform for implementing and deploying machine learning models, and the ability to extend its functionality through custom modules is a game-changer.

The Challenge

The first task was to create a custom module mlx_llm.cpp that would allow MLX to load models like llama and phi3 on c++ api for MLX framework. This involved diving deep into the core architecture of MLX and understanding how its various components interact with each other. Also I had to refer the implementation of frontend APIs of PyTorch and others to understand the inner workings of C++ based modules in loading weights from formats like GGUF and SafeTensors.

The goal was to extend the capabilities of WasiNN by building a custom module that would allow the framework to handle LLM models using operations of mlx.cpp, thus boosting the speed of ops computed on Mac Devices. This involved not only understanding the intricacies of WasmEdge and WasiNN but also navigating the complex landscape of mlx.cpp and edge computing.

The Process

The first step was to familiarize myself with the existing codebase and documentation. MLX is built using modern C++ and leverages advanced techniques like template meta-programming and Metal acceleration. While this presented a steep learning curve, the comprehensive documentation and supportive community made the journey smoother. I would also like to thank MLX maintainers, especially @awni for supporting my work and providing guidance along the way. After gaining a solid understanding of the framework, I started designing the custom module. This involved defining the module's interface, data structures, and algorithms. I had to ensure that the module would seamlessly integrate with the existing MLX ecosystem while providing the desired functionality. One of the challenges I faced was implementing the custom module in a way that would not compromise the performance and efficiency of MLX. This required careful optimisation and profiling to identify and address potential bottlenecks.

After that the task was to complete the mlx_llm.cpp library, use it as a separate dependency for WasmEdge project. Then I added files mlx.h and mlx.cpp for using the model in the Wasi-NN plugin that can now be accessed by the usage of the framework.

Figure 1: An example NN that could be built using mlx components and my new mlx_llm.cpp library (used for testing the new Wasi-NN MLX plugin)

Figure 1: An example NN that could be built using mlx components and my new mlx_llm.cpp library (used for testing the new Wasi-NN MLX plugin)

Benchmarking:

The sample example built using the mlx_llm.cpp library API was able to provide important differences between the speeds of python and CPP API on MLX and thus improving the ecosystem for the whole new projects that are going to come in the future.

Time (in s) Python API Model CPP API model Speed Up
Test_nn 0.0003715919494628906 8.1458e-05 ~ 4.56x

The speed up even if it is for a small neural network is worth saying and could lead us to better performant edge nn than llama.cpp in future. Sadly I am not able to complete the complete implementation of an LLM model like phi3 (under progress) with the new library and thus we need more time for comparing token speeds of both the ecosystems i.e. mlx_llm.cpp and llama.cpp .

My contributions:

  1. https://github.com/WasmEdge/WasmEdge/pull/3330
  2. https://github.com/guptaaryan16/mlx_llm.cpp

The Result:

After several iterations of development, testing, and refinement, I successfully created a custom module that met the project requirements. The module not only extended MLX's capabilities but also demonstrated the framework's flexibility and extensibility. Throughout the process, I had the opportunity to collaborate with experienced developers and researchers from the MLX team. Their guidance and feedback were invaluable in shaping the final product and helping me grow as a developer.