Benchmark Results
GPU (Metal shader) vs CPU (MLX) on Apple Silicon.
Timing is the average of 5 runs with 2 warmup runs discarded.
Skipped cells (—) exceeded the per-config batch limit.
Small dimensions — M < 512
| Shape |
|
batch=10 |
batch=50 |
batch=100 |
batch=500 |
batch=1000 |
batch=5000 |
batch=10000 |
batch=15000 |
| 8 × 8 |
GPU |
0.49 ms |
0.59 ms |
0.64 ms |
1.63 ms |
3.17 ms |
8.40 ms |
16.54 ms |
24.64 ms |
| |
CPU |
1.10 ms |
1.81 ms |
4.21 ms |
29.51 ms |
47.43 ms |
194.28 ms |
392.69 ms |
592.42 ms |
| |
Speedup |
2.27x |
3.10x |
6.58x |
18.09x |
14.94x |
23.13x |
23.74x |
24.04x |
| 16 × 16 |
GPU |
0.54 ms |
0.45 ms |
0.85 ms |
2.40 ms |
3.59 ms |
12.48 ms |
25.01 ms |
39.91 ms |
| |
CPU |
0.33 ms |
2.05 ms |
4.05 ms |
20.89 ms |
41.17 ms |
204.96 ms |
411.91 ms |
619.39 ms |
| |
Speedup |
0.61x |
4.53x |
4.77x |
8.70x |
11.47x |
16.42x |
16.47x |
15.52x |
| 32 × 32 |
GPU |
0.74 ms |
0.90 ms |
1.47 ms |
4.45 ms |
7.39 ms |
33.84 ms |
71.03 ms |
107.61 ms |
| |
CPU |
0.47 ms |
2.32 ms |
4.72 ms |
23.73 ms |
46.94 ms |
233.64 ms |
468.85 ms |
727.55 ms |
| |
Speedup |
0.64x |
2.57x |
3.22x |
5.34x |
6.36x |
6.90x |
6.60x |
6.76x |
| 64 × 64 |
GPU |
1.01 ms |
2.21 ms |
3.56 ms |
14.19 ms |
27.47 ms |
135.44 ms |
284.26 ms |
411.40 ms |
| |
CPU |
1.00 ms |
4.81 ms |
9.52 ms |
47.38 ms |
93.89 ms |
469.27 ms |
940.34 ms |
1413.26 ms |
| |
Speedup |
0.99x |
2.18x |
2.68x |
3.34x |
3.42x |
3.46x |
3.31x |
3.44x |
| 128 × 64 |
GPU |
1.85 ms |
2.39 ms |
6.04 ms |
20.12 ms |
41.24 ms |
207.19 ms |
409.29 ms |
579.60 ms |
| |
CPU |
1.60 ms |
6.70 ms |
13.21 ms |
66.34 ms |
132.21 ms |
665.30 ms |
1326.68 ms |
2035.80 ms |
| |
Speedup |
0.87x |
2.81x |
2.19x |
3.30x |
3.21x |
3.21x |
3.24x |
3.51x |
| 256 × 128 |
GPU |
4.46 ms |
11.27 ms |
22.12 ms |
97.19 ms |
209.31 ms |
— |
— |
— |
| |
CPU |
6.12 ms |
30.15 ms |
65.47 ms |
325.97 ms |
596.31 ms |
— |
— |
— |
| |
Speedup |
1.37x |
2.68x |
2.96x |
3.35x |
2.85x |
— |
— |
— |
| 512 × 256 |
GPU |
13.53 ms |
59.71 ms |
112.91 ms |
519.76 ms |
— |
— |
— |
— |
| |
CPU |
24.12 ms |
119.07 ms |
237.59 ms |
1196.59 ms |
— |
— |
— |
— |
| |
Speedup |
1.78x |
1.99x |
2.10x |
2.30x |
— |
— |
— |
— |
Observations
- At very small batch sizes (≤ 10), GPU launch overhead can exceed the compute cost, resulting in sub-1x speedup for small matrices. This is expected — the GPU only becomes worthwhile once there is enough work to amortise the dispatch cost.
- Speedup scales strongly with batch size. At batch=15000, an 8×8 workload reaches 24x over MLX CPU.
- The crossover from CPU-faster to GPU-faster occurs around batch=50 for most small configurations.
Large dimensions — M ≥ 512
| Shape |
|
batch=1 |
batch=8 |
batch=16 |
batch=32 |
| 512 × 512 |
GPU |
12.07 ms |
24.07 ms |
35.50 ms |
69.65 ms |
| |
CPU |
7.00 ms |
43.75 ms |
85.92 ms |
174.61 ms |
| |
Speedup |
0.58x |
1.82x |
2.42x |
2.51x |
| 1024 × 512 |
GPU |
17.93 ms |
53.30 ms |
92.24 ms |
— |
| |
CPU |
12.95 ms |
98.48 ms |
196.12 ms |
— |
| |
Speedup |
0.72x |
1.85x |
2.13x |
— |
| 5000 × 5000 |
GPU |
1213.09 ms |
7494.11 ms |
— |
— |
| |
CPU |
1970.29 ms |
15769.81 ms |
— |
— |
| |
Speedup |
1.62x |
2.10x |
— |
— |
Observations
- At batch=1, the GPU is slower than MLX CPU for 512×512 and 1024×512 matrices. A single large matrix does not generate enough parallelism to saturate the GPU’s shader multiprocessors.
- Speedup grows steadily with batch size, reaching 2.51x at batch=32 for 512×512.
- Even at batch=1, the GPU outperforms MLX CPU on 5000×5000 (1.62x), where the sheer volume of in-matrix computation saturates the shader cores without needing a large batch.