计算科学 - 最快的倍增方式1041042x2 矩阵 - 吾爱随笔录

最快的倍增方式1041042x2 矩阵

计算科学 matlab Python 矩阵表现麻木的

2021-12-08 09:44:35

在我使用的代码中（用 python 编写，但也标记为 matlab，因为 numpy 非常接近，如果需要我可以使用它），我们使用传递矩阵方法来计算物理系统的属性。即对于初始位置的粒子 $\vec{x}_i$ ，我们将最终位置计算为

{\vec{x}}_{f} = \hat{M} (z) \cdot {\vec{x}}_{i} \approx \hat{M} (z_{n}) \dots \hat{M} (z_{1}) \cdot \hat{M} (z_{0}) \cdot {\vec{x}}_{i},

$\vec{x}_f = \hat{M}(z)\cdot\vec{x}_i \approx \hat{M}(z_n)\cdots\hat{M}(z_1)\cdot\hat{M}(z_0)\cdot\vec{x}_i,$ 出于收敛原因

n \sim 10^{4}

$n\sim 10^4$ . 在分析我们的代码之后，大约 90% 的 CPU 时间都花在了最终的矩阵乘法上。我目前正在使用它的幼稚实现，如下所示：

# The list of matrices
Ms = [M1, M2, M3, ..., Mn]

# Start with the identity matrix
result = np.identity(2)

# Multiply the matrices
for M in Ms:
    result = M @ result

我的问题是：有没有加快矩阵乘法步骤的聪明方法？ 或者，我也会对使用 numpy voodoo 节省时间的不太聪明的方法感兴趣。

不幸的是，这些矩阵不会交换，所以我不能取对数、求和，然后取矩阵指数，我认为这会更快。

编辑： 矩阵生成如下：

# Calculate the constant matrices and edge matrices
Ms = get_M_const(E, B, gammas[:-1], delta_z)
rising_Ms = np.concatenate((np.array([[[1.0,], [0.0,]], [[0.0,], [1.0,]]]), get_M_edge(E[1:], gammas[1:-1], 'rising')), axis=2)
falling_Ms = get_M_edge(E, gammas[1:], 'falling')

# Interleave the arrays
c = np.empty((2,2, Ms.shape[-1]+rising_Ms.shape[-1]+falling_Ms.shape[-1],), dtype=Ms.dtype)
c[:,:,0::3] = rising_Ms
c[:,:,1::3] = Ms
c[:,:,2::3] = falling_Ms

从技术上讲，存在三种不同类型的矩阵 $\hat{M}_{\text{total}} = \hat{M}_{\text{falling}}\cdot\hat{M}_{\text{const}}\cdot\hat{M}_{\text{rising}}$ 我使用 numpy 函数计算以利用矢量化例程。变量E,B和gammas是形状 (n) 的 numpy 数组，delta_z只是一个数字。这些函数返回 (2,2,n) 数组，然后我将其交错以获得相乘的完整 (2,2,3n) 矩阵数组。

我想我在第一个代码块中通过将矩阵列为本机 python 列表来简化我的代码。剩下的就是我如何执行矩阵乘法。我在转置的元素上运行 for 循环c。

2个回答

总的来说，我同意Chris 的评论，即使用编译语言来分配堆栈上的矩阵可以有很大帮助。

如果我们仅限于 Python 和 numpy，有几种可能性：

考虑np.array 与 np.matrix，它可能会np.matrix比np.arraymatrix-matrix 产品更快（目前尚不清楚您现在使用的是什么，以及如何使用 $2\times2$ 大小会影响结果）
根据 whpwell96 的评论，考虑并行计算最终矩阵
也许，您不需要计算整个矩阵 $\hat{M}(z)$ . 而不是计算 $(10^4-1)$ 矩阵矩阵产品和 $1$ 矩阵向量乘积，替代方法是 $10^4$ 如果不需要其他计算，矩阵向量乘积可能会更好。
考虑以性能为目标的Cython和/或 Python 发行版。

我想跟进这个问题，因为我对使用 python 的 C 扩展实现的性能改进感到非常震惊。

我在 C 中编写了一个简单的函数，它接受我的 (2,2,n) numpy 数组并对其执行重复的矩阵乘法。我按照 KobeGote 的建议对 2x2 矩阵乘法进行硬编码，以避免 numpy 版本中的 for 循环开销。通过这些更改，我使用以下 python 脚本测试了性能。

import spam
import numpy as np
import timeit
import functools as f

# Make a test array
arr = np.random.rand(2,2,10000)

# Test the speed of the new method
testfun_new = lambda: spam.numpy_test(arr)
print("Execution Time New: {:.0f} us".format(timeit.timeit(testfun_new, number=100)/100*1e6))

# Test the speed of the old method
testfun_old = lambda: f.reduce(np.dot, arr.T).T
print("Execution Time Old: {:.0f} ms".format(timeit.timeit(testfun_old, number=100)/100*1e3))

# Make sure they are the same
print()
print('Relative Difference of Elements:')
print((testfun_new() - testfun_old())/testfun_old())

spam只是我暂时给我的 C 扩展名起的愚蠢的名字。输出是：

Execution Time New: 35 us
Execution Time Old: 22 ms

Relative Difference of Elements:
[[ 8.04395924e-15 -3.77388718e-15]
 [ 7.98691127e-15 -3.94433965e-15]]

我期待一些性能提升，但我从未想过它会快近三个数量级！我什至不得不仔细检查它们是否返回相同的值，因为我认为我只是忘记了一些东西，但它们基本上与机器精度相同。通过这一更改，我的全矩阵计算的运行时间减少了 95%，这远低于我所关心的阈值，现在它受到其他数值运算的限制。

考虑到性能差异，我很惊讶 numpy/scipy 还没有这样的功能。也许我稍后会向他们建议，或者至少在完善后发布我自己的代码。

编辑：对于将来遇到此问题的任何人，请查看我编写的包含我的实现的 python 库。在https://pypi.org/project/matprod/上的 PyPi 上找到它，或者使用pip install matprod.

其它你可能感兴趣的问题

上一篇了解各种 BLAS 实现下一篇编译的 c++ 代码使用 double 比 float 运行得快得多。解释？