一段Cpp代码引发的惨案—R语言低效的for循环

Posted on 2021-07-15 In Rcpp Views:

Symbols count in article: 3.5k Reading time ≈ 3 mins.

唉，但凡我熟悉一些 R 函数，也不会被逼用 Cpp 提高效率。

一段 Cpp 代码引发的惨案

笔者正在处理数据，80000+ 行 58 列的浮点数。对于每一行都要根据该行的标签，在另一个 n x 58 的数据框中找到对应的行，然后逐元素除以找到的行向量。也就是大致要做 80000 次查询，80000 * 58 次除法，80000 * 58 次赋值。

一开始笔者是在 data.frame 上工作，属实是写代码前不用脑子思考。R 中的数据框实质上是多个列向量的组合在一起的数据类型，对其进行行操作会比较痛苦。（要是有一个汇总各个编程语言的设计理念与特点的地方就好了）所以笔者刚开始不得不用 for 循环来实现遍历 80000+ 行的目标：

st_ind <- 8
ed_ind <- 65

standard <- standard %>%
  column_to_rownames(var = "产品名称") %>%
  t() %>%
  as.data.frame() %>%
  as.list()

for (i in 1:rows) {
  if (i %% 1000 == 0) {
    # 进度输出
    cat(sprintf("已完成：%i/%i    %.2f%%\n", i, rows, 100 * round(i/rows, 4)))
  }
  product_name <- processed_data$product[i]
  processed_data[i,st_ind:ed_ind] <-
    processed_data[i,st_ind:ed_ind] / standard[[product_name]]
}

这里通过数据框转置为矩阵 t()，再转回数据框，再用 as.list() 将每一列作为列表的元素组成一个当作 Map 用的列表。如下：

How to Correctly Use Lists in R? - Stack Overflow

R 中的 List 对象的表现和用法很像 Hash Map，但我也没找到对应的资料。

我们想将每行的 id 作为列表的 ”Key“，剩余 w:z 列的值作为这个 ”Key“ 下的 ”Value“，部分类似于按行转为列表。
1
2
3
4
5
6
7
# A tibble: 4 x 5
     id     w     x     y     z
  <int> <dbl> <dbl> <dbl> <dbl>
1     1 0.117 0.617 0.698 0.899
2     2 0.126 0.690 0.968 0.789
3     3 0.596 0.691 0.716 0.125
4     4 0.661 0.404 0.279 0.730
首先，通过 dplyr::column_to_rownames() 将标签列化为行名，然后进行转置，化为矩阵：
1
2
3
4
5
6
> df %>% column_to_rownames(var = "id") %>% t()
          1         2         3         4
w 0.1171960 0.1258210 0.5957329 0.6611707
x 0.6167798 0.6896939 0.6910173 0.4037931
y 0.6979593 0.9682786 0.7157607 0.2793822
z 0.8992127 0.7893004 0.1253743 0.7300913
然后通过 dplyr::as.data.frame() 转为数据框（其实就是将分散的 cell 按列组合成列向量）：
1
2
3
4
5
6
> df %>% column_to_rownames(var = "id") %>% t() %>% as.data.frame()
          1         2         3         4
w 0.1171960 0.1258210 0.5957329 0.6611707
x 0.6167798 0.6896939 0.6910173 0.4037931
y 0.6979593 0.9682786 0.7157607 0.2793822
z 0.8992127 0.7893004 0.1253743 0.7300913
最后化为我们想要的 ”Map“ 列表：
1
2
3
4
5
6
7
8
9
10
11
12
> df %>% column_to_rownames(var = "id") %>% t() %>% as.data.frame() %>% as.list()
$`1`
[1] 0.1171960 0.6167798 0.6979593 0.8992127

$`2`
[1] 0.1258210 0.6896939 0.9682786 0.7893004

$`3`
[1] 0.5957329 0.6910173 0.7157607 0.1253743

$`4`
[1] 0.6611707 0.4037931 0.2793822 0.7300913

:crying_cat_face:跑偏了。。。使用 for 循环处理 80000 * 58 的数据框耗时 30+ min，几乎不可忍受。笔者也考虑了其他优化方案：

并行化
搞个新算法
换台电脑 ×

emmm，能力太差，全否。

所以只好祭出终极方案——换语言。

好在 R 中有 Rcpp 这样优秀的包，提供了对象接口，让在 R 中实现编写、编译、运行 Cpp 代码一套流程。既然都用了 Cpp，就不做优化，直接双重循环（顺便把构建 Map 的过程也写了，虽然提高不了多少效率）：

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
std::map<String, NumericVector> mapping_mrl(NumericMatrix mrl, CharacterVector products) {
  std::map<String, NumericVector> mrl_map;

  int n = products.size();
  for (int i = 0; i < n; i++) {
    mrl_map[products[i]] = mrl(i,_);
  }

  return mrl_map;
}

// [[Rcpp::export]]
NumericMatrix calc_(NumericMatrix vm, CharacterVector products, List mrl_map) {
  int rows = vm.nrow();
  int cols = vm.ncol();

  NumericMatrix nvm(rows, cols);

  for (int i = 0; i < rows; i++) {
    for (int j = 0; j < cols; j++) {
      String product_name = products[i];
      nvm(i,j) = vm(i,j) / as<NumericVector>(mrl_map[product_name])[j];
    }
  }

  return nvm;
}


// You can include R code blocks in C++ files processed with sourceCpp
// (useful for testing and development). The R code will be automatically
// run after the compilation.
//

/*** R
df <- data.frame(
  name = c('a', 'b', 'a', 'c', 'c'),
  t1 = c(1,2,3,4,5),
  t2 = c(0,9,8,7,6)
)
mrl <- data.frame(
  product = c('a','b','c'),
  t1 = c(2,2,4),
  t2 = c(4,10,6)
)
df1 <- data.frame(
  a = c('a','b','c'),
  b = c('c','b','a')
)

mrl_map <- mapping_mrl(as.matrix(mrl[,2:3]), mrl$product)
nvm <- test(as.matrix(df[,2:3]), df$name, mrl_map)
*/

结果它处理 80000 * 58 的数据，运行时间在 20-30 s内，效率提升 60 倍。。。

R 的速度啥时候才能上去呢。。。？