12. 优化

91 CPU缓存 #

CPU架构

i5-7300 2个物理内核,4个逻辑内核.

graph TD subgraph CPU subgraph core0 C0_T0[T0] C0_T1[T1] C0_L1D[L1D] C0_L1I[L1I] C0_L2[L2] C0_T0 <-->C0_L1D C0_T0 <-->C0_L1I C0_T1 <-->C0_L1D C0_T1 <-->C0_L1I C0_L1D<-->C0_L2 C0_L1I<-->C0_L2 end subgraph core1 T0 <--> L1D T0 <--> L1I T1 <--> L1D T1 <--> L1I L1D<-->L2 L1I<-->L2 end L2<-->L3 C0_L2<-->L3 end m1[Main memory] L3 <--> m1

cache大小

L1: 64KB
L2: 256KB
L3: 4MB

访问速度.

L1: 1ns
L2: 1/4 L1速度
L3: 1/10 L1速度
主存: 1/50~1/100 L1速度

L1 可以挂上100个变量,推算关系–L1存储大小/变量大小=变量数量.

缓存行相关

访问特定内存位置时

同位置再次引用
附近内存位置将被引用

同位置再次引用

int a = arr[0]; 
//代码运算
int b = arr[0];

第二次访问arr[0]时,数据仍在CPU缓存中,直接从缓存中获取,速度更快.

附近内存位置将被引用

int a = arr[0];
int b = arr[1]; 
int c = arr[2];

访问arr[0],会同时读入arr[0]-arr[N]到缓存.

以上为局部性原理体现.

slice与结构体

type Foo struct {
	a int64 
	b int64 
}

type Bar struct {
	a []int64 
	b []int64 
}

在sum(a)计算中,Bar比[]Foo紧凑.

大概分布

# []Foo
a b a b a b a b

# Bar
a a a a b b b b

预测性

CPU stride // CPU在连续访问内存地址时,两个地址之间的间隔大小. 大概CPU步幅意思.

CPU stride类型

unit stride 可预测
constant stride 可预测但低效
non-unit stride 不可预测

type node struct { 
	value int64 
	next *node
}

node是链表,node.value的sum计算就是constant stride类型.

[]int64的sum计算就是unit stride类型.

Cache placement policy

当CPU决定复制内存块并将其放入缓存时,遵守的策略.

常见策略

Fully associative
Set associative

Set associative比较主流,如下图所示.

graph TD A[内存地址] --> B(索引) B --> C{组选择} C --> D1[组1块1] C --> D2[组1块2] C --> D3[组1块3] C --> D4[组2块1] C --> D5[组2块2]

92 错误共享 #

多个协程处理时,结构体padding作用.

type Result struct { 
	sumA int64
    _ [56]byte 
	sumB int64
}

如果没有padding

sumA和sumB可能共享一个Cache line,读写时可能发生Cache冲突
保存sumA结果时会污染sumB所在Cache line的数据
读取sumB时更高概率需要从内存重新加载Cache

padding存在优化内存对齐和Cache利用率.

93 指令并行 #

func add(s [2]int64) [2]int64 {
    for i := 0; i < n; i++ {
        s[0]++
        if s[0]%2 == 0 {
            s[1]++
        }
    }
    return s
}

func add2(s [2]int64) [2]int64 {
    for i := 0; i < n; i++ {
        v := s[0] // v的存在降低读写冲突
		s[0] = v + 1
        if v%2 != 0 {
            s[1]++
        }
    }
    return s
}

add2的优化让存在instruction-level parallelism (ILP)特性的CPU对其优化.

94 数据对齐 #

byte, uint8, int8: 1 byte
uint16, int16: 2 bytes
uint32, int32, float32: 4 bytes
uint64, int64, float64, complex64: 8 bytes
complex128: 16 bytes

type Foo struct {
	b1 byte
	i int64 
	b2 byte 
}

优化后

type Foo struct { 
	i int64 
	b1 byte 
	b2 byte 
}

95 堆栈 #

stack特点

last-in, first-out (LIFO)

heap上的变量规则

全局变量
发送chan的指针变量
chan值里面的结构体嵌套指针
局部变量过大时
无法预估变量大小

96 降低内存分配 #

设计接口时考虑到逃逸分析

sync.Pool有利于对象重用

97 内联 #

函数过于复杂则不会内联.

98 profiling #

Golang性能剖析

99 垃圾回收 #

GC流程

mark
sweep

GOGC参数控制触发GC的频率,值越大触发GC的频率越低

频繁GC与延迟有关.

100 k8s调度 #

golang应用跑在k8s,在cpu调度上目前存在问题.

k8s目前的调度策略 Completely Fair Scheduler (CFS)

配置k8s配置上cpu CFS的参数

cpu.cfs_period_us (global setting)
cpu.cfs_quota_us (setting per Pod)

https://github.com/golang/go/issues/33803 提出了相关的问题,

其中 https://github.com/uber-go/automaxprocs 让 GOMAXPROCS 自动根据 CPU 配额进行配置.