20241025笔记-把 Megatron 数据喂给 HF 模型

机缘巧合之预训练数据的处理都在Megatron里

但模型后处理改了下模型结构，实现的是HF版本，这个改动不方便加到Megatron上，而且转来转去很麻烦

于是将 Megatron 里处理好的数据接到基于 accelerate 的 HF 训练代码里

可以通过，torch.distributed 代替 mpu 的各种 rank，然后只用 data parallel 就行了

torch.distributed.get_rank()
torch.distributed.get_world_size()

需要用到 process_group 的时候传 None，默认就会使用 accelerate 里隐式运行的

default_group = torch.distributed.init_process_group()

注意不能把 megatron 里 labels 直接丢给 HF，应该 HF 里会再 shift 一次

4d attention mask HF 不支持，可以 hack 改 transformers 里的 _prepare_4d_causal_mask 成 identity 函数