Commit f134283 1 parent 3709285 commit f134283 Copy full SHA for f134283
File tree 13 files changed +82
-21
lines changed
13 files changed +82
-21
lines changed Original file line number Diff line number Diff line change @@ -377,7 +377,7 @@ LLM Data Engineering
377
377
- [ MindIE 1.0.RC1 发布,华为昇腾终于推出了针对LLM的完整部署方案,结束小米加步枪时代] ( https://www.zhihu.com/question/654472145/answer/3482521709 )
378
378
- [ 大模型国产化适配8-基于昇腾MindIE推理工具部署Qwen-72B实战(推理引擎、推理服务化)] ( https://juejin.cn/post/7365879319598727180 )
379
379
- Qwen-72B、Baichuan2-7B、ChatGLM3-6B
380
- - [ 大模型国产化适配9-LLM推理框架MindIE-Service性能基准测试] ( https://juejin.cn/post/7370925923455713320 )
380
+ - [ 大模型国产化适配9-LLM推理框架MindIE-Service性能基准测试] ( https://zhuanlan.zhihu.com/p/704649189 )
381
381
- [ 大模型国产化适配10-快速迁移大模型到昇腾910B保姆级教程(Pytorch版)] ( https://juejin.cn/post/7375351908896866323 )
382
382
- [ 大模型国产化适配11-LLM训练性能基准测试(昇腾910B3)] ( https://juejin.cn/post/7380995631790964772 )
383
383
@@ -447,6 +447,7 @@ AI编译器是指将机器学习算法从开发阶段,通过变换和优化算
447
447
- [ 现在为什么那么多人以清华大学的ChatGLM-6B为基座进行试验?] ( https://www.zhihu.com/question/602504880/answer/3041965998 )
448
448
- [ 为什么很多新发布的大模型默认使用BF16而不是FP16?] ( https://www.zhihu.com/question/616600181/answer/3195333332 )
449
449
- [ 大模型训练时ZeRO-2、ZeRO-3能否和Pipeline并行相结合?] ( https://www.zhihu.com/question/652836990/answer/3468210626 )
450
+ - [ 一文详解模型权重存储新格式 Safetensors] ( https://juejin.cn/post/7386360803039838235 )
450
451
451
452
452
453
## [ LLM面试题] ( https://github.com/liguodongiot/llm-action/blob/main/llm-interview/README.md )
Original file line number Diff line number Diff line change 2
2
3
3
4
4
5
+ - 代码:https://github.com/NVIDIA/TensorRT-Model-Optimizer
6
+ - 文档:https://nvidia.github.io/TensorRT-Model-Optimizer/
5
7
6
- https://nvidia.github.io/TensorRT-Model-Optimizer/
7
-
8
-
9
-
10
- https://github.com/NVIDIA/TensorRT-Model-Optimizer
11
-
12
-
13
-
14
-
8
+ - 量化方法最佳实践:https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_choosing_quant_methods.html
Original file line number Diff line number Diff line change
1
+
2
+
3
+
Original file line number Diff line number Diff line change @@ -14,7 +14,7 @@ transformers
14
14
15
15
16
16
accelerate
17
- - https://zhuanlan.zhihu.com/p/671742753
17
+ - 使用 PyTorch FSDP 微调 Llama 2 70B: https://zhuanlan.zhihu.com/p/671742753
18
18
- https://huggingface.co/docs/transformers/v4.41.0/en/fsdp#fsdp-configuration
19
19
- https://huggingface.co/docs/transformers/v4.41.0/en/main_classes/trainer#transformers.TrainingArguments
20
20
- https://github.com/pacman100/LLM-Workshop/blob/main/chat_assistant/sft/training/configs/fsdp_config.yaml
Load Diff This file was deleted.
Original file line number Diff line number Diff line change 5
5
- https://github.com/ggerganov/llama.cpp
6
6
7
7
8
+ GGUF量化格式
8
9
10
+ - https://lightning.ai/cosmo3769/studios/post-training-quantization-to-gguf-format-and-evaluation
11
+ - https://medium.com/@metechsolutions/llm-by-examples-use-gguf-quantization-3e2272b66343
12
+ - ctransformers、llama.cpp
9
13
10
14
11
15
Original file line number Diff line number Diff line change 4
4
5
5
6
6
7
+ - OpenAI Triton 入门教程: https://zhuanlan.zhihu.com/p/684473453
Original file line number Diff line number Diff line change 1
1
2
2
3
3
4
+
5
+ - Mastering LLM Techniques: Inference Optimization: https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
4
6
- 7 Ways To Speed Up Inference of Your Hosted LLMs:https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
5
7
- How to Speed Up LLM Training with Distributed Systems?:https://www.appypie.com/blog/llm-training-with-distributed-systems
6
8
- 详谈大模型训练和推理优化技术:https://wjn1996.blog.csdn.net/article/details/130764843
Original file line number Diff line number Diff line change @@ -11,3 +11,11 @@ https://github.com/ModelTC/lightllm/blob/main/docs/LightLLM.md
11
11
12
12
13
13
14
+
15
+
16
+ - https://github.com/ModelTC/lightllm/tree/main
17
+
18
+
19
+ - LightLLM:纯Python超轻量高性能LLM推理框架: https://mp.weixin.qq.com/s/-wMLMGAHkxeyDYkixqni9Q
20
+
21
+
Original file line number Diff line number Diff line change
1
+
2
+
3
+
4
+ - https://hub.docker.com/r/openmmlab/lmdeploy-builder/tags
5
+ - https://hub.docker.com/r/openmmlab/lmdeploy/tags
6
+
7
+
8
+
9
+ - https://github.com/InternLM/lmdeploy/blob/main/docs/zh_cn/serving/api_server.md
10
+
11
+ - https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/cli/utils.py#L64
12
+
Original file line number Diff line number Diff line change 6
6
- https://github.com/NVIDIA/TensorRT-LLM
7
7
- https://nvidia.github.io/TensorRT-LLM/index.html
8
8
9
-
9
+ - 性能优化最佳实践: https://nvidia.github.io/TensorRT-LLM/performance/perf-best-practices.htm
10
10
11
11
12
12
## FP8
Original file line number Diff line number Diff line change 1
1
2
2
3
3
4
- https://github.com/triton-inference-server/backend
5
4
5
+ - https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
6
+
7
+
8
+
9
+
10
+ - https://github.com/triton-inference-server/backend
11
+ - https://github.com/triton-inference-server/server
6
12
7
13
8
14
Original file line number Diff line number Diff line change
1
+
2
+
3
+
4
+ 这篇论文是关于一个名为ORCA的分布式服务系统,它专门为基于Transformer的生成模型设计。以下是论文的主要内容概述:
5
+
6
+ 1 . ** 背景与动机** :
7
+ - 大型基于Transformer的模型(如GPT-3)在生成任务中表现出色,但现有的推理服务系统在处理这类模型时存在性能瓶颈。这些模型需要多次迭代来生成每个输出令牌,而现有系统在处理多迭代请求时效率不高。
8
+
9
+ 2 . ** 主要问题** :
10
+ - 现有系统在处理请求时,通常是按批次处理,这导致一些请求在批次中提前完成但无法及时返回给客户端,增加了延迟。同时,新到达的请求需要等待当前批次完全处理完毕,增加了排队时间。
11
+
12
+ 3 . ** 解决方案** :
13
+ - ** 迭代级调度(Iteration-level Scheduling)** :提出一种新的调度机制,调度执行时以迭代为单位,而不是整个请求。这样,每次迭代后,可以立即处理新到达的请求,减少等待时间。
14
+ - ** 选择性批处理(Selective Batching)** :在应用批处理和迭代级调度时,只对选定的操作应用批处理。这样可以在不同的操作中灵活地处理请求,避免因不同请求处理不同数量的令牌而导致的批处理问题。
15
+
16
+ 4 . ** ORCA系统** :
17
+ - ORCA是一个分布式服务系统,实现了上述两种技术。它还采用了模型并行化策略(如层内和层间模型并行化),以支持大规模模型。
18
+ - ORCA的设计包括一个请求池、调度器和执行引擎。调度器负责从请求池中选择请求,执行引擎则负责执行模型的迭代。
19
+
20
+ 5 . ** 评估** :
21
+ - 使用GPT-3模型进行评估,结果显示ORCA在延迟和吞吐量方面显著优于NVIDIA FasterTransformer。具体来说,ORCA在相同延迟水平下,吞吐量提高了36.9倍。
22
+
23
+ 6 . ** 结论** :
24
+ - ORCA通过迭代级调度和选择性批处理,为基于Transformer的生成模型提供了一个高效、低延迟的服务系统。这种方法在处理大规模模型时表现出色,能够显著提高服务的吞吐量和响应速度。
25
+
26
+ 论文的主要贡献在于提出了一种新的调度和批处理机制,解决了现有系统在处理基于Transformer的生成模型时的性能问题。
27
+
28
+
29
+
30
+ 第三章 - 重点
31
+
32
+
33
+
34
+
35
+
36
+
37
+
38
+
You can’t perform that action at this time.
0 commit comments