update-2024-07-17_22:36:39

liguodongiot · liguodongiot · commit f134283504cb · 2024-07-17T22:36:39.000+08:00
diff --git a/README.md b/README.md
@@ -377,7 +377,7 @@ LLM Data Engineering
 - [MindIE 1.0.RC1 发布，华为昇腾终于推出了针对LLM的完整部署方案，结束小米加步枪时代](https://www.zhihu.com/question/654472145/answer/3482521709)
 - [大模型国产化适配8-基于昇腾MindIE推理工具部署Qwen-72B实战（推理引擎、推理服务化）](https://juejin.cn/post/7365879319598727180)
   - Qwen-72B、Baichuan2-7B、ChatGLM3-6B
-- [大模型国产化适配9-LLM推理框架MindIE-Service性能基准测试](https://juejin.cn/post/7370925923455713320)
+- [大模型国产化适配9-LLM推理框架MindIE-Service性能基准测试](https://zhuanlan.zhihu.com/p/704649189)
 - [大模型国产化适配10-快速迁移大模型到昇腾910B保姆级教程（Pytorch版）](https://juejin.cn/post/7375351908896866323)
 - [大模型国产化适配11-LLM训练性能基准测试（昇腾910B3）](https://juejin.cn/post/7380995631790964772)
 
@@ -447,6 +447,7 @@ AI编译器是指将机器学习算法从开发阶段，通过变换和优化算
 - [现在为什么那么多人以清华大学的ChatGLM-6B为基座进行试验？](https://www.zhihu.com/question/602504880/answer/3041965998)
 - [为什么很多新发布的大模型默认使用BF16而不是FP16？](https://www.zhihu.com/question/616600181/answer/3195333332)
 - [大模型训练时ZeRO-2、ZeRO-3能否和Pipeline并行相结合？](https://www.zhihu.com/question/652836990/answer/3468210626)
+- [一文详解模型权重存储新格式 Safetensors](https://juejin.cn/post/7386360803039838235)
 
 
 ## [LLM面试题](https://github.com/liguodongiot/llm-action/blob/main/llm-interview/README.md)
diff --git a/ai-framework/TensorRT-Model-Optimizer.md b/ai-framework/TensorRT-Model-Optimizer.md
@@ -2,13 +2,7 @@
 
 
 
+- 代码：https://github.com/NVIDIA/TensorRT-Model-Optimizer
+- 文档：https://nvidia.github.io/TensorRT-Model-Optimizer/
 
-https://nvidia.github.io/TensorRT-Model-Optimizer/
-
-
-
-https://github.com/NVIDIA/TensorRT-Model-Optimizer
-
-
-
-
+- 量化方法最佳实践：https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_choosing_quant_methods.html
diff --git a/ai-framework/cuda/README.md b/ai-framework/cuda/README.md
@@ -0,0 +1,3 @@
+
+
+
diff --git a/ai-framework/huggingface-transformers/FSDP.md b/ai-framework/huggingface-transformers/FSDP.md
@@ -14,7 +14,7 @@ transformers
 
 
 accelerate
-- https://zhuanlan.zhihu.com/p/671742753
+- 使用 PyTorch FSDP 微调 Llama 2 70B:https://zhuanlan.zhihu.com/p/671742753
 - https://huggingface.co/docs/transformers/v4.41.0/en/fsdp#fsdp-configuration
 - https://huggingface.co/docs/transformers/v4.41.0/en/main_classes/trainer#transformers.TrainingArguments
 - https://github.com/pacman100/LLM-Workshop/blob/main/chat_assistant/sft/training/configs/fsdp_config.yaml
diff --git a/ai-framework/lightllm/README.md b/ai-framework/lightllm/README.md
diff --git a/ai-framework/llama-cpp/README.md b/ai-framework/llama-cpp/README.md
@@ -5,7 +5,11 @@
 - https://github.com/ggerganov/llama.cpp
 
 
+GGUF量化格式
 
+- https://lightning.ai/cosmo3769/studios/post-training-quantization-to-gguf-format-and-evaluation
+- https://medium.com/@metechsolutions/llm-by-examples-use-gguf-quantization-3e2272b66343
+- ctransformers、llama.cpp
 
 
 
diff --git a/ai-framework/openai-triton/README.md b/ai-framework/openai-triton/README.md
@@ -4,3 +4,4 @@
 
 
 
+- OpenAI Triton 入门教程: https://zhuanlan.zhihu.com/p/684473453
diff --git a/docs/llm-inference/llm推理优化技术.md b/docs/llm-inference/llm推理优化技术.md
@@ -1,6 +1,8 @@
 
 
 
+
+- Mastering LLM Techniques: Inference Optimization: https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
 - 7 Ways To Speed Up Inference of Your Hosted LLMs：https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
 - How to Speed Up LLM Training with Distributed Systems?：https://www.appypie.com/blog/llm-training-with-distributed-systems
 - 详谈大模型训练和推理优化技术：https://wjn1996.blog.csdn.net/article/details/130764843
diff --git a/llm-inference/lightllm/README.md b/llm-inference/lightllm/README.md
@@ -11,3 +11,11 @@ https://github.com/ModelTC/lightllm/blob/main/docs/LightLLM.md
 
 
 
+
+
+- https://github.com/ModelTC/lightllm/tree/main
+
+
+- LightLLM：纯Python超轻量高性能LLM推理框架: https://mp.weixin.qq.com/s/-wMLMGAHkxeyDYkixqni9Q
+
+
diff --git a/llm-inference/lmdeploy/README.md b/llm-inference/lmdeploy/README.md
@@ -0,0 +1,12 @@
+
+
+
+- https://hub.docker.com/r/openmmlab/lmdeploy-builder/tags
+- https://hub.docker.com/r/openmmlab/lmdeploy/tags
+
+
+
+- https://github.com/InternLM/lmdeploy/blob/main/docs/zh_cn/serving/api_server.md
+
+- https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/cli/utils.py#L64
+
diff --git a/llm-inference/tensorrt-llm/README.md b/llm-inference/tensorrt-llm/README.md
@@ -6,7 +6,7 @@
 - https://github.com/NVIDIA/TensorRT-LLM
 - https://nvidia.github.io/TensorRT-LLM/index.html
 
-
+- 性能优化最佳实践：https://nvidia.github.io/TensorRT-LLM/performance/perf-best-practices.htm
 
 
 ## FP8
diff --git a/llm-inference/triton/REAEME.md b/llm-inference/triton/REAEME.md
@@ -1,8 +1,14 @@
 
 
 
-https://github.com/triton-inference-server/backend
 
+- https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
+
+
+
+
+- https://github.com/triton-inference-server/backend
+- https://github.com/triton-inference-server/server
 
 
 
diff --git a/paper/inference/orca.md b/paper/inference/orca.md
@@ -0,0 +1,38 @@
+
+
+
+这篇论文是关于一个名为ORCA的分布式服务系统，它专门为基于Transformer的生成模型设计。以下是论文的主要内容概述：
+
+1. **背景与动机**：
+   - 大型基于Transformer的模型（如GPT-3）在生成任务中表现出色，但现有的推理服务系统在处理这类模型时存在性能瓶颈。这些模型需要多次迭代来生成每个输出令牌，而现有系统在处理多迭代请求时效率不高。
+
+2. **主要问题**：
+   - 现有系统在处理请求时，通常是按批次处理，这导致一些请求在批次中提前完成但无法及时返回给客户端，增加了延迟。同时，新到达的请求需要等待当前批次完全处理完毕，增加了排队时间。
+
+3. **解决方案**：
+   - **迭代级调度（Iteration-level Scheduling）**：提出一种新的调度机制，调度执行时以迭代为单位，而不是整个请求。这样，每次迭代后，可以立即处理新到达的请求，减少等待时间。
+   - **选择性批处理（Selective Batching）**：在应用批处理和迭代级调度时，只对选定的操作应用批处理。这样可以在不同的操作中灵活地处理请求，避免因不同请求处理不同数量的令牌而导致的批处理问题。
+
+4. **ORCA系统**：
+   - ORCA是一个分布式服务系统，实现了上述两种技术。它还采用了模型并行化策略（如层内和层间模型并行化），以支持大规模模型。
+   - ORCA的设计包括一个请求池、调度器和执行引擎。调度器负责从请求池中选择请求，执行引擎则负责执行模型的迭代。
+
+5. **评估**：
+   - 使用GPT-3模型进行评估，结果显示ORCA在延迟和吞吐量方面显著优于NVIDIA FasterTransformer。具体来说，ORCA在相同延迟水平下，吞吐量提高了36.9倍。
+
+6. **结论**：
+   - ORCA通过迭代级调度和选择性批处理，为基于Transformer的生成模型提供了一个高效、低延迟的服务系统。这种方法在处理大规模模型时表现出色，能够显著提高服务的吞吐量和响应速度。
+
+论文的主要贡献在于提出了一种新的调度和批处理机制，解决了现有系统在处理基于Transformer的生成模型时的性能问题。
+
+
+
+第三章 - 重点
+
+
+
+
+
+
+
+

Original file line number	Diff line number	Diff line change
`@@ -4,3 +4,4 @@`
`4`	`4`
`5`	`5`
`6`	`6`
	`7`	`+- OpenAI Triton 入门教程: https://zhuanlan.zhihu.com/p/684473453`