important articles for personal, especially on computer science.
-
tensorflow lite micro
- TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systemsa
- https://arxiv.org/abs/2010.08678
-
cnn-winograd
- Fast Algorithms for Convolutional Neural Networks
- https://arxiv.org/abs/1509.09308
-
how to optimize gemm
-
separable convolutions---depthwiseconv and pointwiseconv
- A Basic Introduction to Separable Convolutions
- MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
- https://arxiv.org/abs/1704.04861
- it used separable conv in mobilenet, so it introduced separable conv in this article.
-
binary neural networks
- https://github.com/JDAI-CV/dabnn
- daBNN: A Super Fast Inference Framework for Binary Neural Networks on ARM devices
-
On the Measure of Intelligence
- What Every Programmer Should Know About Memory
- Carnegie Mellon University
- 11-memory-hierarchy.pdf
- 12-cache-memories.pdf
- 2021, A White Paper on Neural Network Quantization
- Quantizing deep convolutional networks for efficient inference: A whitepaper
- Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
- Outlier Channel Splitting
- Improving Neural Network Quantization without Retraining using Outlier Channel Splitting
- https://arxiv.org/abs/1901.09504 , three methods abou clip
- Outlier Channel Splitting Improving DNN Quantization without Retraining
- Improving Neural Network Quantization without Retraining using Outlier Channel Splitting
- nvidia, INTEGER QUANTIZATION FOR DEEP LEARNING INFERENCE: PRINCIPLES AND EMPIRICAL EVALUATION
- https://arxiv.org/pdf/2004.09602.pdf, show some interesting data
- DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients
- Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
- stable diffusion 3, Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
- BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
- deploy:
- source
- vllm: https://github.com/vllm-project/vllm
- lightllm: https://github.com/ModelTC/lightllm
- mlc-llm: https://github.com/mlc-ai/mlc-llm
- source
- kv cache
- paged attention
- paged attention: both vllm and lightllm(token attention) used.
- Efficient Memory Management for Large Language Model Serving with PagedAttention
- https://blog.vllm.ai/2023/06/20/vllm.html
- qcom-qualla: NSPKVManager
- llama, ggml_cpy, just for memory copy and concat
- MQG, GQA, need to change model struct.
- https://arxiv.org/pdf/2305.13245 GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
- https://zhuanlan.zhihu.com/p/655325832
- https://zhuanlan.zhihu.com/p/659238103
- paged attention
- quantize
- aimet, AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.
- qserve mit, QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
-
Latency Numbers Every Programmer Should Know
-
DEN0018A_neon_programmers_guide_en
-
硬件体系架构浅析
-
google, tpu, In-Datacenter Performance Analysis of a Tensor Processing Unit