Releases: modelscope/dash-infer
Releases · modelscope/dash-infer
v2.1.0
What's Changed
- [JSON mode]: FormatEnforcer use cudaMallocHost for scores buffer by @WangNorthSea in #56
- [A16W8 & A8W8]: further optimization for Ampere A16W8 fused gemm kernel 2. fix lora doc by @wyajieha in #58
- [Multimodal]: Support LLM quantization with GPTQ and AXWY by @x574chen in #60
- [PKG]: Reduce package size by only compiling flash-attn src with hdim128 by @laiwenzh in #62
- [MOE]: add high performance moe kernel; fix a16w8 compile bug for sm<80 by @laiwenzh in #67
New Contributors
Full Changelog: v2.0.0...v2.1.0
v2.0.0
What's Changed
- engine: stop and release model when engine release, and remove deprecated lock
- sampling: generate_op heavily modified, remove dependency on global tensors
- prefix cache: some bug fix, impove evict performance
- json mode: update lmfe-cpp patch, add process_logits, sampling with top_k top_p
- span-attention: move span_attn decoderReshape to init
- lora: add docs, fix typo
- ubuntu: add ubuntu dockerfile, fix install dir err
- bugifx: fix multi-batch rep_penlty bug
Full Changelog: v1.3.0...v2.0.0
v2.0.0-rc3
some bugfix - uuid crash issue - update lora implement - set page size by param - delete deprecated files
v2.0.0-rc2
release script: reduce python wheel size (#46)
v1.3.0
Highlight
- Support Baichuan-7B and Baichuan2-7B & 13B by @WangNorthSea in #38
Full Changelog: v1.2.1...v1.3.0
v1.2.1
v1.2.0
expand context length to 32K & support flash attention on intel-avx512 platform
- remove currently unsupported cache mode
- examples: update qwen prompt template, add print func to examples
- support glm-4-9b-chat by
- change to size_t to avoid overflow when seq is long
- update README since we support 32k context length
- Add flash attention on intel-avx512 platform