Skip to content

Releases: modelscope/dash-infer

v2.1.0

11 Feb 07:28
069c74e
Compare
Choose a tag to compare

What's Changed

  • [JSON mode]: FormatEnforcer use cudaMallocHost for scores buffer by @WangNorthSea in #56
  • [A16W8 & A8W8]: further optimization for Ampere A16W8 fused gemm kernel 2. fix lora doc by @wyajieha in #58
  • [Multimodal]: Support LLM quantization with GPTQ and AXWY by @x574chen in #60
  • [PKG]: Reduce package size by only compiling flash-attn src with hdim128 by @laiwenzh in #62
  • [MOE]: add high performance moe kernel; fix a16w8 compile bug for sm<80 by @laiwenzh in #67

New Contributors

Full Changelog: v2.0.0...v2.1.0

v2.0.0

21 Jan 07:31
012eb1b
Compare
Choose a tag to compare

What's Changed

  • engine: stop and release model when engine release, and remove deprecated lock
  • sampling: generate_op heavily modified, remove dependency on global tensors
  • prefix cache: some bug fix, impove evict performance
  • json mode: update lmfe-cpp patch, add process_logits, sampling with top_k top_p
  • span-attention: move span_attn decoderReshape to init
  • lora: add docs, fix typo
  • ubuntu: add ubuntu dockerfile, fix install dir err
  • bugifx: fix multi-batch rep_penlty bug

Full Changelog: v1.3.0...v2.0.0

v2.0.0-rc3

20 Dec 13:10
Compare
Choose a tag to compare
some bugfix

- uuid crash issue
- update lora implement
- set page size by param
- delete deprecated files

v2.0.0-rc2

17 Dec 12:29
Compare
Choose a tag to compare
release script: reduce python wheel size (#46)

v1.3.0

27 Aug 03:33
Compare
Choose a tag to compare

Highlight

Full Changelog: v1.2.1...v1.3.0

v1.2.1

01 Jul 03:28
5ceddf9
Compare
Choose a tag to compare

What's Changed

  • Add llama.cpp benchmark steps
  • fix: fallback to mha without avx512f support
  • solve security issue; helper: bugfix, cpu platform check
  • add release package workflow

v1.2.0

24 Jun 05:32
3a0417b
Compare
Choose a tag to compare

expand context length to 32K & support flash attention on intel-avx512 platform

  • remove currently unsupported cache mode
  • examples: update qwen prompt template, add print func to examples
  • support glm-4-9b-chat by
  • change to size_t to avoid overflow when seq is long
  • update README since we support 32k context length
  • Add flash attention on intel-avx512 platform

v1.1.0

29 May 08:32
Compare
Choose a tag to compare

support Qwen2, change dashinfer model extensions

  • support Qwen2, add model_type Qwen_v20
  • change dashinfer model extensions (asgraph, asparam -> dimodel, ditensors)
  • python example: remove xxx_quantize.json config file, use command line arg instead

v1.0.4

14 May 05:50
Compare
Choose a tag to compare

First official release.