Add MixtralForCausalLM in Turbomind#4623
Conversation
| self._n_experts = getattr(cfg, 'num_experts', 0) | ||
|
|
||
| if self._n_experts > 0: | ||
| self._moe_cfg = make_moe_config( | ||
| cfg, | ||
| experts_per_token=cfg.num_experts_per_tok) | ||
| self._moe_cfg.expert_num = self._n_experts | ||
|
|
| experts = ModuleListBuilder(ModuleListConfig(), self._ctx) | ||
| for e in range(self.cfg.num_experts): | ||
| experts[e] = self.ffn(pfx + 'experts' + e, is_expert=True) | ||
| m.experts = experts.build() |
| if self._n_experts: | ||
| d.moe_ffn = self.moe(p + 'block_sparse_moe') | ||
| else: | ||
| d.feed_forward = self.ffn(p + 'block_sparse_moe') |
| # ------------------------------------------------------------------ | ||
| # model() — walks full hierarchy | ||
| # ------------------------------------------------------------------ |
| def __init__(self, cfg: MixtralConfig, *, resolver): | ||
| super().__init__(cfg, resolver=resolver) | ||
|
|
||
| self._attn_cfg = make_attention_config(cfg) |
There was a problem hiding this comment.
to suit make_attention_config rather than changing make_attention_config
| lm_pfx = (pfx + 'model.embed_tokens' | ||
| if self.cfg.tie_word_embeddings | ||
| else pfx + 'lm_head') |
There was a problem hiding this comment.
mixtral has no tie_word_embeddings
| self._ffn_cfg = make_ffn_config(cfg, | ||
| act_type=_act_type_id('silu')) | ||
|
|
||
| self._n_experts = getattr(cfg, 'num_experts', 0) |
There was a problem hiding this comment.
Overusing getattr(cfg, 'num_experts', 0) and similar patterns makes the code overly defensive and obscures the intent. If num_experts is a required field, just use cfg.num_experts and let the missing attribute fail explicitly. If a default value is truly appropriate, please double-check its correctness. Otherwise, reject this cargo-cult style of getattr usage.
|
|
||
| self._n_experts = getattr(cfg, 'num_experts', 0) | ||
|
|
||
| if self._n_experts > 0: |
There was a problem hiding this comment.
We can assert self._n_experts > 0. if self._n_experts >0 can be removed
| if self._n_experts: | ||
| d.moe_ffn = self.moe(p + 'block_sparse_moe') | ||
| else: | ||
| d.feed_forward = self.ffn(p + 'block_sparse_moe') |
There was a problem hiding this comment.
Is there "mlp" layer in mixtral?
| hidden_dim = cfg.hidden_size | ||
| head_num = cfg.num_attention_heads | ||
| head_dim = head_dim if head_dim is not None else getattr(cfg, 'head_dim', hidden_dim // head_num) | ||
| cfg_head_dim = getattr(cfg, 'head_dim', None) |
There was a problem hiding this comment.
May not change this function
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Please describe the motivation of this PR and the goal you want to achieve through this PR.
Modification
Please briefly describe what modification is made in this PR.
BC-breaking (Optional)
Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist