[Paper Reading] [KR only] Are Transformers universal approximators of sequence-to-sequence functions?

๐Ÿ—“    

Main References

Abstract

  • Transformer encoder๋Š” โ€˜permutation equivariantโ€™ํ•œ ์„ฑ์งˆ์„ ๊ฐ–๋Š” ์—ฐ์†์ธ โ€˜sequence-to-sequenceโ€™ ํ•จ์ˆ˜(with compact support)์— ๋Œ€ํ•œ universal approximator์ž„์„ ๋ณด์ธ๋‹ค.
  • Transformer encoder์—๋‹ค learnableํ•œ positional encodings๋ฅผ ๊ฐ™์ด ์“ฐ๋ฉด ์ž„์˜์˜(permutation equivariantํ•˜์ง€ ์•Š์•„๋„) ์—ฐ์†์ธ โ€˜sequence-to-sequenceโ€™ ํ•จ์ˆ˜(with compact domain)๋ฅผ universally approximateํ•จ์„ ๋ณด์ธ๋‹ค.
  • Contextual mapping์ด๋ผ๋Š” ๊ฒƒ์„ ์ˆ˜์‹์ ์œผ๋กœ ์ •์˜ํ–ˆ์œผ๋ฉฐ, Transformer Encoder์˜ multi-head self-attention layer๋“ค์ด ์ž…๋ ฅ sequence์— ๋Œ€ํ•œ contextual mapping์„ ์ž˜ ๊ณ„์‚ฐํ•จ์„ ๋ณด์ธ๋‹ค.
  • (์‹คํ—˜๋„ ์ง„ํ–‰ํ•˜์˜€์œผ๋‚˜ ์—ฌ๊ธฐ์„œ๋Š” ์ƒ๋žต)

Keywords & Definitions

1. Sequence-to-sequence Function

$\mathbb{R}^{d\times n}$์—์„œ $\mathbb{R}^{d\times n}$๋กœ ๊ฐ€๋Š” ํ•จ์ˆ˜๋ฅผ sequence-to-sequence function์ด๋ผ๊ณ  ๋งํ•œ๋‹ค. ์ •ํ™•ํžˆ๋Š” ์ •์˜์—ญ๋„ ์น˜์—ญ๋„ ๋ชจ๋‘ subset of $\mathbb{R}^{d\times n}$์ธ ํ•จ์ˆ˜๋ฅผ ๋งํ•œ๋‹ค. ($\mathbb{R}^{d\times n}$: the set of all $d\times n$ real matrices)

์ด๋•Œ $d$์™€ $n$์€ ๊ฐ๊ฐ, Transformer ๋…ผ๋ฌธ์—์„œ ์–ธ๊ธ‰ํ•˜๋Š” embedding ์ฐจ์›๊ณผ ์ž…๋ ฅ sequence ๊ธธ์ด๋กœ ๋น„์œ ๋œ๋‹ค. ๊ธฐ์กด Transformer ๋…ผ๋ฌธ์—์„œ๋„ ๊ฑฐ์˜ ๊ฐ™์€ ํ‘œ๊ธฐ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค($d_{\text{model}} = d$). ํ•œ ๊ฐ€์ง€ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค๋ฉด, Transformer ๋…ผ๋ฌธ์—์„œ๋Š” $n\times d$ ํ–‰๋ ฌ์„ ์“ฐ๋Š” ๋ฐ˜๋ฉด, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๊ทธ ๋ฐ˜๋Œ€($d\times n$ ํ–‰๋ ฌ)๋ฅผ ์ด์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ํ–‰๋ ฌ์˜ ๊ฐ ์—ด(column)์ด ํ•œ input word embedding(ํ˜น์€ token)์œผ๋กœ ๋น„์œ ๋œ๋‹ค. ์•ˆ๊ทธ๋ž˜๋„ ์ด ๋…ผ๋ฌธ์—์„œ ๊ณ„์†ํ•ด์„œ $d\times n$ ํ–‰๋ ฌ $X$๋ฅผ input sequence๋ผ๊ณ  ์นญํ•œ๋‹ค.

  • Sequence-to-sequence ํ•จ์ˆ˜์˜ ์—ฐ์†์„ฑ ์ •์˜

    Sequence-to-sequence function์ด ํ–‰๋ ฌ์„ ๋ฐ›์•„ ํ–‰๋ ฌ์„ ๋‚ด๋ฑ‰๋Š” ํ•จ์ˆ˜์ด๋‹ค ๋ณด๋‹ˆ ์—ฐ์†์„ฑ๋„ ์ž˜ ์ •์˜๋˜์–ด์•ผ ํ•œ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” $\mathbb{R}^{d\times n}$์— entry-wise $\ell^p$ norm($|\cdot|_p$)๊ณผ ๊ทธ์— ๋Œ€ํ•œ norm topology๋ฅผ ์ฃผ๊ณ  ๊ทธ ์œ„์—์„œ ์—ฐ์†์„ฑ์„ ์ •์˜ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. ์ด๋•Œ $p$์˜ ๊ฐ’์€ $1\le p<\infty$.

  • ํ•จ์ˆ˜ ๊ฐ„์˜ ๊ฑฐ๋ฆฌ(function metric)

    ํ•จ์ˆ˜๋ผ๋ฆฌ ์–ผ๋งˆ๋‚˜ ๊ฐ€๊นŒ์šด ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด function ์‚ฌ์ด์˜ distance๋ฅผ ์ •์˜ํ•œ๋‹ค. ์ฆ‰ sequence-to-sequence function space์˜ metric $d_p$์„ ์“ฐ์ž๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

    metric-formula

    (Usualํ•œ $L^p$ function norm์„ ์ด์šฉํ•ด์„œ, ๋…ผ๋ฌธ์— ์žˆ๋Š” ํ‘œ๊ธฐ์™€ ์กฐ๊ธˆ ๋‹ค๋ฅด๊ฒŒ ์ ์–ด๋ณด์•˜๋‹ค.)

    • Note: ๋…ผ๋ฌธ์—์„œ๋Š” ์–ธ์ œ๋‚˜ compact domain, compact support๋ฅผ ๊ฐ€์ •ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, $N_p(f)$๊ฐ€ ๋ฌดํ•œ๋Œ€๋กœ ๋ฐœ์‚ฐํ•  ๊ฑฑ์ •์€ ํ•˜์ง€ ์•Š์•„๋„ ๋  ๊ฒƒ ๊ฐ™๋‹ค.

2. Permutation Equivariant

  • Permutation matrix๋ž€

    Permutation matrix๋Š” ๊ฐ ํ–‰๊ณผ ๊ฐ ์—ด๋งˆ๋‹ค 1์ด ๋”ฑ ํ•˜๋‚˜์”ฉ ์žˆ๋Š” ์ •์‚ฌ๊ฐํ–‰๋ ฌ์ด๋‹ค. ์–ด๋–ค ํ–‰๋ ฌ $A\in \mathbb{R}^{m\times n}$์— Permutation matrix $P$๋ฅผ ๊ณฑํ•˜๋ฉด $A$์˜ ํ–‰ ๋˜๋Š” ์—ด์˜ ์ˆœ์„œ๋ฅผ ๋’ค์ฃฝ๋ฐ•์ฃฝ ์„ž์–ด ๋†“์€ ๊ฒƒ๊ณผ ๊ฐ™๋‹ค. ์ข€ ๋” ์ •ํ™•ํžˆ๋Š”, (1) $P\in \mathbb{R}^{n\times n}$์ด๋ผ๋ฉด $AP$๋Š” $A$์˜ ์—ด๋“ค์˜ ์ˆœ์„œ๋ฅผ ์„ž์–ด๋†“์€ ํ–‰๋ ฌ์ด ๋˜๊ณ , (2) $P\in \mathbb{R}^{m\times m}$์ด๋ผ๋ฉด $PA$๋Š” $A$์˜ ํ–‰๋“ค์˜ ์ˆœ์„œ๋ฅผ ์„ž์–ด๋†“์€ ํ–‰๋ ฌ์ด ๋œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์ž๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

    \[\begin{pmatrix} 1&2&3 \\ 4&5&6 \\ 7&8&9\end{pmatrix}\begin{pmatrix} 0&1&0 \\ 0&0&1 \\ 1&0&0\end{pmatrix} = \begin{pmatrix} 3&1&2 \\ 6&4&5 \\ 9&7&8\end{pmatrix}\]

    ์ฐธ๊ณ ๋กœ ์ด๋Ÿฌํ•œ permutation matrix๋Š” ์–ธ์ œ๋‚˜ orthogonalํ•˜๋‹ค: $P^TP=PP^T=I$. (P๊ฐ€ ํ–‰/์—ด์˜ ์ˆœ์„œ๋ฅผ ์–ด๋–ป๊ฒŒ ์„ž๋Š”์ง€ ์ƒ๊ฐํ•ด๋ณด์ž.)

์ž„์˜์˜ $X\in \mathbb{R}^{m\times n}$์™€ ์ž„์˜์˜ permutation matrix $P\in \mathbb{R}^{n\times n}$์— ๋Œ€ํ•ด์„œ, Sequence-to-sequence function์ธ $f$๊ฐ€ $f(XP)=f(X)P$๋ฅผ ๋งŒ์กฑํ•˜๋ฉด ์ด๋Ÿฌํ•œ ํ•จ์ˆ˜๊ฐ€ permutation equivariantํ•˜๋‹ค๊ณ  ๋งํ•œ๋‹ค.

Sequence์˜ ์ˆœ์„œ๋ฅผ ๋’ค์„ž๋Š” ์ผ์„ ํ•จ์ˆ˜์— ๋Œ€์ž…ํ•˜๊ธฐ ์ „์— ํ•˜๋‚˜ ํ›„์— ํ•˜๋‚˜ ๋‹ฌ๋ผ์ง€์ง€ ์•Š๋Š” ํ•จ์ˆ˜๋ฅผ ๋งํ•œ๋‹ค๊ณ  ๋ณด๋ฉด ๋œ๋‹ค.

์ฐธ๊ณ ๋กœ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ๊ฐ์˜ transformer (encoder) block์ด permutation equivariantํ•œ sequence-to-sequence function์ž„์„ ์ฆ๋ช…ํ•œ๋‹ค. (Claim 1)

3. Universal Approximation

๋”ฅ๋Ÿฌ๋‹ ์ด๋ก ์˜ ์ถœ๋ฐœ์ ์ด๋ผ๊ณ  ํ•  ๋งŒํ•œ ์ •๋ฆฌ๋กœ, Neural network์˜ expressive power์— ๋Œ€ํ•ด ์•Œ๋ ค์ฃผ๋Š” ์ •๋ฆฌ์ธ โ€˜universal approximation theoremโ€™์ด ์žˆ๋‹ค. ์ด๊ฒƒ์˜ ๋‚ด์šฉ์„ ์š”์•ฝํ•˜์ž๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Hidden layer๊ฐ€ 1๊ฐœ ์žˆ๋Š” neural network๋งŒ ๊ฐ€์ง€๊ณ ๋„ ์•„๋ฌด๋Ÿฐ ์—ฐ์†ํ•จ์ˆ˜(with compact support)๋ฅผ ์ž„์˜์˜ (์•„์ฃผ ์ž‘์€) ์˜ค์ฐจ๋กœ ๊ทผ์‚ฌํ•  ์ˆ˜ ์žˆ๋‹ค. (๋‹จ! network์˜ width์—๋Š” ์ œํ•œ์ด ์—†์œผ๋ฉฐ, ์ค‘๊ฐ„์— ์žˆ๋Š” activation function์€ ๋‹คํ•ญํ•จ์ˆ˜๊ฐ€ ์•„๋‹˜.)

์ด์ฒ˜๋Ÿผ, Universal Approximator๋Š” โ€˜์ž„์˜์˜ ์ •ํ™•๋„๋กœ ์—„์ฒญ ๋งŽ์€ ํ•จ์ˆ˜๋“ค์„ ๊ทผ์‚ฌํ•  ์ˆ˜ ์žˆโ€™๋Š” ๋ชจ๋ธ์„ ๋‘๊ณ  ํ•˜๋Š” ๋ง์ด๋‹ค. ์ดํ›„๋กœ๋„ universal approximation์— ๋Œ€ํ•œ ๋‹ค๋ฐฉ๋ฉด์˜ ์—ฐ๊ตฌ๊ฐ€ ์ด๋ฃจ์–ด์กŒ๋Š”๋ฐ, ์ด๋Š” ์—ฌ๊ธฐ์„œ ์†Œ๊ฐœํ•˜๋Š” ๋…ผ๋ฌธ์˜ section 1.2 related works & notation์— ์ž˜ ์†Œ๊ฐœ๋˜์–ด ์žˆ๋‹ค.

4. Contextual Mapping

๋…ผ๋ฌธ์— ๋”ฐ๋ฅด๋ฉด, Transformer๊ฐ€ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š” ์ด์œ ๊ฐ€ ๋ณดํ†ต โ€˜contextual mappingโ€™์„ ์ž˜ ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ํ‰๊ฐ€๋œ๋‹ค๊ณ  ํ•œ๋‹ค. ์ฆ‰, ๊ฐ๊ฐ์˜ ๋ฌธ๋งฅ์„ ์„œ๋กœ ์ž˜ ๊ตฌ๋ถ„ํ•˜๋Š” ๋Šฅ๋ ฅ์ด ํƒ์›”ํ•˜๋‹ค๊ณ  ๋ณด๋Š” ๊ฒƒ์ด๋‹ค.

๋…ผ๋ฌธ์—์„œ๋Š” Trasformer์˜ ์ด๋Ÿฐ์ €๋Ÿฐ universal approximation ๋Šฅ๋ ฅ์„ ์ฆ๋ช…ํ•˜๋ ค ํ•˜๋Š”๋ฐ, ๊ทธ ๊ณผ์ • ์ค‘์— โ€˜(multi-head) self-attention layers๊ฐ€ contextual mapping์„ ์ž˜ ๊ณ„์‚ฐํ•œ๋‹คโ€™๋Š” ๊ฒƒ์„ ์ฆ๋ช…ํ•˜๋Š” ๊ฒŒ ์ •๋ง ์ค‘์š”ํ•œ ์ค‘๊ฐ„ ๊ณผ์ •์ด๋ผ๊ณ  ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋…ผ๋ฌธ์—์„œ๋Š” contextual mapping์˜ ๊ฐœ๋…์„ ์•„์˜ˆ ์ˆ˜์‹์ ์œผ๋กœ ์ •์˜ํ•ด๋ฒ„๋ฆฐ ๋’ค์— ์ด๋ฅผ ์ฆ๋ช…์— ์ด์šฉํ•œ๋‹ค. ๋…ผ๋ฌธ์—์„œ ์ฃผ์–ด์ง„ ์ •์˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

contextual-mapping.jpeg

์ฆ‰ contextual mapping์€ ๊ธธ์ด $n$์ธ input sequence๋ฅผ ๋ฐ›์•„ $n$๊ฐœ์˜ ๊ฐ’ (ํ˜น์€ $n$์ฐจ์› ์—ด๋ฒกํ„ฐ)๋ฅผ ๋‚ด๋†“๋Š” ํ•จ์ˆ˜๋กœ ์ •์˜๋œ๋‹ค. ์ด๋•Œ ํ•œ ๋ฌธ์žฅ(sequence) ์•ˆ์˜ ๋‹จ์–ด๋“ค์€ ์„œ๋กœ ๋‹ค๋ฅธ ์—ญํ• ์„ ํ•˜๋ฏ€๋กœ ๊ฐ๊ฐ ๋‹ค๋ฅธ context๊ฐ’(contextual mapping์˜ entry)์ด ๋งค๊ฒจ์ง„๋‹ค(1๋ฒˆ ์กฐ๊ฑด). ๊ฒŒ๋‹ค๊ฐ€, ๊ฐ™์€ ๋‹จ์–ด๋ผ๋„ ๋‹ค๋ฅธ ๋ฌธ์žฅ์—์„œ๋Š” ๋‹ค๋ฅธ ์˜๋ฏธ๋กœ ํ•ด์„๋œ๋‹ค๋Š” ์˜๋ฏธ์—์„œ, ์„œ๋กœ ๋‹ค๋ฅธ ๋‘ input sequence(L, Lโ€™)์— ๋Œ€ํ•œ contextual mapping์— ์žˆ๋Š” ๋ชจ๋“  (์ด 2n๊ฐœ์˜) entry๋“ค์€ ์ „๋ถ€ ๋‹ค๋ฅด๊ฒŒ ๋งค๊ฒจ์ง„๋‹ค(2๋ฒˆ ์กฐ๊ฑด).

  • ์ง‘ํ•ฉ $\mathbb{L}$์ด ์œ ํ•œ์ง‘ํ•ฉ์œผ๋กœ ์„ค์ •๋œ ์ด์œ ๋Š” (๋‚ด ์ƒ๊ฐ์—๋Š”)

    Vocabulary์˜ ํฌ๊ธฐ๋„ ์œ ํ•œํ•˜๊ณ  sequence ๊ธธ์ด๋„ ์œ ํ•œํ•˜๋ฏ€๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š” input sequence์˜ ๊ฐœ์ˆ˜๋Š” ์œ ํ•œํ•˜๋‹ค. Sequence๋“ค์˜ ์ง‘ํ•ฉ๊ณผ ๋Œ€์‘๋˜๋Š” ์ง‘ํ•ฉ์ด $\mathbb{L}$๊ณผ ๋น„์Šทํ•œ ๊ฒƒ์ด๋ผ๋ฉด, $\mathbb{L}$์„ ์œ ํ•œ์ง‘ํ•ฉ์ด๋ผ๊ณ  ๋†“์•„๋„ ๊ดœ์ฐฎ์„ ๊ฒƒ์ด๋‹ค. (์ด ์กฐ๊ฑด์ด ํ•„์ˆ˜์ธ์ง€๋Š” ์ฆ๋ช…์„ ๋” ๋“ค์—ฌ๋‹ค๋ด์•ผ..)

Main Text

1. Universal Approximator์ž„์„ ๋ณด์ด๊ธฐ ํž˜๋“  ์ด์œ 

  • ๋„ˆ๋ฌด ๋งŽ์•„ ๋ณด์ด๋Š” Parameter sharing. Self-attention layer์™€ feed-forward layer ๋ชจ๋‘, token๋ผ๋ฆฌ ๊ณต์œ ํ•˜๋Š” parameter์˜ ์ˆ˜๊ฐ€ ๋งค์šฐ ๋งŽ๋‹ค.
  • ๋„ˆ๋ฌด ์ ์–ด ๋ณด์ด๋Š” token-wise interaction. Self-attention layer์˜ ํŠน์„ฑ์ƒ pairwise dot-product๋กœ๋งŒ token ์‚ฌ์ด์˜ interaction์„ ์žก์•„๋‚ธ๋‹ค.

(๋‘˜์งธ ์ด์œ ๋Š” ๊ทธ๋Ÿด ๋งŒํ•˜๋‹ค๊ณ  ๋ณด์ด๋Š”๋ฐ, ์ฒซ์งธ ์ด์œ ๋Š” ์•„์ง ์ž˜ ์ดํ•ดํ•˜์ง€ ๋ชปํ–ˆ๋‹ค.)

๋…ผ๋ฌธ์—์„œ๋Š” ์œ„์˜ ๋‘ ์ด์œ ๋กœ ์ธํ•ด transformer encoder ์ž์ฒด๊ฐ€ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋Š” sequence-to-sequence ํ•จ์ˆ˜์˜ ์ข…๋ฅ˜์— ์ œํ•œ์ด ์žˆ๋‹ค๊ณ  ๋ณด๋ฉฐ, ์ด๋ฅผ trainableํ•œ positional encoding์œผ๋กœ ํ•ด๊ฒฐํ•œ๋‹ค.

โ“ ์ผ๋ฐ˜์ ์œผ๋กœ, Parameter sharing์ด ๋งŽ์„์ˆ˜๋ก universal approximator๊ฐ€ ๋˜๊ธฐ ์–ด๋ ค์šด ์ด์œ ๋Š” ๋ฌด์—‡์ผ๊นŒ?

2. ๋…ผ๋ฌธ์—์„œ ๋ณธ Transformer

์•„๋ž˜๋Š” ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•œ transformer block์— ๋Œ€ํ•œ ์‹์ด๋‹ค.

transformer-formula.jpeg

์ž˜ ์•Œ๋ ค์ ธ ์žˆ๋“ฏ, transformer encoder block์€ multi-head self-attention layer(โ€™Attnโ€™)์™€ token-wise feed-forward layer(โ€™FFโ€™)๋ผ๋Š” ๋‘ (sub-)layer๋กœ ๋‚˜๋‰œ๋‹ค.

2.1. ๊ธฐ์กด Transformer ๋…ผ๋ฌธ๊ณผ์˜ ๊ณตํ†ต์ 

  • ์ˆ˜์‹์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋“ฏ residual connection์€ ๊ทธ๋Œ€๋กœ ์‚ด๋ ค๋‘์—ˆ๋‹ค.

2.2. ๊ธฐ์กด Transformer ๋…ผ๋ฌธ๊ณผ์˜ ์ฐจ์ด์ 

  • ํ•ด์„์„ ๊ฐ„๋‹จํžˆ ํ•˜๊ธฐ ์œ„ํ•ด layer normalization์€ ๋บ๋‹ค๊ณ  ํ•œ๋‹ค.
  • Self-attention layer ์‹์„ ๋ณด๋ฉด ๊ธฐ์กด ๋…ผ๋ฌธ์—์„œ๋Š” ๋ณผ ์ˆ˜ ์—†๋˜ ์‹œ๊ทธ๋งˆ($\sum$) ๊ธฐํ˜ธ๊ฐ€ ๋ณด์ธ๋‹ค. ์›๋ž˜ transformer ๋…ผ๋ฌธ์—์„œ๋Š” attention head๋“ค์„ concatenateํ•˜๋Š”๋ฐ, ์ด๋Ÿฌํ•œ concatenation์„ ์ˆ˜์‹์ ์œผ๋กœ๋Š” ์ €๋ ‡๊ฒŒ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์ฆ‰ ์˜๋ฏธ๊ฐ€ ๋‹ค๋ฅธ ์‹์ด ์•„๋‹ˆ๋‹ค.
  • Self-attention layer์˜ ์†Œ๋ฌธ์ž ์‹œ๊ทธ๋งˆ ํ•จ์ˆ˜($\sigma(\cdot)$)๋Š” (column-wise) softmax๋ฅผ ๊ฐ€๋ฆฌํ‚จ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ๊ธฐ์กด ๋…ผ๋ฌธ์—์„œ๋Š” scaled dot-product attention์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐ˜๋ฉด ์—ฌ๊ธฐ์„œ๋Š” ๊ทธ๋ƒฅ dot-product attention์„ ์“ฐ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ธ๋‹ค. ์‚ฌ์‹ค $\boldsymbol{W}_K$๋‚˜ $\boldsymbol{W}_Q$๊ฐ™์€ parameter๋“ค์ด ๊ทธ scaling factor($\frac{1}{\sqrt{d_k}}$)๋ฅผ ํ•™์Šตํ•˜๋ฉด ๊ทธ๋งŒ์ด๋‹ค.

โ“ Layer normalization์„ ๋นผ๋„ ๊ดœ์ฐฎ์€ ์ด์œ ๋Š” ๋ฌด์—‡์ผ๊นŒ?

2.3. Positional encoding

  • Trainableํ•œ positional encoding์ด ์—†๋Š” ์ˆœ์ˆ˜ํ•œ transformer block์€ ์˜ค์ง โ€˜permutation equivariantโ€™ํ•œ ์ข…๋ฅ˜์˜ ํ•จ์ˆ˜๋งŒ์„ ์ž˜ ๊ทผ์‚ฌํ•  ๋ฟ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ positional encoding์„ ๋„์ž…ํ•จ์œผ๋กœ์จ ์ด๋Ÿฌํ•œ ํ•จ์ˆ˜ ์ข…๋ฅ˜์˜ ์ œํ•œ ์—†์ด ์•„๋ฌด๋Ÿฐ sequence-to-sequence ํ•จ์ˆ˜(with compact domain)์„ ์ž˜ ๊ทผ์‚ฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.
  • Positional encoding $\boldsymbol{E}$ ์—ญ์‹œ $d\times n$ ํฌ๊ธฐ์˜ real matrix๋กœ ์ •์˜๋œ๋‹ค. Transformer block์„ ํ•จ์ˆ˜ $g$๋กœ ์“ด๋‹ค๋ฉด, positional encoding์ด ๋„์ž…๋œ transformer block์€ input sequence $\boldsymbol{X}$์— ๋Œ€ํ•ด $g(\boldsymbol{X}+\boldsymbol{E})$๋ผ๊ณ  ์“ธ ์ˆ˜ ์žˆ๋‹ค.
  • ๋…ผ๋ฌธ์—์„œ๋Š” ์ด $\boldsymbol{E}$๊ฐ€ trainableํ•˜๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฏ€๋กœ ์•„๋ฌด๋ ‡๊ฒŒ๋‚˜ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. ์‹ค์ œ๋กœ, ํ•จ์ˆ˜๋“ค์˜ domain์ด compactํ•จ์„ ๊ฐ€์ •ํ•ด์„œ input sequence๊ฐ€ $\boldsymbol{X}\in [0,1]^{d\times n}$ ๊ฐ€ ๋˜๋„๋ก ํ•œ ๋’ค, positional encoding์„ ๋‚˜ํƒ€๋‚ด๋Š” ํ–‰๋ ฌ์„ ์ž„์˜๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•œ๋‹ค. (Appendix C ์ฐธ๊ณ )
\[\boldsymbol{E} = \begin{pmatrix} 0&1&2&\cdots&n-1\\0&1&2&\cdots&n-1\\\vdots&\vdots&\vdots&&\vdots\\0&1&2&\cdots&n-1\end{pmatrix}\]

3. ์ฃผ์š” ๊ฒฐ๊ณผ (2๊ฐ€์ง€)

๋…ผ๋ฌธ์ด ์ฃผ์žฅํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ์ค‘์š”ํ•œ ๊ฒฐ๊ณผ๋Š” Abstract์—์„œ ์†Œ๊ฐœํ•œ ์ฒ˜์Œ ๋‘ ์ค„๊ณผ ๊ฐ™๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ๋” ์ž์„ธํ•œ ์„œ์ˆ ์„ ์†Œ๊ฐœํ•œ๋‹ค.

Theorem 2. (์ž„์˜์˜ $\epsilon>0$์™€ $1\le p < \infty$์— ๋Œ€ํ•ด) ํ•จ์ˆ˜ $f$๊ฐ€ ๋‹ค์Œ์˜ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•œ๋‹ค๊ณ  ํ•˜์ž.

  1. $f$๋Š” sequence-to-sequence ํ•จ์ˆ˜.
  2. $f$์˜ support๋Š” compact.
  3. $f$๋Š” ์—ฐ์†(w.r.t. entry-wise $\ell^p$ norm).
  4. $f$๋Š” permutation equivariant.

๊ทธ๋Ÿฌ๋ฉด ๋‹ค์Œ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” Transformer network $g$๊ฐ€ ์กด์žฌํ•œ๋‹ค.

  1. $g$๋Š” $(h,m,r)=(2,1,4)$๋ฅผ ๋งŒ์กฑ.
  2. $d_p (f,g ) \le \epsilon$.
  • ์ฐธ๊ณ : Transformer network๋ž€, ๊ฐ™์€ Transformer block์„ ์—ฌ๋Ÿฌ ๊ฐœ ์Œ“์€ ๊ฒƒ์ด๋‹ค. ๋˜ ์œ„์—์„œ ์“ฐ์ธ h, m, r์€ ๊ฐ๊ฐ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒƒ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ธฐํ˜ธ๋‹ค.
๋ฌธ์ž๋œป
$h$attention head์˜ ๊ฐœ์ˆ˜
$m$attention head์˜ ํฌ๊ธฐ
$r$feed-forward layer์˜ hidden ์ฐจ์› (=$d_{ff}$)

Theorem 3. (์ž„์˜์˜ $\epsilon>0$์™€ $1\le p < \infty$์— ๋Œ€ํ•ด) ํ•จ์ˆ˜ $f$๊ฐ€ ๋‹ค์Œ์˜ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•œ๋‹ค๊ณ  ํ•˜์ž.

  1. $f$๋Š” sequence-to-sequence ํ•จ์ˆ˜.
  2. $f$์˜ domain์€ compact.
  3. $f$๋Š” ์—ฐ์†(w.r.t. entry-wise $\ell^p$ norm).

๊ทธ๋Ÿฌ๋ฉด ๋‹ค์Œ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” Transformer network $g$ with (trainable) positional encoding $\boldsymbol{E}$๊ฐ€ ์กด์žฌํ•œ๋‹ค.

  1. $g$๋Š” $(h,m,r)=(2,1,4)$๋ฅผ ๋งŒ์กฑ.
  2. $d_p (f,g ) \le \epsilon$.

๊ฑฐ์˜ ๋ชจ๋“  ๊ฒƒ์ด Theorem 2์™€ ๋™์ผํ•˜์ง€๋งŒ, Transformer network์—๋Š” positional encoding์ด ์ถ”๊ฐ€๋๊ณ , ๋Œ€์‹  ๊ทผ์‚ฌํ•˜๋ ค๋Š” sequence-to-sequence ํ•จ์ˆ˜์˜ permutation equivariant ์กฐ๊ฑด์ด ์‚ฌ๋ผ์กŒ๋‹ค.

  • $(h,m,r)=(2,1,4)$๋ฅผ ์“ฐ๋Š” ์ด์œ ? (๋„ˆ๋ฌด ์ž‘์€ block ์•„๋‹Œ๊ฐ€?)

    Attention head๊ฐ€ 2๊ฐœ๋ฐ–์— ์—†๊ณ , ๊ทธ ํฌ๊ธฐ๋„ ๊ฒจ์šฐ 1์ด๊ณ , ์‹ฌ์ง€์–ด feed-forward layer์˜ hidden ์ฐจ์›์ด 4๋ฐ–์— ์•ˆ ๋˜๋Š” ์ž‘์€ Transformer block์€ ์‹ค์งˆ์ ์œผ๋กœ ์“ฐ์ด์ง€ ์•Š๋Š”๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ Transformer block์„ ์ด์šฉํ•œ ์ด์œ ๋Š” ๋‹จ์ง€ ๋‹จ์ˆœํ™”๊ฐ€ ์ฆ๋ช…์„ ์‰ฝ๊ฒŒ ํ•ด์ฃผ๊ธฐ ๋•Œ๋ฌธ๋งŒ์€ ์•„๋‹ˆ๋‹ค.

    ๋” ํฐ ๋ชจ๋ธ์€ ์ž๋ช…ํ•˜๊ฒŒ expressive power๊ฐ€ ๋” ํฌ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์‹ค์งˆ์ ์œผ๋กœ ์“ฐ์ด๋Š” transformer block์€ ํ›จ์”ฌ ๋” ๋งŽ์€ parameter๋ฅผ ์“ธ ํ…๋ฐ, ๊ทธ๋Ÿฐ model์€ ๋…ผ๋ฌธ์—์„œ ์“ฐ์ด๋Š” ๋งค์šฐ ์ž‘์€ transformer block์— ๋น„ํ•˜๋ฉด ๋‹น์—ฐํžˆ ๋”์šฑ๋” ๋งŽ์€ ํ•จ์ˆ˜๋“ค์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‹ˆ ์ด๋ ‡๊ฒŒ ์ž‘์€ ์Šค์ผ€์ผ๋กœ ๋ฌธ์ œ๋ฅผ ์ถ•์†Œ์‹œ์ผœ์„œ ๋ฌธ์ œ๋ฅผ ํ’€์–ด๋„ ์ถฉ๋ถ„ํ•˜๋‹ค.

โ“ ์œ„์˜ ๋‘ ์ •๋ฆฌ๋Š” universal approximation์˜ ์ธก๋ฉด์—์„œ ๋งค์šฐ ์œ ์˜๋ฏธํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ชจ๋‘ ์กด์žฌ์„ฑ ์ •๋ฆฌ์ธ ํƒ“์—, ํ›ˆ๋ จ ๊ณผ์ •์—์„œ transformer๊ฐ€ โ€˜์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ํ•จ์ˆ˜โ€™๋ฅผ ์‹ค์ œ๋กœ ์ž˜ ๊ทผ์‚ฌํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋Š” ๋งํ•ด์ฃผ์ง€ ์•Š๋Š” ๊ฒŒ ๋ถ„๋ช…ํ•˜๋‹ค. ์ด๊ฒƒ์ด ๊ฐ€๋Šฅํ•œ์ง€๋Š” ์–ด๋–ป๊ฒŒ ์—ฐ๊ตฌํ•ด์•ผ ํ• ๊นŒ?/ ์–ด๋–ป๊ฒŒ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ์„๊นŒ?

4. ์–ด๋–ป๊ฒŒ ์ฆ๋ช…ํ•˜๋‚˜?

Theorem 2์™€ Theorem 3์˜ ์ฆ๋ช…์€ ๋งค์šฐ ์œ ์‚ฌํ•˜๋ฉฐ, ๋ณธ๋ฌธ์—์„œ๋Š” Theorem 2์˜ ์ฆ๋ช…๊ณผ์ •์„ ์š”์•ฝํ•˜์—ฌ ์„ค๋ช…ํ•œ๋‹ค. ์„ธ ๋‹จ๊ณ„๋กœ ๋‚˜๋ˆ„์–ด ์ž„์˜์˜ continuous, permutation equivariant, sequence-to-sequence function $f$ with compact support๋ฅผ ์ ์ ˆํ•œ Transformer network๋กœ ๊ทผ์‚ฌํ•œ๋‹ค. ๊ทธ ๋กœ๋“œ๋งต์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

4.1) $f$๋ฅผ piece-wise ์ƒ์ˆ˜ํ•จ์ˆ˜๋กœ ๊ทผ์‚ฌํ•˜๊ธฐ

์ƒ์ˆ˜ํ•จ์ˆ˜๋ผ๊ณ  ํ•ด์„œ f๊ฐ€ ๊ฐ‘์ž๊ธฐ real-valued๊ฐ€ ๋˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ์˜ ์ƒ์ˆ˜ํ•จ์ˆ˜ ์—ญ์‹œ ํ–‰๋ ฌ์„ ๋ฐ›์•„ ํ–‰๋ ฌ์„ ๋‚ด๋ฑ‰๋Š” ํ•จ์ˆ˜์ธ๋ฐ, ํ•จ์ˆซ๊ฐ’์œผ๋กœ์„œ์˜ ํ–‰๋ ฌ์ด ๊ณ ์ •๋˜์–ด ์žˆ์œผ๋ฉด ์ƒ์ˆ˜ํ•จ์ˆ˜์ธ ๊ฒƒ์ด๋‹ค.

4.2) Piece-wise ์ƒ์ˆ˜ํ•จ์ˆ˜๋ฅผ โ€˜modifiedโ€™ Transformer network๋กœ ๊ทผ์‚ฌํ•˜๊ธฐ

โ€˜Modifiedโ€™ Transformer๋ž€, ๊ธฐ์กด์˜ Transformer์—์„œ ์“ฐ์ด๋˜ (column-wise) softmax ํ•จ์ˆ˜($\sigma$)๋Š” column-wise hardmax($\sigma_H$)๋กœ ๋Œ€์ฒดํ•˜๊ณ , FF์˜ activation function์œผ๋กœ ์“ฐ์ด๋˜ ReLU๋Š” ๋˜๋‹ค๋ฅธ ํŠน์ดํ•œ ํ•จ์ˆ˜($\phi \in \Phi$, ์ž์„ธํ•œ ์ •์˜๋Š” ์•„๋ž˜์—)๋กœ ๋Œ€์ฒดํ•œ ๊ฒƒ์ด๋‹ค.

  • $\Phi$์˜ ์ •์˜:
    The set of all piece-wise linear functions with at most three pieces, where at least one piece is constant. (p.9)

์ด ๋ถ€๋ถ„์„ ์ฆ๋ช…ํ•˜๊ธฐ ์œ„ํ•ด, ๋…ผ๋ฌธ์—์„œ๋Š” modified Transformer์˜ layer ์ˆœ์„œ๋ฅผ ๋œฏ์–ด๊ณ ์น˜๋Š” ์ผ์„ ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. Residual connection์„ ์ด์šฉํ•˜๋ฉด, self-attention๊ณผ feed-forward layer๋ฅผ ๋ฒˆ๊ฐˆ์•„ ์ ์šฉํ•˜๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ self-attention๋งŒ ์ญ‰, ํ˜น์€ feed-forward layer๋งŒ ์ญ‰ ์ด์–ด ํ•ฉ์„ฑํ•œ ๊ฒƒ์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

(โ€ฆ.) we note that even though a Transformer network stacks self-attention and feed-forward layers in an alternate manner, the skip connections enable these networks to employ a composition of multiple self-attention or feed-forward layers. (์ค‘๋žต) self-attention and feed-forward layers play in realizing the ability to universally approximate sequence-to-sequence functions: 1) self-attention layers compute precise contextual maps; and 2) feed-forward layers then assign the results of these contextual maps to the desired output values. (p.6)

โ“ Modified Transformer network์˜ layer ์ˆœ์„œ๋ฅผ ๋’ค๋ฐ”๊พธ์–ด ๊ฐ™์€ ์ข…๋ฅ˜์˜ layer๋งŒ ์ด์–ด๋ถ™์ผ ์ˆ˜ ์žˆ๋Š” ์ด์œ ๊ฐ€ ๊ตฌ์ฒด์ ์œผ๋กœ ๋ฌด์—‡์ผ๊นŒ? ์—ฌ๊ธฐ์— skip connection์€ ์–ด๋–ค ์—ญํ• ์„ ํ• ๊นŒ?

4.3) Modified Transformer network๋ฅผ Transformer network๋กœ ๊ทผ์‚ฌํ•˜๊ธฐ

์•ž์—์„œ ๋Œ€์ฒดํ–ˆ๋˜ softmax์™€ ReLU๋ฅผ ์›๋ž˜๋Œ€๋กœ ๋Œ๋ ค๋†“๋Š” ์ž‘์—…์ด๋ผ๊ณ  ๋ณด๋ฉด ๋  ๊ฒƒ ๊ฐ™๋‹ค.

5. ๋ช‡ ๊ฐœ์˜ block์„ ์Œ“์•„์•ผ ํ•˜๋‚˜?

Theorem 2๋Š” ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ช‡๊ฐœ์˜ Transformer block์„ ์Œ“์•„์•ผ ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค€๋‹ค. ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๋Š”, permutation equivariant ํ•จ์ˆ˜๋ฅผ ์ž˜ ๊ทผ์‚ฌํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ (h,m,r)=(2,1,4) Transformer block์€ ์ด $O(n(1/\delta)^{dn}/n!)$๊ฐœ๋‹ค. ๋˜ํ•œ, positional encoding๊นŒ์ง€ ๋”ํ•ด ์ข€ ๋” ๊ด‘๋ฒ”์œ„ํ•œ sequence-to-sequence ํ•จ์ˆ˜๋ฅผ ์ž˜ ๊ทผ์‚ฌํ•˜๊ธฐ ์œ„ํ•ด์„œ ํ•„์š”ํ•œ block์€ $O(n(1/\delta)^{dn})$๊ฐœ๋‹ค.

์ด๋•Œ $\delta$๋Š” Theorem 2/3์˜ ์ฆ๋ช… 1~2๋‹จ๊ณ„์—์„œ ์“ฐ์ธ piecewise constant function์˜ domain์„ ๊ตฌ๋ถ„ํ•˜๋Š” grid๋ฅผ ์ด๋ฃจ๋Š” (hyper-)cube์˜ ํ•œ ๋ณ€์˜ ๊ธธ์ด์ด๋ฉฐ, ์ถฉ๋ถ„ํžˆ ์ž‘์Œ์„ ๊ฐ€์ •ํ•ด์•ผ ํ•œ๋‹ค. (์ฆ๋ช…๊ณผ์ •์— ๋”ฐ๋ฅด๋ฉด, $O(\delta^{d/p} ) \le \epsilon/3)$

โ“ ๋…ผ๋ฌธ์—์„œ๋Š” ์ฆ๋ช…์„ ์œ„ํ•ด ์•„์ฃผ ์ž‘์€ transformer block์„ ์ด์šฉํ•˜๊ณ  ์žˆ๋‹ค. ๋งŒ์•ฝ ์ด transformer block์˜ ํฌ๊ธฐ๋ฅผ ํ‚ค์šด๋‹ค๋ฉด ํ•„์š”ํ•œ block์˜ ์ˆ˜๋Š” ์ค„์–ด๋“ค๊นŒ? (์•„๋งˆ $d$์™€ $n$์— ๋”ฐ๋ฅธ complexity์—๋Š” ํฌ๊ฒŒ ์ฐจ์ด๊ฐ€ ์žˆ์ง€ ์•Š์„ ๊ฒƒ ๊ฐ™๋‹ค. $h$, $m$, $r$ ๋“ฑ์˜ ๊ฐ’์€ $d$๋‚˜ $n$์˜ ๊ฐ’๊ณผ๋Š” ๊ด€๋ จ์ด ์—†์œผ๋ฏ€๋กœ.)

My Comments & Questions

  • ์„ ํ˜•๋Œ€์ˆ˜ํ•™์„ ๊ฝค๋‚˜ ์“ฐ๋Š” ๋…ผ๋ฌธ์ด์ง€๋งŒ ์‹ค์ƒ์€ ์—„์ฒญ๋‚˜๊ฒŒ ํ•ด์„ํ•™์Šค๋Ÿฌ์šด ๋…ผ๋ฌธ์ด์—ˆ๋‹ค. ํ•ด์„ํ•™1๋•Œ Weierstrass Approximation Theorem(compact domain์—์„œ ์—ฐ์†ํ•จ์ˆ˜๋ฅผ ๋‹คํ•ญ์‹์œผ๋กœ ์ž„์˜์˜ ์ •ํ™•๋„๋กœ ๊ทผ์‚ฌํ•˜๊ธฐ) ๋ฐฐ์› ๋˜ ๊ฒƒ์ด ์ƒˆ๋ก์ƒˆ๋กโ€ฆ
  • ์œ„์—์„œ ๋˜์กŒ๋˜ ์งˆ๋ฌธ๋“ค์€ ๋‚ด๊ฐ€ ๋…ผ๋ฌธ์„ ์ฝ์œผ๋ฉด์„œ๋„ ๋๊นŒ์ง€ ์ดํ•ดํ•˜์ง€ ๋ชปํ–ˆ๋˜, ํ˜น์€ ์Šค์Šค๋กœ 100% ๋งŒ์กฑ์Šค๋Ÿฝ๊ฒŒ ๋Œ€๋‹ตํ•˜์ง€๋Š” ๋ชปํ–ˆ๋˜ ์งˆ๋ฌธ๋“ค์ด๋‹ค. ๋‹ค์‹œ ๋ชจ์•„๋ณด์ž.

โ“ ์ผ๋ฐ˜์ ์œผ๋กœ, Parameter sharing์ด ๋งŽ์„์ˆ˜๋ก universal approximator๊ฐ€ ๋˜๊ธฐ ์–ด๋ ค์šด ์ด์œ ๋Š” ๋ฌด์—‡์ผ๊นŒ?

โ“ Layer normalization์„ ๋นผ๋„ ๊ดœ์ฐฎ์€ ์ด์œ ๋Š” ๋ฌด์—‡์ผ๊นŒ?

โ“ (Paraphrased:) ํ›ˆ๋ จ ๊ณผ์ •์—์„œ transformer๊ฐ€ โ€˜์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ํ•จ์ˆ˜โ€™๋ฅผ ์‹ค์ œ๋กœ ์ž˜ ๊ทผ์‚ฌํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋Š” ์–ด๋–ป๊ฒŒ ์•Œ ์ˆ˜ ์žˆ์„๊นŒ?

โ“ Modified Transformer network์˜ layer ์ˆœ์„œ๋ฅผ ๋’ค๋ฐ”๊พธ์–ด ๊ฐ™์€ ์ข…๋ฅ˜์˜ layer๋งŒ ์ด์–ด๋ถ™์ผ ์ˆ˜ ์žˆ๋Š” ์ด์œ ๊ฐ€ ๊ตฌ์ฒด์ ์œผ๋กœ ๋ฌด์—‡์ผ๊นŒ? ์—ฌ๊ธฐ์— skip connection์€ ์–ด๋–ค ์—ญํ• ์„ ํ• ๊นŒ?

โ“ ๋…ผ๋ฌธ์—์„œ๋Š” ์ฆ๋ช…์„ ์œ„ํ•ด ์•„์ฃผ ์ž‘์€ transformer block์„ ์ด์šฉํ•˜๊ณ  ์žˆ๋‹ค. ๋งŒ์•ฝ ์ด transformer block์˜ ํฌ๊ธฐ๋ฅผ ํ‚ค์šด๋‹ค๋ฉด ํ•„์š”ํ•œ block์˜ ์ˆ˜๋Š” ์ค„์–ด๋“ค๊นŒ?