0. 講解 Video

請先觀看講解 Video 在進行下方的問題討論、paper 內容與詳細資訊

1. Transformer

內容

討論

Transformer 在一個一個 feed whole transformer output 回 decoder input 時，算不算 recurrent?

過程是 autoregressive ，算是 implicit recurrent，但 transformer 整個架構還是跟 RNN 架構裡的 recurrent 不同

  The autoregressive generation process in the Transformer can be
  considered a form of implicit recurrence, although it doesn't 
  involve explicit recurrent connections like traditional recurrent 
  neural networks (RNNs).

  In the context of sequence generation, recurrence refers to the
  dependence on previously generated tokens to generate subsequent 
  tokens. In the Transformer, the autoregressive generation process
  achieves this by using the previously generated tokens as input 
  at each decoding step.

  During generation, the Transformer decoder operates sequentially,
  generating one token at a time. At each decoding step, the model
  attends to all the previously generated tokens through self-attention
  and generates the next token based on the attended information. 
  This process can be seen as an implicit recurrence because the 
  decoder leverages the information from earlier generated tokens 
  to make predictions for the current token.

  While there are no explicit recurrent connections or hidden states
  in the Transformer decoder, the autoregressive generation process
  allows the model to capture dependencies between tokens and generate
  coherent sequences. So, while it's not recurrent in the same sense
  as RNNs, it still exhibits a form of sequential dependence and
  can be considered a form of implicit recurrence.

學習資源

2. Vision transformer

內容

- Hidden size $D$：vector dimension
- MLP size：每一個 layer 的 neurons 數量

ViT-L/16 means the “Large” variant with 16 * 16 input patch size
- patch size 跟 length 成反比，所以 patch 越大包，需要的 computational resource 就比較少
Traning 需要的資料量大，跟好不好訓練是兩回事
- 雖然 ViT 要資料量大，結果才會好
- 但在相同的資料下，需要的運算資源 ViT 比 CNN 少

名稱	模型	pre-train dataset
Ous-JFT	ViT-H/14	JFT-300M
Ous-I21k	ViT-L/16	ImageNet-21k
BiT-L (Big Transfer)	ResNet	JFT-300M
Noisy Student	EfficientNet	semi-supervised learning on ImageNet and JFT- 300M with the labels removed

討論

為什麼需要 patch：圖像的解析度會太大
文獻內寫道 “Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data…” 以下是這兩個專有名詞的介紹。
1. Translation Equivariance: CNNs possess a property known as translation equivariance. This means that if an object in an image is shifted, the same feature will be detected regardless of its position in the image. CNNs achieve this through shared weights and local receptive fields. However, Transformers do not inherently have this property. They process the entire image in parallel using self-attention, and the attention mechanism is not constrained by local receptive fields. As a result, Transformers may not capture translation equivariance as effectively as CNNs.
2. Locality (+bias): CNNs have an inherent bias towards local patterns in images due to their use of local receptive fields(The receptive field is the portion of the input that a particular neuron is “looking at” or “receptive to”.) and weight sharing. This locality bias allows them to capture spatial hierarchies of features in images effectively. On the other hand, Transformers do not have a built-in bias towards local patterns. The self-attention mechanism in Transformers enables capturing global relationships between image patches but does not explicitly enforce locality. This can make it more challenging for Transformers to learn spatial hierarchies from limited amounts of training data.
在與 ResNet 比較時，論文內 ResNet 改做 Group Normalization
Group Normalization：把每一組 batch，再多用 channel 分組，並對每一組做類似 layer normalization（但他是組內的每一個 channel 一起做，layer 是分開做）
- The main idea behind Group Normalization is to divide the channels of a feature map into groups and normalize each group separately across the spatial dimensions (H and W).
- the feature map of size (N, C, H, W) is divided into G groups.Each group contains C/G channels.
- Scale and shift the normalized values using learnable parameters (gamma and beta) for each group.
- Advantage：
  - Reduced Dependency on Batch Statistics：GN calculates group-level statistics,varying size making it less dependent on batch statistics, which makes it less sensitive to the batch size compared to BN and LN.
  - Performance on Small Datasets：better generalization capabilities
  - Spatial Independence, Compatibility with Non-sequential Data：only uses spatial dimensions (H and W) to compute group-level statistics, making it more suitable for tasks where the spatial layout of the data is critical. This property can be advantageous in computer vision tasks, such as object detection or semantic segmentation, where objects can appear at different positions in the image.
- 需要根據實驗來決定
what is mean attention distance meaning in self-attention?
- In a self-attention mechanism, each position in the sequence can attend to all other positions, including itself. The attention mechanism calculates attention weights that represent the importance of each position relative to the others.
- These attention weights determine how much each position contributes to the representation of other positions in the output.
- 在一層 layer 的情況下，對其他位置的影響關係（位置乘上權重）

學習資源

名稱	依據	學習	添加位置
Transformer	sin/cos 函數	不可以學習	concate 在開頭
Visual Transformer	絕對位置	可學習	加在開頭
Swin Transformer	相對位置	可學習	加在 attention matrix

Transformer 系列模型介紹

認識 Transformer, Vision Transformer(ViT) 與 Swin Transformer(Swin T)

0. 講解 Video

1. Transformer

內容

討論

學習資源

2. Vision transformer

內容

討論

學習資源

3. Swin transformer

Patch Merging

W-MSA

SW-MSA

Relative positive bias

詳細結構

遇到的問題

學習資源