cs231n Assignment 2: Q2 (Batch Normalization, Layer Normalization 구현)

cs231n

cs231n Assignment 2: Q2 (Batch Normalization, Layer Normalization 구현)

츄츄츄츄츄츄츄 2023. 2. 14. 20:50

내 풀이 링크: https://github.com/lionkingchuchu/cs231n.git

GitHub - lionkingchuchu/cs231n: cs231n Spring 2022 Assignment

cs231n Spring 2022 Assignment. Contribute to lionkingchuchu/cs231n development by creating an account on GitHub.

github.com

이번 과제는 batch normalization 에 대한 과제이다. 과제를 하는 중 batch normalization 논문을 보며 도움을 받았다. https://arxiv.org/abs/1502.03167 이고 옆에 Download: PDF 를 누르면 논문을 볼 수 있다.

batch normalization 은 먼저 parameter의 각 weight 들의 초기 값들이 normalize하게 분포되어 있을 시, backpropagation 과정에서 각 weight들이 최대한 다양하게 변화하며 loss function함수를 줄이는 방향으로 변화된다는 원리에서 시작된다.

만약 각 뉴런의 weight값이 모두 0이라면, 각 뉴련의 output값은 모두 같은 값이 될 것이고, 모든 뉴런의 dw값이 같아지므로 여러개의 뉴런을 사용하는 의미가 없어진다. 만약 뉴런의 weight값들이 뉴런의 개수에 비해 너무 작거나 크다면 , 뉴런의 dw값이 또 너무 작거나 커져 적절한 dw 업데이트가 어려울 것이다. 그래서 초기의 weight initialization의 기준을 normalize을 통해 잡는다. 각 weight의 초기 값들을 정규 분포에 맞추어 잡으면 모든 뉴런의 값들이 다양하게, 그리고 값이 너무 크거나 작지 않은 최적의 weights들을 얻을 수 있다.

그런데 layer을 지나면서 뉴런의 output 값들은 정규분포에서 점점 다른 양상을 띄게 된다. 만약 Relu layer을 지나면 음수의 값들이 전부 사라지는 경우가 있고, 정규화된 layer을 계속해서 지날 때마다 output은 점점 0에 가까운 값들이 많아지게 된다. batch normalization은 이 문제를 해결한다. batch normalization layer은 affine layer 이후에 각 output 값들을 normalize 하여 다음 layer로 무조건 output이 정규분포를 띄게 만들어 보내주는 역할을 한다.

def batchnorm_forward(x, gamma, beta, bn_param):
    

    running_mean = momentum * running_mean + (1 - momentum) * sample_mean
    running_var = momentum * running_var + (1 - momentum) * sample_var

    mode = bn_param["mode"]
    eps = bn_param.get("eps", 1e-5)
    momentum = bn_param.get("momentum", 0.9)

    N, D = x.shape
    running_mean = bn_param.get("running_mean", np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get("running_var", np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == "train":
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        sample_mean = np.mean(x,axis=0)
        sample_var = np.var(x,axis=0)

        running_mean = momentum * running_mean + (1 - momentum) * np.mean(x,axis=0)
        running_var = momentum * running_var + (1 - momentum) * sample_var

        x_normal = (x - sample_mean) / np.sqrt(sample_var + eps)
        out = gamma * x_normal + beta

        cache = (x, sample_mean, sample_var, gamma)

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        
    elif mode == "test":
        
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        x_normal = (x - running_mean) / np.sqrt(running_var + eps)
        out = gamma * x_normal + beta

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        
    else:
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    # Store the updated running means back into bn_param
    bn_param["running_mean"] = running_mean
    bn_param["running_var"] = running_var

    return out, cache

먼저 batch normalization layer의 forward 함수를 구현한다. train하는 경우에는 기존 논리대로 output값들의 mean, variance를 구하여 각 output값들에 mean 을 빼주고 variance를 나누어 주어 normalize 하게 한다. 여기서 batch normalization에도 parameter 두개가 있는데, normalize 정도를 조절하는 gamma 와 beta이다. gamma는 x_normal의 정규분포의 흩어지는 정도(곱셈)를 조절하고 beta는 정규분포의 평균값(덧셈)을 조정한다. gamma와 beta parameter를 사용하는 이유는 normalize하는 과정에서 값들을 얼마나 흩뿌릴지, 값들을 얼마나 옮길지를 loss function을 줄이는 방향으로 normalize를 조절하기 위함이다. 무조건 평균 0, 분산 1 로 normalize하는 것이 좋은 예측을 만들지 아닐지는 모르기 때문이다.

그리고 test하는 경우를 위해 running mean, running var 계산에 mean, variance를 반영해 준다. running mean, running var을 사용하는 이유는 명확하게는 잘 설명하지 못하겠지만 test하는 경우에는 gamma와 beta, weight값들이 지금까지 train해온 train data에 맞게 조정되어 있기 때문에 test값의 mean, var을 사용하기 보다는 지금까지 train한 train data들이 반영된 running mean, running var을 사용하여 normalize 해야 정확도가 높아질 것이라고 생각한다.

def batchnorm_backward(dout, cache):

    dx, dgamma, dbeta = None, None, None

    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    eps = 1e-5
    N = dout.shape[0]
    x, sample_mean, sample_var, gamma = cache
    x_normal = (x - sample_mean) / np.sqrt(sample_var + eps)
    dgamma = np.sum(dout * x_normal, axis = 0)
    dbeta = np.sum(dout, axis = 0)
    dx_normal = dout * gamma
    dlvar = np.sum(dx_normal * (x - sample_mean) * -0.5 * (sample_var + eps)**-1.5, axis = 0)
    dlmean = np.sum(dx_normal * -1 / np.sqrt(sample_var + eps) , axis = 0) 
    + dlvar * np.sum(-2 * (x - sample_mean), axis = 0) / N
    dx = dx_normal * 1 / np.sqrt(sample_var + eps) + dlvar * 2 * (x - sample_mean) / N + dlmean / N

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    return dx, dgamma, dbeta

다음으로 backward pass인데 이부분은 손으로 계산하기에 너무 복잡해 논문을 참고하여 계산해 가며 dx, dgamma, dbeta를 구현하였다.

논문에서 나온 chain rule을 이용한 backward pass 수식

다음으로 Q1에서 만든 FullyConnectedNets에서 반복되는 Layer을 (Affine - batchnorm - Relu)로 했을때와 (Affine - Relu)로 했을때의 차이를 비교한다.

세가지 모든 경우에서 batch normalization을 사용한 경우에 더 좋은 결과를 보여주는것을 알 수 있다.

다음으로 초기의 weight 값들을 조절하는 weight scale에 따라 기존 신경망과 batch norm을 사용한 신경망의 차이를 비교한다.

대부분의 경우에서는 batchnorm이 좋지만, 최적의 weight scaled에서는 반대의 결과를 보여준다.

Inline Question 1:

Describe the results of this experiment. How does the weight initialization scale affect models with/without batch normalization differently, and why?

Answer:

weight initialization scale affect both models, but it is shown to affect more model without batch normalization. This is because when we use batch normalization, even when layers' output are not normalized, batch normalization layer forces the output of each layer to be normalized, so batch normalization model shows decent accuracy even if the weight scale is not optimized well. While model without batch norm shows poor accuracy when weight scale is not optimized very well.

여기서 다음 문제로 위 그래프의 weight scale과 각 신경망의 결과를 비교하는 문제이다. weight scale은 두 모델 중 batchnorm을 사용하지 않은 경우에 더 크게 영향을 끼쳤다. 이는 batchnorm을 사용하면 weight값들이 최적으로 흩어져 있지 않더라도, output값들을 normalize하게 pass하기 때문에 weight scale을 최적으로 조절하지 않아도 꽤 좋은 점수를 보여준다. 반면에 batchnorm을 사용하지 않은 경우에는 최적의 weight scale이 아니라면 좋지 않은 결과를 나타낸다.

다음은 batch size에 따른 batch norm을 사용한 신경망, 그리고 batch norm 을 사용하지 않은 신경망들의 결과이다.

batch size에 따른 batch norm 결과 / 사용하지 않은 결과

Inline Question 2:

Describe the results of this experiment. What does this imply about the relationship between batch normalization and batch size? Why is this relationship observed?

Answer:

We can see when the batch size is small, it shows even worse accuracy than model without batch norm. When we train model, we use batch size of the data for each iteration. If batch size is too small, the data might not show a good standard for the whole data, and those data might not be general. Since batch normalization use its mean and variance, when the data is small and bad it will not train and manipulate parameters in a general way.

다음 문제는 위의 batch size에 따라 결과가 왜 이렇게 달라지는지 설명하는 문제이다. 만약 batch size가 너무 작으면, 해당 batch의 데이터들이 모든 데이터를 표방하는 데이터가 아닐 수 있다. 우리가 model을 train하는 이유는 그 어떤 데이터가 오더라도 분류할 수 있는 모델을 만드는 것인데, batch의 적은 데이터의 mean과 variance를 사용하여 gamma와 beta, weight들을 조절하면 모든 데이터의 general한 분류 성능 상승과는 거리가 멀게 parameter들을 업데이트 할 것이다.

다음으로 Layer normalization은 normalize과정에서 mean 과 variance를 batch의 평균, 분산이 아닌 각 feature의 평균, 분산을 사용하여 normalize 하는 것이다.

Inline Question 3:

Which of these data preprocessing steps is analogous to batch normalization, and which is analogous to layer normalization?

Scaling each image in the dataset, so that the RGB channels for each row of pixels within an image sums up to 1.
Scaling each image in the dataset, so that the RGB channels for all pixels within an image sums up to 1.
Subtracting the mean image of the dataset from each image in the dataset.
Setting all RGB values to either 0 or 1 depending on a given threshold.

Answer:

1, 3 is analogous to batch normalization and 2, 4 is analogous to layer normalization. 1,3 we are scaling over pixel-wise, and 2, 4 we are scaling over image-wise.

문제는 1234 중 layer norm과 batch norm 의 preprocessing을 구분하는 문제이다.

1. 이미지의 모든 데이터들의 RGB채널의 각 픽셀을 모두 더했을때 1이 되게 조정

2.각 이미지의 RGB채널의 모든 픽셀을 모두 더했을때 1이 되게 조정

3.모든 데이터의 평균 이미지를 각 이미지에서 뺀 경우

4.모든 RGB 값들을 구분에 따라 0과 1로 조정

1,3번은 모든 이미지의 각 픽셀 단위로 조정하고, 2,4번은 각 이미지를 이미지 단위(총 픽셀 단위) 단위로 조정한다. 각 feature값은 픽셀이므로 픽셀 단위로 조정하는 1,3번이 layer norm에 가깝다.

다음은 Layer norm forward, backward함수 구현이다.

def layernorm_forward(x, gamma, beta, ln_param):
   
    out, cache = None, None
    eps = ln_param.get("eps", 1e-5)

    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    feature_mean = np.mean(x,axis=1)[:,np.newaxis]
    feature_var = np.var(x,axis=1)[:,np.newaxis]

    x_normal = (x - feature_mean) / np.sqrt(feature_var + eps)
    out = gamma * x_normal + beta

    cache = (x, feature_mean, feature_var, gamma)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
    return out, cache

def layernorm_backward(dout, cache):
    
    dx, dgamma, dbeta = None, None, None
   
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    eps = 1e-5
    N = dout.shape[1]
    x, feature_mean, feature_var, gamma = cache
    x_normal = (x - feature_mean) / np.sqrt(feature_var + eps)
    dgamma = np.sum(dout * x_normal, axis = 0)
    dbeta = np.sum(dout, axis = 0)
    dx_normal = dout * gamma
    
    dlvar = np.sum(dx_normal * (x - feature_mean) 
    * -0.5 * (feature_var + eps)**-1.5, axis = 1)[:,np.newaxis]

    dlmean = np.sum(dx_normal * -1 / np.sqrt(feature_var + eps) , axis = 1)[:,np.newaxis] 
    + dlvar * np.sum(-2 * (x - feature_mean), axis = 1)[:,np.newaxis] / N

    dx = dx_normal * 1 / np.sqrt(feature_var + eps) + dlvar * 2 * (x - feature_mean) / N + dlmean / N

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
    return dx, dgamma, dbeta

위의 batch normalization 함수에서 mean, var, sum 등의 함수의 axis를 layer에 맞게 잘 조절하면 새로운 함수를 만들 필요 없이 추가만으로도 구현해 낼 수 있다. 다음은 batch size에 따른 layer norm 사용 모델의 비교이다.

batch size에 따른 layer norm 사용한 모델, 사용하지 않은 모델의 결과 비교

Inline Question 4:

When is layer normalization likely to not work well, and why?

Using it in a very deep network
Having a very small dimension of features
Having a high regularization term

Answer:

when small dimension of features: just like batch normalization with small batch size, when features' dimensions are small, each normalized feature will not be a standard feature for general data

마지막 문제로 layer normalization이 잘 작동하지 않을 경우를 묻는다. batch normalization의 batch size가 작을 때처럼, layer normalization은 feature dimension이 작을때 각 feature들이 general한 data들의 표준적인 값들이 되지 않기에 성능이 떨어질 것이라고 생각한다.

'cs231n' 카테고리의 다른 글

cs231n Assignment 2: Q4 (CNN, Group Normalization 구현) (1)	2023.02.18
cs231n Assignment 2: Q3 (Dropout 구현) (0)	2023.02.15
cs231n Assignment 2: Q1 (Fully Connected Network 구현) (0)	2023.02.14
cs231n Assignment 1: Q5 (HOG, HSV 추출 사용) (0)	2023.02.06
cs231n Assignment 1: Q4 (Two Layer Network 구현) (0)	2023.02.01

현재글cs231n Assignment 2: Q2 (Batch Normalization, Layer Normalization 구현)

King of the Jungle Lion

BFS, cs231n, 세그먼트트리, self supervised learning, Group Normalization, 백준, fooling image, class visualization, 볼록 껍질, dp, 최대 유량, Strongly Connected Components, 세그먼트 트리, 트라이, SimCLR, 최대유량 최소컷, 파이썬, DFS, 스프라그-그런디, saliency map,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

King of the Jungle Lion