[논문리뷰] NeRF: Representing Scenes asNeural Radiance Fields for View Synthesis

논문, 강의

[논문리뷰] NeRF: Representing Scenes asNeural Radiance Fields for View Synthesis

capaca 2023. 3. 31. 04:19

논문링크 : https://arxiv.org/pdf/2003.08934v2.pdf
읽은이유 : DBARF라는 신기하게 생긴 논문을 읽기 위해서 사전지식으로 NeRF가 필요한 듯해서 읽게 되었다.
이해정도 : ⭐ [⭐개념 ⭐⭐제대로 읽음 ⭐⭐⭐ 코드까지 이해]

📌 Summary of Contribution

In summary, our technical contribution
- An approach for representing continuous scenes with complex geometry and materials as 5D neural radiance fields, parameterized as basic MLP networks.
- A differentiable rendering procedure based on classical volume rendering techniques, which we use to optimize these representations from standard RGB images. This includes a hierarchical sampling strategy to allocate the MLP’s capacity towards space with visible scene content.
  - differential rendering : 2D 이미지만을 사용하여 Neural implicit shape representation을 최적화할 수 있는 차별화 가능한 렌더링 기능 ( 미분가능한 랜더링 기능 )
    - Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.: Differentiable volumetric rendering: Learning implicit 3D representations without 3D supervision. In: CVPR (2019)
- A positional encoding to map each input 5D coordinate into a higher dimensional space, which enables us to successfully optimize neural radiance fields to represent high-frequency scene content.

기본적인 MLP 네트워크를 통해 5D input으로 복잡한 기하학과 물체의 질감에 대해 연속적인 장면을 표현할 수 있음
전통적인 volume rendering 기법 기반의 미분 가능한 렌더링 방식을 제안하여 파라미터 학습을 수행함
positional encoding을 통하여 세세하게 장면을 표현할 수 있음 (입력은 5차원 이지만 고차원 데이터로 맵핑하는)

💡 사전지식

Ray : 카메라 초점을 지나는 선으로 이 선들이 모여서 projection 되면 하나의 pixel 값을 정함 즉 하나의 Ray 가 하나의 Pixel 값을 정한다는 의미
- 카메라 초점의 위치와 보는 방향이 정해지면 ray 위의 좌표는 o+td를 통해 선형적으로 계산할 수 있다.

📑 NeRF란 ?

2D 이미지를 3D 이미지로 변환해준다 ( 3D object를 바라본 모습을 예측할 수 있는 모델을 만드는 것 )
물체를 찍은 여러장의 이미지를 입력받아 “새로운 시점”에서 물체 이미지를 만들어내는 (View synthesis) 모델
novel view synthetics : 어디서 바라보더라도 해당 물체 모습을 암 물체를 3d 랜더링 한다고 보는 것

📑 NeRF 입, 출력

$input :(x,y,z,\theta,\phi)$ → $F_\Theta$ → $output:(RGB\sigma)$

Input : 물체의 위치 정보(x,y,z)와 물체가 바라보는 방향, 카메라의 한 점과 각도 [ 5차원 데이터 ]
- ??) 물체의 위치정보와 카메라 한점, 물체가 바라보는 방향은 학습에 어떻게 입력으로 작용할 수 있는 것일까?
FC : fullty-conected network
Output : RGB 값과 물체의 밀도 예측

📑 NeRF 학습 과정

x,y,z 값을 8개 레이어를 통과시켜 물체의 밀도를 뽑는다
물체의 밀도와 기존 위치정보에 대해 한 레이어를 통과시켜 RGB 값을 뽑는다.

📑 NeRF Contribution

Positional encoding (??)
- 복잡한 scene 표현을 위해 NeRF를 최적화 하기 위해서는 더 많은 sampled set이 필요하다는 사실을 알아냄
- Positional encoding 을 이용해 input data 변화와 더 높은 주파수(경계선을 섬세하게 표현할 수 있도록 함)
hierarchical sampling procedure

저작자표시 (새창열림)