phd/src/state-of-the-art/3d-streaming.tex

\section{3D streaming\label{sote:3d-streaming}}

In this thesis, we focus on the objective of delivering large, massive 3D scenes over the network.
While 3D streaming is not the most popular research field, there has been a special attention around 3D content compression, in particular progressive compression which can be considered a premise for 3D streaming.
In the next sections, we review the 3D streaming related work, from 3D compression and structuring to 3D interaction.

\subsection{Compression and structuring}

According to \citep{maglo20153d}, mesh compression can be divided into four categories:
\begin{itemize}
    \item single-rate mesh compression, seeking to reduce the size of a mesh;
    \item progressive mesh compression, encoding meshes in many levels of resolution that can be downloaded and rendered one after the other;
    \item random accessible mesh compression, where different parts of the models can be decoded in an arbitrary order;
    \item mesh sequence compression, compressing mesh animations.
\end{itemize}

Since our objective is to stream 3D static scenes, single-rate mesh and mesh sequence compressions are less interesting for us.
This section thus focuses on progressive meshes and random accessible mesh compression.

Progressive meshes were introduced in~\citep{progressive-meshes} and allow a progressive transmission of a mesh by sending a low resolution mesh first, called \emph{base mesh}, and then transmitting detail information that a client can use to increase the resolution.
To do so, an algorithm, called \emph{decimation algorithm}, starts from the original full resolution mesh and iteratively removes vertices and faces by merging vertices through the so-called \emph{edge collapse} operation (Figure~\ref{sote:progressive-scheme}).

\begin{figure}[ht]
    \centering
    \begin{tikzpicture}[scale=2]
        \node (Top1) at (0.5, 1) {};
        \node (A) at (0, 0.8)    {};
        \node (B) at (1, 0.9)    {};
        \node (C) at (1.2, 0)    {};
        \node (D) at (0.9, -0.8) {};
        \node (E) at (0.2, -0.9) {};
        \node (F) at (-0.2, 0)   {};
        \node (G) at (0.5, 0.5)  {};
        \node (H) at (0.6, -0.5) {};
        \node (Bottom1) at (0.5, -1) {};

        \node (Top2) at (3.5, 1)  {};
        \node (A2) at (3, 0.8)    {};
        \node (B2) at (4, 0.9)    {};
        \node (C2) at (4.2, 0)    {};
        \node (D2) at (3.9, -0.8) {};
        \node (E2) at (3.2, -0.9) {};
        \node (F2) at (2.8, 0)    {};
        \node (G2) at (3.55, 0)   {};
        \node (Bottom2) at (3.5, -1) {};

        \draw (A.center) -- (B.center) -- (C.center) -- (D.center) -- (E.center) -- (F.center) -- (A.center);
        \draw (A.center) -- (G.center);
        \draw (B.center) -- (G.center);
        \draw (C.center) -- (G.center);
        \draw (F.center) -- (G.center);
        \draw (C.center) -- (H.center);
        \draw (F.center) -- (H.center);
        \draw (E.center) -- (H.center);
        \draw (D.center) -- (H.center);
        \draw[color=red, line width=1mm] (G.center) -- (H.center);

        \draw (A2.center) -- (B2.center) -- (C2.center) -- (D2.center) -- (E2.center) -- (F2.center) -- (A2.center);
        \draw (A2.center) -- (G2.center);
        \draw (B2.center) -- (G2.center);
        \draw (C2.center) -- (G2.center);
        \draw (F2.center) -- (G2.center);
        \draw (E2.center) -- (G2.center);
        \draw (D2.center) -- (G2.center);
        \node at (G2) [circle,fill=red,inner sep=2pt]{};

        \draw[-{Latex[length=3mm]}] (Top1) to [out=30, in=150] (Top2);
        \draw[-{Latex[length=3mm]}] (Bottom2) to [out=-150, in=-30] (Bottom1);

        \node at (2,  1.75) {Edge collapse};
        \node at (2, -1.75) {Vertex split};


    \end{tikzpicture}
    \caption{Vertex split and edge collapse\label{sote:progressive-scheme}}
\end{figure}

Every time two vertices are merged, a vertex and two faces are removed from the original mesh, decreasing the model resolution.
At the end of this content preparation phase, the mesh has been reorganized into a base mesh and a sequence of partially ordered edge split operations.
Thus, a client can start by downloading the base mesh, display it to the user, and keep downloading refinement operations (vertex splits) and display details as time goes by.
This process reduces the time a user has to wait before seeing a downloaded 3D object, thus increases the quality of experience.

\begin{figure}[ht]
    \centering
    \includegraphics[width=\textwidth]{assets/state-of-the-art/3d-streaming/progressivemesh.png}
    \caption{Four levels of resolution of a mesh~\citep{progressive-meshes}}
\end{figure}

%These methods have been vastly researched \citep{bayazit20093,mamou2010shape}, but very few of these methods can handle meshes with attributes, such as texture coordinates.

\citep{streaming-compressed-webgl} develop a dedicated progressive compression algorithm based on iterative decimation, for efficient decoding, in order to be usable on web clients.
With the same objective, \citep{pop-buffer} proposes pop buffer, a progressive compression method based on quantization that allows efficient decoding.

Following these, many approaches use multi triangulation, which creates mesh fragments at different levels of resolution and encodes the dependencies between fragments in a directed acyclic graph.
In \citep{batched-multi-triangulation}, the authors propose Nexus: a GPU optimized version of multi triangulation that pushes its performances to make real time rendering possible.
It is notably used in 3DHOP (3D Heritage Online Presenter, \citep{3dhop}), a framework to easily build web interfaces to present 3D objects to users in the context of cultural heritage.

Each of these approaches define its own compression and coding for a single mesh.
However, users are often interested in scenes that contain multiple meshes, and the need to structure content emerged.

To answer those issues, the Khronos group proposed a generic format called glTF (GL Transmission Format,~\citep{gltf}) to handle all types of 3D content representations: point clouds, meshes, animated models, etc.\
glTF is based on a JSON file, which encodes the structure of a scene of 3D objects.
It contains a scene graph with cameras, meshes, buffers, materials, textures and animations.
Although relevant for compression, transmission and in particular streaming, this standard does not yet consider view-dependent streaming, which is required for large scene remote visualization and which we address in our work.

% Zampoglou

\subsection{Viewpoint dependency}

3D streaming means that content is downloaded while the user is interacting with the 3D object.
In terms of quality of experience, it is desirable that the downloaded content falls into the user's field of view.
This means that the progressive compression must encode a spatial information in order to allow the decoder to determine content adapted to its viewpoint.
This is typically called \emph{random accessible mesh compression}.
\citep{maglo2013pomar} is such an example of random accessible progressive mesh compression.
\citep{cheng2008receiver} proposes a receiver driven way of achieving viewpoint dependency with progressive mesh: the client starts by downloading the base mesh, and from then is able to estimate the importance of the different vertex splits, in order to choose which ones to download.
Doing so drastically reduces the server computational load, since it only has to send data, and improves the scalability of this framework.

In the case of streaming a large 3D scene, view-dependent streaming is fundamental: a user will only be seeing one small portion of the scene at each time, and a system that does not adapt its streaming to the user's point of view is bound to induce a low quality of experience.

A simple way to implement viewpoint dependency is to request the content that is spatially close to the user's camera.
This approach, implemented in Second Life and several other NVEs (e.g.,~\citep{peer-texture-streaming}), only depends on the location of the avatar, not on its viewing direction.
It exploits spatial coherence and works well for any continuous movement of the user, including turning.
Once the set of objects that are likely to be accessed by the user is determined, the next question is in what order should these objects be retrieved.
A simple approach is to retrieve the objects based on distance: the spatial distance from the user's virtual location and rotational distance from the user's view.

More recently, Google integrated Google Earth 3D module into Google Maps (Figure~\ref{sota:google-maps}).
Users are now able to go to Google Maps, and click the 3D button which shifts the camera from the aerial view.
Even though there are no associated publications to support this assertion, it seems clear that the streaming is view-dependent: low resolution from the center of the point of view gets downloaded first, and higher resolution data gets downloaded for closer objects than for distant ones.

\begin{figure}[h]
    \centering
    \includegraphics[width=0.8\textwidth]{assets/state-of-the-art/3d-streaming/googlemaps.png}
    \caption{Screenshot of the 3D interface of Google Maps\label{sota:google-maps}}
\end{figure}

Other approaches use level of details.
Level of details have been initially used for efficient 3D rendering~\citep{lod}.
When the change from one level of detail to another is direct, it can create visual discomfort to the user.
This is called the \emph{popping effect} and level of details have the advantage of enabling techniques, such as geomorhping \citep{hoppe-lod}, to transition smoothly from one level of detail to another.
Level of details have then been used for 3D streaming.
For example, \citep{streaming-hlod} propose an out-of-core viewer for remote model visualization based by adapting hierarchical level of details~\citep{hlod} to the context of 3D streaming.
Level of details can also be used to perform viewpoint dependant streaming, such as \citep{view-dependent-lod}.

\subsection{Texture streaming}

In order to increase the texture rendering speed, a common technique is the \emph{mipmapping} technique.
It consists in generating progressively lower resolutions of an initial texture.
Lower resolutions of the textures are used for polygons which are far away from the camera, and higher resolutions for polygons closer to the camera.
Not only this reduces the time needed to render the polygons, but it can also reduce the aliasing effect.
Using these lower resolutions can be especially interesting for streaming.
\citep{mipmap-streaming} proposes the PTM format which encode the mipmap levels of a texture that can be downloaded progressively, so that a lower resolution can be shown to the user while the higher resolutions are being downloaded.

Since 3D data can contain many textures, \citep{simon2019streaming} propose a way to stream a set of textures by encoding them into a video.
Each texture is segmented into tiles of a fixed size.
Those tiles are then ordered to minimize dissimilarities between consecutive tiles, and encoded as a video.
By benefiting from the video compression techniques, the authors are able to reach a better rate-distortion ratio than webp, which is the new standard for texture transmission, and jpeg.

\subsection{Geometry and textures}

As discussed in Chapter~\ref{f:3d}, most 3D scenes consist in two main types of data: geometry and textures.
When addressing 3D streaming, one must handle the concurrency between geometry and textures, and the system needs to address this compromise.

Balancing between streaming of geometry and texture data is addressed by~\citep{batex3},~\citep{visual-quality-assessment}, and~\citep{mesh-texture-multiplexing}.
Their approaches combine the distortion caused by having lower resolution meshes and textures into a single view independent metric.
\citep{progressive-compression-textured-meshes} also deals with the geometry / texture compromise.
This work designs a cost driven framework for 3D data compression, both in terms of geometry and textures.
The authors generate an atlas for textures that enables efficient compression and multi-resolution scheme.
All four works considered a single mesh, and have constraints on the types of meshes that they are able to compress.
Since the 3D scenes we are interested in in our work consist in soups of textured polygons, those constraints are not satisfied and we cannot use those techniques.
% All four works considered a single, manifold textured mesh model with progressive meshes, and are not applicable in our work since we deal with large and potentially non-manifold scenes.


\subsection{Streaming in game engines}

In traditional video games, including online games, there is no requirement for 3D data streaming.
Video games either come with a physical support (CD, DVD, Blu-Ray) or they require the downloading of the game itself, which includes the 3D data, before letting the user play.
However, transferring data from the disk to the memory is already a form of streaming.
This is why optimized engines for video games use techniques that are reused for streaming such as level of details, to reduce the details of objects far away for the point of view and save the resources to enhance the level of detail of closer objects.

Some other online games, such as \href{https://secondlife.com}{Second Life}, rely on user generated data, and thus are forced to send data from users to others.
In such scenarios, 3D streaming is appropriate and this is why the idea of streaming 3D content for video games has been investigated.
For example, \citep{game-on-demand} proposes an online game engine based on geometry streaming, that addresses the challenge of streaming 3D content at the same time as synchronization of the different players.

\subsection{NVE streaming frameworks}

An example of NVE streaming framework is 3D Tiles \citep{3d-tiles}, which is a specification for visualizing massive 3D geospatial data developed by Cesium and built on top of glTF\@.
Their main goal is to display 3D objects on top of regular maps, and their visualization consists in a top-down view, whereas we seek to let users freely navigate in our scenes, whether it be flying over the scene or moving along the roads.

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.8\textwidth]{assets/state-of-the-art/3d-streaming/3dtiles.png}
    \caption{Screenshot of 3D Tiles interface~\citep{3d-tiles}}
\end{figure}

3D Tiles, as its name suggests, is based on a spacial partitionning of the scene.
It started with a regular octree, but has then been improved to a $k$-d tree (see Figure~\ref{sote:3d-tiles-partition}).

\begin{figure}[th]
    \centering
    \begin{subfigure}[b]{0.45\textwidth}
        \includegraphics[width=1\textwidth]{assets/state-of-the-art/3d-streaming/3d-tiles-octree.png}
        \caption{With regular octree (depth 4)}
    \end{subfigure}
    \begin{subfigure}[b]{0.45\textwidth}
        \includegraphics[width=1\textwidth]{assets/state-of-the-art/3d-streaming/3d-tiles-kd-tree.png}
        \caption{With $k$-d tree (depth 6)}
    \end{subfigure}
    \caption{Spatial partitionning used in 3D Tiles\label{sote:3d-tiles-partition}}
\end{figure}

In~\citeyear{3d-tiles-10x}, 3D Tiles streaming system was improved by preloading the data at the camera's next position when known in advance (with ideas that are similar to those we discuss and implement in Chapter~\ref{bi}, published in~\citeyear{bookmarks-impact}) and by ordering tile requests depending on the user's position (with ideas that are similar to those we discuss and implement in Chapter~\ref{d3}, published in~\citeyear{dash-3d}).

\citep{zampoglou} is another example of a streaming framework: it is the first paper that proposes to use DASH to stream 3D content.
In their work, the authors describe a system that allows users to access 3D content at multiple resolutions.
They organize the content, following DASH terminology, into periods, adaptation sets, representations and segments.
Their first adaptation set codes the tree structure of the scene graph.
Each further adaptation set contains both geometry and texture information and is available at different resolutions defined in a corresponding representation.
To avoid requests that would take too long and thus introduce latency, the representations are split into segments.
The authors discuss the optimal number of polygons that should be stored in a single segment.
On the one hand, using segments containing very few faces will induce many HTTP requests from the client, and will lead to poor streaming efficiency.
On the other hand, if segments contain too many faces, the time to load the segment is long and the system loses adaptability.
Their approach works well for several objects, but does not handle view-dependent streaming, which is desirable in the use case of large NVEs\@.

% \subsection{Prefetching in NVE}
% The general prefetching problem can be described as follows: what are the data most likely to be accessed by the user in the near future, and in what order do we download the data?
%

%
% Other approaches consider the movement of the user and attempt to predict where the user will move to in the future.
% \citep{motion-prediction} and~\citep{walkthrough-ve} predict the direction of movement from the user's mouse input pattern.
% The predicted mouse movement direction is then mapped to the navigation path in the NVE\@.
% Objects that fall in the predicted path are then prefetched.
% CyberWalk~\citep{cyberwalk} uses an exponentially weighted moving average of past movement vectors, adjusted with the residual of prediction, to predict the next location of the user.
%
% \citep{prefetching-walkthrough-latency} cluster the navigation paths of users and use them to predict the future navigation paths.
% Objects that fall within the predicted navigation path are prefetched.
% All these approaches work well for a navigation path that is continuous --- once the user clicks on a bookmark and jumps to a new location, the path is no longer continuous and the prediction becomes wrong.
%
% Moving beyond ordering objects to prefetch based on distance only,~\citep{caching-prefetching-dve} propose to predict the user's interest in an object as well.
% Objects within AoI are then retrieved in decreasing order of predicted interest value to the user.
%
% % \cite{learning-user-access-patterns} investigates how to render large-scale 3-D scenes on a thin client.
% % Efficient scene prefetching to provide timely data with a limited cache is one of the most critical issues for remote 3-D data scheduling in networked virtual environment applications.
% % Existing prefetching schemes predict the future positions of each individual user based on user traces.
% % In this paper, we investigate scene content sequences accessed by various users instead of user viewpoint traces and propose a user access pattern-based 3-D scene prefetching scheme.
% % We make a relationship graph-based clustering to partition history user access sequences into several clusters and choose representative sequences from among these clusters as user access patterns.
% % Then, these user access patterns are prioritized by their popularity and users' personal preference.
% % Based on these access patterns, the proposed prefetching scheme predicts the scene contents that will most likely be visited in the future and delivers them to the client in advance.
%
% \citep{remote-rendering-streaming} investigate remote image-based rendering (IBR) as the most suitable  solution for rendering complex 3D scenes on mobile devices,  where the server renders the 3D scene and streams the rendered images to the client.
% However, sending a large  number of images is inefficient due to the possible limitations  of wireless connections.
% They propose a  prefetching scheme at the server side that predicts client  movements and hence prefetches the corresponding images.
%
% Prefetching techniques easing 3D data streaming and real-time rendering for remote walkthroughs are considered in~\citep{prefetching-remote-walkthroughs}.
% Culling methods, that don't possess frame to frame coherence, can successfully be combined with remote scene databases, if the prefetching algorithm is adapted accordingly.
% We present a quantitative transmission policy, that takes the limited bandwidth of the network and the limited memory available at the client computer into account.
%
% Also in the context remote visualization,~\citep{cache-remote-visualization} study caching and prefetching and  optimize configurations of remote visualization architectures.
% They aim at minimizing the fetch time in a remote visualization system and defend a practical infrastructure software to adaptively optimize the caching architecture of such systems under varying conditions (e.g.\ when network ressources vary).
%