phd/src/state-of-the-art/video.tex

130 lines
8.9 KiB
TeX
Raw Normal View History

2019-09-11 16:47:38 +02:00
\fresh{}
\section{Video scenario}
Despite what one may think, the video streaming scenario and the 3D streaming one share many similarities: at a higher level of abstraction, they are both interfaces that allow a user to access remote content without having to wait until everything is loaded.
Analyzing the similarities and the differences between the video and the 3D scenarios as well as having knowledge about video streaming litterature are key to developing an efficient 3D streaming system.
\subsection{Similarities and differences between video and 3D}
\subsubsection{Data persistence}
One of the main differences between video and 3D streaming is the persistence of data.
In video streaming, only one second of video is required at a time.
Of course, most of video streaming services will prefetch some future chunks, and keep in cache some previous ones.
In 3D streaming, each chunk is part of a scene, and not only many chunks are required to perform a satisfying rendering for the user, but it is impossible to know in advance what chunks are necessary to perform a rendering.
\subsubsection{Multiresolution}
All the major video streaming platforms support multiresolution streaming.
This means that a client can choose the resolution at which the user will request the content.
It can be chosen directly by the user or automatically determined by analysing the available resources (size of the screen, downoading bandwidth, device performances, etc\ldots)
\begin{figure}{ht}
\centering
\includegraphics[width=\textwidth]{assets/state-of-the-art/video/youtube-multiresolution.png}
\caption{The different resolutions available for a Youtube video}
\end{figure}
In the same way, the recent works in 3D streaming have proposed many ways to progressively streaming 3D models, allowing the user to have a low resolution without having to wait, and being able to interact with the model while the details are being downloaded.
\subsubsection{Media types}
Just like a video, a 3D scene is composed of different types of media.
In video, those type will typically be images, sounds, and eventually subtitles, whereas in 3D, those types will typically be geometry or textures.
In both cases, an algorithm for content streaming will have to acknowledge those different media types and manage them correctly.
In video streaming, most of the data (in terms of bytes) is used for images.
Thus, the most important thing a video streaming system should do is optimize the image streaming.
That's why, on a video on Youtube for example, there may be 6 resolutions for images (144p, 240p, 320p, 480p, 720p and 1080p) but having only 2 resolutions for sound.
This is one of the main differences between video and 3D streaming: in a 3D scene, the geometry and the texture size are approximately the same, and work to improve the streaming needs to be performed on both.
\subsubsection{Interaction}
The ways of interacting with the content is probably the most important difference between video and 3D.
In a video interface, there is only one degree of freedom: the time.
The only things a user can do is watch the video (without interacting), pause or resume it, or jump to another moment in the video.
Even though these interactions seem easy to handle, giving the best possible experience to the user is already challenging. For example, to perform these few actions, Youtube gives the user multiple options.
\begin{itemize}
\item To pause or resume a video, the user can:
\begin{itemize}
\item click the video;
\item press the \texttt{k} key;
\item press the space key if the video is focused by the browser.
\end{itemize}
\item To navigate to another moment of the video, the user can:
\begin{itemize}
\item click the timeline of the video where he wants;
\item press the left arrow key to move 5 seconds backwards;
\item press the right arrow key to move 5 seconds forwards;
\item press the \texttt{j} key to move 10 seconds backwards;
\item press the \texttt{l} key to move 10 seconds forwards;
\item press one of the number key (on the first row of the keyoard, below the function keys) to move the corresponding decile of the video.
\end{itemize}
\end{itemize}
Those interactions are different if the user is using a mobile device.
\begin{itemize}
\item To pause a video, the user must touch the screen once to make the HUD appear and once on the pause button at the center of the screen.
\item To resume a video, the user must touch the play button at the center of the screen.
\item To navigate to another moment of the video, the user can:
\begin{itemize}
\item double touch the left of the screen to move 5 seconds backwards;
\item double touch the right of the screen to move 5 seconds forwards.
\end{itemize}
\end{itemize}
When interacting with a 3D model, there are many approches.
Some interfaces mimic the video scenario, where the only variable is the time and the user has no control on the camera.
These interfaces are not interactive, and can be frustrating to the user if he does not feel free.
Some other interfaces will add 2 degrees of freedom to the previous one: the user will not control the position of the camera but he can control the angle. This mimics the scenario of the 360 video.
Finally, most of the other interfaces will give at least 5 degrees of freedom to the user: 3 being the coordinates of the position of the camera, and 2 being the angle (assuming the up vector is unchangeable, some interfaces might allow that giving a sixth degree of freedom).
\subsection{DASH\@: the standard for video streaming}
\copied{}
Dynamic Adaptive Streaming over HTTP (DASH), or MPEG-DASH~\cite{stockhammer2011dynamic,Sodagar2011}, is now a widely deployed
standard for streaming adaptive video content on the Web~\cite{dashstandard}, made to be simple and scalable.
\fresh{}
DASH is based on a clever way of structuring the content that allows a great adaptability during the streaming without requiring any server side computation.
All those pieces are structured in a Media Persentation Description (MPD) file, written in the XML format.
This file has 4 layers, the periods, the adaptation sets, the representations and the segments.
Each period can have many adaptation sets, each adaptation set can have many representation, and each representation can have many segments.
\subsubsection{Periods}
Periods are used to delimit content depending on the time. It can be used to delimit chapters, or to add advertisements that occur at the beginning, during or at the end of a video.
\subsubsection{Adaptation sets}
Adaptation sets are used to delimit content depending of the format.
Each adaptation set has a mime-type, and all the representations and segments that it contains share this mime-type.
In videos, most of the time, each period has at least one adaptation set containing the images, and one adaptation set containing the sound.
\subsubsection{Representations}
The representation level is the level DASH uses to offer the same content at different levels of resolution.
For example, a adaptation set containing images have a representation for each available resolution (it might be 480p, 720p, 1080p, etc\ldots).
This allows a user to choose its representation and change it during the video, but most importantly, since the software is able to estimate its downloading speed based on the time it took to download data in the past, it is able to find the optimal resolution, being the highest resolution that arrives on time to avoid stalling.
\subsubsection{Segments}
Until this level of the MPD, content can be long.
For example, a representation of images of a chapter of a movie can be heavy and long to download.
However, downloading heavy files is not suitable for streaming because it prevents the dynamicity of it: if the user requests to change the level of resolution of a video, the system would either have to wait until the file is totally downloaded, or cancel the request, making all the progress done unusable.
Segments are used to prevent this behaviour. They typically encode files that last approximately one second of video, and give the software a great ability to dynamically adapt to the system. If a user wants to seek somewhere else in the video, only one second of data can be lost, and only one second of data has to be downloaded for the playback to resume.
\subsubsection{Client side computation}
Once a video is encoded in DASH format, once the files have been structured and the MPD has been generated, they can simply be put on a static HTTP server that does no computation other than serving files when it receives requests.
All the intelligence and the decision making is moved to the client side.
A client typically starts by downloading the MPD file, and then proceeds on downloading segments of the different adaptation sets that he needs, estimating itself its downloading speed and choosing itself whether it needs to change representation or not.