phd/src/state-of-the-art/video.tex

\section{Video\label{sote:vide}}

Accessing a remote video through the web has been a widely studied problem since the 1990s.
The Real-time Transport Protocol (RTP,~\cite{rtp-std}) has been an early attempt to formalize audio and video streaming.
The protocol allowed data to be transferred unilaterally from a server to a client, and required the server to handle a separate session for each client.
% While this protocol can be useful in particular scenarii, such as video-conferencing, it suffers from several is a stateful protocol: the server keeps track of every user along the streaming session.

In the following years, HTTP servers have become ubiquitous, and many industrial actors (Apple, Microsoft, Adobe, etc.) developed HTTP streaming systems to deliver multimedia content over the network.
In an effort to bring interoperability between all different actors, the MPEG group launched an initiative, which eventually became a standard known as DASH, Dynamic Adaptive Streaming over HTTP\@.
Using HTTP for multimedia streaming has many advantages over RTP\@.
While RTP is stateful (that is to say, it requires keeping track of every user along the streaming session), HTTP is stateless, and thus more efficient.
Furthermore, an HTTP server can easily be replicated at different geographical locations, allowing users to fetch data from the closest server.
This type of network architecture is called CDN (Content Delivery Network) and increases the speed of HTTP requests, making HTTP based multimedia streaming more efficient.

\subsection{DASH\@: the standard for video streaming\label{sote:dash}}

Dynamic Adaptive Streaming over HTTP (DASH), or MPEG-DASH \citep{dash-std,dash-std-2}, is now a widely deployed
standard for adaptively streaming video on the web \citep{dash-std-full}, made to be simple, scalable and inter-operable.
DASH describes guidelines to prepare and structure video content, in order to allow a great adaptability of the streaming without requiring any server side computation. The client should be able to make good decisions on what part of the content to download, only based on an estimation of the network constraints and on the information provided in a descriptive file: the MPD\@.

\subsubsection{DASH structure}

All the content structure is described in a Media Presentation Description (MPD) file, written in the XML format.
This file has 4 layers: the periods, the adaptation sets, the representations, and the segments.
An MPD has a hierarchical structure, meaning it has multiple periods, and each period can have multiple adaptation sets, each adaptation set can have multiple representation, and each representation can have multiple segments.

\paragraph{Periods.}
Periods are used to delimit content depending on time.
It can be used to delimit chapters, or to add advertisements that occur at the beginning, during or at the end of a video.

\paragraph{Adaptation sets.}
Adaptation sets are used to delimit content according to the format.
Each adaptation set has a mime-type, and all the representations and segments that it contains share this mime-type.
In videos, most of the time, each period has at least one adaptation set containing the images, and one adaptation set containing the sound.
It may also have an adaptation set for subtitles.

\paragraph{Representations.}
The representation level is the level DASH uses to offer the same content at different levels of quality.
For example, an adaptation set containing images has a representation for each available quality (it might be 480p, 720p, 1080p, etc.).
This allows a user to choose its representation and change it during the video, but most importantly, since the software is able to estimate its downloading speed based on the time it took to download data in the past, it is able to find the optimal representation, being the highest quality that the client can request without stalling.

\paragraph{Segments.}
Until this level in the MPD, content has been divided but it is still far from being sufficiently divided to be streamed efficiently.
A representation of the images of a chapter of a movie is still a long video, and keeping such a big file is not possible  since heavy files prevent streaming adaptability: if the user requests to change the quality of a video, the system would either have to wait until the file is totally downloaded, or cancel the request, making all the progress done unusable.

Segments are used to prevent this issue.
They typically encode files that contain two to ten seconds of video, and give the software a greater ability to dynamically adapt to the system.
If a user wants to seek somewhere else in the video, only one segment of data is potentially lost, and only one segment of data needs to be downloaded for the playback to resume. The impact of the segment duration has been investigated in many work, including \citep{sideris2015mpeg, stohr2017sweet}. For example, \citep{stohr2017sweet} discuss how the segment duration affects the streaming: short segments lower the initial delay and provide the best stalling quality of experience, but make the total downloading time of the video longer because of overhead.

\subsubsection{Content preparation and server}

Encoding a video in DASH format consists in partitioning the content into periods, adaptation sets, representations and segments as explained above, and generating a Media Presentation Description file (MPD) which describes this organization.
Once the data are prepared, they can simply be hosted on a static HTTP server which does no computation other than serving files when it receives requests.
All the intelligence and the decision making is moved to the client side.
This is one of the DASH strengths: no powerful server is required, and since static HTTP server are mature and efficient, all DASH clients can benefit from it.


\subsubsection{Client side adaptation}

A client typically starts by downloading the MPD file, and then proceeds on downloading segments from the different adaptation sets. While the standard describes well how to structure content on the server side, the client may be freely implemented to take into account the specificities of a given application.
The most important part of any implementation of a DASH client is called the adaptation logic. This component takes into account a set of parameters, such as network conditions (bandwidth, throughput, for example), buffer states or segments size to derive a decision on which segments should be downloaded next. Most of the industrial actors have their own adaptation logic, and many more have been proposed in the literature. A thorough review is beyond the scope of this state-of-the-art, but examples include \citep{chiariotti2016online} who formulate the problem in a reinforcement learning framework, \citep{yadav2017quetra} who formulate the problem using queuing theory, or \citep{huang2019hindsight} who use a formulation derived from the knapsack problem.

\subsection{DASH-SRD}
Being now widely adopted in the context of video streaming, DASH has been adapted to various other contexts.
DASH-SRD (Spatial Relationship Description,~\citep{dash-srd}) is a feature that extends the DASH standard to allow streaming only a spatial subpart of a video to a device.
It works by encoding a video at multiple resolutions, and tiling the highest resolutions as shown in Figure~\ref{sota:srd-png}.
That way, a client can choose to download either the low resolution of the whole video or higher resolutions of a subpart of the video.

\begin{figure}[th]
    \centering
    \includegraphics[width=0.6\textwidth]{assets/state-of-the-art/video/srd.png}
    \caption{DASH-SRD~\citep{dash-srd}\label{sota:srd-png}}
\end{figure}

For each tile of the video, an adaptation set is declared in the MPD, and a supplemental property is defined in order to give the client information about the tile.
This supplemental property contains many elements, but the most important ones are the position ($x$ and $y$) and the size (width and height) describing the position of the tile in relation to the full video.
An example of such a property is given in Snippet~\ref{sota:srd-xml}.

\begin{figure}[th]
\lstinputlisting[%
        language=XML,
        caption={MPD of a video encoded using DASH-SRD},
        label=sota:srd-xml,
        emph={%
            MPD,
            Period,
            AdaptationSet,
            Representation,
            BaseURL,
            SegmentBase,
            Initialization,
            Role,
            SupplementalProperty,
            SegmentList,
            SegmentURL,
            Viewpoint
        }
    ]{assets/state-of-the-art/video/srd.xml}
\end{figure}

Essentially, this feature is a way of achieving view-dependent streaming, since the client only displays a part of the video and can avoid downloading content that will not be displayed. While Figure~\ref{sota:srd-png} illustrates how DASH-SRD can be used in the context of zoomable video streaming, the ideas developed in DASH-SRD have proven to be particularly useful in the context of 360 video streaming (see for example \citep{ozcinar2017viewport}).
This is especially interesting in the context of 3D streaming since we have this same pattern of a user viewing only a part of a content.

% \subsection{Prefetching in video streaming}
%
% We briefly survey other research on prefetching that focuses on non-continuous interaction in other types of media.
%
% In the context of navigating in a video, a recent work in~\citep{video-bookmarks} prefetches video chunks located after bookmarks along the video timeline.
% Their work, however, focuses on changing the user behavior to improve the prefetching hit rate, by depicting the state of the prefetched buffer to the user.
% Carlier et al.\ also consider prefetching in the context of zoomable videos in an earlier work~\citep{zoomable-video}, and showed that predicting which region of videos the user will zoom into or pan to by analyzing interaction traces from users is difficult.
%
% Prefetching for navigation through a sequence of short online videos is considered in~\citep{user-generated-videos}.
% Each switch from the current video to the next can be treated as a non-continuous interaction.
% The authors proposed recommendation-aware prefetching --- to prefetch the prefix of videos from the search result list and related video list, as these videos are likely to be of interest to the user and other users from the same community.
%
% \citep{video-navigation-mpd} consider the problem of prefetching in the context of a hypervideo; non-continuous interaction happens when users click on a hyperlink in the video.
% They propose a formal framework that captures the click probability, the bandwidth, and the bit rate of videos as a markov decision problem, and derive an optimal prefetching policy.
%
% \citep{joserlin} propose Joserlin, a generic framework for prefetching that applies to any non-continuous media, but focuses on peer-to-peer streaming applications.
% They do not predict which item to prefetch, but rather focus on how to schedule the prefetch request and response.
%
% There is a huge body of work on prefetching web objects in the context of the world wide web.
% Interested readers can refer to numerous surveys written on this topic, e.g.~\citep{survey-caching-prefetching}.
%