phd-typst/related-work/video.typ

96 lines
8.6 KiB
XML

= Video
Accessing a remote video through the web has been a widely studied problem since the 1990s.
The Real-time Transport Protocol (RTP, #cite("rtp-std")) has been an early attempt to formalize audio and video streaming.
The protocol allowed data to be transferred unilaterally from a server to a client, and required the server to handle a separate session for each client.
In the following years, HTTP servers have become ubiquitous, and many industrial actors (Apple, Microsoft, Adobe, etc.) developed HTTP streaming systems to deliver multimedia content over the network.
In an effort to bring interoperability between all different actors, the MPEG group launched an initiative, which eventually became a standard known as DASH, Dynamic Adaptive Streaming over HTTP.
Using HTTP for multimedia streaming has many advantages over RTP.
While RTP is stateful (that is to say, it requires keeping track of every user along the streaming session), HTTP is stateless, and thus more efficient.
Furthermore, an HTTP server can easily be replicated at different geographical locations, allowing users to fetch data from the closest server.
This type of network architecture is called CDN (Content Delivery Network) and increases the speed of HTTP requests, making HTTP based multimedia streaming more efficient.
== DASH: the standard for video streaming
Dynamic Adaptive Streaming over HTTP (DASH), or MPEG-DASH #cite("dash-std", "dash-std-2") is now a widely deployed
standard for adaptively streaming video on the web #cite("dash-std-full"), made to be simple, scalable and inter-operable.
DASH describes guidelines to prepare and structure video content, in order to allow a great adaptability of the streaming without requiring any server side computation. The client should be able to make good decisions on what part of the content to download, only based on an estimation of the network constraints and on the information provided in a descriptive file: the MPD.
#heading(level: 3, numbering: none)[DASH structure]
All the content structure is described in a Media Presentation Description (MPD) file, written in the XML format.
This file has 4 layers: the periods, the adaptation sets, the representations, and the segments.
An MPD has a hierarchical structure, meaning it has multiple periods, and each period can have multiple adaptation sets, each adaptation set can have multiple representation, and each representation can have multiple segments.
#heading(level: 4, numbering: none)[Periods]
Periods are used to delimit content depending on time.
It can be used to delimit chapters, or to add advertisements that occur at the beginning, during or at the end of a video.
#heading(level: 4, numbering: none)[Adaptation sets]
Adaptation sets are used to delimit content according to the format.
Each adaptation set has a mime-type, and all the representations and segments that it contains share this mime-type.
In videos, most of the time, each period has at least one adaptation set containing the images, and one adaptation set containing the sound.
It may also have an adaptation set for subtitles.
#heading(level: 4, numbering: none)[Representations]
The representation level is the level DASH uses to offer the same content at different levels of quality.
For example, an adaptation set containing images has a representation for each available quality (it might be 480p, 720p, 1080p, etc.).
This allows a user to choose its representation and change it during the video, but most importantly, since the software is able to estimate its downloading speed based on the time it took to download data in the past, it is able to find the optimal representation, being the highest quality that the client can request without stalling.
#heading(level: 4, numbering: none)[Segments]
Until this level in the MPD, content has been divided but it is still far from being sufficiently divided to be streamed efficiently.
A representation of the images of a chapter of a movie is still a long video, and keeping such a big file is not possible since heavy files prevent streaming adaptability: if the user requests to change the quality of a video, the system would either have to wait until the file is totally downloaded, or cancel the request, making all the progress done unusable.
Segments are used to prevent this issue.
They typically encode files that contain two to ten seconds of video, and give the software a greater ability to dynamically adapt to the system.
If a user wants to seek somewhere else in the video, only one segment of data is potentially lost, and only one segment
of data needs to be downloaded for the playback to resume. The impact of the segment duration has been investigated in many work, including #cite("sideris2015mpeg", "stohr2017sweet").
For example, #cite("stohr2017sweet") discuss how the segment duration affects the streaming: short segments lower the initial delay and provide the best stalling quality of experience, but make the total downloading time of the video longer because of overhead.
#heading(level: 3, numbering: none)[Content preparation and server]
Encoding a video in DASH format consists in partitioning the content into periods, adaptation sets, representations and segments as explained above, and generating a Media Presentation Description file (MPD) which describes this organization.
Once the data are prepared, they can simply be hosted on a static HTTP server which does no computation other than serving files when it receives requests.
All the intelligence and the decision making is moved to the client side.
This is one of the DASH strengths: no powerful server is required, and since static HTTP server are mature and efficient, all DASH clients can benefit from it.
#heading(level: 3, numbering: none)[Client side adaptation]
A client typically starts by downloading the MPD file, and then proceeds on downloading segments from the different adaptation sets. While the standard describes well how to structure content on the server side, the client may be freely implemented to take into account the specificities of a given application.
The most important part of any implementation of a DASH client is called the adaptation logic. This component takes into account a set of parameters, such as network conditions (bandwidth, throughput, for example), buffer states or segments size to derive a decision on which segments should be downloaded next. Most of the industrial actors have their own
adaptation logic, and many more have been proposed in the literature.
A thorough review is beyond the scope of this state-of-the-art, but examples include #cite("chiariotti2016online") who formulate the problem in a reinforcement learning framework, #cite("yadav2017quetra") who formulate the problem using queuing theory, or #cite("huang2019hindsight") who use a formulation derived from the knapsack problem.
== DASH-SRD
Being now widely adopted in the context of video streaming, DASH has been adapted to various other contexts.
DASH-SRD (Spatial Relationship Description, #cite("dash-srd")) is a feature that extends the DASH standard to allow streaming only a spatial subpart of a video to a device.
It works by encoding a video at multiple resolutions, and tiling the highest resolutions as shown in Figure \ref{sota:srd-png}.
That way, a client can choose to download either the low resolution of the whole video or higher resolutions of a subpart of the video.
#figure(
image("../assets/related-work/video/srd.png", width: 60%),
caption: [DASH-SRD #cite("dash-srd")],
)
For each tile of the video, an adaptation set is declared in the MPD, and a supplemental property is defined in order to give the client information about the tile.
This supplemental property contains many elements, but the most important ones are the position ($x$ and $y$) and the size (width and height) describing the position of the tile in relation to the full video.
An example of such a property is given in @rw:srd-xml.
#figure(
align(left,
raw(
read("../assets/related-work/video/srd.xml"),
block: true,
lang: "xml",
),
),
caption: [MPD of a video encoded using DASH-SRD]
)<rw:srd-xml>
Essentially, this feature is a way of achieving view-dependent streaming, since the client only displays a part of the video and can avoid downloading content that will not be displayed.
While Figure \ref{sota:srd-png} illustrates how DASH-SRD can be used in the context of zoomable video streaming, the ideas developed in DASH-SRD have proven to be particularly useful in the context of 360 video streaming (see for example #cite("ozcinar2017viewport")).
This is especially interesting in the context of 3D streaming since we have this same pattern of a user viewing only a part of a content.