From 947a32d97073b5de2709b875b6b0493489504a0c Mon Sep 17 00:00:00 2001 From: Thomas Forgione Date: Tue, 1 Oct 2019 17:34:05 +0200 Subject: [PATCH] Update --- README.md | 7 +--- src/dash-3d-implementation/introduction.tex | 4 +-- src/foreword/3d-model.tex | 22 ++++++------ src/foreword/video-vs-3d.tex | 40 ++++++++++----------- src/introduction/challenges.tex | 11 +++--- src/introduction/main.tex | 37 ++++++++++++------- src/introduction/outline.tex | 6 ++-- src/main.tex | 2 +- 8 files changed, 68 insertions(+), 61 deletions(-) diff --git a/README.md b/README.md index ed4fcdc..39014bd 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,5 @@ # Thesis -| Link | Size | Comment | -|----------------------------------------------------------------------------------------------------------------------------------------------------|-------|---------------------------------| -| [![](https://img.shields.io/badge/pdf-uncompressed-green.svg)](https://gitea.tforgione.fr/tforgione-phd/phd-release/raw/branch/master/main.pdf) | ~20MB | LaTeX output | -| [![](https://img.shields.io/badge/pdf-printer-green.svg)](https://gitea.tforgione.fr/tforgione-phd/phd-release/raw/branch/master/main-printer.pdf) | ~3MB | compressed, good quality | -| [![](https://img.shields.io/badge/pdf-screen-green.svg)](https://gitea.tforgione.fr/tforgione-phd/phd-release/raw/branch/master/main-screen.pdf) | ~1MB | heavily compressed, low quality | - This repository holds the code for the phd thesis. +[The pdfs can be found here.](https://tforgione.fr/phd/) diff --git a/src/dash-3d-implementation/introduction.tex b/src/dash-3d-implementation/introduction.tex index 1c17973..3157026 100644 --- a/src/dash-3d-implementation/introduction.tex +++ b/src/dash-3d-implementation/introduction.tex @@ -52,7 +52,7 @@ But the most important thing here is that since we add elements to the vector, i The equivalent code in Rust is in Listings~\ref{d3i:undefined-behaviour-rs} and~\ref{d3i:undefined-behaviour-rs-it}. \begin{figure}[ht] \centering - \begin{minipage}[b]{0.45\textwidth} + \begin{minipage}[c]{0.45\textwidth} \lstinputlisting[ language=rust, caption={Rust version of Listing~\rawref{d3i:undefined-behaviour-cpp}}, @@ -60,7 +60,7 @@ The equivalent code in Rust is in Listings~\ref{d3i:undefined-behaviour-rs} and~ ]{assets/dash-3d-implementation/undefined-behaviour.rs} \end{minipage} \quad\quad\quad - \begin{minipage}[b]{0.45\textwidth} + \begin{minipage}[c]{0.45\textwidth} \lstinputlisting[ language=rust, caption={Rust version of Listing~\rawref{d3i:undefined-behaviour-cpp-it}}, diff --git a/src/foreword/3d-model.tex b/src/foreword/3d-model.tex index 540909c..7e21df0 100644 --- a/src/foreword/3d-model.tex +++ b/src/foreword/3d-model.tex @@ -8,55 +8,55 @@ A 3D model consists in a set of data. \begin{itemize} \item \textbf{Vertices} are simply 3D points; \item \textbf{Faces} are polygons defined from vertices (most of the time, they are triangles); - \item \textbf{Textures} are images that can be applied to faces; + \item \textbf{Textures} are images that can be use to paint faces to add visual richness; \item \textbf{Texture coordinates} are information added to a face to describe how the texture should be applied on a face; \item \textbf{Normals} are 3D vectors that can give information about light behaviour on a face. \end{itemize} -The Wavefront OBJ is probably the best format to give an example of 3D model since it describes all these elements in text format. +The Wavefront OBJ is one of the most popular format and describes all these elements in text format. A 3D model encoded in the OBJ format typically consists in two files: the materials file (\texttt{.mtl}) and the object file (\texttt{.obj}). \paragraph{} The materials file declare all the materials that the object file will reference. -Each material has a name, and can have photometric properties such as ambient, diffuse and specular colors, as well as texture maps. +A material consists in name, and other photometric properties such as ambient, diffuse and specular colors, as well as texture maps. +Each face correspond to a material and a renderer can use the material's information to render the faces. A simple material file is visible on Listing~\ref{i:mtl}. \paragraph{} -The object file declare the 3D content of the objects. +The object file declares the 3D content of the objects. It declares vertices, texture coordinates and normals from coordinates (e.g.\ \texttt{v 1.0 2.0 3.0} for a vertex, \texttt{vt 1.0 2.0} for a texture coordinate, \texttt{vn 1.0 2.0 3.0} for a normal). These elements are numbered starting from 1. Faces are declared by using the indices of these elements. A face is a polygon with any number of vertices and can be declared in multiple manners: \begin{itemize} \item \texttt{f 1 2 3} defines a triangle face that joins the first, the second and the third vertex declared; - \item \texttt{f 1/1 2/3 3/4} defines a triangle similar but with texture coordinates, the first texture coordinate is associated to the first vertex, the third texture coordinate is associated to the second vertex, and the fourth texture coordinate is associated with the third vertex; - \item \texttt{f 1//1 2//3 3//4} defines a triangle similar but using normal instead of texture coordinates; + \item \texttt{f 1/1 2/3 3/4} defines a similar triangle but with texture coordinates, the first texture coordinate is associated to the first vertex, the third texture coordinate is associated to the second vertex, and the fourth texture coordinate is associated with the third vertex; + \item \texttt{f 1//1 2//3 3//4} defines a similar triangle but using normals instead of texture coordinates; \item \texttt{f 1/1/1 2/3/3 3/4/4} defines a triangle with both texture coordinates and normals. \end{itemize} -It can include materials from a material file (\texttt{mtllib path.mtl}) and apply the materials that it declares to faces. +An object file can include materials from a material file (\texttt{mtllib path.mtl}) and apply the materials that it declares to faces. A material is applied by using the \texttt{usemtl} keyword, followed by the name of the material to use. The faces declared after a \texttt{usemtl} are painted using the material in question. An example of object file is visible on Listing~\ref{i:obj}. \begin{figure}[th] \centering - \begin{subfigure}[b]{0.4\textwidth} + \begin{subfigure}[c]{0.4\textwidth} \lstinputlisting[ language=XML, caption={An object file describing a cube}, label=i:obj, ]{assets/introduction/cube.obj} \end{subfigure}\quad% - \begin{subfigure}[b]{0.4\textwidth} + \begin{subfigure}[c]{0.4\textwidth} \lstinputlisting[ language=XML, caption={A material file describing a material}, label=i:mtl, ]{assets/introduction/materials.mtl} - \vspace{0.2cm} \includegraphics[width=\textwidth]{assets/introduction/cube.png} - \caption*{A rendering of the cube} + \captionof{figure}{A rendering of the cube} \end{subfigure} \caption{The OBJ representation of a cube and its render\label{i:cube}} \end{figure} diff --git a/src/foreword/video-vs-3d.tex b/src/foreword/video-vs-3d.tex index fff01e1..c3ce171 100644 --- a/src/foreword/video-vs-3d.tex +++ b/src/foreword/video-vs-3d.tex @@ -3,7 +3,7 @@ \section{Similarities and differences between video and 3D\label{i:video-vs-3d}} Contrary to what one might think, the video streaming scenario and the 3D streaming one share many similarities: at a higher level of abstraction, they are both systems that allow a user to access remote content without having to wait until everything is loaded. -Analyzing the similarities and the differences between the video and the 3D scenarios as well as having knowledge about video streaming litterature is\todo{is key or are key?} key to developing an efficient 3D streaming system. +Analyzing the similarities and the differences between the video and the 3D scenarios as well as having knowledge about video streaming litterature is~\todo{is key or are key?} key to developing an efficient 3D streaming system. \subsection{Data persistence} @@ -40,7 +40,7 @@ In both cases, an algorithm for content streaming has to acknowledge those diffe In video streaming, most of the data (in terms of bytes) is used for images. Thus, the most important thing a video streaming system should do is optimize the image streaming. That's why, on a video on Youtube for example, there may be 6 resolutions for images (144p, 240p, 320p, 480p, 720p and 1080p) but only 2 resolutions for sound. -This is one of the main differences between video and 3D streaming: in a 3D scene, the geometry and the texture size are approximately the same, and work to improve the streaming needs to be performed on both. +This is one of the main differences between video and 3D streaming: in a 3D scene, geometry and texture sizes are approximately the same, and leveraging between those two types of content is a key problem. \subsection{Chunks of data} @@ -51,9 +51,9 @@ In mesh streaming, it can either by segmenting faces in chunks, with a certain n \subsection{Interaction} The ways of interacting with the content is probably the most important difference between video and 3D. -In a video interface, there is only one degree of freedom: the time. +In a video interface, there is only one degree of freedom: time. The only thing a user can do is let the video play itself, pause or resume it, or jump to another moment in the video. -Even though these interactions seem easy to handle, giving the best possible experience to the user is already challenging. For example, to perform these few actions, Youtube gives the user multiple options. +Even though these interactions seem easy to handle, giving the best possible experience to the user is already challenging. For example, to perform these few actions, Youtube provides the user with multiple options. \begin{itemize} @@ -66,7 +66,7 @@ Even though these interactions seem easy to handle, giving the best possible exp \item To navigate to another moment of the video, the user can: \begin{itemize} - \item click the timeline of the video where he wants; + \item click the timeline of the video where they wants; \item press the left arrow key to move 5 seconds backwards; \item press the right arrow key to move 5 seconds forwards; \item press the \texttt{J} key to move 10 seconds backwards; @@ -84,7 +84,7 @@ All the interactions are summmed up in Figure~\ref{i:youtube-keyboard}. \newcommand{\playpausecontrol}{Pink} \newcommand{\othercontrol}{PalePaleGreen} -\newcommand{\keystrokescale}{0.55} +\newcommand{\keystrokescale}{0.625} \newcommand{\tuxlogo}{\FA\symbol{"F17C}} \newcommand{\keystrokemargin}{0.1} \newcommand{\keystroke}[5]{% @@ -245,27 +245,27 @@ All the interactions are summmed up in Figure~\ref{i:youtube-keyboard}. % Legend \begin{tikzpicture}[scale=\keystrokescale] - \keystrokebg{7}{8}{-6}{-5}{}{\absoluteseekcontrol}; - \node[right=0.2cm] at (7.5, -5.55) {Absolute seek keys}; + \keystrokebg{0}{1}{0}{1}{}{\absoluteseekcontrol}; + \node[right=0.3cm] at (0.5, 0.5) {\small Absolute seek keys}; - \keystrokebg{7}{8}{-7}{-6}{}{\relativeseekcontrol}; - \node[right=0.2cm] at (7.5, -6.55) {Relative seek keys}; + \keystrokebg{6}{7}{0}{1}{}{\relativeseekcontrol}; + \node[right=0.3cm] at (6.5, 0.5) {\small Relative seek keys}; - \keystrokebg{7}{8}{-8}{-7}{}{\playpausecontrol}; - \node[right=0.2cm] at (7.5, -7.55) {Play or pause keys}; + \keystrokebg{12}{13}{0}{1}{}{\playpausecontrol}; + \node[right=0.3cm] at (12.5, 0.5) {\small Play or pause keys}; - \keystrokebg{7}{8}{-9}{-8}{}{\othercontrol}; - \node[right=0.2cm] at (7.5, -8.55) {Other keys}; + \keystrokebg{18}{19}{0}{1}{}{\othercontrol}; + \node[right=0.3cm] at (18.5, 0.5) {\small Other keys}; \end{tikzpicture} - \caption{Youtube shortcuts\label{i:youtube-keyboard}} + \caption{Youtube shortcuts (white keys are unused)\label{i:youtube-keyboard}} \end{figure} Those interactions are different if the user is using a mobile device. \begin{itemize} - \item To pause a video, the user must touch the screen once to make the HUD appear and once on the pause button at the center of the screen. + \item To pause a video, the user must touch the screen once to make the timeline and the buttons appear and once on the pause button at the center of the screen. \item To resume a video, the user must touch the play button at the center of the screen. \item To navigate to another moment of the video, the user can: \begin{itemize} @@ -278,16 +278,16 @@ When it comes to 3D, there are many approaches to manage user interaction. Some interfaces mimic the video scenario, where the only variable is the time and the camera follows a predetermined path on which the user has no control. These interfaces are not interactive, and can be frustrating to the user who might feel constrained. -Some other interfaces add 2 degrees of freedom to the previous one: the user does not control the position of the camera but he can control the angle. This mimics the scenario of the 360 video. +Some other interfaces add 2 degrees of freedom to the previous one: the user does not control the position of the camera but they can control the angle. This mimics the scenario of the 360 video. Finally, most of the other interfaces give at least 5 degrees of freedom to the user: 3 being the coordinates of the position of the camera, and 2 being the angle (assuming the up vector is unchangeable, some interfaces might allow that giving a sixth degree of freedom). \subsection{Relationship between interface, interaction and streaming} In both video and 3D systems, streaming affects the interaction. -For example, in a video streaming scenario, if a user sees that the video is fully loaded, he might start moving around on the timeline, but if he sees that the streaming is just enough to not stall, he might prefer staying peaceful and just watch the video. -If the streaming stalls for too long, the user migth seek somewhere else hoping for the video to resume, or totally give up and leave the video. -The same types of behaviour occur in 3D streaming: if a user is somewhere in a scene, and sees more data appearing, he might wait until enough data has arrived, but if he sees nothing happens, he might leave to look for data somewhere else. +For example, in a video streaming scenario, if a user sees that the video is fully loaded, they might start moving around on the timeline, but if they sees that the streaming is just enough to not stall, they might prefer staying peaceful and just watch the video. +If the streaming stalls for too long, the user migth seek somewhere else hoping for the video to resume, or get frustrated and leave the video. +The same types of behaviour occur in 3D streaming: if a user is somewhere in a scene, and sees more data appearing, they might wait until enough data has arrived, but if they sees nothing happens, they might leave to look for data somewhere else. Those examples show how streaming can affect the interaction, but the interaction also affects the streaming. In a video streaming scenario, if a user is watching peacefully without interacting, the system just has to request the next chunks of video and display them. diff --git a/src/introduction/challenges.tex b/src/introduction/challenges.tex index e604b2a..13868c4 100644 --- a/src/introduction/challenges.tex +++ b/src/introduction/challenges.tex @@ -22,20 +22,19 @@ Before streaming content, it needs to be prepared. This includes but is not limited to compression and segmentation. One of the question this thesis has to answer is \emph{what is the best way to prepare 3D content so that a client can benefit from it?} -\subsection{Chunk utility} -Once our content is prepared and split in chunks, we need to be able to rate those chunks depending on the user's position. -A chunk that contains data in the field of view of the user should have a higher score than a chunk outside of it; a chunk that is close to the camera should have a higher score than a chunk far away from the camera, etc\ldots. An open question of this thesis is \emph{how do we determine how useful is a chunk of data depending on the user's position?} \subsection{Streaming policies} -Rating the chunks is not enough, there are other contextual parameters that need to be taken into account, such as the size of a chunk, the bandwidth, the user's behaviour, etc\ldots. -Another question that raises from this is \emph{how do we take into the context into account to decide which chunks to download?} +Once our content is prepared and split in chunks, one needs to be able to rate those chunks depending on the user's position. +A chunk that contains data in the field of view of the user should have a higher score than a chunk outside of it; a chunk that is close to the camera should have a higher score than a chunk far away from the camera, etc\ldots. +This rating should also include other contextual parameters, such as the size of a chunk, the bandwidth, the user's behaviour, etc\ldots. +The most important question we have to answer is \emph{how do we determine which chunks need to be downloaded depending on the chunks themselves and the user's interactions?} \subsection{Evaluation} In such systems, the two most important criteria for evaluation are quality of service, and quality of experience. The quality of service is a network-centric metric, which considers values such as throughput. The quality of experience is a user-centric metric, and can only be measured by asking how users feel about a system. -To be able to know which streaming policies, we need to know \emph{how can we compare streaming policies and evalute the impact of their parameters in terms of quality of service and quality of experience?} +To be able to know which streaming policies, one needs to know \emph{how can we compare streaming policies and evalute the impact of their parameters in terms of quality of service and quality of experience?} \subsection{Implementation} The objective of our work is to setup a client-server architecture that answers the problems mentioned earlier (content preparation, chunk utility, streaming policies). diff --git a/src/introduction/main.tex b/src/introduction/main.tex index 980d31a..65407d7 100644 --- a/src/introduction/main.tex +++ b/src/introduction/main.tex @@ -1,19 +1,32 @@ \chapter{Introduction\label{i}} -\copied{} +\fresh{} -With the progress in data acquisition and modeling techniques, networked virtual environments, or NVE, are increasing in scale. -For instance,~\cite{urban-data-visualisation} reported that the 3D scene for the city of Lyon takes more than 30 GB of data. -It has become impractical to download the whole 3D scene before the user begins to navigate in the scene. -A more common approach is to stream the required 3D content (models and textures) on demand, as the user moves around the scene. -Downloading the required 3D content the moment the user demands it, however, leads to ``popping effect'' where 3D objects materialize suddenly in the view of the user, due to the latency between requesting for and receiving the 3D content from the server~\cite{visibility-determination}. -Such latency can be quite high --- Varvello et al.\ reported a median of about 30 seconds for all 3D data in an avatar's surrounding to be loaded in high density Second Life regions under their experimental network conditions, due to a bottleneck at the server~\cite{second-life}. +During the last years, 3D acquisition and modeling techniques have progressed a lot. +Recent software such as \href{https://alicevision.org/#meshroom}{Meshroom} use \emph{structure from motion} and \emph{multi view stero} to infer a 3D model from a set of photographs. +There are more and more devices that are specifically built to obtain 3D data: some are more expensive and provide with very precise information such as Lidar, and some cheaper devices can obtain coarse data such as the Kinect. +Thanks to these techniques, more and more 3D data becomes available. +These models have potential for multiple purposes, for example, they can be 3D printed which can reduce the production cost of some pieces of hardware or enable the creation of new objects, but most uses will consist in visualisation. +For example, they can be used for augmented reality, to provide user with feedback that can be useful to help worker with complex tasks, but also for fashion (for example, \emph{Fitting Box} is a company that develops software to virtually try glasses). +3D acquisition and visualisation is also useful to preserve cultural heritage, and software such as Google Heritage or 3DHop are such examples, or to allow users navigating in a city (as in Google Earth or Google Maps in 3D). +\href{https://sketchfab.com}{Sketchfab} is an example of a website allowing users to share their 3D models and visualise models from other users. +In most 3D visualisation systems, the 3D data needs to be transmitted to a terminal before the user can visualise it. +The improvements in the acquisition setups we described lead to an increasing quality of the 3D models, and an increasing size in bytes as well. +Simply downloading 3D content and waiting until the content is fully downloaded to let the user visualise it is no longer a satisfactory solution, and streaming needs to be performed. +In this thesis, we are especially interested in the navigation and in the streaming of large 3D scenes, such as districts or whole cities. -For a smoother user experience, NVE typically prefetch 3D content, so that a 3D object is readily available for rendering when the object falls into the view of the user. -Efficient prefetching, however, requires the client or the server to predict where the user would navigate to in the future and retrieve the corresponding 3D content before the user reaches there. -In a typical scenario, users navigate along a continuous path in a NVE, leading to a significant overlap between the 3D content visible from the user's known current position and possible next positions (i.e., \textit{spatial data locality}). -Furthermore, there is a significant overlap between the 3D content visible from the current point in time to the next point in time (i.e., \textit{temporal data locality}). -Both forms of locality lead to content overlaps, thus making a correct prediction easier and a wrong prediction less costly. 3D content overlaps are particularly common in a NVE with open space, such as a 3D archaeological site or a 3D city. +% With the progress in data acquisition and modeling techniques, networked virtual environments, or NVE, are increasing in scale. +% For instance,~\cite{urban-data-visualisation} reported that the 3D scene for the city of Lyon takes more than 30 GB of data. +% It has become impractical to download the whole 3D scene before the user begins to navigate in the scene. +% A more common approach is to stream the required 3D content (models and textures) on demand, as the user moves around the scene. +% Downloading the required 3D content the moment the user demands it, however, leads to ``popping effect'' where 3D objects materialize suddenly in the view of the user, due to the latency between requesting for and receiving the 3D content from the server~\cite{visibility-determination}. +% Such latency can be quite high --- Varvello et al.\ reported a median of about 30 seconds for all 3D data in an avatar's surrounding to be loaded in high density Second Life regions under their experimental network conditions, due to a bottleneck at the server~\cite{second-life}. +% +% For a smoother user experience, NVE typically prefetch 3D content, so that a 3D object is readily available for rendering when the object falls into the view of the user. +% Efficient prefetching, however, requires the client or the server to predict where the user would navigate to in the future and retrieve the corresponding 3D content before the user reaches there. +% In a typical scenario, users navigate along a continuous path in a NVE, leading to a significant overlap between the 3D content visible from the user's known current position and possible next positions (i.e., \textit{spatial data locality}). +% Furthermore, there is a significant overlap between the 3D content visible from the current point in time to the next point in time (i.e., \textit{temporal data locality}). +% Both forms of locality lead to content overlaps, thus making a correct prediction easier and a wrong prediction less costly. 3D content overlaps are particularly common in a NVE with open space, such as a 3D archaeological site or a 3D city. \resetstyle{} diff --git a/src/introduction/outline.tex b/src/introduction/outline.tex index 2a31abf..89d5941 100644 --- a/src/introduction/outline.tex +++ b/src/introduction/outline.tex @@ -3,7 +3,7 @@ First, in Chapter~\ref{f}, we give some preliminary information required to understand the types of objects we are manipulating in this thesis. We then proceed to compare 3D and video content: surprisingly, video and 3D share many problems, and analysing them gives inspiration for building a 3D streaming system. -In Chapter~\ref{sote}, we present a review of the state of the art on the fields that we are interesting in. +In Chapter~\ref{sote}, we present a review of the state of the art in the multimedia interaction and streaming. This chapter starts with an analysis of the video streaming standards. Then it reviews the different manners of performing 3D streaming. The last section of this chapter focuses on 3D interaction. @@ -12,12 +12,12 @@ Then, in Chapter~\ref{bi}, we present our first contribution: an in-depth analys We first develop a basic interface for navigating in 3D and we introduce 3D objects called \emph{bookmarks} that help users navigate in the scene. We then present a user study that we conducted on 50 people that shows that bookmarks have a great impact on how easy it is for a user to perform tasks such as finding objects. % Then, we setup a basic 3D streaming system that allows us to replay the traces collected during the user study and simulate 3D streaming at the same time. -Finally, we analyse how the presence of bookmarks impacts the streaming, and we propose and evaluate a few streaming policies that rely on precomputations that can be made thanks to bookmarks and that can increase the quality of experience. +We analyse how the presence of bookmarks impacts the streaming, and we propose and evaluate a few streaming policies that rely on precomputations that can be made thanks to bookmarks and that can increase the quality of experience. In Chapter~\ref{d3}, we present the most important contribution of this thesis: DASH-3D. DASH-3D is an adaptation of the video streaming standard to 3D streaming. We first describe how we adapt the concepts of DASH to 3D content, including the segmentation of content. -We then define utilty metrics that associates score to each chunk depending on the camera's position. +We then define utility metrics that associates score to each chunk depending on the camera's position. Then, we present a client and various streaming policies based on our utilities that can benefit from the DASH format. We finally evaluate the different parameters of our client. diff --git a/src/main.tex b/src/main.tex index 75db5b5..ce5fb07 100644 --- a/src/main.tex +++ b/src/main.tex @@ -1,6 +1,6 @@ \RequirePackage{fix-cm} \documentclass[ - fontsize=10pt, + fontsize=11pt, paper=a4, pagesize, bibliography=totoc,