Youtube From Scratch (Part 2)

#tech

This is a direct sequel to my original post on building Youtube from scratch, which covers how Youtube works under the hood and how we might replicate Youtube-like performance for hosted video content. At the end of that post, I remarked that beating Youtube video performance is crushingly difficult without extensive resources. I ended up hosting all my videos through Bunny (not Bunny Stream) which supports range requests and breaks videos into 5 MB chunks for faster delivery. This provides great performance for my needs.

In this post, I want to expand on my video hosting solution and cover the user-facing aspects of Youtube I neglected to address: user interface. Can we replicate Youtube's video player from scratch?

You can find the final video player being used on all of my streams.

Observations ¶

Let's gather what we know about the existing HTML video players and what they can accomplish. HTML video has a default player for all major browsers, but a lot of key controls we want to replicate are missing. Most offer the following:

  • a play and pause button
  • a progress bar
  • the current timestamp
  • a fullscreen button
  • a volume slider

Although this is a nice starting point, it doesn't account for some unique features of Youtube I have come to take for granted:

  • hovering over the progress bar to seek
  • displaying a preview thumbnail while seeking
  • a better loading indicator
  • auto-hiding controls on mobile devices

In addition, I want to propose three more requirements we must meet:

  • the player must be accessible. This is the bare minimum requirement for any website.
  • the player must have a non-JavaScript fallback. My entire website respects non-JavaScript users and this video player should align with that goal.
  • the player must be self-contained. It should be one code file that is "plug and play".

These are our requirements for MVP. Let's see what we can accomplish.

Web Components ¶

Regrettably, there are no CSS selectors that style native <video /> controls. We can, however, follow MDN's guide to video player styling and programmatically create controls on top of <video /> with JavaScript. This means non-JavaScript users will see an HTML video element and JavaScript users will see our dynamic video controls, meeting our requirement criteria for a non-JavaScript fallback.

Instead of writing our video control logic in a standard JS script, we can make our solution more extensible by writing a web component to wrap our video element. Web components are fantastically robust out of the box because browsers ignore undefined elements in both markup and styles. Imagine we have this HTML:

<style>
  wc-yt-video {
    background: red;
  }
</style>
<div>
  <wc-yt-video>
    <video src="input.mp4">
      Your browser does not support HTML video.
    </video>
  </wc-yt-video>
</div>

If your browser does not support web components or your browser cannot load the script that defines wc-yt-video, we effectively end up with the following HTML:

<style>
</style>
<div>
  <video src="input.mp4">
    Your browser does not support HTML video.
  </video>
</div>

Web components are an amazing solution for creating progressively enhanced islands of interactivity in websites. I'm surprised they get such a bad rep in the frontend engineering realm.

Let's start by defining a custom element in our script.

<div>
  <wc-yt-video>
    <video src="input.mp4">
      Your browser does not support HTML video.
    </video>
  </wc-yt-video>
</div>
<script>
  class WCYTVideo extends HTMLElement {
    constructor() {
      super();
    }

    connectedCallback() {
      console.log("custom element");
    }
  }
  customElements.define("wc-yt-video", WCYTVideo);
</script>

Web components are standard JS classes that become real HTML elements via customElements.define(). Once defined, they can be used like any other element in the DOM. Because they're constructed with vanilla JS, we can define custom methods on web components if we want to expose new functionality:

class MyComponent extends HTMLElement {
  fetchMyData() { /* ... */ }
}
customElements.define("my-component", MyComponent);

document.querySelector("my-component").fetchMyData();

Or we can create a reusable chunk of HTML:

<my-component data-name="Bob" data-age="32"></my-component>
<my-component data-name="Alice" data-age="28"></my-component>
<script>
  class MyComponent extends HTMLElement {
    constructor() {
      super();
    }

    connectedCallback() {
      this.innerHTML = `
        <div>
          <p>Hi, my name is <strong>${this.dataset.name}</strong>!</p>
          <p>My age is ${this.dataset.age}.</p>
        </div>
      `;
    }
  }
  customElements.define("my-component", MyComponent);
</script>

Although web components have constructors, it is recommended to perform all setup in connectedCallback.

Shadow DOM ¶

There are two general kinds of web components: web components that inherit information from the main DOM, and web components that are isolated from the main DOM. We'll be constructing the latter using a shadow DOM because we don't want individual webpage code to interfere with our video player styling or event handling. The shadow DOM is a hidden or detached DOM connected to the regular DOM. You can think of the shadow DOM as an <iframe> in its isolation of content but with key differences.

First, shadow DOMs are less restrictive than iframes. iframes are isolated for security whereas shadow DOMs are isolated for encapsulation. For example, you can traverse an open shadow DOM with JavaScript and use it like any other DOM:

const shadowDOM = document.querySelector("div").shadowRoot;
const child = shadowDOM.getElementById("shadow-child")

Second, shadow DOMs can be constructed directly in an HTML document with HTML or JavaScript. Here is how to create a shadow DOM using the <template /> element:

<div id="parent">
  <template shadowrootmode="open">
    <span>I'm better than Sonic</span>
  </template>
</div>

Alternatively, the same shadow DOM can be constructed in JavaScript:

const shadowDOM = document.getElementById("parent").attachShadow({
  mode: "open",
});
const child = document.createElement("span");
child.textContent = "I'm better than Sonic";
shadowDOM.appendChild(child);

The shadow DOM allows us to create a video player component without worrying about external scripts affecting the controls or external stylesheets interfering with the UI. However, this also means we'll need to manually write all styling and logic ourselves. Let's start with basic styling.

Programmatic CSS ¶

If we want our video player to be self-contained in a single JavaScript file, we need a way to construct CSS styles using JavaScript. Fortunately, this can be done via CSSStyleSheet() and adoptedStyleSheets. We can attach a stylesheet constructed in JavaScript to our shadow DOM:

const stylesheet = new CSSStyleSheet();
stylesheet.replaceSync(`
:host {
  position: relative;
  display: block;
  overflow: hidden;
  border-radius: 8px;
}
`);

const shadow = this.attachShadow({ mode: "open" });
shadow.adoptedStyleSheets = [stylesheet];

shadow.appendChild(video);

This constructs a CSS stylesheet, creates a shadow DOM and container div, then attaches both the container and the CSS to the DOM. replaceSync allows us to write any CSS styles we want on our elements. The :host selector applies styles to the web component itself. We're hiding overflow to allow our video corners to always appear rounded.

JS DOM Manipulation ¶

Now that we understand the basic building blocks of our video player, we can start constructing HTML elements to make a functional video player. We want to create a user interface that sits on top of our HTML video.

const controlbox = document.createElement("div");
controlbox.id = "controlbox";
controlbox.innerHTML = `
<div id="viewport" tabindex="0">
  <svg aria-hidden id="loading" height="100" width="100" viewBox="0 0 120 120">
    <path d="M60 30 a 30 30 270 1 1 -30,30" />
  </svg>
</div>
<div id="floatbar">
  <div
    id="progressbox"
    role="slider"
    tabindex="0"
    aria-label="video progress"
    aria-valuemin="0"
    aria-valuemax="0"
    aria-valuenow="0"
  ></div>
  <div id="controls">
    <div class="ctrlside">
      <button id="play" aria-label="play video">
        <svg aria-hidden viewBox="0 0 28 28" height="28" width="28">
          <path d="M6 5 L22 14 L6 23 Z" fill="#fff" />
        </svg>
        <svg aria-hidden viewBox="0 0 28 28" height="28" width="28">
          <rect x="6" y="5" width="6" height="18" fill="#fff" />
          <rect x="16" y="5" width="6" height="18" fill="#fff" />
        </svg>
      </button>
      <input id="volume" type="range" min="0" max="1" step="0.05"/>
      <span id="timestamp" tabindex="0"></span>
    </div>
    <div class="ctrlside">
      <button id="captions" aria-label="toggle captions" disabled>
        <svg aria-hidden viewBox="0 0 28 28" height="28" width="28">
          <rect x="2" y="4" rx="4" ry="4" width="24" height="20" />
          <text x="6" y="18">cc</text>
          <path d="M28 0 L0 28Z" />
        </svg>
      </button>
      <button id="fullscreen" aria-label="toggle fullscreen">
        <svg aria-hidden viewBox="0 0 28 28" height="24" width="24">
          <path d="M3 11 L3 3 L11 3 M17 3 L25 3 L25 11 M25 17 L25 25 L17 25 M11 25 L3 25 L3 17 M0 0Z" />
        </svg>
      </button>
    </div>
  </div>
</div>
`;

const video = this.querySelector("video");
video.controls = false;
video.tabIndex = -1;

shadow.appendChild(video);
shadow.appendChild(controlbox);

There's a lot of moving parts here so let's break this down. Below is a drawn out diagram of the core HTML elements to help you conceptualize each piece:

shadow DOM <video /> #controlbox #viewport #floatbar #controls #progressbox

First, we create a new HTML element #controlbox. This is a div that will house all of our video controls. To make our lives easier, we can write raw HTML children to our control box by setting innerHTML. This prevents us from having to explicitly write document.createElement() for every element and provides a great visual of our HTML result without the need to remember where and how elements are nested. innerHTML is slightly more expensive than multiple document.createElement and document.appendChild invocations from a performance perspective, but the cost is negligible because we are only doing this once per video player.

Next, we construct #floatbar, then append #progressbox and #controls as children. #floatbar will handle the visibility of our main control bar. We also have #viewport which captures any keyboard controls registered from the main video viewer (for example, pressing the space bar to pause or play the video).

Finally, append our <video /> and #controlbox elements to our shadow DOM as children. An interesting feature of appendChild() is that if the element already exists in the DOM, appendChild() will move the original element to the new position instead. This feature allows us to take the video element wrapped by our web component and move it to where we need to use it.

Notice that we also turn off controls and tab indexing. We don't want to show two sets of video controls and we don't want users to be able to keyboard navigate to the video element. Next, we'll add event listeners to hook everything up.

Event Listeners ¶

Now we can create event listeners to capture button clicks and mouse movements. Similar to React, we need to properly setup and teardown event listeners in lifecycle callbacks. Events should be added in connectedCallback() and removed in disconnectedCallback() to prevent any memory leaks. However, memory leaks will only occur in modern browsers when event listeners are attached to elements that are never removed from the DOM. Because we will not be constructing any event listeners to the document or other elements outside our web component, we actually don't need to remove any event listeners. In the case our web component is removed from the DOM, all event listeners and elements they are listening to will be garbage collected. This makes it much easier for us to manage our event listeners.

In addition to explicit click and mouse events, we also need to capture events that occur on the video itself. What happens when the video begins buffering? What about when the video stops buffering? What about when the video is paused?

Let's set up some basic listeners based on the descriptions of the video events.

const loading = shadow.getElementById("loading");

const setLoadingOff = () => {
  loading.style.display = "none";
};
const setLoadingOn = () => {
  loading.style.display = "block";
};

video.addEventListener("canplay", setLoadingOff);
video.addEventListener("canplaythrough", setLoadingOff);
video.addEventListener("playing", setLoadingOff);
video.addEventListener("seeked", setLoadingOff);
video.addEventListener("seeking", setLoadingOn);
video.addEventListener("waiting", setLoadingOn);

video.addEventListener("durationchange", () => {
  // video metadata was loaded, the duration changed
});
video.addEventListener("timeupdate", () => {
  // the video progress position changed
});
video.addEventListener("play", () => {
  // the video is playing
})
video.addEventListener("pause", () => {
  // the video is paused
})
video.addEventListener("volumechange", () => {
  // the volume changed
});

There's a lot events to capture. Rather than bore you with the events, I'll skip to some of the interesting ones.

timeupdate ¶

When timeupdate occurs, we want to update both the progress bar and the timestamp printed on the controls. However, all we have is a video currentTime and duration (both in seconds). How can we modify these values to fit our needs?

The progress bar can be solved by including a dummy <div /> representing the actual progress of the progress bar. Every time the video's progress position changes, we can set the width of the element to be the percentage of video.currentTime / video.duration. Then we can update the real progress bar's ARIA attributes to match.

const progressbox = shadow.getElementById("progressbox");
const progressbar = shadow.getElementById("progressbar");

progressbar.style.width = (video.currentTime / video.duration * 100) + "%";
progressbox.setAttribute("aria-valuemax", video.duration);
progressbox.setAttribute("aria-valuenow", video.currentTime);

Problem solved!

What about our timestamp? We could modulo the number of seconds to get minutes and hours but this solution ends up being a bit messy. Instead, we'll use a trick to let JS Date handle this for us.

const currentSec = video.currentTime; // 3000 seconds
const currentDate = new Date(currentSec * 1000).toISOString(); // "1970-01-01T00:50:00.000Z"
const timeStart = currentDate.indexOf("T") + 1; 
const currentTimestamp = `${currentDate.substring(timeStart, timeStart + 8)}`; // "00:50:00"
progressbox.setAttribute("aria-valuetext", currentTimestamp);

We read the video current time in milliseconds into Date, then convert it to an 8601 ISO date string. Date can take a numerical argument which it parses as the number of milliseconds that have passed since January 1, 1970 UTC. This means that when we convert the date back to an ISO string, we'll already have the properly formatted timestamp in string form.

Note that this trick won't work for videos longer than 24 hours, but I've never seen a video longer than 24 hours and I don't plan to stream over 24 hours at once.

play/pause ¶

When we switch between playing and pausing the video, we want to switch between the diffrerent play and pause icons. Instead of creating and removing SVG icons in the DOM, we can instead utilize CSS to help us toggle between states. In the HTML, we can render both icons, then use a data- attribute selector to toggle the state.

<button id="play" aria-label="play video">
  <svg aria-hidden viewBox="0 0 28 28" height="28" width="28">
    <path d="M6 5 L22 14 L6 23 Z" fill="#fff" />
  </svg>
  <svg aria-hidden viewBox="0 0 28 28" height="28" width="28">
    <rect x="6" y="5" width="6" height="18" fill="#fff" />
    <rect x="16" y="5" width="6" height="18" fill="#fff" />
  </svg>
</button>
#play > svg { display: none; }
#play[data-play=false] > svg:first-child {
  display: block;
}
#play[data-play=true] > svg:last-child {
  display: block;
}
const play = shadow.getElementById("play");
video.addEventListener("play", () => {
  play.dataset.play = true;
});
video.addEventListener("pause", () => {
  play.dataset.play = false;
});

When we set data-play="true", we display the button with the second icon. When data-play="false", it switches back to displaying the first icon. This conditional rendering trick makes toggle visual button states a single line of JavaScript.

Fullscreen API ¶

To make the video player fullscreen, browsers provide a Fullscreen API that allows us to make any element fill the screen with element.requestFullscreen(). Similarly, we can exit fullscreen using document.exitFullscreen(). To make our video player fill the screen we can call these methods on our web component.

const toggleFullscreen = () => {
  if (document.fullscreenElement !== null) {
    document.exitFullscreen();
  } else {
    this.requestFullscreen();
  }
};

const fullscreen = shadow.getElementById("fullscreen");
fullscreen.addEventListener("click", toggleFullscreen);

Hover Thumbnails ¶

One of Youtube's unique features we want to implement is the thumbnail previews that display while hovering over the progress bar. How can we accomplish this?

Let's take a step back and make some observations on how this works. When you hover over a specific point on the progress bar, Youtube displays an approximate video frame thumbnail of that timestamp in the video.

  • The thumbnail is not exact. Sometimes the same thumbnail displays for a 5-10 second segment.
  • The thumbnail is not high quality. It's acceptable to be slightly blurry.
  • The thumbnail is very performant. We should not see any buffering time for the thumbnails to load.

As a first attempt, we might try taking a snapshot of every frame of the video. Then we can downscale these and remove some of them (we only need one frame per second at a maximum). While this could be performant, it's not very ergonomic. It's not easy to export and store thousands of small photos. In addition, how do we determine which photo to display at any given timestamp.

Instead of individual photos, why not use an entire video? When a user hovers over the thumbnail, we can display a second <video /> element and set the video's currentTime. If the video is small enough, loading individual frames should be immediate.

Let's compress our stream to be as small as possible:

ffmpeg $(
  echo "-i stream.mp4"

  # no audio necessary for a visual preview
  echo "-an"

  # scale down a 1080p60 video to a tiny video with 1 frame every 5 seconds
  echo "-vf \"scale=192x108,fps=0.2\""

  # use AV1 codec for efficiency
  echo "-c:v libsvtav1"
  echo "-preset 1"

  # lower quality, we don't care too much
  echo "-crf 35"

  echo "thumbnails.mp4"
)

This can bring a 9 GB stream down to 1-2 MB in thumbnails. Even for lower end devices this means almost instantaneous loading. We can set this as an optional data attribute to pass to the web component.

<wc-yt-video data-thumbs="thumbnails.mp4">
  <video src="input.mp4">
    Your browser does not support HTML video.
  </video>
</wc-yt-video>
const html = `
  <div id="progresshover">
    ${
      this.dataset.thumbs
        ? `<video tabindex="-1" id="progresshoverthumb" src="${this.dataset.thumbs}"></video>`
        : ""
    }
  </div>
`;

Then, when hovering over a timestamp:

// on hover
const progresshoverthumb = shadow.getElementById("progresshoverthumb");
if (progresshoverthumb) {
  progresshoverthumb.currentTime = thumbTime - (thumbTime % 5);
}

Results ¶

After putting all the core pieces together (with a bit of JavaScript glue), here is the result! Unlike Youtube, this solution is plug and play, non-JavaScript compatible, and is less than 600 lines of JavaScript code. You can find the source code here - feel free to use it in your own projects.