Youtube From Scratch

2024-08-15#tech #presentation

I've always wondered how Youtube works under the hood. Sure, they have a server hosting your videos, but how do they make it so fast? You have hundreds of trillions of video files from all over the world at your fingertips and you can instantly load and start watching any video in less than a second. How does it scale so well?

I was incentivised to dig into this when I heard the announcement for Elden Ring's DLC Shadow of the Erdtree back in early 2023. I wanted to stream myself playing the game and uploading 1080p recordings of those streams to this website, but performance was my primary concern. I spent the next few months trying to replicate Youtube's raw performance and quality to upload my streams. The conclusion I ran into was that Youtube is practically unbeatable at their own game, so I made some acceptable tradeoffs to achieve the results I wanted. You can see the fruits of my labor here. I'm happy with my stream setup even if it isn't a perfect Youtube replica. I'll explain why in more detail below.

How can we build our own Youtube from scratch?

Observations ¶

To fully understand the features and techniques used by Youtube, I need to state a few observations and basic concepts to make sure we're all on the same page and lay the groundwork for our implementation. These observations might seem painfully obvious but we need to explicitly acknowledge them in order to explain the reasoning behind some of the technical decisions we'll make below.

Observation 1: There is a relationship between download time and file size. The larger the file size, the longer it takes to download. People have limited internet download speeds, so smaller file sizes are always faster to load than larger file sizes.

Observation 2: What is Youtube under the hood? If you take away all the bells and whistles, Youtube is just a server (aka "someone else's computer") hosting videos. You upload video files to Youtube's "computer" and other people watch those videos by downloading them from that computer in their browser. It doesn't matter where the computer is or who owns it.

Observation 3: When you watch a video, you are only watching a portion of that video. When you watch the first 5 seconds of a 20 minute video, you aren't watching the last 19 minutes of that video.

Video File Optimization ¶

Our first attempt to recreate Youtube might be to create a webserver, upload some video files, then display those files in HTML using the <video /> tag. Simple enough, right?

<video controls width="720">
  <source src="https://example.com/video.mp4" type="video/mp4" />
</video>

This works great for small two minute videos but once we try to load a three hour stream, we have to wait 10 minutes for the whole stream to load before we can play the video. Not great.

What if I told you we can halve the size of our video and still keep 1080p quality? The secret is in the encoding. A video is just a combination of a video data stream and an audio data stream. We generally call these data streams tracks. The video file itself is just a container that holds the video tracks and audio tracks together. If we can compress how these tracks are held together and read, we can reduce the size of our file. Based on observation 1 above, if we can reduce the size of our file, we can reduce the time it takes to download and view the video!

We use codecs to compress our tracks, then pack them into a container. Mozilla has a great comparison of codecs and supported file formats which demonstrates the different combinations commonly used. Most of the internet uses AVC (H.264) via MP4, but there is rising popularity in more efficient codecs such as HEVC (H.265) and AV1 that can generate video files half the size of a video file encoded with H.264. The container can also make a difference: WebM boasts smaller file sizes than MP4 for the same video quality!

Note: As of August 2024, neither H.265 nor AV1 have full support by browsers.

Now that we've optimized our container and codec, we can start thinking about video quality itself. Most of the time you don't actually need perfect quality in a video. For example, if you record a 4k video at 70Hz, most people won't be able to tell the visual difference between that video and a 1080p video at 60Hz. Video quality is measured through bitrate, and codecs help us control bitrate using a Constant Rate Factor (CRF). You can think of CRF like a numeric quality scale: 0 means pristine unaltered video quality at the cost of large file size and a high number means the smallest possible file size at the cost of severely degraded quality. H.264 CRF ranges from 0 to 51 and VPX CRF ranges from 0 to 63.

Video compression uses the same techniques as image and audio compression, reducing unperceivable (or inaudible) detail to the benefit of file size. When you remux (re-encode) a video file, setting a CRF of 0-18 is generally discouraged because most of the quality you receive from those CRFs are invisible to the human eye. In practice, I've found that a CRF of 23 is perfect for H.264 video. Keep in mind, however, that optimization comes at a cost: video compression and remuxing costs CPU power.

A comparison of 4:4:4 chroma subsampling vs 4:2:0 chroma subsampling. 4:2:0 chroma subsampling displays nearly identical image quality to 4:4:4.

For video remuxing I highly recommend ffmpeg, a free and open source tool for all kinds of video editing and processing.

# remux MKV to MP4 (H.264)
ffmpeg -i input.mkv -c:v libx264 -c:a aac -crf 23 -preset veryslow output.mp4

# remux MKV to WebM (VP9)
ffmpeg -i input.mkv -c:v libvpx-vp9 -b:v 2M -c:a libopus -crf 31 -preset veryslow output.webm

See Video Playback Optimization Demo for an interactive demo comparing loading times for differently encoded video files.

Let's use the same webserver as before but with optimized video files. In the case a source isn't supported, we'll provide some fallbacks. The browser will try each source sequentially until it can read one. In the case HTML video isn't supported at all, it will display to a link to the video.

<video controls width="720">
  <source src="https://example.com/video1.webm" type="video/webm" />
  <source src="https://example.com/video1.mp4" type="video/mp4" />
  <p><a href="https://example.com/video1.mp4">Download the video here.</a></p>
</video>

Server Optimization (Caching) ¶

Now we've optimized the video file, but what if the video is 10 hours long? It doesn't matter if we've reduced the file size by 30GB if the resulting file is still 80GB!

On the server we can optimize how we send the files to the client to produce nearly seamless playback. When you download a video from the internet, that video is most likely never going to change. It only makes sense for the client to cache those videos so that if you try to watch the same video more than once, you don't need to download it multiple times. It also helps when you stop watching a video halfway through the video and come back to it later.

This is a simple fix. For any or media asset server, we just need to send back a Cache-Control response header:

"Cache-Control" "max-age=31536000, public, immutable"

This tells your browser the following:

This file will be "fresh" for at least 31536000 seconds (1 year)
This file is public and doesn't need to be stored privately
This file is immutable and will never change so it never needs to be refetched

However, even with this caching in place, we would still have to wait a terribly long time for our video to download if it was hosted in New York and we live in India. That's why CDNs exist. A Content Delivery Networks (CDN) is essentially a network of file servers spread across the globe to ease loading times. Instead of sending a request to New York and waiting for a response, I can instead send a request to Mumbai or even Singapore and get a response back much faster than if I waited for New York. This follows observation 2: we don't need to care about where the video file is hosted, we just care about how long it takes to download. Using a CDN allows us to get a consistently fast video loading time regardless of country or location.

Server Optimization (MSE) ¶

There's still an issue with our current implementation: if we try to load a long stream, the browser tries to download the majority of the file before playing - even if we only care about watching the first five minutes. This is because the browser fetches media files for us so we have little control over it.

What if we fetched the media file ourselves through an API call instead? We could fetch the media via the JavaScript fetch API or even on the server, then convert the raw video file data into something the user can view in the browser. If we fetch the file ourselves, we have complete control over how the data is fetched - including what part of the data is fetched. This is where things get interesting.

There's one more piece of the puzzle we need to learn about: HTTP status code 206. The HTTP status code 206 refers to "Partial Content": when a range of data is requested, the server can choose to send back partial data via status code 206. This is mostly useful for paginated requests where we only want to fetch part of a page of some search results, but we can also apply this concept to our video files. If we control both the server and the client, we can send an HTTP request with a Range header asking for specific bytes in a video file:

Range: "bytes=32768-"

Then the server can process that range and respond with the range of bytes of the video file we originally requested:

Content-Range: "bytes 32768-15072908755"

With this setup, you'll notice that the partial content fetching... doesn't work. The reason why is because video files generally aren't made to be broken up into random segments. Video files containers are comprised of smaller boxes holding varying types of metadata. This is how video players know how long a video is, whether subtitles are baked into the video, and what tracks can be played. If we took a random part of this file, we might get a mix of metadata and a piece of actual video data!

A diagram representation of the MP4 file format. Each section is broken into a small box with a unique label. Some example labels are ftyp, moov, mvex, and moof.

If we want to evenly distribute metadata and video data across the file, we need to perform fragmentation. This is easily done through an external tool such as bento4 which is made for this purpose. Next, we need to segment the data into literal file chunks. Instead of the server hosting one big video file, we can host multiple small pieces of a properly fragmented video file, then serve the pieces one by one through a range request. That means that on the client, we request one chunk of the video and it immediately loads and plays while the next chunk is fetched in the background. This perfectly aligns with observation 3: we only need to fetch the part of the video we're watching and nothing more.

# mp4 utilies provided by bento4
mp4fragment input.mp4 input-fragmented.mp4

# check to ensure an "ftyp" atom is immediately followed by a "moov" atom
mp4dump input.mp4 | head

# check codecs used
mp4info input.mp4 | grep "Codec String"

What we've just created is called Media Source Extensions (MSE). This is a W3C API created to make streaming video easier on the web and follows the process we just described to "stream" video to the client. Because we segmented our video into chunks, it also allows us to swap out chunks for ads and different quality chunks (more on that below).

See this Shaka player MSE demo for an interactive demo showcasing Media Source Extensions.

Server Optimization (DASH) ¶

There's still one feature I haven't talked about that we all take for granted when we browse Youtube. That feature is the ability to change video quality on the fly! This is known as Dynamic Adaptive Streaming over HTTP (DASH). DASH is relatively trivial to implement with MSE: instead of segmenting a single video, we can clone our original video at varying qualities and segment each of them. Then, when we fetch a new video segment to send to the client, we can check to see the client's internet strength using an API like Navigator.connection. If the internet is spotty or shaky, we can switch to fetching a lower quality segment to adapt to the internet conditions.

To tell the client what video resolutions are available, we generate a Media Presentation Description (MPD) manifest file. This file is just an XML file summarizing metadata and required bandwidths for each video resolution.

# generate multiple resolutions for a video
mp4-dash-encode.py -b 5 input.mp4

# generate an MPD file
mp4-dash.py --exec-dir=. video*
# you can also generate a (simple) MPD file with ffmpeg
ffmpeg -i input.mp4 -f dash output.mpd

Bonus Features ¶

There's a few other minor features of Youtube I will talk about here that are relatively simple in implementation.

Video Thumbnails ¶

A video poster or thumbnail is what draws users into your video. This can be implemented with the poster attribute on the <video /> element:

<video controls poster="https://example.com/video-thumbnail.png">
  <source src="https://example.com/video.mp4" type="video/mp4" />
</video>

To make video buffering feel more seamless, you can even take a picture of the first frame of the video to use as the poster so the user doesn't perceive any loading state.

Auto Play ¶

Unfortunately, due to security concerns, the ability to auto-play a video in HTML was removed in 2018. You can only autoplay muted videos:

<video muted autoplay>
  <source src="https://example.com/autoplay.mp4" type="video/mp4" />
</video>

The only reason Youtube is still able to autoplay videos is because Chrome and some other browsers have agreed to allowlist "certain frequently visited sites" to be able to autoplay videos. Another example of Google dominating the web.

Auto Captions ¶

How does Youtube generate automatic translations or captions on videos? It's quite simple in 2024 with AI tools such as ChatGPT and Claude that can translate audio to text. I tested out OpenAI's Whisper on some of my first few streams and it's surprising how many proper nouns and pop culture words they get right. I find base.en to be the most accurate English model.

# extract audio from video into a format whisper understands
ffmpeg -i input.mp4 -acodec pcm_s16le -ac 1 -ar 16000 temp.wav

# whisper automatically adds the .vtt extension to the output
whisper-cpp -m "base.en" -f temp.wav -ml 60 -ovtt -pp -of output

Once Whisper has generated a VTT caption file, you can include captions on videos using the <track /> element:

<video controls>
  <source src="https://example.com/video.mp4" type="video/mp4" />
  <track
    default
    kind="captions"
    label="English"
    srclang="en"
    src="https://example.com/video-captions.vtt"
  />
</video>

The kind attribute can dictate how the captions are used. In most cases you'll want to use "captions".

Timestamps/Chapters ¶

Timestamps and video chapters provide the user with ways to mark certain points in a video for reference. There are a two ways we can implement this:

HTML video allows us to set a playback range on the video URL itself. If we control the server we can set a video starting time on page load.
```
<video controls>
  <source src="https://example.com/video.mp4#t=479" type="video/mp4" />
</video>
```
JavaScript provides a currentTime property on media elements that can be set to seek to a specific timestamp.

If we want to mimic Youtube's timestamp share feature, we can parse URL query params and set the currentTime based on the timestamp query param.

How Youtube Actually Works ¶

Now that I've gone over many different strategies for optimized browser video playback and rich video playback features, let's talk about how Youtube does it. Because Youtube is closed source, we can only speculate based on page inspection and any technical articles or software released by them. However, if I had to guess, this is how I suspect Youtube works under the hood:

A creator uploads their video to Youtube.
Youtube re-encodes your video into WebM (VP9).
Youtube generates autocaptions in multiple languages for your video using some AI model (likely Gemini).
Youtube guesses big timestamps for your video using some AI model and formats them into a <track /> with kind="chapters".
Youtube properly fragments the video so that data is spread out evenly into chunks.
Youtube generates various different resolutions (720p, 480p, 360p, 240p, 144p) and creates a manifest file.
Youtube breaks each of these videos into segments (most likely into 1 minute chunks).
Youtube uploads all of these video chunks to an internal CDN.
Youtube serves the video files using MSE via Shaka.js.

As I mentioned at the start of this article, I was unable to replicate Youtube perfectly in my streams. The reason why is not necessarily due to feasibility but due to practicality. I don't have the CPU processing power to generate all of these resolutions, captions, and chapters for each of my streams, and I don't have the storage capacity to generate all the different artifacts. If you have a 3 hour stream recorded at 1080P, you have to duplicate that stream 5 times for each resolution, then break each video into 180 files (1 per minute) to upload to a CDN. I didn't want to have to manage 900 files for every stream. I made some tradeoffs - prioritizing convenience over performance. At the end of the day, I'm pretty happy with my stream setup. It isn't perfect, but it gets the job done sufficiently.

It's impressive what Youtube is able to accomplish on their platform. Then again, they're able to do it because they have so many resources and a lot of money.

Resources ¶

Here are some resources to explanations I found helpful:

Additionally, here are tools I recommend using if you would like to implement your own Youtube: