RTMP (Real-Time Messaging Protocol) is a low-latency protocol for streaming video, audio, and data over the Internet.
It may sound to you like something obsolete and reminding the ancient Flash Player 🙂. But at the same time, you see big platforms like Facebook, YouTube, and others support it as the main protocol to deliver your video content to them. So, what’s happening?
This post provides an in-depth explanation of RTMP. It includes its history, internal structure, and current status in the Web and video streaming industry. After reading it you’ll be absolutely aware of the state of things and will know about RTMP just enough to use it efficiently.
The first part of the article contains information useful for anyone interested in streaming. The second part is more for programmers or those who want to get a more detailed look at the protocol 🐱💻.
This is the very first post in this blog and I really hope you’ll enjoy it. So, let’s start!
RTMP: Quick Characteristics
- Based on TCP, that means no lost frames and video glitches.
- Ultra low-latency (<1-2 second) on good network conditions
- The de-facto standard for streaming to CDNs and streaming platforms
- Supports Encryption (RTPMS)
- Based on TCP, that means higher delay once transmission reach maximum bandwidth, or when network congestion occurs
- No modern codecs support except H264/AAC
- Obsolete for playback
Originally, RTMP was a proprietary protocol created by Macromedia as a messaging protocol for the Flash media platform. Flash was a serious multimedia platform to build games, rich applications, and embedded video players. And RTMP was a communication protocol between browsers, mobile apps or other types of clients and the Flash Media Server. It was developed to transfer video, audio and arbitrary data.
Wait, but what about RTMP? Is it out of play too?
Well, yes, but not yet.
As was said above, when it comes to playing or capturing video in a browser, we don’t need Flash (and RTMP) anymore. We can accomplish the same things natively with HTML5. For instance, play a video with an HTML5
<video> tag and capture camera and microphone with
getUserMedia() API. But what about a live-streaming from a hardware device, camera, desktop or mobile application to the server over the Internet?
In 2009 Adobe published (incomplete) 📜 RTMP specification that allowed third-party developers to build their own RTMP servers and clients. There was some information missing in the spec that disallowed implementations to be fully compatible with Adobe Media Server. However, in the course of time, RTMP support was implemented in many client applications, hardware devices, and servers. As a result, RTMP became a de-facto standard for streaming media from client to server over the Internet.
In this post, we’ll dive inside RTMP from a live-streaming point of view, because as said above, it’s the only niche where this protocol is still actual. Since RTMP is a two-way protocol, it allows transfer data in both ways, but in live-streaming, we send video from the client to the server, and the server sends only control messages back to the client.
RTMP is an L7 (application layer) protocol that works over the TCP and uses 1935 port by default. There are also multiple flavors of the protocol like RTMPE and RTMPT (encapsulated into HTTP to bypass firewalls). But the most important protocol variant seems to be RTMPS.
RTMPS is the same RTMP but over the TLS/SSL connection. So it’s encrypted 🔒. Interesting that starting from May 2019 Facebook supports only RTMPS for their live video ingest.
RTMP allows ultra-low-latency continuous streaming (less than 1-2 seconds) on good network conditions. Since it based on TCP, it guarantees packets delivery and missed packets retranslation so no frames will be lost. But on the other hand, once it reaches maximum network bandwidth or network congestion occurs, TCP slows transmission down, and we get a larger buffer and, hence, delay.
Audio & Video Codecs
RTMP encapsulates and can carry multiple multiplexed FLV media streams. But usually, it’s used to stream one H.264 (AVC) video stream and one AAC audio stream. The codecs supported by RTMP restricted by the FLV media container format. The full list of supported codecs is listed in the FLV specification here. But most of them are obsolete and not widely used.
And it’s the biggest 🔴 problem of RTMP: lack of support of modern codecs. You may want to stream H.265 (HEVC) or VP8/9 with Opus, but you can’t. Technically, of course, you can do it by patching your software, but since it is not specified in the spec, it won’t be compatible with anything.
Adobe seems unresponsive about propositions to add more codecs to a spec. So people end up with either use H264/AAC only, switch to a different protocol, or patch RTMP and use it only by yourself (there was an unsuccessful attempt to modify official FFmpeg, however).
This situation stimulates companies to gradually move out of RTMP. ➡️ For example, if you want to deliver a VP8/9 video with Vorbis/Opus audio to Youtube, you can now do it with DASH or HLS. The price for that is higher latency than when using RTMP, but there are some tips on how to deal with it.
In this section, I’ll jump into the RTMP from a developer’s point of view.
First, the client initiates a TCP connection to the server. After that, the client and the server should perform an RTMP handshake. Then, after some negotiations, the client starts streaming video and audio to the server.
ℹ️ Important Note. It’s always a good idea to read the specification when trying to sort out how some protocol works. So please refer to RTMP Specification for the details.
Typically, before you’ll start streaming, you should get RTMP URL first. It has the following structure:
rtmp://host:port/app/stream-key. Obviously, in the case of RTMPS, the URL will start with
Your client application will handle URL and connect to the corresponding host and port, providing RTMP server with the
"app name" and
"stream key" so the server could decide how it will handle the connection. Typically,
"stream key" is used to uniquely distinguish your live-streaming session among others, and quite frequently used as a secret that makes the server accept your live stream. We’ll take a look closely at in next sections, but now let’s start with a handshake.
Handshake consists of the exchange of fixed-size messages referred to as
S2. The client sends
C messages and the server sends
S0 client and server agree on RTMP version that should be
3. In the following messages, they agree on the epoch for future messages and just confirm that they are both RTMP peers by exchanging some random data. Consider the RTMP specification for more details.
Client >> << Server >> C1 C0 >> << S0 S1 S2 << << C2 <<
After a handshake, the client and server communicate by sending each other RTMP messages, multiplexed in chunks over several chunk streams. Let’s take a closer look at what Chunk Stream is.
RTMP Chunk Stream
Chunk Stream is an abstraction that is used to represent multiplexing and packetizing of RTMP messages. The reason for introducing chunk streams into protocol is to enable big messages splitting and interleaving.
For example, if we have to transfer the
1 Mb video packet through our TCP connection, we can’t send anything until
1 Mb of our video packet data is not fully sent. But if we split our
1 Mb message in chunks of, for example,
50 Kb, we could easily send something else in between these
50 Kb chunks. So smaller message has a higher priority. They won’t be blocked with a lower-priority (but bigger) message.
The default chunk size in the RTMP chunk stream is
128 bytes and it could be changed via sending a special message (will be discussed later). Keep in mind that chunk streams are separate for both sides of the TCP connection, i.e receiving and sending chunk streams are different and may have different chunk sizes.
Each chunk stream has its own numeric identifier. RTMP supports up to 65597 chunk streams that are more than enough for most situations. Typically we need one chunk stream for video and one for audio.
Chunk Stream ID = 2 is reserved for protocol control messages and commands.
Each chunk has a header.
Chunk Stream ID included in the chunk header. Also, the chunk header contains a
Chunk Type ID (fmt in a spec) that affects how the remaining header (if exists) should be processed. There are 4 chunk types, and depending on the type, the size of the header and information it contains may be bigger or smaller.
It’s done this way, for example, for situations where several subsequent chunks with the same chunk stream id and are parts of the same message, so they share some of the message properties (headers) like timestamp, that were already sent in first chunk header. So, in this case, we can save a few bytes for each chunk by using, for example, setting chunk type 4, and omitting all other header data.
For more info about chunk types please refer to RTMP specification.
RTMP messages are designed to be used with RTMP Chunk Stream, but they themselves could be used with different transport (for example over HTTP). Usually, when we talk about RTMP/RTMPS we mean original RTMP implementation over Chunk Streams.
Okay, so we have some RTMP messages to send. We split them into chunks and transfer with different Chunk Stream IDs. But what exactly these messages are? Each message has
Type ID that indicates the message meaning.
It may be one of protocol control messages, like
SET_CHUNK_SIZE (that is used for indicating the sender’s chunk size change).
It also may be RTMP message, like
VIDEO (video message),
DATA_AMF3(serialized data) or
COMMAND_AMF3(serialized RPC command). AMF0 and AMF3 are different versions of Action Message Format. Most of implementations use AMF0 version.
AUDIO messages content is interpreted as
AUDIODATA FLV tags correspondingly. Please refer to the FLV specification for more info.
VIDEO messages with H.264 content
Message content represented in the AVCC (
AVCDecoderConfigurationRecord) format (see
ISO 14496-3). That means that we have separately transferred AVCC Sequence Header (or exatradata, i.e. SPS and PPS) in the first message, and AVCC coded (length-prefixed) NALUs in the following messages.
It may worth mentioning that FLV (and, respectively, RTMP) supports H.264 streams with B-frames. It’s done by providing
CompositionTime in the
AVCVIDEOPACKET structure of an FLV video tag. B-frames add additional delay, hence they are not intensively in live-streaming.
AUDIO messages with AAC content
Everything is almost the same, it’s usually FLV tag with
ISO 14496-3) encoded AAC Sequence header and then tags with raw AAC data.
Another important property of RTMP message is
Timestamp. For audio and video messages it means decoding (and presentation, for audio and video streams without B-frames) timestamp. I.e. the time when data should be decoded and presented to the user.
In the case of B-frames, presentation timestamp should be calculated from decoding timestamp and mentioned above
CompositionTime (represented in 90kHz timebase). Information on how to do it is provided in the ISO spec. But briefly, you can do something like this:
DTS (Decoding Timestamp) = Timestamp
PTS (Presentation Timestamp) = DTS + CompositionTime / 90000
VIDEO messages timestamps are synchronized with each other and have the same starting epoch that is agreed during the Handshake.
RTMP Streaming Messages Flow
OK, now we know about RTMP messages and how they are multiplexed with chunk streams. Let’s get back to the connection life-cycle. After the client and server finished performing handshake (that looks different to the rest of the protocol), peers start to send data using Chunk Streams. Here is a typical rough sequence of messages to start streaming H264/AAC stream from client to server.
Client >> << Server >> TCP Connection Establish << >> RTMP Handshake << >> COMMAND_AMF0 ("connect", tx1, ...) >> << WINDOW_ACKNOWLEDEMENT_SIZE << << SET_PEER_BANDWIDTH << << COMMAND_AMF0 ("_result", tx1, ...) << >> COMMAND_AMF0 ("createStream", tx2,...) >> << COMMAND_AMF0 ("_result", tx2, ...) << >> COMMAND_AMF0 ("publish", tx3, ... streamId, live, ...) >> << COMMAND_AMF0 ("onStatus", tx3, ...code="NetStream.Publish.Start") << >> DATA_AMF0 ("@setDataFrame", "onMetaData", ...) >> >> VIDEO (AVCDecoderConfigurationRecord) >> >> AUDIO (AAC Sequence header) >> ... >> VIDEO >> >> AUDIO >> ... >> COMMAND_AMF0 ("deleteStream") >> >> TCP Connection Drop <<
Please note that a concrete set of messages may be slightly different depending on the client and server whilst it matches the specification. For example, some peers may send additional
SET_СHUNK_SIZE, or other messages during the streaming. Also,
ACKNOWLEDGEMENT messages are required from the party after it receives
WINDOW_ACKNOWLEDGEMENT_SIZE bytes. Please consider RTMP Specification for the required and optional messages and their flow.
If you’re interested in how’s it works in reality, you could easily use packet capturing software like Wireshark. It has RTMP support and could show you everything we describe here. Please note one small tricks here, to get all RTMP messages parsed and displayed correctly it may be needed to set
Maximum packet size to a higher value (for example,
327680) in Edit → Preferences → Protocols → RTMPT.
There are a lot of libraries that contain RTMP implementation.
The biggest and most well known are:
You can easily find a native library for your favorite programming language, for example, go-rtmp (Golang). Or if you want, you can vote for implementing it by yourself, it would be a really good exercise 🤖.
Docs & Specs
- RTMP article on Wikipedia: encyclopedic general information
- RTMP Specification: protocol official spec
- FLV Specification: media container format spec
- AMF0/AMF3 Specifications: serialization format spec
- ISO/IEC 14496-3: specs for H.264
We have taken a look into the history of RTMP, it’s properties and did an advanced dive into the internal structure. RTMP has its advantages and disadvantages, but it is still is a play in live-streaming ingest until something newer and better will come out.