SD what?
“You had me at “Hello”.”
You have probably heard about SIP, SIP Trunks, SIP stack, SIP leg etc. SIP – Session Initiation Protocol (SIP) is a signalling protocol used for establishing, maintaining, and tearing down real-time sessions. These sessions include voice, video, and messaging applications. SIP carries SDP information within the SIP message body.
The Session Description Protocol (SDP) provides a structure on how to convey media to the endpoints. In essence it describes the media, the transport address, transport protocol, ports, codecs, and other session description metadata to the participants in the session.
It is important to note that SDP does not provide the media – all it does is in essence is get one endpoint to tell the other endpoint: “these are the codecs I support”, “hey, you can reach me on these protocol-ip address-port combinations.”
SDP can be divided into three main sections:
1.1 Session description
2.1 Time description and
3.1 Media description which advertises details about the media which will be streamed in the advertised session.
We will now look at these three segments.
1.1 Session description
v= (protocol version, currently only 0)
o= (originator and session identifier: username, session identification, version number, network address)
s= (session name: mandatory with at least one UTF-8-encoded character)
i= (session title or short information – optional)
u= (URI of description – optional)
e=(zero or more email address with name of contacts – optional)
p= (zero or more phone number with name of contacts – optional)
c= (connection information – not required if included in media description – optional)
b= (zero or more session bandwidth information line – optional)
z= (time zone adjustments – optional)
k= (encryption key)
a= (zero or more session attribute lines – optional)
Time description (mandatory)
t= (time the session is active)
r= (zero or more repeat times – optional)
2.1 Media description
m= (media name and transport address)
i= (media title – optional)
c= (connection information – not required if included in session description – optional)
b= (zero or more bandwidth information lines – optional)
k= (encryption key – optional)
a= (zero or more media attribute lines – optional)
Description of each field
Session Description
v=<version> – Specifies the version of Session Description Protocol. As specified in RFC 4566, up to now there is only one version, which is Version 0. No minor versions exist.
o=<username><sess-id><sess-version><nettype><addrtype><unicast-address> Details about the originator and identification of the session.
<username> – The user’s login. The MUST NOT contain spaces.
<sess-id> – A numeric string used as unique identifier for the session.
<sess-version> – A numeric string used as version number for this session description.
<nettype> – Text string, specifying the network type, e.g., IN for Internet.
<addrtype> – Text string specifying the type of the address of originator E.g.IP4 or IP6.
<unicast-address> – The address of the machine from where the session is originating, which can be either FQDN or IP address.
s=<session name> – Only one session name per session description can be specified. It must not be empty; therefore, if no name is assigned to the session, a single empty space should be used as session name.
i=<session description> – Only one session-level “i” field can be specified in the Session description. The “i” filed can be used in session or media description. It is primarily intended for labelling media streams when used in media description section. It can be a human readable description.
u=<uri> – The URI (Uniform Resource Identifier) specified in the “u” filed, is a pointer to additional information about the session.
e=<email address>
p=<phone-number> – Specifies contact information for the person responsible for the conference.
c=<nettype> <addrtype> <connection-address> – Connection information can be included in Session description or in media description. A session description MUST contain either at least one “c=” field in each media description or a single “c=” field at the session level
<nettype> A text string describing the network type, e.g., IN for internet.
<addrype> A text string describing the type of the address used in connection-address, E.g. IP4 or IP6.
<connection-address> A Multicast IP address is specified including TTL, e.g., 224.2.36.42/127.
b=<bwtype>:<bandwidth> – Bandwidth field can be used both in the session description, specifying the total bandwidth of the whole session and can also be used in media description, per media session.
<bwtype> Bandwidth type can be CT; conference total upper limit of bandwidth to be used, or AS; application specific, therefore it will be the application’s concept of maximum bandwidth.
<bandwidth> is interpreted as kilobits per second by default.
z=<adjustment time> <offset> <adjustment time> <offset> – To schedule a repeated session that specifies a change from daylight saving time to standard time or vice versa, it is necessary to specify difference from the originating time.
k=<method>:<encryption key> – If channel is secure and trusted, SDP can be used to convey encryption keys. A key can be specified for the whole session or for each media description.
<method> Indicates the mechanism which is used to obtain the encryption key from external sources or from encoding the given key. Several different methods exist, such as prompt and URI.
<encryption key> The encryption key, or if URI is used as method, the URI from where the key can be retrieved.>
a=<attribute>:<value> – Attributes may be defined at “session-level” or at “media-level” or both. Session level attributes are used to advertise additional information that applies to conference. Media level attributes are specific to the media, i.e., advertising information about the media stream.
3.1 Time Description
t=<start-time>:<value> – Specifies the start and stop times for a session. If a session is active at irregular intervals, multiple time entries can be used.
r=<repeat interval> <active duration> <offsets from start-time> – If a session is to be repeated at fixed intervals, the “r” field is used. By default, all values should be specified in seconds, but to make description more compact, time can also be given in different units, such as days, hours, or minutes, e.g., r=6d 2h 14m.
Media Description
m=<media> <port>/<number of ports> <proto> <fmt> – This field is used in the media description section to advertise properties of the media stream, such as the port it will be using for transmitting, the protocol used for streaming and the format or codec.
<media> Used to specify media type, generally this can be audio, video, text etc.
<port> The port to which the media stream will be sent. Multiple ports can also be specified if more than 1 port is being used.
<proto> The transport protocol used for streaming, e.g., RTP (real time protocol).
<fmt> The format of the media being sent, e.g., in which codec is the media encoded, e.g., PCMU, GSM etc.
Let us look at a sample SDP message:
A breakdown of this sample SDP is given below:
c=IN IP4 84.64.105.21 tells us where the media will come from and where it should be sent to
m=audio 40834 RTP/SAVP 104 114 9 112 111 0 8 103 116 115 97 13 118 119 101 as described above there will be a media line for each type of media. Here, we have only audio. If the session include video, there will be a separate m=video line. The numbers indicate the codecs that can be used.
a=rtpmap has an attribute line for each codec advertised in the media line.
The far-end party will typically respond to the above SDP with its own SDP (in a 183 Session Progress or in a 200 OK).
The above SDP would read like this: My SIP client on IPv4 address of 84.64.105.21 can support the following codecs on port 40834:
a=rtpmap:104 SILK/8000
SILK was developed for Skype and it has been extended into the Internet standard Opus audio codec.
a=rtpmap:114 x-msrta
This is Microsoft’s proprietary audio codec which can be licensed for use in other third-party clients and devices.
RTAudio is provided both in narrowband (8 kHz) and wideband (16kHz) options.
a=rtpmap:9 G722/8000
A freely available and widely popular wideband audio codec
a=rtpmap:112 G7221/16000
This is a variant of the Siren 7 codec.
a=rtpmap:111 SIREN/16000
SIREN family of codecs were originally developed by Polycom.
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
The above two codecs are two variants of G.711. The PCMU – G.711 µ-Law (used in North America & Japan) and PCMA – G.711 A-Law used by the rest of the world.
a=rtpmap:103 SILK/8000
SILK was developed for Skype and it has been extended into the Internet standard Opus audio codec.
a=rtpmap:116 AAL2-G726-32/8000
The G.726 is an Adaptive Differential Pulse Code Modulation (ADPCM) codec designed to compress speech than older PCM-based codec more effectively.
a=rtpmap:115 x-msrta/8000
This is Microsoft’s proprietary audio codec which can be licensed for use in other third-party clients and devices.
RTAudio is provided both in narrowband (8 kHz) and wideband (16kHz) options.
a=rtpmap:97 RED/8000
RED is used for any out-of-band Forward Error Correction (FEC) audio payload.
a=rtpmap:13 CN/8000
a=rtpmap:118 CN/16000
Both utilise Comfort Noise.
a=rtpmap:119 CN/24000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-16
DTMF signalling used to support common telephone events such as pushing buttons on the dial pad while in a call.
The far-end participant picks from this list and media can be exchanged.