Unicode in your SMS: Emojis and Non-English Characters

27-2

We are glad to announce that we have improved our routes and our systems, so you are now able to send messages of any length and with any character available in the Unicode charset, including more than 128,000 characters (and yes, Emojis are part of the package 😀).

Not too long ago, we limited our SMS messages to 160 characters, and to use only the printable ASCII characters. This was because we wanted to be sure that your messages (when passed this filter) were able to reach any mobile device in the world, by sending the content in 7bit ASCII honoring the GSM 03.38 7bit character set.

Of course this was fine for English and Spanish speakers, but left other languages (like hindi, hebrew, chinese, etc) without support. And we know, this is just not acceptable in the 21st century!

Unicode? Encoding? Characters? Bytes?

Unicode is now the globally adopted standard to handle internationalization in everything that has to do with IT. It basically maps each grapheme (a character, punctuation symbol, etc) of nearly every language known to man (and even more!) to a Code Point (a number). This makes it possible to render any kind of text in different platforms and architectures (everyone will draw the same character based on the code point).

The character encoding defines how we actually write the code points, and how many bytes we use for each one of them.

For example, we can use 1 byte or 4 bytes (i.e: 8 bits or 32 bits), and also we could choose to write first the more significant bits, or write first the less significant bits (this is actually called the endianness).

That’s why it’s important to distinguish between the number of characters of a text, and the number of bytes required to encode each one of the characters in that text.

The standard GSM character encoding: Limits and constraints

The GSM standard specifies that messages should be sent by encoding the text according to the GSM 03.38 Recommendation, including different encoding formats: 7bit, 8bit, and UCS-2 (now replaced by UTF-16).

GSM allows 160 7-bit characters per message at most. Depending on the encoding and type of characters (if they’re escaped or not, etc), this limit can drop to 140 or less. In the case of using UCS-2 (aka UTF-16) a messages can have at most 70 characters. This is because when using UTF-16, each character needs 16 bits (2 bytes) to encode each character.

So it is possible that a given text has to be split into several messages in order to be fully sent. Each one of these messages (or segments) will be charged separately, as if they are different non-related messages (you can read more about this in our FAQ “How are my messages billed?“. For this reason it’s highly recommended that you take this into account when designing the text for your A2P or Marketing messages.

In the following link you can find the complete GSM 03.38 character set. You can also get the full character list by using your panel, or the GSM Charset API Endpoint from one of our available SDKs.

Nothing to change on your side! We choose the right encoding automagically

That’s right! Our systems will try to automatically detect the encoding needed to send your message and act accordingly when delivering your message, so you don’t have to change anything in your systems, no matter if you’re using one of our SDKs or the REST API directly.

Also, if your text doesn’t fit completely in one message we will split it up as needed and send as many “segments” needed to complete the full text. Just fire and forget.

One more thing: You can simulate your message before sending it to know beforehand if a given message will need unicode or not, how many segments will be needed to send it completely and also some extra useful information.

Don’t hesitate to contact us if you have any questions, we’re always glad to help.

Enjoy 🙂

— The PortaText Team.