Playing Together with Elixir Binaries-Strings :)
This article comprises of things that you’ll encounter while working with Strings and Raw bytes explaining with real situational examples. I tried to design the images, to focus on what we are talking. Hope you like them.
All the examples used in this article are executed in
iex using the following combination of
I got to do the heavy workout on packet parsing using the header lengths on raw binaries decoding and encoding of 16, 32, 64 bit strings in one of my projects. So, I just got a thought to share the experience.
Hope, you already knew the difference of
byte. If true, do:
skip the following screen shot else:
have a glance of it.
🔥 “Every binary is a bitstring but every bitstring is not a binary “
In elixir, binary is represented by
<<>> . Of course, everybody does know.
iex(8) data = <<"hello">> "hello" iex(9) is_binary data true iex(10) is_bitstring data true iex(11) data2 = <<1,2,3::4>> <<1, 2, 3::size(4)>> iex(12) is_bitstring data2 true iex(13) is_binary data2 false
What makes a binary different from bitstring ?
If the number of
bits is a multiple of
8, then we call it as a
Consider the following example.
In the above line, we did not mention the number
bits to be used for
1,2 but we represented for
3. In elixir, if the size is not mentioned, it uses default
8 bits. So,
<<1,2,3::4>> is equal to
<<1::8, 2::8, 3::4>> which is a
20 bit data. We cannot call it as a
binary as number of
20 which is not a multiple of
Have a look at the following representation.
Raw bytes and Understanding Elixir representation
Strings in elixir are binaries. Sorry for repeating the same statement again and again. But, I have to do. Even when you are asked by waking up from sleep, you are supposed to say that.
Consider a word
hello each letter or a
grapheme will take 8 bits. So, the total
byte_size of a word
hello is 5.
iex> byte_size "hello" 5 iex> String.graphemes "hello" ["h", "e", "l", "l", "o"] iex> String.valid? <<35>> true iex> <<35>> "#" // valid string
The ASCII (American Standard Code for Information Interchange) code for
35. The binary representation of
100011 6 bit data.
<< 35 >> means we are telling to use
8 bits for
00100011 is the binary form for
35. If you represent like
<< 35::6>> is fall under
raw bytes of data.
iex> <<35::6>> <<35::size(6)>> iex> String.valid?(<<35::6>>) false iex> String.valid?(<<35::8>>) true
Understanding Elixir Representation
Consider the following lines of code
iex> match?("#", <<35>>) true iex> match? "#", <<0::1, 0::1, 1::1, 0::1, 0::1, 0::1, 1::1, 1::1>> true iex> match? <<35>>, <<0::1, 0::1, 1::1, 0::1, 0::1, 0::1,1::1,1::1>> true
Here, literally we are dividing each bit of
<< 35::8 >> to
<<0::1, 0::1, 1::1, 0::1, 0::1, 0::1,1::1, 1::1>>
Back End Story of Learning
When I was learning the basics of programming in Elixir, I used to turn the pages without reading when ever I see the symbols
<<>> . These symbols are night mare when I was a kid relative to Elixir. Learning them is like a feeling of hitting the mountain with your head at a speed of 200. Just imagine.
OK! Stories are apart. But, once you get a clear picture of what is meant by raw byte and valid strings in your mind, you’ll climb the mountain with ease.
Programmers heavily deal with raw bytes in their life than Strings. Especially, one who always do parsing.
Programmers count memory but not in length.
Remember the previous line, we talk on this later inside the article in deep.
This is a real-world situation.
Extracting a String of Known Length
If you know the exact length of the string and position from where you want to extract, then you can go with the following approach
Using binary_part for raw bytes
When you dwell on real world project, you’ll encounter the situations dealing with raw bytes of data. I would suggest you learn as much as possible before working with raw bytes of data.
iex> binary_part("hello medium", 6, 6) "medium"
binary_part(binary, start, length) extracts the binary part from
start to the
length . It is used for splitting the raw bytes of data.
When the length is negative and within the bounds, it extracts the string from right to left unlike it does from left to right.
Things to remember.
→Here, the index cannot be negative.
→Here, the binaries are
binary_part("hello",1,1) would results
h . You have to try
binary_part("hello", 0,1) . Hope you understood what the zero-indexed is.
length cannot exceed the
byte_size of string. Otherwise, it raises an
Argument Error Exception.
Using binary_part in Guard clause
This definition can be used in guard clause as well.
Example: Packet Parsing
For an example, you are parsing the packets like
$user#blackode#a#medium#writer . You are asked to write a definition that receives a packet and you have to differentiate each packet from other.
You can do this by splitting the packet like
String.split(packet, "#") and using
if macro to do the job. But, it takes more code logic. You can make use of the
binary_part in guard clause like following.
defmodule Parser do def parse(packet) when binary_part(packet, 1,5)=="admin" do IO.puts "Admin Packet !" end ... end
Check out the execution screenshot
= = = = = = = = = = ======Warning====== = = = = = = = = = =
As I already mentioned in the things to remember section, if either
start values are out of bounds, then it raises an Argument Error exception.
— Extracting a string of Unknown Length
If you don’t know the
length of the
sub string, you cannot use the binary_part function. Here comes the binary pattern matching «» in handy.
You are asked to extract the string from the position
6 to end of the
Elixir is a multiple of
8 bits which we call it as
binary. It means, if the
bit_size is divided by
8 then we call that
As we talked earlier in the intro section, each letter in string is of
8 bits means
1 byte. So, to skip the
6 letters you have to skip
— Extracting first letter from the string
Extract the currency symbol from string “$500”
This can be achieved in many ways
iex> string = "$500" "$500" iex> string |> String.first "$"
iex> string = "$500" "$500" iex> <<first::8,_rest::binary>> = string "$500" iex> <<first>> "$" iex> first 36 // code_point ascii-code of $ iex> <<35>> "#"
Not recommended in this situation but, it is good to know the option existence.
As we know, it splits the string based on the given pattern. If the pattern is
"" it gives some different result.
iex> string = "$500" "$500" iex> string |> String.split("") ["", "$", "5", "0", "0", ""]
🔥 no space `between`
If you observe here, it added some extra
"" at head and tail. You have to again trim them by passing an option
trim: true .
iex> string = "$500" "$500" iex> string |> String.split("", trim: true) |> hd "$"
iex> String.slice "$500", 0, 1 "$" iex> String.slice "$500", -4, 1 "$"
String.slice [ VS ] binary_part
As we know, both will takes arguments as
(str, start, len) and returns a sub string starting at the offset
start, and of length
I kept thinking of why would be there two functions with similar functionality. So, I started checking out the things that differentiate them.
Out of bound options
len are out of the bounds then
binary_part would raise an Argument Error as it is designed to use along with raw bytes but not
String.slice which refers to the String.length.
Let’s check that.
iex(14) str = "hello medium" "hello medium" iex(15) String.slice str, 6, 10 "medium" iex(16) binary_part str, 6, 10Bug Bug ..!!** (ArgumentError) argument error :erlang.binary_part("hello medium", 6, 10)
Here, after position
6 only remain with
6 letters, but we tried to extract sub string of
len 10 . So, the
binary_part raised an error but not String.slice which gave a result of sub string from index
6 to end of the string. Hope you got the point.
Raw Bytes and Graphemes
String.slice(str, start, len) , the
start is the index of the graphemes whereas in
binary_part it is the index of a byte.
It will be more clear with the following example.
iex> str = "hełło" "hełło" iex> String.length str 5 iex> byte_size str 7 iex> String.graphemes str ["h", "e", "ł", "ł", "o"]
I hope you understand what I mean of
graphemes length of
5 but its
7 that is where these functions differ from each other.
byte_size/1 counts the underlying
raw bytes, and
String.length/1 counts characters .
String.slice deals with unicode graphemes and
binary_part deals byte_size.
binary_part deals with
Internal Representation of String (Raw Bytes)
iex> str = "hełło" "hełło" iex> raw = str <> <<0>> <<104, 101, 197, 130, 197, 130, 111, 0>> iex(37) String.slice raw, 2, 3 "łło" iex(38) binary_part raw, 2, 3 <<197, 130, 197>>
The elixir has a
Base module which helps you in decoding and encoding of binaries. Have a look here
Hope you enjoyed playing with strings. Practice makes you more perfect. Try to parse ipv4 packet based on its header length .
If you find this helpful, please put your hand forward to share. Let’s others get benefited from this.