Skip to main content

Protobuf serialization - How it works + How to combine serialized messages

· 6 min read
Hreniuc Cristian-Alexandru

We wanted to optimize the way we combine multiple protobuf messages. Our scenario was this, we had a NestedRepeatedMessage that contained a NestedSingleInt message. We were creating a NestedRepeatedMessage which was the same for multiple users, so were serializing it once and we were posting it to be sent to that user. After some tests, we detected that sending 1 message per send operation is verry expensive, so we had to find a solution to combine multiple serialized messags before sending them.

The first solution was to, keep a reference to the protobuf object that was used to serialize, and if the client had multiple messages in the queue, we would create a big NestedRepeatedMessage and we would create a copy of each NestedSingleInt from those NestedRepeatedMessage from the queue. If there was only one element in thew queue, we would use the serialized version. This was also expensive..

Our next step, was to find a way to use the already serialized string. After understanding how the the protobuf encoding works, we've noticed that if append the serialized NestedRepeatedMessage messages, it results the same serialized string as the one from above, where we were doing a copy of all messages, adding them to a big message and then re-serializing it.

Below you can find all steps that were done to get to this conclusion.


Resources:

I've created messages step by step and I've tried to understand how they are created. To do so, I've used the following code resources:

Makefile:

all:
g++ -o exe main.cpp message.pb.cc -I/home/chreniuc/.conan/data/protobuf/3.15.5/user/stable/package/2b4a21ccda4a91fa18db143988901cc28f2f7109/include -L/home/chreniuc/.conan/data/protobuf/3.15.5/user/stable/package/2b4a21ccda4a91fa18db143988901cc28f2f7109/lib/ -lprotobuf

clean:
rm -rf exe main.o

protoc:
/home/chreniuc/.conan/data/protobuf/3.15.5/user/stable/package/2b4a21ccda4a91fa18db143988901cc28f2f7109/bin/protoc message.proto --cpp_out=.

message.proto

syntax = "proto3";


message SingleInt {
int32 a = 1;
}

message TwoInts {
int32 a = 1;
int32 b = 2;
}

message RepeatedInts
{
repeated int32 a=1;
}

message NestedSingleInt {
SingleInt singleint = 1;
}

message NestedTwoInts {
TwoInts twoints = 1;
}

message NestedRepeatedInts {
RepeatedInts repeatedints = 1;
}

message NestedRepeatedMessage {
repeated SingleInt ints = 1;
}

message NestedRepeatedMessageString {
repeated bytes ints = 1;
}

main.cpp

#include <iomanip>
#include <sstream>
#include <iostream>
#include <bitset>
#include <string>

#include "./message.pb.h"

using namespace ::std;

void print_hex(const string& input);
void print_binary(const string& input);
void print_serialized(google::protobuf::Message& message, const string& test_name, string* manually_serialized = nullptr);

// Check this: https://developers.google.com/protocol-buffers/docs/encoding#structure

int main()
{
SingleInt singleInt;
singleInt.set_a(3);
print_serialized(singleInt, "SingleInt"s);

TwoInts twoInts;
twoInts.set_a(3);
twoInts.set_b(5);
print_serialized(twoInts, "TwoInts"s);

RepeatedInts repeatedInts;
repeatedInts.add_a(3);
print_serialized(repeatedInts, "RepeatedInts - 1"s);
repeatedInts.add_a(5);
print_serialized(repeatedInts, "RepeatedInts - 2"s);


cout << "======== Nested examples ===========" << endl;

NestedSingleInt nestedSingleInt;
SingleInt* singleIntNested = new SingleInt();
singleIntNested->set_a(3);
nestedSingleInt.set_allocated_singleint(singleIntNested);
print_serialized(nestedSingleInt, "NestedSingleInt"s);

NestedTwoInts nestedTwoInts;
TwoInts* twoIntsNested = new TwoInts();
twoIntsNested->set_a(3);
twoIntsNested->set_b(5);
nestedTwoInts.set_allocated_twoints(twoIntsNested);
print_serialized(nestedTwoInts, "NestedTwoInts"s);

NestedRepeatedMessage nestedRepeatedMessage;
NestedRepeatedMessageString nestedRepeatedMessageString;
string manually_serialized;
auto single_int1 = nestedRepeatedMessage.add_ints();
single_int1->set_a(3);
manually_serialized += single_int1->SerializeAsString();
nestedRepeatedMessageString.add_ints(single_int1->SerializeAsString());
print_serialized(nestedRepeatedMessage, "NestedRepeatedMessage - 1"s, &manually_serialized);

auto single_int2 = nestedRepeatedMessage.add_ints();
single_int2->set_a(3);
manually_serialized += single_int2->SerializeAsString();
nestedRepeatedMessageString.add_ints(single_int2->SerializeAsString());
print_serialized(nestedRepeatedMessage, "NestedRepeatedMessage - 2"s, &manually_serialized);

auto single_int3 = nestedRepeatedMessage.add_ints();
single_int3->set_a(3);
manually_serialized += single_int3->SerializeAsString();
nestedRepeatedMessageString.add_ints(single_int3->SerializeAsString());
print_serialized(nestedRepeatedMessage, "NestedRepeatedMessage - 3"s, &manually_serialized);

// Apended manually
NestedRepeatedMessage nestedRepeatedMessageSingle;
nestedRepeatedMessageSingle.add_ints()->set_a(3);
string append_serialization = nestedRepeatedMessageSingle.SerializeAsString();
append_serialization += nestedRepeatedMessageSingle.SerializeAsString();
append_serialization += nestedRepeatedMessageSingle.SerializeAsString();
cout <<"[Apended manually]" << endl;
print_hex(append_serialization);
print_binary(append_serialization);
cout <<"[ " << setw(20) << setfill('-') << "-";
cout << "]\n" << endl;


print_serialized(nestedRepeatedMessageString, "nestedRepeatedMessageString"s);

return 0;
}




void print_serialized(google::protobuf::Message& message, const string& test_name, string* manually_serialized)
{
cout <<"[-----" << test_name << "-----]" << endl;
string serialized = message.SerializeAsString();
print_hex(serialized);
print_binary(serialized);
if(manually_serialized != nullptr)
{
cout <<"[Serialized manually]" << endl;
print_hex(*manually_serialized);
print_binary(*manually_serialized);
}
cout <<"[ " << setw(20) << setfill('-') << "-";
cout << "]\n" << endl;
}

void print_hex(const string& input)
{
std::stringstream stream;
for(auto const& bytte: input)
{
stream << setw(6) << setfill(' ') << ' ';
stream << setw(2) << setfill('0') << std::uppercase << std::hex << (int)bytte;
stream << " ";
}
cout<< "HEX: " << stream.str() << endl;
}

void print_binary(const string& input)
{
std::stringstream stream;
for(auto const& bytte: input)
{
stream << setw(8) << setfill('0') << std::bitset<8> ((int)bytte);
stream << ' ';
}
cout << "BIN: " << stream.str() << endl;
}

To use this, do the following:

make protoc 

make

./exe

But before, add the correct paths to the protobuf library.

Results:

I've printed the serialization of each message in HEX and in binary and I've followed their documentation to understand each bit from the serialization.

These are the outputs and the notes:

# 0 msb
// 000 - wiretype

// SingleInt
[-----SingleInt-----]
HEX: 08 03
BIN: 0 0001 000 00000011
msb key=1 wire_type=0 value=3

// TwoInts a=1=3,b=2=5
0 0001 000 00000011 | 0 0010 000 00000101
msb key=1 wire_type=0 value=3 | msb key=2 wire_type=0 value=5


[-----RepeatedInts - 1-----]
HEX: 0A 01 03
BIN: 0 0001 010 0 000 0001 0000 0011
msb key=1 wire=2 msb payload_size = 1 byte value=3

Note payload_size is not the length of the array, it\'s the size in bytes, eq a big number can be on multiple bytes.

[-----RepeatedInts - 2-----]
HEX: 0A 02 03 05
BIN: 0 0001 010 0 000 0010 0000 0011 0000 0101
msb key=1 wire=2 msb payload_size = 2 byte value=3 value=5

[-----NestedSingleInt-----]
HEX: 0A 02 08 03
BIN: 0 0001 010 0 000 0010 0 0001 000 0000 0011
msb key=1 wire=2 msb payload_size = 2 byte msb key=1 wire=0 value=3

Doc: the embeded messages act like string, so we need to specify the wiretype 2, the payload size and that\'s it. The last two bytes are the same as the ones from SingleInt.

[-----NestedTwoInts-----]
HEX: 0A 04 08 03 10 05
BIN: 0 0001 010 00000100 00001000 00000011 00010000 00000101
msb key=1 wire=2 payload_size = 4 byte TwoInts message serialized


[-----NestedRepeatedMessage - 1-----]
HEX: 0A 02 08 03
BIN: 00001010 00000010 00001000 00000011
[ --------------------]

[-----NestedRepeatedMessage - 2-----]
HEX: 0A 02 08 03 0A 02 08 03
BIN: 00001010 00000010 00001000 00000011 00001010 00000010 00001000 00000011
[ --------------------]


Solutions to our optimization problems:

[-----NestedRepeatedMessage - 3-----]
HEX: 0A 02 08 03 0A 02 08 03 0A 02 08 03
BIN: 00001010 00000010 00001000 00000011 00001010 00000010 00001000 00000011 00001010 00000010 00001000 00000011
NestedSingleInt NestedSingleInt NestedSingleInt

Solution 1:

Which means that we can serialize each message separated and we can append the strings and get the same thing. But we need to have a NestedSingleInt, because if we serialize the Single int we won\'t get to the correct serialization. Ex: check the Apended manually scenario.

[Apended manually]
HEX: 0A 02 08 03 0A 02 08 03 0A 02 08 03
BIN: 00001010 00000010 00001000 00000011 00001010 00000010 00001000 00000011 00001010 00000010 00001000 00000011
[ --------------------]

This is the same as: NestedRepeatedMessage - 3

Solution 2:
[-----nestedRepeatedMessageString-----]
HEX: 0A 02 08 03 0A 02 08 03 0A 02 08 03
BIN: 00001010 00000010 00001000 00000011 00001010 00000010 00001000 00000011 00001010 00000010 00001000 00000011

Another way to do it, is by having a message on the server side that has a repeating bytes group. The client apps will not have acces to this, because they will interpret is as a NestedRepeatedMessage, not nestedRepeatedMessageString.