SELECT * FROM vendors WHERE id = Meraki

Meraki is one of those wireless companies that the more you learn about them the more you like what they are doing. They are a quick, agile, and nimble young company that is now owned by Cisco and is known as the Cloud Networking Group.

I’ve had experience using their wireless, security, switching, and mobile device management platforms but never really understood how it worked from the back end services. With Meraki being a cloud based solution performance is key, the scalability for their infrastructure is very important and it needs to be able to handle numerous different client requests for data through their dashboard application. I’ve had the benefit of building numerous web applications using PHP and MySQL so I understand where they are coming from. I’ve had applications in the past that I wrote that were querying MySQL tables in excess of a million rows and I can tell you that isn’t a quick action unless done right.

So let’s take a look at the Meraki cloud from about 1000 feet:
[list_3]

  • Customers are partitioned across Meraki servers
    • Each partition is called a shard.
    • It’s a beefy 1U RAID server plus a 1u backup geographically dispersed.
    • One shard, which is the master, acts as a demultiplexer. When you hit the dashboard GUI you can see which master you are on.
  • Example scale from a representative shard
    • Thousands of Meraki devices
    • Hundreds of thousands of clients
    • About 300GB of stats, dating back over a year
    • Gathers new data from every device every 45 seconds

[/list_3]
Those numbers from the representative shard create some interesting engineering challenges though.
[list_3]

  • How to connect to thousands of Meraki devices per shard?
    • How do you allow access to devices behind NATs? Meraki created a tunneling software that is encrypted as well as the data being encrypted while streamed.
    • This custom secure tunnel requires only 2 pkts/device/25 secs allowing for scalability.
  • How to gather stats from those devices every 45 secs?
    • You need to minimize network overhead.
    • You need to minimize CPU costs.
  • How to quickly store and retrieve all that data for analysis?
    • Customer database tuned for statistical data specifically.
    • SSD like speeds from inexpensive spinning discs which reduces expenses for scaling.

[/list_3]
The challenge was solved by a poller (if you are familiar with Cacti) of sorts called Poder. At first it didn’t scale well because of how the daemon would grab data from the devices. Threads were used with each grabber (1 per data point such as cpu usage, uptime, etc) having their own process. This was a very simple approach relying on basic features:
[list_3]

  • HTTP, XML libraries which are widely available
  • TCP based which easily carries arbitrarily large responses
  • XML which is unstructured allowing one to easyily to add new stats
  • Blocking code is easy to write
  • It’s easy to restart individual grabbers during testing/dev

[/list_3]
But this approach is expensive at scale:
[list_3]

  • Single HTTP fetch with empty payload: 10 pits, 510 bytes
  • Each grabber does its own fetch, no sharing connections
  • Lots of threads for I/O parallelism

[/list_3]
So a new high performance approach was created based on an event driven RPC engine. Each information module then has a thread for LLDP, probing, etc.
Each module creates a request and sends to the RPC engine, each device then sends a response which goes to a DB. This allows for a single networking core for all statistics where the transaction is 1 fetch thread + non-blocking read/write + select loop. A switch to use UDP by default, fall back to TCP for large responses allows quicker packet transfer times. Binary encoding via Google Protocol Buffers allows for efficient use of buffers. There are still separate modules for each type of stat but the modules are single threaded and blocking on DB access. Most are 200-400 liens of straight line Scala code and don’t present a locking issue. The backend DB was changed as well, Meraki creates an interesting tree node concept keeping more information in RAM before writing it to disk. Once written to disk though there are index identifiers that allow for quicker disk seeking increasing performance by limited the number of reads required to find a data section containing the stats needed to display to the dashboard.

This new performance approach allows Meraki to scale their dashboard application which is crucial to the user experience and use of their equipment to levels that were once out of reach. I really enjoyed getting an in depth drink from the fire hose look at the back end services running the dashboard GUI. While it wasn’t directly related to RF or wireless I think it is important to understand how Meraki handles the complexity of running a hosted platform and growing to support larger amounts of customers. Kudos to Meraki for giving us this behind the scenes look.

Leave a Reply