Or, an amusing misuse of a cloud datacenter operating system. As a reminder, I’m using this product in an entirely weird way, and any thing I’ve listed below should not be seen as a shortcoming of the product. I’ve pushed this product to a weird edge, and it’s performing admirably.
I need to test a new stack of switches for a data center. The data center isn’t very large, but it is complex configuration wise. I’d like to be able to place real traffic with instrumentation on the devices, but I have, for reasons of lack of resources, one server to work with. Said server is a serious piece of hardware, but still a single server that was scavenged from spare parts.
So, Joyent’s Triton product https://www.joyent.com/triton/compute is an impressive stack of technologies, some of which I will use for this solution. One of the core technologies I’m going to use is Triton native container service, implemented with Illumos zones. Zones generate virtual systems, but share the kernel, thus allowing me to build densely. That, plus the Illumos crossbow networking stack gives each zone it’s own IP address, and it’s own mac address. Perfect for populating with small services moving network traffic.
If I’m going to be testing networking hardware, I need to make sure the traffic is actually passing through the network. Crossbow’s stack will generate a virtual ethernet switch in normal operation to make traffic that’s passing from one zone to another on the same host never exit the host. That’s the exact opposite of what I want to accomplish in my testing. I would like 2 zones, both attached to vlan 4 to talk to each other over physical network interfaces.
The hardware setup
I’m using a reasonably average Dell R820 server. Getting Triton to boot is a different discussion https://blog.goekesmi.com/bootstrapping-triton-on-whatever-you-can-find/ . After you’ve got Triton online in a single headnode configuration running local provisioning of instance, fill the chassis with ethernet cards. Full. In my case, I ended up with 14 gigabit ports, and two 10 gig sfp+ ports total.
The networking setup
Now we go to the weirder edges of what can be configured. Triton has reasonable documentation about nictags https://docs.joyent.com/private-cloud/networks/nic-tags and networks, and the napi https://github.com/joyent/sdc-napi/blob/master/docs/index.md .
So, A nictag is basically a name. A network is a IP configuration and a range of available addresses and is associated with a vlan id and a nictag. Compute Nodes have nic tags assigned to a physical interface. Many networks can be on one nictag, many nictags can be on one physical interface.
Can networks have overlapping ip allocations, but unique names?
Yes, as long as you are in RFC1918 space. https://github.com/joyent/sdc-napi/blob/392178ed2d1de72d814cfd9fea0b728b8dbd84d9/lib/models/network.js#L778-L790
Can two networks with the same vlan id mapped to different nictags have both nictags on the same physical interface?
Yes, via experimental testing.
With this, I have enough to construct the test jig.
Construct a Triton::Network (Yes, I’m namespaceing this because a network has so many, many meanings) that has a unique Triton::nictag and a wide subnet,but a restricted allocation range.
network: tnt_10_0_4_0 nictag: tnt_10.0.4.0 ip subnet: 10.0.4.0/24 default route: 10.0.4.1 first address: 10.0.4.2 last address: 10.0.4.15
network: tnt_10_0_4_16 nictag: tnt_10.0.4.16 ip subnet: 10.0.4.0/24 default route: 10.0.4.1 first address: 10.0.4.16 last address: 10.0.4.31
Finally you assign each of these nictags to an independent physical interface on your Compute Node.
Generate a instance in each network and then attempt to talk to each other. The traffic flows out one network interface, through the device under test, and into the other interface. We have the bare bones of a test jig.
And now we have to make it scale
I wanted this to be a non-trivial number of IP addresses and MAC addresses active at any given time. I’m aiming around 1000. I’ve had 1000 zones up and running before, but that was with SmartOS directly, and not using Triton. Also not using the above networking configuration.
So, first, configure the Triton::nictag and Triton::Networks in the network configuration. The above sample can be done in the web interface, but doing that the hundreds of times I might need to seemed painful. Thankfully, this is Triton. It’s all opensource, and there’s good documentation about how it all fits together. You have the headnode, you are on the control plane. Write your own thing to talk to napi.
There’s a terrible piece of code I’ve written, that I’m too embarrassed to share, that from a zone, sitting on the control plane (the admin network in Triton parlance), generates the names, nictags, and the network configurations. That’s actually harder than it seems it should be due to having to avoid the default router address (provided by the network equipment under test) and broadcast addresses. For amusement value, the way I do this sort of address manipulation is almost always using a postgresql instance that I’m not storing data in. I find postgresql has a remarkably complete set of types and functions to manipulate those types in native ways. In this case, https://www.postgresql.org/docs/current/static/functions-net.html has almost every manipulation you might want, and lets you think in ip addressing, not text manipulation. Also, napi is reasonably well documented. https://github.com/joyent/sdc-napi/blob/392178ed2d1de72d814cfd9fea0b728b8dbd84d9/docs/index.md
A key thing to remember is that the napi is really just a database holding thoughts about what a configuration should look like, it doesn’t actually implement the configuration.
Also, at this point, I realized I was on the control plane and it’s speaking http. Not https. And there was no authentication. Hello and welcome to the control plane. Protect your admin networks. Don’t let those leak.
Once that was done, I had to assign nictags to physical interfaces. Again, I could do this by hand in the adminui, but that seemed like a painful way, and I was already on the control plane. So how does that work? My first guess was napi, but as I said above, it’s a database, it doesn’t actually do the work.
Assigning nictags to physical interfaces is the role of cnapi.
The obvious first response was to loop through each nictag, and match it up with a physical interface. To make sure things don’t always line up, reduce the number of physical interfaces to 13 (prime) and that should spray the nictags for a given IP subnet and range in a delightfully spread out pattern. For each nictag, call the api to add a nictag to the interface.
Don’t do this.
Jobs, workflows and scheduling
Each time you call this API, a job is created to go out to the CN, make a change, collect up the final state, and update Triton about the new state of the universe. Doing this process takes 20-30 seconds overall. I had hundreds of nictags to add, and update. Each time you added a nictag, the system would go slower, because reading the present configuration and parsing it, and updating the universe is, apparently, a O(n) operation. I was doing n operations. I accidentally made a O(n^2) operation. About halfway through populating, the process that collects information about the system that’s called after the assignment to a physical interface, but before the job is done starts timing out and throwing job failures.
There has to be a better way
Reading the documentation again, you see the second call is a replace operation. Since I’m using a dedicated set of physical interfaces, and I get to choose how all of them are setup, I can aggregate the whole configuration into a single job that sets all the interfaces at once. As an intermediate step, I attempted to set each physical interface from a independent job. This was better than each nictag one at a time, but still was taking minutes to eventually time out, thankfully completing in the background.
But to do that, I had to clean out and reset my interfaces to empty. I think I’ve found a bug where if you attempt to replace the whole set of nictags with a null set, nothing happens. I’ll get to submitting that one day.
Finally, the network configuration is in place and it’s time to launch containers. Lots of containers.
Now back to somewhat normal Triton
For this testing, I created an independent Triton user for the purpose of owning all the networks, and all of the instances I would be using. This makes it easy to clean out the everything and reset the universe for another go.
sdc-listnetworks will get you some networks. sdc-createmachine will let you launch an image, with a package with a network with some user metadata. A bash loop lets you do this quickly. In pseudocode
for network in `sdc-listnetworks | json` ; do sdc-createmachine --network $network --bar etc ; done;
And away we go, launching an instance on each of the networks I’ve created.
Want to get a list of IPs of machines that are up? Again, pseudo code
sdc-listmachines | json primaryIp > ipaddrs
Have something running that uses the Triton metadata service https://docs.joyent.com/private-cloud/instances/using-mdata for targets? Want to use that list?
for machine in `sdc-listmachines | json` ; do sdc-updatemachinemetadata --metadatafile targets=ipaddrs $machine; done;
Compositable systems with easy changes and updates, isn’t this wonderful?
The job and workflow engine returns.
Starting an instance isn’t a free operation. That process actually creates a job in system and has to grind through a workflow. Under unloaded conditions, it takes about 2 minutes to create a instance in my system. You can keep queuing jobs all you want, but all that does is create a backlog. This is okay, until you want to add another job that needs to run right now. But if you are 100 two minute jobs back, it’s going to be a while before that new job runs. What could that new job be? Anything, but importantly stopping or destroying a system will get in the queue after the create machine calls, so if you are attempting to reduce load on the system, you won’t be able to via that method.
Also, remember how I said 2 minutes under ‘unloaded’ conditions? Loading the system down makes that take longer. I’ve seen it up at the 300 second range to create a new machine.
Also, the queue runner will only process a few jobs at once. There appear to be locks to make sure jobs don’t step on each other. This is all sound engineering, but it reduces the rate at which you can create instances. Which leads me to:
The placement engine
If you start pushing the envelope, you will eventually attempt to launch an instance and get a failed job, with the shockingly unhelpful message “No compute resources available”. When you do, use this https://gist.github.com/bahamat/8f5df9789c99afe482fc430bf0ea3de7 to find out why. That will go find your job and dump the reasoning the placement engine gave up on your request.
In my case, I discovered that Triton has some default limits that I was running into.
ALLOC_FILTER_VM_COUNTInteger 224 CNs with equal or more VMs than this will be removed from consideration.
Well, that isn’t going to work for me. Updated that to something silly in the 10,000 range and I was once again able to create machines.
And away we go
This works. I can build machines, push bits, and orchestrate the whole test rig from a single command line. There’s some interesting challenges, but most have been overcome at this point.
If the node gets busy, the provisioning process times out. I either need the zones to get very quiet when I’m doing provisioning work, or specifically stop the zones via sdc-stopmachine to make them idle. Zone provisioning takes 2+ minutes. But zone startup take 5-15 seconds, depending on load. All of these are jobs, and thus are processed by the job scheduler sequentially. Don’t get ahead of your job queue too far, or you will effectively lose control of your system. When that happened to me, I created a on the fly job canceler for the workflow engine. It wasn’t nice, but it did work. Again, it’s really nice having direct access to the control plane, and all the parts laid out where I can see them.
Seeing load averages of over 120 on the system and still having it responsive is nice.
Over provisioning 1000% on disk, and 120% on ram and still having headroom tells me that I can push this system farther. I’ve hit 1000 test zones. I suspect I can get 1500, but the job runner is likely to have a heck of a time.
Finally, to be absolutely clear, what I’m doing with Triton is a very, very unusual workload and usage case. Anything I’ve encountered here should not be read as a problem, so much as an exploration of how useful generic open compositable systems can be. Many thanks to Joyent for building a system I could even attempt this rig with.
5 Aug 2017 additional notes:
If you attempt to do this, be very, very careful when you reboot the headnode. Hundreds of zones will attempt to launch at the same time. I had the load average well over 1300 on a 32 core box. While the system was still somewhat responsive, it was clearly in trouble. Significant sections of the networking stack failed to assemble. I had to remove or shutdown most of the test zones before another reboot to bring the networking stack back up to reasonable.
Again, well outside the intended envelope. Test pilot beware.