Author Archives: John Ford

New OS X 10.7 build machine configuration almost ready — Are you able to help test the builds?

We are working on replacing our old Mac build slaves.  As shown in a previous blog post, our current Mac builders are very slow.  I am working on creating a OS X 10.7 builder configuration in bug 720470 and its gotten to the point where I need help verifying that these builds are valid and correct.  I’ve posted the complete output of two builds done on this new reference machine to my personal server (i.e. this site):

The keen eye will notice that there are no links to shark builds.  As it happens, shark support on 10.7 is something we still need to figure out.  Once I have 10 more machines in our data center to test with, I am going to start producing these builds in a production environment for general consumption.  Soon after, we will want to start switching over to this new builder.  I’ll set these machines to build mozilla-central and project branches first and as we gain confidence, I’ll request approval to move Aurora and possibly Beta to these new machines.

Writting a native rm program for Windows

Our Windows build machines use msys to emulate a posix environment.  Msys is great tool, providing a lot of common posix utilities, like cp, mv, rm.  Sadly there are bugs in the posix emulation.  For us, this manifests in the rm program being unable to delete certain files.

Bug 583129 is about using native Windows tools for file removals.  In that bug, we’ve looked at a combination of the rmdir and attrib Windows tools.  The problem is that rmdir doesn’t like to delete files if they have the read-only or system attribute.  To fix that, we need to run attrib to remove those attributes then rmdir to delete them.  Running rmdir is fast but attrib is very slow.

Last year, I spent a bit of time writing a native windows version of rm.  I am by no means a Windows developer, so I spent a couple hours learning the basics of the Windows API.  The API is quite different to what I’m used to, but the documentation seems to have been written quite well for my purposes.  I was able to get the basics working pretty quickly.  Yesterday, I decided to finish up the program by adding directory deletion, recursive deletes and a command line parser.

I present to you winrm.  The code is available in my Mozilla user repo.  This tool works similarly to the standard posix rm utility.  For simplicity’s sake, only single character options are supported.  These options can be joined, like “-rf”, or specified individually, like “-r -f”.  The “–” option is also supported, signalling the program to treat all following arguments as files to be deleted.  Because of how the option parser works, files are deleted in reverse order to how they are on the command line.

jhford@JHFORD-VM ~/mozilla/jhford-native-rm
$ touch a b c d e

jhford@JHFORD-VM ~/mozilla/jhford-native-rm
$ winrm -v -- a b c d e
deleting "e"
deleted "e"
deleting "d"
deleted "d"
deleting "c"
deleted "c"
deleting "b"
deleted "b"
deleting "a"
deleted "a"

Because my program is written in standard Windows API and is a much simpler program, it is also much faster than the msys rm program.  To test this, I timed deletion of a mozilla-central clone using both mine and msys’ rm.  My program took 37s where the msys program took 113s.

If you know the Windows API and have the cycles, please let me know if you find any glaring errors with my program.   If you want to test the program without having to build it, I’ve uploaded a copy here winrm-0.1.

Evaluating new Mac Mini builders, SSDs and -j settings

We build Firefox a lot[1].  We build every time someone pushes to try.  We build every time someone pushes to mozilla-central, a project branch or an integration branch.  We build every night on many branches.  We build every time we want to ship an update.

Each of those builds is done on Windows, Linux, Mac OS and Android.  In the most simple sense, the time it takes to get results from any of these platforms is comprised of:

  • Time from push to build start
  • Time from build start to build completion
  • Time from build completion to test start
  • Time from test start to test completion

What I am looking at here is the time from build start to build completion for Mac OS builds.  The overall end-to-end time is the time from push to the last test completion.  Currently, our Mac OS builds are by far the slowest.  If we are able to speed up the mac build times, we should see improvements in overall end-to-end times.

We also want to have release builds to be as fast as possible.  During a release, every minute we can save in the build is a minute more we can spend qualifying the product.  During a chemspill release, every minute saved in the build is a minute sooner that we can protect our users.

For a multitude of reasons, we currently have some pretty slow Mac build hardware. We are building on 5 year old 1.83Ghz Mac Minis [2].  It’s definitely time to upgrade.

I have evaluated the build times of four different specs of Mac Mini that are available today.  Because these machines only run 10.7, we can’t start using them straight away.  We need to figure out how to build 10.5 compatible builds on 10.7 before we are able to use this hardware.

The specifications are:

  • Mac Mini 5,3 – Quad i7 2.0GHz with 256GB SSD and 8GB ram
  • Mac Mini 5,2 – Dual i7 2.7GHz with 256GB SSD and 8GB ram
  • Mac Mini 5,2 – Dual i7 2.7GHz with 750GB 7200rpm and 8GB ram
  • Mac Mini 5,1 – Dual i5 2.3GHz with 500GB 5400rpm and 8GB ram

Both the Quad i7 and Dual i5 have Intel Integrated as their only video chip, whereas both Dual i7 machines have a Radeon 6630M.  This shouldn’t make any difference to building Firefox, but I do want to note this as a difference.

Because I don’t have access to actual pricing, I am unable to do a complete and accurate analysis of cost.  All of my information uses wall clock timing and are scratch builds without rebooting.  Since we have a limited number of configuration options, we have to figure out which configuration is best for us, not how to pick each component for maximum performance.

Value of solid state disks

One of the biggest decisions we need to make is whether to buy solid state drives or magnetic platter hard drives.  There are a lot of things that factor into this decision.  Having a faster disk means faster I/O.  I did this test using Dual Core machines, because that’s what was available to me.

As shown in the graph above, the total time saved on a Dual i7 2.7GHz machine is just over 8 minutes, an 11% improvement.  With Quad i7s it is more likely that we become I/O starved.  It might be worthwhile to test a Quad i7 with a magnetic disk to see if we get a larger improvement there.

Interestingly, the portions of the build that take the least amount of time overall show the biggest improvement from using an SSD.  Nothing at all is slower on SSD than on magnetic disk.  I have another graph which shows this information as percentage of reduction here.

Parallelism in the build system

One of the main things we want to learn from this experiment is whether to buy dual or quad core machines.  One major way we exploit parallelism in the build system is through the use of GNU Make’s -j option.  In order to figure out which -j setting to use for the evaluation builds, I tested each machine using -j2, -j4, -j8, -j12 and -j16.  The following graph is a plot of the build time against the -j setting used for each specification we are evaluating.

The best setting for the Quad-i7 is -j12 and the best setting for all dual core chips is -j4.  The advantage of -j12 over -j8 for the Quad-i7 is less than one minute.  This graph shows that for the compile step, the Quad-i7 is always the fastest and the Dual-i5 is always the slowest.  Each machine exhibited roughly the same trend of dropping significantly until an ideal setting was reached.  Any setting above the ideal setting caused times to increase.  I did find that it is better to set the -j setting too high than it is to set it too low.

Interestingly, as the -j setting increases, the time advantage the Dual-i7 machine with SSD has over the otherwise identical machine with 7200rpm harddrive increases.  This suggests that as we increase parallelism in the build system, we will need faster I/O to supply the processors with data.

Comparing to existing hardware

It is also worth comparing each of the machine specifications at their optimal -j setting to what we have now.  The data source for the existing hardware is our Buildbot Status DB.  I am using a nearly identical mozconfig file to our nightly builds to make this comparison valid.  The only settings changed were the make -j flag and a couple of Mac OS target settings to allow me to build on 10.7.  The builds on the old hardware are also done on 10.6 instead of 10.7, so this is not an exact comparison.

There isn’t an easy way to map all of the steps that my test builds do to the steps that our production builds do, so I selected the two longest steps of every Mac OS build.

This graph shows the absolute times for each specification.  You’ll notice that the symbols generation is slower on the Quad-i7 than on either Dual-i7 machine.  Symbol generation is not a parallelized process, which explains why higher clock rate dual core machines are faster than the quad core.  Symbols is a much smaller portion of the overall build time than the build.

It is clear that no matter what we do, new hardware is a giant step in the right direction.  The question is by how much we want to improve the situation.

This graph shows the time it takes each of the new machine specs to build and generate symbols as a percentage of time on the current machines.  The Quad-i7 is clearly the fastest of the new machine configurations.

Cost

While I don’t know what our actual pricing might be, I do know what each of these configurations costs retail.  Below is a graph of how much each minute of time saved from the compile step costs.  In case we get discounts, I have included the costs adjusted for 5%, 10% and 15% discounts.

The order of absolute number of minutes saved is Quad-i7, Dual-i7-SSD, Dual-i7-HD then Dual-i5-HD.  This graph does not take into account the cost of racking, networking, powering and cooling.  It also doesn’t take into account the throughput advantages of fast machines or wait time improvements of many slow machines.  This is for one single build, not counting any wait time.

Conclusions

The information above shows that the Quad Core i7 is faster than the all the dual core machines.  I think we should buy the Quad i7 with SSD, as evaluated.  We have work underway to improve the parallelism in the build system, which favours buying quad core.  The quad core is also the overall fastest machine by 10 to 20 minutes.  If we decide not to get the quad core, I think we should get the Dual i7 2.7GHz machine with a 7200rpm hard drive as the SSD adds significant cost in this configuration for minimal performance gain on dual core machines.

If you are a developer and understand Mac OS and want faster build times, please help fix bug 715397 so we can start making use of whatever faster build machines we end up buying.

Further Tests

There are a couple other tests that I think would give us useful information.  I don’t have time to test these right now, but I am interested in the results

  1. Using a Mac Pro and DistCC to optimize compile times
  2. Use a mix of SSD and HD for OS, repo and objdir
  3. 16GB of memory in quad core minis
  4. quad core minis with hard drives
  5. effects on incremental builds

Code and data available on GitHub.

[1] Yesterday, we built it nearly 2400 times
[2] http://support.apple.com/kb/sp7 + ram upgrade

Using OpenVPN to tunnel all traffic through my home server

I want to be able to send all my internet traffic to the Linux machine I have running in my apartment and I am not a networking expert. My motivation for this post is threefold; document my process for future reference, share my info and see if people have suggestions for how to do this better. I am not going to go through every option, just what I did and what worked for me.

The next step was to figure out what I needed to do. I decided on using openvpn because I already use it for work and because it’s open source. I found the how-to document on the openvpn site to be really useful. I am using Fedora, so I skipped the section on installing openvpn from source and ran “sudo yum install openvpn“. My next step was to copy the pki support files into a directory by running “cp -r /usr/share/openvpn/easy-rsa/2.0/* .“. I then followed the directions for generating the pki infrastructure.

For this to work you need an open port on your server. I used the openvpn standard of 1194. I tested that the port was open with netcat by running “nc -l 1194” on my server and “nc server.name 1194“. Writing on either terminal will show the output on the other on EOL.

At this point, I needed to set up the server configuration. I copied the sample config file to my directory by running “cp cp /usr/share/doc/openvpn-2.1.4/sample-config-files/server.conf server.conf“. I found that the sample server config file seemed to work great for me with the following changes:

diff -U0 sample-config-files/server.conf config/server.conf
--- sample-config-files/server.conf	2011-12-12 21:43:31.000000000 -0800
+++ config/server.conf	2011-12-12 22:16:46.000000000 -0800
@@ -196,0 +197,2 @@
+push "dhcp-option DNS 0.0.0.0"
+push "dhcp-option DNS 0.0.0.0"
@@ -204 +206 @@
-;client-to-client
+client-to-client

The first change pushes DNS servers to my client (fake ips, obviously) and the second change is to allow different clients to talk to each other. I am not sure how useful the inter-client link will end up being.

I am using the Viscosity client because that’s the only sane way to do this on OS X and Windows. Sending all traffic over the vpn link is the default behaviour for Network Manager (Linux). I started with the sample by running “cp /usr/share/doc/openvpn-2.1.4/sample-config-files/client.conf .“. My changes where pretty basic:

diff -U0 sample-client.conf client.conf
--- sample-client.conf	2011-12-12 22:43:11.000000000 -0800
+++ client.conf	2011-12-12 21:49:17.000000000 -0800
@@ -42 +42 @@
-remote my-server-1 1194
+remote server.name 1194
@@ -89,2 +89,2 @@
-cert client.crt
-key client.key
+cert laptop.crt
+key laptop.key

At this point, the client side configuration was ready to transfer, so I tarred up the needed files with:

mkdir ovpn-configs
cp keys/ca.crt keys/laptop.crt keys/laptop.key client.conf ovpn-configs/
tar jcf laptop-openvpn-config.tar.bz2 ovpn-configs

and used scp to transfer the files over to my laptop.

Once on my laptop, I untarred the files and imported the configuration into Viscosity. I did this by:

  • clicking on Viscosity menu icon then selecting preferences
  • clicking on plus arrow with down, selecting “import connection” then selecting “from file”
  • selected the client.conf file from the tarball

Next, I configured all my traffic to go over vpn. I selected the “client” configuration from the list of configurations and pressed the “edit” button. In the sheet, I navigated to the “networking” tab and checked the box for “send all traffic over VPN connection”. My client side configuration was complete.

At this stage, I tested that my machine was able to connect to my openvpn server. I gathered the various files needed for the openvpn server into a single directory:

mkdir ~/openvpn-server/
cp keys/* ~/openvpn-server #lazy
cp server.conf ~/openvpn-server

and started the server with “cd ~/openvpn-server && sudo openvpn server.conf“. I connected using viscosity to the server. The client connected properly, but I was unable to resolve anything on dns or reach anything other than my openvpn server. Reading the openvpn howto suggested setting up a NAT. I did some searching and found a page with information on setting up the NAT. I did:

echo 1 > /proc/sys/net/ipv4/ip_forward
/sbin/iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
/sbin/iptables -A FORWARD -i eth0 -o tun0 -m state --state RELATED,ESTABLISHED -j ACCEPT
/sbin/iptables -A FORWARD -i tun0 -o eth0 -j ACCEPT

At this point, everything worked! I ran traceroute, and the first hop was my vpn server’s vpn address (10.8.0.1). I also used some websites to check my public IP and it was showing as my server’s IP.

I hope this is useful to others. If I’ve done something really dumb, I’d appreciate any suggestions for how to do it better! I have left out information about how to start the openvpn service on boot. This isn’t really important to me right now but if I ever bother with it, I’ll update this blog post.

A *useful* IRC channel for discussions about Release Engineering and its systems

At some point or another, I am sure that every code contributor has had a question about Release Engineering or the systems we run. The best place for this should have been #build. Instead, these discussions end up happening in various other channels because of the barrage of alerts from nagios. To put some approximate numbers on it, in 5 months, we had 69,000 messages in #build. Of these, 42,000 were messages from or to nagios. This makes it very difficult to have a conversation in #build.

To fix this, we have created #buildduty. This channel is specifically for nagios alerts and buildduty queries. The work to switch nagios to point to #buildduty is being tracked by bug 700817. Once this is done, #build should become a useful collaboration point for all things Release Engineering.