DæmonNews: News and views for the BSD community

September 2000 Search Submit Article Contact Us Join Us Merchandise

An In-depth Study of NetBSD Bootstrapping
and a Few Speedy Enhancements

Final Report

PDF version Source Code

May 1st, 2000

Chen Zhou & Raghav Ramesh
czen@cs.columbia.edu, raghav@nymed.net
http://www.10k.org/~jake/pub/sb

Advanced Operating Systems
Prof. Jason Nieh

Introduction

       ...
       #ifdef notyet
       if (boothowto & RB_FASTBOOT) {
          *flagsp++ = 'f';
          options = 1;
       }
       #endif
       ...
       - orphan code found in the NetBSD kernel

Motivation

The motivation for this project is two-fold: one, to gain a detailed understanding of the low-level bootstrap process; two, to exploit areas of performance improvement.

There has been minimal work in this area as most OS developers consider bootstrapping uninteresting[7]. This has led to a dearth of knowledge in this field. In addition, most available documentation is either incomplete or out-of- date, not reflecting modern hardware and OSes.

The OS of our choice for this research project is NetBSD/macppc. We have chosen BSD mainly because of its coherent interface and organization. Another reason of our selection attributes to BSD's "OS Maturity"(it has been successfully tested and refined over the years).

In the following sections, we present our experimental setup, an illustrated, detailed description of the multi-stage booting, and our empirical measurement and analysis in identifying the bottlenecks. We then demonstrate our application of various optimization techniques in implementing those enhancements, the improvements that we have achieved, and last our conclusions.

Experimental Setup

The machine used for the project is a Powerbook 2400c, with 112MB of
main memory. The computer is running on a PowerPC 750(G3) processor rated
at 240 MHz with 512k backside cache. The computer provides OpenFirmware
2.0.1. The 3 SCSI devices used in this project are the following:

IMAGE sb01.jpg

The operating system is NetBSD/macppc, we are using the source dated
on 02/05/2000, the latest stable version in the -current tree.

Boot Sequence

The boot process on NetBSD/macppc can be logically separated into 3 stages. In each stage, a larger and more complex program is loaded and executed. We will identify each stage by the name of the program which is loaded. They are "bootxx", "ofwboot", and "netbsd" respectively. This process is visually illustrated in Figure 11.

Stage1(bootxx)

This is the very first stage of booting. First, the computer is turned on by hitting the "power" key located on the top right corner of the keyboard. We are then presented with an OpenFirmware command prompt. The first stage booting is coordinated with several environmental variables in OpenFirmware which can be examined and set using "printenv" and "setenv" respectively. The one of interest here is called "boot-device". In our scenario, it is set to "scsi/sd@1:0". The above value is analogous of telling the computer to boot from the SCSI drive located at ID = 1, the Seagate ST34572N . "load-base" is another variable which affects the behavior of booting at this stage. "load-base" specifies where OpenFirmware should load the first stage booter in memory so that it can begin execution from that address. The default is set to 4000, and we leave it as is.

Now, typing "boot" at the prompt loads the "bootxx" program from the boot disk. "bootxx" is located on the second partition of the disk with the first partition being the Apple Partition Map. The purpose of "bootxx" is simply to load the second stage boot program, namely "ofwboot". The physical location of "ofwboot" is hardcoded within "bootxx". It should be noted that "bootxx" itself should be first installed at a fixed location via the command "installboot". "installboot" collects the information of all physical disk sectors where "ofwboot"
resides and hardcodes them into the static portion of the "bootxx" executable.

Figure 1
IMAGE sb03.jpg

Stage2(ofwboot)

Now we are at the second stage booter. "ofwboot" in our case is 63 KB in compare to the smaller "bootxx" which is only 1 KB in size. Part of the complexity of "ofwboot" is its ability to read the standard BSD filesystem, FFS. "ofwboot" can also be run in an interactive mode whereby the user can specify the root partition as well as pass flags to the kernel that it is about to load. "ofwboot" examines the "boot-file" environment variable within OpenFirmware to determine the name of the kernel (which defaults to "netbsd" if nothing is specified). It then searches on the root filesystem for the kernel, in our example, "/netbsd.level1". Once the kernel has been successfully located, it is loaded in memory, and "ofwboot" transfers its control to the kernel.

Stage3(netbsd)

The kernel is often specific to the configuration which it was built on. In general, the kernel now begins various machine dependent initializations, such as the virtual memory subsystem. The kernel continues by calling the "cpu_startup()" routine, which initializes various system data structures, and begins autoconfigurations of different I/O devices. Once the kernel completes the initialization, it sets the scheduler in motion and forks off the first real process, "init". The boot then proceeds by starting services specified in the "/etc/rc" and "etc/rc.local" bootscripts. The boot completes its journey by printing a login prompt on the console device.

Bottleneck

After understanding the boot-up process, our next step was to isolate the main bottleneck so that we could focus our efforts. Our approach was to use the human perceived response time as a guideline and then to validate our hypothesis using empirical data. Our qualitative observations led us to believe that the bottleneck lies within the SCSI device configuration portion of the boot process.

In order to obtain accurate statistics, we decided to bound our measurements between two points in time: the time when the kernel begins execution and the time just before the kernel forks the first process "init". Our choice was inspired by the following motivations: the first and second stage booters are both machine specific as well as simple in nature and do not leave much room for additional optimization; secondly, we want to choose an interval where its operation is minimally affected by external variables. If we were to follow "the rest" of the booting process (all the way until the login prompt is printed), the results would be heavily adulterated by the interaction of various subsystems such as bootscript organization, disk performance, virtual memory subsystems, and scheduling policies. On the other hand, within our chosen duration, the kernel executes each step sequentially on a single thread of control. We shall refer to our defined interval as one "fullboot" through the rest of this paper.

We sprinkled the kernel library function microtime() at major steps along the kernel's execution path and collected the results.



From Figure 2 and Figure 3, we can see that the SCSI device probing and attachment is indeed the primary bottleneck.

Though our measurement stops right before "init" is executed, it is interesting to note that the bootscript organization of a BSD operating system makes the boot process inherently more efficient than that of a System V style organization, which is employed by OSes such as Linux and Solaris. In a System V environment, the operating system must fork a shell for each service that it intends to start, incurring additional overhead. BSD, on the other hand, carries out all of its operations in 2 scripts, "rc" and "rc.local".

Enhancements

After having identified the bottleneck, we are now ready to find ways to speed it up.

The first approach that we considered was to take a snapshot of the entire memory image once the machine is fully booted. With this file in hand, we could simply reload the image on our next boot. Although this technique sounds elegant and promising, unfortunately, it has many problems. One problem is that consistency cannot be taken for granted. Memory allocation is a very delicate issue, if only one pointer in the kernel space were to be misplaced, the result could be disastrous. In addition, internal variables such as clock counters and random seeds would be rendered completely worthless. Relying on them would have consequential cascading effects which cannot be accurately predicted and protected from. Furthermore, the snapshot of the entire memory image would be quite large in size, making it infeasible to reload. As a result, we did not embark down this lane.

We then went back to the basics and considered some of the general principles of optimization. The most commonly used trick to speedup an operation often involves replacing those unpredictable dynamic behaviors with static procedures, provided that the faster technique yields the same correct results. There are many applications where performance is gained using this approach. For example, let's compare an array with a linked list. While a linked list gives you flexibility in making run-time decisions, an array operates much better in terms of speed due to its static nature. If performance is the criterion, an array is often preferred over a linked list. C and Java provides another instance of this comparison. Statically compiled programs often run an order of magnitude faster than dynamically interpreted code. With this principle in mind, we returned to our enhancement workshop.

We went into detail and carefully examined the kernel code which probes and attaches SCSI devices during boot-time, hoping to discover areas that we could exploit using our static enhancement process. We installed a recording mechanism to capture the information obtained via expensive dynamic probing. By statically storing these values, we can simulate the same results by replaying the same data in future boots, thus generating a speed boost.

Conceptually this is simple, however, one interesting problem surfaced during the implementation where can we store the acquired data? Since the disk has not yet been fully attached and configured(we are still in the booting stage), there is no easy way for us to write out those values to disk reliably. There are some possible solutions. One answer would be to write directly to the raw disk device, a daunting task that often raises many brows. We also entertained the suggestion of storing our data in the non-volatile PRAM. In the end, we favored a more elegant solution: we stored our data in a set of preallocated kernel buffers. Once the machine is fully booted, they are then retrieved from userland using libkvm. A suite of programs were written to generate C-header files which contains these static assignments. The rest is then elementary these header files get recompiled into the kernel, and we achieve the speedup on the next boot.

As we mentioned earlier, one must recompile the kernel in order to take advantage of our speedboot optimizations. In a normal application, one could often get away without actually modifying the executable by simulating a static replay with kernel or system level services. Our situation is quite different here because we are the kernel. Booting is a lonely journey and there are not many things that we can rely on. The sensible place where we can store some values such that the kernel can use during a boot is inside of the kernel itself.

Below are the three simple steps to integrate our optimizations into a running system (more detailed installation instructions can be found in the README of the accompanying source code archive).

1. Install the recording mechanism in the kernel and recompile.

2. Install the modified first and second stage booter to enable the new boot flag(this way we can toggle the optimization directly from boot time), and reboot with the new kernel

3. Recompile the kernel to incorporate the static structures generated by the enhancement programs.

On the next reboot, the user would be able to take advantage of our wondrous enhancement and live happily ever after.

Boot Flag

A natural question follows: what happens when a device is added or removed from the SCSI chain? Obviously, replaying the stored values during booting would yield incorrect system behaviors. Our solution here was to implement a new bootflag "-f", which enables the user to toggle our enhancement at each boot. To utilize the speedboot, one would use boot f; conversely, to boot "slowly" but to perform all dynamic configurations, one would continue to use the normal "boot" command. This addition not only provided the user with the flexibility (consider a mobile user on the road who often changes devices), but also facilitated our tests and measurements greatly.

Implementing a boot flag as it turned out was a little more involved than simply modifying the kernel's argc and argv parsing code. In a multi-stage boot process, the boot flags were actually captured and forwarded along through the 3 different stages. As a result, in addition to modifying the kernel's argument parsing, we also changed the source in "bootxx" and "ofwboot" to ensure that the bootflag gets passed appropriately. It is important to note that one must explicitly perform "installboot" for changes in "bootxx" and "ofwboot" to take effect.

Now with everything in place we will describe the two different enhancements that we implemented.

Level 1 Enhancement:

Our first enhancement stemmed from our observation of the SCSI device detection mechanism. We noticed that during a SCSI device probing run, the controller actually checks every possible LUN(Logical Unit) on every valid Target(SCSI ID), attempting to find a usable device. Since there are seven targets on our SCSI bus and eight LUNs for each target, this inquiry process could occur a maximum of 56 times during a boot. In our setup, this occurred 24 times because we had only three SCSI devices.

Naturally, our first attempt was to short-circuit this nested detection loop so we can reduce the overall number of SCSI probes. Since we knew where the devices were before-hand, we could force the controller to inquire only the existing devices. This simple change reduced the amount of low-level inquiry calls from 24 down to 3. As can be seen from Figure 4, the nesting level of the function-calls is quite high. By reducing the number of times that the computer has to iterate through this pipeline, we achieved a vast improvement in speed.

Level 2 Enhancement:

Our second enhancement was based on the fact that actual hardware inquiries are expensive. We sampled the duration of each inquiry in each case it resulted in about 450-500 microseconds in time for a valid device. If the underlying hardware were to stay unchanged, then their responses would remain constant. With this in mind, we decided to trap all SCSI device discovery calls. We saved the information returned by each inquiry call and then simulated the inquiry with the pre-stored values via a memcpy(which takes only 3 microseconds to perform from our measurement). With this modification, we can boot without making a single SCSI device inquiry call whatsoever! The following graph illustrates the performance enhancement gained from our improvement.

The next improvement that naturally flows from this train of thought is to prevent the dynamic device driver attachment and to manually attach them ourselves. After careful analysis, we deemed that a speedup gain from this enhancement would be highly unlikely due to complexity in the implementation. In order to perform such manual attachment, we must save a very large collection of data structures, many of which are list based and also contain pointers to other structures. What is required here is the ability to make a recursive dump of all relevant C structures for an arbitrary depth. There is no known tool that can perform this operation gracefully. We would have to write specialized functions which perform this. These functions would have to know the precise definition of each of these structures. The overhead incurred as a result of the complexity could potentially offset our performance gain. Another reason we decided to avoid this approach was due to scalability issues: any single change in one of these structure definitions in a later revision of the Operating System can quickly render this functionality obsolete, requiring constant maintenance.

Conclusions

Bootstrapping is a deterministic process where each step has to be taken carefully, correctly, and efficiently. During this research project, we have gained an intricate knowledge of its operation. We confirmed our opinion that BSD is a mature Operating System and we have also acquired a thorough understanding of the inner workings of its SCSI subsystem. The performance improvement from our enhancements was useful and perceivable, even on a mature operating system. We feel that we have made considerable progress in achieving our goals, making the boot process faster and less painful.

Acknowledgements

We thank Prof. Jason Nieh for his patient guidance. His suggestions and support have been extremely valuable to the success of this project. We would also like to thank Bill Studenmund and Bob Nester on the NetBSD port-macppc mailing list for answering many of our questions; Tim Sweeney(Epic Games), Prof. Chris Okasaki, and Hua Zhong for their insight on recursive data serialization.

Bibliography

1. "The Design and Implementation of the 4.4BSD Operating System", Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, John S. Quarterman, Addison- Wesley Pub Co, 1996

2. "Computer Architecture: a Quantitative Approach (2nd Edition)", John L. Hennessy and David A. Patterson, Morgan Kaufman,1996

3. "Computer Organization and Design: the Hardware/Software Interface, (2nd Edition)", David A. Patterson and John L. Hennessy, Morgan Kaufmann, 1998

4. "Advanced Programming in the Unix Environment", W. Richard Stevens, Addison Wesley, 1993

5. "C Programming Language, Second Edition", Brian Kernighan, Dennis Ritchie, Prentice-Hall, 1989

6. "The Design and Implementation of a Log-Structured File System," Rosenblum, M. and Ousterhout, J., ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992, pp. 26-52.

7. "The Flux OSKit: A Substrate for Kernel and Language Research," Ford, B., Back, G., Benson, G., Lepreau, J., Lin, A., Shivers, O., Proceedings of the Sixteenth ACM Symposium on Operating System Principles, St. Malo, France, October 1997, 38-51.

8. "Fundamentals of Open Firmware, Part I: The User Interface", Apple Computer Technotes, July, 1996, http://developer.apple.com/technotes/tn/tn1061.html

9. "Fundamentals of Open Firmware, Part II: The Device Tree", Apple Computer Technotes, September, 1996, http://developer.apple.com/technotes/tn/tn1062.html

10. "Fundamentals of Open Firmware, Part III: PCI Expansion ROM Contents for Mac OS 8 ", Apple Computer Technotes, May, 1996, http://developer.apple.com/technotes/tn/tn1044.html

11. NetBSD port-macppc mailing list archive, 1998 – 2000, http://mail-index.netbsd.org/port-macppc/


[1] actually there are a number of different ways to bootstrap NetBSD/macppc. We have chosen to describe this particular procedure because it best resembles traditional UNIX bootstrapping.



Author maintains all copyrights on this article.
Images and layout Copyright © 1998-2000 Dæmon News. All Rights Reserved.