This paper was prepared by Ilya Gertner, Ph.D., February, 2001
TEL: (508) 872-4586 FAX: (508) 872-2414
Evolution of Intelligent Storage: Assembly of Standard Components
Network Disk™ and
Copyright© 1998, Network Disk, Inc.
Intelligent storage subsystems have been privy to the very few, high-end storage vendors selling to large data centers. This situation is ready for a change. With the advent of Intel-based multiprocessors, Gigabit networks, and increased reliability of Windows NT and surge of Linux, today one can build a high-performance, highly reliable intelligent storage subsystem out of standard components. This paper describes
Network Disk, a software product that enables such storage systems and that runs on off-the-shelf PC’s with off-the-shelf operating systems such as Windows NT or Linux. Network Disk core technologies include integrated cache, snapshot copy, block-level incremental backup, remote command invocation, and remote mirroring.The primary audiences for this paper are vendors that would be interested in integrating
Network Disk technologies with their systems. These include:A
Network Disk demo is available in several versions:
Intelligent storage subsystems have been the exclusive domain of high-end storage vendors who are selling to large data centers. These systems have typically been big - both in terms of size and cost - and have been based on proprietary hardware and embedded operating systems. Such was the price one had to pay, however, to achieve the performance and reliability required of enterprise-wide storage systems.
This situation is now about to change. As with the initial computer revolution – from mainframes and minicomputers, to workstations, and ultimately to PCs - the mechanism for change will be the move to standard platforms, both hardware and software. That process has delivered increasingly more powerful computer processors (lead by Intel), steadily more reliable and high-performance PC operating systems (lead by Microsoft) and faster and faster networks (not dominated by any one player, but continually advanced by dozens of major companies and hundreds of smaller ones). Indeed, today one can build entire computer systems from standard components. And, these systems are significantly cheaper and easier to maintain than the proprietary, monolithic systems that preceded them.
One of the most significant recent developments in the computer revolution has been the emergence of Gigabit networks whose speeds approach those of internal computer buses. Historically, the communication speeds within a computer, i.e. intra-computer communication, have been an order of magnitude faster than the speed of tying computers together in a network, i.e. inter-computer communication. The Gigabit network, however, has flipped this traditional bandwidth hierarchy: inter-computer bandwidth rates now exceed backplane rates!
Where do the intelligent storage subsystems vendors stand on the path of the computer revolution? With the move toward standard platforms, a number of these vendors have begun to use standard components in their products. However, these have typically been restricted to hardware components such as processors and disks. By and large, the storage subsystems still rely on highly proprietary software. This software is frequently referred to as the microcode of the I/O subsystem, and is used to provide the performance and reliability required in such storage systems.
The availability of Gigabit networks marks the beginning of a radical new phase in computer architectures. Computer systems will now begin to be organized as networks of computers. Importantly, peripherals will also start to be organized as networks of peripheral device servers, with each server controlling a cluster of peripherals. If The PC revolution changed the way enterprises bought and sold computing power in terms of MIPS, a similar revolution is now brewing within PC peripherals.
The impact of the PC peripheral revolution stands to be even more profound than the PC MIPS revolution in terms of flexibility offered to customers. For example, for an IS manager today to buy a tape backup system from a mainframe, minicomputer or workstation vendor is a fairly complex exercise, involving calling a specialized salesperson, who in turn will likely do a needs analysis, etc. Imagine being able to walk into COMPUSA and picking the enterprise-ready peripheral of your choice!
This statement may sound too simplistic as well as too good to be true. In fact, it would be very difficult for an IS manager to find a high capacity, high-performance, and highly reliable system readily available on a shelf in COMPUSA. However, if one were to assemble together a number of components - such as PC’s, storage disks, tape drives, and network cards - then in aggregate this off-the-shelf equipment could have more processing power than any existing monolithic and proprietary product. There is a catch, however: the problem remains of writing software to make it all appear as a single server for an enterprise.
This last statement is likely to arouse suspicion in any IS manager’s mind, who remembers all too well how the famed PC revolution of the early nineties failed to replace mainframes in IT centers. Could storage subsystems suffer the same fate, in spite of advances in Pentium hardware, the abundance of cheap memory, and fast network connections? Would the lack of efficient software result in the failure to replace large computers with networks of PCs?
Herein lies the real promise of the peripheral revolution: unlike the PC revolution, it does not require a change in the computing paradigm. Data centers will continue to be data-centric operations and continue to run the same set of applications. However, these data centers will now be equipped with up-to-date, peripheral devices organized as networks of peripheral device servers – provided the necessary software is available to make it all work as a unified system.
Network Disk
, Inc. provides such software, which turns the dream of standards-based storage subsystems into reality. In the next several pages, you will find a description of the software, how it works, and the benefits it offers.Network Disk
Network Disk
is software that enables the use of off-the-shelf, Intel-based, multiprocessor PC hardware and off-the-shelf, Windows NT-based software for enterprise-level storage subsystems. The advantages to using standard hardware and software are great. The prices for both standard hardware and software are significantly lower than proprietary solutions, while offering the user greater flexibility. In addition, on the hardware side, the user benefits from performance enhancements to computer processors (which typically occur several times a year) and from other emerging technologies, such as Windows NT clustering, Gigabit Ethernet, and Fibre Channel. From the software side, the benefits are even greater, due to the fact that the user has access to a plethora of software products that include storage administration tools, backup utilities, diagnostic and trouble-shooting tools, performance monitoring tools, etc. Together, the use of standard hardware and software components gives users the ability to design their own storage systems using off-the-shelf products.Network Disk, Inc
. software can be loosely described as a collection of Windows NT services, libraries and slightly modified device drivers that implement the functions of an intelligent storage subsystem. In order to describe the system, it is useful to look at how traditional high-end storage subsystems work.
There are three major types of storage subsystems: direct-attached systems, network-attached systems, and storage subsystems supplied by the major computer system vendors.
Direct-Attached Systems:
Direct-attached systems use the SCSI protocol, the de facto standard for accessing disks from open system architectures. Access of the data is directly to the "raw disk" itself, which makes these solutions file system-transparent. Direct-attached systems are limited, however, by the number of SCSI cables supported by the device in question, since each server requires a point-to-point SCSI connection. SCSI cables are also restricted to 20 feet in length, although Fibre Channel technology will overcome this limitation. However, the greatest shortcoming with direct-attached solutions is that they have all been implemented to date on proprietary embedded software systems. Major storage subsystem vendors that fall into the direct-attached category include IBM, StorageTek, EMC and Data General (Clariion).
Network-Attached Systems:
All network-attached systems rely on a network file system of some sort. (Some common network file systems include NFS and CIFS). These solutions do not suffer from the limitations introduced by the use of SCSI cables. The number of application servers connected and their physical distance is limited only by the network topology. Hence, there is greater flexibility in the number of application servers that can share storage. However, while all open systems support SCSI, not all open systems support every network file system. Therefore, the usefulness of these systems as cross-platform solutions for heterogeneous enterprises is limited. In addition, because the file system is bundled with the storage system, the customer is locked in to a particular file system architecture. Another concern with network-attached solutions is that network protocols do not support the high-bandwidth demands of servers running On-line Transaction Processing (OLTP). This is evidenced by a lack of TPC benchmark results using any network file system. Major storage subsystem vendors that fall into the networked-attached category include Network Appliance, Novel, and Vinca.
Computer System Vendors:
Several of the major computer system vendors also sell their own proprietary external storage products. Because these computer system vendors focus on selling storage as a back end to their own specific server systems, they tend to have a relatively narrow scope in the market of enterprises comprised of heterogeneous servers. Major computer system vendors that supply their own storage products include IBM, Compaq, and Sun Microsystems.
Network Disk
:Network Disk
is a direct-attached solution. However, it has been implemented on standard platforms comprised of Intel-based PC’s running the Windows NT operating system. This allows users greater flexibility in configuring the storage system to their needs, without sacrificing performance. Network Disk thus combines the best characteristics of direct-attached and network-attached solutions.
Network Disk
sits in the middle of the data center of an enterprise with heterogeneous systems, as depicted in the above diagram. Network Disk PCs are stripped of peripherals such as monitors, keyboards, and mice, in order to cut costs. They are directly attached to application servers via the SCSI protocol. In addition, networked-attached application servers are also supported via Fast LAN. An inexpensive LAN is used to connect user terminals and desktop PCs to the application servers.
Network Disk’s
core software modules include Integrated cache, Snapshot copy, and Remote mirroring.Integrated cache is the centerpiece of the architecture. It uniformly handles incoming requests over the SCSI I/O channel and network connections.
Remote mirroring can replicate individual files, separate partitions or whole disks. This fine level of replication configurability allows users to minimize network traffic.
Snapshot copy creates a "frozen-in-time" image of data that can be used for data warehousing or data mining without affecting on-going OLTP applications that work against real data.
Incremental backup creates a log of data blocks that have been modified and in conjunction with the previously stored full backup can be used to play back and create a next a full backup.
Remote command invocation from an application server that can use a shell script to issue commands directly to the storage server.
The implementation view of
Network Disk shown below depicts three layers: (1) front-end SCSI controller, (2) cache manager and (3) back-end SCSI controller. All layers use a configuration file that defines the mapping of the low-end SCSI ID’s as seen by the application servers and other logical devices, and file names as seen by the user employing administration tools that run on Windows NT. This configuration file is also used to define special options such as snapshot copy and remote mirroring.
Incoming I/O Requests
¯
|
Front End SCSI Controller |
|
Cache Manager |
|
Back-end SCSI Controller |
¯
Disks or RAID Storage Array
The configuration file defines the mapping between the virtual devices as seen by the user and managed by the front-end to the actual physical devices as seen by Windows NT.
Network Disk uses standard Windows NT Administration tools along with Network Disk extensions that control access to the physical devices. The configuration file is very flexible. For example, a raw disk partition on the front-end may map to a physical disk, a partition, or even an NT file on the back-end. In addition to device name mapping, the configuration file controls advanced features such as remote mirroring, concurrent backup and on-line data sharing. Below is an example of a configuration file that defines local resource mapping:volume PHYSICALDRIVE1 disk RW 0
volume D: partition RW 0
volume E: partition RO 0
#volume L: partition RO 0
volume c:\temp\disk_0 file RO 0
#volume c:\temp\disk_1 file RW 0
The entries may represent the entire disk, a volume, or just a file. In this example, "PHYSICALDRIVE1" represents the disk, "volume D" represents a volume, and volume C is mapped to a file named "disk_
0".The configuration file is also used to define advanced, network based, mappings such as those listed in the example below:
rem_server 129.91.2.8 256 C:\temp\disk_0 RW
rem_mirror: 129.91.2.8 256 C:\temp\disk_0 RW C:\temp\disk_0
In this example, the file named "C:\temp\disk_0" is accessed as a disk on the remote server with the IP address of 129.91.2.8, using UDP port number 256. The line following defines remote mirroring between the file disk_0 on the local host and file disk_0 on the remote host.
Extensive logging capabilites can also be enabled in the configuration file.
Network Disk’s
Remote Mirroring application is a two-node networked system that provides remote file replication and disk mirroring services for two application servers. Each application server uses a standard SCSI connection to the Remote Mirroring node. The application server "sees" Network Disk’s Remote Mirroring node as a disk (in fact, it is configured as a SCSI device) and remains otherwise unaware of the on-going replication. The Network Disk Remote Mirroring application can be thought of as a very intelligent controller that offloads the mirroring function from the application servers’ databases and operating systems to an I/O channel.A
Network Disk Remote Mirroring application is shown below.
Network Disk’s
software is a fairly complex program (40,000 lines of Visual C++) that is installed as a Windows NT service and a library.2 A stand-alone demo is available on a laptop or as a free download. Demonstration on live hardware requires a cluster of 2 PCs connected via SCSI cable and Fast Ethernet. Demonstration of Remote Mirroring requires 4 PCs. These demonstrations show core technologies, including:The demonstrations also show the diagnostic tools that are available with the system.
Future work for the
Network Disk product line is being planned in several directions:
This paper describes the core technologies and applications of
Network Disk, Inc., a new class of intelligent storage subsystems built entirely on standard hardware and software platforms. The architecture is based on off-the-shelf PCs running Windows NT. Storage industry vendors and integrators are invited to contact Network Disk, Inc. regarding custom projects. Application software vendors that support databases and data warehousing applications are also welcome to contact Network Disk, Inc. regarding future products that are tailored to large data centers. Finally, data centers with large legacy databases should contact us directly for Web-to-legacy system solutions provided by Network Disk, Inc.
Network Disk
, Inc. was founded by Dr. Ilya Gertner in 1998 to develop portable, scalable, and reliable storage system software for enterprise systems at a cost far less than existing products. In spite of the on-going PC revolution in enterprise computing, Dr. Gertner saw a staggering gap between enterprise storage and off-the-shelf PC storage prices. While a number of companies have made an effort to close this gap by bundling off-the-shelf storage components (e.g. Seagate disks) into large arrays of disks, they still depend on highly proprietary software systems (frequently referred to as the microcode of the I/O subsystem). The reason for such proprietary software is the need for performance and reliability, which has been difficult to achieve using general purpose operating systems.Network Disk
, Inc. develops open, Windows NT-based intelligent storage subsystems that will force a change in the storage industry and have a serious impact on the Internet server industry. In comparison to existing vendors, Network Disk, Inc. makes two leaps forward: it uses off-the-shelf, Intel-based hardware and off-the-shelf, Windows NT-based software. Advances in Intel processors (which happen several times a year) assure performance leadership in Network Disk, Inc. architectures. Similarly, reliance on the standard Windows NT operating system assures availability of a vast variety of devices and drivers available in the PC marketplace. Furthermore, availability of third-party storage management software provides a competitive edge to Network Disk, Inc.The company founder is a specialist in high-availability storage, clustering and parallel processing. He was the chief architect of Encore’s Storage Product (SP) Division which was sold to Sun Microsystems in 1997 for $185M. He was also general manager of Encore’s remote development site that was the focal point for many innovative products, including Remote Dual Copy (RDC), Backup-while-Open (BwO) and DataShare®. He has accumulated a wealth of experience while working for Encore Computer, Digital Equipment Corporation, Prime Computer and as an independent consultant. He has published extensively and presented papers at both industry tradeshows and computer architecture conferences. He holds a Ph.D in Computer Science from the University of Rochester and B.Sc. from Technion in Israel.
The company’s Principal Engineer, Dr. Peter Walker, is a specialist in high-performance SCSI target emulation, bitmap-based optimization of concurrent copy, and remote mirroring. He was a Senior Engineer at Encore Computer Corporation’s Storage Product (SP) Division, where he was the lead developer of high-performance software on Gigabit Network and storage subsystems on Unix. He holds a Ph.D in Computer Science from Brown University.
The company’s Marketing Consultant, James Aucoin, is a specialist in electronic pre-press publishing, electronic imaging, video storage and data warehousing.
Network Disk and
Ilya Gertner. "True Data Sharing: A Simple Solution to a Complex Storage Challenge". Server I/O '97 Conference, High Performance I/O Architectures. January 27-30, 1997.
Ilya Gertner. "Encore Storage Processor Overview". SHARE-87 Session 3031. New Orleans, LA. August, 1996.
Ilya Gertner, et al. "Disaster Recovery Using an Open UNIX Multiprocessor that Supports Remote Dual Copy and Concurrent Copy". SHARE-87 Session 3018. March, 1996.
Ilya Gertner, Stephen Mckellar, and Mark Aldred. "A Distributed Lock Manager on Fault Tolerant MPP". Computer Architecture Track, 28th HICSS Conference. January, 1995.
Ilya Gertner and Ike Nassi. "Symmetric Parallel Processing". Aerospace Software Engineering: A Collection of Concepts. American Institute of Aeronautics and Astronautics, Inc., Volume 136, pp. 505-521. 1991.
Ilya Gertner, Ziya Aral and Greg Schaffer. "Efficient Debugging Primitives for Multiprocessors". ASPLOS-III Conference. Boston, MA. May, 1989.
Ilya Gertner, Ziya Aral and Alan Langerman. "Variable Weight Processes with Flexible Shared Resources".
USENIX 1989 Winter Conference. San Diego, CA. February, 1989. Also implemented in UNIX BSD 4.4.
Ilya Gertner and Ziya Aral. "Parasight: A High Level Debugger/Profiler Architecture for Shared-Memory Multiprocessors". 1988 ACM International Conference on Supercomputing. Saint Malo, France. July 4-7. 1988.
Peter Walker and S. Ghosh. "Asynchronous Distributed Event Driven Simulation for Execution of VHDL on Parallel Processors". 32nd Design Automation Conference. San Francisco, CA. June 12-16,1995.
Peter Walker and S. Ghosh. "Asynchronous, Distributed Event Driven Simulation Algorithm with Inconsistent Event Preemption for Accurate Execution of VHDL Descriptions on Parallel Processors". High Performance Computing Symposium 95. Phoenix, Arizona. April 9-13, 1995.
Integrated cache serves as an intermediate depository of recently read and written data. It is optimized to always satisfy write requests and maintain a least recently used (LRU) list of read buffers. Since the behavior of cache has a very strong impact on the performance of applications, Integrated Cache provides a variety of tools to configure applications and measure their performance. It also includes low-level tools for troubleshooting and diagnosing faulty SCSI connections. The screen shot below shows some of the functionality of the Integrated Cache application.

In the upper right-hand side of the screen is a "View LOGFILE" button which, when executed, displays information in the box below it. This information includes detailed diagnostic and warning messages. Categories of specific information to be displayed are chosen from the "Trace Level" and "STE Trace Level" option menus in the upper left-hand side of the screen. In the example shown above, all options have been selected (checked) in both option menus.
The lower left-hand corner of the screen displays cache statistics. The "Process priority class" option at the top of the screen controls the priority of the cache manager. Normally, this is set as "High" or "Realtime" to provide the best performance for the storage application. However, the priority can be lowered to make it possible for other storage applications (such as backups or remote mirroring) to run concurrently, even on a single processor computer. In the example above, the priority has been reduced to "Normal".
The "View/Tail SCSI Trace" option provides a low-level diagnostic facility for monitoring SCSI commands. A sample of the output of the execution of this option appears below.
______________________________________________
00000015: ..............................
Command: SCSI_WRITE10
[0]0x2A [1]0x00 [2]0x00 [3]0x00 [4]0x00 [5]0x38
[6]0xFFFFFFF0 [7]0x00 [8]0x04 [9]0x00
RW: 1 LUN: 0 sector: 56 blocks: 4
_______________________________________________
Snapshot copy serves as a mechanism to "freeze" a copy of data at a particular moment in time, for the purposes of backup and data analysis, while the user is allowed to continue modifying the original data. Snapshot copy is an immediate, minimal overhead operation that makes a virtual copy of data. The screen shot below shows a Snapshot copy control window.

Some of the controls and buttons in the Snapshot copy utility are similar to the controls in the Integrated cache utility. An additional set of buttons allows for Snapshot control. The information box in the upper-right hand portion of the screen shows the status of devices. In this particular example, the device "disk_2" is currently "logging", while other devices are "inactive". The user executes a Snapshot by selecting a device(s) from the list of available devices in the lower right-hand side of the screen and clicking on "COMMIT".
If a Snapshot is in progress, the user can select the "View Snapshot Trace" option on the left side of the screen to log transactions. A sample of the output from the selection of this option appears below.
_____________________________________________
Logged: 8188 offset: 20969984
Logged: 8189 offset: 20970496
Logged: 8190 offset: 20971008
Logged: 8191 offset: 20971520
1 - Replayed: startblock: 8064 offset: 20905984 seq: 128
2 - Replayed: startblock: 7936 offset: 20840448 seq: 128
3 - Replayed: startblock: 7808 offset: 20774912 seq: 128
4 - Replayed: startblock: 7680 offset: 20709376 seq: 128
_____________________________________________
Remote mirroring creates and sends a network request to a secondary server for every write that it executes on the primary server. Because this data is buffered in cache, the Remote mirroring function is very non-intrusive and imposes less than a 5% overhead on concurrent applications. The networking layer employed by Remote mirroring is very efficient and in a stand-alone version achieves a throughput of 80 Mb/sec on a 100 M/sec Ethernet link. The Remote mirroring utility also provides a detailed mirroring logging facility, an example of which is shown below.
______________________________________
# client traces
Creating Rdc devices associated with volume c:\temp\disk_2
Opening socket
Connecting...
Setting destination...
SND: seq:0 num:0 tot:1
New RDC device:: Local: c:\temp\disk_2 Remote: c:\temp\disk_2 Host:129.91.2.19
Volume c:\temp\disk_3 => _deviceSize: 4194304 Final offset:0 _blockSize: 512
SND: seq:1 num:0 tot:4
SND: seq:2 num:1 tot:4
SND: seq:3 num:2 tot:4
SND: seq:4 num:3 tot:4
SND: seq:5 num:0 tot:4
SND: seq:6 num:1 tot:4
SND: seq:7 num:2 tot:4
SND: seq:8 num:3 tot:4
# server traces
Starting poller and driver...
Setting poller sleep to 1000ms
Poller started...
Driver started...
Winsock Started...
Initializing async UDP on host serpent, port 2003
Allocated 300 buffers successfully
SRV RCV: async pkt ind:0, len:512
SRV RCV: async pkt ind:1, len:512
SRV RCV: async pkt ind:2, len:512
SRV RCV: async pkt ind:3, len:512
SRV RCV: async pkt ind:4, len:512
SRV RCV: async pkt ind:5, len:512
SRV RCV: async pkt ind:6, len:512
SRV RCV: async pkt ind:7, len:512
SRV RCV: async pkt ind:8, len:512
______________________________________