Live Chat Software by Kayako
 News Categories
(19)Microsoft Technet (2)StarWind (6)TechRepublic (3)ComuterTips (1)SolarWinds (1)Xangati (1) (27)VMware (8)NVIDIA (9)VDI (1)pfsense vRouter (3)VEEAM (3)Google (2)RemoteFX (1) (1)MailCleaner (1)Udemy (1)AUGI (2)AECbytes Architecture Engineering Constrution (7)VMGuru (2)AUTODESK (5) (1)Atlantis Blog (18)AT.COM (2) (1) (16) (2)hadoop360 (3)bigdatastudio (1) (1) (3)VECITA (1) (1)Palo Alto Networks (4) (2) (1)Nhịp Cầu đầu tư (3)VnEconomy (1)Reuters (1)Tom Tunguz (1) (1)Esri (1) (1)tweet (1)Tesla (1) (6)ITCNews (1) (1) Harvard Business Review (1)Haravan (2) (1) (3) (3)IBM (1) (2) (1) (8) (1) (1) (4) (1) (1) (1) (1) (1) (1) (1) (5) (4) (1) (1) (1) (1) (2) (22) (1) (1) (1) (1) (2) (1) (3) (2) (2) (1) (19) (1) (1) (1) (1) (1) (1) (2)Engenius (1) (1) (1) (1) (1) (3) (6) (1)
RSS Feed
non-volatile storage (NVS)
Posted by Thang Le Toan on 16 August 2018 05:26 AM

Non-volatile storage (NVS) is a broad collection of technologies and devices that do not require a continuous power supply to retain data or program code persistently on a short- or long-term basis.


Three common examples of NVS devices that persistently store data are tape, a hard disk drive (HDD) and a solid-state drive (SSD). The term non-volatile storage also applies to the semiconductor chips that store the data or controller program code within devices such as SSDs, HDDs, tape drives and memory modules.

Many types of non-volatile memory chips are in use today. For instance, NAND flash memory chips commonly store data in SSDs in enterprise and personal computer systems, USB sticks, and memory cards in consumer devices such as mobile telephones and digital cameras. NOR flash memory chips commonly store controller code in storage drives and personal electronic devices.

Non-volatile storage technologies and devices vary widely in the manner and speed in which they transfer data to and retrieve data or program code from a chip or device. Other differentiating factors that have a significant impact on the type of non-volatile storage a system manufacturer or user chooses include cost, capacity, endurance and latency.

For example, an SSD equipped with NAND flash memory chips can program, or write, and read data faster and at lower latency through electrical mechanisms than a mechanically addressed HDD or tape drive that uses a head to write and read data to magnetic storage media. However, the per-bit price to store data in a flash-based SSD is generally higher than the per-bit cost of an HDD or tape drive, and flash SSDs can sustain a limited number of write cycles before they wear out.

Volatile vs. non-volatile storage devices

The key difference between volatile and non-volatile storage devices is whether or not they are able to retain data in the absence of a power supply. Volatile storage devices lose data when power is interrupted or turned off. By contrast, non-volatile devices are able to keep data regardless of the status of the power source.

Common types of volatile storage include static random access memory (SRAM) and dynamic random access memory (DRAM). Manufacturers may add battery power to a volatile memory device to enable it to persistently store data or controller code.

Enterprise and consumer computing systems often use a mix of volatile and non-volatile memory technologies, and each memory type has advantages and disadvantages. For instance, SRAM is faster than DRAM and well suited to high-speed caching. DRAM is less expensive to produce and requires less power than SRAM, and manufacturers often use it to store program code that a computer needs to operate.

Comparison of non-volatile memory types

By contrast, non-volatile NAND flash is slower than SRAM and DRAM, but it is cheaper to produce. Manufacturers commonly use NAND flash memory to store data persistently in business systems and consumer devices. Storage devices such as flash-based SSDs access data at a block level, whereas SRAM and DRAM support random data access at a byte level.

Like NAND, NOR flash is less expensive to produce than volatile SRAM and DRAM. NOR flash costs more than NAND flash, but it can read data faster than NAND, making it a common choice to boot consumer and embedded devices and to store controller code in SSDs, HDDs and tape drives. NOR flash is generally not used for long-term data storage due to its poor endurance.

Trends and future directions

Manufacturers are working on additional types of non-volatile storage to try to lower the per-bit cost to store data and program code, improve performance, increase endurance levels and reduce power consumption.

For instance, manufacturers developed 3D NAND flash technology in response to physical scaling limitations of two-dimensional, or planar, NAND flash. They are able to reach higher densities at a lower cost per bit by vertically stacking memory cells with 3D NAND technology than they can by using a single layer of memory cells with planar NAND.

NVM use cases

Emerging 3D XPoint technology, co-developed by Intel Corp. and Micron Technology Inc., offers higher throughput, lower latency, greater density and improved endurance over more commonly used NAND flash technology. Intel ships 3D XPoint technology under the brand name Optane in SSDs and in persistent memory modules intended for data center use. Persistent memory modules are also known as storage class memory.

3D XPoint non-volatile technology

Micron Technology Inc.

An image of a 3D XPoint technology die.

Using non-volatile memory express (NVMe) technology over a computer's PCI Express (PCIe) bus in conjunction with flash storage and newer options such as 3D XPoint can further accelerate performance, and reduce latency and power consumption. NVMe offers a more streamlined command set to process input/output (I/O) requests with PCIe-based SSDs than the Small Computer System Interface (SCSI) command set does with Serial Attached SCSI (SAS) storage drives and the analog telephone adapter (ATA) command set does with Serial ATE (SATA) drives.

Everspin Technologies DDR3 ST-MRAM storage.

Everspin Technologies Inc.

Everspin's EMD3D064M 64 Mb DDR3 ST-MRAM in a Ball Grid Array package.

Emerging non-volatile storage technologies currently in development or in limited use include ferroelectric RAM (FRAM or FeRAM), magnetoresistive RAM (MRAM), phase-change memory (PCM), resistive RAM (RRAM or ReRAM) and spin-transfer torque magnetoresistive RAM (STT-MRAM or STT-RAM).

Read more »

SSD write cycle
Posted by Thang Le Toan on 16 August 2018 05:24 AM

An SSD write cycle is the process of programming data to a NAND flash memory chip in a solid-state storage device.

A block of data stored on a flash memory chip must be electrically erased before new data can be written, or programmed, to the solid-state drive (SSD). The SSD write cycle is also known as the program/erase (P/E) cycle.

When an SSD is new, all of the blocks are erased and new, incoming data is directly written to the flash media. Once the SSD has filled all of the free blocks on the flash storage media, it must erase previously programmed blocks to make room for new data to be written. Blocks that contain valid, invalid or unnecessary data are copied to different blocks, freeing the old blocks to be erased. The SSD controller periodically erases the invalidated blocks and returns them into the free block pool.

The background process an SSD uses to clean out the unnecessary blocks and make room for new data is called garbage collection. The garbage collection process is generally invisible to the user, and the programming process is often identified simply as a write cycle, rather than a write/erase or P/E cycle.

Why write cycles are important

A NAND flash SSD is able to endure only a limited number of write cycles. The program/erase process causes a deterioration of the oxide layer that traps electrons in a NAND flash memory cell, and the SSD will eventually become unreliable, wear out and lose its ability to store data.

The number of write cycles, or endurance, varies based on the type of NAND flash memory cell. An SSD that stores a single data bit per cell, known as single-level cell (SLC) NAND flash, can typically support up to 100,000 write cycles. An SSD that stores two bits of data per cell, commonly referred to as multi-level cell (MLC) flash, generally sustains up to 10,000 write cycles with planar NAND and up to 35,000 write cycles with 3D NAND. The endurance of SSDs that store three bits of data per cell, called triple-level cell (TLC) flash, can be as low as 300 write cycles with planar NAND and as high as 3,000 write cycles with 3D NAND. The latest quadruple-level cell (QLC) NAND will likely support a maximum of 1,000 write cycles.

Comparison of NAND flash memory

As the number of bits per NAND flash memory cell increases, the cost per gigabyte (GB) of the SSD declines. However, the endurance and the reliability of the SSD are also lower.

NAND flash writes



Common write cycle problems

Challenges that SSD manufacturers have had to address to use NAND flash memory to store data reliably over an extended period of time include cell-to-cell interference as the dies get smaller, bit failures and errors, slow data erases and write amplification.

Manufacturers have enhanced the endurance and reliability of all types of SSDs through controller software-based mechanisms such as wear-leveling algorithms, external data buffering, improved error correction code (ECC) and error management, data compression, overprovisioning, better internal NAND management and block wear-out feedback. As a result, flash-based SSDs have not worn out as quickly as users once feared they would.

Vendors commonly offer SSD warranties that specify a maximum number of device drive writes per day (DWPD) or terabytes written (TBW). DWPD is the number of times the entire capacity of the SSD can be overwritten on a daily basis during the warranty period. TBW is the total amount of data that an SSD can write before it is likely to fail. Vendors of flash-based systems and SSDs often offer guarantees of five years or more on their enterprise drives.

Manufacturers sometimes specify the type of application workload for which an SSD is designed, such as write-intensive, read-intensive or mixed-use. Some vendors allow the customer to select the optimal level of endurance and capacity for a particular SSD. For instance, an enterprise user with a high-transaction database might opt for a greater DWPD number at the expense of capacity. Or a user operating a database that does infrequent writes might choose a lower DWPD and a higher capacity.

Read more »

cache memory
Posted by Thang Le Toan on 16 August 2018 05:11 AM

Cache memory, also called CPU memory, is high-speed static random access memory (SRAM) that a computer microprocessor can access more quickly than it can access regular random access memory (RAM). This memory is typically integrated directly into the CPU chip or placed on a separate chip that has a separate bus interconnect with the CPU. The purpose of cache memory is to store program instructions and data that are used repeatedly in the operation of programs or information that the CPU is likely to need next. The computer processor can access this information quickly from the cache rather than having to get it from computer's main memory. Fast access to these instructions increases the overall speed of the program.

As the microprocessor processes data, it looks first in the cache memory. If it finds the instructions or data it's looking for there from a previous reading of data, it does not have to perform a more time-consuming reading of data from larger main memory or other data storage devices. Cache memory is responsible for speeding up computer operations and processing.

Once they have been opened and operated for a time, most programs use few of a computer's resources. That's because frequently re-referenced instructions tend to be cached. This is why system performance measurements for computers with slower processors but larger caches can be faster than those for computers with faster processors but less cache space.

This CompTIA A+ video tutorial explains
cache memory.

Multi-tier or multilevel caching has become popular in server and desktop architectures, with different levels providing greater efficiency through managed tiering. Simply put, the less frequently certain data or instructions are accessed, the lower down the cache level the data or instructions are written.

Implementation and history

Mainframes used an early version of cache memory, but the technology as it is known today began to be developed with the advent of microcomputers. With early PCs, processor performance increased much faster than memory performance, and memory became a bottleneck, slowing systems.

In the 1980s, the idea took hold that a small amount of more expensive, faster SRAM could be used to improve the performance of the less expensive, slower main memory. Initially, the memory cache was separate from the system processor and not always included in the chipset. Early PCs typically had from 16 KB to 128 KB of cache memory.

With 486 processors, Intel added 8 KB of memory to the CPU as Level 1 (L1) memory. As much as 256 KB of external Level 2 (L2) cache memory was used in these systems. Pentium processors saw the external cache memory double again to 512 KB on the high end. They also split the internal cache memory into two caches: one for instructions and the other for data.

Processors based on Intel's P6 microarchitecture, introduced in 1995, were the first to incorporate L2 cache memory into the CPU and enable all of a system's cache memory to run at the same clock speed as the processor. Prior to the P6, L2 memory external to the CPU was accessed at a much slower clock speed than the rate at which the processor ran, and slowed system performance considerably.

Early memory cache controllers used a write-through cache architecture, where data written into cache was also immediately updated in RAM. This approached minimized data loss, but also slowed operations. With later 486-based PCs, the write-back cache architecture was developed, where RAM isn't updated immediately. Instead, data is stored on cache and RAM is updated only at specific intervals or under certain circumstances where data is missing or old.

Cache memory mapping

Caching configurations continue to evolve, but cache memory traditionally works under three different configurations:

  • Direct mapped cache has each block mapped to exactly one cache memory location. Conceptually, direct mapped cache is like rows in a table with three columns: the data block or cache line that contains the actual data fetched and stored, a tag with all or part of the address of the data that was fetched, and a flag bit that shows the presence in the row entry of a valid bit of data.
  • Fully associative cache mapping is similar to direct mapping in structure but allows a block to be mapped to any cache location rather than to a prespecified cache memory location as is the case with direct mapping.
  • Set associative cache mapping can be viewed as a compromise between direct mapping and fully associative mapping in which each block is mapped to a subset of cache locations. It is sometimes called N-way set associative mapping, which provides for a location in main memory to be cached to any of "N" locations in the L1 cache.

Format of the cache hierarchy

Cache memory is fast and expensive. Traditionally, it is categorized as "levels" that describe its closeness and accessibility to the microprocessor.

cache memory diagram

L1 cache, or primary cache, is extremely fast but relatively small, and is usually embedded in the processor chip as CPU cache.

L2 cache, or secondary cache, is often more capacious than L1. L2 cache may be embedded on the CPU, or it can be on a separate chip or coprocessor and have a high-speed alternative system bus connecting the cache and CPU. That way it doesn't get slowed by traffic on the main system bus.

Level 3 (L3) cache is specialized memory developed to improve the performance of L1 and L2. L1 or L2 can be significantly faster than L3, though L3 is usually double the speed of RAM. With multicore processors, each core can have dedicated L1 and L2 cache, but they can share an L3 cache. If an L3 cache references an instruction, it is usually elevated to a higher level of cache.

In the past, L1, L2 and L3 caches have been created using combined processor and motherboard components. Recently, the trend has been toward consolidating all three levels of memory caching on the CPU itself. That's why the primary means for increasing cache size has begun to shift from the acquisition of a specific motherboard with different chipsets and bus architectures to buying a CPU with the right amount of integrated L1, L2 and L3 cache.

Contrary to popular belief, implementing flash or more dynamic RAM (DRAM) on a system won't increase cache memory. This can be confusing since the terms memory caching (hard disk buffering) and cache memory are often used interchangeably. Memory caching, using DRAM or flash to buffer disk reads, is meant to improve storage I/O by caching data that is frequently referenced in a buffer ahead of slower magnetic disk or tape. Cache memory, on the other hand, provides read buffering for the CPU.

Specialization and functionality

In addition to instruction and data caches, other caches are designed to provide specialized system functions. According to some definitions, the L3 cache's shared design makes it a specialized cache. Other definitions keep instruction caching and data caching separate, and refer to each as a specialized cache.

Translation lookaside buffers (TLBs) are also specialized memory caches whose function is to record virtual address to physical address translations.

Still other caches are not, technically speaking, memory caches at all. Disk caches, for instance, can use RAM or flash memory to provide data caching similar to what memory caches do with CPU instructions. If data is frequently accessed from disk, it is cached into DRAM or flash-based silicon storage technology for faster access time and response.

SSD caching vs. primary storage
SSD caching vs. primary storage
Current Time 0:00
Duration Time 3:00
SSD caching vs. primary storage

Dennis Martin, founder and president of Demartek LLC, explains the pros and cons of using solid-state drives as cache and as primary storage.

Specialized caches are also available for applications such as web browsers, databases, network address binding and client-side Network File System protocol support. These types of caches might be distributed across multiple networked hosts to provide greater scalability or performance to an application that uses them.


The ability of cache memory to improve a computer's performance relies on the concept of locality of reference. Locality describes various situations that make a system more predictable, such as where the same storage location is repeatedly accessed, creating a pattern of memory access that the cache memory relies upon.

There are several types of locality. Two key ones for cache are temporal and spatial. Temporal locality is when the same resources are accessed repeatedly in a short amount of time. Spatial locality refers to accessing various data or resources that are in close proximity to each other.

Cache vs. main memory

DRAM serves as a computer's main memory, performing calculations on data retrieved from storage. Both DRAM and cache memory are volatile memories that lose their contents when the power is turned off. DRAM is installed on the motherboard, and the CPU accesses it through a bus connection.

Dynamic RAM


An example of dynamic RAM.

DRAM is usually about half as fast as L1, L2 or L3 cache memory, and much less expensive. It provides faster data access than flash storage, hard disk drives (HDDs) and tape storage. It came into use in the last few decades to provide a place to store frequently accessed disk data to improve I/O performance.

DRAM must be refreshed every few milliseconds. Cache memory, which also is a type of random access memory, does not need to be refreshed. It is built directly into the CPU to give the processor the fastest possible access to memory locations, and provides nanosecond speed access time to frequently referenced instructions and data. SRAM is faster than DRAM, but because it's a more complex chip, it's also more expensive to make.

Comparison of memory types

Cache vs. virtual memory

A computer has a limited amount of RAM and even less cache memory. When a large program or multiple programs are running, it's possible for memory to be fully used. To compensate for a shortage of physical memory, the computer's operating system (OS) can create virtual memory.

To do this, the OS temporarily transfers inactive data from RAM to disk storage. This approach increases virtual address space by using active memory in RAM and inactive memory in HDDs to form contiguous addresses that hold both an application and its data. Virtual memory lets a computer run larger programs or multiple programs simultaneously, and each program operates as though it has unlimited memory.

Virtual memory in the memory hierarchy
Where virtual memory fits in the memory hierarchy.

In order to copy virtual memory into physical memory, the OS divides memory into pagefiles or swap files that contain a certain number of addresses. Those pages are stored on a disk and when they're needed, the OS copies them from the disk to main memory and translates the virtual addresses into real addresses.

Read more »

all-flash array (AFA)
Posted by Thang Le Toan on 16 August 2018 05:09 AM

An all-flash array (AFA), also known as a solid-state storage disk system, is an external storage array that uses only flash media for persistent storage. Flash memory is used in place of the spinning hard disk drives (HDDs) that have long been associated with networked storage systems.

Vendors that sell all-flash arrays usually allow customers to mix flash and disk drives in the same chassis, a configuration known as a hybrid array. However, those products often represent the vendor's attempt to retrofit an existing disk array by replacing some of the media with flash.

All-flash array design: Retrofit or purpose-built

Other vendors sell purpose-built systems designed natively from the ground up to only support flash. These models also embed a broad range of software-defined storage features to manage data on the array.

A defining characteristic of an AFA is the inclusion of native software services that enable users to perform data management and data protection directly on the array hardware. This is different from server-side flash installed on a standard x86 server. Inserting flash storage into a server is much cheaper than buying an all-flash array, but it also requires the purchase and installation of third-party management software to supply the needed data services.

Leading all-flash vendors have written algorithms for array-based services for data management, including clones, compression and deduplication -- either an inline or post-process operation -- snapshots, replication, and thin provisioning.

As with its disk-based counterpart, an all-flash array provides shared storage in a storage area network (SAN) or network-attached storage (NAS) environment.

How an all-flash array differs from disk

Flash memory, which has no moving parts, is a type of nonvolatile memory that can be erased and reprogrammed in units of memory called blocks. It is a variation of erasable programmable read-only memory (EEPROM), which got its name because the memory blocks can be erased with a single action, or flash. A flash array can transfer data to and from solid-state drives (SSDs) much faster than electromechanical disk drives.

The advantage of an all-flash array, relative to disk-based storage, is full bandwidth performance and lower latency when an application makes a query to read the data. The flash memory in an AFA typically comes in the form of SSDs, which are similar in design to an integrated circuit.

Pure FlashBlade

Pure Storage

Image of a Pure Storage FlashBlade enterprise storage array

Flash is more expensive than spinning disk, but the development of multi-level cell (MLC) flash, triple-level cell (TLC) NAND flash and 3D NAND flash has lowered the cost. These technologies enable greater flash density without the cost involved in shrinking NAND cells.

MLC flash is slower and less durable than single-level cell (SLC) flash, but companies have developed software that improves its wear level to make MLC acceptable for enterprise applications. SLC flash remains the choice for applications with the highest I/O requirements, however. TLC flash reduces the price more than MLC, although it also comes with performance and durability tradeoffs that can be mitigated with software. Vendor products that support TLC SSDs include the Dell EMC SC Series and Kaminario K2 arrays.

Considerations for buying an all-flash array

Deciding to buy an AFA involves more than simple comparisons of vendor products. An all-flash array that delivers massive performance increases to a specific set of applications may not provide equivalent benefits to other workloads. For example, running virtualized applications in flash with inline data deduplication and compression tends to be more cost-effective than flash that supports streaming media in which unique files are uncompressible.

An all-SSD system will produce smaller variations than that of an HDD array in maximum, minimum and average latencies. This makes flash a good fit for most read-intensive applications.

The tradeoff comes in write amplification, which relates to how an SSD will rewrite data to erase an entire block. Write-intensive workloads require a special algorithm to collect all the writes on the same block of the SSD, thus ensuring the software always writes multiple changes to the same block.

Garbage collection can present a similar issue with SSDs. A flash cell can only withstand a limited number of writes, so wear leveling can be used to increase flash endurance. Most vendors design their all-flash systems to minimize the impact of garbage collection and wear leveling, although users with write-intensive workloads may wish to independently test a vendor's array to determine the best configuration.

Despite paying a higher upfront price for the system, users who buy an AFA may see the cost of storage decline over time. This is tied to an all-flash array's increased CPU utilization, which means an organization will need to buy fewer application servers.

The physical size of an AFA is smaller than that of a disk array, which lowers the rack count. Having fewer racks in a system also reduces the heat generated and the cooling power consumed in the data center.

All-flash array vendors, products and markets

Flash was first introduced as a handful of SSDs in otherwise all-HDD systems with the purpose to create a small flash tier to accelerate a few critical applications. Thus was born the hybrid flash array.

The next phase of evolution arrived with the advent of software that enabled an SSD to serve as a front-end cache for disk storage, extending the benefit of faster performance across all the applications running on the array.

The now-defunct vendor Fusion-io was an early pioneer of fast flash. Launched in 2005, Fusion-io sold Peripheral Component Interface Express (PCIe) cards packed with flash chips. Inserting the PCIe flash cards in server slots enabled a data center to boost the performance of traditional server hardware. Fusion-io was acquired by SanDisk in 2014, which itself was subsequently acquired by Western Digital Corp.

Also breaking ground early was Violin, whose systems -- designed with custom-built silicon -- gained customers quickly, fueling its rise in public markets in 2013. By 2017, Violin was surpassed by all-flash competitors whose arrays integrated sophisticated software data services. After filing for bankruptcy, the vendor was relaunched by private investors as Violin Systems in 2018, with a focus on selling all-flash storage to managed service providers.

comparison of all-flash storage arrays
Independent analyst Logan G. Harbaugh compares various all-flash arrays. This chart was created in August 2017.

All-flash array vendors, such as Pure Storage and XtremIO -- part of Dell EMC -- were among the earliest to incorporate inline compression and data deduplication, which most other vendors now include as a standard feature. Adding deduplication helped give AFAs the opportunity for price parity with storage based on cheaper rotating media.

A sampling of leading all-flash array products includes the following:

  • Dell EMC VMAX
  • Dell EMC Unity
  • Dell EMC XtremIO
  • Dell EMC Isilon NAS
  • Fujitsu Eternus AF
  • Hewlett Packard Enterprise (HPE) 3PAR StoreServ
  • HPE Nimble Storage AF series
  • Hitachi Vantara Virtual Storage Platform
  • Huawei OceanStor
  • IBM FlashSystem V9000
  • IBM Storwize 5000 and Storwize V7000F
  • Kaminario K2
  • NetApp All-Flash Fabric-Attached Array (NetApp AFF)
  • NetApp SolidFire family -- including NetApp HCI
  • Pure Storage FlashArray
  • Pure FlashBlade NAS/object storage array
  • Tegile Systems T4600 -- bought in 2017 by Western Digital
  • Tintri EC Series

Impact on hybrid arrays use cases

Falling flash prices, data growth and integrated data services have increased the appeal of all-flash arrays for many enterprises. This has led to industry speculation that all-flash storage can supplant hybrid arrays, although there remain good reasons to consider using a hybrid storage infrastructure.

HDDs offer predictable performance at a fairly low cost per gigabyte, although they use more power and are slower than flash, resulting in a high cost per IOPS. All-flash arrays also have a lower cost per IOPS, coupled with the advantages of speed and lower power consumption, but they carry a higher upfront acquisition price and per-gigabyte cost.

AFA vs. hybrid array

A hybrid flash array enables enterprises to strike a balance between relatively low cost and balanced performance. Since a hybrid array supports high-capacity disk drives, it offers greater total storage than an AFA.

All-flash NVMe and NVMe over Fabrics

All-flash arrays based on nonvolatile memory express (NVMe) flash technologies represent the next phase of maturation. The NVMe host controller interface speeds data transfer by enabling an application to communicate directly with back-end storage.

NVMe is meant to be a faster alternative to the Small Computer System Interface (SCSI) standard that transfers data between a host and a target device. Development of the NVMe standard is under the auspices of NVM Express Inc., a nonprofit organization comprising more than 100 member technology companies.

The NVMe standard is widely considered to be the eventual successor to the SAS and SATA protocols. NVMe form factors include add-in cards, U.2 2.5-inch and M.2 SSD devices.

Some of the NVMe-based products available include:

  • DataDirect Networks Flashscale
  • Datrium DVX hybrid system
  • HPE Persistent Memory
  • Kaminario K2.N
  • Micron Accelerated Solutions NVMe reference architecture
  • Micron SolidScale NVMe over Fabrics appliances
  • Pure Storage FlashArray//X
  • Tegile IntelliFlash

A handful of NVMe-flash startups are bringing products to market, as well, including:

  • Apeiron Data Systems combines NVMe drives with data services housed in field-programmable gate arrays instead of servers attached to storage arrays.
  • E8 Storage E8-D24 NVMe flash arrays replicate snapshots to attached compute servers to reduce management overhead on the array.
  • Excelero software-defined storage runs on any x86 server.
  • Mangstor MX6300 NVMe over Fabrics (NVMe-oF) storage is branded PCIe NVMe add-in cards on Dell PowerEdge servers.
  • Pavilion Data Systems-branded Pavilion Memory Array.
  • Vexata VX-100 is based on the software-defined Vexata Active Data Fabric.

Industry experts expect 2018 to usher in more end-to-end, rack-scale flash storage systems based on NVMe-oF. These systems integrate custom NVMe flash modules as a fabric in place of a bunch of NVMe SSDs.

The NVMe-oF transport mechanism enables a long-distance connection between host devices and NVMe storage devices. IBM, Kaminario and Pure Storage have publicly disclosed products to support NVMe-oF, although most storage vendors have pledged support.

All-flash storage arrays in hyper-converged infrastructure

Hyper-converged infrastructure (HCI) systems combine computing, networking, storage and virtualization resources as an integrated appliance. Most hyper-convergence products are designed to use disk as front-end storage, relying on a moderate flash cache layer to accelerate applications or to use as cold storage. For reasons related to performance, most HCI arrays were not traditionally built primarily for flash storage, although that started to change in 2017.

Now the leading HCI vendors sell all-flash versions. Among these vendors are Cisco, Dell EMC, HPE, Nutanix, Pivot3 and Scale Computing. NetApp launched an HCI product in October 2017 built around its SolidFire all-flash storage platform.

Read more »

How to use Iometer to Simulate a Desktop Workload
Posted by Thang Le Toan on 06 September 2015 10:54 PM

To deliver VDI Performance it is key to understand I/O performance  when creating your VDI architecture. Iometer is treated as the industry standard tool when you want to test load upon a storage subsystem.   While there are many tools available, Iometer’s balance between usability and function sets it out.  However, Iometer has its quirks and I’ll attempt to show exactly how you should use Iometer to get the best results, especially when testing for VDI environments. I’ll also show you how to stop people using Iometer to fool you.

 As Iometer requires almost no infrastructure, you can use it to very quickly determine the storage subsystem performance.  In steady state a desktop (VDI or RDS) I/O profile will be approximately 80/20 write/read, 80/20 random/sequential and the block size of the reads and writes will be in 4k blocks.  The block size in a real windows workload does vary between 512B and 1MB, but the vast majority will be at 4K, as Iometer does not allow a mixed block size during testing we will use a constant 4K.

That said, while Iometer is great for analysing storage subsystem performance,if you need to simulate a real world workload for your VDI environment I would recommend using tools from the likes of Login VSI or DeNamik.

 Bottlenecks for Performance in VDI

Iometer is usually run from within a windows guest which is sitting upon the storage subsystem. This means that there are many layers between it and the storage as we see below:

If we are to test the performance of the storage, the storage must be the bottleneck. This means there must be sufficient resource in all the other layers to handle the traffic.

 Impact of Provisioning Technologies on Storage Performance

If your VM is provisioned using Citrix Provisioning Services (PVS), Citrix Machine Creation Services (MCS) or VMware View Linked Clones, you will be bottlenecked by the provisioning technology.  If you test with Iometer against the C: drive of a provisioned VMs you will not get full insight of the storage performance as these three technologies fundamentally change the way I/O is treated.

 You cannot drive maximum IOPS from a single VM, it is therefore not recommended to run Iometer against these VMs when attempting to stress-test storage.

I would always add a second drive to the VM and test Iometer against a second hard drive as this by-passes the issue with PVS/MCS/Linked Clones.

In 99% of cases I would actually rather test against a ‘vanilla’ Windows 7 VM. By  this I mean a new VM installed from scratch, without it joining the domain and only having the appropriate hypervisor tools installed. Remember, Iometer is designed to test storage. By testing with a ‘vanilla’ VM environment you baseline core performance delivery. From that you can go to test a fully configured VM; and now you can understand the impact of AV filter drivers, provisioned by linked clones, or other software/agents etc. has on storage performance.

Using Iometer for VDI testing: advantages and disadvantages

Before we move on to the actual configuration setting within Iometer, I want to talk a little bit about the test file that Iometer creates to throw I/O against.  This file is called iobw.tst and is why I both love and hate Iometer.  It’s the source of Iometers biggest bugs and also it’s biggest advantage.

First, the advantage; Iometer can create any size of test file you like in order to represent the test scenario that you need.  When we talk about a single host with 100 Win 7 VMs, or 8 RDS VMs, the size of the I/O ‘working set’ must be, at a minimum, the aggregate size of the pagefiles: as this is will be the a set of unique data that will consistently be used.  So for the 100 Win 7 VMs, with 1GB RAM, this test file will be at least 100GB and for the 8 RDS VMs, with 10GB RAM, it would be at least 80GB.  The actual working set of data will probably be much higher than this, but I’m happy to recommend this as a minimum.  This means that it would be very hard for a storage array or RAID card to hold the working set in cache.  Iometer allows us to set the test file to a size that will mimic such a working set.  In practice, I’ve found that a 20GB test file is sufficient to accurately mimic a single host VDI load.  If you are still getting unexpected results from your storage, I’d try and increase the size of this test file.

Second, the disadvantage; iobw.tst is buggy.  If you resize the file without deleting, it fails to resize (without error) and if you delete the file without closing Iometer, Iometer crashes.  In addition, if you do not run Iometer as administrator, Windows 7 will put the iobw.tst file in the profile instead of the root of C:.  OK, that’s not technically Iometer’s fault, but it’s still annoying.

Recommended Configuration of Iometer for VDI workloads

 First tab (Disk Targets)

The number of workers is essentially the number of threads used to create the I/O requests, adding workers will add latency, it will also add a small amount of total I/O.  I consider 4 workers to be the best balance between latency and IOPS.

 Highlighting the computer icon means that all workers are configured simultaneously, you can check that the workers are configured correctly by highlighting the individual workers.

The second drive should be used to avoid issues with filter drivers/provisioning etc on C: (although Iometer should always be run in a ‘vanilla’ installation).

The number of sectors gives you the size of the test file, this is extremely important as is mentioned above. You can use the following website to determine the sectors/GB:

The size used in the example to get 20GB is 41943040 sectors.

 The reason for configuring 16 outstanding I/Os is similar to the number of workers as increasing I/Os will increase Latency while slightly increasing IOPS. As with workers, I think 16 is a good compromise. You can also refer to the following article regarding outstanding I/Os:

Second tab (Network Targets)

No changes are needed on the network Targets tab

 Third tab (Access Specifications)

To configure a workload that mimics a desktop, we need to create a new specification.

The new Access specification should have the following settings. This is to ensure that the tests model as closely as possible a VDI workload. The settings are:

  • 80% Write

  • 80% Random

  • 4K blocks

  • Sector Boundaries at 128K, this is probably overkill and 4K would be fine, but should eliminate any disk alignment issues.

The reason for choosing these values are too detailed to go into here, but you can refer to the following document on Windows 7 I/O:

You should then Add the access specification to the manager.

 Fifth tab (Test Setup)

I’d advise only configuring the test to run for 30 seconds, the results should be representative after that amount of time. More importantly, if you are testing your production SAN, Iometer once configured correctly will eat all of your SAN performance. Therefore, if you have other workloads on your SAN, running Iometer for a long time will severely impact them.

 Fourth tab (Results Display)

Set the Update Frequency (seconds) slider to the left so you can see the results as they happen.

 Set the ‘Results Since’ to ‘Start of Test’ which will give you a reliable average.

Both Read and Write avg. response times (Latency) are essential.

It should be noted that the csv file Iometer creates will capture all metrics while the GUI will only show six.

Save Configuration

It is recommended that you save the configuration for later use by clicking the disk icon. This will save you having to re-configure Iometer each test run you do. The file is saved as *.icf in a location of your choosing.  Or to save some time, download a preconfigured Iometer Desktop Virtualization Configuration file and load it into Iometer.

Start the test using the green flag.

Interpreting Results

Generally the higher the IOPs the better, indicated by ‘total IOPS per second’ counter above, but this must be delivered at a reasonable latency, anything under 5ms will provide a good user experience.

Given the max possible IOPS for a single spindle is 200, you should sanity check your results against predicted values. For an SSD you can get 3-15,000 IOPS depending on how empty it is and how expensive it is, so again you can sanity check your results.

You don’t need to split IOPS or throughput out into read and write because we know Iometer will be working at 80/20, as we configured in the access specification.

 How can I check someone isn’t using Iometer to trick me?

To the untrained eye Iometer can be used to show very unrepresentative results.  Here is a list of things to check when someone is showing you an Iometer result.

  • What size is the test file in Explorer? it needs to be very large (minimum 20GB), don’t check in the Iometer gui.
  • How sequential is the workload?  The more sequential, the easier it is to show better IOPS and throughput. (It should be set to a minimum of 75% random)
  • What’s the block size?  Windows has a block size of 4K, anything else is not a relevant test and probably helps out the vendor.

Read more »

Help Desk Software by Kayako