Whenever we chat about IO and storage there are actually FOUR factors to determine requirements. Often people only discuss the first one (Capacity) so we have to tease out the others:
1. Capacity
2. Throughput
3. Latency
4. Service Levels
1. Capacity
Capacity is the easy one that no one misses. How many GB do I need ?
The obvious mistakes we see people making is to only size for the raw data. You also need to size for:
- Indexes. These can be 20% to 100% of the data size, depending on how heavily indexed. A database that is used for both OLTP and relational reporting (and is tuned) is likely to be heavily indexed. A data warehouse that just exists to load a cube, might not have any indexes on it at all!!
- TempDB. How to size TempDB is a topic in itself, but as a rough rule of thumb, you’ll need to store at least 1.5 to 2 times your largest table (or partition) in TempDB.
- Backup and General Space. You are going to need somewhere to back up your SQL Server, or restore from. You may also need more than one full copy to be stored. How much space you need can depend on if your backups are compressed, but in a lot of cases the space required is 2 or more times the database + index size. The good news it that this “general” storage space does not often need top of the range performance, so use of near line storage in RAID 5 can save buckets.
- OLAP If you have cubes, then these need to be sized. A cube can 15% to 100% of the size of the data warehouse, and you may need multiple copies. Often, cubes don’t need as terribly performant storage if queries are served from RAM.
- Data Processing. Its not uncommon to have ETL processes drop files or pick files up from a SQL Server and this requires both space and a drive letter allocated. You can’t really put this on the same volume as SQL Data files or you will get fragmentation.
2. Throughout
Throughput is how fast do I need the IO to be or rate of transfer. This is typically measures in IOP’s which is the windows perfmon counter “Logical Disk\Disk Transfers per Sec” or it can also be measured in MB/Sec.
Usually, we measure how many 64k random read/write IO’s we need per second at the correct latency (8-20ms) for data files.
An exception to this is sometimes when designing data warehouses, we speak in terms of MB/Sec. The reasoning here is that a lot of research has gone into Microsoft’s “Fast Track” design for data warehouses that estimate a single CPU can consume about 200 MB/Sec sequential reads. While a full blown “fast Track” design might be too expensive for most Irish pockets, the underlying principles and estimator calculations are still really useful for sizing.
http://www.microsoft.com/sqlserver/2008/en/us/fasttrack.aspx
3. Latency
This is how responsive do I need the disks to be and is measured by the perfmon counter “Logical Disk\Avg Secs per Transfer”. SQL Server is very prescriptive here:
- Data Files need 8-20ms latency
- Log files need 1-5ms latency
The exception to this is data warehouses where we accept higher latency (8-50ms). The reasoning is very simple – if a query is going to take 30 seconds to scan millions of rows then having to wait 50ms to get started isn’t going to loose anyone any sleep …
3. Service Levels
This is how resilient do we need the storage to be:
- Do we need the throughput and Latency guaranteed 9-5 or 24×7
- What throughput and Latency can we accept during a disk failure event
- What throughput and Latency can we accept during a backup event.
- Following a failover to DR site what storage throughput and latency do you expect (or is this all bets are off!)
As an example, imagine if you have a 24×7 SQL Server on RAID 5 and a disk blows and will take eight hours to rebuild. What if the IOPS drops from 1,000 to 500 during the rebuild and latency goes from 8ms to 60ms? This could mean that this event is technically an outage for eight hours. if a disk blows a few times a year you will have lots of outages. Therefore the requirements should state “we need 1,000 IOPS and 8ms latency DURING a disk failure event”, or “We can accept 800 IOPS and 10ms latency during a disk failure event”. This will help the storage designer make the correct choice on configuration. For example he may use 15 spindles on RAID 5 rather than 8 to ensure requirements are met, or he may change from RAID 5 to RAID 6. If you don’t specify the requirements, you can’t really blame the storage designer or Vendor for not meeting them!!
For a real mission critical SQL Server, i would run the “SQLIO” test during these conditions to baselines expectations and check requirements met.
Bottom line is Service Levels is the stick that you use to get the storage designer to over spec the IO requirements from the basic requirements 😉 It will raise the cost of course.