I am asked regularly about the storage of data in the database and if storing large amounts of data will slow down the performance of NLS, our core loan servicing software application. To answer the question, I decided to run a couple of tests to illustrate the database performance as tables grow in size.
But before I talk about these tests, let’s look at the size of the largest databases over the past 10 years. In 2001 the largest database was around 10 terabytes (10,000 GB). In 2005 the largest was considered to be 100TB. Data sizes started exploding by 2008 with Yahoo having a 1PB database (1,000,000 GB), in 2009 their database was >10PB. By 2010, SQL Server had a 1.1PB database and Oracle had a 5PB database in production.
A petabyte is huge. It would take a stack of CD-ROMS over a mile high to store a petabyte. To put this in an NLS perspective, a 1PB database would contain approximately 250 million loans with 10 full years of history.
To look at the performance side of things, the best way to illustrate this is running a couple of tests against the largest table in the NLS database, “daily_trial_balance”. The “daily_trial_balance” table contains the end-of-day balances of each active loan – including actual, effective and general ledger balances. If a loan has been active for one year, there will be 365 entries in this table related to that loan. Each row in this table consumes approximately 800 bytes of data.
The first test is on an NLS database that contains 4.8 million rows in the “daily_trial_balance” table. The data size of this table is 5GB. The test query is “select * from daily_trial_balance where acctrefno = x”, x being a random loan reference number. This would be equivalent to selecting the Daily Trial Balance tab of a loan. I am running this test in Microsoft SQL Server Management Studio and using the SET STATISTICS TIME option in SQL, the execution time is 0.297 seconds:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
(306 row(s) affected)
SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 297 ms.
The second test is on a “daily_trial_balance” table that contains 48 million rows, an order of magnitude larger in size. To make sure that there is no help from the cache, I restarted the database server prior to the query test. The data size of this table is 42.7GB. Using the same basic query, the execution time is 0.269 seconds:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
(306 row(s) affected)
SQL Server Execution Times:
CPU time = 15 ms, elapsed time = 269 ms.
The final test is on a “daily_trial_balance” table that contains 480 million rows, again an order of magnitude larger in size. Once again I restarted the database server prior to the query test. The data size of this table is 420GB. Using the same basic query, the execution time is 0.271 seconds:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
(306 row(s) affected)
SQL Server Execution Times:
CPU time = 15 ms, elapsed time = 271 ms.
To answer the question – Does size of data affect performance? – the short answer is no. As you can see in the timings of the three different queries, the time and size do not correlate on a chart. Increasing the data size by 1,000% and then again by 1,000% did not change the actual time to query a random loan’s daily trial balance entries. The small fluctuations between timings, specifically the smaller of the tables has to do with other load on the server.
With a properly indexed database, finding specific data even in very large datasets can be very fast and efficient.