09 Juni 2015
von Tim Heron
Improvements in data load times
The team at Apteco has long acknowledged that the development of
multi-threaded algorithms are the means of driving performance
around data compression in modern high performance software
development and as a result have seen significant improvements in
data load times with each new version of FastStats Designer.
It has been just over 10 years since the publication of Herb
Sutter’s article “The Free Lunch Is Over”. In his article
Herb described the changing performance characteristics of modern
processors and the consequences this has for modern high
performance software development.
In the Apteco High Performance Team we have long recognised that
improvements in query and data load performance will only come
from implementing multi-threaded algorithms instead of continual
micro optimisations to sequential code.
With FastStats Designer we have worked over the last few years on
developing multi-threaded versions of our sequential data load
algorithms. We have worked a component at a time replacing
the sequential processes with new multi-threaded equivalents and
in doing so we achieved a 25-30% speed improvement over the
course of 2014.
For the 2015 Q2 version of FastStats Designer we are releasing
for beta use our new multi-threaded data load and compression
This routine still produces our standard FastStats optimised,
compressed binary data file formats but does so with entirely new
code that has been written from scratch using the latest
parallelisation technology. The technology we decided to
base our component on is a Microsoft library called TPL Dataflow
that extends the multi-threaded features of the .Net Framework to
support data processing pipelines. We re-implemented our
compression routines as components of this pipeline to make full
use of the multi-threaded capabilities of modern processors.
Previously with the Nov 14 sequential load process we were able
to build a TPC-H reference test system with 1.3Bn records (1Bn
Line Item records) in 4 hours. The sequential data load
portion of this build took nearly 2 hours. The new
Apteco.DataBuild component performs the same processing in under
50 minutes dropping the total load time (including auto
discovery, sort and compression) to 2 hours 50 minutes.
Starting with the 2015 Q2 release users of FastStats Designer
will have the option to build selected tables with this new
component and the improvements will benefit any FastStats
Designer data builds that execute on multi-processor machines.
Apteco.DataBuild giving a powerful 32 core AWS EC2 instance
something to think about.
The ability to access and analyse big data is key to driving
business forward and creating target marketing campaigns that
give you and your organisation the edge.
Sutter, H. 2005. "The free lunch is over: A fundamental turn
toward concurrency in software," Dr. Dobb's Journal. http://www.gotw.ca/publications/concurrency-ddj.htm