Joe Williams home
BigData is a big deal. It's changing how we look at data and analytics, but it isn't the end. What are the enablers of BigData? First and foremost, cheap computing resources (CPU, disks, memory, bandwidth, etc) all thanks to Moore's Law. Today even startups have the ability to afford huge amounts of computing power, the likes previously only the big boys could afford. Additionally, this has given rise to commodity hardware and cloud computing, which only furthers the proliferation of large amounts cheap, quickly-provisioned, computing resources. Second, to apply all that power, we have open source data processing systems based on years of distributed systems research, like Hadoop, and many incarnations of NoSQL. The development of open source data processing sytems has allowed proliferation of systems that scale, which only the highly capitalized could afford, until recently. These two things alone have allowed for the democratization of BigData. A guy in a garage can process terabytes of data with little more than a credit card and elbow grease. With all these tools and recently acquired computing power, where are we going? Of course we can expect datasets to continue to grow, and the computational complexity of our data processing to increase, as well as compute power to continue to rise (GPGPUs, multicore and so on). In addition, I anticipate the emergence of something I'm calling NewData. NewData will build on what we have currently with the BigData, but will include some trends just beginning to take off. First, the development of ubiquitous public APIs (Meatcloud Manifesto). Public APIs have yet to proliferate to all online systems. As a consequence, there is still a lot of screen scraping going on. By having easily query-able and parse-able datasets available through ubiquitous APIs, consuming the internet with machines is easier making the application of BigData more powerful. Netflix is a good example of this. Second and similarly enabling will be the development of standardized public datasets. Current datasets are generally hard to find and use, standardized dataset formats will enable BigData analysis to be more productive and not waste time munging. Data.gov is a start. These two developments are yet to be fully realized in current systems but will allow for the rise of NewData. As these developments begin to roll out we will begin to see changes to how our BigData systems look. NewData systems will be less concerned with how big the data is and what it looks like, but will emphasize derivation of more information from the data. Bradford Cross gets this, and as a result FlightCaster is an early example of what I mean by NewData.
The scale of data and computations is an important issue, but the data age is less about the raw size of your data, and more about the cool stuff you can do with it.
Asking the right questions of the data is important, especially if you're trying to do cool stuff. The Freakonomics guys proved this a few times over. NewData will be about creating value from data, and asking the right questions is worth as much as the answers. The key enablers of this will be using new found APIs and datasets to combine data from disperate sources in ways that BigData couldn't. Asking questions that we wouldn't have thought to ask of BigData. Where BigData was about a handful of datasets at most, NewData will be about dozens of datasets. The mashup is the cornerstone of NewData. That being said, we will need new systems to process this data and enable us to ask these questions. NewData analysis will need inter-process communication and collaboration. Currently, systems like Hadoop process data by splitting the data up and processing chunks in parallel on hundreds to thousands of machines. Processes are isolated from the other processes. This will continue, but NewData will require more from these systems to ask deeper questions. Complex inter-process communication will be needed to ask these questions. Think of the simplicity of writing Map/Reduce jobs, the robustness of Hadoop, the workflow and dataflow of Cascading and DryadLINQ, respectively, and the power of a message passing system like MPI. These jobs will likely include large in-memory collaborative computations across thousands of machines. Where data locality was key in BigData, both data and memory-locality (NUMA/ccNUMA) will be important in NewData. It is clear that BigData still has some runway before NewData takes over. However, if the trends in the democratization of compute and processing continue (beyond Hadoop and EC2), and the opening of APIs and datasets proliferate online and off, NewData and it's new questions, mashups, and systems are inevitable. Where having readily available compute resources and the software to use it defined BigData, NewData will be defined solely by asking the right questions, the algorithms to derive answers, and the systems used to produce them. Thanks to Mike Miller, Bradford Stephens and my awesome wife Erin for the help on this article. Follow me on twitter.
Fork me on GitHub