Is your (or your end client’s) business growing rapidly? We certainly hope so. But to have an edge on your IT-savvy competition, you need to start making sense of the plethora of business data which your enterprise generates daily and which at present just quietly wanes.
Prior to choosing a solution, you should comprehend the basics so as to not overpay your Big Data consultant and not make platform mistakes – which can be extremely costly to correct later on. The Big Data consulting industry is still in its adolescence. As a result, many consultancies are competent in only one platform and so they will try to sell you their particular offering. Our quick 101 can save you a bunch of money and help you to talk to Big Data techies with confidence.
First off, all those terabytes of your business information must be collected and stored properly – otherwise they will be difficult or impossible to analyze and, to make things worse, will be complicated to move elsewhere upon solution deployment. Imagine you are (for whatever reason) lying under your car changing the oil. You are holding a small pot into which all of the dirty oil is dripping. You watch as it quickly fills up and realize you need to figure out how to pour it somewhere else without messing up your clothes and your garage.
Have you heard about the 3Vs of Big Data? Fret not, it’s simple:
- Volume: presumably your business generates LOTS of data – otherwise you wouldn’t be reading this.
- Velocity: is your data flowing more like a waterfall than a river? Should it be collected at all times and at high rates?
- Variety: probably the most discerning one. Types of data can vary greatly: from those that are produced by machines (and therefore easy to be machine-processed) to pieces of art that only human beings can create and really comprehend.
While big volumes can to an extent be conquered by the brute force of buying ever more data storage, it’s your unique combination of the latter two Vs that will determine the best platform for you. After all, maximizing the efficiency and convenience of your information analysis is the main reason behind your entire Big Data endeavor.
Below we’ll briefly review the main families of Big Data systems — see which one applies best to your business circumstances:
- Key-Value Store. Popular platforms: Azure Table Storage, DynamoDB, Redis
- Extremely fast at saving streams of data
- Saves data as a key-value pair where the value can be any type: BLOBs, CLOBs, encoded strings, etc. – and all of them can be combined
- Caveat: Does not support data relationships well (though this is not really a deciding factor as it is a common problem across most NoSQL engines)
- Caveat: For later processing, the value needs to be transformed into a form that supports SQL-like querying; for raw data, a key lookup is usually needed
- Best for: video capture, encoded data, real-time logging
- Column Family. Popular platforms: Hadoop/HBase, Cassandra
- An extension of raw key-value tuple model where the value is a set of columns defined as a name-value-timestamp tuple (triplet)
- A look-alike of the good old table structures in relational databases, though not quite so apparent
- Quantity of table columns can vary to produce flexible data representation
- Querying on column family structures is performed very quickly because you can define the specific set of columns that you query most frequently and thus not all the information has to be read as it is in Relational DBM’s
- Best for: variable data representation, data analysis (both science and business)
- Document Store: Popular platforms: MongoDB, Azure DocumentDB, CouchDB
- Represents data as a key-value pair, where the value is a “document” – a variable set of fields with a name and value, with nesting documents in documents for greater flexibility
- Supports rich querying mechanisms and data relationships representation techniques, which enables quick adoption from RDBMs
- Data denormalization – all necessary information is in one place
- Best for: metadata storage, web-applications that read/write massive amounts of information, sales/products definition and online marketing
- Graph Database: FlockDB, Onyx, InfiniteGraph
- If you really need relations in your NoSQL, these engines are for you
- Hierarchies are maintained simply and transparently
- Objects can be queried not only by attributes but also by relations, using built-in join capabilities
- Best for: building infrastructure models, social networks maintenance, business process modeling
- Multi-Model Database: ArangoDB, AlchemyDB, CortexDB
- Combines two or more approaches to Big Data to implement the polyglot persistence paradigm
- Best for: large-scale enterprises where multiple types of Big Data have to be maintained and served through one integrated solution
Here’s a brief comparison of the main features throughout the described platform families:
Edvantis has to date worked extensively with all five groups of Big Data systems, and our engineers have hands-on experience in most of the listed technologies (while we are not associated with any particular vendor or platform). Want more free advice? I’m a Big Data evangelist at Edvantis Software, it’s my passion, and I’d be happy to chat. Feel free to reach out to me at [email@example.com, ATTN Volo Klymko Big Data 101] and we’ll talk.