Friday, 1 September 2017

How to Design a Big Data Architecture:



Designing a Big Data architecture is a complex task, considering the volume, variety and velocity of data today. Add to that the speed of technology innovations and competitive products in the market, and this is no trivial challenge for a Big Data Architect.
Analyze the Business Problem
Look at the business problem objectively and identify whether it is a Big Data problem or not? Sheer volume or cost may not be the deciding factor. Multiple criteria like velocity, variety, challenges with the current system and time taken for processing should be considered as well.
Some Common Use Cases:
  • Data Archival/ Data Offload – Despite the cumbersome process and long SLAs for retrieval of data from tapes, it’s the most commonly used method of backup, as the cost prohibits the amount of active data maintained in the current systems. Alternatively, Hadoop facilitates storing huge amounts of data spanning across years (active data) at a very low cost.
  • Process Offload – Offload jobs that consume expensive MIPS cycles or consume extensive CPU cycles on the current systems.
  • Data Lake Implementation– Data lakes help in storing and processing massive amounts of data.
  • Unstructured Data Processing – Big Data technologies provide capabilities to store and process any amount of unstructured data natively. RDBMS’s can also store unstructured data as BLOB or CLOB but wouldn’t provide processing capabilities natively.
  • Data Warehouse Modernization – Integrate the capabilities of Big Data and your data warehouse to increase operational efficiency.
 Vendor Selection
Vendor selection for the Hadoop distribution may be driven by the client most of the time, depending on their personal bias, market share of the vendor, or existing partnerships. The vendors for Hadoop distribution are Cloudera, Hortonworks, Mapr and BigInsights (with Cloudera and Hortonworks being the prominent ones).
Deployment Strategy
Deployment strategy  determines whether it will be on premise, cloud based, or a mix of both. Each has its own pros and cons.
  • An on premise solution tends to be more secure (at least in the customers mind). Typically Banking, Insurance, and Healthcare customers have preferred this method, as data doesn’t leave the premise. However, the hardware procurement and maintenance would cost a lot more money, effort and time.
  • A cloud based solution is a more cost effective pay as you go model which provides a lot of flexibility in terms of scalability and eliminates procurement and maintenance overhead.
  • A mix deployment strategy gives us bits of both worlds and can be planned to retain PII data on premise and the rest in the cloud.
Capacity Planning
Capacity planning plays a pivotal role in hardware and infrastructure sizing. Important factors to be considered are:
  • Data volume for one-time historical load
  • Daily data ingestion volume
  • Retention period of data
  • HDFS Replication factor based on criticality of data
  • Time period for which the cluster is sized (typically 6months -1 year), after which the cluster is scaled horizontally based on requirements
  • Multi datacenter deployment
Infrastructure sizing
Infrastructure sizing is based on our capacity planning, and decides the type of hardware required, like the number of machines, CPU, memory, etc. It also involves deciding the number of clusters/environment required.
Important factors to be considered
  • Types of processing Memory or I/O intensive
  • Type of disk
  • No of disks per machine
  • Memory size
  • HDD size
  • No of CPU and cores
  • Data retained and stored in each environment (Ex: Dev may be 30% of prod)
Backup and Disaster Recovery Planning
Backup and disaster recovery is a very important part of planning, and involves the following considerations:
  • The criticality of data stored
  • RPO (Recovery Point Objective) and RTO (Recovery Time Objective) requirements
  • Active-Active or Active-Passive Disaster recovery
  • Multi datacenter deployment
  • Backup Interval (can be different for different types of data)
           each of the logical layers in architecting the BigData Solution.

Get to the Source!
Source profiling is one of the most important steps in deciding the architecture. It involves identifying the different source systems and categorizing them based on their nature and type.
Points to be considered while profiling the data sources:
  • Identify the internal and external sources systems
  • High Level assumption for the amount of data ingested from each source
  • Identify the mechanism used to get data – push or pull
  • Determine the type of data source – Database, File, web service, streams etc.
  • Determine the type of data – structured, semi structured or unstructured
Ingestion Strategy and Acquisition
Data ingestion in the Hadoop world means ELT (Extract, Load and Transform) as opposed to ETL (Extract, Transform and Load) in case of traditional warehouses.
Points to be considered:
  • Determine the frequency at which data would be ingested from each source
  • Is there a need to change the semantics of the data append replace etc?
  • Is there any data validation or transformation required before ingestion (Pre-processing)?
  • Segregate the data sources based on mode of ingestion – Batch or real-time
 Storage
One should be able to store large amounts of data of any type and should be able to scale on need basis. We should also consider the number of IOPS (Input output operations per second) that it can provide. Hadoop distributed file system is the most commonly used storage framework in BigData world, others are the NoSql data stores – MongoDB, HBase, Cassandra etc. One of the salient features of Hadoop storage is its capability to scale, self-manage and self-heal.
There are 2 kinds of analytical requirements that storage can support:
  • Synchronous – Data is analysed in real-time or near real-time, the storage should be optimized for low latency.
  • Asynchronous – Data is captured, recorded and analysed in batch.
Things to consider while planning storage methodology:
  • Type of data (Historical or Incremental)
  • Format of data ( structured, semi structured and unstructured)
  • Compression requirements
  • Frequency of incoming data
  • Query pattern on the data
  • Consumers of the data
And Now We Process
Not only the amount of data being stored but the processing also has increased multifold.
Earlier frequently accessed data was stored in Dynamic RAMs but now due to the sheer volume, it is been stored on multiple disks on a number of machines connected via the network. Instead of bringing the data to processing, in the new way, processing is taken closer to data which significantly reduce the network I/O.The Processing methodology is driven by business requirements. It can be categorized into Batch , real-time or Hybrid based on the SLA.
  • Batch Processing – Batch is collecting the input for a specified interval of time and running transformations on it in a scheduled way. Historical data load is a typical batch operation
Technology Used: MapReduce, Hive, Pig
  • Real-time Processing – Real-time processing involves running transformations as and when data is acquired.
Technology Used: Impala, Spark, spark SQL, Tez, Apache Drill
  • Hybrid Processing – It’s a combination of both batch and real-time processing needs.
Best example would be lambda architecture.
The Last Mile- Consumption
This layer consumes the output provided by processing layer. Different users like administrator, Business users, vendor, partners etc. can consume data in different format. Output of analysis can be consumed by recommendation engine or business processes can be triggered based on the analysis.
Different forms of data consumption are:
  • Export Data sets – There can be requirements for third party data set generation. Data sets can be generated using hive export or directly from HDFS.
  • Reporting and visualization – Different reporting and visualization tool scan connect to Hadoop using JDBC/ODBC connectivity to hive.
  • Data Exploration – Data scientist can build models and perform deep exploration in a sandbox environment. Sandbox can be a separate cluster (Recommended approach) or a separate schema within same cluster that contains subset of actual data.
  • Adhoc Querying – Adhoc or Interactive querying can be supported by using Hive, Impala or spark SQL.
  And finally, the key thing to remember in designing BigData Architecture are:
  • Dynamics of use case: There a number of scenarios as illustrated in the article which need to be considered while designing the architecture – form and frequency of data, Type of data, Type of processing and analytics required.
  • Myriad of technologies: Proliferation of tools in the market has led to a lot of confusion around what to use and when, there are multiple technologies offering similar features and claiming to be better than the others.