Wednesday, 4 October 2017

What is Big Data Analytics?

The term "big data" refers to digital stores of information that have a high volume, velocity and variety. Big data analytics is the process of using software to uncover trends, patterns, correlations or other useful insights in those large stores of data.
Advantage: Big data analytics enables companies to increase revenues, decrease costs and become more competitive within their industries. Many firms are investing heavily in Big Data analytics.
Big data analytics is quickly gaining adoption. Enterprises have awakened to the reality that their big data stores represent a largely untapped gold mine that could help them lower costs, increase revenue and become more competitive. They don't just want to store their vast quantities of data, they want to convert that data into valuable insights that can help improve their companies.
As a result, investment in big data analytics tools is seeing remarkable gains. According to IDC, worldwide sales of big data and business analytics tools are likely to reach $150.8 billion in 2017, which is 12.4 percent higher than in 2016. And the market research firm doesn't see that trend stopping anytime soon. It forecasts 11.9 percent annual growth through 2020 when revenues will top $210 billion.

Data analytics isn't new. It has been around for decades in the form of business intelligence and data mining software. Over the years, that software has improved dramatically so that it can handle much larger data volumes, run queries more quickly and perform more advanced algorithms.
The market research firm Gartner categories big data analytics tools into four different categories:

  1. Descriptive Analytics: These tools tell companies what happened. They create simple reports and visualizations that show what occurred at a particular point in time or over a period of time. These are the least advanced analytics tools.
  2. Diagnostic Analytics: Diagnostic tools explain why something happened. More advanced than descriptive reporting tools, they allow analysts to dive deep into the data and determine root causes for a given situation.
  3. Predictive Analytics: Among the most popular big data analytics tools available today, predictive analytics tools use highly advanced algorithms to forecast what might happen next. Often these tools make use of artificial intelligence and machine learning technology.
  4. Prescriptive Analytics: A step above predictive analytics, prescriptive analytics tell organizations what they should do in order to achieve a desired result. These tools require very advanced machine learning capabilities, and few solutions on the market today offer true prescriptive capabilities.

Benefits of Big Data Analytics

Organizations decide to deploy big data analytics for a wide variety of reasons, including the following:
  • Business Transformation In general, executives believe that big data analytics offers tremendous potential to revolution their organizations. In the 2016 Data & Analytics Survey from IDGE, 78 percent of people surveyed agreed that over the next one to three years the collection and analysis of big data could fundamentally change the way their companies do business.
  • Competitive Advantage In the MIT Sloan Management Review Research Report Analytics as a Source of Business Innovation, sponsored by SAS, 57 percent of enterprises surveyed said their use of analytics was helping them achieve competitive advantage, up from 51 percent who said the same thing in 2015.
  • InnovationBig data analytics can help companies develop products and services that appeal to their customers, as well as helping them identify new opportunities for revenue generation. Also in the MIT Sloan Management survey, 68 percent of respondents agreed that analytics has helped their company innovate. That's an increase from 52 percent in 2015.
  • Lower Costs In the NewVantage Partners Big Data Executive Survey 2017, 49.2 percent of companies surveyed said that they had successfully decreased expenses as a result of a big data project.
  • Improved Customer Service Organizations often use big data analytics to examine social media, customer service, sales and marketing data. This can help them better gauge customer sentiment and respond to customers in real time.
  • Increased Security Another key area for big data analytics is IT security. Security software creates an enormous amount of log data. By applying big data analytics techniques to this data, organizations can sometimes identify and thwart cyberattacks that would otherwise have gone unnoticed.

Big Data Analytics Tools

Big data analytics has become so trendy that nearly every major technology company sells a product with the "big data analytics" label on it, and a huge crop of startups also offers similar tools. Cloud-based big data analytics have become particularly popular. In fact, the 2016 Big Data Maturity Surveyconducted by AtScale found that 53 percent of those surveyed planned to use cloud-based big data solutions, and 72 percent planned to do so in the future. Open source tools like Hadoop are also very important, often providing the backbone to commercial solution.
The lists below are not exhaustive, but do include a sampling of some of better known big data analytics solutions.

Open Source Big Data Analytics Tools

Big Data Analytics Vendors

How to Select a Big Data Application

Choosing big data software is a complicated process that requires a careful evaluation of your goals and the solutions available from vendors.

To be sure, big data solutions are in great demand. Today, enterprise leaders know that their big data is one of their most valuable resources — and one they can't afford to ignore. As a result, they are looking for hardware and software that can help them store, manage and analyze their big data.
According to IDC, enterprises will likely spend $150.8 billion on big data and analytics in 2017, 12.4 percent more than they spent last year. And that spending is likely to increase at 11.9 percent per year through 2020, when revenues will likely top $210 billion.
Much of that revenue is going toward big data applications. IDC forecasts that spending on software alone could exceed $70 billion in 2020. Spending is increasing particularly rapidly on non-relational analytic data stores (like NoSQL databases), which will likely grow 38.6 percent per year, and cognitive software platforms (like analytics tools with artificial intelligence and machine learning capabilities), which will likely grow 23.3 percent per year.
In order to capitalize on all that big data spending, vendors have slapped the "big data" label on a dizzying array of different products and services. That product proliferation can make it difficult for organizations to find the right big data applications to meet their needs. Experts suggest that a good way to start the process of selecting a big data application is to determine exactly what kind of application (or applications) you need.

Types of Big Data Applications

Enterprise software vendors offer a wide array of different types of big data applications. The kind of big data application that is right for you will depend on your goals.
For example, if you just want to expand your existing financial reporting capabilities with greater detail and depth, a data warehouse and business intelligence solution might be sufficient for your needs. If your sales and marketing teams want to use your big data to uncover new opportunities for increasing your revenue and margins, you might consider creating a data lake and/or investing in a data mining solution. If you want to create a data-driven culture where everyone in your organization is using data to guide their decision-making, you might want a data lake and predictive analytics and an in-memory database and possibly streaming analytics too.
Things can get a little more complicated because the lines between the different types of tools can be a little fuzzy. Some business intelligence tools have data mining and predictive analytics capabilities. Some predictive analytics tools include streaming capabilities.
Your best approach is to define your goals clearly at the outset and then go looking for products that will help you reach those goals. The chart below offers an overview of some of the most common types of big data applications and how they can be useful in the enterprise.



Key Decisions When Selecting a Big Data Application

No matter which type of big data application you select, you'll need to make some key decisions that will help you narrow down your options. Here are a few of the most important of these considerations:

On-premise vs cloud-based big data applications

The first big decision you'll need to make is whether you want to host your big data software in your own data center or if you want to use a cloud-based solution.
Currently, more organizations seem to be opting for the cloud. “Global spending on big data solutions via cloud subscriptions will grow almost 7.5 times faster than on-premise subscriptions." Brian Hopkins, Forrester vice president and principal analyst, wrote in an August 2017 blog post. "Furthermore, public cloud was the number one technology priority for big data according to our 2016 and 2017 surveys of data analytics professionals.”
Cloud-based big data applications are popular for several reasons, including scalability and ease of management. The major cloud vendors are also leading the way with artificial intelligence and machine learning research, which is allowing them to add advanced features to their solutions.
However, cloud isn't always the best option. Organizations with high compliance or security requirements sometimes find that they need to keep sensitive data on premises. In addition, some organizations already have investments in existing on-premises data solutions, and they find it more cost effective to continue running their big data applications locally or to use a hybrid approach.

Proprietary vs open source big data applications

Some of the most popular big data tools available, including the Hadoop ecosystem, are available under open source licenses. Forrester has estimated, “Firms will spend $800 million in Hadoop software and related services in 2017.”
One of the big appeals of Hadoop and other open source software is the low total cost of ownership. While proprietary solutions have hefty license fees and may require expensive specialized hardware, Hadoop has no licensing fees and can run on industry-standard hardware.
However, enterprises sometimes find it difficult to get the open source solutions up and running and configured for their needs. They may need to purchase support or consulting services, and organizations need to consider those expenses when figuring out total cost of ownership.

Batch vs streaming big data applications

The earliest big data solutions, like Hadoop, processed batch data only, but enterprises increasingly find that they want to analyze data in real-time. That has generated more interest in streaming solutions such as Spark, Storm, Samza and others.
Many analysts say that even if organizations don't think they need to process streaming data today, streaming capabilities are likely to become standard operating procedure in the not-too-distant future. For that reason, many organizations are moving toward Lambda architecture, a data processing architecture that can handle both real-time and batch data.

Characteristics to Look for in a Big Data Application

Once you have narrowed down your options, you'll need to evaluate the big data applications you are considering. The criteria below include some of the most important factors to examine.
  • Integration with Legacy Technology – Most organizations already have existing investments in data management and analytics technology. Replacing that technology completely can be expensive and disruptive, so organizations often choose to look for solutions that can be used alongside their current tools or that can augment their existing software.
  • Performance – A 2017 Talend study found that real-time analytics capabilities were one of business leaders' top IT priorities. Executives and managers need to be able to access insights in a timely manner if they are going to profit from those insights. That means investing in technology that can provide the speed they need.
  • Scalability – Big data stores get larger every day. Organizations not only need big data applications that perform quickly right now, they need big data applications that can continue to perform quickly as data stores grow exponentially. This need for scalability is one of the key reasons why cloud-based big data applications have become very popular.
  • Usability – Organizations should also consider the "learning curve" for any big data applications that they intend to purchase. Tools with easy deployment, easy configuration, intuitive interfaces and/or similarity or integration with tools the organization already uses can provide tremendous value.
  • Visualization – According to BI-Survey.com, "Visualization and explorative data analysis for business users (known as data discovery) have evolved into the hottest business intelligence and analytics topic in today’s market." Presenting data in charts and graphs makes it easier for human brains to spot trends and outliers, speeding up the process of identifying actionable insights.
  • Flexibility – The big data needs you have today are likely very different from the needs you will have in another year or two. That's why many enterprises choose to look for tools with the capacity to serve a variety of different goals rather than performing a single function very well.
  • Security – Much of the data included in those big data stores is sensitive information that would be highly valuable to competitors, nation-states or hackers. Organizations need to ensure that their big data has adequate protection to prevent the sorts of large data breaches that have recently been dominating headlines. That means looking either for tools that have security features like encryption and strong authentication built in or tools that integrate with your existing security solutions.
  • Support – Even experienced IT professionals sometimes find it difficult to deploy, maintain and use complex big data applications. Don't forget to consider the quality and cost of the support available from the various vendors.
  • Ecosystem – Most organizations need a number of different applications to meet all of their big data needs. That means looking for a big data platform that integrates with a lot of other popular tools and a vendor with strong partnerships with other providers.
  • Self-Service Capabilities – The Harvey Nash KPMG CIO Survey 2017 found that sixty percent of CIOs consistently report talent shortages, with big data and analytics being the most in-demand skillset. Because there aren't enough qualified data scientists to go around, organizations are looking for tools that other business professionals can use on their own. A recent Gartner blog post noted that in an average organization, about 32 percent of employees are using BI and analytics.
  • Total Cost of Ownership – The upfront costs of a big data application are only a small part of the picture. Organizations need to make sure they consider related hardware costs, ongoing license or subscription fees, employee time, support costs and any expenses related to the physical space for on-premises applications. Don't forget to factor in the fact that cloud computing costs generally decrease over time.
  • Estimated Time to Value – Another important financial consideration is how quickly you'll be able to get up and running with a particular solution. Most companies would prefer to see benefit from their big data projects within days or weeks rather than months or years.
  • Artificial Intelligence and Machine Learning – Finally, consider how innovative the various big data applications vendors are. AI and machine learning research are advancing at an incredible rate and becoming a mainstream part of big data analytics. Forrester has predicted, “In 2017, investments in AI will triple as firms work to convert customer data into personalized experiences.” If you choose a vendor that isn't on the cutting-edge of this research, you may find yourself falling behind the competition.

Tips for Selecting a Big Data Application

Clearly, choosing the right big data application is a complicated process that involves a myriad of factors. Experts and organizations that have successfully deployed big data software offer the following advice:
  • Understand your goals — As previously mentioned, knowing what you want to accomplish is of paramount importance when choosing a big data application. If you aren't sure why you are investing in a particular technology, your project is unlikely to succeed.
  • Start small — If you can demonstrate success with a small-scale big data analytics project, that will generate interest in using the tool throughout the company.
  • Take a holistic approach — While a small-scale project can help you gain experience and expertise with your technology, it's important to choose an application that can ultimately be used throughout the business. Gartner advises, “To support a ‘data and analytics everywhere’ world, IT professionals need to create a new end-to-end architecture built for agility, scale and experimentation. Today, disciplines are merging and approaches to data and analytics are becoming more holistic and encompassing the entire business.”
  • Work together — That same blog post also notes, “Gartner recommends data and analytics leaders work proactively to spread analytics throughout their organization, to get the largest possible benefit from enabling data to drive business actions.” Many organizations are attempting to build a data-driven culture, and that requires a great deal of cooperation among business and IT leaders.
  • Go viral — Those previously mentioned self-service capabilities can also help with the creation of data-driven culture. Gartner advises, “Enable analytics to truly go viral, within and outside the enterprise. Empower more business users to perform analytics by fostering a pragmatic approach to self-service and by embedding analytic capabilities at the point of data ingestion within interactions and processes.”

Friday, 1 September 2017

How to Design a Big Data Architecture:



Designing a Big Data architecture is a complex task, considering the volume, variety and velocity of data today. Add to that the speed of technology innovations and competitive products in the market, and this is no trivial challenge for a Big Data Architect.
Analyze the Business Problem
Look at the business problem objectively and identify whether it is a Big Data problem or not? Sheer volume or cost may not be the deciding factor. Multiple criteria like velocity, variety, challenges with the current system and time taken for processing should be considered as well.
Some Common Use Cases:
  • Data Archival/ Data Offload – Despite the cumbersome process and long SLAs for retrieval of data from tapes, it’s the most commonly used method of backup, as the cost prohibits the amount of active data maintained in the current systems. Alternatively, Hadoop facilitates storing huge amounts of data spanning across years (active data) at a very low cost.
  • Process Offload – Offload jobs that consume expensive MIPS cycles or consume extensive CPU cycles on the current systems.
  • Data Lake Implementation– Data lakes help in storing and processing massive amounts of data.
  • Unstructured Data Processing – Big Data technologies provide capabilities to store and process any amount of unstructured data natively. RDBMS’s can also store unstructured data as BLOB or CLOB but wouldn’t provide processing capabilities natively.
  • Data Warehouse Modernization – Integrate the capabilities of Big Data and your data warehouse to increase operational efficiency.
 Vendor Selection
Vendor selection for the Hadoop distribution may be driven by the client most of the time, depending on their personal bias, market share of the vendor, or existing partnerships. The vendors for Hadoop distribution are Cloudera, Hortonworks, Mapr and BigInsights (with Cloudera and Hortonworks being the prominent ones).
Deployment Strategy
Deployment strategy  determines whether it will be on premise, cloud based, or a mix of both. Each has its own pros and cons.
  • An on premise solution tends to be more secure (at least in the customers mind). Typically Banking, Insurance, and Healthcare customers have preferred this method, as data doesn’t leave the premise. However, the hardware procurement and maintenance would cost a lot more money, effort and time.
  • A cloud based solution is a more cost effective pay as you go model which provides a lot of flexibility in terms of scalability and eliminates procurement and maintenance overhead.
  • A mix deployment strategy gives us bits of both worlds and can be planned to retain PII data on premise and the rest in the cloud.
Capacity Planning
Capacity planning plays a pivotal role in hardware and infrastructure sizing. Important factors to be considered are:
  • Data volume for one-time historical load
  • Daily data ingestion volume
  • Retention period of data
  • HDFS Replication factor based on criticality of data
  • Time period for which the cluster is sized (typically 6months -1 year), after which the cluster is scaled horizontally based on requirements
  • Multi datacenter deployment
Infrastructure sizing
Infrastructure sizing is based on our capacity planning, and decides the type of hardware required, like the number of machines, CPU, memory, etc. It also involves deciding the number of clusters/environment required.
Important factors to be considered
  • Types of processing Memory or I/O intensive
  • Type of disk
  • No of disks per machine
  • Memory size
  • HDD size
  • No of CPU and cores
  • Data retained and stored in each environment (Ex: Dev may be 30% of prod)
Backup and Disaster Recovery Planning
Backup and disaster recovery is a very important part of planning, and involves the following considerations:
  • The criticality of data stored
  • RPO (Recovery Point Objective) and RTO (Recovery Time Objective) requirements
  • Active-Active or Active-Passive Disaster recovery
  • Multi datacenter deployment
  • Backup Interval (can be different for different types of data)
           each of the logical layers in architecting the BigData Solution.

Get to the Source!
Source profiling is one of the most important steps in deciding the architecture. It involves identifying the different source systems and categorizing them based on their nature and type.
Points to be considered while profiling the data sources:
  • Identify the internal and external sources systems
  • High Level assumption for the amount of data ingested from each source
  • Identify the mechanism used to get data – push or pull
  • Determine the type of data source – Database, File, web service, streams etc.
  • Determine the type of data – structured, semi structured or unstructured
Ingestion Strategy and Acquisition
Data ingestion in the Hadoop world means ELT (Extract, Load and Transform) as opposed to ETL (Extract, Transform and Load) in case of traditional warehouses.
Points to be considered:
  • Determine the frequency at which data would be ingested from each source
  • Is there a need to change the semantics of the data append replace etc?
  • Is there any data validation or transformation required before ingestion (Pre-processing)?
  • Segregate the data sources based on mode of ingestion – Batch or real-time
 Storage
One should be able to store large amounts of data of any type and should be able to scale on need basis. We should also consider the number of IOPS (Input output operations per second) that it can provide. Hadoop distributed file system is the most commonly used storage framework in BigData world, others are the NoSql data stores – MongoDB, HBase, Cassandra etc. One of the salient features of Hadoop storage is its capability to scale, self-manage and self-heal.
There are 2 kinds of analytical requirements that storage can support:
  • Synchronous – Data is analysed in real-time or near real-time, the storage should be optimized for low latency.
  • Asynchronous – Data is captured, recorded and analysed in batch.
Things to consider while planning storage methodology:
  • Type of data (Historical or Incremental)
  • Format of data ( structured, semi structured and unstructured)
  • Compression requirements
  • Frequency of incoming data
  • Query pattern on the data
  • Consumers of the data
And Now We Process
Not only the amount of data being stored but the processing also has increased multifold.
Earlier frequently accessed data was stored in Dynamic RAMs but now due to the sheer volume, it is been stored on multiple disks on a number of machines connected via the network. Instead of bringing the data to processing, in the new way, processing is taken closer to data which significantly reduce the network I/O.The Processing methodology is driven by business requirements. It can be categorized into Batch , real-time or Hybrid based on the SLA.
  • Batch Processing – Batch is collecting the input for a specified interval of time and running transformations on it in a scheduled way. Historical data load is a typical batch operation
Technology Used: MapReduce, Hive, Pig
  • Real-time Processing – Real-time processing involves running transformations as and when data is acquired.
Technology Used: Impala, Spark, spark SQL, Tez, Apache Drill
  • Hybrid Processing – It’s a combination of both batch and real-time processing needs.
Best example would be lambda architecture.
The Last Mile- Consumption
This layer consumes the output provided by processing layer. Different users like administrator, Business users, vendor, partners etc. can consume data in different format. Output of analysis can be consumed by recommendation engine or business processes can be triggered based on the analysis.
Different forms of data consumption are:
  • Export Data sets – There can be requirements for third party data set generation. Data sets can be generated using hive export or directly from HDFS.
  • Reporting and visualization – Different reporting and visualization tool scan connect to Hadoop using JDBC/ODBC connectivity to hive.
  • Data Exploration – Data scientist can build models and perform deep exploration in a sandbox environment. Sandbox can be a separate cluster (Recommended approach) or a separate schema within same cluster that contains subset of actual data.
  • Adhoc Querying – Adhoc or Interactive querying can be supported by using Hive, Impala or spark SQL.
  And finally, the key thing to remember in designing BigData Architecture are:
  • Dynamics of use case: There a number of scenarios as illustrated in the article which need to be considered while designing the architecture – form and frequency of data, Type of data, Type of processing and analytics required.
  • Myriad of technologies: Proliferation of tools in the market has led to a lot of confusion around what to use and when, there are multiple technologies offering similar features and claiming to be better than the others.

Monday, 29 May 2017

JAVA VS SCALA

JAVA VS SCALA


















POJO IN JAVA VS SCALA


The Java version of a simple POJO:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
public class PostJ {
    private String title;
    private String text;
    private String author;
    public String getTitle() {
        return title;
    }
    public void setTitle(String title) {
        this.title = title;
    }
    public String getText() {
        return text;
    }
    public void setText(String text) {
        this.text = text;
    }
    public String getAuthor() {
        return author;
    }
    public void setAuthor(String author) {
        this.author = author;
    }
}


And the Scala version.

1
case class Post(title: String, text: String, author: String)



Variables :

Java Syntax:
------------------

<data type>  <variable name> = <value> ;
   int                    a                    = 10;
   String               s                   = "Big Data";

Scala Syntax:
------------------

val <variable name> : <data type> = <value>
val            a                :  Int              = 10
val             s               :   String        = "Spark"

var is immutable.

NOTE: SCALA SUPPORTS " TYPE INFER"= automatically find the data type from value.
 val           a    = 10;
val            s=   "Spark"
var <variable name> : <data type> = <value>

var is mutable.