Monday 7 December 2015

Cassandra Installations on Windows Machine

Cassandra Installations on Windows Machine

Prior to installations set up Java in windows and set environment paths as well

Step 1)

Download Cassandra tar file in your windows machine of any location

Click below link to download tar file

http://cassandra.apache.org/download/

If u want new version of Cassandra click on the latest version or else check the version in Cassandra archives (check this section of above URL -> Previous and Archived Cassandra Server Releases)

Click on the tar file. It will download into your windows machine

Step 2)

Go to Download tar file location and extract files using (WinZip or 7zip)

Copy the extracted file into your any drive. In my Case I am placing it in ‘D :/< location>’

Location means your Cassandra extracted file

Step 3)

Set up Cassandra home path in environment variables (see the below screen shot to set path in environment variables)

CASSANDRA_HOME=D :\< location of Cassandra>

Step 4)

We have to modify 2 configurations in Conf/Cassandra.yaml file in the Cassandra

Go to Cassandra.yaml file and search for this line ( Commitlog_Directory and data_file_directories)

CommitLogDirectory: /var/lib/cassandra/commitlog

And change the line into

CommitLogDirectory: D :/< location of Cassandra/commitlog

Create commit log folder in the specified location (As mentioned above)

data_file_directories : /var/lib/Cassandra/Data

and change the line to

data_file_directories: D :/< location of Cassandra/data

Create data folder in the specified location (As mentioned above)

See Screenshot below for better understanding

Step 5)

Once the above 4 steps are done as mentioned above

Go to command prompt in windows -> then switch to Cassandra folder location -> run the Cassandra instance by entering Cassandra.bat command

Then enter

Cassandra-cli.bat in another terminal to interact with Cassandra

See Screenshot for better understanding

So once everything is working fine it means that installations done properly

Thursday 3 December 2015

Cassandra Introduction

Cassandra:

Apache Cassandra is column oriented No SQL Data base for processing large amount of data that is spawned across multiple clusters and nodes. Cassandra process unstructured data and data is going to store in terms of key-values pairs. It has some unique features comparing to other data models.

Features

· High available service

· No single point of failure

· Linear scale performance

· Easy data distribution across multiple data centers

Some of differences in key features: RDBMS Vs Cassandra

Feature	RDBMS	Cassandra
Type of Data	Only deals with structured data	Deals with Unstructured data
Schema	Fixed schema	Flexible schema can be designed according to data
Relationships	Through joins and foreign keys between tables	In this it will represent through collections
Data Storage	In terms of tables by rows and columns	In terms of Nested key-value pairs
Data model	Database-> tables	Keyspaces->column families
Row Representation	Row is nothing but individual record present in table	Row represents replication
Column Representation	Series of relations	It represents storage

The top 5 most common use cases are:

1. Internet of Things

Cassandra is a perfect fit for scaling time-series data from users, devices, and sensors.

2. Personalization

Use Cassandra to ingest and analyze for custom, fast, low cost, scalable user experiences.

3. Messaging

Cassandra’s original Facebook use case; storing, managing, and analyzing messages requires sophisticated systems and massive scale.

4. Fraud detection

Staying a step ahead of fraud has become best solved at the database level. Apache Cassandra lets you analyze patterns quickly, accurately, and effectively.

5. Playlist

Product catalogs; movie ratings; you name it. Storing a collection of user selected items has massive performance and availability demands.

Next Article we are going to see how we work with Key spaces in Cassandra

Cassandra Operations Part 1( Working with Key spaces )

Cassandra operations:

In this article we are going to create, alter and delete key spaces in Cassandra. Key spaces are like data bases in Cassandra. In side of Key spaces we can able to create tables and we can load data into tables further

We are going to learn the following concepts from this article

Creation of Key space

Altering Key space

Dropping Key space

For this first we will go to cqlsh mode and to connect with local Cassandra cluster as shown in below

Go to Cassandra installed location and to bin location and type. /cqlsh

Observe the below screen shot for better understanding

Creation of Key space

Before going to create first we will check what the key spaces that present in Cassandra. For this we can use DESC command to get the list of key spaces present in Cassandra.

From the above screen shot we can observe these steps

1) This step will give the key spaces present in the Cassandra. If we observe there are 5 key spaces present as of now

2) Creation of Key Space Sample_Cassandra as shown in step 2 with options Replication.

In this replication we have to mention the class name and replication factor

3) Checking either Sample_Cassandra created or not using DESC command once again

Altering Key space:

We can alter the already existing key space with this Command Alter.

In this current example we are Altering the Sample_Cassandra that we created in the above step

If we observe the above screen shot we are altering schema with replication factor value as ‘3 ‘

So we can check the modified schema by seeing key space information as mention below

Just as similar to SQL command in Cassandra also supports same query language.

So to see the keyspace information we can see like executing command as below

Query: Select * from system.shema_keyspaces;

From the below screen shot we can able to find out the Schema details of key spaces present in the Cassandra.

In the above screen shot we can check the keyspaces schemas present in the Cassandra. Total it returned 6 rows each represents one key space.

Dropping Key space:

We can drop key space present in Cassandra using Drop command

From the above screen shot we will observe the following

1) Dropping Keyspace Sample_Cassandra

2) Checking either schema dropped or not using DESC command

Next article we will see Table operations in Cassandra

Thursday 6 August 2015

Social Media Analytics using R + Hadoop

Social media Analytics Using R + Hadoop (RHadoop):

This article is about an idea of doing analytics using RHadoop. For the domains like bio medical, research and analysis of educational institutions , Statistical computing we use R to find out different patterns , prediction analysis and more insights from the data. If suppose data is limited and its usage are nominal then we can do those analyses with R. But think of scenarios where data is going to be huge and in terms of peta bytes.

I here am plotting a diagram which will show the view to inculcate R with hadoop and social media analytics.

Fig (1): R Hadoop with Social Media analytics

RHadoop Set up and Installations:-

--> Setting up of R in your system, the latest one R 3.1.3 with the required packages that we work on. Check this for installations
Refer -->"http://cran.r-project.org/bin/windows/base/"

-->Setting up of Hadoop system in single node or multinode cluster.
Referàhttp://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node- cluster/

-->RHadoop setup
Referàhttp://blog.hsinfu.org/2015/04/01/ubuntu14-04linuxmint16-rhadoop-rmr2-plyrmr-rhdfs-rhbase-installation-on-hadoop-2-6-0/

RHadoop Use Cases

--> Coming to the Use Cases of RHadoop ,its presence in two ways .one with the streamed data (Like the Social Media Sites and news feeds from different Sources )and one with the data that resides in the Standard Traditional or NOSQL DBs (like MongoDb).

Coming With the Social media Analytics using RHadoop we have the following setup
--> Hadoop setup with R running on it
--> API s to connect with different social media like Linkedin,Facebook,Twitter.
--> Packages to be loaded must in R be ( ROAuth, twitteR, RLinkedin,RCurl )

Key User Case for Streaming Data be Like :
R <------> Twitter and fetching tweets and slice and dice the fetching data
R<-------> Linkedin .Connecting with Linkedin and getting data and slice and dice it.
Similar way we can do with FB and Instagram.

The Second User Case be like:
R <-----> MongoDb. Fetching the documents and applying logic on the fetched documents and
performing the analytics.

As of now there is No Parallel distribution supporting with R as a standalone.
But with some Distributions its comes up with Parallel distribution.

Wednesday 5 August 2015

Big Data Core Indicators and Key Aspects

Big Data Core Indicators :

As we all talking about big data the core indicators that comes into picture are four V's.

Volume,Velocity,Variety and Veracity.

These V's are going to decide the big data and its future. Technically big data comes into picture when ever an organization or company only deals about any of these V's.

Big Data Core Indicators

Key Aspects of Big data Platform

1. Integration --The point is to have one platform to manage all of the data. Big data has to be bigger than just one technology.

2. Analytics — A very important point. We see big data as a viable place to analyze and store data. sophistication and accuracy of the analytics matters.

3. Visualization -- Need to bring big data to the users.

4. Development — Need sophisticated development tools for the engines and across them to enable the market to develop analytic applications.

5. Workload optimization — Improvements upon open source for efficient processing and storage.

6. Security and Governance — Sensitive data that needs to be protected, retention policies need to be determined .

As Technology advancements day by day the amount of data that dealing with the business requirements also increasing. So Big data analytics and solutions providing better and enhanced solutions to solve business problems in different industrial verticals.

Big Data At Glance

The big data ecosystem can be confusing. The popularity of “big data” as industry buzzword has created a broad category. As Hadoop steamrolls through the industry, solutions from the business intelligence and data warehousing fields are also attracting the big data label. To confuse matters, Hadoop-based solutions such as Hive are at the same time evolving toward being a competitive data warehousing solution.

Understanding the nature of your big data problem is a helpful first step in evaluating potential solutions. Let’s remind ourselves of Big Data.

“Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it."

Big data problems vary in how heavily they weigh in on the axes of volume, velocity and variability. Predominantly structured yet large data, for example, may be most suited to an analytical database approach.

This survey makes the assumption that a data warehousing solution alone is not the answer to your problems, and concentrates on analyzing the commercial Hadoop ecosystem. We’ll focus on the solutions that incorporate storage and data processing, excluding those products which only sit above those layers, such as the visualization or analytical workbench software.

Getting started with Hadoop doesn’t require a large investment as the software is open source, and is also available instantly through the Amazon Web Services cloud. But for production environments, support, professional services and training are often required

Monday 26 January 2015

Pentaho with Real Time Data Analytics

Pentaho with Real Time Data Analytics :

The Importance of Big Data is well recognized today, with implementations across every size and type of business today. What has become apparent is that the real value of big data is not the data in and of itself, but in the combination of data with other relevant data from existing internal and external systems and sources need to blend data to derive maximum value will only escalate as new types and sources of data and information continue to emerge.

With Pentaho BI Analytics , you can easily create architected, blended views across both the traditional Call Detail Records in the warehouse, and the network data Just in time, architected blending delivers accurate big data analytics based on blended data. You can connect to, combine, and even transform data from any of the multiple data stores in your hybrid data ecosystem into blended views, then query the data directly via that view using the full spectrum of analytics in the Pentaho Analytics platform, including predictive analytics.

Examining a typical big data Analytics process workflow helps identify where many of these potential problems may occur, special skill sets are required and delays are introduced.

Common steps in the Bigdata Analytics workflow include Data Ingestion ,Manipulations,Access ,Model and Visualization .

Bigdata-Analytics

Monday 7 December 2015

Cassandra Installations on Windows Machine

Thursday 3 December 2015

Cassandra Introduction

Cassandra Operations Part 1( Working with Key spaces )

Creation of Key space

Altering Key space:

Dropping Key space:

Thursday 6 August 2015

Social Media Analytics using R + Hadoop

RHadoop Use Cases

Wednesday 5 August 2015

Big Data Core Indicators and Key Aspects

Big Data Core Indicators :

Key Aspects of Big data Platform

Big Data At Glance

Big Data At Glance

Monday 26 January 2015

Pentaho with Real Time Data Analytics

Blog Archives

Key Elements

Follow on Twitter

Total Pageviews

Popular Posts