Monday, 27 July 2015

HADOOP : WORD COUNT PROBLEM



WHAT IS WORD COUNT
      Word count is typical problem which works on hadoop distributed file system and map       reduce is a intended count the no. of occurrence of each word in the provided input file. Word count operation takes places in two phases:-
1.      Mapper phase: In this phase first the text is tokenized into word then we form a key value pair with these word where the key being the word itself and value ‘1’. Mapper class execute completely on the entire data set splitting the word and forming the initial key value pair. Only after this entire process is completed the reducer start.
2.      Reducer phase: In reduce phase the key are grouped together and the value for similar keys are added. This could give the no of occurrence of each word in the input file. It creates an aggregation phase for key.

MAP   REDUCE   ALGORITHM
1.      Map reduce is a programming model and it is design to compute large volume of data in a parallel fashion.
2.      Map operation written by user in which takes a set of input key/values pairs and produces a set of intermediate key #1
3.      Reduce function also written by user, accept an intermediate key #1 and a set of value for that key .it merges together these values to form a possible smaller set of value.
4.      Map reduce operations are carried out in hadoop.
5.      Hadoop is a distributed sorting   engine. 
  
Map and reduce in word count problem Algorithm
                             mapper(file name , file-count);
                                                for each word in file-contents;
                                               emit(word,1)
                             reducer (word, value);
                                              sum=0
                                              for each value in values
                                              sum=sum +   value
                                           emit(word , sum)

    DATA FLOW DIAGRAM




METHODOLOGY

Map reduce is a 3 steps approach to solving a problem :-
Step 1:- Map
The purpose of a map step is to group or divide data into set based on desire value. While using map function we need to be careful about 3 things.
      1. How do we want to divide or group the data?
      2. Which part of the data we need or which part of the data is extraneous?
      3. In what form or structure do we need our data?
 Step 2:- Reduce
     Reduce operation combine different values for each given key using a user defined function. Reduce operation will take up each key and pick up all the values created from map step and process them one by one using custom define logic. It will take 2 parameters:-
1.    Key
2.     Array of values
 Step 3:-Finalized
     It is used to do any required transformation on the final output of the reduce.

FUTURE SCOPE
It is used in many applications because of parallel processing like document clustering, web link graph reversing and inverted index construction. As it is map reduce so it increases the efficiency of  handling big data .it is used where we need data to be available all the times and security is needed.map reduce related woks are:-
·     Yahoo!: Web map application uses hadoop to create a database of information on all known webpage.
·    Facebook: Hadoop provides Hive data center.
·     Backspace: It analyzes server log files and usage data using hadoop.






HADOOP: OVERVIEW

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.
Hadoop Architecture
Hadoop framework includes following four modules:
·        Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provides file system and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop.
·        Hadoop YARN: This is a framework for job scheduling and cluster resource management.
·        Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
·        Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.

We can use following diagram to depict these four components available in Hadoop framework.
Hadoop Architecture

MapReduce

Hadoop MapReduce is a software framework for easily writing applications which process big amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
The term MapReduce actually refers to the following two different tasks that Hadoop programs perform:
  • The Map Task: This is the first task, which takes input data and converts it into a set of data, where individual elements are broken down into tuples (key/value pairs).
  • The Reduce Task: This task takes the output from a map task as input and combines those data tuples into a smaller set of tuples. The reduce task is always performed after the map task.
Typically both the input and the output are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
The MapReduce framework consists of a single master JobTracker and one slaveTaskTracker per cluster-node. The master is responsible for resource management, tracking resource consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves TaskTracker execute the tasks as directed by the master and provide task-status information to the master periodically.
The JobTracker is a single point of failure for the Hadoop MapReduce service which means if JobTracker goes down, all running jobs are halted.

Hadoop Distributed File System

Hadoop can work directly with any mountable distributed file system such as Local FS, HFTP FS, S3 FS, and others, but the most common file system used by Hadoop is the Hadoop Distributed File System (HDFS).
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on large clusters (thousands of computers) of small computer machines in a reliable, fault-tolerant manner.
HDFS uses a master/slave architecture where master consists of a single NameNode that manages the file system metadata and one or more slave DataNodes that store the actual data.
A file in an HDFS namespace is split into several blocks and those blocks are stored in a set of DataNodes. The NameNode determines the mapping of blocks to the DataNodes. The DataNodes takes care of read and write operation with the file system. They also take care of block creation, deletion and replication based on instruction given by NameNode.
HDFS provides a shell like any other file system and a list of commands are available to interact with the file system. These shell commands will be covered in a separate chapter along with appropriate examples.

How Does Hadoop Work?

Stage 1

A user/application can submit a job to the Hadoop (a hadoop job client) for required process by specifying the following items:
  1. The location of the input and output files in the distributed file system.
  2. The java classes in the form of jar file containing the implementation of map and reduce functions.
  3. The job configuration by setting different parameters specific to the job.

Stage 2

The Hadoop job client then submits the job (jar/executable etc) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Stage 3

The TaskTrackers on different nodes execute the task as per MapReduce implementation and output of the reduce function is stored into the output files on the file system.

Monday, 20 July 2015

GREEN CLOUD COMPUTING VERUS MOBILE CLOUD COMPUTING


Computing means any goal-oriented activity requiring, benefiting from, or creating computers. Thus, computing includes designing and building hardware and software systems for a wide range of purposes; processing, structuring, and managing various kinds of information; doing scientific studies using computers; making computer systems behave intelligently; creating and using communications and entertainment media; finding and gathering information relevant to any particular purpose.

GREEN COMPUTING

Green computing is the environmentally responsible and eco-friendly use of computers and their resources. In broader terms, it is also defined as the study of designing, manufacturing/engineering, using and disposing of computing devices in a way that reduces their environmental impact. Many IT manufacturers and vendors are continuously investing in designing energy efficient computing devices, reducing the use of dangerous materials and encouraging the recyclability of digital devices and paper. Green computing is also known as green information technology (green IT). Green computing, or green IT, aims to attain economic viability and improve the way computing devices are used. Green IT practices include the development of environmentally sustainable production practices, energy efficient computers and improved disposal and recycling procedures.

 MOBILE COMPUTING

Mobile computing is human–computer interaction by which a computer is expected to be transported during normal usage. Mobile computing involves mobile communication, mobile hardware, and mobile software. Communication issues include ad-hoc and infrastructure networks as well as communication properties, protocols, data formats and concrete technologies. Hardware includes mobile devices or device components. Mobile software deals with the characteristics and requirements of mobile applications. Thus, mobile computing is the ability to use computing capability without a pre-defined location and/or connection to a network to publish and/or subscribe to information The purpose of this paper is to explore the comparison between Green cloud computing and Mobile Cloud computing and security issues and define which common security solutions are.

 GREEN CLOUD COMPUTING

Green cloud is a buzzword that refers to the potential environmental benefits that information technology (IT) services delivered over the Internet can offer society. The term combines the words green -- meaning environmentally friendly -- and cloud, the traditional symbol for the Internet and the shortened name for a type of service delivery model known as cloud computing.

Benefits of Green Cloud Computing  
·         Reduced Cost
·         Automatic Updates
·         Green Benefits of Cloud computing
·         Remote Access
·         Disaster Relief
·         Self-service provisioning
·         Scalability
·         Reliability and fault-tolerance
·         Ease of Use
·         Skills and Proficiency
·         Response Time
·         Increased Storage
·         Mobility

 Security Issues in Green cloud computing
The chief concern in cloud environments is to provide security around multi-tenancy and isolation, giving customers more comfort besides “trust us” idea of clouds. There has been survey works reported that classifies security threats in cloud based on the nature of the service delivery models of a cloud computing system However, security requires a holistic approach. Service delivery model is one of many aspects that need to be considered for a comprehensive survey on cloud security. Security at different levels such as Network level, Host level and Application level is necessary to keep the cloud up and running continuously. In accordance with these different levels, various types of security breaches may occur.
There are four types of issues raise while discussing security of a cloud.  
·         Data Issues
·         Privacy issues
·         Infected Application
·         Security issues

Solution to security issues in Green Cloud Computing

 1) Control the consumer access devices: Be sure the consumer’s access devices or points such as Personal Computers, virtual terminals, gazettes, pamphlets and mobile phones are secure enough. The loss of an endpoint access device or access to the device by an unauthorized user can cancel even the best security protocols in the cloud. Be sure the user computing devices are managed properly and secured from malware functioning and supporting advanced authentication features.

2) Monitor the Data Access: Cloud service providers have to assure about whom, when and what data is being accessed for what purpose. For example many website or server had a security complaint regarding snooping activities by many people such as listening to voice calls, reading emails and personal data etc.

3) Share demanded records and Verify the data deletion: If the user or consumer needs to report its compliance, then the cloud service provider will share diagrams or any other information or provide audit records to the consumer or user. Also verify the proper deletion of data from shared or reused Many providers do not provide for the proper degaussing of data from drives each time the drive space is abandoned. Insist on a secure deletion process and have that process written into the contract.

4) Security checks events: Ensure that the cloud service provider gives enough details about fulfillment of promises, break remediation and reporting contingency. These security events will describe responsibility, promises and actions of the cloud computing service provider


MOBILE CLOUD COMPUTING

Mobile cloud computing is the combination of cloud computing and mobile networks to bring benefits for mobile users, network operators, as well as cloud providers. Cloud computing exists when tasks and data are kept on the Internet rather than on individual devices, providing on-demand access. Mobile apps may use the cloud for both app development as well as hosting. A number of unique characteristics of hosted apps make the mobile cloud different from regular cloud computing. Mobile apps may be more reliant upon the cloud to provide much of the computing, storage, and communication fault tolerance than regular cloud computing does.

Benefits of Mobile Cloud Computing
·        
      Extending battery lifetime
·         Improving data storage capacity and processing power
·         Improving reliability

Security Issues in Mobile cloud Computing

Cloud computing as opposed to standard computing has several issues which can cause reluctance or 
fear in the user base. Some of these issues include concerns about privacy and data ownership and security. Some of these concerns are especially relevant to mobile devices. In this section, the paper discusses some of these issues, including both incidents involving them and techniques used to combat them.
·         Privacy
·         Data Ownership
·         Data Access and Security

Solution to Security issues in Mobile Cloud computing
Individuals and enterprises take advantage of the benefits for storing large amount of data or applications on a cloud. However, issues in terms of their integrity, authentication, and digital rights must be taken care of

1) Integrity: Every mobile cloud user must ensure the integrity of their information stored on the cloud. Every access they make must me authenticated and verified. Different approaches in preserving integrity for one’s information that is stored on the cloud is being proposed.

2) Authentication: Different authentication mechanisms have been presented and proposed using cloud computing to secure the data access suitable for mobile environments. Some uses the open standards and even supports the integration of various authentication methods. For example, the use of access or log-in IDs, passwords or PINS, authentication requests, etc.


3) Digital rights management: Illegal distribution and piracy of digital contents such as video, image, audio and e-book, programs becomes more and more popular. Some solutions to protect these contents from illegal access are implemented such as provision of encryption and decryption keys to access these contents. A coding or decoding must be done before any mobile user can have access to such digital contents

Sunday, 19 July 2015

iCanCloud: Cloud Computing Simulator


iCanCloud is a simulation platform aimed to model and simulate cloud computing systems, which is targeted to those users who deal closely with those kinds of systems. The main objective of iCanCloud is to predict the trade-offs between cost and performance of a given set of applications executed in a specific hardware, and then provide to users useful information about such costs. However, iCanCloud can be used by a wide range of users, from basic active users to developers of large distributed applications.


Features


The most remarkable features of the iCanCloud simulation platform include the following:

  • Both existing and non-existing cloud computing architectures can be modeled and simulated.
  • A flexible cloud hypervisor module provides an easy method for integrating and testing both new and existent cloud brokering policies.
  • iCanCloud provides methods for obtaining the energy consumption of each hardware component in cloud computing systems.
  • Users are able to design and model resource provisioning policies for cloud systems to balance the trade-offs between performance and energy consumption. Since energy consumption in large distributed systems is directly correlated with the management of resources, it is a major requirement to let users customize their own policies to analyze the impact of energy consumption on the overall system performance.
  • Customizable VMs can be used to quickly simulate uni-core/multi-core systems.
  • iCanCloud provides a wide range of configurations for storage systems, which include models for local storage systems, remote storage systems, like NFS, and parallel storage systems, like parallel file systems and RAID systems.
  • iCanCloud provides a user-friendly GUI to ease the generation and customization of large distributed models. This GUI is especially useful for: managing a repository of pre-configured VMs, managing a repository of pre-configured Cloud systems, managing a repository of pre-configured experiments, launching experiments from the GUI, and generating graphical reports.
  • iCanCloud provides a POSIX-based API and an adapted MPI library for modelling and simulating applications. Also, several methods for modelling applications can be used in iCanCloud: using traces of real applications; using a state graph; and programming new applications directly in the simulation platform.

Monday, 6 July 2015

OPEN NEBULA: A HETEROGENEOUS DISTRIBUTED DATA CENTER INFRASTRUCTURES


OpenNebula is the result of many years of research and development in efficient and scalable management of virtual machines on large-scale distributed infrastructures. Its innovative features have been developed to address the requirements of business use cases from leading companies in the context of flagship European projects in cloud computing. OpenNebula is being used as an open platform for innovation in several international projects to research the challenges that arise in cloud management, and also as production-ready tool in both academia and industry to manage clouds.

As virtualization technologies mature at an incredibly rapid pace, there is a growing interest in applying them to the data-centre. After the success of cloud computing, companies are seeking reliable and efficient technologies to transform their rigid infrastructure into a flexible and agile provisioning platform. These so-called private clouds allow you to provide IT services with an elastic capacity, obtained from your local resources in the form of Virtual Machines (VM). Local resources can be further combined with public clouds in a hybrid cloud computing setup, thus enabling highly scalable hosting environments.

The main component involved in implementing this provision scheme is the Cloud Management Tool, which is responsible for the secure, efficient and scalable management of the cloud resources. A Cloud Management Tool provides IT staff with a uniform management layer across distributed hypervisors and cloud providers; giving infrastructure users the impression of interacting with a single infinite capacity and elastic cloud.

Because no two data centres are the same, building clouds is about integration and orchestration of the underlying infrastructure systems, services and processes. The Cloud Management Tool should seamlessly integrate any existing security, virtualization, storage, and network solutions deployed in the data-centre. Moreover, the right design and configuration in the Cloud architecture depend not only on the underlying infrastructure but also on the execution requirements of the service workload. The capacity requirements of the virtual machines as well as their level of coupling determine the best hardware configuration for the networking, computing and storage subsystems.

                                                                  Fig: OpenNebula architecture.

OpenNebula is an open-source Cloud Management Tool that embraces this vision. Its open, architecture, interfaces and components provide the flexibility and extensibility that many enterprise IT shops need for internal cloud adoption. These features also facilitate its integration with any product and service in the cloud and virtualization ecosystem, and management tool in the data centre. OpenNebula provides an abstraction layer independent from underlying services for security, virtualization, networking and storage, avoiding vendor lock-in and enabling interoperability. OpenNebula is not only built on standards, but has also provided reference implementation of open community specifications, such us the OGF Open Cloud Computing Interface. This open and flexible approach for cloud management ensures widest possible market and user acceptability, and simplifies adaptation to different environments.
Features
  • Openness means you can run production-ready software that is fully open-source without proprietary extensions that lock you in. Yes, this means that OpenNebula does not need enterprise extensions. Yes, OpenNebula is not a limited version of an enterprise software… There is one and only one OpenNebula distribution, and it is truly open-source, Apache licensed, and enterprise-ready. There is no fragmentation.  
  • Simplicity means that you do not need an army of administrators to build and maintain your cloud. OpenNebula is a product and not a toolkit of components that you have to integrate to build something functional. Moreover your cloud will run for years with little maintain. 
  • Flexibility means that you can easily build a cloud to fit into your data center and policies. Because no two data centers are the same, we do not think there’s a one-size-fits-all in the cloud, and we do not try to impose requirements on data center infrastructure. We try to make cloud an evolution by leveraging existing IT infrastructure, protecting your investments, and avoiding vendor lock-in. 
  • Scalability means that you can easily grow the size of each zone and the number of zones. Some of our main users have reported infrastructures with tens of zones distributed worldwide that have executed several hundreds of thousands of virtual machines. 




Saturday, 4 July 2015

OPENSTACK: THE OPEN SOURCE CLOUD OPERATING SYSTEM

OpenStack is a set of software tools for building and managing cloud computing platforms for public and private clouds. Backed by some of the biggest companies in software development and hosting, as well as thousands of individual community members, many think that OpenStack is the future of cloud computing. OpenStack is managed by the OpenStack Foundation, a non-profit which oversees both development and community-building around the project.
Introduction to OpenStack
OpenStack lets users deploy virtual machines and other instances which handle different tasks for managing a cloud environment on the fly. It makes horizontal scaling easy, which means that tasks which benefit from running concurrently can easily serve more or less users on the fly by just spinning up more instances. For example, a mobile application which needs to communicate with a remote server might be able to divide the work of communicating with each user across many different instances, all communicating with one another but scaling quickly and easily as the application gains more users.
And most importantly, OpenStack is open source software, which means that anyone who chooses to can access the source code, make any changes or modifications they need, and freely share these changes back out to the community at large. It also means that OpenStack has the benefit of thousands of developers all over the world working in tandem to develop the strongest, most robust, and most secure product that they can.

How is OpenStack used in a cloud environment?
The cloud is all about providing computing for end users in a remote environment, where the actual software runs as a service on reliable and scalable servers rather than on each end users computer. Cloud computing can refer to a lot of different things, but typically the industry talks about running different items "as a service"—software, platforms, and infrastructure. OpenStack falls into the latter category and is considered Infrastructure as a Service (IaaS). Providing infrastructure means that OpenStack makes it easy for users to quickly add new instance, upon which other cloud components can run. Typically, the infrastructure then runs a "platform" upon which a developer can create software applications which are delivered to the end users.
What are the components of OpenStack?
OpenStack is made up of many different moving parts. Because of its open nature, anyone can add additional components to OpenStack to help it to meet their needs. But the OpenStack community has collaboratively identified nine key components that are a part of the "core" of OpenStack, which are distributed as a part of any OpenStack system and officially maintained by the OpenStack community.
·         Nova is the primary computing engine behind OpenStack. It is used for deploying and managing large numbers of virtual machines and other instances to handle computing tasks.
·         Swift is a storage system for objects and files. Rather than the traditional idea of a referring to files by their location on a disk drive, developers can instead refer to a unique identifier referring to the file or piece of information and let OpenStack decide where to store this information. This makes scaling easy, as developers don’t have the worry about the capacity on a single system behind the software. It also allows the system, rather than the developer, to worry about how best to make sure that data is backed up in case of the failure of a machine or network connection.
·         Cinder is a block storage component, which is more analogous to the traditional notion of a computer being able to access specific locations on a disk drive. This more traditional way of accessing files might be important in scenarios in which data access speed is the most important consideration.
·         Neutron provides the networking capability for OpenStack. It helps to ensure that each of the components of an OpenStack deployment can communicate with one another quickly and efficiently.
·         Horizon is the dashboard behind OpenStack. It is the only graphical interface to OpenStack, so for users wanting to give OpenStack a try, this may be the first component they actually “see.” Developers can access all of the components of OpenStack individually through an application programming interface (API), but the dashboard provides system administrators a look at what is going on in the cloud, and to manage it as needed.
·         Keystone provides identity services for OpenStack. It is essentially a central list of all of the users of the OpenStack cloud, mapped against all of the services provided by the cloud which they have permission to use. It provides multiple means of access, meaning developers can easily map their existing user access methods against Keystone.
·         Glance provides image services to OpenStack. In this case, "images" refers to images (or virtual copies) of hard disks. Glance allows these images to be used as templates when deploying new virtual machine instances.
·         Ceilometer provides telemetry services, which allow the cloud to provide billing services to individual users of the cloud. It also keeps a verifiable count of each user’s system usage of each of the various components of an OpenStack cloud. Think metering and usage reporting.
·         Heat is the orchestration component of OpenStack, which allows developers to store the requirements of a cloud application in a file that defines what resources are necessary for that application. In this way, it helps to manage the infrastructure needed for a cloud service to run.