Live Chat Software by Kayako
 News Categories
(24)Microsoft Technet (2)StarWind (6)TechRepublic (4)ComuterTips (1)SolarWinds (1)Xangati (1) (30)VMware (8)NVIDIA (9)VDI (1)pfsense vRouter (4)VEEAM (3)Google (2)RemoteFX (1) (1)MailCleaner (1)Udemy (1)AUGI (2)AECbytes Architecture Engineering Constrution (8)VMGuru (2)AUTODESK (9) (1)Atlantis Blog (40)AT.COM (2) (1) (16) (3)hadoop360 (3)bigdatastudio (1) (1) (3)VECITA (1) (1)Palo Alto Networks (5) (2) (1)Nhịp Cầu đầu tư (3)VnEconomy (1)Reuters (1)Tom Tunguz (1) (1)Esri (1) (1)tweet (1)Tesla (1) (7)ITCNews (1) (1) Harvard Business Review (1)Haravan (2) (1) (8) (3)IBM (1) (2) (1) (9) (1) (1) (4) (1) (1) (1) (1) (1) (1) (1) (4) (5) (4) (3) (1) (1) (1) (3) (1) (27) (1) (1) (1) (5) (2) (1) (1) (3) (2) (2) (1) (21) (1) (1) (1) (1) (1) (1) (2)Engenius (1) (1) (1) (1) (1) (3) (6) (1) (2) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (2)VTV (6)NguyenTatThanh School (1) (1)
RSS Feed
Apache Hadoop: Built for big data, insights, and innovation
Posted by Thang Le Toan on 17 August 2018 10:55 PM

An open source platform for the distributed processing of structured, semi- and unstructured data. Providing the platform for IBM and Hortonworks enterprise-grade distribution.

What is Apache Hadoop®?

Apache Hadoop offers highly reliable, scalable, distributed processing of large data sets using simple programming models. With the ability to be built on clusters of commodity computers, Hadoop provides a cost-effective solution for storing and processing structured, semi- and unstructured data with no format requirements.

Key Big Data Use Cases for Hadoop

  1. New data formats – Utilize new forms of semi- and unstructured data such as streaming audio and video, social media, sentiment and clickstream data that can’t be ingested into the Enterprise Data Warehouse (EDW). This data can provide more accurate analytic decisions in response to today’s new technologies such as Internet-of-Things (IOT), artificial intelligence(AI), cloud and mobile.  
  2. Data lake analytics:  Provide a platform for real-time, self-service access and advanced analytics for data users like data scientists, line of business owners (LOBs) and developers. The Hadoop-based data lake is the future of data science, an interdisciplinary field that combines machine learning, statistics, advanced analysis and programming.
  3. Data offload and consolidation: Optimize your Enterprise Data Warehouse (EDW) and streamline costs by moving “cold” or data not currently in use to a Hadoop-based data lake. Consolidating by moving siloed data to the data lake decreases costs, increases accessibility and drives better, more accurate decisions.

Learn more about Big Data


100% Open Source

The IBM and Hortonworks partnership provides an integrated, open source Hadoop-based platform with the tools needed for advanced analytic workloads. Both companies are members of the Open Data Platform Initiative (ODPi), a multi-vendor standards association focused on advancing the adoption of Hadoop

Enterprise grade distribution

The combination of the Hortonworks platform with IBM Db2® Big SQL offers the benefits of Hadoop with added security, governance and machine learning capabilities. Db2 Big SQL is the first SQL-on-Hadoop solution that understands commonly used SQL syntax from other vendors and products such as Oracle, IBM Db2 and IBM Netezza®.

IBM and Hortonworks, better together

Build, govern, secure and quickly gain valuable analytic insights from your data using a single ecosystem of products and services. Benefit from combined collaboration and investment in the open source community, while removing concerns about connectivity and stability.

In the spotlight

Get started with Apache Hadoop®

IBM, in partnership with Hortonworks, offers Hortonworks Data Platform (HDP), a secure, enterprise-ready open source Hadoop distribution based on a centralized architecture. HDP, when used with IBM Db2 Big SQL, addresses a range of data-at-rest and data-in-motion use cases, provides data federation across the organization, powers real-time customer applications, and delivers robust analytics accelerating analytic decisioning.

Screen capture of getting started with Hadoop

Accelerate big data collection and dataflow management

Hortonworks DataFlow (HDF) for IBM, powered by Apache NiFi, is the first integrated platform that solves the challenges of collecting and transporting data from a multitude of sources. HDF for IBM enables simple, fast data acquisition, secure data transport, prioritized data flow and clear traceability of data from the edge of your network to the core data center. It uses a combination of an intuitive visual interface, a high-fidelity access and authorization mechanism and an always-on chain of custody (data provenance) framework.

Flow chart of Hortonworks DataFlow (HDF) for IBM

Accelerated and Stable Apache Hadoop®

The best way to move forward with Hadoop is to choose an installation package that simplifies interoperability. The Open Data Platform Initiative (ODPi) is a multi-vendor standards association focused on advancing the adoption of Hadoop in the enterprise by promoting the interoperability of big data tools. ODPi simplifies and standardizes the Apache Hadoop big data ecosystem with a common reference specification called the ODPi Core.


Db2 Big SQL

Db2 Big SQL lets you access, query, and summarize data from any platform including databases, data warehouses, NoSQL databases, and more. Db2 Big SQL can concurrently exploit Apache Hive, HBase and Spark using a single database connection—even a single query.

Hortonworks on Power

Use Hortonworks Data Platform on IBM Power Systems™ to increase efficiency, maximize performance and accelerate insights.

IBM Big Replicate

This active-transactional replication technology delivers continuous availability, streaming backup, uninterrupted migration, hybrid cloud and burst-to-cloud, and data consistency across clusters any distance apart.


Db2 Big SQL demos

Explore several Db2 Big SQL demos to walk through business benefits and core features, integration with Data Sever Manager for creating federated connections to Db2 Warehouse on Cloud, as well as how to integrate with IBM Cognos® Analytics to create dashboards and reports.

Connect more data from more sources with a data lake

Data lakes are gaining prominence as businesses incorporate more unstructured data and look to generate insights from real-time ad hoc queries and analysis. Learn more about the new types of data and sources that can be leveraged by integrating data lakes into your existing data management.

eBook: Build a better data lake

Discover best practices to follow and the potential pitfalls to avoid when integrating a data lake in your existing data infrastructure. Learn how enterprise-grade security and governance can allow any business to leverage a growing diversity of data to drive innovation across the organization.

Read more »

Sql Server Parallel Data Warehouse (PDW)
Posted by Thang Le Toan on 16 September 2015 03:28 PM

Sql Server Parallel Data Warehouse (PDW)

Fari Payandeh

Fari Payandeh




Unlike Hadoop and NoSql Databases, MPP is not a new technology. Yet, it is a strong contender in the “Big Data” space. Sql Server PDW appliance boasts up to 100x performance gains over legacy data warehouses. Moreover, it is a fault tolerant, horizontally scalable, high capacity RDBMS. Simply put, it is an excellent solution for companies that are wholly vested in RDBMS but need to break free of its constraining factors. We will narrow down our comparison by juxtaposing Sql Server SMP with Sql Server MPP.


Performance (Velocity)
Scalability: Throwing more hardware at Sql Server SMP will eventually hit a point of diminishing return as the size of the data sets grow. By contrast, Sql Server MPP architecture is horizontally scalable and performance grows linearly as we add more nodes (Physical Servers) to the appliance– Up to 100x performance gains over legacy data warehouses.
CPU Utilization: A Database task in Sql Server SMP is bound to only one Cpu whereas a task runs on multiple Cpu’s in Sql Server MPP
Resource Sharing: Sql Server MPP has a shared nothing architecture which allows each node to dedicate its resources to processing queries thereby avoiding resource contention and I/O bottlenecks that are caused by resource sharing.
Distributed Queries: Query execution time is reduced significantly. Each query is broken down into pieces and fed to different nodes enabling parallel processing.
Data Distribution: Sql Server MPP automatically distributes the data among different nodes. Each node processes its own data set before sending the output to the control process which in turn merges the results.
Parallel Load: Data is automatically loaded in parallel.
In-Memory Operations and Columnar Data Store : Both Sql Server SMP and Sql Server MPP support In-Memory Operations and Columnar Data Stores.

Capacity (Volume)
Sql Server SMP can handle a few Terabytes whereas Sql Server MPP can linearly scale-out to 6 Petabytes.

High Availability
Sql Server MPP is fault tolerant. Redundancy is applied to all hardware and software components of the appliance. Moreover, the appliance runs on Microsoft Hyper-V which gives the nodes failover capabilities.

Analytics Capabilities (Variety and Variability)
PDW is part of Microsoft Analytics Platform System, which supports connectivity and query access to Hadoop and unstructured data via PolyBase data querying technology.

Sql Server SMP Databases are simpler and less costly to maintain.
Sql Server MPP: Upgrades are not seamless and may require down time. Patches are applied by Microsoft.



Read more »

The Best Of Open Source For Big Data
Posted by Thang Le Toan on 16 September 2015 03:22 PM

Originally posted on Data Science Central

It was not easy to select a few out of many Open Source projects. My objective was to choose the ones that fit Big Data’s needs most. What has changed in the world of Open Source is that the big players have become stakeholders; IBM’s alliance with Cloud Foundry, Microsoft providing a development platform for Hadoop, Dell’s Open Stack-Powered Cloud Solution, VMware and EMC partnering on Cloud, Oracle releasing its NoSql database as Open Source.

“If you can’t beat them, join them”. History has vindicated the Open Source visionaries and advocates.

Hadoop Distributions


Cloud Operating System

Cloud Foundry -- By VMware

OpenStack -- Worldwide participation and well-known companies


fusion-io -- Not open source, but very supportive of Open Source projects; Flash-aware applications.

Development Platforms and Tools

REEF -- Microsoft's Hadoop development platform

Lingual -- By Concurrent

Pattern -- By Concurrent

Python -- Awesome programming language

Mahout -- Machine learning programming language

Impala -- Cloudera

R -- MVP among statistical tools

Storm -- Stream processing by Twitter

LucidWorks -- Search, based on Apache Solr

Giraph -- Graph processing by Facebook

NoSql Databases

MongoDB, Cassandra, Hbase

Sql Databases

MySql -- Belongs to Oracle

MariaDB -- Partnered with SkySql

PostgreSQL -- Object Relational Database

TokuDB -- Improves RDBMS performance

Server Operating Systems

Red Hat -- The defacto OS for Hadoop Servers

BI, Data Integration, and Analytics




See Big Data Studio



Read more »

Help Desk Software by Kayako