ZFS on Linux lets admins error correct in real time and use solid-state disks for data caching. With the command-line interface, they can install it for these benefits.
ZFS is a file system that provides a way to store and manage large volumes of data, but you must manually install it.
ZFS on Linux does more than file organization, so its terminology differs from standard disk-related vocabulary. The file system collects data in pools. Vdevs, or virtual devices, make up each pool and provide redundancy if a physical device fails. You can store these pools on a single storage disk -- which is not a good idea if you encounter file corruption or if the drive fails -- or many disks.
Benefits of ZFS
It is free to install ZFS on Linux, and it provides robust storage with features such as:
on-the-fly error correction;
disk-level, enterprise-strength encryption;
transactional writes -- writing all or none of the data to ensure integrity;
use of solid-state disks to cache data; and
use of high-performance software rather than proprietary RAID hardware.
ZFS on Linux offers significant advantages over more traditional file systems such as ext, the journaling file system and Btrfs. With ZFS, it is easy to create a crash-consistent point in time that you can easily back up. ZFS can also support massive file sizes of up to 16 exabytes if the hardware meets performance requirements.
How to install ZFS
To install ZFS on Linux, type sudo apt-get install zfsutils-linux -y into the command-line interface (CLI). This example shows how to create a new ZFS data volume that spans two disks, but other ZFS disk configurations are also available. This tutorial uses the zfs-utils setup package.
Next, create the vdev disk container. This example adds two 20 GB disks. To identify the disks, use the sudo fdisk -l command. In this case, the two disks are /dev/sdb and /dev/sdc.
Now you can create the mirror setup with sudo zpool create mypool mirror /dev/sdb /dev/sdc.
Depending on the disk reader's setup when you install ZFS, you might get an error that states "/dev/sdb does not contain an extensible firmware interface label but it may contain partition information in the master boot record."
To fix it, use the -f switch so the full command is sudo zpool create -f mypool mirror /dev/sdb /dev/sdc. If you are successful, you won't receive an output or error message.
To reduce root folder clutter, group the ZFS in a subfolder instead of in the root drive.
At this point, the system creates a pool. To check the pool's status, use the sudo zpool status command. The CLI will show the following status and the included volumes.
Your pools should automatically mount and be available within the system. The pools' default location is in a directory off the root folder with the pool name. For example, mypool will mount on the /mypool folder, and you can use the pool just like any other mount point.
If you're not sure of a pool location, use sudo zfs get all | grep mountpoint to show which mount point the program uses and identify the mount point needed to bring the pool online.
With your data pools online, you can set most ZFS options via the CLI with sudo zfs. To set up more advanced ZFS functions, such as how to snapshot a read-only version of a file system, define storage pool thresholds or check data integrity with the checksum function, search the in-system ZFS resources with man zfs and reference the Ubuntu ZFS wiki.
If you're new to ZFS, double-check commands before you run them and ensure you understand how they move data within pools, address storage limits and sync data.
Posted by Thang Le Toan on 28 January 2019 01:34 AM
The term "backup," which has become synonymous with data protection, may be accomplished via several methods. Here's how to choose the best way to safeguard your data.
Protecting data against loss, corruption, disasters (manmade or natural) and other problems is one of the top priorities...
for IT organizations. In concept, the ideas are simple, although implementing an efficient and effective set of backup operations can be difficult.
The term backup has become synonymous with data protection over the past several decades and may be accomplished via several methods. Backup software applications have been developed to reduce the complexity of performing backup and recovery operations. Backing up data is only one part of a disaster protection plan, and may not provide the level of data and disaster recovery capabilities desired without careful design and testing.
The purpose of most backups is to create a copy of data so that a particular file or application may be restored after data is lost, corrupted, deleted or a disaster strikes. Thus, backup is not the goal, but rather it is one means to accomplish the goal of protecting data. Testing backups is just as important as backing up and restoring data. Again, the point of backing up data is to enable restoration of data at a later point in time. Without periodic testing, it is impossible to guarantee that the goal of protecting data is being met.
Backing up data is sometimes confused with archiving data, although these operations are different. A backup is a secondary copy of data used for data protection. In contrast, an archive is the primary data, which is moved to a less-expensive type of media (such as tape) for long-term, low-cost storage.
The most basic and complete type of backup operation is a full backup. As the name implies, this type of backup makes a copy of all data to another set of media, which can be tape, disk or a DVD or CD. The primary advantage to performing a full backup during every operation is that a complete copy of all data is available with a single set of media. This results in a minimal time to restore data, a metric known as a recovery time objective (RTO). However, the disadvantages are that it takes longer to perform a full backup than other types (sometimes by a factor of 10 or more), and it requires more storage space.
Thus, full backups are typically run only periodically. Data centers that have a small amount of data (or critical applications) may choose to run a full backup daily, or even more often in some cases. Typically, backup operations employ a full backup in combination with either incremental or differential backups.
An incremental backup operation will result in copying only the data that has changed since the last backup operation of any type. The modified time stamp on files is typically used and compared to the time stamp of the last backup. Backup applications track and record the date and time that backup operations occur in order to track files modified since these operations.
Because an incremental backup will only copy data since the last backup of any type, it may be run as often as desired, with only the most recent changes stored. The benefit of an incremental backup is that they copy a smaller amount of data than a full. Thus, these operations will complete faster, and require less media to store the backup.
A differential backup operation is similar to an incremental the first time it is performed, in that it will copy all data changed from the previous backup. However, each time it is run afterwards, it will continue to copy all data changed since the previous full backup. Thus, it will store more data than an incremental on subsequent operations, although typically far less than a full backup. Moreover, differential backups require more space and time to complete than incremental backups, although less than full backups.
Table 1: A comparison of different backup operations
Changes from backup 1
Changes from backup 1
Changes from backup 2
Changes from backup 1
Changes from backup 3
Changes from backup 1
As shown in "Table 1: A comparison of different backup operations," each type of backup works differently. A full backup must be performed at least once. Afterwards, it is possible to run either another full, an incremental or a differential backup. The first partial backup performed, either a differential or incremental will back up the same data. By the third backup operation, the data that is backed up with an incremental is limited to the changes since the last incremental. In comparison, the third backup with a differential backup will backup all changes since the first full backup, which was backup 1.
From these three primary types of backup types it is possible to develop an approach to protecting data. Typically one of the following approaches is used:
Full weekly + Differential daily
Full weekly + Incremental daily
Many considerations will affect the choice of the optimal backup strategy. Typically, each alternative and strategy choice involves making tradeoffs between performance, data protection levels, total amount of data retained and cost. In "Table 2: A backup strategy's impact on space" below, the media capacity requirements and media required for recovery are shown for three typical backup strategies. These calculations presume 20 TB of total data, with 5% of the data changing daily, and no increase in total storage during the period. The calculations are based on 22 working days in a month and a one month retention period for data.
Table 2: A backup strategy's impact on space
Common backup scenarios
Media Space Required for one Month (20 TB @ 5% daily rate of change)
Media required for recovery
Full daily (weekdays)
Space for 22 daily fulls (22 * 20 TB) = 440.00 TB
Most recent backup only
Full (weekly) + Differential (weekdays)
Fulls, plus most recent differential since full (5 * 20 TB) + (22 * 5%* 20 TB) = 124.23 TB
Most recent full + most recent differential
Full (weekly) + Incremental (weekdays)
Fulls, plus all incrementals since weekly full (5 * 20 TB) + (22 * 5% * 20 TB) = 122.00 TB
Most recent full + all incrementals since full
As shown above, performing a full backup daily requires the most amount of space, and will also take the most amount of time. However, more total copies of data are available, and fewer pieces of media are required to perform a restore operation. As a result, implementing this backup policy has a higher tolerance to disasters, and provides the least time to restore, since any piece of data required will be located on at most one backup set.
As an alternative, performing a full backup weekly, coupled with running incremental backups daily will deliver the shortest backup time during weekdays and use the least amount of storage space. However, there are fewer copies of data available and restore time is the longest, since it may be required to utilize six sets of media to recover the information needed. If data is needed from data backed up on Wednesday, the Sunday full backup, plus the Monday, Tuesday and Wednesday incremental media sets are required. This can dramatically increase recovery times, and requires that each media set work properly; a failure in one backup set can impact the entire restoration.
Running a weekly full backup plus daily differential backups delivers results in between the other alternatives. Namely, more backup media sets are required to restore than with a daily full policy, although less than with a daily incremental policy. Also, the restore time is less than using daily Incrementals, and more than daily fulls. In order to restore data from a particular day, at most two media sets are required, diminishing the time needed to recover and the potential for problems with an unreadable backup set.
For organizations with small data sets, running a daily full backup provides a high level of protection without much additional storage space costs. Larger organizations or those with more data find that running a weekly full Backup, coupled with either daily incrementals or differentials provides a better option. Using differentials provides a higher level of data protection with less restore time for most scenarios with a small increase in storage capacity. For this reason, using a strategy of weekly full backups with daily differential backups is a good option for many organizations.
Most of the advanced backup options such as synthetic full, mirror, reverse incremental and CDP require disk storage as the backup target. A synthetic full simply reconstructs the full backup image using all required incrementals or the differential on disk. This synthetic full may then be stored to tape for offsite storage, with the advantage being reduced restoration time. Mirroring is copying of disk storage to another set of disk storage, with reverse incrementals used to add incremental type of backup support. Finally, CDP allows a greater number of restoration points than traditional backup options.
When deciding which type of backup strategy to use the question is not what type of backup to use, but when to use each, and how these options should be combined with testing to meet the overall business cost, performance and availability goals.
Posted by Thang Le Toan on 05 October 2018 06:46 AM
Kubernetes TLS bootstrap improvements in version 1.12 tackle container management complexity, and users hope there's more where that came from.
IT pros have said Kubernetes TLS bootstrap is a step in the right direction, and they have professed hope that it's the first of many more to come.
Automated Transport Layer Security (TLS) bootstrap is now a stable, production-ready feature as of last week's release of Kubernetes 1.12 to the open source community. Previously, IT pros set up secure communication between new nodes, as they were added to a Kubernetes cluster separately and often manually. The Kubernetes TLS bootstrap feature automates the way Kubernetes nodes launch themselves into TLS-secured clusters at startup.
"The previous process was more complicated and error-prone. [TLS bootstrap] enables simpler pairing similar to Bluetooth or Wi-Fi push-button pairing," said Tim Pepper, a senior staff engineer at VMware and release lead for Kubernetes 1.12 at the Cloud Native Computing Foundation.
Kubernetes maintainers predict this automation will discourage sys admins' previous workarounds to ease management, such as the use of a single TLS credential for an entire cluster. This workaround prevented the use of Kubernetes security measures that require each node to have a separate credential, such as node authorization and admission controls.
Kubernetes 1.12 pushed to beta a similarly automated process for TLS certificate requests and rotation once clusters are setup. Stable support for such long-term manageability tops Kubernetes users' wish list.
"TLS bootstrap helps, but doesn't completely automate the process of TLS handshakes between nodes and the Kubernetes master," said Arun Velagapalli, principal security engineer at Kabbage Inc., a fintech startup in Atlanta. "It's still a lot of manual work within the [command-line interface] right now."
Kubernetes TLS bootstrap automates TLS communication between Kubernetes nodes, but security in depth also requires Kubernetes TLS management between pods and even individual containers. This has prompted Kabbage engineers to explore the Istio service mesh and HashiCorp Vault for automated container security management.
Kubernetes management challenges linger
Industry analysts overwhelmingly agreed that Kubernetes is the industry standard for container orchestration. A 451 Research survey of 200 enterprise decision-makers and developers in North America conducted in March 2018 found 84% of respondents plan to adopt Kubernetes, rather than use multiple container orchestration tools.
Chris Rileydirector of solutions architecture, cPrime Inc.
"It will take one to three years for most enterprises to standardize on Kubernetes, and we still see some use of Mesos, which has staying power for data-rich applications," said Jay Lyman, analyst at 451 Research. "But Kubernetes is well-timed as a strong distributed application framework for use in hybrid clouds."
Still, while many enterprises plan to deploy Kubernetes, IT experts questioned the extent of its widespread production use.
"A lot of people say they're using Kubernetes, but they're just playing around with it," said Jeremy Pullen, CEO and principal consultant at Polodis Inc., a DevSecOps and Lean management advisory firm in Tucker, Ga., which works with large enterprise clients. "The jury's still out on how many companies have actually adopted it, as far as I'm concerned."
The Kubernetes community still must make the container orchestration technology accessible to enterprise customers. Vendors such as Red Hat, Rancher and Google Cloud Platform offer Kubernetes distributions that automate cluster setup, but IT pros would like to see such features enter the standard Kubernetes upstream distribution, particularly for on-premises use.
"Manually creating on-premises Kubernetes is not a simple proposition, and the automation features for load balancers, storage, etc., are really public-cloud-centric," said Chris Riley, director of solutions architecture at cPrime Inc., an Agile software development consulting firm in Foster City, Calif. "If that same ease of use [came to] the default distro, I think that would help clients who are still sensitive about public cloud consider Kubernetes."
Kubernetes community leaders don't rule out this possibility as they consider the future of the project. Features in the works include the Kubernetes Cluster API and a standardized container storage interface slated for stable release by the end of 2018. Standardized and accessible cluster management is the top priority for the Kubernetes architecture special interest group.
"The question is, how many and which variations on [Kubernetes cluster management automation] does the community test there, and how do we curate the list we focus on?" Pepper said. "It becomes complicated to balance that. So, for now, we rely on service providers to do opinionated infrastructure integrations."
Posted by Thang Le Toan on 25 September 2018 12:12 AM
Storage at the edge is the collective methods and technologies that capture and retain digital information at the periphery of the network, as close to the originating source as possible. In the early days of the internet, managing storage at the edge was primarily the concern of network administrators who had employees at remote branch offices (ROBOS). By the turn of the century, the term was also being used to describe direct-attached storage (DAS) in notebook computers and personal digital assistants (PDAs) used by field workers. Because employees did not always remember to back up storage at the edge manually, a primary concern was how to automate backups and keep the data secure.
Today, as more data is being generated by networked internet of things (IoT) devices, administrators who deal with storage at the edge are more concerned with establishing workarounds for limited or intermittent connectivity and dealing with raw data that might need to be archived indefinitely. The prodigious volume of data coming from highway video surveillance cameras, for example, can easily overwhelm a traditional centralized storage model. This has led to experiments with pre-processing data at its source and centralizing storage for only a small part of the data.
In the case of automotive data, for example, log files stored in the vehicle might simply be tagged so should the need arise, the data in question could be sent to the cloud for deeper analysis. In such a scenario, intermediary micro-data centers or high-performance fog computing servers could be installed at remote locations to replicate cloud services locally. This not only improves performance, but also allows connected devices to act upon perishable data in fractions of a second. Depending upon the vendor and technical implementation, the intermediary storage location may be referred to by one of several names including IoT gateway, base station or hub.
Unstructured data is information, in many different forms, that doesn't hew to conventional data models and thus typically isn't a good fit for a mainstream relational database. Thanks to the emergence of alternative platforms for storing and managing such data, it is increasingly prevalent in IT systems and is used by organizations in a variety of business intelligence and analytics applications.
Traditional structured data, such as the transaction data in financial systems and other business applications, conforms to a rigid format to ensure consistency in processing and analyzing it. Sets of unstructured data, on the other hand, can be maintained in formats that aren't uniform, freeing analytics teams to work with all of the available data without necessarily having to consolidate and standardize it first. That enables more comprehensive analyses than would otherwise be possible.
Types of unstructured data
One of the most common types of unstructured data is text. Unstructured text is generated and collected in a wide range of forms, including Word documents, email messages, PowerPoint presentations, survey responses, transcripts of call center interactions, and posts from blogs and social media sites.
Other types of unstructured data include images, audio and video files. Machine data is another category, one that's growing quickly in many organizations. For example, log files from websites, servers, networks and applications -- particularly mobile ones -- yield a trove of activity and performance data. In addition, companies increasingly capture and analyze data from sensors on manufacturing equipment and other internet of things (IoT) connected devices.
In some cases, such data may be considered to be semi-structured -- for example, if metadata tags are added to provide information and context about the content of the data. The line between unstructured and semi-structured data isn't absolute, though; some data management consultants contend that all data, even the unstructured kind, has some level of structure.
Unstructured data analytics
Because of its nature, unstructured data isn't suited to transaction processing applications, which are the province of structured data. Instead, it's primarily used for BI and analytics. One popular application is customer analytics. Retailers, manufacturers and other companies analyze unstructured data to improve customer relationship management processes and enable more-targeted marketing; they also do sentiment analysis to identify both positive and negative views of products, customer service and corporate entities, as expressed by customers on social networks and in other forums.
Predictive maintenance is an emerging analytics use case for unstructured data. For example, manufacturers can analyze sensor data to try to detect equipment failures before they occur in plant-floor systems or finished products in the field. Energy pipelines can also be monitored and checked for potential problems using unstructured data collected from IoT sensors.
Analyzing log data from IT systems highlights usage trends, identifies capacity limitations and pinpoints the cause of application errors, system crashes, performance bottlenecks and other issues. Unstructured data analytics also aids regulatory compliance efforts, particularly in helping organizations understand what corporate documents and records contain.
Unstructured data techniques and platforms
Analyst firms report that the vast majority of new data being generated is unstructured. In the past, that type of information often was locked away in siloed document management systems, individual manufacturing devices and the like -- making it what's known as dark data, unavailable for analysis.
But things changed with the development of big data platforms, primarily Hadoop clusters, NoSQL databases and the Amazon Simple Storage Service (S3). They provide the required infrastructure for processing, storing and managing large volumes of unstructured data without the imposition of a common data model and a single database schema, as in relational databases and data warehouses.
A variety of analytics techniques and tools are used to analyze unstructured data in big data environments. Text analytics tools look for patterns, keywords and sentiment in textual data; at a more advanced level, natural language processing technology is a form of artificial intelligence that seeks to understand meaning and context in text and human speech, increasingly with the aid of deep learning algorithms that use neural networks to analyze data. Other techniques that play roles in unstructured data analytics include data mining, machine learning and predictive analytics.
Data science teams need the right skills and solid processes
For data scientists, big data systems and AI-enabled advanced analytics technologies open up new possibilities to help drive better business decision-making. "Like never before, we have access to data, computing power and rapidly evolving tools," Forrester Research analyst Kjell Carlsson wrote in a July 2017 blog post.
The downside, Carlsson added, is that many organizations "are only just beginning to crack the code on how to unleash this potential." Often, that isn't due to a lack of internal data science skills, he said in a June 2018 blog; it's because companies treat data science as "an artisanal craft" instead of a well-coordinated process that involves analytics teams, IT and business units.
Of course, possessing the right data science skills is a predicate to making such processes work. The list of skills that LinkedIn's analytics and data science team wants in job candidates includes the ability to manipulate data, design experiments with it and build statistical and machine learning models, according to Michael Li, who heads the team.
But softer skills are equally important, Li said in an April 2018 blog. He cited communication, project management, critical thinking and problem-solving skills as key attributes. Being able to influence decision-makers is also an important part of "the art of being a data scientist," he wrote.
The problem is that such skills requirements are often "completely out of reach for a single person," Miriam Friedel wrote in a September 2017 blog when she was director and senior scientist at consulting services provider Elder Research. Friedel, who has since moved on to software vendor Metis Machine as data science director, suggested in the blog that instead of looking for the proverbial individual unicorn, companies should build "a team unicorn."
This handbook more closely examines that team-building approach as well as critical data science skills for the big data and AI era.
Reskilling the analytics team: Math, science and creativity
Technical skills are a must for data scientists. But to make analytics teams successful, they also need to think creatively, work in harmony and be good communicators.
In a 2009 study of its employee data, Google discovered that the top seven characteristics of a successful manager at the company didn't involve technical expertise. For example, they included being a good coach and an effective communicator, having a clear vision and strategy, and empowering teams without micromanaging them. Technical skills were No. 8.
Google's list, which was updated this year to add collaboration and strong decision-making capabilities as two more key traits, applies specifically to its managers, not to technical workers. But the findings from the study, known as Project Oxygen, are also relevant to building an effective analytics team.
Obviously, STEM skills are incredibly important in analytics. But as Google's initial and subsequent studies have shown, they aren't the whole or even the most important part of the story. As an analytics leader, I'm very glad that someone has put numbers to all this, but I've always known that the best data scientists are also empathetic and creative storytellers.
According to the latest employment projections report by the U.S. Bureau of Labor Statistics, statisticians are in high demand. Among occupations that currently employ at least 25,000 people, statistician ranks fifth in projected growth rate; it's expected to grow by 33.8% from 2016 to 2026. For context, the average rate of growth that the statistics bureau forecasts for all occupations is 7.4%. And with application software developers as the only other exception, all of the other occupations in the top 10 are in the healthcare or senior care verticals, which is consistent with an aging U.S. population.
Statistician is fifth among occupations with at least 25,000 workers projected to grow at the fastest rates.
Thanks to groundbreaking innovations in technology and computing power, the world is producing more data than ever before. Businesses are using actionable analytics to improve their day-to-day processes and drive diverse functions like sales, marketing, capital investment, HR and operations. Statisticians and data scientists are making that possible, using not only their mathematical and scientific skills, but also creativity and effective communication to extract and convey insights from the new data resources.
In 2017, IBM partnered with job market analytics software vendor Burning Glass Technologies and the Business-Higher Education Forum on a study that showed how the democratization of data is forcing change in the workforce. Without diving into the minutia, I gathered from the study that with more and more data now available to more and more people, the insights garnered from the data set you apart as an employee -- or as a company.
Developing and encouraging our analytics team
The need to find and communicate these insights influences how we hire and train our up-and-coming analytics employees at Dun & Bradstreet. Our focus is still primarily on mathematics, but we also consider other characteristics like critical- and innovative-thinking abilities as well as personality traits, so our statisticians and data scientists are effective in their roles.
Our employees have the advantage of working for a business-to-business company that has incredibly large and varied data sets -- containing more than 300 million business records -- and a wide variety of customers that are interested in our analytics services and applications. They get to work on a very diverse set of business challenges, share cutting-edge concepts with data scientists in other companies and develop creative solutions to unique problems.
Our associates are encouraged to pursue new analytical models and data analyses, and we have special five-day sprints where we augment and enhance some of the team's more creative suggestions. These sprints not only challenge the creativity of our data analysts, but also require them to work on their interpersonal and communication skills while developing these applications as a group.
Socializing the new, creative data analyst
It's very important to realize that some business users aren't yet completely comfortable with a well-rounded analytics team. For the most part, when bringing in an analyst, they're looking for confirmation of a hypothesis rather than a full analysis of the data at hand.
If that's the case in your organization, then be persistent. As your team continues to present valuable insights and creative solutions, your peers and business leaders across the company will start to seek guidance from data analysts as partners in problem-solving much more frequently and much earlier in their decision-making processes.
As companies and other institutions continue to amass data exponentially and rapid technological changes continue to affect the landscape of our businesses and lives, growing pains will inevitably follow. Exceptional employees who have creativity and empathy, in addition to mathematical skills, will help your company thrive through innovation. Hopefully, you have more than a few analysts who possess those capabilities. Identify and encourage them -- and give permission to the rest of your analytics team to think outside the box and rise to the occasion.
Data scientist vs. business analyst: What's the difference?
Data science and business analyst roles differ in that data scientists must deep dive into data and come up with unique business solutions -- but the distinctions don't end there.
What is the difference between data science and business analyst jobs? And what kind of training or education is required to become a data scientist?
There are a number of differences between data scientists and business analysts, the two most common business analytics roles, but at a high level, you can think about the distinction as similar to a medical researcher and a lab technician. One uses experimentation and the scientific method to search out new, potentially groundbreaking discoveries, while the other applies existing knowledge in an operational context.
Data scientist vs. business analyst comes down to the realms they inhabit. Data scientists delve into big data sets and use experimentation to discover new insights in data. Business analysts, on the other hand, typically use self-service analytics tools to review curated data sets, build reports and data visualizations, and report targeted findings -- things like revenue by quarter or sales needed to hit targets.
What does a data scientist do?
A data scientist takes analytics and data warehousing programs to the next level: What does the data really say about the company, and is the company able to decipher relevant data from irrelevant data?
A data scientist should be able to leverage the enterprise data warehouse to dive deeper into the data that comes out or to analyze new types of data stored in Hadoop clusters and other big data systems. A data scientist doesn't just report on data like a classic business analyst does, he also delivers business insights based on the data.
A data scientist job also requires a strong business sense and the ability to communicate data-driven conclusions to business stakeholders. Strong data scientists don't just address business problems, they'll also pinpoint the problems that have the most value to the organization. A data scientist plays a more strategic role within an organization.
Data scientist education, skills and personality traits
Data scientists look through all the available data with the goal of discovering a previously hidden insight that, in turn, can provide a competitive advantage or address a pressing business problem. Data scientists do not simply collect and report on data -- they also look at it from many angles, determine what it means and then recommend ways to apply the data. These insights could lead to a new product or even an entirely new business model.
Data scientists apply advanced machine learning models to automate processes that previously took too long or were inefficient. They use data processing and programming tools -- often open source, like Python, R and TensorFlow -- to develop new applications that take advantage of advances in artificial intelligence. These applications may perform a task such as transcribing calls to a customer service line using natural language processing or automatically generating text for email campaigns.
What does a business analyst do?
A business analyst -- a title often used interchangeably with data analyst -- focuses more on delivering operational insights to lines of business using smaller, more targeted data sets. For example, a business analyst tied to a sales team will work primarily with sales data to see how individual team members are performing, to identify members who might need extra coaching and to search for other areas where the team can improve on its performance.
Business analysts typically use self-service analytics and data visualization tools. Using these tools, business analysts can build reports and dashboards that team members can use to track their performance. Typically, the information contained in these reports is retrospective rather than predictive.
Data scientist vs. business analyst training, tools and trends
To become a business analyst, you need a familiarity with statistics and the basic fundamentals of data analysis, but there are many self-service analytics tools that do the mathematical heavy lifting for you. Of course, you have to know if it's statistically meaningful to join two separate data sets, and you have to understand the distinction between correlation and causation. But, on the whole, a deep background in mathematics is unnecessary.
To become a data scientist, on the other hand, you need a strong background in math. This is one of the primary differences in the question of data scientists vs. business analysts.
Many data scientists have doctorates in some field of math. Many have backgrounds in physics or other advanced sciences that lean heavily on statistical inference.
Business analysts can generally pick up the technical skills they need on the job. Whether an enterprise uses Tableau, Qlik or Power BI -- the three most common self-service analytics platforms -- or another tool, most use graphical user interfaces that are designed to be intuitive and easy to pick up.
Data science jobs require more specific technical training. In addition to advanced mathematical education, data scientists need deep technical skills. They must be proficient in several common coding languages -- including Python, SQL and Java -- which enable them to run complex machine learning models against big data stored in Hadoop or other distributed data management platforms. Most often, data scientists pick up these skills from a college-level computer science curriculum.
However, trends in data analytics are beginning to collapse the line between data science and data analysis. Increasingly, software companies are introducing platforms that can automate complex tasks using machine learning. At the same time, self-service software supports deeper analytical functionality, meaning data scientists are increasingly using tools that were once solely for business analysts.
Companies often report the highest analytics success when blending teams, so data scientists working alongside business analysts can produce operational benefits. This means that the data scientist vs. business analyst distinctions could become less important as time goes on -- a trend that may pay off for enterprises.
Hiring vs. training data scientists: The case for each approach
Hiring data scientists is easier said than done -- so should you try to train current employees in data science skills? That depends on your company's needs, writes one analytics expert.
Companies are faced with a dilemma on big data analytics initiatives: whether to hire data scientists from outside or train current employees to meet new demands. In many cases, realizing big data's enormous untapped potential brings the accompanying need to increase data science skills -- but building up your capacity can be tricky, especially in a crowded market of businesses looking for analytics talent.
Even with a shortage of available data scientists, screening and interviewing for quality hires is time- and resource-intensive. Alternatively, training data scientists from within may be futile if internal candidates don't have the fundamental aptitude.
At The Data Incubator, we've helped hundreds of companies train employees on data science and hire new talent -- and, often, we've aided organizations in handling the tradeoffs between the two approaches. Based on the experiences we've had with our corporate clients, you should consider the following factors when deciding which way to go.
New hires bring in new thinking
The main benefit of hiring rather than training data scientists comes from introducing new ideas and capabilities into your organization. What you add may be technical in nature: For example, are you looking to adopt advanced machine learning techniques, such as neural networks, or to develop real-time customer insights by using Spark Streaming? It may be cultural, too: Do you want an agile data science team that can iterate rapidly -- even at the expense of "breaking things," in Facebook's famous parlance? Or one that can think about data creatively and find novel approaches to using both internal and external information?
At other times, it's about having a fresh set of eyes looking at the same problems. Many quant hedge funds intentionally hire newly minted STEM Ph.D. holders -- people with degrees in science, technology, engineering or math -- instead of industry veterans precisely to get a fresh take on financial markets. And it isn't just Wall Street; in other highly competitive industries, too, new ideas are the most important currency, and companies fight for them to remain competitive.
How a company sources new talent can also require some innovation, given the scarcity of skilled data scientists. Kaggle and other competition platforms can be great places to find burgeoning data science talent. The public competitions on Kaggle are famous for bringing unconventional stars and unknown whiz kids into the spotlight and demonstrating that the best analytics performance may come from out of left field.
Similarly, we've found that economists and other social scientists often possess the same strong quantitative skill sets as their traditional STEM peers, but are overlooked by HR departments and hiring managers alike.
Training adds to existing expertise
In other cases, employers may value industry experience first and foremost. Domain expertise is complex, intricate and difficult to acquire in some industries. Such industries often already have another science at their core. Rocketry, mining, chemicals, oil and gas -- these are all businesses in which knowledge of the underlying science is more important than data science know-how.
Highly regulated industries are another case in point. Companies facing complex regulatory burdens must often meet very specific, and frequently longstanding, requirements. Banks must comply with financial risk testing and with statutes that were often written decades ago. Similarly, the drug approval process in healthcare is governed by a complex set of immutable rules. While there is certainly room for innovation via data science and big data in these fields, it is constrained by regulations.
Companies in this position often find training data scientists internally to be a better option for developing big data analytics capabilities than hiring new talent. For example, at The Data Incubator, we work with a large consumer finance institution that was looking for data science capabilities to help enhance its credit modeling. But its ideal candidate profile for that job was very different from the ones sought by organizations looking for new ideas on business operations or products and services.
The relevant credit data comes in slowly: Borrowers who are initially reliable could become insolvent months or years after the initial credit decision, which makes it difficult to predict defaults without a strong credit model. And wrong decisions are very expensive: Loan defaults result in direct hits to the company's profitability. In this case, we worked with the company to train existing statisticians and underwriters on complementary data science skills around big data.
Of course, companies must be targeted in selecting training candidates. They often start by identifying employees who possess strong foundational skills for data science -- things like programming and statistics experience. Suitable candidates go by many titles, including statisticians, actuaries and quantitative analysts, more popularly known as quants.
Find the right balance for your needs
For many companies, weighing the options for hiring or training data scientists comes down to understanding their specific business needs, which can vary even in different parts of an organization. It's worth noting that the same financial institution that trained its staffers to do analytics for credit modeling also hired data scientists for its digital marketing team.
Without the complex regulatory requirements imposed on the underwriting side, the digital marketing team felt it could more freely innovate -- and hence decided to bring in new blood with new ideas. These new hires are now building analytical models that leverage hundreds of data signals and use advanced AI and machine learning techniques to more precisely target marketing campaigns at customers and better understand the purchase journeys people take.
Ultimately, the decision of whether to hire or train data scientists must make sense for an organization. Companies must balance the desire to innovate with the need to incorporate existing expertise and satisfy regulatory requirements. Getting that balance right is a key step in a successful data science talent strategy.
Self-service business intelligence (SSBI) is an approach to data analytics that enables business users to access and work with corporate data even though they do not have a background in statistical analysis, business intelligence (BI) or data mining. Allowing end users to make decisions based on their own queries and analyses frees up the organization's business intelligence and information technology (IT) teams from creating the majority of reports and allows those teams to focus on other tasks that will help the organization reach its goals.
Because self-service BI software is used by people who may not be tech-savvy, it is imperative that the user interface (UI) for BI software be intuitive, with a dashboard and navigation that is user-friendly. Ideally, training should be provided to help users understand what data is available and how that information can be queried to make data-driven decisions to solve business problems, but once the IT department has set up the data warehouse and data marts that support the business intelligence system, business users should be able to query the data and create personalized reports with very little effort.
While self-service BI encourages users to base decisions on data instead of intuition, the flexibility it provides can cause unnecessary confusion if there is not a data governance policy in place. Among other things, the policy should define what the key metrics for determining success are, what processes should be followed to create and share reports, what privileges are necessary for accessing confidential data and how data quality, security and privacy will be maintained.
Explore the data discovery software market, including the products and vendors helping enterprises glean insights using data visualization and self-service BI.
Turning data into business insight is the ultimate goal. It's not about gathering as much data as possible, it's about applying tools and making discoveries that help a business succeed. The data discovery software market includes a range of software and cloud-based services that can help organizations gain value from their constantly growing information resources.
These products fall within the broad BI category, and at their most basic, they search for patterns within data and data sets. Many of these tools use visual presentation mechanisms, such as maps and models, to highlight patterns or specific items of relevance. The tools deliver visualizations to users, including nontechnical workers, such as business analysts, via dashboards, reports, charts and tables.
The big benefit here: data discovery tools provide detailed insights gleaned from data to better inform business decisions. In many cases, the tools accomplish this with limited IT involvement because the products offer self-service features.
Using extensive research into the data discovery software market, TechTarget editors focused on the data discovery software vendors that lead in market share, plus those that offer traditional and advanced functionality. Our research included data from TechTarget surveys, as well as reports from respected research firms, including Gartner and Forrester.
Alteryx Inc.'s Connect markets itself as a collaborative data exploration and data cataloging platform for the enterprise that changes how information workers discover, prioritize and analyze all the relevant information within an organization.
The data discovery software market includes a range of software and cloud-based services that can help organizations gain value from their constantly growing information resources.
Alteryx Connect key features include:
Data Asset Catalog, which collects metadata from information systems, enabling better relevant data organization;
Business Glossary, which defines standard business terms in a data dictionary and links them to assets in the catalog; and
Data Discovery, which lets users discover the information they need through search capabilities.
Other features include:
Data Enrichment and Collaboration, which allows users to annotate, discuss and rate information to offer business context and provide an organization with relevant data; and
Certification and Trust, which provides insights into information asset trustworthiness through certification, lineage and versioning.
Alteryx touts these features as decreasing the time necessary to gain insight and supporting faster, data-driven decisions by improving collaboration, enhancing analytic productivity and ensuring data governance.
Domo Inc. provides a single-source system for end-to-end data integration and preparation, data discovery, and sharing in the cloud. It's mobile-focused, and it doesn't need you to integrate desktop software, third-party tools or on-premises servers.
With more than 500 native connectors, Domo designed the platform for quick and easy access to data from across the business, according to the company. It contains a central repository that ingests the data and aids version and access control.
Domo also provides one workspace from which people can choose and explore all the data sets available to them in the platform.
Data discovery capabilities include Data Lineage, a path-based view that clarifies data sources. This feature also enables simultaneous display of data tables alongside visualizations, aiding insight discovery, as well as card-based publishing and sharing.
GoodData Enterprise Insights Platform
The GoodData Corp.'s cloud-based Enterprise Insights Platform is an end-to-end data discovery software platform that gathers data and user decisions, transforming them into actionable insights for line-of-business users.
The platform provides insights in the form of recommendations and predictive analytics with the goal of delivering the analytics that matter most for real-time decision-making. Customers, partners and employees see information that is relevant to the decision at hand, presented in what GoodData claims is a personalized, contextual, intuitive and actionable form. Users can also integrate these insights directly into applications.
IBM Watson Explorer
IBM has a host of data discovery products, and one of the key offerings is IBM Watson Explorer. It's a cognitive exploration and content analysis platform that enables business users to easily explore and analyze structured, unstructured, internal, external and public data for trends and patterns.
Organizations have used Watson Explorer to understand 100% of incoming calls and emails, to improve the quality of information, and to enhance their ability to use that information, according to IBM.
Machine learning models, natural language processing and next-generation APIs combine to help organizations unlock value from all of their data and gain a secure, 360-degree view of their customers, in context, according to the company.
The platform also enables users to classify and score structured and unstructured data with machine learning to reach the most relevant information. And a new mining application gives users deep insights into structured and unstructured data.
Informatica LLC offers multiple data management products powered by its Claire engine as part of its Intelligent Data Platform. The Claire engine is a metadata-driven AI technology that automatically scans enterprise data sets and exploits machine learning algorithms to infer relationships about the data structure and provide recommendations and insights. By augmenting end users' individual knowledge with AI, organizations can discover more data from more users in the enterprise, according to the company.
Another component, Informatica Enterprise Data Catalog, scans and catalogs data assets across the enterprise to deliver recommendations, suggestions and data management task automation. Semantic search and dynamic facet capabilities allow users to filter search results and get data lineage, profiling statistics and holistic relationship views.
Informatica Enterprise Data Lake enables data analysts to quickly find data using semantic and faceted search and to collaborate with one another in shared project workspaces. Machine learning algorithms recommend alternative data sets. Analysts can sample and prepare datasets in an Excel-like data preparation interface, which analysts can operationalize as reusable workflows.
Information Builders WebFocus
Information Builders claims its WebFocus data discovery software platform helps companies use BI and analytics strategically across and beyond the enterprise.
The platform includes a self-service visual discovery tool that enables nontechnical business users to conduct data preparation; visually analyze complex data sets; generate sophisticated data visualizations, dashboards, and reports; and share content with other users. Its extensive visualization and charting capabilities provide an approach to self-service discovery that supports any type of user, Information Builders claims.
Information Builders offers a number of tools related to the WebFocus BI and analytics platform that provide enterprise-grade analytics and data discovery. One is WebFocus InfoApps, which can take advantage of custom information applications designed to enable nontechnical users to rapidly gather insights and explore specific business contexts. InfoApps can include parameterized dashboards, reports, charts and visualizations.
Another tool, WebFocus InfoAssist, enables governed self-service reporting, analysis and discovery capabilities to nontechnical users. The product offers a self-service BI capability for immediate data access and analysis.
Microsoft Power BI
Microsoft Power BI is a cloud-based business analytics service that enables users to visualize and analyze data. The same users can distribute data insights anytime, anywhere, on any device in just a few clicks, according to the company.
As a BI and analytics SaaS tool, Power BI equips users across an organization to build reports with colleagues and share insights. It connects to a broad range of live data through dashboards, provides interactive reports and delivers visualizations that include KPIs from data on premises and in the cloud.
Organizations can use machine learning to automatically scan data and gain insights, ask questions of the data using natural language queries, and take advantage of more than 140 free custom visuals created by the user community.
Power BI applications include dashboards with prebuilt content for cloud services, including Salesforce, Google Analytics and Dynamics 365. It also integrates seamlessly with Microsoft products, such as Office 365, SharePoint, Excel and Teams.
Organizations can start by downloading Power BI Desktop for free, while Power BI Pro and Premium offer several licensing options for companies that want to deploy Power BI across their organization.
MicroStrategy Desktop Client
MicroStrategy Ltd. designed its Desktop client to deliver self-service BI and help business users or departmental analysts analyze data with out-of-the-box visualizations. Data discovery capabilities are available via Mac or Windows PC web browsers and native mobile apps for iOS and Android.
All the interfaces are consistent and users can promote content between the interfaces. With the MicroStrategy Desktop client, business users can visualize data on any chart or graph, including natural language generation narratives, Google Charts, geospatial maps and data-driven documents visualizations.
They can access data from more than 100 data sources, including spreadsheets, RDBMS, cloud systems, and more; prepare, blend, and profile data with graphical interfaces; share data as a static PDF or as an interactive dashboard file; and promote offline content to a server and publish governed and certified dashboards.
OpenText EnCase Risk Manager
OpenText EnCase Risk Manager enables organizations to understand the sensitive data they have in their environment, where the data exists and its value.
The data discovery software platform helps organizations identify, categorize and remediate sensitive information across the enterprise, whether that information exists in the form of personally identifiable customer data, financial records or intellectual property. EnCase Risk Manager provides the ability to search for standard patterns, such as national identification numbers and credit card data, with the ability to discover entirely unique or proprietary information specific to a business or industry.
Risk Manager is platform-agnostic and able to identify this information throughout the enterprise wherever structured or unstructured data is stored, be that on endpoints, servers, cloud repositories, SharePoint or Exchange. Pricing starts at $60,000.
Oracle Big Data Discovery
Oracle Big Data Discovery enables users to find, explore and analyze big data. They can use the platform to discover new insights from data and share results with other tools and resources in a big data ecosystem, according to the company.
The platform uses Apache Spark, and Oracle claims it's designed to speed time to completion, make big data more accessible to business users across an organization and decrease the risks associated with big data projects.
Big Data Discovery provides rapid visual access to data through an interactive catalog of the data; loads local data from Excel and CSV files through self-service wizards; provides data set summaries, annotations from other users, and recommendations for related data sets; and enables search and guided navigation.
Together with statistics about each individual attribute in any data set, these capabilities expose the shape of the data, according to Oracle, enabling users to understand data quality, detect anomalies, uncover outliers and ultimately determine potential. Organizations can use the platform to visualize attributes by data type; glean which are the most relevant; sort attributes by potential, so the most meaningful information displays first; and use a scratchpad to uncover potential patterns and correlations between attributes.
Qlik View Sense
Qlik Sense is Qlik's next-generation data discovery software platform for self-service BI. It supports a full range of analytics use cases including self-service visualization and exploration, guided analytics applications and dashboards, custom and embedded analytics, mobile analytics, and reporting, all within a governed, multi-cloud architecture.
It offers analytics capabilities for all types of users, including associative exploration and search, smart visualizations, self-service creation and data preparation, geographic analysis, collaboration, storytelling, and reporting. The platform also offers fully interactive online and offline mobility and an insight advisor that generates relevant charts and insights using AI.
The product can readily integrate streaming data sources from IoT, social media and messaging with at-rest data for real-time contextual analysis.
Freely distributed accelerators include product templates to help users get to production quickly.
Tibco's Insight Platform combines live streaming data with queries on large at-rest volumes. Historical patterns are interactively identified with Spotfire, running directly against Hadoop and Spark. The Insight Platform can then apply these patterns to streaming data for predictive and operational insights.
For the enterprise, Qlik Sense provides a platform that includes open and standard APIs for customization and extension, data integration scripting, broad data connectivity and data-as-a-service, centralized management and governance, and a multi-cloud architecture for scalability across on-premises environments, as well as private and public cloud environments.
Qlik Sense runs on the patented Qlik Associative Engine, which allows users to explore information without query-based tools. And the new Qlik cognitive engine works with the associative engine to augment the user, offering insight suggestions and automation in context with user behavior.
Qlik Sense is available in cloud and enterprise editions.
Salesforce Einstein Discovery
Salesforce's Einstein Discovery, an AI-powered feature within the Einstein Analytics portfolio, allows business users to automatically analyze millions of data points to understand their current business, explore historical trends, and automatically receive guided recommendations on what they can do to expand deals or resolve customer service cases faster.
Einstein Discovery for Analysts lets users analyze data in Salesforce CRM, CSV files or data from external data sources. In addition, users can take advantage of smart data preparation capabilities to make data improvements, run analyses to create stories, further explore these stories in Einstein Analytics for advanced visualization capabilities, and push insights into Salesforce objects for all business users.
Einstein Discovery for Business Users provides access to insights in natural language and into Salesforce -- within Sales Cloud or Service Cloud, for example. Einstein Discovery for Analysts is available for $2,000 per user, per month. Einstein Discovery for Business Users is $75 per user, per month.
SAS Visual Analytics
SAS Institute Inc.'s Visual Analytics on SAS Viya provides interactive data visualizations to help users explore and better understand data.
The product provides a scalable, in-memory engine along with a user-friendly interface, SAS claims. The combination of interactive data exploration, dashboards, reporting and analytics is designed to help business users find valuable insights without coding. Any user can assess probable outcomes and make more informed, data-driven decisions.
SAS Visual Analytics capabilities include:
automated forecasting, so users can select the most appropriate forecasting method to suit the data;
scenario analysis, which identifies important variables and how changes to them can influence forecasts;
goal-seeking to determine the values of underlying factors that would be required to achieve the target forecast; and
decision trees, allowing users to create a hierarchical segmentation of the data based on a series of rules applied to each observation.
Other features include network diagrams so users can see how complex data is interconnected; path analysis, which displays the flow of data from one event to another as a series of paths; and text analysis, which applies sentiment analysis to video, social media streams or customer comments to provide quick insights into what's being discussed online.
SAP Analytics Cloud
SAP's Analytics Cloud service offers analytics capabilities for all users in one data discovery software product, including discovery, analysis, planning, predicting and collaborating, in one integrated cloud platform, according to SAP.
The service gives users business insights based on its ability to turn embedded data analytics into business applications, the company claims.
Among the potential benefits:
enhanced user experience with the service's visualization and role-based personalization features;
better business results from deep collaboration and informed decisions due to SAP's ability to integrate with existing on-premises applications; and
simplified data across an organization to ensure faster, fact-based decision-making.
In addition, Analytics Cloud is free from operating system constraints, download requirements and setup tasks. It provides real-time analytics and extensibility using SAP Cloud Platform, which can reduce the total cost of ownership because all the features are offered in one SaaS product for all users.
Sisense Ltd. is an end-to-end platform that ingests data from a variety of sources before analyzing, mashing and visualizing it. Its open API framework also enables a high degree of customization without the input of designers, data scientists or IT specialists, according to Sisense.
The Sisense analytics engine runs 10 to 100 times faster than in-memory platforms, according to the company, dealing with terabytes of data and potentially eliminating onerous data preparation work. The platform provides business insights augmented by machine learning and anomaly detection. In addition, the analytics tool offers the delivery of insights beyond the dashboard, offering new forms of BI access, including chatbots and autonomous alerts.
Tableau Software Inc.'s Desktop is a visual analytics and data discovery software platform that lets users see and understand their data with drag-and-drop simplicity, according to the company. Users can create interactive visualizations and dashboards to gain immediate insights without the need for any programming. They can then share their findings with colleagues.
Tableau Desktop can connect to an organization's data in the cloud, on premises or both using one of 75 native data connectors or Tableau's Web Data Connector. This includes connectors to cloud data sources from cloud databases such as Amazon Redshift, Google BigQuery, SQL Server, SAP and Oracle, plus applications such as Salesforce and ServiceNow.
Tibco Software Inc.'s Spotfire is an enterprise analytics platform that connects to and blends data from files, relational and NoSQL databases, OLAP, Hadoop and web services, as well as to cloud applications such as Google Analytics and Salesforce.
Operational intelligence (OI) is an approach to data analysis that enables decisions and actions in business operations to be based on real-time data as it's generated or collected by companies. Typically, the data analysis process is automated, and the resulting information is integrated into operational systems for immediate use by business managers and workers.
OI applications are primarily targeted at front-line workers who, hopefully, can make better-informed business decisions or take faster action on issues if they have access to timely business intelligence (BI) and analytics data. Examples include call-center agents, sales representatives, online marketing teams, logistics planners, manufacturing managers and medical professionals. In addition, operational intelligence can be used to automatically trigger responses to specified events or conditions.
What is now known as OI evolved from operational business intelligence, an initial step focused more on applying traditional BI querying and reporting. OI takes the concept to a higher analytics level, but operational BI is sometimes still used interchangeably with operational intelligence as a term.
How operational intelligence works
In most OI initiatives, data analysis is done in tandem with data processing or shortly thereafter, so workers can quickly identify and act on problems and opportunities in business operations. Deployments often include real-time business intelligence systems set up to analyze incoming data, plus real-time data integration tools to pull together different sets of relevant data for analysis.
Stream processing systems and big data platforms, such as Hadoop and Spark, can also be part of the OI picture, particularly in applications that involve large amounts of data and require advanced analytics capabilities. In addition, various IT vendors have combined data streaming, real-time monitoring and data analytics tools to create specialized operational intelligence platforms.
As data is analyzed, organizations often present operational metrics, key performance indicators (KPIs) and business insights to managers and other workers in interactive dashboards that are embedded in the systems they use as part of their jobs; data visualizations are usually included to help make the information easy to understand. Alerts can also be sent to notify users of developments and data points that require their attention, and automated processes can be kicked off if predefined thresholds or other metrics are exceeded, such as stock trades being spurred by prices hitting particular levels.
Operational intelligence uses and examples
Stock trading and other types of investment management are prime candidates for operational intelligence initiatives because of the need to monitor huge volumes of data in real time and respond rapidly to events and market trends. Customer analytics is another area that's ripe for OI. For example, online marketers use real-time tools to analyze internet clickstream data, so they can better target marketing campaigns to consumers. And cable TV companies track data from set-top boxes in real time to analyze the viewing activities of customers and how the boxes are functioning.
The growth of the internet of things has sparked operational intelligence applications for analyzing sensor data being captured from manufacturing machines, pipelines, elevators and other equipment; that enables predictive maintenance efforts designed to detect potential equipment failures before they occur. Other types of machine data also fuel OI applications, including server, network and website logs that are analyzed in real time to look for security threats and IT operations issues.
There are less grandiose operational intelligence use cases, as well. That includes the likes of call-center applications that provide operators with up-to-date customer records and recommend promotional offers while they're on the phone with customers, as well as logistics ones that help calculate the most efficient driving routes for fleets of delivery vehicles.
OI benefits and challenges
The primary benefit of OI implementations is the ability to address operational issues and opportunities as they arise -- or even before they do, as in the case of predictive maintenance. Operational intelligence also empowers business managers and workers to make more informed -- and hopefully better -- decisions on a day-by-day basis. Ultimately, if managed successfully, the increased visibility and insight into business operations can lead to higher revenue and competitive advantages over rivals.
But there are challenges. Building operational intelligence architecture typically involves piecing together different technologies, and there are numerous data processing platforms and analytics tools to choose between, some of which may require new skills in organizations. High performance and sufficient scalability are also needed to handle the real-time workloads and large volumes of data common in OI applications without choking the system.
Also, most business processes at a typical company don't require real-time data analysis. With that in mind, a key part of operational intelligence projects involves determining which end users need up-to-the-minute data and then training them to handle the information once it starts being delivered to them in that fashion.
Operational intelligence vs. business intelligence
Conventional BI systems support the analysis of historical data that has been cleansed and consolidated in a data warehouse or data mart before being made available for business analytics uses. BI applications generally aim to tell corporate executives and business managers what happened in the past on revenues, profits and other KPIs to aid in budgeting and strategic planning.
Early on, BI data was primarily distributed to users in static operational reports. That's still the case in some organizations, although many have shifted to dashboards with the ability to drill down into data for further analysis. In addition, self-service BI tools let users run their own queries and create data visualizations on their own, but the focus is still mostly on analyzing data from the past.
Operational intelligence systems let business managers and front-line workers see what's currently happening in operational processes and then immediately act upon the findings, either on their own or through automated means. The purpose is not to facilitate planning, but to drive operational decisions and actions in the moment.
A CDN (content delivery network), also called a content distribution network, is a group of geographically distributed and interconnected servers that provide cached internet content from a network location closest to a user to accelerate its delivery. The primary goal of a CDN is to improve web performance by reducing the time needed to transmit content and rich media to users' internet-connected devices.
Content delivery network architecture is also designed to reduce network latency, which is often caused by hauling traffic over long distances and across multiple networks. Eliminating latency has become increasingly important, as more dynamic content, video and software as a service are delivered to a growing number of mobile devices.
CDN providers house cached content in either their own network points of presence (POP) or in third-party data centers. When a user requests content from a website, if that content is cached on a content delivery network, the CDN redirects the request to the server nearest to that user and delivers the cached content from its location at the network edge. This process is generally invisible to the user.
A wide variety of organizations and enterprises use CDNs to cache their website content to meet their businesses' performance and security needs. The need for CDN services is growing, as websites offer more streaming video, e-commerce applications and cloud-based applications where high performance is key. Few CDNs have POPs in every country, which means many organizations use multiple CDN providers to make sure they can meet the needs of their business or consumer customers wherever they are located.
In addition to content caching and web delivery, CDN providers are capitalizing on their presence at the network edge by offering services that complement their core functionalities. These include security services that encompass distributed denial-of-service (DDoS) protection, web application firewalls (WAFs) and bot mitigation; web and application performance and acceleration services; streaming video and broadcast media optimization; and even digital rights management for video. Some CDN providers also make their APIs available to developers who want to customize the CDN platform to meet their business needs, particularly as webpages become more dynamic and complex.
How does a CDN work?
The process of accessing content cached on a CDN network edge location is almost always transparent to the user. CDN management software dynamically calculates which server is located nearest to the requesting user and delivers content based on those calculations. The CDN server at the network edge communicates with the content's origin server to make sure any content that has not been cached previously is also delivered to the user. This not only eliminates the distance that content travels, but reduces the number of hops a data packet must make. The result is less packet loss, optimized bandwidth and faster performance, which minimizes timeouts, latency and jitter, and it improves the overall user experience. In the event of an internet attack or outage, content hosted on a CDN server will remain available to at least some users.
Organizations buy services from CDN providers to deliver their content to their users from the nearest location. CDN providers either host content themselves or pay network operators and internet service providers (ISPs) to host CDN servers. Beyond placing servers at the network edge, CDN providers use load balancing and solid-state hard drives to help data reach users faster. They also work to reduce file sizes using compression and special algorithms, and they are deploying machine learning and AI to enable quicker load and transmission times.
History of CDNs
The first CDN was launched in 1998 by Akamai Technologies soon after the public internet was created. Akamai's original techniques serve as the foundation of today's content distribution networks. Because content creators realized they needed to find a way to reduce the time it took to deliver information to users, CDNs were seen as a way to improve network performance and to use bandwidth efficiently. That basic premise remains important, as the amount of online content continues to grow.
So-called first-generation CDNs specialized in e-commerce transactions, software downloads, and audio and video streaming. As cloud and mobile computing gained traction, second-generation CDN services evolved to enable the efficient delivery of more complex multimedia and web content to a wider community of users via a more diverse mix of devices. As internet use grew, the number of CDN providers multiplied, as have the services CDN companies offer.
New CDN business models also include a variety of pricing methods that range from charges per usage and volume of content delivered to a flat rate or free for basic services, with add-on fees for additional performance and optimization services. A wide variety of organizations use CDN services to accelerate static and dynamic content, online gaming and mobile content delivery, streaming video and a number of other uses.
What are the main benefits of using a CDN?
The primary benefits of traditional CDN services include the following:
Improved webpage load times to prevent users from abandoning a slow-loading site or e-commerce application where purchases remain in the shopping cart;
Improved security from a growing number of services that include DDoS mitigation, WAFs and bot mitigation;
Increased content availability because CDNs can handle more traffic and avoid network failures better than the origin server that may be located several networks away from the end user; and
A diverse mix of performance and web content optimization services that complement cached site content.
How do you manage CDN security?
A representative list of CDN providers in this growing market include the following:
A wide variety of organizations use CDNs to meet their businesses' performance and security needs. The need for CDN services is growing, as websites offer more streaming video, e-commerce applications and cloud-based applications, where high performance is essential.
CDN technology is also an ideal method to distribute web content that experiences surges in traffic, because distributed CDN servers can handle sudden bursts of client requests at one time over the internet. For example, spikes in internet traffic due to a popular event, like online streaming video of a presidential inauguration or a live sports event, can be spread out across the CDN, making content delivery faster and less likely to fail due to server overload.
AWS GPU instance type slashes cost of streaming apps
The cost of graphics acceleration can often make the technology prohibitive, but a new AWS GPU instance type for AppStream 2.0 makes that process more affordable.
Download Our AWS Cloud Computing Must-Have Guide
While Amazon Web Services (AWS) has established itself as a top contender in the cloud computing market, it's not without its challenges and misconceptions. Get expert insight into the most common and pressing questions regarding AWS management, monitoring, costs, benefits, limitations and more.
Amazon AppStream 2.0, which enables enterprises to stream desktop apps from AWS to an HTML5-compatible web browser, delivers graphics-intensive applications for workloads such as creative design, gaming and engineering that rely on DirectX, OpenGL or OpenCL for hardware acceleration. The managed AppStream service eliminates the need for IT teams to recode applications to be browser-compatible.
The newest AWS GPU instance type for AppStream, Graphics Design, cuts the cost of streaming graphics applications up to 50%, according to the company. AWS customers can launch Graphics Design GPU instances or create a new instance fleet with the Amazon AppStream 2.0 console or AWS software development kit. AWS’ Graphics Design GPU instances come in four sizes that range from 2-16 virtual CPUs and 7.5-61 gibibytes (GiB) of system memory, and run on AMD FirePro S7150x2 Server GPUs with AMD Multiuser GPU technology.
Developers can now also select between two types of Amazon AppStream instance fleets in a streaming environment. Always-On fleets provide instant access to apps, but charge fees for every instance in the fleet. On-Demand fleets charges fees for instances when end users are connected, plus an hourly fee, but there is a delay when an end user accesses the first application.
New features and support
In addition to the new AWS GPU instance type, the cloud vendor rolled out several other features this month, including:
ELB adds network balancer. AWS Network Load Balancer helps maintain low latency during spikes on a single static IP address per Availability Zone. Network Load Balancer — the second offshoot of Elastic Load Balancing features, following Application Load Balancer — routes connections to Virtual Private Cloud-based Elastic Compute Cloud (EC2) instances and containers.
New edge locations on each coast. Additional Amazon CloudFront edge locations in Boston and Seattle improve end user speed and performance when they interact with content via CloudFront. AWS now has 95 edge locations across 50 cities in 23 countries.
X1 instance family welcomes new member. The AWS x1e.32xlarge instance joins the X1 family of memory-optimized instances, with the most memory of any EC2 instance — 3,904 GiB of DDR4 instance memory — to help businesses reduce latency for large databases, such as SAP HANA. The instance is also AWS’ most expensive at about $16-$32 per hour, depending on the environment and payment model.
AWS Config opens up support. The AWS Config service, which enables IT teams to manage service and resource configurations, now supports both DynamoDB tables and Auto Scaling groups. Administrators can integrate those resources to evaluate the health and scalability of their cloud deployments.
Start and stop on the Spot. IT teams can now stop Amazon EC2 Spot Instances when an interruption occurs and then start them back up as needed. Previously, Spot Instances were terminated when prices rose above the user-defined level. AWS saves the EBS root device, attached volumes and the data within those volumes; those resources restore when capacity returns, and instances maintain their ID numbers.
EC2 expands networking performance. The largest instances of the M4, X1, P2, R4, I3, F1 and G3 families now use Elastic Network Adapter (ENA) to reach a maximum bandwidth of 25 Gb per second. The ENA interface enables both existing and new instances to reach this capacity, which boosts workloads reliant on high-performance networking.
New Direct Connect locations. Three new global AWS Direct Connect locations allow businesses to establish dedicated connections to the AWS cloud from an on-premises environment. New locations include: Boston, at Markley, One Summer Data Center for US-East-1; Houston, at CyrusOne West I-III data center for US-East-2; and Canberra, Australia, at NEXTDC C1 Canberra data center for AP-Southeast-2.
Role and policy changes. Several changes to AWS Identity and Access Management (IAM) aim to better protect an enterprise’s resources in the cloud. A policy summaries feature lets admins identify errors and evaluate permissions in the IAM console to ensure each action properly matches to the resources and conditions it affects. Other updates include a wizard for admins to create the IAM roles, and the ability to delete service-linked roles through the IAM console, API or CLI — IAM ensures that no resources are attached to a role before deletion.
Six new data streams. Amazon Kinesis Analytics, which enables businesses to process and query streaming data in an SQL format, has six new types of stream processes to simplify data processing: STEP(), LAG(), TO_TIMESTAMP(), UNIX_TIMESTAMP(), REGEX_REPLACE() and SUBSTRING(). AWS also increased the service’s capacity to process higher data volume streams.
Get DevOps notifications. Additional notifications from AWS CodePipeline for stage or action status changes enable a DevOps team to track, manage and act on changes during continuous integration and continuous delivery. CodePipeline integrates with Amazon CloudWatch to enable Amazon Simple Notification Service messages, which can trigger an AWS Lambda function in response.
AWS boosts HIPAA eligibility. Amazon’s HIPAA Compliance Program now includes Amazon Connect, AWS Batch and two Amazon Relational Database Service (RDS) engines, RDS for SQL Server and RDS for MariaDB — all six RDS engines are HIPAA eligible. AWS customers that sign a Business Associate Agreement can use those services to build HIPAA-compliant applications.
RDS for Oracle adds features. The Amazon RDS for Oracle engine now supports Oracle Multimedia, Oracle Spatial and Oracle Locator features, with which businesses can store, manage and retrieve multimedia and multi-dimensional data as they migrate databases from Oracle to AWS. The RDS Oracle engine also added support for multiple Oracle Application Express versions, which enables developers to build applications within a web browser.
Assess RHEL security. Amazon Inspector expanded support for Red Hat Enterprise Linux (RHEL) 7.4 assessments, to run Vulnerabilities & Exposures, Amazon Security Best Practices and Runtime Behavior Analysis scans in that RHEL environment on EC2 instances.
BPM in cloud evolves to suit line of business, IoT
While on-premises BPM tools have caused a tug of war between lines of business and IT, the cloud helps appease both sides. Here's what to expect from this cloud BPM trend and more.
Business process management tools rise in importance as companies try to make better use -- and reuse -- of IT assets. And, when coupled with cloud, this type of software can benefit from a pay-as-you-go model for more efficient cost management, as well as increased scalability.
As a result, cloud-based BPM has become a key SaaS tool in the enterprise. Looking forward, the growth of BPM in cloud will drive three major trends that enterprise users should track.
BPM is designed to encourage collaboration between line departments and IT, but the former group often complains that BPM tools hosted in the data center favor the IT point of view in both emphasis and design. To avoid this and promote equality between these two groups, many believe that BPM tools have to move to neutral territory: the cloud.
Today, BPM supports roughly a dozen different roles and is increasingly integrated with enterprise architecture practices and models. This expands the scope of BPM software, as well as the number of non-IT professionals who use it. Collaboration and project management, for example, account for most of the new features in cloud BPM software.
Collaboration features in cloud-based BPM include project tools and integration with social networks. While business people widely use platforms like LinkedIn for social networking, IT professionals use other wiki-based tools. Expect to see a closer merger between the two.
This push for a greater line department focus in BPM could also divide the BPM suites themselves. While nearly all the cloud BPM products are fairly broad in their application, those from vendors with a CIO-level sales emphasis, such as IBM's Business Process Manager on Cloud or Appian, focus more on IT. NetSuite, on the other hand, is an example of cloud BPM software with a broader organizational target.
Software practices influence BPM
Cloud, in general, affects application design and development, which puts pressure on BPM to accommodate changes in software practices. Cloud platforms, for example, have encouraged a more component-driven vision for applications, which maps more effectively to business processes. This will be another factor that expands line department participation in BPM software.
BPM in cloud encourages line organizations to take more control over applications. The adoption of third-party tools, rather than custom development, helps them target specific business problems. This, however, is a double-edged sword: It can improve automated support for business processes but also duplicate capabilities and hinder workflow integration among organizations. IT and line departments will have to define a new level of interaction.
The third trend to watch around BPM in cloud involves internet of things (IoT) and machine-to-machine communications. These technologies presume that sensors will activate processes, either directly or through sensor-linked analytics. This poses a challenge for BPM, because it takes human judgment out of the loop and requires instead that business policies anticipate human review of events and responses. That shifts the emphasis of BPM toward automated policies, which, in the past, has led to the absorption of BPM into things like Business Process Modeling Language, and puts the focus back on IT.
What do you expect from cloud BPM in the future?
In theory, business policy automation has always been within the scope of BPM. But, in practice, BPM suites have offered only basic support for policy automation or even for the specific identification of business policies. It's clear that this will change and that policy controls to guide IoT deployments will be built into cloud-based BPM.