Unstructured data is information, in many different forms, that doesn't hew to conventional data models and thus typically isn't a good fit for a mainstream relational database. Thanks to the emergence of alternative platforms for storing and managing such data, it is increasingly prevalent in IT systems and is used by organizations in a variety of business intelligence and analytics applications.
Traditional structured data, such as the transaction data in financial systems and other business applications, conforms to a rigid format to ensure consistency in processing and analyzing it. Sets of unstructured data, on the other hand, can be maintained in formats that aren't uniform, freeing analytics teams to work with all of the available data without necessarily having to consolidate and standardize it first. That enables more comprehensive analyses than would otherwise be possible.
Types of unstructured data
One of the most common types of unstructured data is text. Unstructured text is generated and collected in a wide range of forms, including Word documents, email messages, PowerPoint presentations, survey responses, transcripts of call center interactions, and posts from blogs and social media sites.
Other types of unstructured data include images, audio and video files. Machine data is another category, one that's growing quickly in many organizations. For example, log files from websites, servers, networks and applications -- particularly mobile ones -- yield a trove of activity and performance data. In addition, companies increasingly capture and analyze data from sensors on manufacturing equipment and other internet of things (IoT) connected devices.
In some cases, such data may be considered to be semi-structured -- for example, if metadata tags are added to provide information and context about the content of the data. The line between unstructured and semi-structured data isn't absolute, though; some data management consultants contend that all data, even the unstructured kind, has some level of structure.
Unstructured data analytics
Because of its nature, unstructured data isn't suited to transaction processing applications, which are the province of structured data. Instead, it's primarily used for BI and analytics. One popular application is customer analytics. Retailers, manufacturers and other companies analyze unstructured data to improve customer relationship management processes and enable more-targeted marketing; they also do sentiment analysis to identify both positive and negative views of products, customer service and corporate entities, as expressed by customers on social networks and in other forums.
Predictive maintenance is an emerging analytics use case for unstructured data. For example, manufacturers can analyze sensor data to try to detect equipment failures before they occur in plant-floor systems or finished products in the field. Energy pipelines can also be monitored and checked for potential problems using unstructured data collected from IoT sensors.
Analyzing log data from IT systems highlights usage trends, identifies capacity limitations and pinpoints the cause of application errors, system crashes, performance bottlenecks and other issues. Unstructured data analytics also aids regulatory compliance efforts, particularly in helping organizations understand what corporate documents and records contain.
Unstructured data techniques and platforms
Analyst firms report that the vast majority of new data being generated is unstructured. In the past, that type of information often was locked away in siloed document management systems, individual manufacturing devices and the like -- making it what's known as dark data, unavailable for analysis.
But things changed with the development of big data platforms, primarily Hadoop clusters, NoSQL databases and the Amazon Simple Storage Service (S3). They provide the required infrastructure for processing, storing and managing large volumes of unstructured data without the imposition of a common data model and a single database schema, as in relational databases and data warehouses.
A variety of analytics techniques and tools are used to analyze unstructured data in big data environments. Text analytics tools look for patterns, keywords and sentiment in textual data; at a more advanced level, natural language processing technology is a form of artificial intelligence that seeks to understand meaning and context in text and human speech, increasingly with the aid of deep learning algorithms that use neural networks to analyze data. Other techniques that play roles in unstructured data analytics include data mining, machine learning and predictive analytics.
Data science teams need the right skills and solid processes
For data scientists, big data systems and AI-enabled advanced analytics technologies open up new possibilities to help drive better business decision-making. "Like never before, we have access to data, computing power and rapidly evolving tools," Forrester Research analyst Kjell Carlsson wrote in a July 2017 blog post.
The downside, Carlsson added, is that many organizations "are only just beginning to crack the code on how to unleash this potential." Often, that isn't due to a lack of internal data science skills, he said in a June 2018 blog; it's because companies treat data science as "an artisanal craft" instead of a well-coordinated process that involves analytics teams, IT and business units.
Of course, possessing the right data science skills is a predicate to making such processes work. The list of skills that LinkedIn's analytics and data science team wants in job candidates includes the ability to manipulate data, design experiments with it and build statistical and machine learning models, according to Michael Li, who heads the team.
But softer skills are equally important, Li said in an April 2018 blog. He cited communication, project management, critical thinking and problem-solving skills as key attributes. Being able to influence decision-makers is also an important part of "the art of being a data scientist," he wrote.
The problem is that such skills requirements are often "completely out of reach for a single person," Miriam Friedel wrote in a September 2017 blog when she was director and senior scientist at consulting services provider Elder Research. Friedel, who has since moved on to software vendor Metis Machine as data science director, suggested in the blog that instead of looking for the proverbial individual unicorn, companies should build "a team unicorn."
This handbook more closely examines that team-building approach as well as critical data science skills for the big data and AI era.
Reskilling the analytics team: Math, science and creativity
Technical skills are a must for data scientists. But to make analytics teams successful, they also need to think creatively, work in harmony and be good communicators.
In a 2009 study of its employee data, Google discovered that the top seven characteristics of a successful manager at the company didn't involve technical expertise. For example, they included being a good coach and an effective communicator, having a clear vision and strategy, and empowering teams without micromanaging them. Technical skills were No. 8.
Google's list, which was updated this year to add collaboration and strong decision-making capabilities as two more key traits, applies specifically to its managers, not to technical workers. But the findings from the study, known as Project Oxygen, are also relevant to building an effective analytics team.
Obviously, STEM skills are incredibly important in analytics. But as Google's initial and subsequent studies have shown, they aren't the whole or even the most important part of the story. As an analytics leader, I'm very glad that someone has put numbers to all this, but I've always known that the best data scientists are also empathetic and creative storytellers.
According to the latest employment projections report by the U.S. Bureau of Labor Statistics, statisticians are in high demand. Among occupations that currently employ at least 25,000 people, statistician ranks fifth in projected growth rate; it's expected to grow by 33.8% from 2016 to 2026. For context, the average rate of growth that the statistics bureau forecasts for all occupations is 7.4%. And with application software developers as the only other exception, all of the other occupations in the top 10 are in the healthcare or senior care verticals, which is consistent with an aging U.S. population.
Statistician is fifth among occupations with at least 25,000 workers projected to grow at the fastest rates.
Thanks to groundbreaking innovations in technology and computing power, the world is producing more data than ever before. Businesses are using actionable analytics to improve their day-to-day processes and drive diverse functions like sales, marketing, capital investment, HR and operations. Statisticians and data scientists are making that possible, using not only their mathematical and scientific skills, but also creativity and effective communication to extract and convey insights from the new data resources.
In 2017, IBM partnered with job market analytics software vendor Burning Glass Technologies and the Business-Higher Education Forum on a study that showed how the democratization of data is forcing change in the workforce. Without diving into the minutia, I gathered from the study that with more and more data now available to more and more people, the insights garnered from the data set you apart as an employee -- or as a company.
Developing and encouraging our analytics team
The need to find and communicate these insights influences how we hire and train our up-and-coming analytics employees at Dun & Bradstreet. Our focus is still primarily on mathematics, but we also consider other characteristics like critical- and innovative-thinking abilities as well as personality traits, so our statisticians and data scientists are effective in their roles.
Our employees have the advantage of working for a business-to-business company that has incredibly large and varied data sets -- containing more than 300 million business records -- and a wide variety of customers that are interested in our analytics services and applications. They get to work on a very diverse set of business challenges, share cutting-edge concepts with data scientists in other companies and develop creative solutions to unique problems.
Our associates are encouraged to pursue new analytical models and data analyses, and we have special five-day sprints where we augment and enhance some of the team's more creative suggestions. These sprints not only challenge the creativity of our data analysts, but also require them to work on their interpersonal and communication skills while developing these applications as a group.
Socializing the new, creative data analyst
It's very important to realize that some business users aren't yet completely comfortable with a well-rounded analytics team. For the most part, when bringing in an analyst, they're looking for confirmation of a hypothesis rather than a full analysis of the data at hand.
If that's the case in your organization, then be persistent. As your team continues to present valuable insights and creative solutions, your peers and business leaders across the company will start to seek guidance from data analysts as partners in problem-solving much more frequently and much earlier in their decision-making processes.
As companies and other institutions continue to amass data exponentially and rapid technological changes continue to affect the landscape of our businesses and lives, growing pains will inevitably follow. Exceptional employees who have creativity and empathy, in addition to mathematical skills, will help your company thrive through innovation. Hopefully, you have more than a few analysts who possess those capabilities. Identify and encourage them -- and give permission to the rest of your analytics team to think outside the box and rise to the occasion.
Data scientist vs. business analyst: What's the difference?
Data science and business analyst roles differ in that data scientists must deep dive into data and come up with unique business solutions -- but the distinctions don't end there.
What is the difference between data science and business analyst jobs? And what kind of training or education is required to become a data scientist?
There are a number of differences between data scientists and business analysts, the two most common business analytics roles, but at a high level, you can think about the distinction as similar to a medical researcher and a lab technician. One uses experimentation and the scientific method to search out new, potentially groundbreaking discoveries, while the other applies existing knowledge in an operational context.
Data scientist vs. business analyst comes down to the realms they inhabit. Data scientists delve into big data sets and use experimentation to discover new insights in data. Business analysts, on the other hand, typically use self-service analytics tools to review curated data sets, build reports and data visualizations, and report targeted findings -- things like revenue by quarter or sales needed to hit targets.
What does a data scientist do?
A data scientist takes analytics and data warehousing programs to the next level: What does the data really say about the company, and is the company able to decipher relevant data from irrelevant data?
A data scientist should be able to leverage the enterprise data warehouse to dive deeper into the data that comes out or to analyze new types of data stored in Hadoop clusters and other big data systems. A data scientist doesn't just report on data like a classic business analyst does, he also delivers business insights based on the data.
A data scientist job also requires a strong business sense and the ability to communicate data-driven conclusions to business stakeholders. Strong data scientists don't just address business problems, they'll also pinpoint the problems that have the most value to the organization. A data scientist plays a more strategic role within an organization.
Data scientist education, skills and personality traits
Data scientists look through all the available data with the goal of discovering a previously hidden insight that, in turn, can provide a competitive advantage or address a pressing business problem. Data scientists do not simply collect and report on data -- they also look at it from many angles, determine what it means and then recommend ways to apply the data. These insights could lead to a new product or even an entirely new business model.
Data scientists apply advanced machine learning models to automate processes that previously took too long or were inefficient. They use data processing and programming tools -- often open source, like Python, R and TensorFlow -- to develop new applications that take advantage of advances in artificial intelligence. These applications may perform a task such as transcribing calls to a customer service line using natural language processing or automatically generating text for email campaigns.
What does a business analyst do?
A business analyst -- a title often used interchangeably with data analyst -- focuses more on delivering operational insights to lines of business using smaller, more targeted data sets. For example, a business analyst tied to a sales team will work primarily with sales data to see how individual team members are performing, to identify members who might need extra coaching and to search for other areas where the team can improve on its performance.
Business analysts typically use self-service analytics and data visualization tools. Using these tools, business analysts can build reports and dashboards that team members can use to track their performance. Typically, the information contained in these reports is retrospective rather than predictive.
Data scientist vs. business analyst training, tools and trends
To become a business analyst, you need a familiarity with statistics and the basic fundamentals of data analysis, but there are many self-service analytics tools that do the mathematical heavy lifting for you. Of course, you have to know if it's statistically meaningful to join two separate data sets, and you have to understand the distinction between correlation and causation. But, on the whole, a deep background in mathematics is unnecessary.
To become a data scientist, on the other hand, you need a strong background in math. This is one of the primary differences in the question of data scientists vs. business analysts.
Many data scientists have doctorates in some field of math. Many have backgrounds in physics or other advanced sciences that lean heavily on statistical inference.
Business analysts can generally pick up the technical skills they need on the job. Whether an enterprise uses Tableau, Qlik or Power BI -- the three most common self-service analytics platforms -- or another tool, most use graphical user interfaces that are designed to be intuitive and easy to pick up.
Data science jobs require more specific technical training. In addition to advanced mathematical education, data scientists need deep technical skills. They must be proficient in several common coding languages -- including Python, SQL and Java -- which enable them to run complex machine learning models against big data stored in Hadoop or other distributed data management platforms. Most often, data scientists pick up these skills from a college-level computer science curriculum.
However, trends in data analytics are beginning to collapse the line between data science and data analysis. Increasingly, software companies are introducing platforms that can automate complex tasks using machine learning. At the same time, self-service software supports deeper analytical functionality, meaning data scientists are increasingly using tools that were once solely for business analysts.
Companies often report the highest analytics success when blending teams, so data scientists working alongside business analysts can produce operational benefits. This means that the data scientist vs. business analyst distinctions could become less important as time goes on -- a trend that may pay off for enterprises.
Hiring vs. training data scientists: The case for each approach
Hiring data scientists is easier said than done -- so should you try to train current employees in data science skills? That depends on your company's needs, writes one analytics expert.
Companies are faced with a dilemma on big data analytics initiatives: whether to hire data scientists from outside or train current employees to meet new demands. In many cases, realizing big data's enormous untapped potential brings the accompanying need to increase data science skills -- but building up your capacity can be tricky, especially in a crowded market of businesses looking for analytics talent.
Even with a shortage of available data scientists, screening and interviewing for quality hires is time- and resource-intensive. Alternatively, training data scientists from within may be futile if internal candidates don't have the fundamental aptitude.
At The Data Incubator, we've helped hundreds of companies train employees on data science and hire new talent -- and, often, we've aided organizations in handling the tradeoffs between the two approaches. Based on the experiences we've had with our corporate clients, you should consider the following factors when deciding which way to go.
New hires bring in new thinking
The main benefit of hiring rather than training data scientists comes from introducing new ideas and capabilities into your organization. What you add may be technical in nature: For example, are you looking to adopt advanced machine learning techniques, such as neural networks, or to develop real-time customer insights by using Spark Streaming? It may be cultural, too: Do you want an agile data science team that can iterate rapidly -- even at the expense of "breaking things," in Facebook's famous parlance? Or one that can think about data creatively and find novel approaches to using both internal and external information?
At other times, it's about having a fresh set of eyes looking at the same problems. Many quant hedge funds intentionally hire newly minted STEM Ph.D. holders -- people with degrees in science, technology, engineering or math -- instead of industry veterans precisely to get a fresh take on financial markets. And it isn't just Wall Street; in other highly competitive industries, too, new ideas are the most important currency, and companies fight for them to remain competitive.
How a company sources new talent can also require some innovation, given the scarcity of skilled data scientists. Kaggle and other competition platforms can be great places to find burgeoning data science talent. The public competitions on Kaggle are famous for bringing unconventional stars and unknown whiz kids into the spotlight and demonstrating that the best analytics performance may come from out of left field.
Similarly, we've found that economists and other social scientists often possess the same strong quantitative skill sets as their traditional STEM peers, but are overlooked by HR departments and hiring managers alike.
Training adds to existing expertise
In other cases, employers may value industry experience first and foremost. Domain expertise is complex, intricate and difficult to acquire in some industries. Such industries often already have another science at their core. Rocketry, mining, chemicals, oil and gas -- these are all businesses in which knowledge of the underlying science is more important than data science know-how.
Highly regulated industries are another case in point. Companies facing complex regulatory burdens must often meet very specific, and frequently longstanding, requirements. Banks must comply with financial risk testing and with statutes that were often written decades ago. Similarly, the drug approval process in healthcare is governed by a complex set of immutable rules. While there is certainly room for innovation via data science and big data in these fields, it is constrained by regulations.
Companies in this position often find training data scientists internally to be a better option for developing big data analytics capabilities than hiring new talent. For example, at The Data Incubator, we work with a large consumer finance institution that was looking for data science capabilities to help enhance its credit modeling. But its ideal candidate profile for that job was very different from the ones sought by organizations looking for new ideas on business operations or products and services.
The relevant credit data comes in slowly: Borrowers who are initially reliable could become insolvent months or years after the initial credit decision, which makes it difficult to predict defaults without a strong credit model. And wrong decisions are very expensive: Loan defaults result in direct hits to the company's profitability. In this case, we worked with the company to train existing statisticians and underwriters on complementary data science skills around big data.
Of course, companies must be targeted in selecting training candidates. They often start by identifying employees who possess strong foundational skills for data science -- things like programming and statistics experience. Suitable candidates go by many titles, including statisticians, actuaries and quantitative analysts, more popularly known as quants.
Find the right balance for your needs
For many companies, weighing the options for hiring or training data scientists comes down to understanding their specific business needs, which can vary even in different parts of an organization. It's worth noting that the same financial institution that trained its staffers to do analytics for credit modeling also hired data scientists for its digital marketing team.
Without the complex regulatory requirements imposed on the underwriting side, the digital marketing team felt it could more freely innovate -- and hence decided to bring in new blood with new ideas. These new hires are now building analytical models that leverage hundreds of data signals and use advanced AI and machine learning techniques to more precisely target marketing campaigns at customers and better understand the purchase journeys people take.
Ultimately, the decision of whether to hire or train data scientists must make sense for an organization. Companies must balance the desire to innovate with the need to incorporate existing expertise and satisfy regulatory requirements. Getting that balance right is a key step in a successful data science talent strategy.
Self-service business intelligence (SSBI) is an approach to data analytics that enables business users to access and work with corporate data even though they do not have a background in statistical analysis, business intelligence (BI) or data mining. Allowing end users to make decisions based on their own queries and analyses frees up the organization's business intelligence and information technology (IT) teams from creating the majority of reports and allows those teams to focus on other tasks that will help the organization reach its goals.
Because self-service BI software is used by people who may not be tech-savvy, it is imperative that the user interface (UI) for BI software be intuitive, with a dashboard and navigation that is user-friendly. Ideally, training should be provided to help users understand what data is available and how that information can be queried to make data-driven decisions to solve business problems, but once the IT department has set up the data warehouse and data marts that support the business intelligence system, business users should be able to query the data and create personalized reports with very little effort.
While self-service BI encourages users to base decisions on data instead of intuition, the flexibility it provides can cause unnecessary confusion if there is not a data governance policy in place. Among other things, the policy should define what the key metrics for determining success are, what processes should be followed to create and share reports, what privileges are necessary for accessing confidential data and how data quality, security and privacy will be maintained.
Explore the data discovery software market, including the products and vendors helping enterprises glean insights using data visualization and self-service BI.
Turning data into business insight is the ultimate goal. It's not about gathering as much data as possible, it's about applying tools and making discoveries that help a business succeed. The data discovery software market includes a range of software and cloud-based services that can help organizations gain value from their constantly growing information resources.
These products fall within the broad BI category, and at their most basic, they search for patterns within data and data sets. Many of these tools use visual presentation mechanisms, such as maps and models, to highlight patterns or specific items of relevance. The tools deliver visualizations to users, including nontechnical workers, such as business analysts, via dashboards, reports, charts and tables.
The big benefit here: data discovery tools provide detailed insights gleaned from data to better inform business decisions. In many cases, the tools accomplish this with limited IT involvement because the products offer self-service features.
Using extensive research into the data discovery software market, TechTarget editors focused on the data discovery software vendors that lead in market share, plus those that offer traditional and advanced functionality. Our research included data from TechTarget surveys, as well as reports from respected research firms, including Gartner and Forrester.
Alteryx Inc.'s Connect markets itself as a collaborative data exploration and data cataloging platform for the enterprise that changes how information workers discover, prioritize and analyze all the relevant information within an organization.
The data discovery software market includes a range of software and cloud-based services that can help organizations gain value from their constantly growing information resources.
Alteryx Connect key features include:
Data Asset Catalog, which collects metadata from information systems, enabling better relevant data organization;
Business Glossary, which defines standard business terms in a data dictionary and links them to assets in the catalog; and
Data Discovery, which lets users discover the information they need through search capabilities.
Other features include:
Data Enrichment and Collaboration, which allows users to annotate, discuss and rate information to offer business context and provide an organization with relevant data; and
Certification and Trust, which provides insights into information asset trustworthiness through certification, lineage and versioning.
Alteryx touts these features as decreasing the time necessary to gain insight and supporting faster, data-driven decisions by improving collaboration, enhancing analytic productivity and ensuring data governance.
Domo Inc. provides a single-source system for end-to-end data integration and preparation, data discovery, and sharing in the cloud. It's mobile-focused, and it doesn't need you to integrate desktop software, third-party tools or on-premises servers.
With more than 500 native connectors, Domo designed the platform for quick and easy access to data from across the business, according to the company. It contains a central repository that ingests the data and aids version and access control.
Domo also provides one workspace from which people can choose and explore all the data sets available to them in the platform.
Data discovery capabilities include Data Lineage, a path-based view that clarifies data sources. This feature also enables simultaneous display of data tables alongside visualizations, aiding insight discovery, as well as card-based publishing and sharing.
GoodData Enterprise Insights Platform
The GoodData Corp.'s cloud-based Enterprise Insights Platform is an end-to-end data discovery software platform that gathers data and user decisions, transforming them into actionable insights for line-of-business users.
The platform provides insights in the form of recommendations and predictive analytics with the goal of delivering the analytics that matter most for real-time decision-making. Customers, partners and employees see information that is relevant to the decision at hand, presented in what GoodData claims is a personalized, contextual, intuitive and actionable form. Users can also integrate these insights directly into applications.
IBM Watson Explorer
IBM has a host of data discovery products, and one of the key offerings is IBM Watson Explorer. It's a cognitive exploration and content analysis platform that enables business users to easily explore and analyze structured, unstructured, internal, external and public data for trends and patterns.
Organizations have used Watson Explorer to understand 100% of incoming calls and emails, to improve the quality of information, and to enhance their ability to use that information, according to IBM.
Machine learning models, natural language processing and next-generation APIs combine to help organizations unlock value from all of their data and gain a secure, 360-degree view of their customers, in context, according to the company.
The platform also enables users to classify and score structured and unstructured data with machine learning to reach the most relevant information. And a new mining application gives users deep insights into structured and unstructured data.
Informatica LLC offers multiple data management products powered by its Claire engine as part of its Intelligent Data Platform. The Claire engine is a metadata-driven AI technology that automatically scans enterprise data sets and exploits machine learning algorithms to infer relationships about the data structure and provide recommendations and insights. By augmenting end users' individual knowledge with AI, organizations can discover more data from more users in the enterprise, according to the company.
Another component, Informatica Enterprise Data Catalog, scans and catalogs data assets across the enterprise to deliver recommendations, suggestions and data management task automation. Semantic search and dynamic facet capabilities allow users to filter search results and get data lineage, profiling statistics and holistic relationship views.
Informatica Enterprise Data Lake enables data analysts to quickly find data using semantic and faceted search and to collaborate with one another in shared project workspaces. Machine learning algorithms recommend alternative data sets. Analysts can sample and prepare datasets in an Excel-like data preparation interface, which analysts can operationalize as reusable workflows.
Information Builders WebFocus
Information Builders claims its WebFocus data discovery software platform helps companies use BI and analytics strategically across and beyond the enterprise.
The platform includes a self-service visual discovery tool that enables nontechnical business users to conduct data preparation; visually analyze complex data sets; generate sophisticated data visualizations, dashboards, and reports; and share content with other users. Its extensive visualization and charting capabilities provide an approach to self-service discovery that supports any type of user, Information Builders claims.
Information Builders offers a number of tools related to the WebFocus BI and analytics platform that provide enterprise-grade analytics and data discovery. One is WebFocus InfoApps, which can take advantage of custom information applications designed to enable nontechnical users to rapidly gather insights and explore specific business contexts. InfoApps can include parameterized dashboards, reports, charts and visualizations.
Another tool, WebFocus InfoAssist, enables governed self-service reporting, analysis and discovery capabilities to nontechnical users. The product offers a self-service BI capability for immediate data access and analysis.
Microsoft Power BI
Microsoft Power BI is a cloud-based business analytics service that enables users to visualize and analyze data. The same users can distribute data insights anytime, anywhere, on any device in just a few clicks, according to the company.
As a BI and analytics SaaS tool, Power BI equips users across an organization to build reports with colleagues and share insights. It connects to a broad range of live data through dashboards, provides interactive reports and delivers visualizations that include KPIs from data on premises and in the cloud.
Organizations can use machine learning to automatically scan data and gain insights, ask questions of the data using natural language queries, and take advantage of more than 140 free custom visuals created by the user community.
Power BI applications include dashboards with prebuilt content for cloud services, including Salesforce, Google Analytics and Dynamics 365. It also integrates seamlessly with Microsoft products, such as Office 365, SharePoint, Excel and Teams.
Organizations can start by downloading Power BI Desktop for free, while Power BI Pro and Premium offer several licensing options for companies that want to deploy Power BI across their organization.
MicroStrategy Desktop Client
MicroStrategy Ltd. designed its Desktop client to deliver self-service BI and help business users or departmental analysts analyze data with out-of-the-box visualizations. Data discovery capabilities are available via Mac or Windows PC web browsers and native mobile apps for iOS and Android.
All the interfaces are consistent and users can promote content between the interfaces. With the MicroStrategy Desktop client, business users can visualize data on any chart or graph, including natural language generation narratives, Google Charts, geospatial maps and data-driven documents visualizations.
They can access data from more than 100 data sources, including spreadsheets, RDBMS, cloud systems, and more; prepare, blend, and profile data with graphical interfaces; share data as a static PDF or as an interactive dashboard file; and promote offline content to a server and publish governed and certified dashboards.
OpenText EnCase Risk Manager
OpenText EnCase Risk Manager enables organizations to understand the sensitive data they have in their environment, where the data exists and its value.
The data discovery software platform helps organizations identify, categorize and remediate sensitive information across the enterprise, whether that information exists in the form of personally identifiable customer data, financial records or intellectual property. EnCase Risk Manager provides the ability to search for standard patterns, such as national identification numbers and credit card data, with the ability to discover entirely unique or proprietary information specific to a business or industry.
Risk Manager is platform-agnostic and able to identify this information throughout the enterprise wherever structured or unstructured data is stored, be that on endpoints, servers, cloud repositories, SharePoint or Exchange. Pricing starts at $60,000.
Oracle Big Data Discovery
Oracle Big Data Discovery enables users to find, explore and analyze big data. They can use the platform to discover new insights from data and share results with other tools and resources in a big data ecosystem, according to the company.
The platform uses Apache Spark, and Oracle claims it's designed to speed time to completion, make big data more accessible to business users across an organization and decrease the risks associated with big data projects.
Big Data Discovery provides rapid visual access to data through an interactive catalog of the data; loads local data from Excel and CSV files through self-service wizards; provides data set summaries, annotations from other users, and recommendations for related data sets; and enables search and guided navigation.
Together with statistics about each individual attribute in any data set, these capabilities expose the shape of the data, according to Oracle, enabling users to understand data quality, detect anomalies, uncover outliers and ultimately determine potential. Organizations can use the platform to visualize attributes by data type; glean which are the most relevant; sort attributes by potential, so the most meaningful information displays first; and use a scratchpad to uncover potential patterns and correlations between attributes.
Qlik View Sense
Qlik Sense is Qlik's next-generation data discovery software platform for self-service BI. It supports a full range of analytics use cases including self-service visualization and exploration, guided analytics applications and dashboards, custom and embedded analytics, mobile analytics, and reporting, all within a governed, multi-cloud architecture.
It offers analytics capabilities for all types of users, including associative exploration and search, smart visualizations, self-service creation and data preparation, geographic analysis, collaboration, storytelling, and reporting. The platform also offers fully interactive online and offline mobility and an insight advisor that generates relevant charts and insights using AI.
The product can readily integrate streaming data sources from IoT, social media and messaging with at-rest data for real-time contextual analysis.
Freely distributed accelerators include product templates to help users get to production quickly.
Tibco's Insight Platform combines live streaming data with queries on large at-rest volumes. Historical patterns are interactively identified with Spotfire, running directly against Hadoop and Spark. The Insight Platform can then apply these patterns to streaming data for predictive and operational insights.
For the enterprise, Qlik Sense provides a platform that includes open and standard APIs for customization and extension, data integration scripting, broad data connectivity and data-as-a-service, centralized management and governance, and a multi-cloud architecture for scalability across on-premises environments, as well as private and public cloud environments.
Qlik Sense runs on the patented Qlik Associative Engine, which allows users to explore information without query-based tools. And the new Qlik cognitive engine works with the associative engine to augment the user, offering insight suggestions and automation in context with user behavior.
Qlik Sense is available in cloud and enterprise editions.
Salesforce Einstein Discovery
Salesforce's Einstein Discovery, an AI-powered feature within the Einstein Analytics portfolio, allows business users to automatically analyze millions of data points to understand their current business, explore historical trends, and automatically receive guided recommendations on what they can do to expand deals or resolve customer service cases faster.
Einstein Discovery for Analysts lets users analyze data in Salesforce CRM, CSV files or data from external data sources. In addition, users can take advantage of smart data preparation capabilities to make data improvements, run analyses to create stories, further explore these stories in Einstein Analytics for advanced visualization capabilities, and push insights into Salesforce objects for all business users.
Einstein Discovery for Business Users provides access to insights in natural language and into Salesforce -- within Sales Cloud or Service Cloud, for example. Einstein Discovery for Analysts is available for $2,000 per user, per month. Einstein Discovery for Business Users is $75 per user, per month.
SAS Visual Analytics
SAS Institute Inc.'s Visual Analytics on SAS Viya provides interactive data visualizations to help users explore and better understand data.
The product provides a scalable, in-memory engine along with a user-friendly interface, SAS claims. The combination of interactive data exploration, dashboards, reporting and analytics is designed to help business users find valuable insights without coding. Any user can assess probable outcomes and make more informed, data-driven decisions.
SAS Visual Analytics capabilities include:
automated forecasting, so users can select the most appropriate forecasting method to suit the data;
scenario analysis, which identifies important variables and how changes to them can influence forecasts;
goal-seeking to determine the values of underlying factors that would be required to achieve the target forecast; and
decision trees, allowing users to create a hierarchical segmentation of the data based on a series of rules applied to each observation.
Other features include network diagrams so users can see how complex data is interconnected; path analysis, which displays the flow of data from one event to another as a series of paths; and text analysis, which applies sentiment analysis to video, social media streams or customer comments to provide quick insights into what's being discussed online.
SAP Analytics Cloud
SAP's Analytics Cloud service offers analytics capabilities for all users in one data discovery software product, including discovery, analysis, planning, predicting and collaborating, in one integrated cloud platform, according to SAP.
The service gives users business insights based on its ability to turn embedded data analytics into business applications, the company claims.
Among the potential benefits:
enhanced user experience with the service's visualization and role-based personalization features;
better business results from deep collaboration and informed decisions due to SAP's ability to integrate with existing on-premises applications; and
simplified data across an organization to ensure faster, fact-based decision-making.
In addition, Analytics Cloud is free from operating system constraints, download requirements and setup tasks. It provides real-time analytics and extensibility using SAP Cloud Platform, which can reduce the total cost of ownership because all the features are offered in one SaaS product for all users.
Sisense Ltd. is an end-to-end platform that ingests data from a variety of sources before analyzing, mashing and visualizing it. Its open API framework also enables a high degree of customization without the input of designers, data scientists or IT specialists, according to Sisense.
The Sisense analytics engine runs 10 to 100 times faster than in-memory platforms, according to the company, dealing with terabytes of data and potentially eliminating onerous data preparation work. The platform provides business insights augmented by machine learning and anomaly detection. In addition, the analytics tool offers the delivery of insights beyond the dashboard, offering new forms of BI access, including chatbots and autonomous alerts.
Tableau Software Inc.'s Desktop is a visual analytics and data discovery software platform that lets users see and understand their data with drag-and-drop simplicity, according to the company. Users can create interactive visualizations and dashboards to gain immediate insights without the need for any programming. They can then share their findings with colleagues.
Tableau Desktop can connect to an organization's data in the cloud, on premises or both using one of 75 native data connectors or Tableau's Web Data Connector. This includes connectors to cloud data sources from cloud databases such as Amazon Redshift, Google BigQuery, SQL Server, SAP and Oracle, plus applications such as Salesforce and ServiceNow.
Tibco Software Inc.'s Spotfire is an enterprise analytics platform that connects to and blends data from files, relational and NoSQL databases, OLAP, Hadoop and web services, as well as to cloud applications such as Google Analytics and Salesforce.
Operational intelligence (OI) is an approach to data analysis that enables decisions and actions in business operations to be based on real-time data as it's generated or collected by companies. Typically, the data analysis process is automated, and the resulting information is integrated into operational systems for immediate use by business managers and workers.
OI applications are primarily targeted at front-line workers who, hopefully, can make better-informed business decisions or take faster action on issues if they have access to timely business intelligence (BI) and analytics data. Examples include call-center agents, sales representatives, online marketing teams, logistics planners, manufacturing managers and medical professionals. In addition, operational intelligence can be used to automatically trigger responses to specified events or conditions.
What is now known as OI evolved from operational business intelligence, an initial step focused more on applying traditional BI querying and reporting. OI takes the concept to a higher analytics level, but operational BI is sometimes still used interchangeably with operational intelligence as a term.
How operational intelligence works
In most OI initiatives, data analysis is done in tandem with data processing or shortly thereafter, so workers can quickly identify and act on problems and opportunities in business operations. Deployments often include real-time business intelligence systems set up to analyze incoming data, plus real-time data integration tools to pull together different sets of relevant data for analysis.
Stream processing systems and big data platforms, such as Hadoop and Spark, can also be part of the OI picture, particularly in applications that involve large amounts of data and require advanced analytics capabilities. In addition, various IT vendors have combined data streaming, real-time monitoring and data analytics tools to create specialized operational intelligence platforms.
As data is analyzed, organizations often present operational metrics, key performance indicators (KPIs) and business insights to managers and other workers in interactive dashboards that are embedded in the systems they use as part of their jobs; data visualizations are usually included to help make the information easy to understand. Alerts can also be sent to notify users of developments and data points that require their attention, and automated processes can be kicked off if predefined thresholds or other metrics are exceeded, such as stock trades being spurred by prices hitting particular levels.
Operational intelligence uses and examples
Stock trading and other types of investment management are prime candidates for operational intelligence initiatives because of the need to monitor huge volumes of data in real time and respond rapidly to events and market trends. Customer analytics is another area that's ripe for OI. For example, online marketers use real-time tools to analyze internet clickstream data, so they can better target marketing campaigns to consumers. And cable TV companies track data from set-top boxes in real time to analyze the viewing activities of customers and how the boxes are functioning.
The growth of the internet of things has sparked operational intelligence applications for analyzing sensor data being captured from manufacturing machines, pipelines, elevators and other equipment; that enables predictive maintenance efforts designed to detect potential equipment failures before they occur. Other types of machine data also fuel OI applications, including server, network and website logs that are analyzed in real time to look for security threats and IT operations issues.
There are less grandiose operational intelligence use cases, as well. That includes the likes of call-center applications that provide operators with up-to-date customer records and recommend promotional offers while they're on the phone with customers, as well as logistics ones that help calculate the most efficient driving routes for fleets of delivery vehicles.
OI benefits and challenges
The primary benefit of OI implementations is the ability to address operational issues and opportunities as they arise -- or even before they do, as in the case of predictive maintenance. Operational intelligence also empowers business managers and workers to make more informed -- and hopefully better -- decisions on a day-by-day basis. Ultimately, if managed successfully, the increased visibility and insight into business operations can lead to higher revenue and competitive advantages over rivals.
But there are challenges. Building operational intelligence architecture typically involves piecing together different technologies, and there are numerous data processing platforms and analytics tools to choose between, some of which may require new skills in organizations. High performance and sufficient scalability are also needed to handle the real-time workloads and large volumes of data common in OI applications without choking the system.
Also, most business processes at a typical company don't require real-time data analysis. With that in mind, a key part of operational intelligence projects involves determining which end users need up-to-the-minute data and then training them to handle the information once it starts being delivered to them in that fashion.
Operational intelligence vs. business intelligence
Conventional BI systems support the analysis of historical data that has been cleansed and consolidated in a data warehouse or data mart before being made available for business analytics uses. BI applications generally aim to tell corporate executives and business managers what happened in the past on revenues, profits and other KPIs to aid in budgeting and strategic planning.
Early on, BI data was primarily distributed to users in static operational reports. That's still the case in some organizations, although many have shifted to dashboards with the ability to drill down into data for further analysis. In addition, self-service BI tools let users run their own queries and create data visualizations on their own, but the focus is still mostly on analyzing data from the past.
Operational intelligence systems let business managers and front-line workers see what's currently happening in operational processes and then immediately act upon the findings, either on their own or through automated means. The purpose is not to facilitate planning, but to drive operational decisions and actions in the moment.
A CDN (content delivery network), also called a content distribution network, is a group of geographically distributed and interconnected servers that provide cached internet content from a network location closest to a user to accelerate its delivery. The primary goal of a CDN is to improve web performance by reducing the time needed to transmit content and rich media to users' internet-connected devices.
Content delivery network architecture is also designed to reduce network latency, which is often caused by hauling traffic over long distances and across multiple networks. Eliminating latency has become increasingly important, as more dynamic content, video and software as a service are delivered to a growing number of mobile devices.
CDN providers house cached content in either their own network points of presence (POP) or in third-party data centers. When a user requests content from a website, if that content is cached on a content delivery network, the CDN redirects the request to the server nearest to that user and delivers the cached content from its location at the network edge. This process is generally invisible to the user.
A wide variety of organizations and enterprises use CDNs to cache their website content to meet their businesses' performance and security needs. The need for CDN services is growing, as websites offer more streaming video, e-commerce applications and cloud-based applications where high performance is key. Few CDNs have POPs in every country, which means many organizations use multiple CDN providers to make sure they can meet the needs of their business or consumer customers wherever they are located.
In addition to content caching and web delivery, CDN providers are capitalizing on their presence at the network edge by offering services that complement their core functionalities. These include security services that encompass distributed denial-of-service (DDoS) protection, web application firewalls (WAFs) and bot mitigation; web and application performance and acceleration services; streaming video and broadcast media optimization; and even digital rights management for video. Some CDN providers also make their APIs available to developers who want to customize the CDN platform to meet their business needs, particularly as webpages become more dynamic and complex.
How does a CDN work?
The process of accessing content cached on a CDN network edge location is almost always transparent to the user. CDN management software dynamically calculates which server is located nearest to the requesting user and delivers content based on those calculations. The CDN server at the network edge communicates with the content's origin server to make sure any content that has not been cached previously is also delivered to the user. This not only eliminates the distance that content travels, but reduces the number of hops a data packet must make. The result is less packet loss, optimized bandwidth and faster performance, which minimizes timeouts, latency and jitter, and it improves the overall user experience. In the event of an internet attack or outage, content hosted on a CDN server will remain available to at least some users.
Organizations buy services from CDN providers to deliver their content to their users from the nearest location. CDN providers either host content themselves or pay network operators and internet service providers (ISPs) to host CDN servers. Beyond placing servers at the network edge, CDN providers use load balancing and solid-state hard drives to help data reach users faster. They also work to reduce file sizes using compression and special algorithms, and they are deploying machine learning and AI to enable quicker load and transmission times.
History of CDNs
The first CDN was launched in 1998 by Akamai Technologies soon after the public internet was created. Akamai's original techniques serve as the foundation of today's content distribution networks. Because content creators realized they needed to find a way to reduce the time it took to deliver information to users, CDNs were seen as a way to improve network performance and to use bandwidth efficiently. That basic premise remains important, as the amount of online content continues to grow.
So-called first-generation CDNs specialized in e-commerce transactions, software downloads, and audio and video streaming. As cloud and mobile computing gained traction, second-generation CDN services evolved to enable the efficient delivery of more complex multimedia and web content to a wider community of users via a more diverse mix of devices. As internet use grew, the number of CDN providers multiplied, as have the services CDN companies offer.
New CDN business models also include a variety of pricing methods that range from charges per usage and volume of content delivered to a flat rate or free for basic services, with add-on fees for additional performance and optimization services. A wide variety of organizations use CDN services to accelerate static and dynamic content, online gaming and mobile content delivery, streaming video and a number of other uses.
What are the main benefits of using a CDN?
The primary benefits of traditional CDN services include the following:
Improved webpage load times to prevent users from abandoning a slow-loading site or e-commerce application where purchases remain in the shopping cart;
Improved security from a growing number of services that include DDoS mitigation, WAFs and bot mitigation;
Increased content availability because CDNs can handle more traffic and avoid network failures better than the origin server that may be located several networks away from the end user; and
A diverse mix of performance and web content optimization services that complement cached site content.
How do you manage CDN security?
A representative list of CDN providers in this growing market include the following:
A wide variety of organizations use CDNs to meet their businesses' performance and security needs. The need for CDN services is growing, as websites offer more streaming video, e-commerce applications and cloud-based applications, where high performance is essential.
CDN technology is also an ideal method to distribute web content that experiences surges in traffic, because distributed CDN servers can handle sudden bursts of client requests at one time over the internet. For example, spikes in internet traffic due to a popular event, like online streaming video of a presidential inauguration or a live sports event, can be spread out across the CDN, making content delivery faster and less likely to fail due to server overload.
AWS GPU instance type slashes cost of streaming apps
The cost of graphics acceleration can often make the technology prohibitive, but a new AWS GPU instance type for AppStream 2.0 makes that process more affordable.
Download Our AWS Cloud Computing Must-Have Guide
While Amazon Web Services (AWS) has established itself as a top contender in the cloud computing market, it's not without its challenges and misconceptions. Get expert insight into the most common and pressing questions regarding AWS management, monitoring, costs, benefits, limitations and more.
Amazon AppStream 2.0, which enables enterprises to stream desktop apps from AWS to an HTML5-compatible web browser, delivers graphics-intensive applications for workloads such as creative design, gaming and engineering that rely on DirectX, OpenGL or OpenCL for hardware acceleration. The managed AppStream service eliminates the need for IT teams to recode applications to be browser-compatible.
The newest AWS GPU instance type for AppStream, Graphics Design, cuts the cost of streaming graphics applications up to 50%, according to the company. AWS customers can launch Graphics Design GPU instances or create a new instance fleet with the Amazon AppStream 2.0 console or AWS software development kit. AWS’ Graphics Design GPU instances come in four sizes that range from 2-16 virtual CPUs and 7.5-61 gibibytes (GiB) of system memory, and run on AMD FirePro S7150x2 Server GPUs with AMD Multiuser GPU technology.
Developers can now also select between two types of Amazon AppStream instance fleets in a streaming environment. Always-On fleets provide instant access to apps, but charge fees for every instance in the fleet. On-Demand fleets charges fees for instances when end users are connected, plus an hourly fee, but there is a delay when an end user accesses the first application.
New features and support
In addition to the new AWS GPU instance type, the cloud vendor rolled out several other features this month, including:
ELB adds network balancer. AWS Network Load Balancer helps maintain low latency during spikes on a single static IP address per Availability Zone. Network Load Balancer — the second offshoot of Elastic Load Balancing features, following Application Load Balancer — routes connections to Virtual Private Cloud-based Elastic Compute Cloud (EC2) instances and containers.
New edge locations on each coast. Additional Amazon CloudFront edge locations in Boston and Seattle improve end user speed and performance when they interact with content via CloudFront. AWS now has 95 edge locations across 50 cities in 23 countries.
X1 instance family welcomes new member. The AWS x1e.32xlarge instance joins the X1 family of memory-optimized instances, with the most memory of any EC2 instance — 3,904 GiB of DDR4 instance memory — to help businesses reduce latency for large databases, such as SAP HANA. The instance is also AWS’ most expensive at about $16-$32 per hour, depending on the environment and payment model.
AWS Config opens up support. The AWS Config service, which enables IT teams to manage service and resource configurations, now supports both DynamoDB tables and Auto Scaling groups. Administrators can integrate those resources to evaluate the health and scalability of their cloud deployments.
Start and stop on the Spot. IT teams can now stop Amazon EC2 Spot Instances when an interruption occurs and then start them back up as needed. Previously, Spot Instances were terminated when prices rose above the user-defined level. AWS saves the EBS root device, attached volumes and the data within those volumes; those resources restore when capacity returns, and instances maintain their ID numbers.
EC2 expands networking performance. The largest instances of the M4, X1, P2, R4, I3, F1 and G3 families now use Elastic Network Adapter (ENA) to reach a maximum bandwidth of 25 Gb per second. The ENA interface enables both existing and new instances to reach this capacity, which boosts workloads reliant on high-performance networking.
New Direct Connect locations. Three new global AWS Direct Connect locations allow businesses to establish dedicated connections to the AWS cloud from an on-premises environment. New locations include: Boston, at Markley, One Summer Data Center for US-East-1; Houston, at CyrusOne West I-III data center for US-East-2; and Canberra, Australia, at NEXTDC C1 Canberra data center for AP-Southeast-2.
Role and policy changes. Several changes to AWS Identity and Access Management (IAM) aim to better protect an enterprise’s resources in the cloud. A policy summaries feature lets admins identify errors and evaluate permissions in the IAM console to ensure each action properly matches to the resources and conditions it affects. Other updates include a wizard for admins to create the IAM roles, and the ability to delete service-linked roles through the IAM console, API or CLI — IAM ensures that no resources are attached to a role before deletion.
Six new data streams. Amazon Kinesis Analytics, which enables businesses to process and query streaming data in an SQL format, has six new types of stream processes to simplify data processing: STEP(), LAG(), TO_TIMESTAMP(), UNIX_TIMESTAMP(), REGEX_REPLACE() and SUBSTRING(). AWS also increased the service’s capacity to process higher data volume streams.
Get DevOps notifications. Additional notifications from AWS CodePipeline for stage or action status changes enable a DevOps team to track, manage and act on changes during continuous integration and continuous delivery. CodePipeline integrates with Amazon CloudWatch to enable Amazon Simple Notification Service messages, which can trigger an AWS Lambda function in response.
AWS boosts HIPAA eligibility. Amazon’s HIPAA Compliance Program now includes Amazon Connect, AWS Batch and two Amazon Relational Database Service (RDS) engines, RDS for SQL Server and RDS for MariaDB — all six RDS engines are HIPAA eligible. AWS customers that sign a Business Associate Agreement can use those services to build HIPAA-compliant applications.
RDS for Oracle adds features. The Amazon RDS for Oracle engine now supports Oracle Multimedia, Oracle Spatial and Oracle Locator features, with which businesses can store, manage and retrieve multimedia and multi-dimensional data as they migrate databases from Oracle to AWS. The RDS Oracle engine also added support for multiple Oracle Application Express versions, which enables developers to build applications within a web browser.
Assess RHEL security. Amazon Inspector expanded support for Red Hat Enterprise Linux (RHEL) 7.4 assessments, to run Vulnerabilities & Exposures, Amazon Security Best Practices and Runtime Behavior Analysis scans in that RHEL environment on EC2 instances.
BPM in cloud evolves to suit line of business, IoT
While on-premises BPM tools have caused a tug of war between lines of business and IT, the cloud helps appease both sides. Here's what to expect from this cloud BPM trend and more.
Business process management tools rise in importance as companies try to make better use -- and reuse -- of IT assets. And, when coupled with cloud, this type of software can benefit from a pay-as-you-go model for more efficient cost management, as well as increased scalability.
As a result, cloud-based BPM has become a key SaaS tool in the enterprise. Looking forward, the growth of BPM in cloud will drive three major trends that enterprise users should track.
BPM is designed to encourage collaboration between line departments and IT, but the former group often complains that BPM tools hosted in the data center favor the IT point of view in both emphasis and design. To avoid this and promote equality between these two groups, many believe that BPM tools have to move to neutral territory: the cloud.
Today, BPM supports roughly a dozen different roles and is increasingly integrated with enterprise architecture practices and models. This expands the scope of BPM software, as well as the number of non-IT professionals who use it. Collaboration and project management, for example, account for most of the new features in cloud BPM software.
Collaboration features in cloud-based BPM include project tools and integration with social networks. While business people widely use platforms like LinkedIn for social networking, IT professionals use other wiki-based tools. Expect to see a closer merger between the two.
This push for a greater line department focus in BPM could also divide the BPM suites themselves. While nearly all the cloud BPM products are fairly broad in their application, those from vendors with a CIO-level sales emphasis, such as IBM's Business Process Manager on Cloud or Appian, focus more on IT. NetSuite, on the other hand, is an example of cloud BPM software with a broader organizational target.
Software practices influence BPM
Cloud, in general, affects application design and development, which puts pressure on BPM to accommodate changes in software practices. Cloud platforms, for example, have encouraged a more component-driven vision for applications, which maps more effectively to business processes. This will be another factor that expands line department participation in BPM software.
BPM in cloud encourages line organizations to take more control over applications. The adoption of third-party tools, rather than custom development, helps them target specific business problems. This, however, is a double-edged sword: It can improve automated support for business processes but also duplicate capabilities and hinder workflow integration among organizations. IT and line departments will have to define a new level of interaction.
The third trend to watch around BPM in cloud involves internet of things (IoT) and machine-to-machine communications. These technologies presume that sensors will activate processes, either directly or through sensor-linked analytics. This poses a challenge for BPM, because it takes human judgment out of the loop and requires instead that business policies anticipate human review of events and responses. That shifts the emphasis of BPM toward automated policies, which, in the past, has led to the absorption of BPM into things like Business Process Modeling Language, and puts the focus back on IT.
What do you expect from cloud BPM in the future?
In theory, business policy automation has always been within the scope of BPM. But, in practice, BPM suites have offered only basic support for policy automation or even for the specific identification of business policies. It's clear that this will change and that policy controls to guide IoT deployments will be built into cloud-based BPM.
The foundations of big data continue to shift, driven in great part by AI and machine learning applications. The push to work in real time, and to quickly place AI tools in the hands of data scientists and business analysts, has created interest in software containers as a more flexible mechanism for deploying big data systems and applications.
Now, Kubernetes container orchestration is emerging to provide an underpinning for the new container-based workloads. It has stepped into the big data spotlight -- one formerly reserved for data frameworks like Hadoop and Spark.
These frameworks continue to play an important role in big data, but in more of a supporting role, as discussed in this podcast review of the 2018 Strata Data Conference in San Jose, Calif. That's particularly true in the case of Hadoop, the featured topic in only a couple of sessions at the conference, which until last year was called Strata + Hadoop World.
"It's not that people are turning their backs on Hadoop," said Craig Stedman, SearchDataManagement's senior executive editor and a podcast participant. "But it is becoming part of the woodwork."
The attention of IT teams is shifting more toward the actual applications and how they can get more immediate value out of data science, AI and machine learning, he indicated. Maximizing resources is a must, and this is where Kubernetes-based containers are seen as potential helpers for teams looking to swap workloads in and out and maximize the use of computing resources in fast-moving environments.
Kubernetes connections for Spark and Flink, a rival stream processing engine, are increasingly being more closely watched.
At Strata, Stedman said, the deployment of a prototype Kubernetes-Spark combination and several other machine learning frameworks in a Kubernetes-based architecture was seen partly as a way to nimbly shift workloads between CPUs and GPUs, the latter processor type playing a growing role in training the neural network's underlying machine learning and deep learning applications.
The deployment was the work of JD.com Inc., a Beijing-based online retailer and Strata presenter. It is worth emphasizing the early adopter status of such implementations, however. While JD.com is running production applications in the container architecture, Stedman reported that it's still studying performance and reliability issues around the new coupling of Spark and Kubernetes that's included in Apache Spark 2.3 as an experimental technology.
Overall, in fact, there is much learning ahead for Kubernetes container orchestration when it comes to big data. That is because containers tend to be ephemeral or stateless, while big data is traditionally stateful, providing data persistence.
Bridging the two takes on state is the goal of a Kubernetes volume driver that MapR Technologies announced at Strata, which is integrated into the company's big data platform. As such, it addresses one of the obstacles Kubernetes container orchestration faces in big data applications.
Stedman said the march to stateful applications on Kubernetes continued to advance after the conference, as Data Artisans launched its dA Platform for what it described as stateful stream processing with Flink. The development and runtime environment is intended for use with real-time analytics, machine learning and other applications that can be deployed on Kubernetes in order to provide dynamic allocation of computing resources.
Listen to this podcast to learn more about the arrival of containers in the world of Hadoop and Spark and the overall evolution of big data as seen at the Strata event.
Big data vendors and users are looking to Kubernetes-managed containers to help accelerate system and application deployments and enable more flexible use of computing resources.
It's still early going for containerizing the big data implementation process. However, users and vendors alike are increasingly eying software containers and Kubernetes, a technology for orchestrating and managing them, as tools to help ease deployments of big data systems and applications.
Early adopters expect big data containers running in Kubernetes clusters to accelerate development and deployment work by enabling the reuse of system builds and application code. The container approach should also make it easier to move systems and applications to new platforms, reallocate computing resources as workloads change and optimize the use of an organization's available IT infrastructure, advocates say.
The pace is picking up on big data technology vendors adding support for containers and Kubernetes to their product offerings. For example, at the Strata Data Conference in San Jose, Calif., this month, MapR Technologies Inc. said it has integrated a Kubernetes volume driver into its big data platform to provide persistent data storage for containerized applications tied to the orchestration technology.
MapR previously supported the use of specialized Docker containers with built-in connectivity to the MapR Converged Data Platform, but the Kubernetes extension is "much more transparent and native to the environment," said Jack Norris, the Santa Clara, Calif., company's senior vice president of data and applications. He added that the persistent storage capability lets containers be used for stateful applications, a requirement for a typical big data implementation with Hadoop and related technologies.
Also, the version 2.3 update of the open source Apache Spark processing engine released in late February includes a native Kubernetes scheduler. The Spark on Kubernetes technology, which is being developed by contributors from Bloomberg, Google, Intel and several other companies, is still described as experimental in nature, but it enables Spark 2.3 workloads to be run in Kubernetes clusters.
Not to be outdone, an upcoming 1.5 release of Apache Flink -- a stream processing rival to Spark -- will provide increased ties to both Kubernetes and the rival Apache Mesos technology, according to Fabian Hueske, a co-founder and software engineer at Flink vendor Data Artisans. Users can run the Berlin-based company's current Flink distribution on Kubernetes, "but it's not always straightforward to do that now," Hueske said at the Strata conference. "It will be much easier with the new release."
Big data containers achieve liftoff
JD.com Inc., an online retailer based in Beijing, is an early user of Spark on Kubernetes. The company has also containerized TensorFlow, Caffe and other machine learning and deep learning frameworks in a single Kubernetes-based architecture, which it calls Moonshot.
The use of containers is designed to streamline and simplify big data implementation efforts in support of machine learning and other AI analytics applications that are being run in the new architecture, said Zhen Fan, a software development engineer at JD.com. "A major consideration was that we should support all of the AI workloads in one cluster so we can maximize our resource usage," Fan said during a conference session.
However, he added that the containers also make it possible to quickly deploy analytics systems on the company's web servers to take advantage of overnight processing downtime.
"In e-commerce, the [web servers] are quite busy until midnight," Fan said. "But from 12 to 6 a.m., they can be used to run some offline jobs."
JD.com began work on the AI architecture in mid-2017; the retailer currently has 300 nodes running production jobs in containers, and it plans to expand the node count to 1,000 in the near future, Fan said. The Spark on Kubernetes technology was installed in the third quarter of last year, initially to support applications run with Spark's stream processing module.
However, that part of the deployment is still a proof-of-concept project intended to test "if Spark on Kubernetes is ready for a production environment," said Wei Ting Chen, a senior software engineer at Intel, which is helping JD.com build the architecture. Chen noted that some pieces of Spark have yet to be tied to Kubernetes, and he cited several other issues that need to be assessed.
For example, JD.com and Intel are looking at whether using Kubernetes could cause performance bottlenecks when launching large numbers of containers, Chen said. Reliability is another concern, as more and more processing workloads are run through Spark on Kubernetes, he added.
Out on the edge with Kubernetes
Spark on Kubernetes is a bleeding-edge technology that's currently best suited to big data implementations in organizations that have sufficient "technical muscle," said Vinod Nair, director of product management at Pepperdata Inc., a vendor of performance management tools for big data systems that is involved in the Spark on Kubernetes development effort.
The Kubernetes scheduler is a preview feature in Spark 2.3 and likely won't be ready for general availability for another six to 12 months, according to Nair. "It's a fairly large undertaking, so I expect it will be some time before it's out in production," he said. "It's at about an alpha test state at this point."
Pepperdata plans to support Kubernetes-based containers for Spark and the Hadoop Distributed File System in some of its products, starting with Application Spotlight, a performance management portal for big data application developers that the Cupertino, Calif., company announced this month. With the recent release of Hadoop 3.0, the YARN resource manager built into Hadoop can also control Docker containers, "but Kubernetes seems to have much bigger ambitions to what it wants to do," Nair said.
Not everyone is sold on Kubernetes -- or K8s, as it's informally known. BlueData Software Inc. uses a custom orchestrator to manage the Docker containers at the heart of its big-data-as-a-service platform. Tom Phelan, co-founder and chief architect at BlueData, said he still thinks the homegrown tool has a technical edge on Kubernetes, particularly for stateful applications. He added, though, that the Santa Clara, Calif., vendor is working with Kubernetes in the lab with an eye on possible future adoption.
Kinnary Janglasenior software engineer, Pinterest
Pinterest Inc. is doing the same thing. The San Francisco company is moving to use Docker containers to speed up development and deployment of various machine learning applications that help drive its image bookmarking and social networking site under the covers, said Kinnary Jangla, a senior software engineer at Pinterest.
Jangla, who built a container-based setup for debugging machine learning models as a test case, said in a presentation at Strata that Pinterest is also testing a Kubernetes cluster. "We're trying to see if that is going to be useful to us as we migrate to production," she said. "But we're not there yet."
Data security needs to be addressed upfront in deployments of big data systems -- and users are likely to find they have to build some security capabilities themselves.
When TMW Systems Inc. began building a big data environment to run advanced analytics applications three years ago, the first step wasn't designing and implementing the Hadoop-based architecture -- rather, it involved putting together a framework to secure the data going into the new platform.
"I started with the security model," said Timothy Leonard, TMW's executive vice president of operations and CTO. "I wanted my customers to know that when it comes to the security of their data, it's like Fort Knox -- the data is protected. Then I built the rest of the environment on top of it."
Big data security issues shouldn't be an afterthought in deployments of Hadoop, Spark and related technologies, according to technology analysts and experienced IT managers. That's partly because of the importance of safeguarding data against theft or misuse -- and partly because of the work it typically takes to create effective defenses in data lakes and other big data systems.
TMW, which develops transportation management software for trucking companies and collects operational data from them for analysis, has implemented three tiers of data protections. That starts with system-level security on the Mayfield Heights, Ohio, company's big data architecture, which is based on Hortonworks Inc.'s distribution of Hadoop. In addition, data security and governance functions specify who's authorized to access information and under what circumstances.
And, finally, a metadata layer built by Leonard's team provides end-to-end data lineage records on how individual data elements are being used and by whom. That enables TMW to track the use of sensitive data and run audits in search of suspicious activities, he said -- "to see if [a data element] moves 400 times today," for example.
Self-improvement security projects
Leonard said TMW uses Apache Ranger and Knox, two open source tools spearheaded by Hortonworks, to support role-based security in some data science applications and encrypt data while it's stored in the big data environment and when it's moving between different points.
But the metadata repository was a DIY technology, and TMW also created a custom data dictionary that maps data elements to different levels of security based on their sensitivity. "We discovered some areas where we had to improve on what was there," Leonard said, adding that, overall, "big data at the security level hasn't fully matured yet."
The lack of technology maturity is one of the biggest big data security issues facing users, Gartner analyst Merv Adrian said. That applies to the data security and governance tools currently available for use in big data environments and to big data technologies themselves, he noted.
Hadoop, NoSQL databases and other big data platforms don't provide the same level of built-in security features that mainstream relational databases do, Adrian said. Also, data lakes generally incorporate a variety of technologies that aren't configured consistently for security tasks such as activity logging and auditing. "There's a lot of complexity down at the surface to what people are trying to do," he explained.
Piece parts for big data security
Meanwhile, the commercial and open source security tools now on the market address some pieces of, but not the entire, big data puzzle, according to Adrian. "Very few, if any, vendors can cover the gamut," he said. "Ultimately, user organizations are going to have to get to a holistic view [of big data security] -- and today, they're going to have to build that themselves."
In a report published in March 2017, Forrester Research analysts Brian Hopkins and Mike Gualtieri pointed to a common framework for managing metadata, security and data governance as the top item needed to make technologies in the big data ecosystem work better together. But Hortonworks and rivals Cloudera and MapR Technologies are taking different paths. The tools they offer "do not work together, and none of them unifies everything [users] need," Hopkins and Gualtieri wrote. That also applies to Amazon Web Services, the other major big data platform vendor (see "Security menu").
Other big data security issues that Adrian cited include the scale of the data volumes typically involved; the use of data from new sources, including external ones; a lack of upfront data classification as raw data is pulled into data lakes; and the movement of data between cloud and on-premises systems in hybrid environments. The analytics outputs generated by data scientists can also expose sensitive data in unforeseen ways, he said.
Network security startup ProtectWise Inc. designed its internal big data security strategy to address such issues across the spectrum of data acquisition, transport, processing, storage and usage, according to co-founder and CTO Gene Stevens. And, like TMW, ProtectWise had to do lot of custom development to meet its needs for securing the network operations data it collects from customers to monitor and analyze.
To transmit data from corporate networks to its data lake in the AWS cloud, for example, the Denver-based company built software sensors that generate customer-specific encryption keys to prevent a compromise in one network from exposing the data of other customers to attackers. The keys are used just once and then disposed of; doing so "relegates any compromises to one moment in time, which makes them essentially useless," Stevens said.
Security weaknesses not desired
ProtectWise, which collects more than 40 billion data records amounting to 600 TB daily, also set up its own key management system to oversee security processes on most of the data transfers into the Amazon Simple Storage Service (S3) instead of only relying on the one provided by AWS. "We have good faith in Amazon in general," Stevens said. "But any weaknesses they have in their key management system, we don't want to inherit that."
Furthermore, ProtectWise developed routines to encrypt data in the Apache Spark processing engine and the DataStax Enterprise edition of the Cassandra NoSQL database, which it uses in conjunction with the Amazon EMR platform to run analytics applications on both real-time and historical data. Stevens said Spark currently doesn't offer the kind of encryption support ProtectWise needs; Cassandra does "but at a tremendous performance hit" that the company can't afford to take.
All hands on deck for big data security
Security is an underappreciated topic among many data management professionals, according to Gartner's Adrian. But he believes that needs to change, particularly as organizations face up to big data security issues.
Data management teams should get more involved in the process of protecting big data systems, Adrian said. In data lakes built around Hadoop and other technologies that aren't as mature as relational databases are, "security is everybody's business," he noted.
And security initiatives can go hand in hand with efforts to improve data management and usage, TMW's Leonard said. In addition to supporting security audits, Leonard said a metadata repository lets his team see whether data scientists are correctly applying trucking operations data in the transportation management software vendor's big data environment as part of analytics applications.
"We've found things, not that they weren't authorized to access a certain data element, but when they do, they're using it in the wrong way," Leonard explained. As a result, he added, TMW's training program has been upgraded to give the data scientists better information on how to use the data at their disposal.
He said he's open to using embedded functionality that's "more security-friendly" in technologies like Spark and Cassandra. "But we're happy to build some of this ourselves because it's business-critical," he noted. "Security is in our DNA. Not taking it seriously is not an option."
It's the same for TMW's Leonard when it comes to dealing with big data security issues. Protecting the data in the company's Hadoop environment "is the No. 1 thing on my mind," he said. "It's one thing to drive into big data, but boy, you better have security around it."
Sage says the move will boost its cloud financial management software and U.S. presence. Analysts think it's a good technology move but are unsure about the market impact.
Sage Software intends to expand both its cloud offerings and its customer base in North America.
Sage, an ERP vendor based in Newcastle upon Tyne, U.K., is acquiring Intacct, a San Jose-based vendor of financial management software for $850 million, according to the company.
Sage's core products include the Sage X3 ERP system, the Sage One accounting and invoicing application and Sage Live real-time accounting software. The company's products are aimed primarily at SMBs, and Sage claims that it has just over 6 million users worldwide, with the majority of these in Europe.
Intacct provides SaaS financial management software to SMBs, with most of its customer base in North America, according to the company.
The move to acquire Intacct demonstrates Sage's determination to "win the cloud" and expand its U.S. customer base, according to a Sage press release announcing the deal.
"Today we take another major step forward in delivering our strategy and we are thrilled to welcome Intacct into the Sage family," Stephen Kelly, Sage CEO, said in the press release. "The acquisition of Intacct supports our ambitions for accelerating growth by winning new customers at scale and builds on our other cloud-first acquisitions, strengthening the Sage Business Cloud. Intacct opens up huge opportunities in the North American market, representing over half of our total addressable market."
Combining forces makes sense for Intacct because the company shares the same goals as Sage, according to Intacct CEO Robert Reid.
"We are excited to become part of Sage because we are relentlessly focused on the same goal -- to deliver the most innovative cloud solutions for our customers," Reid said in the press release. "Intacct is growing rapidly in our market and we are proud to be a recognized customer satisfaction leader across midsize, large and global enterprise businesses. By combining our strengths with those of Sage, we can jointly accelerate success for our customers."
Intacct brings real cloud DNA to financial management software
Intacct's specialty in cloud financial management software should complement Sage's relatively weak financial functionality, according to Cindy Jutras, president of the ERP consulting firm Mint Jutras.
"[Intacct] certainly brings real cloud DNA, and a financial management solution that would be a lot harder to grow out of than the solutions they had under the Sage One brand," Jutras said. "It also has stronger accounting than would be embedded within Sage X3. I would expect X3 to still be the go-to solution for midsize manufacturers since that was never Intacct's target, but Intacct may very well become the go-to ERP for service companies, like professional services."
Jutras also mentioned that Intacct was one of the first applications to address the new ASC 606 revenue recognition rules, something that Sage has not done yet. Sage's cloud strategy has been murky up to this point, but Jutras was unsure that this move will clarify that.
"It doesn't seem any of its existing products -- except their new Sage Live developed on the Salesforce platform -- are multi-tenant SaaS and up until recently they seemed to be going the hybrid route by leaving ERP on premises and surrounding it with cloud services," she said.
The deal should strengthen Sage's position in the SMB market, according to Chris Devault, manager of software selection at Panorama Consulting Solutions.
"This is a very good move for Sage, as it will bring a different platform and much needed technology to help Sage round out their small to mid-market offerings," Devault said.
Getting into the U.S. market
Overall it appears to be a positive move for Sage, both from a technology and market perspective, according to Holger Mueller, vice president and principal analyst at Constellation Research Inc.
"It's a good move by Sage to finally tackle finance in the cloud and get more exposure to the largest software market in the world, the U.S.," Mueller said. "But we see more than finance moving to the cloud, as customers are starting to look for or demand a complete suite to be available on the same platform. Sage will have to move fast to integrate Intacct and get to a compelling cloud suite roadmap."
Time will also tell if this move will position Sage better in the SMB ERP landscape.