When TMW Systems Inc. began building a big data environment to run advanced analytics applications three years ago, the first step wasn't designing and implementing the Hadoop-based architecture -- rather, it involved putting together a framework to secure the data going into the new platform.
"I started with the security model," said Timothy Leonard, TMW's executive vice president of operations and CTO. "I wanted my customers to know that when it comes to the security of their data, it's like Fort Knox -- the data is protected. Then I built the rest of the environment on top of it."
Big data security issues shouldn't be an afterthought in deployments of Hadoop, Spark and related technologies, according to technology analysts and experienced IT managers. That's partly because of the importance of safeguarding data against theft or misuse -- and partly because of the work it typically takes to create effective defenses in data lakes and other big data systems.
TMW, which develops transportation management software for trucking companies and collects operational data from them for analysis, has implemented three tiers of data protections. That starts with system-level security on the Mayfield Heights, Ohio, company's big data architecture, which is based on Hortonworks Inc.'s distribution of Hadoop. In addition, data security and governance functions specify who's authorized to access information and under what circumstances.
And, finally, a metadata layer built by Leonard's team provides end-to-end data lineage records on how individual data elements are being used and by whom. That enables TMW to track the use of sensitive data and run audits in search of suspicious activities, he said -- "to see if [a data element] moves 400 times today," for example.
Self-improvement security projects
Leonard said TMW uses Apache Ranger and Knox, two open source tools spearheaded by Hortonworks, to support role-based security in some data science applications and encrypt data while it's stored in the big data environment and when it's moving between different points.
But the metadata repository was a DIY technology, and TMW also created a custom data dictionary that maps data elements to different levels of security based on their sensitivity. "We discovered some areas where we had to improve on what was there," Leonard said, adding that, overall, "big data at the security level hasn't fully matured yet."
The lack of technology maturity is one of the biggest big data security issues facing users, Gartner analyst Merv Adrian said. That applies to the data security and governance tools currently available for use in big data environments and to big data technologies themselves, he noted.
Hadoop, NoSQL databases and other big data platforms don't provide the same level of built-in security features that mainstream relational databases do, Adrian said. Also, data lakes generally incorporate a variety of technologies that aren't configured consistently for security tasks such as activity logging and auditing. "There's a lot of complexity down at the surface to what people are trying to do," he explained.
Piece parts for big data security
Meanwhile, the commercial and open source security tools now on the market address some pieces of, but not the entire, big data puzzle, according to Adrian. "Very few, if any, vendors can cover the gamut," he said. "Ultimately, user organizations are going to have to get to a holistic view [of big data security] -- and today, they're going to have to build that themselves."
In a report published in March 2017, Forrester Research analysts Brian Hopkins and Mike Gualtieri pointed to a common framework for managing metadata, security and data governance as the top item needed to make technologies in the big data ecosystem work better together. But Hortonworks and rivals Cloudera and MapR Technologies are taking different paths. The tools they offer "do not work together, and none of them unifies everything [users] need," Hopkins and Gualtieri wrote. That also applies to Amazon Web Services, the other major big data platform vendor (see "Security menu").
Other big data security issues that Adrian cited include the scale of the data volumes typically involved; the use of data from new sources, including external ones; a lack of upfront data classification as raw data is pulled into data lakes; and the movement of data between cloud and on-premises systems in hybrid environments. The analytics outputs generated by data scientists can also expose sensitive data in unforeseen ways, he said.
Network security startup ProtectWise Inc. designed its internal big data security strategy to address such issues across the spectrum of data acquisition, transport, processing, storage and usage, according to co-founder and CTO Gene Stevens. And, like TMW, ProtectWise had to do lot of custom development to meet its needs for securing the network operations data it collects from customers to monitor and analyze.
To transmit data from corporate networks to its data lake in the AWS cloud, for example, the Denver-based company built software sensors that generate customer-specific encryption keys to prevent a compromise in one network from exposing the data of other customers to attackers. The keys are used just once and then disposed of; doing so "relegates any compromises to one moment in time, which makes them essentially useless," Stevens said.
Security weaknesses not desired
ProtectWise, which collects more than 40 billion data records amounting to 600 TB daily, also set up its own key management system to oversee security processes on most of the data transfers into the Amazon Simple Storage Service (S3) instead of only relying on the one provided by AWS. "We have good faith in Amazon in general," Stevens said. "But any weaknesses they have in their key management system, we don't want to inherit that."
Furthermore, ProtectWise developed routines to encrypt data in the Apache Spark processing engine and the DataStax Enterprise edition of the Cassandra NoSQL database, which it uses in conjunction with the Amazon EMR platform to run analytics applications on both real-time and historical data. Stevens said Spark currently doesn't offer the kind of encryption support ProtectWise needs; Cassandra does "but at a tremendous performance hit" that the company can't afford to take.
All hands on deck for big data security
Security is an underappreciated topic among many data management professionals, according to Gartner's Adrian. But he believes that needs to change, particularly as organizations face up to big data security issues.
Data management teams should get more involved in the process of protecting big data systems, Adrian said. In data lakes built around Hadoop and other technologies that aren't as mature as relational databases are, "security is everybody's business," he noted.
And security initiatives can go hand in hand with efforts to improve data management and usage, TMW's Leonard said. In addition to supporting security audits, Leonard said a metadata repository lets his team see whether data scientists are correctly applying trucking operations data in the transportation management software vendor's big data environment as part of analytics applications.
"We've found things, not that they weren't authorized to access a certain data element, but when they do, they're using it in the wrong way," Leonard explained. As a result, he added, TMW's training program has been upgraded to give the data scientists better information on how to use the data at their disposal.
He said he's open to using embedded functionality that's "more security-friendly" in technologies like Spark and Cassandra. "But we're happy to build some of this ourselves because it's business-critical," he noted. "Security is in our DNA. Not taking it seriously is not an option."
It's the same for TMW's Leonard when it comes to dealing with big data security issues. Protecting the data in the company's Hadoop environment "is the No. 1 thing on my mind," he said. "It's one thing to drive into big data, but boy, you better have security around it."