The delicate art of the data lake
In an ideal world, a data lake should keep data in a way that makes it readily available. That's why the tools put into the lake are so important. Otherwise, it's just a blob.
Companies have become obsessed with data, and for good reason: collecting the right data, and knowing how to analyze it, can unlock potential a company never knew it had.
One word executives will hear tossed around during almost any discussion about big data: "data lakes."
"At a very high level, a data lake is simply a storage component where you can put structured or unstructured data in its raw format," Shaun Bierweiler, vice president of U.S. public sector at data software company Hortonworks, told CIO Dive in an interview. "When you dig a little bit lower and get use case and applicability of [data], that's where the magic really happens."
Diving into the data deep end
How helpful a data lake is depends on what's done with raw data.
"One person's data lake is another person's data swamp," Colin Britton, chief strategy officer at Devo, told CIO Dive in an interview.
A lot of companies are storing data, but often don't know what to do it, which is where things can get swampy.
"Most of these big companies have massive centralized IT assets like data warehouses and data lakes, but it's very hard to access it because they've never been built with this specific business purpose at the time," Prat Moghe, CEO of data platform company Cazena, told CIO Dive in an interview.
In an ideal world, a data lake should keep data in a way that makes it readily available, and takes tasks that would otherwise take weeks or even months to complete.
That's why the tools put into the lake are so important. Otherwise, it's just a blob.
Hortonworks gets a lot of calls from companies in that kind of situation, said Bierweiler. They have a lot of data, but "they either have the inability to meet the requirement of their mission, or they're no longer getting results in a timely manner for the results they're trying to accomplish."
Being able to do these things quickly can prepare a company for when they need information in the future for a purpose they may not know about yet, said Britton. That's especially true of security threats.
"We don't know how they're going to look, what the sequence of events will lead to that threat or breach," he said. "We don't know what that looks like in the future, so we collect the information and are able to look at it in a way that makes sense in a future state."
Teaching an old company new tricks
Data lakes aren't just for new and upcoming companies, either.
Carlson Wagonlit Travel (CWT) is over 100 years old and today a leading corporate travel airline – big enough that they serve enough travelers to fill up almost 200 Boeing 747s every day.
CWT wanted to learn from traveler behavior and deliver personalized services, which would require bringing together and analyzing existing customer data, transaction data, traveler comments and external market data from more than 1,600 data sources.
"Like many companies at that scale, they're trying to transform themselves and get more digital," said Moghe. CWT uses Cloudera on Microsoft Azure through Cazena's big data as a service platform. "What they realized is that the real asset they have is the data that they've collected and they wanted to figure out to monetize the data."
This was a CEO-level initiative, Moghe said, and it's big: they didn't just build a lake. CWT created its own data solutions group that now includes over 100 data scientists in locations around the world.
When a data lake works right, as it does for CWT, "it's the data lake that bridges the gap between IT and business," Moghe said.
Security beyond borders
Data lakes are powerful because of the information that can be drawn from them. That's why they need to be secure too.
Putting up walls around the data is important, Britton said, as is making sure the data therein doesn't include personally identifiable information — whether it could unmask someone by itself of when combined with other information.
"It may be personal information in a single event doesn't make sense, but across many different events and messages, you can piece together something about them, Britton said.
Think of Strava data, for example, that was used first to identify locations of military bases but then also extrapolated to unmask individuals as well. It wasn't necessarily one piece of information that allowed the identifications to be made but a lot of different pieces put together.
Businesses want to make sure someone couldn't also do that with the components that make up a data lake.
"The ability to do masked views of the data as you go get it is just as much of a challenge as the security of the boundaries," said Britton.
Follow Jen A. Miller on Twitter