Inside Lenovo's transition to a 1 PB, 115-person data analytics operation

In the last decade, Lenovo grew its data analytics from a three person operation using Microsoft Access and 250 megabytes worth of data to a more than 115 person unit, which uses cognitive analytics on more than 1 petabyte of data.

The company pulled off this feat by adopting a cloud-first strategy and moving all infrastructure to Amazon Web Services (with some internal intersection for hybrid and Hadoop systems), according to Pradeep Kumar, senior big data architect at Lenovo, speaking at Talend Connect in New York City last week. It also added a variety of tools and services, such as Spark, Talend, TensorFlow and Alexa, that transformed how the business designed solutions and used its cloud platform.

The use of middleware has helped the company speed up the onboarding process, getting new hires to develop and deploy jobs by their third week at work, according to Kumar. Such software already takes care of a lot of the coding, so employees can skip the upfront hard coding and jump ahead to the assembly and execution.

The trade-off here is that employees can become tied to a middleware platform without understanding the underlying code.

The availability of integration software and other intermediary sources also helps a company allocate work between types of data workers. Data engineers would use software through Talend or Microsoft SQL, while data scientists would work more on the hard coding side using a language such as R or Python. In the middle, data architects would help bridge the two sides.

Centralizing all of this work in the cloud helped Lenovo improve its scalability and flexibility. The cloud offers a host of advantages over on-premise solutions for most companies, but one factor can end up biting customers who make the switch: cost.

In on-premise solutions, resource utilization is not a problem, but once a company moves to outsourced infrastructure costs can spiral out of control if not managed effectively, according to Kumar.

Figuring out a way to calculate cost, manage unpredictable workloads and ensure resource optimization is driven by costs is important, but the intersection of fast, good and cheap is often a small area, he said.

Serverless and event-driven computing, instead of settling on a provisioned amount of compute power, work hand-in-hand to solve this dilemma, Kumar said. And every tool needs to be added with cost, scalability and interoperability in mind.

The movement from IaaS, Paas and batch processing to real-time computing and Function as a Service has accompanied a shift in management responsibility. Under FaaS, vendors manage applications, runtime, containers, operating systems, virtualization and hardware, while functions have some customer management, according to Kumar.

In the past, customers were responsible for more of these pieces, especially functions, applications and runtime.

The shift in computing landscape and customer management has ushered in a new paradigm in how businesses design solutions and use the cloud.