Unlocking the potential of data: lessons learned from leading data42

Achim Plueckebaum
10 min readJan 20, 2023

In June 2019 I took over the leadership role for data42 — Novartis’ transformational approach to leverage one of the largest and most diverse R&D datasets in the pharma industry to unlock additional value from existing drugs and to enable the discovery of new drugs, with the ultimate moonshot goal to change the way medicines are developed. Throughout these years I have learned and grown with the role (see my post from 2021 here), and in every part of the journey I adapted my leadership style. Now, as I am closing that chapter and can look back over the whole period, here are my main learnings which I hope will benefit those innovators in the near future who will think differently about discovering and developing medicines. I have categorized them into four buckets: technology, data, customers, and people.


Learning 1: Do not try to build it yourself. When we started in 2018, our default solution for our platform was a homegrown solution which we repurposed for data42. This turned out to be a very cumbersome experience — while we wanted to work on use cases and show results, we spent a lot of time on architecture diagrams and stitched together various solutions. About a year later, we went off-the-shelf: a platform which would give us the core data management capabilities (technically, functionally, but also from the talent perspective) so we as data42 team could focus on what we knew best: our data, our processes and our business use cases. We used this kernel platform and then built the remaining, missing capabilities around it, such us high performance computing and analytical platforms which statisticians and data scientists asked for.

Learning 2: Use the cloud to the maximum. The hesitation was enormous in the beginning: how can we even consider putting highly sensitive data from our clinical trials into the cloud? Well, it was the best decision ever. Cloud providers must adhere to the highest level of standards when it comes to security and are constantly optimizing their server landscapes to highest performance levels and state-of-the art cloud innovations while keeping cost manageable. This is hard to match from any internal solution. Moreover, for an endeavor like data42 we needed full flexibility and elasticity for the compute and storage — best to be done in the cloud! The upfront effort we had to go through (internal approvals to ensure compliance with internal and external requirements, change management to bring management and users along) was more than worth it.

Learning 3 — Stay non GxP as long as possible, and then go for targeted GxP validation. In Healthcare, GxP is a must if and when you touch regulatory processes. It’s a good thing, also for systems validation: it ensures that processes which are coded in a system stay compliant and do exactly what they are supposed to do. For data42 — a platform, which was supposed to be open, flexible, and exploratory, mainly for secondary use of data — it was not suitable in the beginning: we decided early on to stay non-GxP, to keep flexibility, speed and openness. The downside: any analysis done on our platform, which scientists wanted to use in regulatory filings, needed to be redone in a GxP system. And secondly, our non-GxP status would not allow us to take over functionalities from neighboring systems (e.g. systems for primary use) which were GxP validated. Hence, once our platform was up and running, we decided to go for a targeted validation for certain parts of data42. The right decision as it allowed us to be fast in the beginning (you can count approximately 6–12 months for GxP systems validation), and then we were very targeted on the validation aspect once we knew what to validate and why.


Learning 4: The R&D data spectrum is endless…focus is key. We started our mission by stating that we will integrate “all R&D data” into data42, not realizing initially that this would be a never-ending exercise which would have taken years, without clear output and results. Therefore, after a few months of analyzing and understanding business needs, external market trends and our own strengths and weaknesses, we narrowed our focus on patient-level data coming from our own clinical trials. These datasets became the nucleus of our data work — and the patient ID become the golden record for our data harmonization. Soon it was clear that non-human data generated and used in the early discovery phase (e.g., chemical structures, compound libraries) should not be in our focus, and we added real world data (RWD) only in phase two for data42. The trial data, however, allowed us to quickly link all our genomic and proteomic data into our datasets — a rich source of information which was never available before as these databases were always separate from each other. In total, we ingested data from about 3,000 internal databases — despite the focused approach still a massive effort.

Learning 5: FAIRification of R&D data is hard, and cannot be fully automated. Right at the beginning, we declared the industry-standard FAIR principles (findability, accessibility, interoperability, and reusability) also our mantra for data42. With our focus on clinical trial data (and related omics) the findability and accessibility problem were solvable for us — interoperability and reusability were a harder nut to crack. Building the data pipelines which would harmonize our data from about two decades, where multiple different data standards were applied, soon became a half automated, half manual effort: we needed clinical expertise from our business teams to make sense of datasets that we generated a long time ago — and partially also acquired through M&A activities from Novartis. A tedious (but necessary) effort — which took us about 6 months, and delivered for the first time a standardized, interoperable dataset across the main domains of all 3000 trials (where Novartis electronically captured trial information over the past 20 years). It also needed a lot of expectation management with our internal Novartis stakeholders, as in this period we could not deliver very tangible business outcomes. Over time, we learnt that considering the number of CDISC variables and domains, the clinical harmonization evolved constantly to FAIRify more variables and domains — whereas a full FAIRification across all domains and variables would not be worth the effort. Based on the first mapping done, we then created a new ML based mapping recommender helps to fast-track the harmonization of new trials — the way to make large-scale, automated harmonization work in the future.

Learning 6: The data you have is never good enough. After all the data harmonization and integration work we had done, we opened pandora’s box to our expected user group and: nothing much happened! Statisticians were skeptical — they couldn’t find (or reproduce) the exact same analysis they had in their favorite legacy tool, as either some datasets were missing, or the anonymization of the data (needed for secondary research on trial data) had “skewed” the data so it wasn’t usable anymore, in their opinion. Data Scientists were looking for even more data — especially as for building AI models an even larger dataset was needed. So, we quickly changed strategy for data42: we built a “T” shape database — a data inventory allowed the initial, horizontal overview into our dataset, showing also cross-disease information and multi-modal overview. Then, once the data requirements were clear, we (the data42) created purpose-built data pools, mainly around our given disease areas of interest. A strategy which quickly increased user adoption and acceptance, but still, up until today, requires a lot of dedication and intensive work with our scientific teams in Novartis disease areas.


Learning 7: Focus user adoption on clear segmented user groups. Analytical work happens almost everywhere across the R&D value chain. So, a key question we had to ask ourselves early on: who is actually our main user group? Who would benefit most from an innovative platform for exploratory analysis? Given our focus on patient-level trial data we could also narrow down our main user population quickly: statisticians in clinical development, and data scientists in translational medicines and in development. This helped to also focus our change management efforts for user adoption: we could segment our user groups into self-service users (user groups who knew how to write code and could largely do their analysis without our help) and serviced users (user groups which needed data engineering support and/or had complex scientific asks). These two user groups became our “internal TAM” (total addressable market) — and allowed us to focus our change management efforts on fewer user groups.

Learning 8: Stakeholder management is a 100% job, and cannot be delegated. Creating a large, complex database like data42 is a long, costly, and unusual investment for a rather traditional Pharma company. The fact that Novartis made that commitment in 2019 was great, but it took time, patience, and resilience to keep that commitment alive, especially as stakeholders change over time, business priorities as well — and in a fast-paced business world outcomes are expected in record time. To manage these expectations, we developed a sophisticated stakeholder plan with various levels of engagement, incl an external advisory board, regular business and finance updates for executives, and business roadshows for upper management. A full-time job which mainly rested with me as the leader of data42 and involved my leadership team to a large extent as well. The best time investment we could do!

Learning 9: Data alone is not a scalable business. Our vision, right from the beginning of data42, was to treat our work as if we were a business. We had the concept of a start-up (freedom to operate within a large corporation), we had bold ideas (revolutionize the way we discover and develop medicines) and operated with a P&L mindset (clear return on investment for every dollar spent). But how much should data be monetized, if at all? Apart from this being an ethically important question, we learned throughout the years of working with our users that you need more than just data to generate additional value. What creates tangible value is the clever integration of data, data science and scientific expertise. Only if and when these three come together, you have the magic to work on use cases like patient subgrouping, disease progression, external control arms — as shown in the picture below.


Learning 10: Data work is an art. Hire artists! It’s amazing how many different skills are needed to make data usable, and to make sense of it. While we started with “classical” hires in the area of technology and data science, we soon widened our search to more “unusual” talents from various fields of expertise: non-Pharma data experts, strategists from consulting firms, entrepreneurs from various cultures, young graduate and experienced executives — the list could go on. These diverse talents soon built the “secret sauce” of our success, and the “place to be” within Novartis and beyond. Orchestrating these different backgrounds required a lot of attention and time — what bonded us was a bigger purpose, a clear vision and the freedom to operate. And, as we soon figured out, we had not just hired talented individuals and a new generation of leaders into our teams, but a bunch of artists who wanted to leave a legacy.

Learning 11: Failures matter. Pivots are key. An adventure like data42 is naturally facing obstacles — many of them could be considered failures along the way. For example, in our case our first handful of use cases did not deliver any tangible business outcome and literally failed. While these failures for sure did not please us in the moment, they were eventually key to make us pause, reflect, and adapt. During the last four years, we had these “pivotal moments many times per year, which at some point we turned into a success factor: we used the learnings from the failures to pivot our strategy, and the teams quickly adjusted to our rhythm to build in these pivots to our agile methodology. Making failures a part of your strategy was key!

Learning 12: Resilience and humility are key leadership attributes. We encountered a lot of skepticism along the way of our journey. Not always are there only supporters, and in fact the people who push back initially are the ones to convert, as they usually have great input for improvement. Approaching the skeptics with openness, patience and understanding was key to success for us as a team, and for me as the leader. It required a high level of resilience and humility from us — after all, we were all dealing with a topic where there was not one straight answer, and we owed it to our patients to deal with their data in the most sensible and sensitive way. And it required collaborating with the users to define the use cases which drive curation needs, get input on harmonization, and jointly implement analysis.

In closing, I felt humbled and honored that I could lead this innovative effort over the past years. Using the words of Mahatma Gandhi — which are even more relevant in today’s world of social media — I feel proud that data42 could generate the next generation of leaders that will rock the healthcare space!



Achim Plueckebaum

Achim Plueckebaum is the Head of data42 for Novartis Research & Development (R&D) at Novartis