Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more
Let the OSS Enterprise newsletter guide your open source journey! Sign up here.
As privacy regulations such as Europe’s GDPR and California’s CCPA have come into force, and with countless others in the pipeline, companies across the industrial and technological spectrum have had to elevate their data management efforts to a whole new level. Jurisdictions globally are enforcing their own data-residency regulations too, with the likes of China and Russia saying to companies: “If you want to do business with us, keep your data here.”
Cumulatively, these various regulations have thrust the closely aligned issues of data residency, localization, and sovereignty to the forefront of companies’ consciousness. They can no longer play fast and loose with data — they must pay close attention to where they store data and the jurisdiction it falls under. This is often used to bolster arguments in the burgeoning multi- and hybrid-cloud movement, as it not only helps companies avoid vendor lock-in, but also gives them flexibility in terms of where their data and applications are hosted.
Control, ultimately, is the name of the game — both at a nation-state and at an individual company level, where digital autonomy is paramount.
“The adoption of data localization laws has been increasing, driven by the fear that a nation’s sovereignty will be threatened by their inability to exert full control over data stored outside their borders,” Russell Christopher, director of product strategy at data analytics company Starburst, told VentureBeat. “In an environment of shifting privacy laws, it’s increasingly difficult for businesses to analyze all critical data quickly, and at scale, while ensuring compliance.”
Permeating all of this is trusty ol’ open source software. Kubernetes, for example, is one of the most popular open source projects out there, serving as a common operating environment that allows companies to embrace the hybrid cloud, powering their applications across all public and private infrastructure. And there are countless companies that bake data sovereignty into the core of their product, even if it’s not immediately obvious that’s what it’s there for — with open source taking center stage.
Starburst is the venture-backed business behind the open source Presto-based SQL query engine Trino. The company recently announced a new fully managed, cross-cloud analytics product that allows companies to query data hosted across the “big three’s” infrastructure — without moving the data from its original location. For some companies, the core benefit here is simply circumventing the data silo problem, as they don’t have to pool the data in a single cloud or data warehouse. But even in situations where a company is using a single cloud provider, they will often have to store data in different “regions” to satisfy data residency requirements — cross-cloud analytics enables them to leave the data where it is, with only aggregated insights transferred out of that location.
“The data warehouse model is predicated on consolidating data assets to create a ‘single source of truth,’ but the regulatory realities of today are proving this to not only be a lie, but legally impossible,” Christopher said. “Adopting a multi-cloud strategy is a significant and necessary initiative for modern computing, but it’s not a final solution. The challenge of accessing and analyzing data across clouds and regions without moving that data to a central location is forcing a paradigm shift in the way we approach modern data management.”
Global biotech giant Sophia Genetics uses Starburst to query data from several regions across the world, while adhering to all the local data sovereignty and compliance requirements.
“Local users in the region can query atomic level data, and those outside of the region can only query aggregated data,” Christopher said.
But how much of this solution is reliant on “open source,” exactly? Well, given that Starburst is built on such a foundational open source project as Trino, it is pretty integral, even if the commercial Starburst product makes the data sovereignty process easier.
“The ability to ‘plug in’ third-party integrations is baked into Trino — Trino also has some native capacity to ‘push down’ portions of a query to local compute, like a database,” Christopher said. “Starburst has taken advantage of this foundational potential to build technology which really does the heavy lifting.”
One of the major difficulties in adhering to a strict data sovereignty regimen is the sheer complexity of it. With systems and data spread across departments and regions, some attained through acquisitions that exist in their own siloed worlds, it’s enough of a challenge in itself for global companies to unlock big data insights — add the thorny issue of regional data residency regulations to the mix and things get just that little bit harder. Companies are dealing with this conundrum in different ways.
“While not every customer I speak to is far enough along on the maturity model (or big enough) to need to ‘slay this dragon,’ there is awareness across the board that data sovereignty is something that they’ll need to address at some point in the near future,” Christopher explained. “Some are just ignoring data sovereignty in the hopes that it will go away because of how big of an undertaking it is to solve, while others are taking the stance of ‘don’t share data, problem solved.’”
And this is why some companies might not be utilizing their data to its full potential — it’s just easier to keep data where it is, rather than risk regulatory wrath.
“Organizations need to share data or potentially find themselves at a competitive disadvantage,” Christopher added. “Now is the time for dealing with data sovereignty issues, and there are tools out there to help with this undertaking.”
Data sovereignty and decentralized team chats
Element is the company behind an end-to-end encrypted team messaging platform powered by Matrix, a decentralized open standards-based communication protocol that promises not to lock people into a closed ecosystem. Similar to how people can send emails to each other across providers and clients (e.g. Gmail to Yahoo), in a Matrix world, WhatsApp users can message people on Slack or Skype. It’s all about interoperability.
For context, thousands of separate organizations permeate Germany’s health care system, including hospitals, clinics, local doctors, and insurance companies. News emerged earlier this year that Gematik, the national agency responsible for digitization Germany’s health care system, was switching to Matrix following a series of separate digital transformation efforts that resulted in the various health care bodies unable to communicate effectively with each other. There were also questions over the security and privacy of the systems they had chosen to transmit confidential medical data.
By switching to Matrix, the different bodies involved didn’t necessary have to use the exact same apps, but given they were all built on a single common standard, they had the flexibility to create systems to suit their own unique use-cases while still being able to connect to each other.
“There are organizations where no one gets fired for buying Microsoft Teams or Slack, and there are organizations where people get fired for a lack of data security,” Matthew Hodgson, Element CEO and technical cofounder of Matrix, told VentureBeat. “We serve the latter group — organizations that need the best security available.”
All this feeds back to the core concepts around data sovereignty, digital autonomy, and control — not putting all your eggs in one digital basket (or multiple digital baskets that don’t play nice with each other).
“Data sovereignty is one of the main reasons why people and organizations choose Element, particularly in the public sector,” Hodgson explained. “A vendor-owned and managed SaaS model run from the U.S. — like Microsoft Teams or Slack — simply doesn’t work for the majority of governments, even if the datacenter happens to be local.”
Of course, even without an open source and open standards ethos, companies can go some way toward achieving data sovereignty through proprietary software. They can run their own email system through Microsoft Exchange, for example, which takes their data out of the cloud — but the company is still being locked into Microsoft’s gargantuan ecosystem. This type of approach to regaining control and digital autonomy “significantly undermines sovereignty,” according to Hodgson.
“Instead, open source solutions embrace open standards and empower the user to have full ownership over their data — the idea of vendor-locking users into proprietary data formats is a contradiction in terms for an open source app, where vendor-specific IP is considered toxic,” Hodgson said. “Open source solutions are leading the charge in empowering data sovereignty, and at last empowering the user or admin to have total control and ownership over their data.”
Elastic is best known for Elasticsearch, a database search engine companies use for any application that relies on the access and retrieval of data or documents. While it was formerly available under an Apache 2.0 open source license, the company transitioned to a duo of proprietary “source available” licenses earlier this year following an ongoing spat with Amazon Web Services (AWS). Elastic still adheres to most of the core principles of open source through a model it refers to as “free and open.”
“Many of Elastic’s customers are multinational, which necessitates that they have total control over their data to abide by the privacy and security laws of the countries in which they operate,” Elastic’s chief product officer Ash Kulkarni told VentureBeat. “Not surprisingly, we are seeing data sovereignty coming up in more customer conversations.”
Elastic operates what is known as a “single-tenancy” architecture — this means that each customer has its own database and instance of the software, affording them full control over the entire environment. Crucially, data is kept completely separate from other customers’ data. This is in contrast to a multi-tenancy architecture, which means that a single instance of the software and underlying infrastructure is used across multiple customers. While the data is kept separate, it still exists in the same environment in a multi-tenancy system, meaning individual companies have less customization options and multiple users — from different organizations — have access to the same database.
There are pros and cons to both architectures, but single-tenancy is ultimately the preferred option in terms of retaining full data control.
“For Elastic, data sovereignty means giving customers full jurisdictional control over their data and the infrastructure systems that the data flows through — an important aspect of that control is how the data is secured internally,” Kulkarni said. “Data has gravity, and our customers want foundational architecture that gives them country-specific controls to manage the data in the country where the data resides while allowing for analytics across all their data globally.”
Elastic recently rolled out cross-cluster search (CCS) on Elastic’s Cloud Enterprise plan, enabling more companies to search their data across all their datacenters — so a business that runs an AWS instance in North Virginia, a Google Cloud instance in London, and a Microsoft Azure instance in Cape Town can search all their data in a single pane without moving their data.
“This enables customers to maintain compliance with the privacy and security laws in the counties they operate in, while simultaneously helping them break down data silos and derive greater insights from their data,” Kulkarni explained.
Global hedge fund Citadel uses Elastic for exactly that, allowing it to grow globally while “meeting data sovereignty requirements” where their customers reside. This is particularly important in highly regulated markets such as finance.
“They chose to work with Elastic to help scale their business, and ensure that any data being processed in specific countries was being run on infrastructure physically located in the country,” Kulkarni said.
But where does open source (or “free and open”) come into all this — wouldn’t it be possible to offer data sovereignty with a fully proprietary closed stack? Irrespective of the specific license that Elastic now issues its software under, the visibility it affords through its “source available” approach is the important factor.
“One of the benefits of open code is that the entire lifecycle of an organization’s data is open for inspection or compliance auditing by any legal party enforcing a law,” Kulkarni said. “Openness provides an additional level of transparency for a government agency to inspect and verify that the organization is compliant. You’re not getting the same level of transparency ‘goodness’ with closed-source software as you do with open code. Organizations have to trust that what their vendor says they are doing with their data is true. In contrast, open code allows an organization to verify that those compliance claims are accurate.”
The open source factor
A quick peek across the technological landscape reveals a slew of commercial open source companies going to market with “data sovereignty” as one of their core selling points. Cal.com, an open source alternative to meeting-scheduling platform Calendly, launched back in September and just last week raised a $7.4 million seed round of funding.
“Transparency and control of companies’ data is what can make or break their choice in which software they use,” cofounder and co-CEO Bailey Pumfleet told VentureBeat. “We’ve spoken with many companies who simply cannot use any other solution out there — due to the inability to self-host, a lack of transparency, and other data protection related characteristics which Cal.com has. This is absolutely vital for industries like health care and government, but an increasing number of non-regulated industries are [also] looking at how their software products treat and use their data.”
Countless examples permeated the past year that not only highlight the growing importance of data sovereignty and digital autonomy, but the role that open source plays in that.
Back in September, Google announced a partnership in Germany with Deutsche Telekom’s IT services and consulting subsidiary T-Systems, with a view toward building a “sovereign cloud” for German organizations. While all the main cloud providers already offer some data residency controls as part of their regional datacenters, they don’t go far enough for industries and regulatory frameworks that require a tighter control over how data is handled, particularly as it relates to personally identifiable information (PII).
And so T-Systems will “manage sovereignty controls and measures” such as encryption and identity management of the Google Cloud Platform for German businesses that need it. It will also oversee other integral parts of the Google Cloud infrastructure, including supervising physical or virtual access to sensitive infrastructure.
The problem that this partnership ultimately seeks to address is one that the open source world has long set out to solve — it’s about bring data control and oversight closer to home. As part of Google’s partnership with T-Systems, the duo have made specific provisions for “openness and transparency,” including collaborating on open source technologies, supporting integrations with existing IT environments, and serving up access to Google’s “open source expertise that provides freedom of choice and prevents lock-in,” Google Cloud CEO Thomas Kurian said at the time.
Elsewhere, a report from the European Commission (EC) shed light on the impact open source software has had — and could have — on the European Union (EU) economy. Notably, it observed that open source helps avoid vendor lock-in and increases an organizations’ digital autonomy — or “technological independence,” as the report called it.
Meanwhile, SaaS-powered password management giant 1Password launched a survey this year to determine whether its (potential) customers would like a self-hosted version of its service.
“Currently, we believe that a 1Password membership is the best way to store, sync, and manage your passwords and other important information,” 1Password CTO Pedro Canahuati said in an interview with VentureBeat this year. “However, we’re constantly looking into new avenues to make sure we always offer what’s best for our customers. Right now, we’re in the exploratory phase of investigating a self-hosted 1Password. We’ll assess the demand for this as we gather results.”
While it’s not clear to what degree — if any — such a solution would embrace open source, it further highlights the growing push toward giving companies and individuals more control over their data. And open source will play a fundamental part in that.
“More and more of our personal data is moving into services that are hosted online,” Kulkarni said. “For the vast majority of people, their digital and physical worlds are indistinguishable — from ecommerce and social media to entertainment and communication. It’s all happening online. This mass migration of data is driving more scrutiny from regulators about what is happening to the data, who sees it, and who has access to it.”
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.
Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
- networking features, and more