The Role of Data Lakes in Modern Data Platforms: Post Webinar Q&A Session

DataArt
4 min readMay 5, 2020

--

On April 9th, 2020, DataArt hosted a free webinar titled “The Role of Data Lakes in Modern Data Platforms,” presented by experts in Data, BI, and Analytics. The panelists, Oleg Komissarov, Principal Consultant at DataArt, and Alexey Utkin, Principal Solution Consultant at DataArt, explained how data lakes, data warehouses, and data hubs differ in purpose and capability. Also, they discussed the core principles, technologies, and benefits that data lakes can bring to a data solution.

Watch the webinar here.

Data lakes are centralized repositories for both structured and unstructured raw data, further used for analytical purposes via any toolset — such as ad-hoc querying, visualization, real-time and/or big data processing, and machine learning. In a data lake, you may store and process data in a low-cost and scalable way. It is a single source of incoming raw data of any quality as well as a reliable source of higher quality processed data.

Access the webinar webcast on-demand here and view it at your convenience.

After the webinar, we offered a free Q&A session to webinar attendees. Here are some select questions and the answers our experts gave:

1. If a company’s data is not recorded, and users input it manually, what is the most efficient way to build a data platform and implement a data lake for it?

Alexey Utkin: I would record the data with software such as Office 365 or its analogues, store the data files on shared corporate drives, and further integrate the data into a data lake in a specific format. Or, try setting up a cloud-based warehouse for analytics; it is much easier. If your company data is being recorded in locally stored files (PDFs, MS Excel/Word files), then you’ll probably require solid capacity and investment to properly extract structured data from these files. As a workaround, you could index document content and use search systems as opposed to data lakes to store and process the data.

However, no data lake can help you resolve data governance challenges and maintain a proper culture of data capture. Try such user-friendly solutions as Anaplan and TrueCue to gradually improve data practices in your organization.

2. Is there a risk that raw unfiltered data in a data lake will lead to GDPR audit and violation?

Oleg Komissarov: GDPR violation could happen in any data system unless it is properly designed to deal with sensitive data. We usually recommend anonymizing/encrypting (hash) PII data fields prior to sourcing them to a data lake. Raw unfiltered data, devoid of proper access configuration, audit, and lineage, should not be used in data lakes. The analytics based on the hashed data can also be linked back to the real data in the customer-facing app, where sensitive data is handled accordingly. This is what they call de-identifiable data lakes. Besides, some editions of cloud data warehouses, for example, Snowflake Enterprise, offer a sensitive data plan to easily hash and mask sensitive data in the data warehouse.

Access the webinar webcast on-demand here and view it at your convenience.

3. What is your take on data-to-compute versus compute-to-data patterns as data consumption/access for analytics?

Oleg Komissarov: Data-to-compute versus compute-to-data, in other words, scale-up versus scale-out, is a classic question in designing any big data system. In general, the scale-out (compute-to-data) strategy is somewhat easier to implement, more accessible, and cost-effective, particularly in the cloud. Yet, its weak point is SLA on latency/throughput, so the answer depends on the use case. If you need to guarantee a particular SLA for your use case, use and configure specific technologies (and possibly even hardware), as opposed to what is generally available as a standard PaaS solution. So this would be data-to-compute. You may end up configuring Kafka/Spark in a specific way or use in-memory or GPU tech.

I am always for compromise — based on the cost/performance ratio acceptable for the business. If computations are fast and cost-effective, and could basically be done on a query level, then there is no need to pre-compute data. Or, if your computed data is used widely or requires significant computational resources, I would pre-compute it.

4. How do you manage data validations and exception resolution (i.e. data quality) for data in data lakes?

Alexey Utkin: It is usually helpful to implement automated data quality checks, with the same analytical tool that is used for big data analytics. When saved to the raw zone, data could be automatically scanned with validation scripts. Such workflow management tools as Apache Airflow or AWS Step Functions will orchestrate the appropriate response when/if data quality is below the required level — for example by sending alerts and/or notifications, flagging the data, or preventing it from moving to the next zone. At the same time, most of the data governance/data quality solutions on the market will also offer similar features for data lakes.

5. In terms of source systems, which ones make data lake implementation complex? In a closed platform, if you do not have access to the database, would you rather have access to data via API or reports?

Oleg Komissarov: In general, ingestions from external systems via API are more complex than ingestions through direct access. If possible, SDKs, APIs, or direct access (via drivers) are preferred. Yet the true answer to the question depends on the specific system.

Access the webinar webcast on-demand here and view it at your convenience.

Originally published at: https://bit.ly/35vOFPJ

--

--

DataArt
DataArt

Written by DataArt

We design, develop & support unique software solutions. We write about Finance, Travel, Media, Music, Entertainment, Healthcare, Retail, Telecom, Gaming & more

No responses yet