What's new in Databricks - July 2024
Top Highlights
Databricks partnered with Meta to release the Llama 3.1 series of models on Databricks
Databricks Assistant and AI-generated comments are now generally available
Serverless Compute for Workflows and Notebooks are GA
Databricks Lakeflow Connect is now available
Governance & Delta Sharing
New connection support for Lakehouse federation
You can run federated queries on data managed by Salesforce Data Cloud
You can use SSO authentication to connect to SQL Server
Query History and Node timeline are now available in System Tables
Query history show sql queries run by SQL warehouses. It’s powerful to track your queries (execution status, error, data management etc). Learn more
Node timeline captures node-level resource utilization data at minute granularity. Each record contains data for a given minute of time per instance. Learn more
Resource quotas for Unity Catalog
The quotas for Unity Catalog objects have been increased! Check the new values below:
Partition metadata logging for UC external tables
For DBR13.3 LTS or above, you can enable partition metadata logging, which is a partition discovery strategy for external tables registered to Unity Catalog. The behavior is consistent with the partition discovery strategy used in Hive metastore and only impacts Unity Catalog external tables that have partitions and use Parquet, ORC, CSV, or JSON.
Databricks recommends enabling the new behavior for improved read speeds and query performance for these tables. Learn more
What’s new in Delta Sharing ?
Delta Sharing lets you share tables where liquid clustering is enabled as well as the objects metadata including comments and primary constraints and AI Models.
Liquid Clustering Sharing Support AI Models Sharing
Compute & Data Engineering
Databricks Lakeflow Connect is available
LakeFlow Connect offers native connectors that enable you to ingest data from databases and enterprise applications and load it into Databricks. LakeFlow Connect leverages efficient incremental reads and writes to make data ingestion faster, scalable, and more cost-efficient, while your data remains fresh for downstream consumption.
Salesforce Sales Cloud, Microsoft Azure SQL Database, Amazon RDS for SQL Server, and Workday Reports-as-a-Service (RaaS) are currently supported.
Blog Announcement, Keynote presentation
Databricks Serverless Compute for Workflows and Notebooks are GA
Configuring and managing compute such as Spark clusters has long been a challenge for data engineers and data scientists. Time spent on configuring and managing compute is time not spent providing value to the business.
Databricks Connect for Python now supports Serverless Compute
GenAI/ML
No time to read the ML updates? Check out this recap video of all the announcements listed below for July 2024 👇
Llama 3.1: A New Standard in Open Source AI: Meta Llama 3.1 on Databricks
Databricks partnered with Meta to release the Llama 3.1 series of models on Databricks → Blog Post
Mosaic AI Model Training available in public preview to all AWS us-east-1 and us-west-2 customers
Mosaic AI Model Training is now in ungated public preview for all AWS us-east-1 and us-west-2 customers. With this release, we now fully support AWS Private Link for workspaces in these regions. → Blog Post
Shutterstock ImageAI available in private preview on Foundation Model API pay-per-token
Shutterstock ImageAI foundation model (announced at Data+AI Summit) is now available in Private Preview on Foundation Models API (pay per token), with Provisioned Throughput support coming soon. → Blog Post
AI Functions now supports ai_forecast()
AI Functions now supports ai_forecast(), a new Databricks SQL function for analysts and data scientists designed to extrapolate time series data into the future. Demo below 👇
Mosaic AI Model Serving now supports serving multiple external models per model serving endpoint
Mosaic AI Model Serving now supports serving multiple external models per model serving endpoint. You can now also directly input API keys as plaintext strings to model serving endpoints that host external models. Demo below 👇
Function calling is now available in Public Preview
Function calling is in Public Preview, it is available using Foundation Model APIs pay-per-token models DBRX Instruct and Meta-Llama-3-70B-Instruct. → Documentation
Databricks Assistant and AI-generated comments are now generally available
The Databricks Assistant is GA. It was added to every page in Databricks and provides more accurate answers with additional data context from stack traces, lineage, popular tables, and more. AI-generated comments is also GA. It leverages GenAI to provide relevant table descriptions and column comments. → Blog Post
Blog Post: How Long Should You Train Your Language Model? (by MosaicAI Research)
How Long Should You Train Your Language Model? In this video, we explore a paper by the Mosaic AI Research team. The key takeaway is that the more inference demand you expect from your users, the smaller and longer you should train your models, as you should continue to see quality improvements. Link to the blog post.
Blog Post: Training MoEs at Scale with PyTorch
Over the past year, Mixture of Experts (MoE) models have surged in popularity. Databricks worked closely with the PyTorch team to scale the training of MoE models. In this video, we discuss how they scaled to over three thousand GPUs using PyTorch Distributed and MegaBlocks, an efficient open-source MoE implementation in PyTorch. Link to the blog post.
Data Warehousing
Predictive Optimisation, which can improve your query performance by 2x through intelligent optimization of data layouts, is now in GA. Learn more
Cost management dashboards are now in Public preview, making it easy to import a dashboard to monitor costs on a workspace or account level. Learn more
In a nutshell
In Databricks Runtime 15.4 LTS and above, Scala is generally available on shared access mode Unity Catalog-enabled compute, including support for scalar user-defined functions (UDFs)
Account SCIM v2.1. Learn more
End of life for Databricks managed passwords. Learn more
Databricks provides an open source software (OSS) JDBC driver that enables you to connect tools such as DataGrip, DBeaver, and SQL Workbench/J to Databricks through Java Database Connectivity (JDBC), an industry-standard specification for accessing database management systems.
Library installation on clusters now has a timeout of 2 hours.