Skip to main content
Version: current

Dataset Concepts

This section describes key concepts associated with Dremio datasets.

Physical and Virtual Datasets

In Dremio, datasets are either physical or virtual.

Physical Datasets

Physical datasets (PDS) are represented by these icons in the Dremio UI. Physical datasets (PDS) created from a folder of files display as a folder (), whereas PDSs created from a single file are representd by a grid ().

A PDS is stored within its respective data source.

Partition information is available for the columns of each dataset, just hover over the icon to see the information icon. To see the partition information, click on PDS icon.

Virtual Datasets

Virtual datasets (VDS) are represented by a grid () icon in the Dremio UI.

They are derived from physical datasets or other virtual datasets. Virtual datasets are defined by the steps necessary for their creation, including transformations, filters, joins, and other modifications. Because virtual datasets do not make a copy of the data, they use very little space, and they always reflect the current state of the physical datasets they are derived from.

Note:

Example: Suppose we have a collection on a MongoDB source called sales. Inside Dremio, this collection is represented as a physical dataset under 'sales.' We can open this dataset within Dremio, and then save it as a virtual dataset called 'salesRaw.' Later on we can derive another virtual dataset called 'salesNY' from 'salesRaw' by excluding all data that doesn't originate from the state of New York. 'salesRaw' and 'salesNY' can each be queried and will return different results, but they are both based on the same underlying physical dataset.

Spaces

Spaces () are where virtual datasets are saved. Spaces provide a way to group datasets by a common theme such as a project, department, or geographic region.

For instance, a list of spaces for an online retailer might look like:

  • Users
  • Transactions
  • Products
  • Sales Analysis
  • Web Traffic

Home Space

Each user has a default private Home space for running tests and experiments without sharing them. You can add, update, and delete catalog objects within the home space, but you cannot update or delete the home space itself.

Folders

You can use folders () to provide a deeper layer of organization to spaces. Folders can contain other folders.

Paths in Dremio

Paths are a dot-separated list that indicates the location of a dataset, starting with the name of the source or space in which that dataset resides, followed by any folders or data source structures, and ending in the name of the dataset. Here are a few examples of what dataset paths look like in Dremio:

  • Transactions.regions.salesNY
  • Web Traffic.october.visits

Transactions is a space, regions is a folder, and salesNY is a virtual dataset. In the second example, Web Traffic is a file system data source, october is a directory on that file system, and visits is a sub-directory with a group of files in a common structure.

Note:

Tip: SQL queries always reference data sets using their full path, for example SELECT * FROM "web traffic".october.visits.