Protocol Documentation

Table of Contents

metastore.proto

Top

General notes

Design goals

Some notes on protobuf proto3

- It supports maps that are extensively used here. It is possible to simulate maps in proto2

if needed.

- All fields are optional. If we need to move to proto2, it is important to add optional to

every field then to preserve semantics.

AddPartitionRequest

Add a single partition to a table.

Partition is described by list of "values" - one value per partition schema.

There is no validation that values actually match partition schema

Each partition belongs to a table and each table belongs to a database

FieldTypeLabelDescription
sequence uint64

Request sequence (used for bulk requests)

catalog string

db_id Id

table_id Id

partition Partition

AddPartitionResponse

Response from AdddPartitionRequest matches sequence to the request.

FieldTypeLabelDescription
sequence uint64

Comes from request

status RequestStatus

AlterDatabaseRequest

Alter database

FieldTypeLabelDescription
catalog string

Catalog this database belongs to

id Id

Database ID. Database can be found by name or id

database Database

Database object

cookie string

Session cookie

CreateDatabaseRequest

Create a new database.

If database.Id.id is empty, it will be assigned a unique ID

database.seq_id is assigned when database is created

FieldTypeLabelDescription
catalog string

Catalog this database belongs to

database Database

Database object

cookie string

Session cookie

CreateTableRequest

Create a new table.

FieldTypeLabelDescription
catalog string

db_id Id

table Table

cookie string

Database

Database is a container for tables.

Database object has two sets of parameters:

- User parameters are intended for user and are just transparently passed around

- System parameters are intended to be used by Hive for its internal purposes

Database has two IDs:

- Id.id is assigned during database creation and it is a unique and stable ID. WHile database

name can change, the id can't, so clients can cache Database by ID.

- seq_id is assigned during database creation. It should be a unique ID within the catalog.

The intention is having an incrementing integer value for each new database. It is not

guaranteed to be monotonous.

Original Metastore Database object also had owner information.

These can be represented using system parameters if needed since the current

metastore service does not interpret Owner info.

FieldTypeLabelDescription
id Id

Unique database ID

seq_id uint64

Unique sequence ID within calalog

location string

Default location of database objects

parameters Database.ParametersEntry repeated

Database user parameters

system_parameters Database.SystemParametersEntry repeated

System parameters (can't be set by user)

Database.ParametersEntry

FieldTypeLabelDescription
key string

value string

Database.SystemParametersEntry

FieldTypeLabelDescription
key string

value string

DropDatabaseRequest

Request to drop a database.

Dropping a database also drops all objects contained in the database.

TODO: Add flag to prohibit dropping non-empty databases

FieldTypeLabelDescription
catalog string

id Id

cookie string

DropPartitionsRequest

Delete partition.

Partition is described by list of "values" - one value per partition schema.

There is no validation that values actually match partition schema

TODO: have flag for specifying fields in the returned value

FieldTypeLabelDescription
catalog string

db_id Id

table_id Id

values PartitionValues repeated

cookie string

DropTableRequest

Request to drop a table.

Dropping a table also drops all objects contained in the table

TODO: Add flag to prohibit dropping of non-empty table

FieldTypeLabelDescription
catalog string

db_id Id

id Id

cookie string

FieldSchema

FieldSchema defines name and type for each column.

FieldTypeLabelDescription
name string

name of the field

type string

type of the field.

comment string

User description

GetDatabaseRequest

Request to get database by its ID.

Database can be located by either part of the ID. If id.id is specified, it will be used first,

otherwise iid.name is used. One of these must be specified.

FieldTypeLabelDescription
catalog string

Catalog this database belongs to

id Id

Database ID. Database can be found by name or id

cookie string

Session cookie

GetDatabaseResponse

Result of GetDatabase request

The result consists of the database information (which may be empty in case of failure)

and request status.

TODO: specify error cases

FieldTypeLabelDescription
database Database

status RequestStatus

GetPartitionRequest

Get partition information.

Partition is described by list of "values" - one value per partition schema.

There is no validation that values actually match partition schema

FieldTypeLabelDescription
catalog string

db_id Id

table_id Id

values string repeated

GetPartitionResponse

FieldTypeLabelDescription
partition Partition

status RequestStatus

GetTableRequest

Request to get table by its ID.

FieldTypeLabelDescription
catalog string

db_id Id

Database ID

id Id

cookie string

GetTableResponse

FieldTypeLabelDescription
table Table

status RequestStatus

Id

Objects have unique name and unique ID.

Name of the object can change but ID never changes. This allows caching of objects

by ID.

Both name and ID are just sequence of bytes - there are no

assumptions about encoding or length.

Implementations may enforce specific assumptions.

FieldTypeLabelDescription
name string

Object name

id string

Permanent object ID

ListDatabasesRequest

Request to get list of databases

If exclude_params is set, result may omit parameters

FieldTypeLabelDescription
catalog string

cookie string

name_pattern string

exclude_params bool

fields string repeated

Field selectors. If specified, only certain fields are sent. The following fields are supported: - location - id - id.name - parameters

ListPartitionsRequest

Return all partitions in a table

Field selectors.

If specified, only certain fields are sent. The following fields are supported:

- location

- values

- parameters

- sd

- sd.parameters

- sd.serdeinfo

- sd.serdeinfo.parameters

- table

FieldTypeLabelDescription
catalog string

db_id Id

table_id Id

cookie string

fields string repeated

values PartitionValues repeated

exclude string repeated

ListTablesRequest

Request to get list of databases.

FieldTypeLabelDescription
catalog string

db_id Id

cookie string

fields string repeated

Field selectors. If specified, only certain fields are sent. The following fields are supported: - id: table Id - location: table location - parameters: table user parameters - partkeys: table partition keys

Order

sort order of a column (column name along with asc/desc)

FieldTypeLabelDescription
col string

sort column name

ascending bool

asc(1) or desc(0)

Partition

Partition

FieldTypeLabelDescription
id Id

seq_id uint64

Sequential ID within table

values string repeated

Values for each partition

sd StorageDescriptor

Partition descriptor

parameters Partition.ParametersEntry repeated

User parameters

location string

Partition location

table Table

Enclosing table

Partition.ParametersEntry

FieldTypeLabelDescription
key string

value string

PartitionValues

FieldTypeLabelDescription
value string repeated

RequestStatus

General status for results.

All non-streaming requests should return RequestStatus.

FieldTypeLabelDescription
status RequestStatus.Status

request status

error string

detailed error message

SerDeInfo

Serialization/Deserialization information

FieldTypeLabelDescription
type SerdeType

Serde type. If CUSTOM, use the name

name string

name of the serde, table name by default

serializationLib string

usually the class that implements the extractor & loader

parameters SerDeInfo.ParametersEntry repeated

NOTE: Should we enum this as well? initialization parameters

SerDeInfo.ParametersEntry

FieldTypeLabelDescription
key string

value string

StorageDescriptor

StorageDescriptor holds all the information about physical storage of the data belonging to a table

FieldTypeLabelDescription
cols FieldSchema repeated

required (refer to types defined above)

inputFormat InputFormat

Specification of input format. If custom, use inputFormatName.

inputFormatName string

Name of input format if custom

outputFormat OutputFormat

Specification of input format. If custom, use outputFormatName.

outputFormatName string

Name of output format if custom

numBuckets int32

this must be specified if there are any dimension columns

serdeInfo SerDeInfo

serialization and deserialization information

bucketCols string repeated

reducer grouping columns and clustering columns and bucketing columns`

sortCols Order repeated

sort order of the data in each bucket

parameters StorageDescriptor.ParametersEntry repeated

any user supplied key value hash

system_parameters StorageDescriptor.SystemParametersEntry repeated

Internal parameters not settable by user

StorageDescriptor.ParametersEntry

FieldTypeLabelDescription
key string

value string

StorageDescriptor.SystemParametersEntry

FieldTypeLabelDescription
key string

value string

Table

Table information

FieldTypeLabelDescription
id Id

seq_id uint64

Sequential ID within database

sd StorageDescriptor

storage descriptor of the table

partitionKeys FieldSchema repeated

partition keys of the table. only primitive types are supported

tableType TableType

table type enum, e.g. EXTERNAL_TABLE

parameters Table.ParametersEntry repeated

User-settable parameters

system_parameters Table.SystemParametersEntry repeated

Internal parameters

location string

Table location

Table.ParametersEntry

FieldTypeLabelDescription
key string

value string

Table.SystemParametersEntry

FieldTypeLabelDescription
key string

value string

InputFormat

Known Input Formats. CUSTOM means that it should be specified as a string.

NameNumberDescription
IF_CUSTOM 0

IF_SEQUENCE 1

IF_TEXT 2

IF_HIVE 3

IF_PARQUET 4

OutputFormat

Known Output Formats. CUSTOM means that it should be specified as a string.

Here is a list of known output formats:

- org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

- org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

- org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat

- org.apache.hadoop.hive.ql.io.HivePassThroughOutputFormat

- org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat

- org.apache.hadoop.hive.ql.io.HiveBinaryOutputFormat

- org.apache.hadoop.hive.ql.io.RCFileOutputFormat

- org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

NameNumberDescription
OF_CUSTOM 0

OF_SEQUENCE 2

OF_IGNORE_KEY 3

OF_HIVE 4

OF_PARQUET 5

RequestStatus.Status

NameNumberDescription
STATUS_OK 0

successful request

STATUS_ERROR 1

General error

STATUS_NOTFOUND 2

Requested object not found

STATUS_CONFLICT 3

Object already exists

STATUS_BUSY 4

Object is busy/used and can't be accessed/destroyed

STATUS_INTERNAL_ERR 5

Internal server error

SerdeType

Known SerDes are represented using enum. Unknown ones are represented using strings.

NameNumberDescription
SERDE_CUSTOM 0

SERDE_LAZY_SIMPLE 1

SERDE_AVRO 2

SERDE_JSON 3

SERDE_ORC 4

SERDE_REGEX 5

SERDE_THRIFT 6

SERDE_PARQUET 7

"org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"

SERDE_CSV 8

SerializationLib

NameNumberDescription
SL_CUSTOM 0

Unknown lib, use string

SL_LAZY_SIMPLE 1

LazySimpleSerDe

SL_PARQUET 2

org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

TableType

NameNumberDescription
TTYPE_MANAGED 0

TTYPE_EXTERNAL 1

TTYPE_INDEX 2

Metastore

Metastore service

This API is different from traditional metastore API. It separates all

metadata-only operations and does not include any filesystem operations.

The assumption is that some other service or the client deals with file system

operations.

This API also uses cookies to associates requests with a session.

The value of the cookie is likely to be printed in logs so it shouldn't contain

any sensitive information.

Metastore service does not interpret the cookie but may print it in its logs.

We could call it SessionId but callers may decide to use it for whatever other

purposes, so using generic term here.

Method NameRequest TypeResponse TypeDescription
CreateDabatase CreateDatabaseRequest GetDatabaseResponse

Create a new database.

GetDatabase GetDatabaseRequest GetDatabaseResponse

Get database information

ListDatabases ListDatabasesRequest Database

Return all databases in a catalog

DropDatabase DropDatabaseRequest RequestStatus

Destroy the database

AlterDatabase AlterDatabaseRequest GetDatabaseResponse

Alter database

CreateTable CreateTableRequest GetTableResponse

Create a new table

GetTable GetTableRequest GetTableResponse

Get table information

ListTables ListTablesRequest Table

Get all tables from a database

DropTable DropTableRequest RequestStatus

Destroy a table

AddPartition AddPartitionRequest AddPartitionResponse

Add partition to a table

AddManyPartitions AddPartitionRequest AddPartitionResponse

Add multiple partitions. The first request contains DB and table info, followed by others, for which db and table info is not needed

GetPartition GetPartitionRequest GetPartitionResponse

Get partition information

ListPartitions ListPartitionsRequest Partition

List all partitions in a table

DropPartitions DropPartitionsRequest RequestStatus

Drop partition

Scalar Value Types

.proto TypeNotesC++ TypeJava TypePython Type
double double double float
float float float float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. int32 int int
int64 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. int64 long int/long
uint32 Uses variable-length encoding. uint32 int int/long
uint64 Uses variable-length encoding. uint64 long int/long
sint32 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. int32 int int
sint64 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. int64 long int/long
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long
sfixed32 Always four bytes. int32 int int
sfixed64 Always eight bytes. int64 long int/long
bool bool boolean boolean
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode
bytes May contain any arbitrary sequence of bytes. string ByteString str