Semantic (e.g.
Inherits From: Enum
numerical, categorical) of an input feature.
Determines how a feature is interpreted by the model.
Similar to the "column type" of Yggdrasil Decision Forest.
Attributes |
NUMERICAL
|
Numerical value. Generally for quantities or counts with full
ordering. For example, the age of a person, or the number of items in a
bag. Can be a float or an integer. Missing values are represented by
math.nan or with an empty sparse tensor. If a numerical tensor contains
multiple values, its size should be constant, and each dimension is
threaded independently (and each dimension should always have the same
"meaning").
|
CATEGORICAL
|
A categorical value. Generally for a type/class in finite set
of possible values without ordering. For example, the color RED in the set
{RED, BLUE, GREEN}. Can be a string or an integer. Missing values are
represented by "" (empty sting), value -2 or with an empty sparse tensor.
An out-of-vocabulary value (i.e. a value that was never seen in training)
is represented by any new string value or the value -1. If a numerical
tensor contains multiple values, its size should be constant, and each
value is treated independently (each value on the tensor should always
have the same meaning). Integer categorical values: (1) The training logic
and model representation is optimized with the assumption that values are
dense. (2) Internally, the value is stored as int32. The values should be
<~2B. (3) The number of possible value is computed automatically from the
training dataset. During inference, integer values greater than any value
seen during training will be treated as out-of-vocabulary. (4) Minimum
frequency and maximum vocabulary size constrains don't apply.
|
HASH
|
The hash of a string value. Used when only the equality between values
is important (not the value itself). Currently, only used for groups in
ranking problems e.g. the query in a query/document problem. The hashing
is computed with google's farmhash and stored as an uint64.
|
CATEGORICAL_SET
|
Set of categorical values. Great to represent tokenized
texts. Can be a string or an integer in a sparse tensor or a ragged tensor
(recommended). Unlike CATEGORICAL, the number of items in a
CATEGORICAL_SET can change and the order/index of each item doesn't
matter.
|
BOOLEAN
|
Boolean value. WARNING: Boolean values are not yet supported for
training. Can be a float or an integer. Missing values are represented by
math.nan or with an empty sparse tensor. If a numerical tensor contains
multiple values, its size should be constant, and each dimension is
threaded independently (and each dimension should always have the same
"meaning").
|
DISCRETIZED_NUMERICAL
|
Numerical values automatically discretized into bins.
Discretized numerical features are faster to train than (non-discretized)
numerical features. If the number of unique values of these features is
lower than the number of bins, the discretization is lossless from the
point of view of the model. If the number of unique values of this
features is greater than the number of bins, the discretization is lossy
from the point of view of the model. Lossy discretization can reduce and
sometime increase (due to regularization) the quality of the model.
|
Class Variables |
BOOLEAN
|
<Semantic.BOOLEAN: 5>
|
CATEGORICAL
|
<Semantic.CATEGORICAL: 2>
|
CATEGORICAL_SET
|
<Semantic.CATEGORICAL_SET: 4>
|
DISCRETIZED_NUMERICAL
|
<Semantic.DISCRETIZED_NUMERICAL: 6>
|
HASH
|
<Semantic.HASH: 3>
|
NUMERICAL
|
<Semantic.NUMERICAL: 1>
|