File Source
The Vector file source ingests data through one or more local files and outputs log events.
Configuration
- Common
- Advanced
[sources.my_source_id]# REQUIRED - Generaltype = "file" # example, must be: "file"include = ["/var/log/nginx/*.log"] # example# OPTIONAL - Generalignore_older = 86400 # example, no default, secondsstart_at_beginning = false # default# OPTIONAL - Priorityoldest_first = false # default
Options
data_dir
The directory used to persist file checkpoint positions. By default, the global [data_dir](#data_dir) option is used. Please make sure the Vector project has write permissions to this dir. See Checkpointing for more info.
exclude
Array of file patterns to exclude. Globbing is supported. Takes precedence over the include option.
file_key
The key name added to each event with the full path of the file. See Context for more info.
"file"fingerprinting
Configuration for how the file source should identify files.
fingerprint_bytes
The number of bytes read off the head of the file to generate a unique fingerprint. See File Identification for more info.
ignored_header_bytes
The number of bytes to skip ahead (or ignore) when generating a unique fingerprint. This is helpful if all files share a common header. See File Identification for more info.
strategy
The strategy used to uniquely identify files. This is important for checkpointing when file rotation is used.
"checksum""checksum" "device_and_inode" glob_minimum_cooldown
Delay between file discovery calls. This controls the interval at which Vector searches for files. See Auto Discovery and Globbing for more info.
1000host_key
The key name added to each event representing the current host. See Context for more info.
"host"ignore_older
Ignore files with a data modification date that does not exceed this age.
include
Array of file patterns to include. Globbing is supported. See File Read Order and File Rotation for more info.
max_line_bytes
The maximum number of a bytes a line can contain before being discarded. This protects against malformed lines or tailing incorrect files.
102400max_read_bytes
An approximate limit on the amount of data read from a single file at a given time.
2048message_start_indicator
When present, Vector will aggregate multiple lines into a single event, using this pattern as the indicator that the previous lines should be flushed and a new event started. The pattern will be matched against entire lines as a regular expression, so remember to anchor as appropriate.
multi_line_timeout
When message_start_indicator is present, this sets the amount of time Vector will buffer lines into a single event before flushing, regardless of whether or not it has seen a line indicating the start of a new message.
1000oldest_first
Instead of balancing read capacity fairly across all watched files, prioritize draining the oldest files before moving on to read data from younger files. See File Read Order for more info.
start_at_beginning
When true Vector will read from the beginning of new files, when false Vector will only read new data added to the file. See Read Position for more info.
Output
The file source ingests data through one or more local files and outputs log events.
For example:
{"file": "/var/log/nginx.log","host": "my.host.com","message": "Started GET / for 127.0.0.1 at 2012-03-10 14:28:14 +0100","timestamp": "2019-11-01T21:15:47+00:00"}
More detail on the output schema is below.
file
The full path of the file tha the log originated from. See Checkpointing and Context for more info.
host
The current hostname, equivalent to the gethostname command.
message
The raw log message, unaltered.
timestamp
The exact time the event was ingested.
How It Works
Auto Discovery
Vector will continually look for new files matching any of your include
patterns. The frequency is controlled via the glob_minimum_cooldown option.
If a new file is added that matches any of the supplied patterns, Vector will
begin tailing it. Vector maintains a unique list of files and will not tail a
file more than once, even if it matches multiple patterns. You can read more
about how we identify files in the Identification
section.
Checkpointing
Vector checkpoints the current read position in the file after each successful
read. This ensures that Vector resumes where it left off if restarted,
preventing data from being read twice. The checkpoint positions are stored in
the data directory which is specified via the
global [data_dir](#data_dir) option but can be
overridden via the data_dir option in the file sink directly.
Compressed Files
Vector will transparently detect files which have been compressed using gzip
and decompress them for reading. This detection process looks for the unique
sequence of bytes in the gzip header and does not rely on the compressed files
adhering to any kind of naming convention.
One caveat with reading compressed files is that Vector is not able to efficiently seek into them. Rather than implement a potentially-expensive full scan as a seek mechanism, Vector currently will not attempt to make further reads from a file for which it has already stored a checkpoint in a previous run. For this reason, users should take care to allow Vector to fully process any compressed files before shutting the process down or moving the files to another location on disk.
Context
By default, the file source will add context
keys to your events via the file_key and host_key
options.
Environment Variables
Environment variables are supported through all of Vector's configuration.
Simply add ${MY_ENV_VAR} in your Vector configuration file and the variable
will be replaced before being evaluated.
You can learn more in the Environment Variables section.
File Deletion
When a watched file is deleted, Vector will maintain its open file handle and
continue reading until it reaches EOF. When a file is no longer findable in the
includes glob and the reader has reached EOF, that file's reader is discarded.
File Identification
By default, Vector identifies files by creating a cyclic redundancy check
(CRC) on the first 256 bytes of the file. This serves as a
fingerprint to uniquely identify the file. The amount of bytes read can be
controlled via the fingerprint_bytes and ignored_header_bytes options.
This strategy avoids the common pitfalls of using device and inode names since inode names can be reused across files. This enables Vector to properly tail files across various rotation strategies.
File Read Order
By default, Vector attempts to allocate its read bandwidth fairly across all of the files it's currently watching. This prevents a single very busy file from starving other independent files from being read. In certain situations, however, this can lead to interleaved reads from files that should be read one after the other.
For example, consider a service that logs to timestamped file, creating a new one at an interval and leaving the old one as-is. Under normal operation, Vector would follow writes as they happen to each file and there would be no interleaving. In an overload situation, however, Vector may pick up and begin tailing newer files before catching up to the latest writes from older files. This would cause writes from a single logical log stream to be interleaved in time and potentially slow down ingestion as a whole, since the fixed total read bandwidth is allocated across an increasing number of files.
To address this type of situation, Vector provides the oldest_first flag. When
set, Vector will not read from any file younger than the oldest file that it
hasn't yet caught up to. In other words, Vector will continue reading from older
files as long as there is more data to read. Only once it hits the end will it
then move on to read from younger files.
Whether or not to use the oldest_first flag depends on the organization of the
logs you're configuring Vector to tail. If your include glob contains multiple
independent logical log streams (e.g. nginx's access.log and error.log, or
logs from multiple services), you are likely better off with the default
behavior. If you're dealing with a single logical log stream or if you value
per-stream ordering over fairness across streams, consider setting
oldest_first to true.
File Rotation
Vector supports tailing across a number of file rotation strategies. The default
behavior of logrotate is simply to move the old log file and create a new one.
This requires no special configuration of Vector, as it will maintain its open
file handle to the rotated log until it has finished reading and it will find
the newly created file normally.
A popular alternative strategy is copytruncate, in which logrotate will copy
the old log file to a new location before truncating the original. Vector will
also handle this well out of the box, but there are a couple configuration options
that will help reduce the very small chance of missed data in some edge cases.
We recommend a combination of delaycompress (if applicable) on the logrotate
side and including the first rotated file in Vector's include option. This
allows Vector to find the file after rotation, read it uncompressed to identify
it, and then ensure it has all of the data, including any written in a gap
between Vector's last read and the actual rotation event.
Globbing
Globbing is supported in all provided file paths, files will
be autodiscovered continually at a rate defined by the
glob_minimum_cooldown option.
Line Delimiters
Each line is read until a new line delimiter (the 0xA byte) or EOF is found.
Read Position
By default, Vector will read new data only for newly discovered files, similar
to the tail command. You can read from the beginning of the file by setting
the start_at_beginning option to true.
Previously discovered files will be checkpointed, and the read position will resume from the last checkpoint.