在CenterOS使用grafana搭建日志监控平台

说明

Loki 是主服务器，负责存储日志和处理查询 。
promtail 是代理，负责收集日志并将其发送给 loki 。
Grafana 用于 UI 展示

loki

下载

curl -O -L "https://github.com/grafana/loki/releases/download/v2.3.0/loki-linux-amd64.zip"

安装

mkdir  /home/gather/data/loki/{chunks,index}
unzip loki-linux-amd64.zip
mv  loki-linux-amd64 /home/gather/data/loki/
chmod a+x "loki-linux-amd64"

下载配置文件

wget https://raw.githubusercontent.com/grafana/loki/master/cmd/loki/loki-local-config.yaml

loki配置文件配置

auth_enabled: false

server:
  http_listen_port: 3100 # 端口

ingester:
  lifecycler:
    address: 127.0.0.1 # 地址
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 1h       # Any chunk not receiving new logs in this time will be flushed
  max_chunk_age: 1h           # All chunks will be flushed when they hit this age, default is 1h
  chunk_target_size: 1048576  # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
  chunk_retain_period: 30s    # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
  max_transfer_retries: 0     # Chunk transfers disabled

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper: 
    active_index_directory: /home/gather/data/loki/boltdb-shipper-active
    cache_location: /home/gather/data/loki/boltdb-shipper-cache
    cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: filesystem
  filesystem:
    directory: /home/gather/data/loki/chunks

compactor:
  working_directory: /home/gather/data/loki/boltdb-shipper-compactor
  shared_store: filesystem

limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 168h

chunk_store_config:
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

ruler:
  storage:
    type: local
    local:
      directory: /home/gather/data/loki/rules
  rule_path: /home/gather/data/loki/rules-temp
  alertmanager_url: http://localhost:9093 # alertmanager报警地址
  ring:
    kvstore:
      store: inmemory
  enable_api: true

启动loki

cd /home/gather/data/loki

# 启动Loki命令
nohup ./loki-linux-amd64 -config.file=loki-local-config.yaml  > loki.log 2>&1 &

# debug日志
nohup ./loki-linux-amd64 --log.level=debug -config.file=./loki-local-config.yaml > /opt/logs/loki-3100.log 2>&1 &

# 查看启动是否成功(查看3100端口的进程是否存在)
netstat -tunlp | grep 3100

# 根据名称查找进程(执行命令后有下边的显示，则启动成功)
ps -ef | grep loki-linux-amd64
$ root     11037 22022  0 15:44 pts/0    00:00:55 ./loki-linux-amd64 -config.file=loki-local-config.yaml

制作loki系统服务

新建service

vim /usr/lib/systemd/system/loki.service

添加配置

[Unit]
Description=loki
Documentation=https://github.com/grafana/loki/tree/master
After=network.target

[Service]
Type=simple
User=root
ExecStart=/usr/local/src/loki-linux-amd64 -config.file=/usr/local/src/loki-local-config.yaml &>> /opt/logs/loki-3100.log # 具体路径可以根据实际情况修改
Restart=on-failure

[Install]
WantedBy=multi-user.target

注册系统服务

# 刷新环境
systemctl daemon-reload

# 启动服务
systemctl start loki

# 服务状态
systemctl status loki

# 开机自启
systemctl enable loki

promtail

下载

curl -O -L "https://github.com/grafana/loki/releases/download/v2.3.0/promtail-linux-amd64.zip"

安装

mkdir  /home/gather/data/promtail
unzip promtail-linux-amd64.zip
mv  promtail-linux-amd64   /home/gather/data/promtail/
chmod a+x "promtail-linux-amd64"

下载promatil配置文件

wget https://raw.githubusercontent.com/grafana/loki/master/cmd/promtail/promtail-local-config.yaml

修改promtail配置文件，下面内容为示例配置

#  prometail.yaml 配置文件
server:
  http_listen_port: 9080
  grpc_listen_port: 0

# Positions
positions:
  filename: /tmp/positions.yaml

# Loki服务器的地址
clients:
  - url: http://localhost:3100/loki/api/v1/push

scrape_configs:
  - job_name: jm-admin
    pipeline_stages:
      - match:
          selector: '{job="jm-admin"}'
          stages:
            - regex:
                expression: '^(?P<time>[\d\s-:,]*)(?P<level>[a-zA-Z]+)\s(?P<pid>[\d]+)\s(?P<content>.*)$'
            - labels:
                level:
                content:
                pid:
                time:
    static_configs:
      - targets:
          - localhost
        labels:
          job: jm-admin
          host: localhost
          __path__: /mnt/data/jm/jm-admin/logs/project.artifactId_IS_UNDEFINED/debug.log
  - job_name: tcc-common
    static_configs:
      - targets:
          - localhost
        labels:
          job: tcc-common
          host: localhost
          __path__: /mnt/data/tcc/common/log/*.log
  - job_name: tcc-admin
    pipeline_stages:
      - match:
          selector: '{job="tcc-admin"}'
          stages:
            - json:
                expressions:
                  timej: time
                  pidj: pid
                  levelj: level
            - labels:
                levelj:
                pidj:
                timej:
    static_configs:
      - targets:
          - localhost
        labels:
          job: tcc-admin
          host: localhost
          __path__: /mnt/data/tcc/admin/log*.log
  - job_name: sqhgy-admin
    static_configs:
      - targets:
          - localhost
        labels:
          job: sqhgy-admin
          host: localhost
          __path__: /mnt/data/sq/sqServer/admin/log/*.log
  - job_name: sqhgy-api
    static_configs:
      - targets:
          - localhost
        labels:
          job: sqhgy-api
          host: localhost
          __path__: /mnt/data/sq/sqServer/api/log/*.log
  - job_name: sqhgy-common
    static_configs:
      - targets:
          - ip地址
        labels:
          job: sqhgy-common
          host: ip地址
          __path__: /mnt/data/sq/sqServer/common/log/*.log
  - job_name: createDataServer
    static_configs:
      - targets:
          - 127.0.0.1
        labels:
          job: createDataServer
          host: 127.0.0.1
          __path__: /home/gather/data/createDataServer/log/*/*.log

启动promtail(注意修改路径)

nohup ./promtail-linux-amd64 -config.file=promtail-local-config.yaml > /home/gather/data/promtail/logs/promtail-9080.log 2>&1 &

发送日志方式补充

添加依赖

<dependency>
  <groupId>cn.allbs</groupId>
  <artifactId>allbs-logback</artifactId>
  <version>1.1.5</version>
</dependency>

yml添加配置

allbs:
  logging:
    console:
      close-after-start: true
    files:
      enabled: true
    loki:
      enabled: true
      http-url: http://${LOKI_HOST}:3100/loki/api/v1/push
      metrics-enabled: true

制作loki系统服务

新建service

vim /usr/lib/systemd/system/promtail.service

修改配置

[Unit]
Description=promtail
Documentation=https://github.com/grafana/loki/tree/master
After=network.target

[Service]
Type=simple
User=root
ExecStart=/usr/local/src/promtail-linux-amd64 -config.file=/usr/local/src/promtail-local-config.yaml  &>> /opt/logs/promtail-9080.log # 具体路径可以根据实际情况修改
Restart=on-failure

[Install]
WantedBy=multi-user.target

注册系统服务

# 刷新环境 
systemctl daemon-reload

# 启动服务
systemctl start promtail

# 服务状态 
systemctl status promtail

# 开机自启 
systemctl enable promtail

grafana

下载

wget https://dl.grafana.com/oss/release/grafana-8.1.2-1.x86_64.rpm

# 或

sudo yum install grafana-8.1.2-1.x86_64.rpm

安装

rpm -ivh  /data/tools/grafana-7.1.0-1.x86_64.rpm

启动相关

# 刷新
systemctl daemon-reload
# 启动
systemctl start grafana-server
# 开机自启
systemctl enable grafana-server
# 查看状态
systemctl status grafana-server

# 启动
service grafana-server start

验证

api验证

curl "http://127.0.0.1:3100/api/prom/label"
curl localhost:3100/loki/api/v1/labels

使用

备注

访问 ip:port
默认账号密码
admin
admin
第一次需要修改密码

图示

过滤指定内容

语法说明

- |=：日志行包含字符串。
- !=：日志行不包含字符串。
- |~：日志行匹配正则表达式。
- !~：日志行与正则表达式不匹配。

对最近五分钟内的所有日志行进行计数

count_over_time({job="jm-admin"}[5m])

获取在过去十秒内所有非超时错误的每秒速率

rate({job="jm-admin"} |= "error" != "timeout" [10s]

集合运算符

与PromQL一样，LogQL支持内置聚合运算符的一个子集，可用于聚合单个向量的元素，从而产生具有更少元素但具有集合值的新向量

运算符	说明
sum	计算标签上的总和
min	选择最少的标签
max	选择标签上方的最大值
avg	计算标签上的平均值
stddev	计算标签上的总体标准差
stdvar	计算标签上的总体标准方差
count	计算向量中元素的数量
bottomk	通过样本值选择最小的k个元素
topk	通过样本值选择最大的k个元素

统计最高日志吞吐量按container排序前十的应用程序

topk(10,sum(rate({job="fluent-bit"}[5m])) by(container))

获取最近五分钟内的日志计数，按级别分组

sum(count_over_time({job="fluent-bit"}[5m])) by (level)