-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drive IO stats #891
Drive IO stats #891
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you fix the lint errors? @r-scheele
pkg/metrics/collector.go
Outdated
return &stats, nil | ||
} | ||
|
||
func getSectorSize(device string) int64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let this function return an (error, int64)
func (c *metricsCollector) publishDriveStats(ctx context.Context, drive *types.Drive, ch chan<- prometheus.Metric) { | ||
device, err := c.getDeviceByFSUUID(drive.Status.FSUUID) | ||
if err != nil { | ||
klog.ErrorS( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the device is not found by its FSUUID, this means the drive is not READY. we should also export this information as a metric.
pkg/metrics/collector.go
Outdated
// Drive is offline | ||
ch <- prometheus.MustNewConstMetric( | ||
prometheus.NewDesc( | ||
prometheus.BuildFQName(consts.AppName, "stats", "drive_status"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have _drive_ready
instead of _drive_status
?
if these directories are not present, then it means the drive is "not ok" (or) "not ready". we can have a label name accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add respective functional tests
@@ -11,3 +11,4 @@ dist/ | |||
kustomize | |||
operator-sdk | |||
kubectl-directpv_* | |||
directpv.tar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As none of our build tool creates this file, could you explain why do we need to ignore this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was added to the .gitignore
because it is generated during testing.
https://github.com/minio/directpv/blob/master/docs/developer.md the image tar ball was created by podman
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new line
pkg/metrics/collector.go
Outdated
ReadTicks int64 | ||
WriteSectors int64 | ||
WriteTicks int64 | ||
TimeInQueue int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Add a function in
pkg/device/sysfs_linux.go
like below
func GetStat(name string) (stats []uint64, err error) {
line, err := readFirstLine("/sys/class/block/" + name + "/stat")
if err != nil {
return nil, err
}
for _, token := range strings.Split(line, " ") {
token = strings.TrimSpace(token)
ui64, err := strconv.ParseUint(token, 10, 64)
if err != nil {
return nil, err
}
stats = append(stats, ui64)
}
return stats, nil
}
and use it here like
func getDriveStats(driveName string) (*driveStats, error) {
stats, err := device.GetStat(driveName)
if err != nil {
return nil, err
}
if len(stats) == 0 {
// The drive is lost
return nil, nil
}
if len(stats) < 10 {
return nil, fmt.Errorf("unknown stats from sysfs for drive %v", driveName)
}
// Refer https://www.kernel.org/doc/Documentation/block/stat.txt
// for meaning of each field.
return &driveStats{
ReadSectors: stats[2],
ReadTicks: stats[3],
WriteSectors: stats[6],
WriteTicks: stats[7],
TimeInQueue: stats[9],
}, nil
}
- unexport/make private all the fields in
driveStats
struct
pkg/metrics/collector.go
Outdated
"Drive Online/Offline Status", | ||
[]string{"drive"}, nil), | ||
prometheus.GaugeValue, | ||
0, // 0 for offline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean boolean value here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I get you correctly, I'm using 0
to show offline and 1
to show online. As it seems more Boolean-like. Prometheus does not support a native Boolean data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the other way to do to indicate boolean state? If none, default constants for drive online/offline and user it here.
pkg/metrics/collector.go
Outdated
|
||
return | ||
} | ||
device = strings.TrimPrefix(device, "/dev") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/metrics/collector.go
Outdated
} | ||
device = strings.TrimPrefix(device, "/dev") | ||
// Online/Offline Status | ||
if _, err := os.Stat("/sys/block" + device); os.IsNotExist(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- As mentioned previously, this won't work for disk partition used as DirectPVDrive. You have to use
/sys/class/block
instead of/sys/block
for both disk and partition - Use
path.Join()
instead of+
sysfs
is unreliable to check drive existent on the system i.e. in some systemssysfs
entry is present but actual drive is gone from the system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^ @r-scheele
pkg/metrics/collector.go
Outdated
) | ||
} | ||
|
||
filePath := "/sys/block" + device + "/stat" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use path.Join()
pkg/metrics/collector.go
Outdated
|
||
func getSectorSize(device string) (int64, error) { | ||
// Construct the file path to the sector size file | ||
filePath := "/sys/block/" + device + "/queue/hw_sector_size" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This file doesn't exist for disk partitions. For partitions, find it is disk and fetch sector size from it and use it.
- if
/sys/class/block/<device>/queue/hw_sector_size
is missing for disk, use 512 as default block size. - Move this function as
GetHardwareSectorSize
topkg/device/sysfs_linux.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a defaultBlockSize
as 512 in the sysfs_linux.go
, Since it's the same value, should I go ahead and use it? but I'm thinking block size and sector size aren't the same thing, I could create a new constant just for sector size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the system have default block/sector size as 512. you could define default sector size as default block size.
if result.Err != nil { | ||
return | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why continue
? do you mean break
?
pkg/metrics/collector.go
Outdated
List(ctx) | ||
for result := range driveResultCh { | ||
if result.Err != nil { | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why continue
? do you mean break
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used continue
to ensure we process all drives, even if an error occurs on one of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the error is thrown, we cannot get values further. break
is logically correct thing to do.
Yes i will, I'm going through the previous one written already. |
PTAL @r-scheele |
Hi @Praveenrajmani yes, i'm working on this review now |
Hi @balamurugana @Praveenrajmani PTAL? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review
@@ -3,7 +3,7 @@ apiVersion: apiextensions.k8s.io/v1 | |||
kind: CustomResourceDefinition | |||
metadata: | |||
annotations: | |||
controller-gen.kubebuilder.io/version: v0.15.0 | |||
controller-gen.kubebuilder.io/version: v0.14.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update your controller-gen version and run build.sh
again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, i did that already.
@@ -0,0 +1,149 @@ | |||
--- | |||
apiVersion: apiextensions.k8s.io/v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file looks unrelated, can you remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was present in the file before https://github.com/minio/directpv/blob/master/pkg/admin/installer/directpv.min.io_directpvvolumes.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these files are not present in the existing source. can you remove these?
@@ -3,7 +3,7 @@ apiVersion: apiextensions.k8s.io/v1 | |||
kind: CustomResourceDefinition | |||
metadata: | |||
annotations: | |||
controller-gen.kubebuilder.io/version: v0.15.0 | |||
controller-gen.kubebuilder.io/version: v0.16.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The controller-gen version should be v0.15.0. You can remove $GOPATH/bin/controller-gen
and then run ./build.sh
- It will install the correct version @r-scheele
|
||
// GetStat retrieves and returns statistics for a given device name. | ||
func GetStat(name string) ([]uint64, int, error) { | ||
filePath := path.Join("/sys/class/block/", name, "/stat") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/sys/class/block
can be defined as a constant as it is referred in more than one place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, use path.Join
and keep this as a global var
func GetStat(name string) ([]uint64, int, error) { | ||
filePath := path.Join("/sys/class/block/", name, "/stat") | ||
|
||
driveStatus := 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we actually don't need a variable here, the function can be simply return the StatInfo, err
.
ie, func Stat(name string) (*StatInfo, error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
driveStatus
is used to track the status of the drive, so it can also be exported as a metric. Should i still remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we can remove it.
err == nil
implies the drive is online and err != nil
implies drive is offline. we don't need the extra param to indicate this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@r-scheele ^^
return nil, driveStatus, nil | ||
} | ||
|
||
klog.Infof("Reading drive statistics from: %s", filePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use verbose levels for loggers, we dont want these logs to be printed all the time when prometheus scrapes the metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like, klog.V(4).Infof
|
||
ui64, err := strconv.ParseUint(token, 10, 64) | ||
if err != nil { | ||
klog.Warningf("Failed to parse token '%s': %v", token, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we return err here as the parsing fails
// GetHardwareSectorSize retrieves the hardware sector size for a given device. | ||
// It works for both whole disks and partitions. | ||
// The device name should be without the "/dev/" prefix (e.g., "sda" or "sda1"). | ||
func GetHardwareSectorSize(deviceName string) (uint64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func GetHardwareSectorSize(deviceName string) (uint64, error) { | |
func HardwareSectorSize(deviceName string) (uint64, error) { |
return 0, fmt.Errorf("failed to parse hardware sector size for %s: %v", deviceName, err) | ||
} | ||
|
||
klog.Infof("Hardware sector size for %s: %d bytes", deviceName, sectorSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same: Use verbose levels in logging
sorry for the late review @r-scheele |
No problem @Praveenrajmani, would look into it soon. |
@balamurugana Can you PTAL |
@r-scheele I sent a PR r-scheele#1 to your branch in order to avoid many review cycles. Please check and merge it accordingly. |
closing this as #955 PR got merged |
Implemented new metrics in
publishDriveStats
:Similar to the existing implementation
publishVolumeStats