Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Pacemaker awsvip failing with different errors #1876

Open
jayan800 opened this issue Jun 23, 2023 · 12 comments
Open

AWS Pacemaker awsvip failing with different errors #1876

jayan800 opened this issue Jun 23, 2023 · 12 comments

Comments

@jayan800
Copy link

Hi All,

We are running a two node pacemaker cluster in AWS and we use "awsvip" resource type to configure the vip IP. Below is the conf

pcs resource show privip_node1

Resource: privip_node1 (class=ocf provider=heartbeat type=awsvip)
Attributes: secondary_private_ip=10.x.x.x
Operations: migrate_from interval=0s timeout=30s (privip_node1-migrate_from-interval-0s)
migrate_to interval=0s timeout=30s (privip_node1-migrate_to-interval-0s)
monitor interval=20s timeout=30s (privip_node1-monitor-interval-20s)
start interval=0s timeout=30s (privip_node1-start-interval-0s)
stop interval=0s timeout=30s (privip_node1-stop-interval-0s)
validate interval=0s timeout=10s (privip_node1-validate-interval-0s)

pcs resource show node1_vip

Resource: node1_vip (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=10.x.x.x
Operations: monitor interval=10s timeout=20s (node1_vip-monitor-interval-10s)
start interval=0s timeout=20s (node1_vip-start-interval-0s)
stop interval=0s timeout=20s (node1_vip-stop-interval-0s)

The EC2 instance is configured to use IMDSV2.The fence_aws agent and resource-agent have also been upgraded to the most recent versions, which support imdsv2. Additionally, the resource is set up to use the IAM Profile credentials.

fence-agents-aws-4.2.1-41.el7_9.3.x86_64
python-s3transfer-0.1.13-1.0.1.el7.noarch
resource-agents-4.1.1-61.el7_9.15.x86_64

pip list | grep -i boto
boto3 (1.10.0)
botocore (1.13.50)

aws --version
aws-cli/2.9.4 Python/3.9.11 Linux/3.10.0-1160.80.1.0.1.el7.x86_64 exe/x86_64.oracle.7 prompt/off

pip3 list | grep -i boto
boto3 1.23.10
botocore 1.26.10

The privip resource consistently fails with the different errors:

pengine: warning: unpack_rsc_op_failure: Processing failed monitor of privip_node2 on node2: unknown error | rc=1
Apr 13 11:09:54 node2 lrmd[3773]: warning: privip_node2_monitor_20000 process (PID 109357) timed out
Apr 13 11:09:54 node2 lrmd[3773]: warning: privip_node2_monitor_20000 process (PID 109357) timed out
Apr 13 11:09:54 node2 lrmd[3773]: warning: privip_node2_monitor_20000:109357 - timed out after 30000ms

Jun 16 10:01:43 node2 lrmd[36967]: notice: privip_node2_monitor_20000:13042:stderr [ Unable to locate credentials. You can configure credentials by running "aws configure". ]
Jun 16 10:01:43 node2 crmd[36970]: notice: privip_node2_monitor_20000:91 [ % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r100 359 100 359 0 0 37513 0 --:--:-- --:--:-- --:--:-- 39888\n\nUnable to locate credentials. You can configure credentials by running "aws configure".\n ]

Jun 22 10:10:10 node1 lrmd[12465]: notice: privip_node1_monitor_20000:105561:stderr [ #15 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed connect to 169.254.169.254:80; Connection refused ]
Jun 22 10:10:10 node1 lrmd[12465]: notice: privip_node1_monitor_20000:105561:stderr [ #15 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed connect to 169.254.169.254:80; Connection refused ]
Jun 22 10:10:10 node1 lrmd[12465]: notice: privip_node1_monitor_20000:105561:stderr [ An error occurred (MissingParameter) when calling the DescribeInstances operation: The request must contain the parameter InstanceId ]

Failed Resource Actions:

  • privip_node1_start_0 on node1 'not running' (7): call=250, status=complete, exitreason='instance_id not found. Is this a EC2 instance?',
    last-rc-change='Fri May 26 07:27:46 2023', queued=0ms, exec=6597ms

Any advice would be great.

@oalbrigt
Copy link
Contributor

Try running pcs resource debug-start --full <resource>. That should show you all the commands it's running, and hopefully some pointers to what's wrong.

@jayan800
Copy link
Author

jayan800 commented Jun 26, 2023

Thank you.

The debug command completed without any errors.

is there anything else to check?

@jayan800 jayan800 changed the title AWS Pacemaker awsvip faling with different errors AWS Pacemaker awsvip failing with different errors Jun 26, 2023
@oalbrigt
Copy link
Contributor

You can run pcs resource update <resource> trace_ra=1 and then disable/enable or restart the resource.

The trace files will be available in /var/lib//heartbeat/trace_ra/.

@jayan800
Copy link
Author

Thank you. I will enable the trace.
fingers crossed

@ahelbadry
Copy link

Hello Good people,

This thread is a bit old, but I have the same error:

Oct 29 08:30:22 [1749] auto-2.dhsscegypt.local lrmd: notice: operation_finished: awsvip_start_0:6361:stderr [ ocf-exit-reason:instance_id not found. Is this a EC2 instance? ]
Oct 29 08:30:22 [1749] auto-2.dhsscegypt.local lrmd: notice: operation_finished: awsvip_start_0:6361:stderr [ ]
Oct 29 08:30:22 [1749] auto-2.dhsscegypt.local lrmd: notice: operation_finished: awsvip_start_0:6361:stderr [ An error occurred (MissingParameter) when calling the DescribeInstances operation: The request must contain the parameter InstanceId ]

Did anyone found a solution for this?

@oalbrigt
Copy link
Contributor

oalbrigt commented Nov 4, 2024

It sounds like the AWS metadata service isnt replying to the requests.

You can try running curl http://169.254.169.254/latest/meta-data/instance-id on the node to see any additional info.

Does it happen every time, or just at random?

@ahelbadry
Copy link

No it's just random, and there are no indicators of ay errors from any kind before that.

@ahelbadry
Copy link

It sounds like the AWS metadata service isnt replying to the requests.

You can try running curl http://169.254.169.254/latest/meta-data/instance-id on the node to see any additional info.

Does it happen every time, or just at random?

I did use the link, and it returns the instance ID successfully, however, sometimes it doesn't which creates this error, any reason you know that might do that?

@oalbrigt
Copy link
Contributor

oalbrigt commented Nov 4, 2024

Ah. If it's random it's probably the requests getting throttled.

You should check if there's a resource-agents update for your distro, as there has been added retry-functionality to avoid it:
#1936

If the latest version for your distro still has issues you should report the bug to them and provide the link to the fix, so they can fix it.

@ahelbadry
Copy link

Thanks man.

So to make sure, that's a resource-agent error not from AWS side?

@oalbrigt
Copy link
Contributor

oalbrigt commented Nov 4, 2024

The issue can be due to some hiccup in the AWS metadata service, network, or simply throttling requests if they receive too many over a short period.

The fix makes the agent retry a specific amount of times before failing, and allows the user to set other amount of retries, and sleep between retries to make it work well with their setup.

@ahelbadry
Copy link

Thanks man.

Great help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants