Jack Arildson

Proxmox Database Corruption - ipcc_send_rec[1] failed: Connection refused

Resolving Proxmox Web Interface Issues Due to Corrupted Cluster Database

Recently, I encountered an issue where my Proxmox node's web interface became inaccessible. After manually SSH-ing into the affected node, I started diagnosing the issue.

Symptoms and Initial Troubleshooting

Upon checking the status of the pve-firewall service, I saw multiple errors:

Jun 27 12:44:15 hv02 pve-firewall[503543]: ipcc_send_rec[1] failed: Connection refused
Jun 27 12:44:15 hv02 pve-firewall[503543]: ipcc_send_rec[2] failed: Connection refused
Jun 27 12:44:15 hv02 pve-firewall[503543]: ipcc_send_rec[3] failed: Connection refused

Additionally, the pveproxy service reported issues with SSL certificates:

Jun 29 03:52:52 hv03 pveproxy[3462962]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key)
Jun 29 03:52:52 hv03 pveproxy[3462963]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key)

These errors pointed toward potential issues with the Proxmox Cluster File System (pmxcfs).

Identifying the Root Cause

Running the command /usr/bin/pmxcfs revealed a critical issue:

[database] crit: found entry with duplicate name 'lxc' - A:(inode = 0x000000000303AD52...) vs. B:(inode = 0x000000000306F1A7...)
[database] crit: DB load failed
[main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'

This indicated corruption in the pmxcfs configuration database (config.db), with duplicate inode entries causing the database to fail to load.

Resolving the Issue

To inspect the problematic entries, I executed:

sqlite3 /var/lib/pve-cluster/config.db 'SELECT inode,mtime,name FROM tree WHERE parent = 0x000000000303AD50'

This confirmed two duplicate entries named lxc. To safely fix the issue, I first made a backup of the database:

cp /var/lib/pve-cluster/config.db /var/lib/pve-cluster/config.db.bk

Initially, I deleted the entry using:

sqlite3 /var/lib/pve-cluster/config.db 'DELETE FROM tree WHERE parent = 50786727 OR inode = 50786727'

This action restored access to the web interface but resulted in all VMs and containers being missing from pct list. Recognizing this unintended consequence, I restored from the backup:

cp /var/lib/pve-cluster/config.db.bk /var/lib/pve-cluster/config.db

I then carefully targeted the correct inode causing the duplicate entry:

sqlite3 /var/lib/pve-cluster/config.db 'DELETE FROM tree WHERE parent = 50572626 OR inode = 50572626'

After executing this, I restarted the essential Proxmox services:

systemctl restart pve-cluster pveproxy pvestatd

Successful Outcome

This final approach successfully resolved the issue completely. The Proxmox web interface became accessible, and all VMs and containers reappeared correctly.

Resources used:

https://forum.proxmox.com/threads/vm-status-unknown-grey-question-mark.92359/

https://forum.proxmox.com/threads/pve-cluster-fails-to-start.82861/

https://nramkumar.org/tech/blog/2023/07/08/proxmox-fixing-your-database-after-a-host-name-change/

https://forum.proxmox.com/threads/the-etc-pve-directory-disappeared.128117/

https://forum.proxmox.com/threads/unable-to-load-access-control-list-connection-refused.72245/page-2

https://www.reddit.com/r/Proxmox/comments/1bx92wv/web_admin_not_loading_after_reboot/

A weird SQLAlchemy session wrapper I wrote

SQL Alchemy Wrapper Decorator Code

# Decorator for managing the database session
def db_session_decorator(func):
    @wraps(func)
    async def session_wrapper(*args, **kwargs):
        async with AsyncSessionLocal() as session:
            try:
                # Pass the session to the wrapped function
                response = await func(session, *args, **kwargs)
                await session.commit()
                return response
            except IntegrityError as e:
                await session.rollback()
                logging.error("Integrity error detected.", exc_info=True)
                raise
            except (DataError, StatementError) as e:
                await session.rollback()
                logging.error("Data integrity or statement error detected.", exc_info=True)
                raise
            except (OperationalError, DisconnectionError, TimeoutError) as e:
                await session.rollback()
                logging.error("Connection or service availability issue.", exc_info=True)
                raise
            except ConnectionDoesNotExistError as e:
                await session.rollback()
                logging.error("Operation on closed connection attempted.", exc_info=True)
                raise
            except CannotConnectNowError as e:
                await session.rollback()
                logging.error("Cannot connect to the database.", exc_info=True)
                raise
            except MemoryError as e:
                await session.rollback()
                logging.critical("Memory or resource exhaustion detected.", exc_info=True)
                raise
            except PostgresError as e:  # Catch-all for Postgres related errors not caught by SQLAlchemy
                await session.rollback()
                logging.error("Postgres error detected.", exc_info=True)
                raise
            except Exception as e:
                # Rollback for any other exception not explicitly handled above
                await session.rollback()
                # Log the exception details
                logging.error("An unexpected error occurred.", exc_info=True)
                return {'error': str(e)}
    return session_wrapper

Example Usage of the Decorator

While this isn't the cleanest code I've written, it's an example I had on hand of how the decorator might be used.

async def get_object(modeltype, uuid, session=None):
    has_outside_session = session is not None
    # If no session was passed, create a new session for this call
    if not has_outside_session:
        session = AsyncSessionLocal()
    try:
        result = await session.execute(
            select(modeltype).filter_by(uuid=int(uuid))
        )
        instance = result.scalars().first()
        # If a new session was created, commit any changes (if any were made, although get_object is typically read-only) and close the session
        if not has_outside_session:
            await session.commit()
        return instance
    except Exception as e:
        # If a new session was created, rollback any changes due to an error
        if not has_outside_session:
            await session.rollback()
        raise
    finally:
        # If a new session was created, close it when done
        if not has_outside_session:
            await session.close()

@db_session_decorator
async def init_user(session, userid, username):
    user = await get_object(User, guildid, session)
    if not guild:
        parent = User(uuid=userid, registered_at=str(datetime.now()), guildname=username)
        session.add(parent)

Deploying LXC Containers in Proxmox using Terraform and Ansible

Problem

So by default as far as I can tell, there's no way to directly run a script on boot with the default LXC container templates that are downloadable through the CT Templates section. No userdata, no cloudinit, etc support as far as I can tell. This lead me to have to come up with a kind of "hack" that isn't perfect but works for my specific scenario.

Why

So my solution was to use Proxmox hookscripts. Proxmox hookscripts are Perl(ewww lol) scripts that are run directly on the Proxmox hypervisor, NOT the LXC container. It's not super clear from the TF docs on this whether its run on the hypervisor or the container as it is part of the LXC container resource. You essentially have to read through the Proxmox docs and forums to get an understanding of it.

Solution

So my solution was to use a hookscript that is able to execute on the container itself through the hypervisor using features within proxmox's CLI tooling. Below is an example of a hookscript I wrote:

#!/usr/bin/perl

use strict;
use warnings;

my $vmid = shift;
my $phase = shift;
my $target_host = 'hv02'; # Replace with the hostname of the target Proxmox host that holds the ansible scripts
my $container_id_on_target_host = '108'; # Replace with the container ID

if ($phase eq 'post-start') {
    # First, install openssh-server inside the container that has just been started
    my $install_cmd = 'dnf install -y openssh-server'; # This installs openssh-server as its not installed on the almalinux 9 LXC CT image that is provided by Proxmox
    my $sed_cmd = "sed -i.bak 's/^PermitRootLogin \\(prohibit-password\\|no\\)/PermitRootLogin yes/' /etc/ssh/sshd_config"; # this edits the sshd_conf file to allow root login, its within my security tolerance for the scenario this server serves.
    my $restart_sshd = 'systemctl restart sshd'; # interestingly this doesn't work. Why? I don't know. when the VM boots it shows that the service isnt actually started.
    my $start_sshd = 'systemctl enable sshd';
    my $full_cmd = "pct exec $vmid -- bash -c '$install_cmd && $sed_cmd && $restart_sshd && $start_sshd'"; # was running into string interp issues so split commands into vars which has the perk of making the code more readable

    # Execute the installation command(this gets executed on the hypervisor that this script is run on.  This should be the same hypervisor that you want the LXC container created on)
    system($full_cmd);

    sleep(5); # wait to prevent screwing up of proxmox's locking system

    my $reboot_ct = "pct exec $vmid -- bash -c 'sudo shutdown -r now'"; # kind of a hack to get around the fact that restarting ssh through the restart command earlier in this script doesn't work despite being enabled in systemd.
    system($reboot_ct);

    # Next, retrieve the IP address of the container's eth0 interface
    my $get_ip_cmd = "pct exec $vmid -- ip -4 addr show eth0 | grep -oP '(?<=inet\\s)\\d+(\\.\\d+){3}'";
    my $container_ip = `$get_ip_cmd`;
    chomp $container_ip; # Remove any trailing newline

    # If the IP address was successfully retrieved, run ssh-keygen -R on the target host's container. This is because I spin up and down test containers quite often but I get irritated at having to remove the known key every time is annoying to me when I just want Ansible to work without extra effort. It's within my security tolerances but feel free to remove from script it it isn't within yours.
    # This command ssh's to the hv that holds the container that does ansible stuff.
    if ($container_ip) {
        my $ssh_cmd = "ssh root\@$target_host 'pct exec $container_id_on_target_host -- ssh-keygen -R $container_ip'";
        system($ssh_cmd);
    } else {
        warn "Failed to get the IP address of the container with ID $vmid";
    }
}

exit(0);

Save the script to: /var/lib/vz/snippets/install-ssh.pl although feel free to rename the script itself, but thats the directory. Make sure its on the host you want the container to be on.

Terraform Example:

module "container-name-here" {
  source = "../modules/lxc-vm"

  hostname           = "put hostname here"
  cores              = 4
  memory             = "4096"
  swap               = "2048"
  ostemplate         = "local:vztmpl/almalinux-9-default_20221108_amd64.tar.xz"
  hookscript         = "local:snippets/install-ssh.pl" # the script we referred to earlier
  container_password = var.container_password # please, please make sure to not actually put password in code. use CI/CD for this..
  target_node        = "hv03" # which hypervisor you want this lxc container to be put.

  ssh_public_keys = <<-EOT
  <put your public key here>
  EOT

  unprivileged     = true
  onboot           = true
  start            = true
  features_nesting = true # turn off if you don't need
  features_keyctl  = true # turn off if you don't need
  features_fuse    = true # turn off if you don't need
  rootfs_storage   = "local"
  rootfs_size      = "40G"
  network_name     = "eth0"
  network_bridge   = "vmbr0"
  network_ip       = "x.x.x.x/24"
  network_gw       = "x.x.x.x.254"
  network_hwaddr   = "mac address here"
}

Discord.py Permanent Button Panels in Dislash

Permanent button panels.

Sometimes with discord bots you want to create a button that persists in a channel for users to click. Whether its a way to weed out spam bots or a way to create a ticket for your ticketing system bot. I struggled for a bit finding a way to get this done. This was my solution using the library Dislash.

@inter_client.slash_command(name="create_button", description="Creates a panel for your button channel")
async def create_button2(ctx):
    embed = discord.Embed(
        title="Click Button to gain access to the features of this server!",
        color=0x00ff00
    )
    row = ActionRow(
        Button(style=ButtonStyle.green, label="Button Name", custom_id="ex_button")
    )
    await ctx.send(embed=embed, components=[row])

@bot.event
async def on_button_click(res):
    embed = discord.Embed(
        title="Click Button to gain access to the features of this server!",
        color=0x00ff00
    )
    row = ActionRow(
        Button(style=ButtonStyle.green, label="Button Name", custom_id="ex_button")
    )
    if res.component.custom_id == "ex_button":
        user = res.author
        dm_channel = user.dm_channel or await user.create_dm()
        await button_action(res) # put the function where you want the buttons actions to take place
        await res.respond(type=6)
        await asyncio.sleep(1)
        await res.channel.send(embed=embed, components=[row])

An example of what this results in:

Celestica Seastone DX010 100GBE - 25GBE breakout configuration file

Preface:

I was having trouble setting up my Celestica Seastone DX010 100GBE switch to breakout a single 100gbe port into 25gbe due to an bug in Sonic NOS. Using this reddit guide was a bit confusing but after hours of troubleshooting, it seemed to work.

Procedure

Read more

Copyright © 2020

Lingonberry by Anders Noren — Proudly Powered by BluditUp ↑