• chevron_right

      The BASIC programming language turns 60

      news.movim.eu / ArsTechnica · 2 days ago - 16:17

    Part of the cover illustration from

    Enlarge / Part of the cover illustration from "The Applesoft Tutorial" BASIC manual that shipped with the Apple II computer starting in 1981. (credit: Apple, Inc.)

    Sixty years ago, on May 1, 1964, at 4 am in the morning , a quiet revolution in computing began at Dartmouth College. That's when mathematicians John G. Kemeny and Thomas E. Kurtz successfully ran the first program written in their newly developed BASIC (Beginner's All-Purpose Symbolic Instruction Code) programming language on the college's General Electric GE-225 mainframe.

    Little did they know that their creation would go on to democratize computing and inspire generations of programmers over the next six decades.

    What is BASIC?

    In its most traditional form, BASIC is an interpreted programming language that runs line by line, with line numbers. A typical program might look something like this:

    Read 13 remaining paragraphs | Comments

    • Sl chevron_right

      Contact publication

      pubsub.blastersklan.com / slashdot · Friday, 19 April - 22:38 edit · 1 minute

    Women Who Code (WWC), a U.S.-based organization of 360,000 people supporting women who work in the tech sector, is shutting down due to a lack of funding. "It is with profound sadness that, today, on April 18, 2024, we are announcing the difficult decision to close Women Who Code, following a vote by the Board of Directors to dissolve the organization," the organization said in a blog post. "This decision has not been made lightly. It only comes after careful consideration of all options and is due to factors that have materially impacted our funding sources -- funds that were critical to continuing our programming and delivering on our mission. We understand that this news will come as a disappointment to many, and we want to express our deepest gratitude to each and every one of you who have been a part of our journey." The BBC reports: WWC was started 2011 by engineers who "were seeking connection and support for navigating the tech industry" in San Francisco. It became a nonprofit organization in 2013 and expanded globally. In a post announcing its closure, it said it had held more than 20,000 events and given out $3.5m in scholarships. A month before the closure, WWC had announced a conference for May, which has now been cancelled. A spokesperson for WWC said: "We kept our programming moving forward while exploring all options." They would not comment on questions about the charity's funding. The most recent annual report, for 2022, showed the charity made almost $4m that year, while its expenses were just under $4.2m. WWC said that "while so much has been accomplished," their mission was not complete. It continued: "Our vision of a tech industry where diverse women and historically excluded people thrive at every level is not fulfilled."

    Read more of this story at Slashdot.

    'Women Who Code' Shuts Down Unexpectedly
    • Sl chevron_right

      Contact publication

      pubsub.blastersklan.com / slashdot · Tuesday, 19 March - 13:33 edit · 2 minutes

    An anonymous reader quotes a report from InfoWorld: C++ creator Bjarne Stroustrup has defended the widely used programming language in response to a Biden administration report that calls on developers to use memory-safe languages and avoid using vulnerable ones such as C++ and C. In a March 15 response to an inquiry from InfoWorld, Stroustrup pointed out strengths of C++, which was designed in 1979. "I find it surprising that the writers of those government documents seem oblivious of the strengths of contemporary C++ and the efforts to provide strong safety guarantees," Stroustrup said. "On the other hand, they seem to have realized that a programming language is just one part of a tool chain, so that improved tools and development processes are essential." Safety improvement always has been a goal of C++ development efforts, Stroustrup stressed. "Improving safety has been an aim of C++ from day one and throughout its evolution. Just compare the K&R C language with the earliest C++, and the early C++ with contemporary C++. My CppCon 2023 keynote outlines that evolution," he said. "Much quality C++ is written using techniques based on RAII (Resource Acquisition Is Initialization), containers, and resource management pointers rather than conventional C-style pointer messes." Stroustrup cited a number of efforts to improve C++ safety. "There are two problems related to safety. Of the billions of lines of C++, few completely follow modern guidelines, and peoples' notions of which aspects of safety are important differ. I and the C++ standard committee are trying to deal with that," he said. "Profiles is a framework for specifying what guarantees a piece of code requires and enable implementations to verify them. There are documents describing that on the committee's website -- look for WG21 -- and more are coming. However, some of us are not in a mood to wait for the committee's necessarily slow progress." Profiles, Stroustrup said, "is a framework that allows us to incrementally improve guarantees -- e.g., to eliminate most range errors relatively soon -- and to gradually introduce guarantees into large code bases through local static analysis and minimal run-time checks. My long-term aim for C++ is and has been for C++ to offer type and resource safety when and where needed. Maybe the current push for memory safety -- a subset of the guarantees I want -- will prove helpful to my efforts, which are shared by many in the C++ standards committee." Stroustrup previously defended the safety of C++ against the NSA, which recommended using memory-safe languages instead of C++ and C in a November 2022 bulletin.

    Read more of this story at Slashdot.

    C++ Creator Rebuts White House Warning
    • Sl chevron_right

      Contact publication

      pubsub.blastersklan.com / slashdot · Sunday, 17 March - 18:58 edit · 1 minute

    An anonymous reader shared this report: After 20 years of development, the open source GnuCOBOL "has reached an industrial maturity and can compete with proprietary offers in all environments," said OCamlPro founder and GnuCOBOL contributor Fabrice Le Fessant, in a FOSDEM talk about the technology. GnuCOBOL turns COBOL source code into executable applications. It is very cross-platform, running Linux, BSD, many proprietary Unixes, macOS, and Windows, even Android. And the latest version, v.32, is being used in many commercial settings... Sobisch noted that the GnuCOBOL is seeing a lot of commercial deployments, such as for banking back-end apps, many of which are being migrated from Micro Focus, with users reporting performance improvements as a result. The French DGFIP federal agency moved from a GCOS mainframe to GnuCOBOL, with the help of Le Fessant's firm. Originally called OpenCOBOL, the project was started in 2002 and renamed GnuCOBOL in 2013. In the past three years, it has received attention from 13 contributors with 460 commits. Most Linux package managers have a copy of GnuCOBOL for the program for downloading... It can compile to C code (C89+), making it extremely portable, from mainframes to Raspberry Pi's, Sobisch said... Also new is SuperBOL, a development studio for GnuCOBOL developed by Le Fessant's OCamlPro. It runs as a VSCode Extension and features a full COBOL processor (written in OCaml).

    Read more of this story at Slashdot.

    Free/Libre 'GnuCOBOL' Compiler Reaches Maturity, Can Compete with Proprietary Offerings
    • wifi_tethering open_in_new

      This post is public

      developers.slashdot.org /story/24/03/17/1810257/freelibre-gnucobol-compiler-reaches-maturity-can-compete-with-proprietary-offerings

    • chevron_right

      C2PA's Time Warp

      pubsub.slavino.sk / hackerfactor · Monday, 4 March - 17:35 edit · 12 minutes

    Throughout my review of the C2PA specification and implementation, I've been focused on how easy it is to create forgeries that appear authentic. But why worry about forgeries when C2PA can't even get ordinary uses correct?

    Just consider the importance of the recorded timestamps. Accurate time records can resolve questions related to ordering and precedence, like "when did this happen?" and "who had it first?" Timestamps can address copyright assignment issues and are used with investigations to identify if something could or could not have happened.

    At my FotoForensics service, I've seen an increase in pictures containing C2PA metadata. They have come from Adobe, Microsoft, OpenAI (and DALL-E), Stability AI (Stable Diffusion), Leica (camera company), and others. Unfortunately, with more images, I'm seeing more problems -- including problems with timestamps.

    I typically use my FotoForensics service for these analysis blog entries. However, this time I'm going to use my Hintfo service ( hintfo.com ) to show the metadata. I also want to emphasize that all of the examples in this blog entry were submitted by real people to the public FotoForensics service; I didn't manufacture any of these pictures.

    Out of Sync

    I first noticed the problem with Microsoft's AI-generated pictures. For example:

    analysis.php?id=c5ec085452c153859d7b29be7ed31559aca5800d.251338&fmt=orig&size=400
    (Click on the picture to view the C2PA metadata at Hintfo.)

    Adobe's Content Credentials web site does not identify any issues with this picture. However, the internal metadata contains two interesting timestamps. I extracted them using Adobe's c2patool . The first timestamp is part of the provenance: how , what , and when the picture was created:
    "assertion_store": {
    "c2pa.actions": {
    "actions": [
    {
    "action": "c2pa.created",
    "description": "AI Generated Image",
    "softwareAgent": "Bing Image Creator",
    "when": "2024-01-28T19:34:25Z"
    }
    ]
    }

    This provenance information identifies an AI Generated Image. It was created by Microsoft's Bing Image Creator on 2024-01-28 at 19:34:25 GMT.

    The other timestamp identifies when the metadata was notarized by an external third-party signatory:
    "signature": {
    "alg": "ps256",
    "issuer": "Microsoft Corporation",
    "time": "2024-01-28T19:34:24+00:00"
    }

    The external third-party timestamp authority works like a notary. It authoritatively states that it saw a signature for this picture at a specific date and time. The picture had to have been created at or before this timestamp, but not later.

    Adobe's c2patool has a bug that conflates information from different X.509 certificates. The cryptographic signature over the entire file was issued by Microsoft, but the time from the timestamp authority response was issued by DigiCert (not Microsoft); DigiCert isn't mentioned anywhere in the c2patool output. This bug gives the false impression that Microsoft notarized their own data. To be clear: Microsoft generated the file and it was notarized by DigiCert. Although attribution is a critical component to provenance, Adobe's c2patool mixes up the information and omits a signatory's identification, resulting in a misleading attribution. (This impacts Adobe's c2patool and Adobe's Content Credentials web site.)

    Ignoring the attribution bug, we can combine these provenance and notary timestamps with the time when FotoForensics received the picture; FotoForensics defines the last possible modification time since the files are stored on my servers in a forensically sound manner:

    2024-01-28 19:34:24 GMT x.509 Signed Timestamp Trusted external timestamp from DigiCert
    2024-01-28 19:34:25 GMT JUMBF: AI image created Internal C2PA metadata from Microsoft
    2024-02-01 10:33:29 GMT FotoForensics: Received File cannot be modified after this time

    The problem, as denoted by the timeline, is that Bing Image Creator's creation date is dated one second after it was notarized by the external third-party. There are a couple of ways this can happen:
    • The external signer could have supplied the wrong time. In this case, the external signer is DigiCert. DigiCert abides by the X.509 certificate standards and maintains a synchronized clock. If we have to trust anything in this example, then I trust the timestamp from DigiCert.
    • Microsoft intentionally post-dated their creation time. (Seems odd, but it's an option.)
    • Microsoft's server is not using a synchronized clock. As noted in RFC 3628 (sections 4.3, 6.2, 6.3, and 7.3.1d), clocks need to be accurately synchronized. There could be a teeny tiny amount of drift, but certainly not at the tenths-of-a-second scale.
    • Microsoft modified the file after it was notarized. This is the only option that we can immediately rule out. Changing Microsoft's timestamp from "19:34:25" to "19:34:24" causes the cryptographic signature to fail. This becomes a detectable alteration. We can be certain that the signed file said "19:34:25" and not "19:34:24" in the provenance record.
    Now, I know what you're thinking. This might be a one-off case. The X.509 timestamp authority system permits clocks to drift by a tiny fraction of a second. With 0.00001 seconds drift, 24.99999 and 25.00000 seconds can be equivalent. With integer truncation, this could look like 24 vs 25 seconds. However, I'm seeing lots of pictures from Microsoft that contain this same "off by 1 second" error. Here are a few more examples:

    analysis.php?id=b6c2d47883aaf387f5d6ca06f930c2ac390a65d6.208865&fmt=orig&size=256analysis.php?id=972a888c7cbc379965ea550ca2eec5de9acbfc13.142321&fmt=orig&size=256analysis.php?id=855f845588e8766225a676b35dbc868c68d6ceb9.3163177&fmt=orig&size=256

    The Lucy/dog picture is from Bing Image Generator, the apple picture is from Microsoft Designer , and the waffles are from Microsoft's Azure DALL-E service. All of these files have the same "off by 1 second" error. In fact, the majority of pictures that I see from Microsoft have this same error. If I had to venture a guess, I'd say Microsoft's clocks were out of sync by almost a full second.

    Being inaccurate by 1 second usually isn't a big deal. Except in this case, it demonstrates that we cannot trust the embedded C2PA timestamps created by Microsoft. Today it's one second. It may increase over time to two seconds, three seconds, etc.

    Out of Time

    Many of the C2PA-enabled files that I encounter have other timestamps beyond the C2PA metadata. It's problematic when the other timestamps in the file fail to align with the C2PA metadata. Does it mean that the external trusted authority signer is wrong, that the device requesting the signature is inaccurate, that the user's clock is wrong, or that some other timestamp is incorrect? Or maybe a combination?

    As an example, here's a picture that was edited using Adobe's Photoshop and includes an Adobe C2PA signature:

    analysis.php?id=bd642144ebff23e155f19d9e2679ea04f429a1d9.2494659&fmt=orig&size=400

    In this case, the picture includes XMP, IPTC, and EXIF timestamps. Putting them together into a timeline shows metadata alterations after the trusted notary timestamp:

    2022-02-25 12:09:40 GMT EXIF: Date/Time Original
    EXIF: Create Date
    IPTC: Created Date/Time
    IPTC: Digital Creation Date/Time
    XMP: Create Date
    XMP: Date Created
    2023-12-13 17:29:15 GMT XMP: History from Adobe Photoshop 25.2 (Windows)
    2023-12-13 18:22:00 GMT XMP: History from Adobe Photoshop 25.2 (Windows)
    2023-12-13 18:32:53 GMT x.509 Signed Timestamp by the authoritative third-party (DigiCert)
    2023-12-13 18:33:12 GMT EXIF: Modify Date
    XMP: History (Adobe Photoshop 25.2 (Windows))
    XMP: Modify Date
    XMP: Metadata Date
    2023-12-14 03:32:15 GMT XMP: History from Adobe Photoshop Lightroom Classic 12.0 (Windows)
    2024-02-06 14:31:58 GMT FotoForensics: Received

    With this picture:
    1. Adobe's C2PA implementation at Content Credentials doesn't identify any problems. The picture and metadata seem legitimate.
    2. The Adobe-generated signature covers the XMP data. Since the signature is valid, it implies that the XMP data was not altered after it was signed.
    3. The authoritative external timestamp authority (DigiCert) provided a signed timestamp. The only other timeline entry after this signature should be when FotoForensics received the picture.
    4. However, according to the EXIF and XMP metadata, the file was further altered without invalidating the cryptographic signatures or externally supplied timestamp. These modifications are timestamped minutes and hours after they could have happened.
    There are a few ways this mismatched timeline can occur:
    • Option 1: Unauthenticated : As noted by IBM : "Authentication is the process of establishing the identity of a user or system and verifying that the identity is valid." Validity is a critical step in determining authenticity. With this picture, it appears that the XMP metadata was postdated prior to signing by Adobe. This option means that Adobe will happily sign anything and there is no validation or authenticity. (Even though "authenticity" is the "A" in C2P A .)
    • Option 2: Tampered : This option assumes that the file was altered after it was signed and the cryptographic signatures were replaced. In my previous blog entry , I demonstrated how easy it is to replace these C2PA signatures and how the X.509 certificates can have forged attribution.

      At Hintfo, I use the GnuTLS's " certtool " to validate the certificates.

      • To view the certificate information, use: c2patool --certs file.jpg | certtool -i
      • To check the certificate information, use: c2patool --certs file.jpg | certtool --verify-profile=high --verify-chain
      • To verify the digital signatures, use: c2patool -d file.jpg

      Although the digital signatures in this car picture appear valid, certtool reports a warning for Adobe's certificate:

      Not verified. The certificate is NOT trusted. The certificate issuer is unknown.

      In contrast to Adobe, the certs from Microsoft, OpenAI, Stability AI, and Leica don't have this problem. Because the certificate is unauthenticated, only Adobe can confirm if the public cert is really theirs. I'm not Adobe; I cannot validate their certificate.

      I also can't validate the DigiCert certificate because Adobe's c2patool doesn't extract this cert for external validation. It is technically feasible for someone to replace both Adobe's and DigiCert's certificates with forgeries.
    Of these two options, I'm pretty certain it's the first one: C2PA doesn't authenticate and Adobe's software can be used to sign anything.

    With this car example, I don't think this user was intentionally trying to create a forgery. But an "unintentional undetected alteration" actually makes the situation worse! An intentional forgery could be trivially accepted as legitimate.

    It's relatively easy to detect when the clock appears to be running fast, postdating times, and listing events after they could have happened. However, if the clocks were slow and backdating timestamps, then it might go unnoticed. In effect, we know that we can't trust postdated timestamps. But even if it isn't postdated, we cannot trust that a timestamp wasn't backdated.

    Time After Time

    This red car picture is not a one-off special case. Here are other examples of mismatched timestamps that are signed by Adobe:

    analysis.php?id=55dbfae053796ff64c06007f23203fd8c0fb2fd9.1072366&fmt=orig&size=256
    The timeline from this cheerleader picture shows that the EXIF and XMP were altered 48 seconds after it was cryptographically signed and notarized by DigiCert. Adobe's Content Credentials doesn't notice any problems.

    analysis.php?id=98d5f35e83fab14668eb2d1a807c0816331ed235.8739499&fmt=orig&size=256
    This photo of lights was notarized by DigiCert over a minute before the last alteration. Again, Adobe's Content Credentials doesn't notice any problems.

    analysis.php?id=ca46ebd10a83a7d62ff9c78e4b9420d5f5559588.497461&fmt=orig&size=256
    This picture has XMP entries that postdate the DigiCert notarized signature by 3 hours. And again, Adobe's Content Credentials finds no problems.

    Unfortunately, I cannot include examples received at FotoForensics that show longer postdated intervals (some by days) because they are associated with personal information. These include fake identity cards, medical records, and legal documents. It appears that organized criminal groups are already taking advantage of this C2PA limitation by generating intentional forgeries with critical timestamp requirements.

    Timing is Everything

    Timestamps identify when files were created and updated. Inconsistent timestamps often indicate alterations or tampering. In previous blog entries , I demonstrated how metadata can be altered and signatures can be forged. In this blog entry, I've shown that we can't even trust the timestamps provided by C2PA steering committee members. Microsoft uses unsynchronized clocks, so we can't be sure when something was created, and Adobe will happily sign anything as if it were legitimate.

    In my previous conversations with C2PA management, we got into serious discussions about what data can and cannot be trusted. One of the C2PA leaders lamented that "you have to trust something." Even with a zero-trust model, you must trust your computer or the validation software. However, C2PA requires users to trust everything . There's a big difference between trusting something and trusting everything . For example:

    Trust Area C2PA Requirements Forgeries Real World
    Metadata C2PA trusts that the EXIF, IPTC, XMP, and other types of metadata accurately reflects the content. A forgery can easily supply false information without being detected. Adobe's products can be trivially convinced to authentically sign false metadata as if it were legitimate. In real world examples, we have seen Microsoft provide false timestamps and Adobe generate valid cryptographic signatures for altered metadata.
    Prior claims C2PA trusts that each new signer verified the previous claims. However, C2PA does not require validation before signing. Forgeries can alter metadata and "authentically" sign false claims. The signatures will be valid under C2PA. The altered metadata examples in this blog entry shows that Adobe will sign anything.
    Signing Certificates C2PA trusts that the cryptographic certificate (cert) was issued by an authoritative source. However, validation is not required. A forgery can create a cert with false attribution. In my previous blog entry , I quoted where the C2PA specification explicitly permits revoked and expired certificates. I also demonstrated how to backdate an expired certificate. As noted by certtool, Adobe's real certificates are not verifiable outside of Adobe.
    Tools Evaluating C2PA metadata requires tools. We trust that the tools provided by C2PA work properly. The back-end C2PA library displays whatever information is in the C2PA metadata. Forged information in the C2PA metadata will be displayed as valid by c2patool and the Content Credentials web site. Both c2patool and Content Credentials omit provenance information that identifies the timestamp authority. Both systems also misassociate the third-party timestamp with the first-party data signature.
    Timestamps C2PA treats timestamps like any other kind of metadata; it trusts that the information is valid. A forgery can easily alter timestamps. In real world examples, we have seen misleading timestamps due to clock drift and other factors.

    The entire architecture of C2PA is a house-of-cards based on 'trust'. It does nothing to prevent malicious actors from falsely attributing an author to some media, claiming ownership over someone else's media, or manufacturing fraudulent content for use as fake news, propaganda, or other nefarious purposes. At best, C2PA gives a false impression of authenticity that is based on the assumption that nobody has ill intent.

    Ironically, the only part of C2PA that seems trustworthy is the third-party timestamp authority's signed timestamp. (I trust that companies like DigiCert are notarizing the date correctly and I can test it by submitting my own signatures for signing.) Unfortunately, the C2PA specification says that using a timestamp authority is optional .

    Recently Google and Meta pledged support for the C2PA specification. Google even became a steering committee member. I've previously spoken to employees associated with both companies. I don't think this decision was because they believe in the technology. (Neither company has deployed C2PA's solution yet.) Rather, I suspect that it was strictly a management decision based on peer pressure. I don't expect their memberships to increase C2PA's reliability and I doubt they can improve the C2PA solution without a complete bottom-to-top rewrite. The only real benefit right now is that they increase the scope of the class action lawsuit when someone eventually gets burned by C2PA. Now that's great timing!

    Značky: #Authentication, #FotoForensics, #Network, #Forensics, #Programming

    • chevron_right

      Catching Flies with Honey

      pubsub.slavino.sk / hackerfactor · Sunday, 25 February - 18:15 edit · 13 minutes

    Recently, the buzz around security risks has recently focused on AI: AI telemarketing scams , deepfake real-time video impersonations , ChatGPT phishing scams, etc. However, traditional network attacks haven't suddenly vanished. My honeypot servers have been seeing an increase in scans and attacks, particularly from China.

    Homemade Solutions

    I've built most of my honeypot servers from scratch. While there are downloadable servers, most of the github repositories haven't been updated in years. Are they no longer maintained, or just continuing to work well? Since I don't know, I don't bother with them.

    What I usually do is start with an existing stable server and then modify it into a honeypot. For example, I run a Secure Shell server (sshd) that captures brute-force login attempts. Based on the collected data, I can evaluate information about the attackers.

    Securing Secure Shell

    Secure Shell (ssh) is a cornerstone technology used by almost every server administrator. Every modern operating system, including MacOS, Linux, and BSD, includes an ssh client by default for accessing remote systems. For Windows, most technical people use PuTTY as an ssh client.

    Because it's ubiquitous, attackers often look for servers running the Secure Shell server (sshd). When they find it, they can be relentless in their brute-force hacking attempts. They will try every combination of username and password until they find a working account.

    If you have an internet-accessible sshd port (default: 22/tcp) and look at your sshd logs (location in OS-specific; try /var/log/system.log or /var/log/auth.log), then you should see tons of brute-force login attempts. These will appear as login failures and might list the username that failed.

    The old common wisdom was to move sshd to a non-standard port, like moving it from 22/tcp to 2222/tcp. The belief was that attackers only look for standard ports. However, the attackers and scanners have become smarter. They now scan for every open port. When they find one, they start to query the service. And when (not if) they find your non-standard port for sshd, they will immediately try brute forcing logins. Sure, they probably can't get in. But that doesn't stop them from trying continually for years.

    These days, I've found that combining sshd with a knock-knock daemon is an ideal solution (sudo apt install knockd). Knockd watches for someone to connect to a few ports (port knocking), even if nothing is running on those ports. If knockd sees the correct knocking pattern, then it opens up the the desired port only for the client who knocked. For example, your /etc/knockd.conf might look like:
    [options]
    UseSyslog
    Interface = eth0

    [ssh]
    sequence = 1234,2345,3456
    seq_timeout = 5
    start_command = /sbin/iptables -A INPUT -s %IP% -p tcp --dport 22 -j ACCEPT
    cmd_timeout = 60
    stop_command = /sbin/iptables -D INPUT -s %IP% -p tcp --dport 22 -j ACCEPT
    tcpflags = syn

    This tells knockd to watch for someone trying to connect to ports 1234/tcp, 2345/tcp, and 3456/tcp in that specific order . The client has five seconds to complete the knocking pattern. If they do it, then port 22/tcp (sshd) will be opened up for the client. It will only be open for 60 seconds, so the client has one minute to connect. (Also be sure to configure ufw to deny access to 22/tcp from the general public!)

    After you connect, the knocking port is closed down. Your existing connection will continue to work, but any new logins will require you to repeat the knocking sequence.

    An attacker who is port scanning will never find the sshd port because it's hidden and they don't know the secret knock pattern.

    For myself, I created an alias for ssh:

    alias kssh="knock -d 500 $1 1234 2345 3456 ; ssh $*"


    This alias says to do the port knocking with a half-second delay (500ms) between each knock, and then run the ssh client. The small delay is because sequential packets may take different network routes and arrive out-of-order; the delay helps them arrive in the correct order.

    Since I deployed port knocking on my production servers, I've had zero scanners and attackers find my sshd. I don't see any brute-force login attempts.

    The Inside View

    I'm always hesitant to explicitly say how I've built or secured my own servers. I don't want to give the attackers any detailed insight. But in this case, knowing how I've hardened my own systems doesn't help the attackers. If they scan my server and find no sshd port, it could mean:
    • I'm not running an external sshd server on this network address.
    • I'm running it, but on a non-standard port.
    • I'm running it, but it requires a special knocking sequence to unlock it, and they don't know the sequence. (65,536 possible ports means a three-port knock sequence means over 2×10 14 possible combinations. And that's assuming that I'm using 3 ports; if I use 5 or more ports then it's practically impossible.)
    • Maybe I'm using both knockd and a non-standard port! Even if they find the knock sequence, they only have seconds to find the port. (I don't have to permit the port for 60 seconds; I could drop it down to 10 seconds to really narrow the window of opportunity.)
    • Assuming they can find the knock sequence and access the sshd port, then they still have to contend with trying to crack sshd, which is probably the most secure software on the internet. Will brute-force password guessing work? Or do I require a pre-shared key for login access? And every time they fail, they need to repeat the knock sequence, which adds in a lot of delay.
    On top of this, the act of scanning my servers for open ports is guaranteed to trigger a hostile scanner alert that will block them from accessing any services on my system.

    Not only do I feel safe telling people how I do this, I think everyone should do this!

    BYOHD (Build Your Own Honeypot Daemon)

    While it's usually desirable to hide sshd from attackers on production servers, a honeypot shouldn't try to hide. Turning a Secure Shell server into a honeypot server requires a little code change to sshd in order to enable more logging.

    Keep in mind, there are honeypot sshd daemons that you can download, but they are usually unmaintained. OpenSSH is battle tested, hardened, and maintained. Turning it into a honeypot means I don't need to worry about possible vulnerabilities in old source code.
    1. Since we're going to be logging every password attempt, we don't want to log your administrative login. You need to configure your sshd to permit logins using certificates based on pre-shared keys (PSK) and not passwords . This allows you (the administrator) to login without a password; you just need the PSK. There are plenty of online tutorials for generating the public/private key pair and configuring your sshd to support PSK-only logins. The main changes that you need in /etc/ssh/sshd_config are:

      ChallengeResponseAuthentication no
      PasswordAuthentication no
      UsePAM no
      PermitRootLogin no

      These changes ensure that you cannot login with a password; you must use the pre-shared keys.
    2. Honeypots generate lots of logs. I moved sshd's logs into a separate file. I redirected sshd logging by creating /etc/rsyslog.d/20-sshd.conf:
      template(name="sshdlog_list" type="list") {
      property(name="timereported" dateFormat="year")
      constant(value="-")
      property(name="timereported" dateFormat="month")
      constant(value="-")
      property(name="timereported" dateFormat="day")
      constant(value=" ")
      property(name="timereported" dateFormat="hour")
      constant(value=":")
      property(name="timereported" dateFormat="minute")
      constant(value=":")
      property(name="timereported" dateFormat="second")
      constant(value=" ")
      property(name="hostname")
      constant(value=" ")
      property(name="app-name")
      constant(value=":")
      property(name="msg" spifno1stsp="on" ) # add space if $msg doesn't start with one
      property(name="msg" droplastlf="on" ) # remove trailing \n from $msg if there is one
      constant(value="\n")
      }

      if $programname == 'sshd' then /var/log/sshd.log;sshdlog_list
      & stop
      Then I updated the log rotation by creating /etc/logrotate.d/sshd:
      /var/log/sshd.log
      {
      rotate 7
      weekly
      missingok
      notifempty
      compress
      delaycompress
      create 0644 syslog adm
      postrotate
      /usr/lib/rsyslog/rsyslog-rotate
      endscript
      }
      Finally, restart the logging: sudo service rsyslog restart.
    3. The easiest way to turn a supported server into a honeypot is to modify the source code. In this case, I download the source for OpenSSH and patch it to log every login attempt. Since I deploy this often, I ended up writing a script to automate this part:
      #!/bin/bash
      # For the honeypot: Create an openssh that logs passwords
      mkdir tmp
      cd tmp
      apt-get source openssh
      cd openssh-*
      patch << EOF
      --- auth-passwd.c 2020-02-13 17:40:54.000000000 -0700
      +++ auth-passwd.c 2023-02-25 10:31:53.946913899 -0700
      @@ -84,14 +84,20 @@
      #endif

      if (strlen(password) > MAX_PASSWORD_LEN)
      + {
      + logit("Failed login by host '%s' port '%d' username '%.100s', password '%.100s' (truncated)", ssh_remote_ipaddr(ssh), ssh_remote_port(ssh), authctxt->user, password);
      return 0;
      + }

      #ifndef HAVE_CYGWIN
      if (pw->pw_uid == 0 && options.permit_root_login != PERMIT_YES)
      ok = 0;
      #endif
      if (*password == '\0' && options.permit_empty_passwd == 0)
      + {
      + logit("Failed login by host '%s' port '%d' username '%.100s', password '' (empty)", ssh_remote_ipaddr(ssh), ssh_remote_port(ssh), authctxt->user);
      return 0;
      + }

      #ifdef KRB5
      if (options.kerberos_authentication == 1) {
      @@ -113,7 +119,12 @@
      #endif
      #ifdef USE_PAM
      if (options.use_pam)
      - return (sshpam_auth_passwd(authctxt, password) && ok);
      + {
      + /* Only log failed passwords */
      + result = sshpam_auth_passwd(authctxt, password);
      + if (!result) { logit("Failed login by host '%s' port '%d' username '%.100s', password '%.100s'", ssh_remote_ipaddr(ssh), ssh_remote_port(ssh), authctxt->user, password); }
      + return (result && ok);
      + }
      #endif
      #if defined(USE_SHADOW) && defined(HAS_SHADOW_EXPIRE)
      if (!expire_checked) {
      @@ -123,6 +134,8 @@
      }
      #endif
      result = sys_auth_passwd(ssh, password);
      + /* Only log failed passwords */
      + if (!result) { logit("Failed login by host '%s' port '%d' username '%.100s', password '%.100s'", ssh_remote_ipaddr(ssh), ssh_remote_port(ssh), authctxt->user, password); }
      if (authctxt->force_pwchange)
      auth_restrict_session(ssh);
      return (result && ok);
      @@ -199,7 +212,10 @@
      char *pw_password = authctxt->valid ? shadow_pw(pw) : pw->pw_passwd;

      if (pw_password == NULL)
      + {
      + logit("Failed login by host '%s' port '%d' username '%.100s', password '' (empty)", ssh_remote_ipaddr(ssh), ssh_remote_port(ssh), authctxt->user);
      return 0;
      + }

      /* Check for users with no password. */
      if (strcmp(pw_password, "") == 0 && strcmp(password, "") == 0)
      @@ -217,7 +233,9 @@
      * Authentication is accepted if the encrypted passwords
      * are identical.
      */
      - return encrypted_password != NULL &&
      - strcmp(encrypted_password, pw_password) == 0;
      + int result=0;
      + if (encrypted_password != NULL) { result = strcmp(encrypted_password, pw_password); }
      + if (!result) { logit("Failed login by host '%s' port '%d' username '%.100s', password '%.100s'", ssh_remote_ipaddr(ssh), ssh_remote_port(ssh), authctxt->user, password); }
      + return ((encrypted_password != NULL) && (result == 0));
      }
      #endif
      EOF
      autoreconf && ./configure --with-pam --with-systemd --sysconfdir=/etc/ssh && make clean && make -j 3
      These patches are inserted everywhere a password is checked. They log the host, port, username, and attempted password.
    4. Finally, tell the server to run this sshd instead of the system one. (sudo install sshd /usr/bin/sshd ; sudo service sshd restart)
    If your public servers are like mine, you'll start seeing entries in /var/log/sshd.log very quickly (under a minute). They might look like:
    2024-02-24 13:48:59 sshd: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=218.92.0.22  user=root
    2024-02-24 13:49:02 sshd: Failed login by host '218.92.0.22' port '58463' username 'root', password 'toor123'
    2024-02-24 13:49:02 sshd: Failed password for root from 218.92.0.22 port 58463 ssh2
    2024-02-24 13:49:03 sshd: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=43.153.207.98 user=root
    2024-02-24 13:49:05 sshd: Failed login by host '43.153.207.98' port '55874' username 'root', password 'qweASDqwe'
    2024-02-24 13:49:05 sshd: Failed password for root from 43.153.207.98 port 55874 ssh2
    2024-02-24 13:49:05 sshd: Failed login by host '218.92.0.22' port '58463' username 'root', password 'asdasd123'
    2024-02-24 13:49:05 sshd: Failed password for root from 218.92.0.22 port 58463 ssh2
    2024-02-24 13:49:06 sshd: Received disconnect from 43.153.207.98 port 55874:11: Bye Bye [preauth]
    2024-02-24 13:49:06 sshd: Disconnected from authenticating user root 43.153.207.98 port 55874 [preauth]
    2024-02-24 13:49:09 sshd: Failed login by host '218.92.0.22' port '58463' username 'root', password '456852'
    2024-02-24 13:49:09 sshd: Failed password for root from 218.92.0.22 port 58463 ssh2
    2024-02-24 13:49:10 sshd: Received disconnect from 218.92.0.22 port 58463:11: [preauth]
    2024-02-24 13:49:10 sshd: Disconnected from authenticating user root 218.92.0.22 port 58463 [preauth]
    2024-02-24 13:49:10 sshd: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=218.92.0.22 user=root
    2024-02-24 13:49:13 sshd: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=43.134.111.125 user=root
    2024-02-24 13:49:15 sshd: Failed login by host '43.134.111.125' port '40806' username 'root', password 'P@ssw0rdd'
    2024-02-24 13:49:15 sshd: Failed password for root from 43.134.111.125 port 40806 ssh2
    2024-02-24 13:49:16 sshd: Received disconnect from 43.134.111.125 port 40806:11: Bye Bye [preauth]
    2024-02-24 13:49:16 sshd: Disconnected from authenticating user root 43.134.111.125 port 40806 [preauth]
    Now I have detailed logs about every brute-force login attempt.

    Gathering statistics All of my honeypot tracking logs contain the string "Failed login by host". I can filter those lines to detect brute-forced login attacks. From the sample log above:

    2024-02-24 13:49:02 sshd: Failed login by host '218.92.0.22' port '58463' username 'root', password 'toor123'
    2024-02-24 13:49:05 sshd: Failed login by host '43.153.207.98' port '55874' username 'root', password 'qweASDqwe'
    2024-02-24 13:49:05 sshd: Failed login by host '218.92.0.22' port '58463' username 'root', password 'asdasd123'
    2024-02-24 13:49:09 sshd: Failed login by host '218.92.0.22' port '58463' username 'root', password '456852'
    2024-02-24 13:49:15 sshd: Failed login by host '43.134.111.125' port '40806' username 'root', password 'P@ssw0rdd'
    (Yes, that's four attacks in under a minute by three different IP addresses! Any that's typical.)

    After a few days, you can start creating histograms related to who attacks the most (IP address), what accounts are attacked the most (username), and what passwords are tried the most. For the last 7 days, my own honeypot has seen 4,934 unique brute force usernames and 19,453 unique brute force passwords from 2,308 unique IP addresses. The vast majority of attacks (56%) are from China, with Singapore coming in at a distant second with 7%, and the United States rounding out third at 5%.

    The top 10 usernames account for 76% of all brute-force login attempts:
    # Sightings % Username
    1 42,732 69.55% root
    2 1,011 1.65% ubuntu
    3 918 1.49% admin
    4 622 1.01% user
    5 593 0.97% test
    6 339 0.55% oracle
    7 304 0.49% ftpuser
    8 304 0.49% postgres
    9 175 0.28% test1
    10 156 0.25% git

    In contrast, the top 10 passwords only account for about 10% of all password guesses:
    # Sightings % Password
    1 2,869 4.67% 123456
    2 851 1.39% 123
    3 321 0.52% 1234
    4 308 0.50% 1
    5 308 0.50% password
    6 293 0.48% 12345
    7 278 0.45% test
    8 274 0.45% admin
    9 268 0.44% root
    10 243 0.40% 111111
    (Don't use user "root" with password "123456" unless you want to be compromised in under an hour.)

    I ran similar login metrics last year . The usernames list is almost the same; only 'debian' dropped out while 'test1' came in. Similarly, password '12345678' swapped positions with the '111111' (the previous #11). By volume, the number of attacks has nearly tripled since last year.

    It's not just my sshd honeypot that has seen this increase in volume. All of my honeypot servers have seen similar increases in volume and mostly from China. A few days ago, the FBI Director warned of an ‘Unprecedented Increase’ in Chinese cyberattacks on US infrastructure . This definitely matches my own observations. They're not just attacking banks and power grids and telecommunications; they are attacking everyone. Even if your server isn't "critical infrastructure" or contains sensitive customer information, it can still be compromised and used to attack other systems. With the upcoming U.S. election and extreme unrest in Europe and the Middle East, it's time to batten down the cyber hatches. If you're not tracking attacks against your own servers and taking steps to mitigate attacks, then it's time to start. (Not sure where to begin? Make a beeline to my series on No-NOC Networking : simple steps to stop attacks before they happen.)

    Značky: #Security, #Honeypot, #Programming, #Network

    • Sl chevron_right

      Contact publication

      pubsub.blastersklan.com / slashdot · Monday, 19 February - 12:57 edit

    An anonymous reader shared this post from Phoronix: With Linux 6.8 the kernel's Rust code was brought up to Rust 1.75 while new patches posted this weekend port the code over to Rust 1.76 and then the upcoming Rust 1.77... With Rust 1.77 they have now stabilized the single-field "offset_of" feature used by the kernel's Rust code. Rust 1.77 also adds a "--check-cfg" option that the Rust kernel code will likely transition to in the future. This follows the Rust for Linux policy of tracking the upstream Rust version upgrades until there is a minimum version that can be declared where all used features are considered stable.

    Read more of this story at Slashdot.

    The Linux Kernel Prepares For Rust 1.77 Upgrade
    • Sl chevron_right

      Contact publication

      pubsub.blastersklan.com / slashdot · Monday, 19 February - 02:27 edit · 1 minute

    This week the non-profit Rust Foundation announced the release of a report on what their Security Initiative accomplished in the last six months of 2023. "There is already so much to show for this initiative," says the foundation's executive director, "from several new open source security projects to several completed and publicly available security threat models." From the executive summary: When the user base of any programming language grows, it becomes more attractive to malicious actors. As any programming language ecosystem expands with more libraries, packages, and frameworks, the surface area for attacks increases. Rust is no different. As the steward of the Rust programming language, the Rust Foundation has a responsibility to provide a range of resources to the growing Rust community. This responsibility means we must work with the Rust Project to help empower contributors to participate in a secure and scalable manner, eliminate security burdens for Rust maintainers, and educate the public about security within the Rust ecosystem... Recent Achievements of the Security Initiative Include: - Completing and releasing Rust Infrastructure and Crates Ecosystem threat models - Further developing Rust Foundation open source security project Painter [for building a graph database of dependencies/invocations between crates] and releasing new security project, Typomania [a toolbox to check for typosquatting in package registries]. - Utilizing new tools and best practices to identify and address malicious crates. - Helping reduce technical debt within the Rust Project, producing/contributing to security-focused documentation, and elevating security priorities for discussion within the Rust Project. ... and more! Over the Coming Months, Security Initiative Engineers Will Primarily Focus On: - Completing all four Rust security threat models and taking action to address encompassed threats - Standing up additional infrastructure to support redundancy, backups, and mirroring of critical Rust assets - Collaborating with the Rust Project on the design and potential implementation of signing and PKI solutions for crates.io to achieve security parity with other popular ecosystems - Continuing to create and further develop tools to support Rust ecosystem, including the crates.io admin functionality, Painter, Typomania, and Sandpit

    Read more of this story at Slashdot.

    How Rust Improves the Security of Its Ecosystem
    • chevron_right

      The Jitter Bug

      pubsub.slavino.sk / hackerfactor · Thursday, 15 February - 20:48 edit · 11 minutes

    I recently attended a presentation about an online "how to program" system. Due to Chatham House Rules , I'm not going to name the organization, speaker, or programming system. What I will say: as an old programmer, I often forget how entertaining it can be to watch a new programmer try to debug code during a live demonstration. (My Gawd, the presenter needs to go into comedy. The colorful phrases -- without swearing -- were priceless.) I totally understand the frustration. And while I did see many of the bugs (often before the presenter hit 'Enter'), the purpose was to watch how this new system helps you learn how to solve problems.

    At the end of the 45-minute presentation, it was revealed that this was the culmination of over two months of learning effort. But honestly, having seen the workflow and thought process, I think the speaker is on track to becoming an excellent software guru. At this point, the methodology is known and it just takes experience to improve.

    It's a feature! Ship it!

    As someone who works with computers every day, I know that tracking down bugs can be really hard. In my opinion, there are four basic difficulty levels when debugging any system:
    • Level 1: Easy . Sometimes you get lucky. Maybe the system generates an informative error message. Programs sometimes alert you to a bad configuration file, missing parameters, or incorrect usage. Compilers often identify the line number with an issue. (And sometimes it's the correct line number!) Other times there might be helpful log messages that tell you about the problem.
    • Level 2: Medium . More often, the error messages and logs provide hints and clues. It's up to you to figure out where the error is coming from, what is causing the error, and how to fix it. Because of the familiarity, problems in your own code are usually easier to debug compared to problems in someone else's code. In the worst-case, you might end up consulting online manuals (man pages), documentation, or even diving into source code. Blind debugging, when you have no code or documents, is much more difficult.
    • Level 3: Frustrating . The hardest problems to resolve are when bugs appear inconsistently. Sometimes it fails and sometimes it works. These are much more difficult to track down. I hate those bugs that appear to vanish when you put in debugging code, but that resurface the instant the debugging code is disabled. Or that work fine under a debugger, like gdb or valgrind, but consistently fail without the debugger. (Those are almost always due to dynamic library issues or memory leaks, and the failure often surfaces long after the problem started.)
    • Level 4: Soul-Crushing . The worst-case scenarios are the ones that appear to happen randomly and leave no logs about the cause. Any initial debugging is really just a blind guess in the dark.
    I've been battling with one of those worst-case scenarios for nearly 2 years -- and I finally got it solved. (I think?)

    Reboot is Needed

    I have a handful of servers in a rack. Each piece of hardware has plenty RAM, CPUs, and disk space. But rather than running each as big computer with tons of CPU power and memory, I've subdivided the resources into a handful of virtual machines. I may allocate 2 CPUs and 1 Gig of RAM to my mail server, and 6 CPUs with more RAM to FotoForensics. The specific resource allocations is configurable based on the VM's requirements.

    For my servers, the hypervisor (parent of the virtual machines, dom0) uses Xen. Xen is a very common virtualization environment. Each piece of hardware has a dom0 and runs a group of virtual machines (VMs, or generically called domu).

    analysis.php?id=4328d5ada9412dfecaa7266e98531400f65454e2.627738&fmt=orig&size=400

    The problem I was having: occasionally a CPU on one VM would hang. The problem seemed to jump around between VMs and didn't appear regularly. The error in the VM's kernel.log looked like:

    kernel: [2333839.516291] RIP: e030:zap_pte_range.isra.0+0x168/0x860
    kernel: [2333839.516298] Code: 00 10 00 00 e8 c9 f6 ff ff 49 89 c0 48 85 c0 74 0b 48 83 7d 88 00 0f 85 0a 06 00 00 41 f6 47 20 01 0f 84 d7 02 00 00 4c 8b 23 <48> c7 03 00 00 00 00 4d 39 6f 10 4c 89 e8 49 0f 46 47 10 4d 39 77
    kernel: [2333839.516316] RSP: e02b:ffffc9004114ba60 EFLAGS: 00010202
    kernel: [2333839.516322] RAX: ffffea000192fe40 RBX: ffff88807854e760 RCX: 0000000000000125
    kernel: [2333839.516330] RDX: 0000000000000000 RSI: 00007f2410aec000 RDI: 00000000135f9125
    kernel: [2333839.516339] RBP: ffffc9004114bb10 R08: ffffea000192fe40 R09: 0000000000000000
    kernel: [2333839.516347] R10: 0000000000000001 R11: 000000000000073f R12: 00000000135f9125
    kernel: [2333839.516355] R13: 00007f2410aec000 R14: 00007f2410aed000 R15: ffffc9004114bc48
    kernel: [2333839.516371] FS: 00007f2410bf1580(0000) GS:ffff88807d500000(0000) knlGS:0000000000000000
    kernel: [2333839.516380] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
    kernel: [2333839.516387] CR2: 00007f2410ae1151 CR3: 0000000002a0a000 CR4: 0000000000040660
    kernel: [2333839.516398] Fixing recursive fault but reboot is needed!

    In this case, the log shows that the CPU had a failure. However, it doesn't identify what caused the failure. Was it a hardware problem? Some bad software? Or something else? There's also no information about how to fix it other than "reboot is needed!"

    When this error happens, the bad CPU would be pinned at 100% usage and not doing anything. If the VM had 2 CPUs, then it would limp along with one good CPU until the VM was rebooted. However, sometimes one CPU would die and then the other would die (sometimes minutes apart, sometimes days). At that point, the entire VM would be dead and require a reboot. The last failure never makes it to the logs.

    I started calling this problem the 'jitter bug' because it happened irregularly and infrequently. There was some unidentified event happening that was capable of hanging a CPU at random.

    Debugging a Bad Bug

    The jitter bug was limited to the VMs. I never saw it on dom0, dom0 never experienced this kind of crash, and dom0 had no logs related to any of these domu CPU failures. When a VM failed, I could use dom0 to destroy the instance and recreate the VM. The new VM would start up without a problem. Whatever was locking the CPU was limited in scope to the VM.

    I searched Google for "Fixing recursive fault but reboot is needed". It currently has over 3,000 results, so I'm definitely not the only person seeing this problem. For me, the problem was irregular but it would happen at least every few weeks on at least one randomly chosen VM across all of my hardware servers. Other people reported the problem happening daily, or every few days. I also noticed a commonality: in almost every reported instance, they were using a virtualized server.

    This is when I went through the long debugging process, trying to catch a server crash that happens intermittently, and often weeks apart. I ended up writing a ton of monitoring tools that watch everything from packets to processes. All of this effort was really to debug the cause of this CPU hang. What I found:
    • Hardware failure . I ruled this out. I had four different hardware servers acting the exact same way. The chances of having the exact same hardware failure appear on four different servers (different ages, different models) was extremely unlikely.
    • My custom settings . Different VMs running different software and with different configurations were experiencing the same problem. Also, other people in various forums were reporting the same problem, and they were not using my software or settings. I could rule out my own software and customizations.
    • Packet of Doom ™. I was concerned that someone might have found a new " ping of death " or some other kind of killer packet. I configured a few boxes that would capture every packet sent to and from the VMs. (I rotated the logs hourly.) I did catch every packet around two different crashes. Nothing unusual, so I ruled out a networking issue.
    • Kernel patch . A few forums suggesting applying a kernel patch or upgrading Xen. I tried that on a test system, but it had no impact. The problem still happened.
    • Operating system . The domu virtual machines don't need to run the same operating system as the parent dom0. On my test system, I installed a different OS . It took a month, but it crashed the same way. This means that the problem is independent of the guest VM operating system.
    • Blocking issue . One of the Xen forums, a person from Amazon suggested that it might be a block device deadlock situation. The patch is to disable underlying block device merges. They didn't say where to apply this, so I put it in both dom0 and every domu. While this is probably overkill, it did result in a change!

      1. The CPU failures happened less often. (Almost every 3 weeks instead of roughly every 2 weeks.)
      2. When they happened, they usually didn't hang the CPU or require a reboot. The system usually recovered. In kernel.log, I'd see a similar CPU failure trace, but it rarely had the 'reboot needed' message and the CPU wasn't hung. (Having a CPU report an error is really bad, but it's a huge improvement over a hung VM.)

      Unfortunately, I was still seeing an occasional hang with a reboot requirement.
    With the block device workaround, I didn't notice any performance problem and the hangs happened much less often.

    Backtrace

    Each of these different tests took weeks to perform. This is why it's taken me years to find a solution. Thinking back on it, I have been battling with this problem for almost as long as I've had the servers in the new rack.

    Wait... the new rack? The new location?

    When I moved all of my servers out of my former hosting location (they went out of business ), I reinstalled the OS on each server. The previous OS was old and losing vendor support, so I needed to upgrade. Upgrading during the move seemed like a good idea at the time. Looking over every other reported instance of this error, I noticed that each sighting was related to a newer operating system. This looked like some kind of incompatibility between Xen and the underlying OS -- either Ubuntu or Debian.

    I ran a test and installed the really old OS on my spare server: Ubuntu 16.04 LTS, from 2016. Yup, no instance of the problem, even though the same hardware running Ubuntu 20.04 LTS had the bug. (It took me two months to confirm this since the problem is irregular.) Unfortunately, rolling back the OS on my production servers is a no-go. I needed a fix for a supported OS.

    Bingo! (Maybe?)

    This got me thinking. The problem never appeared on dom0. But what if dom0 was the cause? And what if it was caused by something found in the newer OS versions that didn't previously exist?

    Buried in the logs of dom0 was an update process that ran every few hours. It's called fwupd, the firmware update daemon. According to their github repository , "This project aims to make updating firmware on Linux automatic, safe, and reliable." Ubuntu appears to have incorporated it into 18.04 LTS (circa 2018), which is the same time people began reporting this CPU hang problem. Every version of Ubuntu since then has included this process.

    To see if your system is using it, try this command: sudo systemctl list-timers | grep fwupd
    You should see a line that says when it will run next and when it last ran:

    $ sudo systemctl list-timers | grep fwupd
    Thu 2024-02-15 09:43:37 MST 18min left Thu 2024-02-15 05:37:05 MST 3h 47min ago fwupd-refresh.timer fwupd-refresh.service


    On my system, /usr/lib/systemd/system/fwupd-refresh.timer says to run the process twice a day, with a random delay of up to 12 hours. This explains why the crashes happened at random times:

    Description=Refresh fwupd metadata regularly
    ConditionVirtualization=!container

    [Timer]
    OnCalendar=*-*-* 6,18:00
    RandomizedDelaySec=12h
    Persistent=true

    [Install]
    WantedBy=timers.target


    When fwupd runs, it queries the existing firmware and then checks if it needs to apply any updates. The act of querying the firmware from Xen's dom0 can hang VMs. As a test, I repeatedly called "fwupdmgr get-devices" and eventually forced a CPU hang on domu. The hang isn't always immediate; I've clocked it as happening as much as 10 minutes after the process ran! This delayed failure is why I wasn't able to associate the hang with any specific application; the crash wasn't immediate. It also appears to be a race condition: on my servers, it's about a 1 in 50 chance of a hang, which explains why usually I saw any given CPU hang at least monthly. I'm sure the odds of a hang vary based on your hardware, which would explain why some people see this same problem more often.

    I disabled this daemon last month. (It's really unnecessary.)

    sudo systemctl stop fwupd fwupd-refresh fwupd-refresh.timer
    sudo systemctl disable fwupd fwupd-refresh fwupd-refresh.timer
    sudo systemctl mask fwupd fwupd-refresh fwupd-refresh.timer


    These three commands are basically (1) stop running, (2) never run, and (3) don't allow any other process to make it run.

    Poof! It's been a month and a half since I last saw any CPU failures on any of my servers. While this isn't proof of a fix, it does give me a high sense of confidence. Rather than doing a monthly reboot "just in case" it fixes the problem, I'm going to try to go back to rebooting only when the kernel is upgraded due to a security patch. (I like having stable systems with uptimes that are measured in months or years.)

    Debugging computer problems can vary from simple typos to complex interactions. In this case, I think it's the combination of Xen, fwupd, and the hardware that causes a random timing error, race condition, and a hardware hang. I wish I had some colorful description for this problem that didn't involve swearing.

    Značky: #Network, #Security, #Programming