<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>incident on Digital Garden</title>
    <link>/tags/incident/</link>
    <description>Recent content in incident on Digital Garden</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <copyright>© Axel Neergaard</copyright>
    <lastBuildDate>Fri, 22 Oct 2021 20:50:00 +0000</lastBuildDate><atom:link href="/tags/incident/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Troubleshooting NixOS and ZFS</title>
      <link>/posts/nix-zfs-space-issue/</link>
      <pubDate>Fri, 22 Oct 2021 20:50:00 +0000</pubDate>
      
      <guid>/posts/nix-zfs-space-issue/</guid>
      <description>Incident on 21/10/2021. NixOS ZFS partition on laptop reported 0 available bytes on all pools.
Teardown:   Tried to setup Android Studio with emulator capabilities on my NixOS machine
 Installed a bunch of packages. Nix the package manager sucks up a lot of storage if not properly scrubbed. Installed and built so much on my machine that suddenly nothing worked anymore.    du reported 100% full on everything.</description>
      <content>&lt;p&gt;Incident on 21/10/2021.
NixOS ZFS partition on laptop reported 0 available bytes on all pools.&lt;/p&gt;
&lt;h2 id=&#34;teardown&#34;&gt;Teardown:&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Tried to setup Android Studio with emulator capabilities on my NixOS machine&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Installed a bunch of packages.&lt;/li&gt;
&lt;li&gt;Nix the package manager sucks up a lot of storage if not properly scrubbed.&lt;/li&gt;
&lt;li&gt;Installed and built so much on my machine that suddenly nothing worked anymore.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;du&lt;/code&gt; reported 100% full on everything.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Started getting so weird that my command line prompt started printing random stuff since it couldn&amp;rsquo;t create a &lt;code&gt;tmp&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;I could not collect garbage with &lt;code&gt;nix-collect-garbage&lt;/code&gt; since it couldn&amp;rsquo;t create a lock file.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Rebooted the machine, wrong call.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Most services enabled at boot reported failure to start.&lt;/li&gt;
&lt;li&gt;Got to login screen.
&lt;ul&gt;
&lt;li&gt;Non-existent users very denied entry.&lt;/li&gt;
&lt;li&gt;Existing users (&lt;code&gt;root&lt;/code&gt; and my &lt;code&gt;cookie&lt;/code&gt;) passed the login, but due to no available space were thrown back into a login.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Troubleshooting with built-in tools.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Laptop firmware settings unable to help.&lt;/li&gt;
&lt;li&gt;Lenovo has SMART tools available.
&lt;ul&gt;
&lt;li&gt;Did not report hardware error on disk.&lt;/li&gt;
&lt;li&gt;All &lt;a href=&#34;https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes&#34;&gt;SMART attributes&lt;/a&gt; seemed in order.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Nikola flashed a USB with an ISO of NixOS.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Able to access a command line with my system.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;smartctl -a/x&lt;/code&gt; confirmed no hardware errors present.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Also tried to use&lt;/em&gt; &lt;a href=&#34;https://www.finnix.org/&#34;&gt;Finnix&lt;/a&gt;, &lt;em&gt;did not assist me due to too old built-in version of ZFS (v23 vs v28 on Nix).&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Thought NixOS wouldn&amp;rsquo;t have tools for ZFS, but both &lt;code&gt;zpool&lt;/code&gt; and &lt;code&gt;zfs&lt;/code&gt; are available.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Tried to get access to my ZFS pool (&lt;code&gt;rpool&lt;/code&gt;).&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Imported all ZFS pools available.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Loaded encryption keys for pool.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Mounted dataset through &lt;code&gt;mount&lt;/code&gt; as datasets were of type &lt;code&gt;legacy&lt;/code&gt;.&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;zpool import -a
zfs list -a
zfs load-key -a
mount -t zfs rpool/home /mnt/rec
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Attempt to backup &lt;code&gt;/home&lt;/code&gt; to external drive.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;70GB of data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;External drive was connect with USB, so extremely slow.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Learnt that &lt;code&gt;cp&lt;/code&gt; does not have an internal mechanism for progress report.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You can, however, get the &lt;code&gt;cp&lt;/code&gt; process ID and then watch the opened files and object reference from &lt;code&gt;/proc/&amp;lt;id&amp;gt;&lt;/code&gt;.&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;pgrep -x cp            # -x gets exact and only the ID
cat /proc/&amp;lt;id&amp;gt;/fdinfo  # gets object reference
ls -l /proc/&amp;lt;id&amp;gt;/      # lists open files
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Due to enormous write size (and impatience) the external (NTFS) drive&amp;rsquo;s MFS got corrupted due to lazy &lt;code&gt;umount&lt;/code&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tried to salvage with &lt;code&gt;ntfsfix&lt;/code&gt; on NixOS command line, no success.&lt;/li&gt;
&lt;li&gt;Final option is to use &lt;code&gt;chkdsk&lt;/code&gt; on a Windows machine, but that can take around &lt;a href=&#34;https://qa.social.microsoft.com/Forums/windows/en-US/f363af20-da4d-4b8a-b0ae-df08a52c5d5a/disk-error-and-chkdsk-execution-time?forum=whshardware&#34;&gt;8 hours per TB&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Attempt to garbage collect &lt;code&gt;rpool/root/nixos&lt;/code&gt; manually.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;nix-env&lt;/code&gt; allows for defining system profile.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Idea was to delete old NixOS generations to free up space.&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;nix-env -p /mnt/rec/nix/var/nix/profiles/system
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Failed again due to no disk space.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Started reading up on ZFS snapshots.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;ZFS snapshots do not take storage space before something changes on disk.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;My machine has automatic ZFS snapshots by a monthly, weekly, daily, hourly, and &amp;ldquo;frequent&amp;rdquo; cadence.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;THEORY&lt;/em&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Due to the snapshots running in the background all of the referenced storage from &lt;code&gt;zfs list&lt;/code&gt; reported the full size.
&lt;ul&gt;
&lt;li&gt;Referenced storage &lt;a href=&#34;https://docs.oracle.com/cd/E27998_01/html/E48433/shares__space_management.html#shares__space_management__referenced_data&#34;&gt;&amp;ldquo;represents the total amount of space referenced by the active share, independent of snaphost &amp;hellip; the share would consume should all snapshots be destroyed&amp;rdquo;&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Thus deleting files didn&amp;rsquo;t report more available space.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Deleted some of the newer snapshots to free 14.1 GB.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Deleting old generations now worked.&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;nix-env -p /mnt/rec/nix/var/nix/profiles/system --delete-generations 30d
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Finally able to login into system and GC.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Followed &lt;a href=&#34;https://nixos.wiki/wiki/Storage_optimization&#34;&gt;NixOS doumentation&lt;/a&gt; to GC properly.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Next steps are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;to implement the documentation steps for automated GC,&lt;/li&gt;
&lt;li&gt;setup proper backups to my server for critical files,&lt;/li&gt;
&lt;li&gt;invest in a proper harddrive that snapshots and larger files can be exported to.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
</content>
    </item>
    
  </channel>
</rss>
