4a4725386d
Add documentation about the new mnemofs design and the NAND flash structure. Signed-off-by: Saurav <resyfer.dev@gmail.com>
264 lines
11 KiB
ReStructuredText
264 lines
11 KiB
ReStructuredText
=======
|
|
MNEMOFS
|
|
=======
|
|
|
|
Mnemofs is a NAND Flash File System built for NuttX.
|
|
|
|
Usage
|
|
=====
|
|
|
|
If there's a NAND flash available at a location, for example, ``/dev/nand``,
|
|
you can mount it with ``mnemofs`` to a location like ``/mydir`` using::
|
|
|
|
mount -t mnemofs /dev/nand /mydir
|
|
|
|
The above command will only work if the device was already formatted using
|
|
mnemofs. For a brand new device, or if you want to switch from an existing
|
|
file system, this won't work, and would need a format::
|
|
|
|
mount -t mnemofs -o forceformat /dev/nand /mydir
|
|
|
|
Unsure of whether you need to do a format? This will help::
|
|
|
|
mount -t mnemofs -o autoformat /dev/nand /mydir
|
|
|
|
This will format the device only if it can not detect mnemofs being already
|
|
formatted onto it. Do note this includes cases where mnemofs is formatted to
|
|
the device, but it's been mutilated to the point of being unrecognizable.
|
|
|
|
After this, use it like a regular file system. That's the job of a file
|
|
system after all...to hide the storage device's pecularities behind an
|
|
abstraction. A file system is considered good if you don't have to think
|
|
about its existence during regular usage.
|
|
|
|
NAND Flashes
|
|
============
|
|
|
|
Programmatically, the NAND flash has some problems. The whole device can be
|
|
condensed into three layers: blocks, pages and cells.
|
|
|
|
Cells represent the smallest unit of storage in NAND flashes, but are often
|
|
ignored, as direct access is not allowed. If a cell stores one bit, it's a
|
|
Single Level Cell. There are MLC, TLC, etc. for more bits per cell. Often,
|
|
the more bits per cell, the lesser is the wear resilience. Thus, higher
|
|
bits per cell are easier to wear out and become unreliable.
|
|
|
|
Pages are the smallest readable or writable unit of the NAND flash. It's
|
|
made up of several cells, and can be expected to have a size of the similar
|
|
order of 512 B.
|
|
|
|
Blocks are the smallest erasable unit of NAND flash. They are made up of
|
|
several pages. If a page is already written, it needs to be erased before it
|
|
can be written again. And since blocks are the smallest erasable unit, the
|
|
entire block needs to be erased if the user wants to update the contents of
|
|
one page.
|
|
|
|
The erase operation is what causes a block to wear out. If a block is worn
|
|
out too much, it will lose its ability to reliably store data. An unreliable
|
|
block can not guarantee that the data read from the pages in it is the same
|
|
as what was written to it. This state is called as a bad block.
|
|
|
|
A manufacturer can also deem a block to be unreliable from their testing,
|
|
and can mark them as bad blocks right from manufacture.
|
|
|
|
A good file system will aim to level out the wear between blocks as much as
|
|
it can.
|
|
|
|
Design
|
|
======
|
|
|
|
There are various layers and components in mnemofs, and they interact with
|
|
various layers on abstraction over each other.
|
|
|
|
Mnemofs works on a Copy-On-Write (CoW) basis, which means, if a page needs to
|
|
be updated, it is copied over in memory, and then the change is applied, and
|
|
the new data is written to a new location.
|
|
|
|
R/W Layer
|
|
---------
|
|
|
|
This works with the NAND flash device driver directly. It can write an
|
|
entire page, read an entire page, erase an entire block, check if a block is
|
|
bad (from it's bad block marker), or set a block as bad. It's the simplest
|
|
layer.
|
|
|
|
Block Allocator
|
|
---------------
|
|
|
|
The block allocator contains two arrays. One is a bit mask is for tracking
|
|
the free pages, while the other is an array of numbers, one number for each
|
|
block, denoting the number of pages in that block that are ready to be
|
|
erased.
|
|
|
|
The block allocator allocates pages or blocks in a sequential manner to keep
|
|
it fair for all pages, thus, ensuring wear levelling. It also starts from a
|
|
random offset to prevent bias to the front of the device in case of multiple
|
|
power losses and reinitialization in such casses. If a block is required it
|
|
skips pages to the start of the next block. Since block allocations happen
|
|
only in the journal, they happen in bulk and the number of skipped pages is
|
|
very minimal.
|
|
|
|
Once reaching the end of the device, it cycles back to the front. Thus any
|
|
skipped pages get the chance to be allocated in the next cycle.
|
|
|
|
CTZ Layer
|
|
---------
|
|
|
|
This works with the R/W Layer, and acts as an abstraction layer for other
|
|
components in mnemofs. Mnemofs uses
|
|
`CTZ lists <https://github.com/littlefs-project/littlefs/blob/master/DESIGN.md#ctz-skip-lists>`_
|
|
to represent both files and directories in flash. CTZ lists of files contain
|
|
only the data of the file, while CTZ lists of directories contain directory
|
|
entries (direntries) for each FS Object (file or directory) grouped into it.
|
|
|
|
This layer abstracts away the complex division of flash space that's present
|
|
in CTZ skip lists, and allows users of this layer to not worry about the
|
|
complexities of a CTZ skip list, and infact, to feel that the data is like a
|
|
contiguous space.
|
|
|
|
This layer allows the user to specify a data offset, which refers to the
|
|
offset into the actual data stored in the CTZ skip list (ie. excluding the
|
|
pointers), and the number of bytes, and perform operations on the CTZ list
|
|
almost like if it were a single array.
|
|
|
|
In mnemofs, each CTZ block takes up the space of exactly 1 page, and each
|
|
pointer takes up 4 bytes.
|
|
|
|
Littlefs design document shows how a CTZ list can be identified using the
|
|
index of the last CTZ block, and the page number of that CTZ block.
|
|
|
|
Journal
|
|
-------
|
|
|
|
The journal in mnemofs is made out of ``n + 2`` blocks. The last two block
|
|
concern the master node. These blocks are arranged in a singly linked list.
|
|
|
|
Due to CoW policy, when a CTZ list is updated, it now has a new location. The
|
|
first ``n`` blocks of the journal is responsible for storing logs containing
|
|
information about this very update. It will contain the old location of the
|
|
CTZ skip list, and the new location.
|
|
|
|
Thus, when the user requires an updated location of a CTZ list, they will
|
|
first find the old location by traversing the FS tree in the flash, and then
|
|
will traverse the journal to find the latest location.
|
|
|
|
So, the FS tree on the flash acts like a "base" state with updates stored in
|
|
the journal. Each log in journal is followed by a checksum to verify if all
|
|
of it was written properly. This helps in making it power loss resilient.
|
|
|
|
The journal, when combined with CoW, plays another important role. In pure
|
|
CoW, any update to a CTZ file will result in it having a new location. This
|
|
new location wil need to be updated in the parent, which itself will have a
|
|
new location after the update, and so on till it reaches the root. The
|
|
journal stops this propagation immediately. When the journal is full above
|
|
a certain limit, it will flush, and apply all of these changes to the FS
|
|
tree in one go. This helps in wear reduction.
|
|
|
|
The journal mainly works with the CTZ layer, and any updates to a CTZ list
|
|
using this layer automatically adds a log for it in the journal.
|
|
|
|
The journal starts with a magic sequence, then the number of blocks in the
|
|
journal (excluding master blocks), and then follows an array with the block
|
|
numbers of the blocks in the journal (including the master blocks). Following
|
|
this, logs are stored in the blocks.
|
|
|
|
Master Node and Root
|
|
--------------------
|
|
|
|
The root of the file system is treated like any directory as far as its
|
|
storage on the flash is concerned. This is because the master node acts as a
|
|
parent to the root, and contains information of the root in a way identical
|
|
to direntries.
|
|
|
|
The master node is stored in the master blocks. There are two master blocks,
|
|
and both are duplicated of each other for backup. Each master block is a
|
|
block, and thus have multiple pages in them. Each page contains one revision
|
|
of the master node. The master nodes are written sequentially, and have a
|
|
timestamp on them as well.
|
|
|
|
When a CTZ list is moved to a new location, the obsolete pages of the old
|
|
CTZ list are marked for deletion.
|
|
|
|
LRU Cache
|
|
---------
|
|
|
|
Mnemofs has a Least Recently Used (LRU) Cache component. The main use of this
|
|
component is to reduce wear on the flash at the expense of memory.
|
|
|
|
The LRU is a kernel list of nodes. Each node represents an FS object. Each
|
|
node also contains a kernel list of deltas. Each delta contains information
|
|
about an update or deletion from the user (which is what all of the VFS write
|
|
operations can be condensed to).
|
|
|
|
There's a pre-configured limit for both deltas per node and nodes in the LRU.
|
|
|
|
If the delta list in a node is full, and another is to be added, all the
|
|
existing deltas in the list are clubbed together and written to the flash
|
|
using the CTZ layer. The layer also automatically adds a log for this update.
|
|
When a node receives a delta, it is bumped from its current location in the
|
|
LRU to be at the front. This way, the last node in the LRU is always the
|
|
least used node.
|
|
|
|
If the node limit is reached in the LRU, and a new node is to be added to the
|
|
LRU, then the final node (which is also the least recently used node), is
|
|
popped from the LRU to make space for the new node. This popped node is then
|
|
written to the flash using the CTZ layer as well.
|
|
|
|
The LRU helps in clubbing updates to a single FS object and thus helps in
|
|
reducing the wear of the flash.
|
|
|
|
Journal Flush
|
|
-------------
|
|
|
|
The latest master node revision is the most useful out of the revisions. As
|
|
in CoW it's prudent to update the FS tree from bottom up, the root is the
|
|
last one to get updated in the case of a journal flush.
|
|
|
|
The logs are just location updates. So, when a journal flush occurs, it will
|
|
update the locations of all the children in the parent. This updates the
|
|
parent, and then this update goes through the same procedure as any other
|
|
update.
|
|
|
|
This is why it's best to start the flush operation when the journal is filled
|
|
up over a certain limit, instead of waiting it to be full. Why? Any log of
|
|
a parent makes any log of its children written **before** it useless, as the
|
|
updated location of the parent can be read to get the updated location of the
|
|
child till that point in the logs.
|
|
|
|
So, it will be best to move up the FS tree from bottom during update and
|
|
update the root last, since the root is the parent of every FS object.
|
|
|
|
Once the root is updated, all other journal logs become useless, and can be
|
|
erased. The root's log is not written in the first ``n`` blocks of the
|
|
journal, but written as a new master node entry in the master blocks.
|
|
|
|
Once the new root is written, the first ``n`` blocks can be erased, and
|
|
reallocated (due to the rules of wear levelling). The master blocks however
|
|
have some conditions for reallocation. This is called moving of the journal.
|
|
|
|
Every time the first ``n`` blocks are cleared, a new master node is added.
|
|
The only time a master block needs to be erased is when it becomes full.
|
|
Thus, if there are ``p`` pages in a block, the master blocks will be
|
|
moved along with the rest of the journal for every ``p`` journal flushes.
|
|
|
|
Before the new master node is written, none of the old pages should be erased
|
|
to allow rollback to the previous FS tree state. The moment the new master
|
|
node is updated, any block which has all of the pages in it ready for
|
|
deletion will be erased to make space.
|
|
|
|
FS Object Layer
|
|
---------------
|
|
|
|
This layer provides an abstraction for iterating, adding, deleting or
|
|
reading direntries.
|
|
|
|
This works with the LRU and the journal to get the latest data and thus the
|
|
user of this layer does not have to worry about these underlying mnemofs
|
|
components.
|
|
|
|
VFS Method Layer
|
|
----------------
|
|
|
|
VFS method layer contains methods exposed to the VFS. This layer works with
|
|
the FS Object layer for direntry related tasks, or the LRU for file level
|
|
read/write tasks. |