		Software Suspend 2.2 Internal Documentation.
				Version 1

1.  Introduction.

    Software Suspend 2.2 is an addition to the Linux Kernel, designed to
    allow the user to quickly shutdown and quickly boot a computer, without
    needing to close documents or programs. It is equivalent to the
    hibernate facility in some laptops. This implementation, however,
    requires no special BIOS or hardware support.

    The code in these files is based upon the original implementation
    prepared by Gabor Kuti and additional work by Pavel Machek and a
    host of others. This code has been substantially reworked by Nigel
    Cunningham, again with the help and testing of many others, not the
    least of whom is Michael Frank, At its heart, however, the operation is
    essentially the same as Gabor's version.

2.  Overview of operation.

    The basic sequence of operations is as follows:

	a. Quiesce all other activity.
	b. Ensure enough memory and storage space are available, and attempt
	   to free memory/storage if necessary.
	c. Allocate the required memory and storage space.
	d. Write the image.
	e. Power down.

    There are a number of complicating factors which mean that things are
    not as simple as the above would imply, however...

    o The activity of each process must be stopped at a point where it will
    not be holding locks necessary for saving the image, or unexpectedly
    restart operations due to something like a timeout and thereby make
    our image inconsistent.

    o It is desirous that we sync outstanding I/O to disk before calculating
    image statistics. This reduces corruption if one should suspend but
    then not resume, and also makes later parts of the operation safer (see
    below).

    o We need to get as close as we can to an atomic copy of the data.
    Inconsistencies in the image will result in inconsistent memory contents at
    resume time, and thus in instability of the system and/or file system
    corruption. This would appear to imply a maximum image size of one half of
    the amount of RAM, but we have a solution... (again, below).

    o In 2.6, we choose to play nicely with the other suspend-to-disk
    implementations.

3.  Detailed description of internals.

    a. Quiescing activity.

    Safely quiescing the system is achieved using two methods.

    First, we note that the vast majority of processes don't need to run during
    suspend. They can be 'frozen'. We therefore implement a refrigerator
    routine, which processes enter and in which they remain until the cycle is
    complete. Processes enter the refrigerator via try_to_freeze() invocations
    at appropriate places.  A process cannot be frozen in any old place. It
    must not be holding locks that will be needed for writing the image or
    freezing other processes. For this reason, userspace processes generally
    enter the refrigerator via the signal handling code, and kernel threads at
    the place in their event loops where they drop locks and yield to other
    processes or sleep.

    The second part of our method for quisescing the system involves freezing
    the filesystems. We use the standard freeze_bdev and thaw_bdev functions to
    ensure that all of the user's data is synced to disk before we begin to
    write the image.

    Quiescing the system works most quickly and reliably when we add one more
    element to the algorithm: separating the freezing of userspace processes
    from the freezing of kernel space processes, and doing the filesystem freeze
    in between. The filesystem freeze needs to be done while kernel threads such
    as kjournald can still run.At the same time, though, everything will be less
    racy and run more quickly if we stop userspace submitting more I/O work
    while we're trying to quiesce.

    Quiescing the system is therefore done in three steps:
	- Freeze userspace
	- Freeze filesystems
	- Freeze kernel threads

    If we need to free memory, we thaw kernel threads and filesystems, but not
    userspace. We can then free caches without worrying about deadlocks due to
    swap files being on frozen filesystems or such like.

    b. Ensure enough memory & storage are available.

    We have a number of constraints to meet to be able to successfully suspend
    and resume.

    First, the image will be written in two parts, described below. One of these
    parts needs to have an atomic copy made, which of course implies a maximum
    size of one half of the amount of system memory. The other part ('pageset')
    is not atomically copied, and can therefore be as large or small as desired.

    Second, we have constraints on the amount of storage available. In these
    calculations, we may also consider any compression that will be done. The
    cryptoapi module allows the user to configure an expected compression ratio.
   
    Third, the user can specify an arbitrary limit on the image size, in
    megabytes. This limit is treated as a soft limit, so that we don't fail the
    attempt to suspend if we cannot meet this constraint.

    c. Allocate the required memory and storage space.

    Having done the initial freeze, we determine whether the above constraints
    are met, and seek to allocate the metadata for the image. If the constraints
    are not met, or we fail to allocate the required space for the metadata, we
    seek to free the amount of memory that we calculate is needed and try again.
    We allow up to four iterations of this loop before aborting the cycle. If we
    do fail, it should only be because of a bug in Suspend's calculations.
    
    These steps are merged together in the prepare_image function, found in
    prepare_image.c. The functions are merged because of the cyclical nature
    of the problem of calculating how much memory and storage is needed. Since
    the data structures containing the information about the image must
    themselves take memory and use storage, the amount of memory and storage
    required changes as we prepare the image. Since the changes are not large,
    only one or two iterations will be required to achieve a solution.

    d. Write the image.

    We previously mentioned the need to create an atomic copy of the data, and
    the half-of-memory limitation that is implied in this. This limitation is
    circumvented by dividing the memory to be saved into two parts, called
    pagesets.

    Pageset2 contains the page cache - the pages on the active and inactive
    lists. These pages are saved first and reloaded last. While saving these
    pages, the swapwriter module carefully ensures that the work of writing
    the pages doesn't make the image inconsistent. Pages added to the LRU
    lists are immediately shot down, and careful accounting for available
    memory aids debugging. No atomic copy of these pages needs to be made.

    Writing the image requires memory, of course, and at this point we have
    also not yet suspended the drivers. To avoid the possibility of remaining
    activity corrupting the image, we allocate a special memory pool. Calls
    to __alloc_pages and __free_pages_ok are then diverted to use our memory
    pool. Pages in the memory pool are saved as part of pageset1 regardless of
    whether or not they are used.

    Once pageset2 has been saved, we suspend the drivers and save the CPU
    context before making an atomic copy of pageset1, resuming the drivers
    and saving the atomic copy. After saving the two pagesets, we just need to
    save our metadata before powering down.

    Having saved pageset2 pages, we can safely overwrite their contents with
    the atomic copy of pageset1. This is how we manage to overcome the half of
    memory limitation. Pageset2 is normally far larger than pageset1, and
    pageset1 is normally much smaller than half of the memory, with the result
    that pageset2 pages can be safely overwritten with the atomic copy of
    pageset1. This is where we need to be careful about syncing, however.
    Pageset2 will probably contain filesystem meta data. If this is overwritten
    with pageset1 and then a sync occurs, the filesystem will be corrupted -
    at least until resume time and another sync of the restored data. Since
    there is a possibility that the user might not resume or (may it never be!)
    that suspend might oops, we do our utmost to avoid syncing filesystems after
    copying pageset1.

    e. Power down.

    Powering down uses standard kernel routines. Prior to this, however, we
    suspend drivers again, ensuring that write caches are flushed.

4.  The method of writing the image.

    Suspend2 contains an internal API which is designed to simplify the
    implementation of new methods of transforming the image to be written and
    writing the image itself. In early versions of Suspend2, compression support
    was inlined in the image writing code, and the data structures and code for
    managing swap were intertwined with the rest of the code. A number of people
    had expressed interest in implementing image encryption, and alternative
    methods of storing the image. This internal API makes that possible by
    implementing 'modules'.

    A module is a single file which encapsulates the functionality needed
    to transform a pageset of data (encryption or compression, for example),
    or to write the pageset to a device. The former type of module is called
    a 'page-transformer', the later a 'writer'.

    Modules are linked together in pipeline fashion. There may be zero or more
    page transformers in a pipeline, and there is always exactly one writer.
    The pipeline follows this pattern:

		---------------------------------
		|          Suspend2 Core        |
		---------------------------------
				|
				|
		---------------------------------
		|	Page transformer 1	|
		---------------------------------
				|
				|
		---------------------------------
		|	Page transformer 2	|
		---------------------------------
				|
				|
		---------------------------------
		|            Writer		|
		---------------------------------

    During the writing of an image, the core code feeds pages one at a time
    to the first module. This module performs whatever transformations it
    implements on the incoming data, completely consuming the incoming data and
    feeding output in a similar manner to the next module. A module may buffer
    its output.

    During reading, the pipeline works in the reverse direction. The core code
    calls the first module with the address of a buffer which should be filled.
    (Note that the buffer size is always PAGE_SIZE at this time). This module
    will in turn request data from the next module and so on down until the
    writer is made to read from the stored image.

    Part of definition of the structure of a module thus looks like this:

        int (*rw_init) (int rw, int stream_number);
        int (*rw_cleanup) (int rw);
        int (*write_chunk) (struct page *buffer_page);
        int (*read_chunk) (struct page *buffer_page, int sync);

    It should be noted that the _cleanup routine may be called before the
    full stream of data has been read or written. While writing the image,
    the user may (depending upon settings) choose to abort suspending, and
    if we are in the midst of writing the last portion of the image, a portion
    of the second pageset may be reread.

    In addition to the above routines for writing the data, all modules have a
    number of other routines:

    TYPE indicates whether the module is a page transformer or a writer.
    #define TRANSFORMER_MODULE 1
    #define WRITER_MODULE 2

    NAME is the name of the module, used in generic messages.

    MODULE_LIST is used to link the module into the list of all modules.

    MEMORY_NEEDED returns the number of pages of memory required by the module
    to do its work.

    STORAGE_NEEDED returns the number of pages in the suspend header required
    to store the module's configuration data.

    PRINT_DEBUG_INFO fills a buffer with information to be displayed about the
    operation or settings of the module.

    SAVE_CONFIG_INFO returns a buffer of PAGE_SIZE or smaller (the size is the
    return code), containing the module's configuration info. This information
    will be written in the image header and restored at resume time. Since this
    buffer is allocated after the atomic copy of the kernel is made, you don't
    need to worry about the buffer being freed.

    LOAD_CONFIG_INFO gives the module a pointer to the the configuration info
    which was saved during suspending. Once again, the module doesn't need to
    worry about freeing the buffer. The kernel will be overwritten with the
    original kernel, so no memory leak will occur.

    OPS contains the operations specific to transformers and writers. These are
    described below.

    The complete definition of struct suspend_module_ops is:

	struct suspend_module_ops {
	        /* Functions common to all modules */
	        int type;
	        char *name;
	        struct module *module;
	        int disabled;
	        struct list_head module_list;

	        /* List of filters or writers */
	        struct list_head list, type_list;

	        /*
	         * Requirements for memory and storage in
	         * the image header..
	         */
	        unsigned long (*memory_needed) (void);
	        unsigned long (*storage_needed) (void);

	        /*
	         * Debug info
	         */
	        int (*print_debug_info) (char *buffer, int size);
	        int (*save_config_info) (char *buffer);
	        void (*load_config_info) (char *buffer, int len);

	        /*
	         * Initialise & cleanup - general routines called
	         * at the start and end of a cycle.
	         */
	        int (*initialise) (int starting_cycle);
	        void (*cleanup) (int finishing_cycle);

	        /*
	         * Calls for allocating storage (writers only).
	         *
	         * Header space is allocated separately. Note that allocation
	         * of space for the header might result in allocated space
	         * being stolen from the main pool if there is no unallocated
	         * space. We have to be able to allocate enough space for
	         * the header. We can eat memory to ensure there is enough
	         * for the main pool.
	         */

	        int (*storage_available) (void);
	        int (*allocate_header_space) (int space_requested);
	        int (*allocate_storage) (int space_requested);
	        int (*storage_allocated) (void);
	        int (*release_storage) (void);

	        /*
	         * Routines used in image I/O.
	         */
	        int (*rw_init) (int rw, int stream_number);
	        int (*rw_cleanup) (int rw);
	        int (*write_chunk) (struct page *buffer_page);
	        int (*read_chunk) (struct page *buffer_page, int sync);

	        /* Reset module if image exists but reading aborted */
	        void (*noresume_reset) (void);

	        /* Read and write the metadata */
	        int (*write_header_init) (void);
	        int (*write_header_cleanup) (void);

	        int (*read_header_init) (void);
	        int (*read_header_cleanup) (void);

	        int (*rw_header_chunk) (int rw, char *buffer_start, int buffer_size);

	        /* Attempt to parse an image location */
	        int (*parse_sig_location) (char *buffer, int only_writer);

	        /* Determine whether image exists that we can restore */
	        int (*image_exists) (void);

	        /* Mark the image as having tried to resume */
	        void (*mark_resume_attempted) (void);

	        /* Destroy image if one exists */
	        int (*invalidate_image) (void);
	};


	Expected compression returns the expected ratio between the amount of
	data sent to this module and the amount of data it passes to the next
	module. The value is used by the core code to calculate the amount of
	space required to write the image. If the ratio is not achieved, the
	writer will complain when it runs out of space with data still to
	write, and the core code will abort the suspend.

	transformer_list links together page transformers, in the order in
	which they register, which is in turn determined by order in the
	Makefile.
