Seneca logo CPR101 — Week 3

File Compression and Backup

CP4P_Compression-and-Backup.pptx Lecture PowerPoint slides
CP4P_CompressionBackup_Activity_Archive.zip

Activity Archive (Instructions and Answer docs) — download and unzip/decompress

For any .zip archive, ensure you work with the extracted files.
Extract everything to a folder, then delete the .zip file to avoid confusion.

Yes, you can double click to open a file within a zip archive—the OS automatically extracts it to a deeply buried temporary work folder—but keeping track of those temp files is not as easy as taking a few seconds to do a proper extraction.

Show provenance of _Activity_Answers development.docx How to show you did the work

Notes

AI uses lossy compression
ChatGPT Is a Blurry JPEG of the Web | The New Yorker
"Sometimes it’s only in the process of writing that you discover your original ideas. Some might say that the output of large language models doesn’t look all that different from a human writer’s first draft, but, again, I think this is a superficial resemblance. Your first draft isn’t an unoriginal idea expressed clearly; it’s an original idea expressed poorly, and it is accompanied by your amorphous dissatisfaction, your awareness of the distance between what it says and what you want it to say. That’s what directs you during rewriting, and that’s one of the things lacking when you start with text generated by an A.I." --- Ted Chiang
Ransomware defence:
Immutable
unchangeable storage of a Backup and/or a Snapshot
Snapshot vs Backup
The same, only different.
Backup
copy of folders & files from the host OS file system to another disconnected system | media
Block storage
is a physical/logical address in secondary storage that is independent of the OS file system
Snapshot
• copy of storage blocks
• Full & Incremental strategy
• always reflects current state of host's secondary storage
• immutable Snapshots can be used to recover from ransomeware
• caution: many previous states (versioning) will soon require very large amounts of Snapshot storage
Synchronization
file system replicated across multiple hosts
Have a lot of data to backup? Watch this: https://youtu.be/y2F0wjoKEhg

Use your mySeneca OneDrive with 1TB storage but make sure it is not synchronized.
(What OneDrive calls "backup" is really folder synchronization, not backup.)

What “Dead to Me” (Netflix) Taught Us About 3-2-1 Backup

Top cloud backup services -- reasonably permanent, under your control. But that is only 1 of 3-2-1.
iDrive has free "Basic" tier of 10GB (no credit card required), more than enough for all Seneca related files. Not enough? How about 100GB for USD$2.95 per year, the price of one Starbucks coffee annually -- and you don't have to wait in a long line.

How Nunavut recovered from a ransomware attack

Canadian firm pays $175,000 to recover from ransomware attack.
MapleSEC: The ransomware attack that turned into a horror story
https://www.itworldcanada.com/article/maplesec-the-ransomware-attack-that-turned-into-a-horror-story/436726

Canadian firm pays $425,000 to recover from ransomware attack.
http://www.itworldcanada.com/article/canadian-firm-pays-425000-to-recover-from-ransomware-attack/394844#ixzz4n70bPzrD
"Another lesson apparently is to ensure backups aren’t connected to the primary system."
Because, if your backup is not platform independent, ransomware will encrypt your backups too.

800+ Million Emails Leaked by Email Verification Service
To: verifications.io
Subject: [security alert] Verifications.io emails database exposed to public
Date: February 25, 2019
Ticket ID: #683614
Thank you for reporting the issue. We appreciate you reaching out and informing us. We
were able to quickly secure the database. Goes to show, even with 12 years of experience
you can't let your guard down.
After closer inspection, it appears that the database used for appends was briefly exposed.
This is our company database built with public information, not client data.
As you pointed out, data breaches and ransom are one of the largest threats our industry and
businesses face. We maintain full backups (both offline and in a different geographic
location) so the destruction of data or ransom is not a concern for us. The exposure of PII
(Personally Identifiable Information) to criminals is our primary concern and we take it seriously.
We have taken appropriate measures to correct this.

World Backup Day advice: Give your system some love | IT World Canada News

Pop songs compressed by the Lempel-Ziv algorithm

Huffman coding is a lossless data compression algorithm. It assigns variable-length codes based on the frequency of characters.

Huffman tree building animation: input text LOSSLESS, move "Animation Speed" slider to left, click [ Build tree ].
Huffman Coding
raw text:     LOSSLESS  <== 8 char X 8 bits = 64 bits
charArray:  L  O  S  E  <== unique chars in text
countArray: 2  1  4  1  <== count char in text
Huff.Code: 10 110 0 111 <== variable-length binary based on frequency of characters (see tree below)
encode:    10 110 0 0 10 111 0 0 <== substitute each char in text for Huffman code, 14 bits or 22% of original
Frequency Table (Char, Frequency, bit length) : (S,4,1)(L,2,2)(E,1,3)(O,1,3) data to build binary tree for decoding.

charArray and countArray used to create binary tree resulting in shortest path to the most frequent characters. 

Huffman Tree

Decode logic:

For each encoded bit
{
    traverse binary tree trunk (0 right, 1 left)

    IF at leaf node
    {

        get/print char;
        reposition at binary tree root node;
    }

}

HuffmanCoding Worked Example.xlsx