Data Engineering Cheat Sheet

These are commands, patterns, or very short gists that I have come across, that help my normal daily functions.

Nginx

Here are some useful commands and set ups

General purpose reverse proxy

To create a general purpose reverse proxy that you can use to mask upstream connections, like a database, put this in your modules conf.

Below will expose the Postgres running on localhost to port 8000

stream {
  upstream db {
    server localhost:5432;
  }

  server {
    listen 8000;
    proxy_pass db;
  }
}

Text processing

I expect this list will grow as I utilise more and more of the GNU and Unix based tools to accomplish my daily tasks.

Detecting inconsistent columns in csv

Assuming you had some data that looked like this:

1,2,3,4
1,2,3,4
1,2,3
1,2,3,4

And you quickly wanted to verify it, you can use awk to walk through your CSV to fine in consistent columns.

$ cat test | awk -F ','  'and(NR>1,(NF-old)>0){ printf NR " | " $0 " <-- " NF - old  "\n"} {old = NF}'

Removing null characters

Sometimes you will have weird null characters in the data dumped, just use sed to clean this out:

$ sed 's/\x0//g' SomeData.csv

FizzBuzz, in awk

Okay, this has no practical application, but if you want to flex in a job interview:

$ seq 100|awk '$0=$1%15?$1%5?$1%3?$1:"Buzz":"Fizz":"FizzBuzz"'

SSH

Entering the SSH command line

Pressing ~ + C on a fresh newline, and you will enter the SSH command line. From here, you can control your current SSH session, including forwarding new ports.

Agent forwarding

SSH agent forwarding allows you to SSH onto a remote machine, and have keys you have added to your agent, forwarded. This allow you to SSH onto other machines or use git+ssh without having to put your keys on the remote box. Note, only agent forward with machines you trust.

launching the ssh-agent can be done by running:

$ eval $(ssh-agent)

You can forward your ssh agent if it’s running, by adding the -A tag:

$ ssh user@remotehost -A

This will allow you to ssh from that host on.

NOTE: Please never forward your agent when connecting to machines you do not trust! You are essentially putting your SSH keys on that machine for the duration of your session.

Port forwarding

To forward a specific port, from the remote host, to your localhost, such that it appears the service running on the remote server is on your, use the -L tag:

$ ssh user@remotehost -L 8080:localhost:8080

Killing an SSH session that is not responding

If your network has blipped, or the remote instance is no longer responding, your SSH session is essentially “locked” while it waits for packets to arrive, or the connection to timeout. You could wait around for the timeout, but we are professionals here. So, the SSH client allows you to send a “kill session” command by typing ~. on a newline.

Sending the SSH shell to the background

So, you want to quickly background the current SSH shell to check on your local shell, you can do that by typing ~^Z (tilde + ctrl-z) on a fresh newline.

Python

Opening a super simple web server using python

If you wish to allow the content in directory /path/to/content to be shared on port 9999; then simply then you can use the command:

$ python -m http.server --directory /path/to/content --bind 0.0.0.0 9999

This is quite useful if you want to quickly share files across a network.

Rendering a pip freeze with only the required packages

When rendering your requirements.txt, there’s often a lot of packages that were installed as dependencies to the packages you installed. This can make your requirements.txt quite large, and somewhat inflexible. There is in fact a better way though! You can

$ pip list --format=freeze --not-required

Getting all packages required into a directory

Sometimes you will need to install Python packages on a computer that does not have access to a pip registry. If you can put files on the computer, you can install Python packages first to a local directory, and then copy them over to your remote system. Perform this like this:

Create your venv

$ python3 -m venv venv
$ source venv/bin/activate

Download your package(s)

$ pip install ipython

Getting the dependencies

$ pip freeze > requirements.txt

Get your pip to spew out the files

$ pip download -r requirements.txt

Testing in a docker container

$ docker container run -it --rm -v $(pwd):/reqs python bash

Now, if you can a directory, you can install the packages

$ pip install --no-index --find-links /reqs/ -r requirements.txt