Data Engineering Cheat Sheet
These are commands, patterns, or very short gists that I have come across, that help my normal daily functions.
Nginx
Here are some useful commands and set ups
General purpose reverse proxy
To create a general purpose reverse proxy that you can use to mask upstream connections, like a database, put this in your modules conf.
Below will expose the Postgres running on localhost to port 8000
stream {
upstream db {
server localhost:5432;
}
server {
listen 8000;
proxy_pass db;
}
}
Text processing
I expect this list will grow as I utilise more and more of the GNU and Unix based tools to accomplish my daily tasks.
Detecting inconsistent columns in csv
Assuming you had some data that looked like this:
1,2,3,4
1,2,3,4
1,2,3
1,2,3,4
And you quickly wanted to verify it, you can use awk to walk through your CSV to fine in consistent columns.
$ cat test | awk -F ',' 'and(NR>1,(NF-old)>0){ printf NR " | " $0 " <-- " NF - old "\n"} {old = NF}'
Removing null characters
Sometimes you will have weird null characters in the data dumped, just use sed to clean this out:
$ sed 's/\x0//g' SomeData.csv
FizzBuzz, in awk
Okay, this has no practical application, but if you want to flex in a job interview:
$ seq 100|awk '$0=$1%15?$1%5?$1%3?$1:"Buzz":"Fizz":"FizzBuzz"'
SSH
Entering the SSH command line
Pressing ~
+ C
on a fresh newline, and you will enter the SSH command line. From here, you can
control your current SSH session, including forwarding new ports.
Agent forwarding
SSH agent forwarding allows you to SSH onto a remote machine, and have keys you have added to your agent, forwarded. This allow you to SSH onto other machines or use git+ssh without having to put your keys on the remote box. Note, only agent forward with machines you trust.
launching the ssh-agent can be done by running:
$ eval $(ssh-agent)
You can forward your ssh agent if it’s running, by adding the -A tag:
$ ssh user@remotehost -A
This will allow you to ssh from that host on.
NOTE: Please never forward your agent when connecting to machines you do not trust! You are essentially putting your SSH keys on that machine for the duration of your session.
Port forwarding
To forward a specific port, from the remote host, to your localhost, such that it appears the service running on the remote server is on your, use the -L tag:
$ ssh user@remotehost -L 8080:localhost:8080
Killing an SSH session that is not responding
If your network has blipped, or the remote instance is no longer responding, your SSH session is
essentially “locked” while it waits for packets to arrive, or the connection to timeout. You could
wait around for the timeout, but we are professionals here. So, the SSH client allows you to send a
“kill session” command by typing ~.
on a newline.
Sending the SSH shell to the background
So, you want to quickly background the current SSH shell to check on your local shell, you can do
that by typing ~^Z
(tilde + ctrl-z) on a fresh newline.
Python
Opening a super simple web server using python
If you wish to allow the content in directory /path/to/content
to be shared on port 9999
; then
simply then you can use the command:
$ python -m http.server --directory /path/to/content --bind 0.0.0.0 9999
This is quite useful if you want to quickly share files across a network.
Rendering a pip freeze with only the required packages
When rendering your requirements.txt
, there’s often a lot of packages that were installed as
dependencies to the packages you installed. This can make your requirements.txt
quite large, and
somewhat inflexible. There is in fact a better way though! You can
$ pip list --format=freeze --not-required
Getting all packages required into a directory
Sometimes you will need to install Python packages on a computer that does not have access to a pip registry. If you can put files on the computer, you can install Python packages first to a local directory, and then copy them over to your remote system. Perform this like this:
Create your venv
$ python3 -m venv venv
$ source venv/bin/activate
Download your package(s)
$ pip install ipython
Getting the dependencies
$ pip freeze > requirements.txt
Get your pip to spew out the files
$ pip download -r requirements.txt
Testing in a docker container
$ docker container run -it --rm -v $(pwd):/reqs python bash
Now, if you can a directory, you can install the packages
$ pip install --no-index --find-links /reqs/ -r requirements.txt