These are just random thoughts that have been occurring to me the last few days! They are subjective and my opinions may change tomorrow.

ML Stacks

  • ML Stacks are so fragile that any modification of the environment can bring the whole show crashing down, cue the rise of the use of containers and tools such as Conda. Available builds on cloud platforms often are limited by the configuration expertise of the deployer, sometimes this can result in a bit of additional work to function usefully as they can miss vital tools. DIY builds are the way I go now, far less hassle.
  • GCP should offer base images with CUDA/cuDNN (specific versions) and nothing else beyond the base OS, allowing to build up – at the moment they don’t offer base images with CUDA pre-installed, well they sort of do.. missing cuDNN… and half finished.
  • Many public ML images on GCP lack basics such as Pandas, numpy, matplotlib, opencv-python, and all the supporting packages, they are little more than TensorFlow + a GPU driver thrown together.

TensorFlow

  • BSD isn’t the only thing to have come out of Berkeley it seems.
  • TensorFlow builds tied to a specific Python version, as well as package releases (as well as the notorious bazel).
  • Uses CC (Compute Capability) 3.5 by default, need to build for CC 3.0 (older GTX cards) and pre CC 3.0 has no CUDA support (versions) – however this may well be due to optimisations that the TF team felt the need to bump the CC. You’ll need to build your own wheel… and face Bazel!
  • TensorFlow versions entangled with specific CUDA/cuDNN versions, i.e. 1.13=10, <1.13=9, 10.1=2.0 etc, makes upgrading not great as CUDA drivers can often be hard to get functional, esp. upgrading.
  • Building Tensorflow with Bazel is not pleasant, nor is Bazel, who needs another build tool? Especially one so incredibly unfriendly. So unwieldy it requires bazelisk (itself requiring numerous dependencies) just to tame it for tf!
  • Building from source fails with Ubuntu due (again) to poor debian packaging (cudnn.h issue) that seems to have been around for years. Yes, the library works (as runtime) but seems to have omitted the developer components. Not purely a tf issue admittedly.
  • Keras should have remained a library – its inclusion is just confusing (but not surprising).
  • Keras (previous incarnation) constantly broke (or worse still changed functionality) without warning (no attribute ‘get_graph’ etc. in 2.3.x, pre-defined models dropped then re-added)
  • No confidence another rewrite of TensorFlow (from 2 to 3) is not on the cards given its present state.
  • Google offer no wheels without AVX – isolating older servers/workstations, had to build custom wheels, exposing myself to bazel (which itself needs its own version manager to stay on top of TensorFlow!) – to be fair this I guess is to be expected.
  • Too many ways to skin a cat in TensorFlow – orthogonal it is not – in fact it is the complete opposite – sometimes three+ ways to import a similar (same name and actions) function from Keras, V1, V2 and whatever comes up. Of course StackOverflow will chide you into being naming/scope consistent but that isn’t really the point.

CUDA

  • NVidia have not provided a consistent straightforward way to completely remove CUDA/cuDNN from a build – many times you will simply have to rebuild the entire machine. Their entire instructions consist of (in Debian terms) apt remove cuda… oh sure.
  • CUDA/cuDNN cannot be wget/curl to an instance as they require web authentication to download! Yes, ways around it with cookie manipulation but…!!!! What are they thinking?
  • You always need to be attention to drivers/frameworks and the card(s) in question, as well as their relation to TensorFlow builds!
  • Debian packages are broken – installing CUDA 10.0 results in a 10.2 install (mispackaged)

Python

The path to using interpretive “instant feedback” languages with inferior tools is the path to ruin. Yes, Python may well “work”, but at what cost? Being reluctantly bound to Python (through libraries) is like being put into chains. I once dismissed Scala as the new “PL/1” but Scala had the last laugh there, Python is the Basic for the new generation.

  • Python 3.6 breaks 3.5 completely (Illegal opcode for example) and upgrading Python is a nightmare with its libraries, all need to be upgraded as well, this rapidly becomes reminiscent of DLL hell.
  • No defined upgrade mechanism – patchy PPA support for upgrading – best is to build from source… in 2020??? Outside of any package mechanisms.
  • Forced Indentations are just plain frustrating, they attempt to enforce a style upon the user, who may well wish to indent in alternative ways.
  • Upgrade Python will break TensorFlow (hence CUDA for GPU access) and all sorts of functionality, libraries, you name it. Likely you will need a CUDA driver upgrade, meaning a potential trash of your existing environment.
  • Without its libraries (Pandas for example) it comes across as a third-rate kludge without any elegance nor discernible benefits beyond a second-rate scripting language. Years with Java has truly made me appreciate consistency at the expense of huge functionality/rewrites. Also the JVM seems perfection (have never really thought otherwise though). Python arguably is successful only due to its major adoption and undisciplined coding nature – the language itself is nothing special, and its support architecture borders on atrocious.
  • pip(3) blindly install tensorflow irrespective of your base CUDA environment. That is, it may just well not work as tf will try and open non-existent versions of the dependent CUDA libraries!
  • Jupyter is at best a mediocre solution – neither fish nor fowl.
  • Without PyCharm I’d be lost, it is fantastic, having used it from PhpStorm upwards (2011) through IntelliJ IDEA and Android Studio, great consistency and a fantastic IDE from Jetbrains who also are amazing responsive with comments/criticism – you can always email the person responsible for something directly and receive a reply.

GCP

  • Googles GCP base ML Compute Engine image is still on Python 3.5 and TensorFlow 1.9? Do they ever update such things?
  • Pre-Built images often fail with “resource level errors” during builds (often with K80/P4?) – issues with GPU availability?
  • GCP Disks cannot be migrated across regions.
  • GCP Attached Disk latency seems high?
  • GPU model availability seems to differ not only from region to region but from build to build (to be fair this may well be an allocation issue but I have my suspicions).
  • Pre-emptible images cannot migrate to standard images (nor vice-versa)
  • Pre-emptible GPU instances often last just minutes before being pre-empted (an hour uptime is unusual!)
  • In their favour: reliability is excellent and once a build is finalised they are very easy to work with, my overall impression of GCP is pretty good and my quibbles are minor here.
  • Importing a GCP image (where the account holder has given usage permissions) doesn’t seem to result in the image being readily available from your account – I seem to have to re-add it through the initial link each time I require usage. This could just be I haven’t figured out how to keep it resident.