{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "lmJ97hWzo2zF"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/visiont3lab/tecnologie_data_science/blob/master/book/docs/pyspark/pyspark_dataframe_sql.ipynb\n",
    "\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Sszrzyybok2v"
   },
   "outputs": [],
   "source": [
    "################ template to run PySpark on Colab #######################"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 52545,
     "status": "ok",
     "timestamp": 1592470596242,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "0WGeMNr-ok5y"
   },
   "outputs": [],
   "source": [
    "!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
    "!wget -q https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz\n",
    "!tar xf spark-2.4.5-bin-hadoop2.7.tgz\n",
    "!pip install -q findspark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 52539,
     "status": "ok",
     "timestamp": 1592470596245,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "S33Pv2YDokyN"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
    "os.environ[\"SPARK_HOME\"] = \"/content/spark-2.4.5-bin-hadoop2.7\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 57533,
     "status": "ok",
     "timestamp": 1592470601245,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "VIQewOA8oktH"
   },
   "outputs": [],
   "source": [
    "import findspark\n",
    "findspark.init()\n",
    "\n",
    "from pyspark.sql import SparkSession\n",
    "spark = SparkSession.builder.master(\"local[*]\").getOrCreate()\n",
    "spark1 = SparkSession.builder.appName('basic').getOrCreate()\n",
    "#Test must give no error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 57532,
     "status": "ok",
     "timestamp": 1592470601248,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "0VyUq14yokoz"
   },
   "outputs": [],
   "source": [
    "import pyspark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 57530,
     "status": "ok",
     "timestamp": 1592470601249,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "CAbqiCrDokkX"
   },
   "outputs": [],
   "source": [
    "from pyspark import SparkConf, SparkContext\n",
    "conf = SparkConf().setAppName(\"basic\").setMaster(\"local\")\n",
    "#sc = SparkContext(conf=conf)  ## for jupyter and Databricks\n",
    "sc = SparkContext.getOrCreate()   ## for Colab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 57527,
     "status": "ok",
     "timestamp": 1592470601250,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "N55X_8Rgokia"
   },
   "outputs": [],
   "source": [
    "from pyspark.sql.types import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "LFRcBVBGoGUd"
   },
   "source": [
    "# SQL e Dataframe \n",
    "Lo Structured Query Language (SQL) è un linguaggio standardizzato per poter interrogare ed operare su Database relazionai. SQL Supporta le seguenti operazioni:\n",
    "\n",
    "* creare e modificare schemi di database \n",
    "* inserire, modificare e gestire dati memorizzati\n",
    "* interrogare i dati memorizzati\n",
    "* creare e gestire strumenti di controllo e accesso ai dati\n",
    "\n",
    "E' possibile utilizzare query SQL su di un Dataframe Spark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 3282,
     "status": "ok",
     "timestamp": 1592470604547,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "0AxxLf9AoGUj"
   },
   "outputs": [],
   "source": [
    "data = [(\"Gianluca\", \"M\", 23, 174, 70.5),\n",
    "        (\"Andrea\", \"M\", 37, 179, 68.),\n",
    "        (\"Marco\", \"M\", 33, 172, 88.5),\n",
    "        (\"Annalisa\", \"F\", 38, 155, 50.2),\n",
    "        (\"Monica\", \"F\", 25, 165, 54.3)]\n",
    "\n",
    "df = spark.createDataFrame(data, [\"name\", \"gender\", \"age\", \"height\",\"weight\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ZqOpmFk2oGUn"
   },
   "source": [
    "## Creare una View\n",
    "Una view è una tabella virtuale, solitamente frutto di un'operazione di SELECT tra più tabelle. Non viene salvata su disco ma risiede in memoria e al termine della sessione viene cancellata. Possiamo creare una View partendo da un Dataframe usando il metodo *.createTempView(name)*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "GTbO2hQcoGUo"
   },
   "outputs": [],
   "source": [
    "df.createTempView(\"people\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "7eTPnHjGoGUq"
   },
   "source": [
    "Se riproviamo a ricreare la stessa view con lo stesso metodo, otterremo un'eccezione che ci informa del fatto che tale view esiste già."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "qd1c9tRmoGUr",
    "outputId": "1186e84f-964a-4ba0-9e09-4470d2d02506"
   },
   "outputs": [
    {
     "ename": "AnalysisException",
     "evalue": "\"Temporary view 'people' already exists;\"",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mPy4JJavaError\u001b[0m                             Traceback (most recent call last)",
      "\u001b[0;32m~/spark/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m     62\u001b[0m         \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 63\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     64\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0mpy4j\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprotocol\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mPy4JJavaError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/.local/lib/python3.6/site-packages/py4j/protocol.py\u001b[0m in \u001b[0;36mget_return_value\u001b[0;34m(answer, gateway_client, target_id, name)\u001b[0m\n\u001b[1;32m    327\u001b[0m                     \u001b[0;34m\"An error occurred while calling {0}{1}{2}.\\n\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 328\u001b[0;31m                     format(target_id, \".\", name), value)\n\u001b[0m\u001b[1;32m    329\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mPy4JJavaError\u001b[0m: An error occurred while calling o40.createTempView.\n: org.apache.spark.sql.catalyst.analysis.TempTableAlreadyExistsException: Temporary view 'people' already exists;\n\tat org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTempView(SessionCatalog.scala:495)\n\tat org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:146)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)\n\tat org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)\n\tat org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)\n\tat org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)\n\tat org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)\n\tat org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)\n\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)\n\tat org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)\n\tat org.apache.spark.sql.Dataset.<init>(Dataset.scala:194)\n\tat org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)\n\tat org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3406)\n\tat org.apache.spark.sql.Dataset.createTempView(Dataset.scala:3082)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:748)\n",
      "\nDuring handling of the above exception, another exception occurred:\n",
      "\u001b[0;31mAnalysisException\u001b[0m                         Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-5-df07ec3d2e4b>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcreateTempView\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"people\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m~/spark/python/pyspark/sql/dataframe.py\u001b[0m in \u001b[0;36mcreateTempView\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m    159\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    160\u001b[0m         \"\"\"\n\u001b[0;32m--> 161\u001b[0;31m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcreateTempView\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    162\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    163\u001b[0m     \u001b[0;34m@\u001b[0m\u001b[0msince\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m2.0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/.local/lib/python3.6/site-packages/py4j/java_gateway.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m   1284\u001b[0m         \u001b[0manswer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgateway_client\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend_command\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcommand\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1285\u001b[0m         return_value = get_return_value(\n\u001b[0;32m-> 1286\u001b[0;31m             answer, self.gateway_client, self.target_id, self.name)\n\u001b[0m\u001b[1;32m   1287\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1288\u001b[0m         \u001b[0;32mfor\u001b[0m \u001b[0mtemp_arg\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtemp_args\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/spark/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m     69\u001b[0m                 \u001b[0;32mraise\u001b[0m \u001b[0mAnalysisException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     70\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstartswith\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'org.apache.spark.sql.catalyst.analysis'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 71\u001b[0;31m                 \u001b[0;32mraise\u001b[0m \u001b[0mAnalysisException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     72\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstartswith\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'org.apache.spark.sql.catalyst.parser.ParseException: '\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     73\u001b[0m                 \u001b[0;32mraise\u001b[0m \u001b[0mParseException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mAnalysisException\u001b[0m: \"Temporary view 'people' already exists;\""
     ]
    }
   ],
   "source": [
    "df.createTempView(\"people\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "gZyk5EzjoGUt"
   },
   "source": [
    "In questi casi ci conviene usare il metodo *.createOrReplaceTempView(name)*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "bb_SmSLQoGUu"
   },
   "outputs": [],
   "source": [
    "df.createOrReplaceTempView(\"People\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "3_yTveZwoGUw"
   },
   "source": [
    "## Comandi di selezione\n",
    "I comandi SQL sono gli stessi sia per le tabelle che per le view, dato che quest'ultime non sono altro che tabelle virtuali."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Fsr2krqAoGUx"
   },
   "source": [
    "Possiamo selezionare una colonna della view usando l'istruzione *SELECT column FROM view*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 1495,
     "status": "ok",
     "timestamp": 1592470632305,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "ejfn5ngLoGUx",
    "outputId": "0a4b66f3-b16e-498a-be7c-1aad30b5fcca"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pyspark.sql.dataframe.DataFrame"
      ]
     },
     "execution_count": 10,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_sql = spark.sql(\"SELECT name FROM People\")\n",
    "type(df_sql)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "fuG47wauoGU1"
   },
   "source": [
    "Il risultato della query sarà sempre un Dataframe, che quindi possiamo stampare usa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 187
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 2761,
     "status": "ok",
     "timestamp": 1592470652404,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "8kA-Ek0UoGU1",
    "outputId": "20a91665-3e42-42c5-82e9-c332fb08534e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------+\n",
      "|    name|\n",
      "+--------+\n",
      "|Gianluca|\n",
      "|  Andrea|\n",
      "|   Marco|\n",
      "|Annalisa|\n",
      "|  Monica|\n",
      "+--------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_sql.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "4Kj9cP1hoGU3"
   },
   "source": [
    "**NOTA BENE** SQL è case-insensitive, quindi non fa distinsione tra maiuscole e minuscole."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 187
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 780,
     "status": "ok",
     "timestamp": 1592470655586,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "v9g940MZoGU4",
    "outputId": "098e3f99-0fda-462e-81c2-b8ae0fa3a41b"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------+\n",
      "|    nAmE|\n",
      "+--------+\n",
      "|Gianluca|\n",
      "|  Andrea|\n",
      "|   Marco|\n",
      "|Annalisa|\n",
      "|  Monica|\n",
      "+--------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"Select nAmE FRoM people\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "snMXq_9woGU8"
   },
   "source": [
    "Possiamo utilizzare lo stesso comando anche per selezionare più colonne."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 187
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 884,
     "status": "ok",
     "timestamp": 1592470665070,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "ESkeViKboGU8",
    "outputId": "1d179beb-3e94-4f47-a8cc-79360f2357ee"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------+---+\n",
      "|    name|age|\n",
      "+--------+---+\n",
      "|Gianluca| 23|\n",
      "|  Andrea| 37|\n",
      "|   Marco| 33|\n",
      "|Annalisa| 38|\n",
      "|  Monica| 25|\n",
      "+--------+---+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"SELECT name, age FROM People\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "snhd5ZteoGU-"
   },
   "source": [
    "Per selezionarle tutte possiamo utilizzare il selettore *."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 187
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 890,
     "status": "ok",
     "timestamp": 1592470669343,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "rupLEJPQoGU_",
    "outputId": "0cf85b92-f523-4aad-cef9-e82dac845ba0"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------+------+---+------+------+\n",
      "|    name|gender|age|height|weight|\n",
      "+--------+------+---+------+------+\n",
      "|Gianluca|     M| 23|   174|  70.5|\n",
      "|  Andrea|     M| 37|   179|  68.0|\n",
      "|   Marco|     M| 33|   172|  88.5|\n",
      "|Annalisa|     F| 38|   155|  50.2|\n",
      "|  Monica|     F| 25|   165|  54.3|\n",
      "+--------+------+---+------+------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"SELECT * FROM People\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ZG1pdMuyoGVC"
   },
   "source": [
    "Possiamo effettuare una selezione basata su una condizione utilizzando il comando WHERE, ad esempio selezioniamo nome e altezza solo per gli uomini."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 153
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 1220,
     "status": "ok",
     "timestamp": 1592470680993,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "n9-kfzYboGVC",
    "outputId": "cead653f-5d70-4535-f1f7-50b92ca76fc1"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------+------+\n",
      "|    name|height|\n",
      "+--------+------+\n",
      "|Gianluca|   174|\n",
      "|  Andrea|   179|\n",
      "|   Marco|   172|\n",
      "+--------+------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"SELECT name, height FROM People WHERE gender='M'\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "4zw1wFdKoGVF"
   },
   "source": [
    "Possiamo ordinare il risultato in base ai valori di una colonna utilizzando il comando ORDER BY, ad esempio ordiniamo le righe in base al peso."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 187
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 1132,
     "status": "ok",
     "timestamp": 1592470683183,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "PIqg_F0-oGVF",
    "outputId": "df1636ff-07ac-49d6-98f7-82bb73a29b20"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------+------+---+------+------+\n",
      "|    name|gender|age|height|weight|\n",
      "+--------+------+---+------+------+\n",
      "|Annalisa|     F| 38|   155|  50.2|\n",
      "|  Monica|     F| 25|   165|  54.3|\n",
      "|  Andrea|     M| 37|   179|  68.0|\n",
      "|Gianluca|     M| 23|   174|  70.5|\n",
      "|   Marco|     M| 33|   172|  88.5|\n",
      "+--------+------+---+------+------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"SELECT * FROM People ORDER BY weight\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "m3tX1oUvoGVH"
   },
   "source": [
    "di default l'ordinamento viene effettuato in maniera ascendente (cioè dal valore più piccolo a quello più grande), se vogliamo effettuare un'ordinamento discendente ci basta aggiungere DESC."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 187
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 860,
     "status": "ok",
     "timestamp": 1592470684688,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "witK5iOsoGVI",
    "outputId": "73cd84b1-9441-4f4a-d985-43ecd7040c12"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------+------+---+------+------+\n",
      "|    name|gender|age|height|weight|\n",
      "+--------+------+---+------+------+\n",
      "|   Marco|     M| 33|   172|  88.5|\n",
      "|Gianluca|     M| 23|   174|  70.5|\n",
      "|  Andrea|     M| 37|   179|  68.0|\n",
      "|  Monica|     F| 25|   165|  54.3|\n",
      "|Annalisa|     F| 38|   155|  50.2|\n",
      "+--------+------+---+------+------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"SELECT * FROM People ORDER BY weight DESC\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "YPGnjm6boGVL"
   },
   "source": [
    "Se vogliamo limitare il numero di righe restituite dalla nostra query possiamo utilizzare LIMIT, ad esempio basandoci sulla query eseguita appena sopra limitiamo il risultato alle sole prime 3 righe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 153
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 994,
     "status": "ok",
     "timestamp": 1592470687978,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "tSGJ6_fkoGVL",
    "outputId": "8f44023b-5dcf-4fa6-cddf-763212f6b4ed"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------+------+---+------+------+\n",
      "|    name|gender|age|height|weight|\n",
      "+--------+------+---+------+------+\n",
      "|   Marco|     M| 33|   172|  88.5|\n",
      "|Gianluca|     M| 23|   174|  70.5|\n",
      "|  Andrea|     M| 37|   179|  68.0|\n",
      "+--------+------+---+------+------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"SELECT * FROM People ORDER BY weight DESC LIMIT 3\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "KyXzhbXcoGVO"
   },
   "source": [
    "Per contare il numero di risultati possiamo utilizzare count, ad esempio contiamo il numero di persone di sesso maschile e più alte di 175 cm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 119
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 1309,
     "status": "ok",
     "timestamp": 1592470690157,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "y8GFMSgIoGVP",
    "outputId": "4af5b67c-3efa-46ab-ff70-840c48ee0d53"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------+\n",
      "|count(1)|\n",
      "+--------+\n",
      "|       1|\n",
      "+--------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"SELECT count(*) FROM People WHERE gender='M' and height>175\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Y2jKFvzeoGVR"
   },
   "source": [
    "In questo caso il nome della colonna (count(1)) viene impostato automaticamente, se vogliamo modificarlo possiamo creare un alias con il comando AS, ad esempio modifichiamo in counter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 119
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 1530,
     "status": "ok",
     "timestamp": 1592470691753,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "eDVWKxREoGVS",
    "outputId": "5abfe129-3685-4116-b077-53b59c03414d"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+\n",
      "|counter|\n",
      "+-------+\n",
      "|      1|\n",
      "+-------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"SELECT count(*) AS counter FROM People WHERE gender='M' and height>175\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Heq6yYQnoGVU"
   },
   "source": [
    "Per concludere vediamo una serie di funzioni statistiche, come la media (avg), usiamola per calolare il peso medio degli uomini."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 119
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 1372,
     "status": "ok",
     "timestamp": 1592470693831,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "We_1ZYaboGVU",
    "outputId": "c50b4bc7-490f-4540-a709-86260b0a0f41"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----------------+\n",
      "|       avg_weight|\n",
      "+-----------------+\n",
      "|75.66666666666667|\n",
      "+-----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"SELECT avg(weight) AS avg_weight FROM People WHERE gender='M'\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ZD2rQWQioGVX"
   },
   "source": [
    "oppure peso massimo (max) e minimo (min), sempre per i soli uomini"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 119
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 1032,
     "status": "ok",
     "timestamp": 1592470695415,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "o1-mnVVpoGVY",
    "outputId": "7fa092ab-ed7e-4048-b12b-f14208183277"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----------+----------+\n",
      "|max_weight|min_weight|\n",
      "+----------+----------+\n",
      "|      88.5|      68.0|\n",
      "+----------+----------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"SELECT max(weight) AS max_weight, min(weight) AS min_weight FROM People WHERE gender='M'\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "1XbOS9_6oGVZ"
   },
   "source": [
    "Utilizzando il comando GROUP BY possiamo raggruppare le righe in base al valore di una colonna, ad esempio raggruppiamo in base al sesso e poi calcoliamo peso medio, massimo e minimo."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 136
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 3386,
     "status": "ok",
     "timestamp": 1592470699293,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "5IIscRrioGVa",
    "outputId": "4890b845-a1f0-4381-a8c0-9a0e4f9e45ce"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------+-----------------+----------+----------+\n",
      "|gender|       avg_weight|max_weight|min_weight|\n",
      "+------+-----------------+----------+----------+\n",
      "|     F|            52.25|      54.3|      50.2|\n",
      "|     M|75.66666666666667|      88.5|      68.0|\n",
      "+------+-----------------+----------+----------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"SELECT gender, avg(weight) AS avg_weight, max(weight) AS max_weight, min(weight) AS min_weight FROM People GROUP BY gender\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "t4MCtiSDoGVd"
   },
   "source": [
    "Il peso medio per l'uomo ha troppe cifre dopo la virgola, arrotondiamole utilizzando la funzione round."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 136
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 3246,
     "status": "ok",
     "timestamp": 1592470700945,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "G7tUwj13oGVd",
    "outputId": "238cd928-9cad-4b41-814f-a034e8ed5dc5"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------+----------+----------+----------+\n",
      "|gender|avg_weight|max_weight|min_weight|\n",
      "+------+----------+----------+----------+\n",
      "|     F|     52.25|      54.3|      50.2|\n",
      "|     M|     75.67|      88.5|      68.0|\n",
      "+------+----------+----------+----------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"SELECT gender, round(avg(weight), 2) AS avg_weight, max(weight) AS max_weight, min(weight) AS min_weight FROM People GROUP BY gender\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "bD_H5k1FoGVf"
   },
   "source": [
    "## Creare una tabella permanente\n",
    "Come abbiamo già detto, la view viene salvata in RAM e al termine della sessione viene perduta, per creare una tabella permanente in memoria partendo da un Dataframe possiamo usare il metodo *.saveAsTable(name).*, per usare questo metodo dobbiamo importare la classe *DataFrameWriter*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 3255,
     "status": "ok",
     "timestamp": 1592470702099,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "0I-edEsIoGVf"
   },
   "outputs": [],
   "source": [
    "from pyspark.sql import DataFrameWriter\n",
    "\n",
    "df.write.saveAsTable(\"People\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "0X9qeP8EoGVi"
   },
   "source": [
    "Adesso nella directory nella quale stiamo lavorando dovremmo trovarci una cartella spark-warehouse contentente un'altra cartella people"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 2842,
     "status": "ok",
     "timestamp": 1592470715980,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "YLDt_fwsoGVj",
    "outputId": "aba0e327-ca7c-4713-d367-82e9e46975d3"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "people\n"
     ]
    }
   ],
   "source": [
    "!ls spark-warehouse/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 794,
     "status": "ok",
     "timestamp": 1592470736595,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "YLj1NkkTpt4j"
   },
   "outputs": [],
   "source": [
    "df.write.saveAsTable(\"People2\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 2103,
     "status": "ok",
     "timestamp": 1592470738642,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "5pcJ3TKBpwKw",
    "outputId": "16520a8a-78b9-4d3d-ca7e-fc6387c0b213"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "people\tpeople2\n"
     ]
    }
   ],
   "source": [
    "!ls spark-warehouse/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "zr4hGhpIoGVl"
   },
   "source": [
    " e qui si trova il nostro database, se proviamo a creare nuovamente la tabella otterremo un'errore, che ci informa che questa già esiste."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 681
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 1040,
     "status": "error",
     "timestamp": 1592470757218,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "SQClHz-BoGVl",
    "outputId": "80335e27-d718-4ef4-c9c6-bb6cc8e39e9d"
   },
   "outputs": [
    {
     "ename": "AnalysisException",
     "evalue": "ignored",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mPy4JJavaError\u001b[0m                             Traceback (most recent call last)",
      "\u001b[0;32m/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m     62\u001b[0m         \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 63\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     64\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0mpy4j\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprotocol\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mPy4JJavaError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/content/spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py\u001b[0m in \u001b[0;36mget_return_value\u001b[0;34m(answer, gateway_client, target_id, name)\u001b[0m\n\u001b[1;32m    327\u001b[0m                     \u001b[0;34m\"An error occurred while calling {0}{1}{2}.\\n\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 328\u001b[0;31m                     format(target_id, \".\", name), value)\n\u001b[0m\u001b[1;32m    329\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mPy4JJavaError\u001b[0m: An error occurred while calling o95.saveAsTable.\n: org.apache.spark.sql.AnalysisException: Table `People` already exists.;\n\tat org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:424)\n\tat org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:748)\n",
      "\nDuring handling of the above exception, another exception occurred:\n",
      "\u001b[0;31mAnalysisException\u001b[0m                         Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-30-d381aadfbad4>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msaveAsTable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"People\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/readwriter.py\u001b[0m in \u001b[0;36msaveAsTable\u001b[0;34m(self, name, format, mode, partitionBy, **options)\u001b[0m\n\u001b[1;32m    776\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mformat\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    777\u001b[0m             \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 778\u001b[0;31m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jwrite\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msaveAsTable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    779\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    780\u001b[0m     \u001b[0;34m@\u001b[0m\u001b[0msince\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1.4\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/content/spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m   1255\u001b[0m         \u001b[0manswer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgateway_client\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend_command\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcommand\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1256\u001b[0m         return_value = get_return_value(\n\u001b[0;32m-> 1257\u001b[0;31m             answer, self.gateway_client, self.target_id, self.name)\n\u001b[0m\u001b[1;32m   1258\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1259\u001b[0m         \u001b[0;32mfor\u001b[0m \u001b[0mtemp_arg\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtemp_args\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m     67\u001b[0m                                              e.java_exception.getStackTrace()))\n\u001b[1;32m     68\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstartswith\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'org.apache.spark.sql.AnalysisException: '\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 69\u001b[0;31m                 \u001b[0;32mraise\u001b[0m \u001b[0mAnalysisException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     70\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstartswith\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'org.apache.spark.sql.catalyst.analysis'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     71\u001b[0m                 \u001b[0;32mraise\u001b[0m \u001b[0mAnalysisException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mAnalysisException\u001b[0m: 'Table `People` already exists.;'"
     ]
    }
   ],
   "source": [
    "df.write.saveAsTable(\"People\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "cdgM_HBroGVn"
   },
   "source": [
    "In questi casi dobbiamo utilizzare il metodo *.mode(m)*, passando al suo interno:\n",
    "* 'append': per eseguire un join con la tabella già esistente, se possibile.\n",
    "* 'overwrite': per sovrascrivere la tabella già esistente."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 916,
     "status": "ok",
     "timestamp": 1592470766282,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "mqotrssRoGVo"
   },
   "outputs": [],
   "source": [
    "df.write.mode('append').saveAsTable(\"People\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "-2HB2--loGVq"
   },
   "source": [
    "Le istruzioni di selezione per le tabelle sono esattamente le stesse di quelle che abbiamo visto su per le view."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 187
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 1084,
     "status": "ok",
     "timestamp": 1592470783410,
     "user": {
      "displayName": "T3Lab Vision",
      "photoUrl": "",
      "userId": "14779383426442114373"
     },
     "user_tz": -120
    },
    "id": "uorfMBUEoGVq",
    "outputId": "abdb46c4-648f-4c62-c981-aec0ba93e8b2"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------+------+---+------+------+\n",
      "|    name|gender|age|height|weight|\n",
      "+--------+------+---+------+------+\n",
      "|Gianluca|     M| 23|   174|  70.5|\n",
      "|  Andrea|     M| 37|   179|  68.0|\n",
      "|   Marco|     M| 33|   172|  88.5|\n",
      "|Annalisa|     F| 38|   155|  50.2|\n",
      "|  Monica|     F| 25|   165|  54.3|\n",
      "+--------+------+---+------+------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"SELECT * FROM People\").show()"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "pyspark_dataframe_sql.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}