{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "lmJ97hWzo2zF" }, "source": [ "\"Open" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "Sszrzyybok2v" }, "outputs": [], "source": [ "################ template to run PySpark on Colab #######################" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 52545, "status": "ok", "timestamp": 1592470596242, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "0WGeMNr-ok5y" }, "outputs": [], "source": [ "!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n", "!wget -q https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz\n", "!tar xf spark-2.4.5-bin-hadoop2.7.tgz\n", "!pip install -q findspark" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 52539, "status": "ok", "timestamp": 1592470596245, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "S33Pv2YDokyN" }, "outputs": [], "source": [ "import os\n", "os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n", "os.environ[\"SPARK_HOME\"] = \"/content/spark-2.4.5-bin-hadoop2.7\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 57533, "status": "ok", "timestamp": 1592470601245, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "VIQewOA8oktH" }, "outputs": [], "source": [ "import findspark\n", "findspark.init()\n", "\n", "from pyspark.sql import SparkSession\n", "spark = SparkSession.builder.master(\"local[*]\").getOrCreate()\n", "spark1 = SparkSession.builder.appName('basic').getOrCreate()\n", "#Test must give no error" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 57532, "status": "ok", "timestamp": 1592470601248, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "0VyUq14yokoz" }, "outputs": [], "source": [ "import pyspark" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 57530, "status": "ok", "timestamp": 1592470601249, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "CAbqiCrDokkX" }, "outputs": [], "source": [ "from pyspark import SparkConf, SparkContext\n", "conf = SparkConf().setAppName(\"basic\").setMaster(\"local\")\n", "#sc = SparkContext(conf=conf) ## for jupyter and Databricks\n", "sc = SparkContext.getOrCreate() ## for Colab" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 57527, "status": "ok", "timestamp": 1592470601250, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "N55X_8Rgokia" }, "outputs": [], "source": [ "from pyspark.sql.types import *" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "LFRcBVBGoGUd" }, "source": [ "# SQL e Dataframe \n", "Lo Structured Query Language (SQL) è un linguaggio standardizzato per poter interrogare ed operare su Database relazionai. SQL Supporta le seguenti operazioni:\n", "\n", "* creare e modificare schemi di database \n", "* inserire, modificare e gestire dati memorizzati\n", "* interrogare i dati memorizzati\n", "* creare e gestire strumenti di controllo e accesso ai dati\n", "\n", "E' possibile utilizzare query SQL su di un Dataframe Spark" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 3282, "status": "ok", "timestamp": 1592470604547, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "0AxxLf9AoGUj" }, "outputs": [], "source": [ "data = [(\"Gianluca\", \"M\", 23, 174, 70.5),\n", " (\"Andrea\", \"M\", 37, 179, 68.),\n", " (\"Marco\", \"M\", 33, 172, 88.5),\n", " (\"Annalisa\", \"F\", 38, 155, 50.2),\n", " (\"Monica\", \"F\", 25, 165, 54.3)]\n", "\n", "df = spark.createDataFrame(data, [\"name\", \"gender\", \"age\", \"height\",\"weight\"])" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ZqOpmFk2oGUn" }, "source": [ "## Creare una View\n", "Una view è una tabella virtuale, solitamente frutto di un'operazione di SELECT tra più tabelle. Non viene salvata su disco ma risiede in memoria e al termine della sessione viene cancellata. Possiamo creare una View partendo da un Dataframe usando il metodo *.createTempView(name)*." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "GTbO2hQcoGUo" }, "outputs": [], "source": [ "df.createTempView(\"people\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "7eTPnHjGoGUq" }, "source": [ "Se riproviamo a ricreare la stessa view con lo stesso metodo, otterremo un'eccezione che ci informa del fatto che tale view esiste già." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "qd1c9tRmoGUr", "outputId": "1186e84f-964a-4ba0-9e09-4470d2d02506" }, "outputs": [ { "ename": "AnalysisException", "evalue": "\"Temporary view 'people' already exists;\"", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mPy4JJavaError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m~/spark/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m 62\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 63\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 64\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mpy4j\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprotocol\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mPy4JJavaError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/.local/lib/python3.6/site-packages/py4j/protocol.py\u001b[0m in \u001b[0;36mget_return_value\u001b[0;34m(answer, gateway_client, target_id, name)\u001b[0m\n\u001b[1;32m 327\u001b[0m \u001b[0;34m\"An error occurred while calling {0}{1}{2}.\\n\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 328\u001b[0;31m format(target_id, \".\", name), value)\n\u001b[0m\u001b[1;32m 329\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mPy4JJavaError\u001b[0m: An error occurred while calling o40.createTempView.\n: org.apache.spark.sql.catalyst.analysis.TempTableAlreadyExistsException: Temporary view 'people' already exists;\n\tat org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTempView(SessionCatalog.scala:495)\n\tat org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:146)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)\n\tat org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)\n\tat org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)\n\tat org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)\n\tat org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)\n\tat org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)\n\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)\n\tat org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)\n\tat org.apache.spark.sql.Dataset.(Dataset.scala:194)\n\tat org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)\n\tat org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3406)\n\tat org.apache.spark.sql.Dataset.createTempView(Dataset.scala:3082)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:748)\n", "\nDuring handling of the above exception, another exception occurred:\n", "\u001b[0;31mAnalysisException\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcreateTempView\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"people\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/spark/python/pyspark/sql/dataframe.py\u001b[0m in \u001b[0;36mcreateTempView\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m 159\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 160\u001b[0m \"\"\"\n\u001b[0;32m--> 161\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcreateTempView\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 162\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 163\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0msince\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m2.0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/.local/lib/python3.6/site-packages/py4j/java_gateway.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m 1284\u001b[0m \u001b[0manswer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgateway_client\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend_command\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcommand\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1285\u001b[0m return_value = get_return_value(\n\u001b[0;32m-> 1286\u001b[0;31m answer, self.gateway_client, self.target_id, self.name)\n\u001b[0m\u001b[1;32m 1287\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1288\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mtemp_arg\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtemp_args\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/spark/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m 69\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mAnalysisException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 70\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstartswith\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'org.apache.spark.sql.catalyst.analysis'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 71\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mAnalysisException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 72\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstartswith\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'org.apache.spark.sql.catalyst.parser.ParseException: '\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 73\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mParseException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mAnalysisException\u001b[0m: \"Temporary view 'people' already exists;\"" ] } ], "source": [ "df.createTempView(\"people\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "gZyk5EzjoGUt" }, "source": [ "In questi casi ci conviene usare il metodo *.createOrReplaceTempView(name)*." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "bb_SmSLQoGUu" }, "outputs": [], "source": [ "df.createOrReplaceTempView(\"People\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "3_yTveZwoGUw" }, "source": [ "## Comandi di selezione\n", "I comandi SQL sono gli stessi sia per le tabelle che per le view, dato che quest'ultime non sono altro che tabelle virtuali." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Fsr2krqAoGUx" }, "source": [ "Possiamo selezionare una colonna della view usando l'istruzione *SELECT column FROM view*." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "executionInfo": { "elapsed": 1495, "status": "ok", "timestamp": 1592470632305, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "ejfn5ngLoGUx", "outputId": "0a4b66f3-b16e-498a-be7c-1aad30b5fcca" }, "outputs": [ { "data": { "text/plain": [ "pyspark.sql.dataframe.DataFrame" ] }, "execution_count": 10, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "df_sql = spark.sql(\"SELECT name FROM People\")\n", "type(df_sql)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "fuG47wauoGU1" }, "source": [ "Il risultato della query sarà sempre un Dataframe, che quindi possiamo stampare usa" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 187 }, "colab_type": "code", "executionInfo": { "elapsed": 2761, "status": "ok", "timestamp": 1592470652404, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "8kA-Ek0UoGU1", "outputId": "20a91665-3e42-42c5-82e9-c332fb08534e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------+\n", "| name|\n", "+--------+\n", "|Gianluca|\n", "| Andrea|\n", "| Marco|\n", "|Annalisa|\n", "| Monica|\n", "+--------+\n", "\n" ] } ], "source": [ "df_sql.show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "4Kj9cP1hoGU3" }, "source": [ "**NOTA BENE** SQL è case-insensitive, quindi non fa distinsione tra maiuscole e minuscole." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 187 }, "colab_type": "code", "executionInfo": { "elapsed": 780, "status": "ok", "timestamp": 1592470655586, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "v9g940MZoGU4", "outputId": "098e3f99-0fda-462e-81c2-b8ae0fa3a41b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------+\n", "| nAmE|\n", "+--------+\n", "|Gianluca|\n", "| Andrea|\n", "| Marco|\n", "|Annalisa|\n", "| Monica|\n", "+--------+\n", "\n" ] } ], "source": [ "spark.sql(\"Select nAmE FRoM people\").show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "snMXq_9woGU8" }, "source": [ "Possiamo utilizzare lo stesso comando anche per selezionare più colonne." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 187 }, "colab_type": "code", "executionInfo": { "elapsed": 884, "status": "ok", "timestamp": 1592470665070, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "ESkeViKboGU8", "outputId": "1d179beb-3e94-4f47-a8cc-79360f2357ee" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------+---+\n", "| name|age|\n", "+--------+---+\n", "|Gianluca| 23|\n", "| Andrea| 37|\n", "| Marco| 33|\n", "|Annalisa| 38|\n", "| Monica| 25|\n", "+--------+---+\n", "\n" ] } ], "source": [ "spark.sql(\"SELECT name, age FROM People\").show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "snhd5ZteoGU-" }, "source": [ "Per selezionarle tutte possiamo utilizzare il selettore *." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 187 }, "colab_type": "code", "executionInfo": { "elapsed": 890, "status": "ok", "timestamp": 1592470669343, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "rupLEJPQoGU_", "outputId": "0cf85b92-f523-4aad-cef9-e82dac845ba0" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------+------+---+------+------+\n", "| name|gender|age|height|weight|\n", "+--------+------+---+------+------+\n", "|Gianluca| M| 23| 174| 70.5|\n", "| Andrea| M| 37| 179| 68.0|\n", "| Marco| M| 33| 172| 88.5|\n", "|Annalisa| F| 38| 155| 50.2|\n", "| Monica| F| 25| 165| 54.3|\n", "+--------+------+---+------+------+\n", "\n" ] } ], "source": [ "spark.sql(\"SELECT * FROM People\").show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ZG1pdMuyoGVC" }, "source": [ "Possiamo effettuare una selezione basata su una condizione utilizzando il comando WHERE, ad esempio selezioniamo nome e altezza solo per gli uomini." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 153 }, "colab_type": "code", "executionInfo": { "elapsed": 1220, "status": "ok", "timestamp": 1592470680993, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "n9-kfzYboGVC", "outputId": "cead653f-5d70-4535-f1f7-50b92ca76fc1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------+------+\n", "| name|height|\n", "+--------+------+\n", "|Gianluca| 174|\n", "| Andrea| 179|\n", "| Marco| 172|\n", "+--------+------+\n", "\n" ] } ], "source": [ "spark.sql(\"SELECT name, height FROM People WHERE gender='M'\").show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "4zw1wFdKoGVF" }, "source": [ "Possiamo ordinare il risultato in base ai valori di una colonna utilizzando il comando ORDER BY, ad esempio ordiniamo le righe in base al peso." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 187 }, "colab_type": "code", "executionInfo": { "elapsed": 1132, "status": "ok", "timestamp": 1592470683183, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "PIqg_F0-oGVF", "outputId": "df1636ff-07ac-49d6-98f7-82bb73a29b20" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------+------+---+------+------+\n", "| name|gender|age|height|weight|\n", "+--------+------+---+------+------+\n", "|Annalisa| F| 38| 155| 50.2|\n", "| Monica| F| 25| 165| 54.3|\n", "| Andrea| M| 37| 179| 68.0|\n", "|Gianluca| M| 23| 174| 70.5|\n", "| Marco| M| 33| 172| 88.5|\n", "+--------+------+---+------+------+\n", "\n" ] } ], "source": [ "spark.sql(\"SELECT * FROM People ORDER BY weight\").show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "m3tX1oUvoGVH" }, "source": [ "di default l'ordinamento viene effettuato in maniera ascendente (cioè dal valore più piccolo a quello più grande), se vogliamo effettuare un'ordinamento discendente ci basta aggiungere DESC." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 187 }, "colab_type": "code", "executionInfo": { "elapsed": 860, "status": "ok", "timestamp": 1592470684688, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "witK5iOsoGVI", "outputId": "73cd84b1-9441-4f4a-d985-43ecd7040c12" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------+------+---+------+------+\n", "| name|gender|age|height|weight|\n", "+--------+------+---+------+------+\n", "| Marco| M| 33| 172| 88.5|\n", "|Gianluca| M| 23| 174| 70.5|\n", "| Andrea| M| 37| 179| 68.0|\n", "| Monica| F| 25| 165| 54.3|\n", "|Annalisa| F| 38| 155| 50.2|\n", "+--------+------+---+------+------+\n", "\n" ] } ], "source": [ "spark.sql(\"SELECT * FROM People ORDER BY weight DESC\").show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "YPGnjm6boGVL" }, "source": [ "Se vogliamo limitare il numero di righe restituite dalla nostra query possiamo utilizzare LIMIT, ad esempio basandoci sulla query eseguita appena sopra limitiamo il risultato alle sole prime 3 righe." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 153 }, "colab_type": "code", "executionInfo": { "elapsed": 994, "status": "ok", "timestamp": 1592470687978, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "tSGJ6_fkoGVL", "outputId": "8f44023b-5dcf-4fa6-cddf-763212f6b4ed" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------+------+---+------+------+\n", "| name|gender|age|height|weight|\n", "+--------+------+---+------+------+\n", "| Marco| M| 33| 172| 88.5|\n", "|Gianluca| M| 23| 174| 70.5|\n", "| Andrea| M| 37| 179| 68.0|\n", "+--------+------+---+------+------+\n", "\n" ] } ], "source": [ "spark.sql(\"SELECT * FROM People ORDER BY weight DESC LIMIT 3\").show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "KyXzhbXcoGVO" }, "source": [ "Per contare il numero di risultati possiamo utilizzare count, ad esempio contiamo il numero di persone di sesso maschile e più alte di 175 cm." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 119 }, "colab_type": "code", "executionInfo": { "elapsed": 1309, "status": "ok", "timestamp": 1592470690157, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "y8GFMSgIoGVP", "outputId": "4af5b67c-3efa-46ab-ff70-840c48ee0d53" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------+\n", "|count(1)|\n", "+--------+\n", "| 1|\n", "+--------+\n", "\n" ] } ], "source": [ "spark.sql(\"SELECT count(*) FROM People WHERE gender='M' and height>175\").show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Y2jKFvzeoGVR" }, "source": [ "In questo caso il nome della colonna (count(1)) viene impostato automaticamente, se vogliamo modificarlo possiamo creare un alias con il comando AS, ad esempio modifichiamo in counter." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 119 }, "colab_type": "code", "executionInfo": { "elapsed": 1530, "status": "ok", "timestamp": 1592470691753, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "eDVWKxREoGVS", "outputId": "5abfe129-3685-4116-b077-53b59c03414d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------+\n", "|counter|\n", "+-------+\n", "| 1|\n", "+-------+\n", "\n" ] } ], "source": [ "spark.sql(\"SELECT count(*) AS counter FROM People WHERE gender='M' and height>175\").show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Heq6yYQnoGVU" }, "source": [ "Per concludere vediamo una serie di funzioni statistiche, come la media (avg), usiamola per calolare il peso medio degli uomini." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 119 }, "colab_type": "code", "executionInfo": { "elapsed": 1372, "status": "ok", "timestamp": 1592470693831, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "We_1ZYaboGVU", "outputId": "c50b4bc7-490f-4540-a709-86260b0a0f41" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----------------+\n", "| avg_weight|\n", "+-----------------+\n", "|75.66666666666667|\n", "+-----------------+\n", "\n" ] } ], "source": [ "spark.sql(\"SELECT avg(weight) AS avg_weight FROM People WHERE gender='M'\").show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ZD2rQWQioGVX" }, "source": [ "oppure peso massimo (max) e minimo (min), sempre per i soli uomini" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 119 }, "colab_type": "code", "executionInfo": { "elapsed": 1032, "status": "ok", "timestamp": 1592470695415, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "o1-mnVVpoGVY", "outputId": "7fa092ab-ed7e-4048-b12b-f14208183277" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+----------+\n", "|max_weight|min_weight|\n", "+----------+----------+\n", "| 88.5| 68.0|\n", "+----------+----------+\n", "\n" ] } ], "source": [ "spark.sql(\"SELECT max(weight) AS max_weight, min(weight) AS min_weight FROM People WHERE gender='M'\").show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "1XbOS9_6oGVZ" }, "source": [ "Utilizzando il comando GROUP BY possiamo raggruppare le righe in base al valore di una colonna, ad esempio raggruppiamo in base al sesso e poi calcoliamo peso medio, massimo e minimo." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 136 }, "colab_type": "code", "executionInfo": { "elapsed": 3386, "status": "ok", "timestamp": 1592470699293, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "5IIscRrioGVa", "outputId": "4890b845-a1f0-4381-a8c0-9a0e4f9e45ce" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------+-----------------+----------+----------+\n", "|gender| avg_weight|max_weight|min_weight|\n", "+------+-----------------+----------+----------+\n", "| F| 52.25| 54.3| 50.2|\n", "| M|75.66666666666667| 88.5| 68.0|\n", "+------+-----------------+----------+----------+\n", "\n" ] } ], "source": [ "spark.sql(\"SELECT gender, avg(weight) AS avg_weight, max(weight) AS max_weight, min(weight) AS min_weight FROM People GROUP BY gender\").show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "t4MCtiSDoGVd" }, "source": [ "Il peso medio per l'uomo ha troppe cifre dopo la virgola, arrotondiamole utilizzando la funzione round." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 136 }, "colab_type": "code", "executionInfo": { "elapsed": 3246, "status": "ok", "timestamp": 1592470700945, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "G7tUwj13oGVd", "outputId": "238cd928-9cad-4b41-814f-a034e8ed5dc5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------+----------+----------+----------+\n", "|gender|avg_weight|max_weight|min_weight|\n", "+------+----------+----------+----------+\n", "| F| 52.25| 54.3| 50.2|\n", "| M| 75.67| 88.5| 68.0|\n", "+------+----------+----------+----------+\n", "\n" ] } ], "source": [ "spark.sql(\"SELECT gender, round(avg(weight), 2) AS avg_weight, max(weight) AS max_weight, min(weight) AS min_weight FROM People GROUP BY gender\").show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "bD_H5k1FoGVf" }, "source": [ "## Creare una tabella permanente\n", "Come abbiamo già detto, la view viene salvata in RAM e al termine della sessione viene perduta, per creare una tabella permanente in memoria partendo da un Dataframe possiamo usare il metodo *.saveAsTable(name).*, per usare questo metodo dobbiamo importare la classe *DataFrameWriter*." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 3255, "status": "ok", "timestamp": 1592470702099, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "0I-edEsIoGVf" }, "outputs": [], "source": [ "from pyspark.sql import DataFrameWriter\n", "\n", "df.write.saveAsTable(\"People\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "0X9qeP8EoGVi" }, "source": [ "Adesso nella directory nella quale stiamo lavorando dovremmo trovarci una cartella spark-warehouse contentente un'altra cartella people" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "executionInfo": { "elapsed": 2842, "status": "ok", "timestamp": 1592470715980, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "YLDt_fwsoGVj", "outputId": "aba0e327-ca7c-4713-d367-82e9e46975d3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "people\n" ] } ], "source": [ "!ls spark-warehouse/" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 794, "status": "ok", "timestamp": 1592470736595, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "YLj1NkkTpt4j" }, "outputs": [], "source": [ "df.write.saveAsTable(\"People2\")" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "executionInfo": { "elapsed": 2103, "status": "ok", "timestamp": 1592470738642, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "5pcJ3TKBpwKw", "outputId": "16520a8a-78b9-4d3d-ca7e-fc6387c0b213" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "people\tpeople2\n" ] } ], "source": [ "!ls spark-warehouse/" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "zr4hGhpIoGVl" }, "source": [ " e qui si trova il nostro database, se proviamo a creare nuovamente la tabella otterremo un'errore, che ci informa che questa già esiste." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 681 }, "colab_type": "code", "executionInfo": { "elapsed": 1040, "status": "error", "timestamp": 1592470757218, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "SQClHz-BoGVl", "outputId": "80335e27-d718-4ef4-c9c6-bb6cc8e39e9d" }, "outputs": [ { "ename": "AnalysisException", "evalue": "ignored", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mPy4JJavaError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m 62\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 63\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 64\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mpy4j\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprotocol\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mPy4JJavaError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/content/spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py\u001b[0m in \u001b[0;36mget_return_value\u001b[0;34m(answer, gateway_client, target_id, name)\u001b[0m\n\u001b[1;32m 327\u001b[0m \u001b[0;34m\"An error occurred while calling {0}{1}{2}.\\n\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 328\u001b[0;31m format(target_id, \".\", name), value)\n\u001b[0m\u001b[1;32m 329\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mPy4JJavaError\u001b[0m: An error occurred while calling o95.saveAsTable.\n: org.apache.spark.sql.AnalysisException: Table `People` already exists.;\n\tat org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:424)\n\tat org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:748)\n", "\nDuring handling of the above exception, another exception occurred:\n", "\u001b[0;31mAnalysisException\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msaveAsTable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"People\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/readwriter.py\u001b[0m in \u001b[0;36msaveAsTable\u001b[0;34m(self, name, format, mode, partitionBy, **options)\u001b[0m\n\u001b[1;32m 776\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mformat\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 777\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 778\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jwrite\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msaveAsTable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 779\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 780\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0msince\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1.4\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/content/spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m 1255\u001b[0m \u001b[0manswer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgateway_client\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend_command\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcommand\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1256\u001b[0m return_value = get_return_value(\n\u001b[0;32m-> 1257\u001b[0;31m answer, self.gateway_client, self.target_id, self.name)\n\u001b[0m\u001b[1;32m 1258\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1259\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mtemp_arg\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtemp_args\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m 67\u001b[0m e.java_exception.getStackTrace()))\n\u001b[1;32m 68\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstartswith\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'org.apache.spark.sql.AnalysisException: '\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 69\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mAnalysisException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 70\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstartswith\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'org.apache.spark.sql.catalyst.analysis'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 71\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mAnalysisException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mAnalysisException\u001b[0m: 'Table `People` already exists.;'" ] } ], "source": [ "df.write.saveAsTable(\"People\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "cdgM_HBroGVn" }, "source": [ "In questi casi dobbiamo utilizzare il metodo *.mode(m)*, passando al suo interno:\n", "* 'append': per eseguire un join con la tabella già esistente, se possibile.\n", "* 'overwrite': per sovrascrivere la tabella già esistente." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 916, "status": "ok", "timestamp": 1592470766282, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "mqotrssRoGVo" }, "outputs": [], "source": [ "df.write.mode('append').saveAsTable(\"People\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "-2HB2--loGVq" }, "source": [ "Le istruzioni di selezione per le tabelle sono esattamente le stesse di quelle che abbiamo visto su per le view." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 187 }, "colab_type": "code", "executionInfo": { "elapsed": 1084, "status": "ok", "timestamp": 1592470783410, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "uorfMBUEoGVq", "outputId": "abdb46c4-648f-4c62-c981-aec0ba93e8b2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------+------+---+------+------+\n", "| name|gender|age|height|weight|\n", "+--------+------+---+------+------+\n", "|Gianluca| M| 23| 174| 70.5|\n", "| Andrea| M| 37| 179| 68.0|\n", "| Marco| M| 33| 172| 88.5|\n", "|Annalisa| F| 38| 155| 50.2|\n", "| Monica| F| 25| 165| 54.3|\n", "+--------+------+---+------+------+\n", "\n" ] } ], "source": [ "spark.sql(\"SELECT * FROM People\").show()" ] } ], "metadata": { "colab": { "name": "pyspark_dataframe_sql.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 1 }