{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "lmJ97hWzo2zF"
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "Sszrzyybok2v"
},
"outputs": [],
"source": [
"################ template to run PySpark on Colab #######################"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {},
"colab_type": "code",
"executionInfo": {
"elapsed": 52545,
"status": "ok",
"timestamp": 1592470596242,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "0WGeMNr-ok5y"
},
"outputs": [],
"source": [
"!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
"!wget -q https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz\n",
"!tar xf spark-2.4.5-bin-hadoop2.7.tgz\n",
"!pip install -q findspark"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {},
"colab_type": "code",
"executionInfo": {
"elapsed": 52539,
"status": "ok",
"timestamp": 1592470596245,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "S33Pv2YDokyN"
},
"outputs": [],
"source": [
"import os\n",
"os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
"os.environ[\"SPARK_HOME\"] = \"/content/spark-2.4.5-bin-hadoop2.7\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {},
"colab_type": "code",
"executionInfo": {
"elapsed": 57533,
"status": "ok",
"timestamp": 1592470601245,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "VIQewOA8oktH"
},
"outputs": [],
"source": [
"import findspark\n",
"findspark.init()\n",
"\n",
"from pyspark.sql import SparkSession\n",
"spark = SparkSession.builder.master(\"local[*]\").getOrCreate()\n",
"spark1 = SparkSession.builder.appName('basic').getOrCreate()\n",
"#Test must give no error"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {},
"colab_type": "code",
"executionInfo": {
"elapsed": 57532,
"status": "ok",
"timestamp": 1592470601248,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "0VyUq14yokoz"
},
"outputs": [],
"source": [
"import pyspark"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {},
"colab_type": "code",
"executionInfo": {
"elapsed": 57530,
"status": "ok",
"timestamp": 1592470601249,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "CAbqiCrDokkX"
},
"outputs": [],
"source": [
"from pyspark import SparkConf, SparkContext\n",
"conf = SparkConf().setAppName(\"basic\").setMaster(\"local\")\n",
"#sc = SparkContext(conf=conf) ## for jupyter and Databricks\n",
"sc = SparkContext.getOrCreate() ## for Colab"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {},
"colab_type": "code",
"executionInfo": {
"elapsed": 57527,
"status": "ok",
"timestamp": 1592470601250,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "N55X_8Rgokia"
},
"outputs": [],
"source": [
"from pyspark.sql.types import *"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "LFRcBVBGoGUd"
},
"source": [
"# SQL e Dataframe \n",
"Lo Structured Query Language (SQL) è un linguaggio standardizzato per poter interrogare ed operare su Database relazionai. SQL Supporta le seguenti operazioni:\n",
"\n",
"* creare e modificare schemi di database \n",
"* inserire, modificare e gestire dati memorizzati\n",
"* interrogare i dati memorizzati\n",
"* creare e gestire strumenti di controllo e accesso ai dati\n",
"\n",
"E' possibile utilizzare query SQL su di un Dataframe Spark"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {},
"colab_type": "code",
"executionInfo": {
"elapsed": 3282,
"status": "ok",
"timestamp": 1592470604547,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "0AxxLf9AoGUj"
},
"outputs": [],
"source": [
"data = [(\"Gianluca\", \"M\", 23, 174, 70.5),\n",
" (\"Andrea\", \"M\", 37, 179, 68.),\n",
" (\"Marco\", \"M\", 33, 172, 88.5),\n",
" (\"Annalisa\", \"F\", 38, 155, 50.2),\n",
" (\"Monica\", \"F\", 25, 165, 54.3)]\n",
"\n",
"df = spark.createDataFrame(data, [\"name\", \"gender\", \"age\", \"height\",\"weight\"])"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ZqOpmFk2oGUn"
},
"source": [
"## Creare una View\n",
"Una view è una tabella virtuale, solitamente frutto di un'operazione di SELECT tra più tabelle. Non viene salvata su disco ma risiede in memoria e al termine della sessione viene cancellata. Possiamo creare una View partendo da un Dataframe usando il metodo *.createTempView(name)*."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "GTbO2hQcoGUo"
},
"outputs": [],
"source": [
"df.createTempView(\"people\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "7eTPnHjGoGUq"
},
"source": [
"Se riproviamo a ricreare la stessa view con lo stesso metodo, otterremo un'eccezione che ci informa del fatto che tale view esiste già."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "qd1c9tRmoGUr",
"outputId": "1186e84f-964a-4ba0-9e09-4470d2d02506"
},
"outputs": [
{
"ename": "AnalysisException",
"evalue": "\"Temporary view 'people' already exists;\"",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mPy4JJavaError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m~/spark/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m 62\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 63\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 64\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mpy4j\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprotocol\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mPy4JJavaError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/.local/lib/python3.6/site-packages/py4j/protocol.py\u001b[0m in \u001b[0;36mget_return_value\u001b[0;34m(answer, gateway_client, target_id, name)\u001b[0m\n\u001b[1;32m 327\u001b[0m \u001b[0;34m\"An error occurred while calling {0}{1}{2}.\\n\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 328\u001b[0;31m format(target_id, \".\", name), value)\n\u001b[0m\u001b[1;32m 329\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mPy4JJavaError\u001b[0m: An error occurred while calling o40.createTempView.\n: org.apache.spark.sql.catalyst.analysis.TempTableAlreadyExistsException: Temporary view 'people' already exists;\n\tat org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTempView(SessionCatalog.scala:495)\n\tat org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:146)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)\n\tat org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)\n\tat org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)\n\tat org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)\n\tat org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)\n\tat org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)\n\tat org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)\n\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)\n\tat org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)\n\tat org.apache.spark.sql.Dataset.(Dataset.scala:194)\n\tat org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)\n\tat org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3406)\n\tat org.apache.spark.sql.Dataset.createTempView(Dataset.scala:3082)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:748)\n",
"\nDuring handling of the above exception, another exception occurred:\n",
"\u001b[0;31mAnalysisException\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcreateTempView\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"people\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m~/spark/python/pyspark/sql/dataframe.py\u001b[0m in \u001b[0;36mcreateTempView\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m 159\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 160\u001b[0m \"\"\"\n\u001b[0;32m--> 161\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcreateTempView\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 162\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 163\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0msince\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m2.0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/.local/lib/python3.6/site-packages/py4j/java_gateway.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m 1284\u001b[0m \u001b[0manswer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgateway_client\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend_command\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcommand\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1285\u001b[0m return_value = get_return_value(\n\u001b[0;32m-> 1286\u001b[0;31m answer, self.gateway_client, self.target_id, self.name)\n\u001b[0m\u001b[1;32m 1287\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1288\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mtemp_arg\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtemp_args\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/spark/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m 69\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mAnalysisException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 70\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstartswith\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'org.apache.spark.sql.catalyst.analysis'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 71\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mAnalysisException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 72\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstartswith\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'org.apache.spark.sql.catalyst.parser.ParseException: '\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 73\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mParseException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mAnalysisException\u001b[0m: \"Temporary view 'people' already exists;\""
]
}
],
"source": [
"df.createTempView(\"people\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "gZyk5EzjoGUt"
},
"source": [
"In questi casi ci conviene usare il metodo *.createOrReplaceTempView(name)*."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "bb_SmSLQoGUu"
},
"outputs": [],
"source": [
"df.createOrReplaceTempView(\"People\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "3_yTveZwoGUw"
},
"source": [
"## Comandi di selezione\n",
"I comandi SQL sono gli stessi sia per le tabelle che per le view, dato che quest'ultime non sono altro che tabelle virtuali."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Fsr2krqAoGUx"
},
"source": [
"Possiamo selezionare una colonna della view usando l'istruzione *SELECT column FROM view*."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1495,
"status": "ok",
"timestamp": 1592470632305,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "ejfn5ngLoGUx",
"outputId": "0a4b66f3-b16e-498a-be7c-1aad30b5fcca"
},
"outputs": [
{
"data": {
"text/plain": [
"pyspark.sql.dataframe.DataFrame"
]
},
"execution_count": 10,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"df_sql = spark.sql(\"SELECT name FROM People\")\n",
"type(df_sql)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "fuG47wauoGU1"
},
"source": [
"Il risultato della query sarà sempre un Dataframe, che quindi possiamo stampare usa"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 187
},
"colab_type": "code",
"executionInfo": {
"elapsed": 2761,
"status": "ok",
"timestamp": 1592470652404,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "8kA-Ek0UoGU1",
"outputId": "20a91665-3e42-42c5-82e9-c332fb08534e"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------+\n",
"| name|\n",
"+--------+\n",
"|Gianluca|\n",
"| Andrea|\n",
"| Marco|\n",
"|Annalisa|\n",
"| Monica|\n",
"+--------+\n",
"\n"
]
}
],
"source": [
"df_sql.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "4Kj9cP1hoGU3"
},
"source": [
"**NOTA BENE** SQL è case-insensitive, quindi non fa distinsione tra maiuscole e minuscole."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 187
},
"colab_type": "code",
"executionInfo": {
"elapsed": 780,
"status": "ok",
"timestamp": 1592470655586,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "v9g940MZoGU4",
"outputId": "098e3f99-0fda-462e-81c2-b8ae0fa3a41b"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------+\n",
"| nAmE|\n",
"+--------+\n",
"|Gianluca|\n",
"| Andrea|\n",
"| Marco|\n",
"|Annalisa|\n",
"| Monica|\n",
"+--------+\n",
"\n"
]
}
],
"source": [
"spark.sql(\"Select nAmE FRoM people\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "snMXq_9woGU8"
},
"source": [
"Possiamo utilizzare lo stesso comando anche per selezionare più colonne."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 187
},
"colab_type": "code",
"executionInfo": {
"elapsed": 884,
"status": "ok",
"timestamp": 1592470665070,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "ESkeViKboGU8",
"outputId": "1d179beb-3e94-4f47-a8cc-79360f2357ee"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------+---+\n",
"| name|age|\n",
"+--------+---+\n",
"|Gianluca| 23|\n",
"| Andrea| 37|\n",
"| Marco| 33|\n",
"|Annalisa| 38|\n",
"| Monica| 25|\n",
"+--------+---+\n",
"\n"
]
}
],
"source": [
"spark.sql(\"SELECT name, age FROM People\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "snhd5ZteoGU-"
},
"source": [
"Per selezionarle tutte possiamo utilizzare il selettore *."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 187
},
"colab_type": "code",
"executionInfo": {
"elapsed": 890,
"status": "ok",
"timestamp": 1592470669343,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "rupLEJPQoGU_",
"outputId": "0cf85b92-f523-4aad-cef9-e82dac845ba0"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------+------+---+------+------+\n",
"| name|gender|age|height|weight|\n",
"+--------+------+---+------+------+\n",
"|Gianluca| M| 23| 174| 70.5|\n",
"| Andrea| M| 37| 179| 68.0|\n",
"| Marco| M| 33| 172| 88.5|\n",
"|Annalisa| F| 38| 155| 50.2|\n",
"| Monica| F| 25| 165| 54.3|\n",
"+--------+------+---+------+------+\n",
"\n"
]
}
],
"source": [
"spark.sql(\"SELECT * FROM People\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ZG1pdMuyoGVC"
},
"source": [
"Possiamo effettuare una selezione basata su una condizione utilizzando il comando WHERE, ad esempio selezioniamo nome e altezza solo per gli uomini."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 153
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1220,
"status": "ok",
"timestamp": 1592470680993,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "n9-kfzYboGVC",
"outputId": "cead653f-5d70-4535-f1f7-50b92ca76fc1"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------+------+\n",
"| name|height|\n",
"+--------+------+\n",
"|Gianluca| 174|\n",
"| Andrea| 179|\n",
"| Marco| 172|\n",
"+--------+------+\n",
"\n"
]
}
],
"source": [
"spark.sql(\"SELECT name, height FROM People WHERE gender='M'\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "4zw1wFdKoGVF"
},
"source": [
"Possiamo ordinare il risultato in base ai valori di una colonna utilizzando il comando ORDER BY, ad esempio ordiniamo le righe in base al peso."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 187
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1132,
"status": "ok",
"timestamp": 1592470683183,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "PIqg_F0-oGVF",
"outputId": "df1636ff-07ac-49d6-98f7-82bb73a29b20"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------+------+---+------+------+\n",
"| name|gender|age|height|weight|\n",
"+--------+------+---+------+------+\n",
"|Annalisa| F| 38| 155| 50.2|\n",
"| Monica| F| 25| 165| 54.3|\n",
"| Andrea| M| 37| 179| 68.0|\n",
"|Gianluca| M| 23| 174| 70.5|\n",
"| Marco| M| 33| 172| 88.5|\n",
"+--------+------+---+------+------+\n",
"\n"
]
}
],
"source": [
"spark.sql(\"SELECT * FROM People ORDER BY weight\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "m3tX1oUvoGVH"
},
"source": [
"di default l'ordinamento viene effettuato in maniera ascendente (cioè dal valore più piccolo a quello più grande), se vogliamo effettuare un'ordinamento discendente ci basta aggiungere DESC."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 187
},
"colab_type": "code",
"executionInfo": {
"elapsed": 860,
"status": "ok",
"timestamp": 1592470684688,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "witK5iOsoGVI",
"outputId": "73cd84b1-9441-4f4a-d985-43ecd7040c12"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------+------+---+------+------+\n",
"| name|gender|age|height|weight|\n",
"+--------+------+---+------+------+\n",
"| Marco| M| 33| 172| 88.5|\n",
"|Gianluca| M| 23| 174| 70.5|\n",
"| Andrea| M| 37| 179| 68.0|\n",
"| Monica| F| 25| 165| 54.3|\n",
"|Annalisa| F| 38| 155| 50.2|\n",
"+--------+------+---+------+------+\n",
"\n"
]
}
],
"source": [
"spark.sql(\"SELECT * FROM People ORDER BY weight DESC\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "YPGnjm6boGVL"
},
"source": [
"Se vogliamo limitare il numero di righe restituite dalla nostra query possiamo utilizzare LIMIT, ad esempio basandoci sulla query eseguita appena sopra limitiamo il risultato alle sole prime 3 righe."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 153
},
"colab_type": "code",
"executionInfo": {
"elapsed": 994,
"status": "ok",
"timestamp": 1592470687978,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "tSGJ6_fkoGVL",
"outputId": "8f44023b-5dcf-4fa6-cddf-763212f6b4ed"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------+------+---+------+------+\n",
"| name|gender|age|height|weight|\n",
"+--------+------+---+------+------+\n",
"| Marco| M| 33| 172| 88.5|\n",
"|Gianluca| M| 23| 174| 70.5|\n",
"| Andrea| M| 37| 179| 68.0|\n",
"+--------+------+---+------+------+\n",
"\n"
]
}
],
"source": [
"spark.sql(\"SELECT * FROM People ORDER BY weight DESC LIMIT 3\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "KyXzhbXcoGVO"
},
"source": [
"Per contare il numero di risultati possiamo utilizzare count, ad esempio contiamo il numero di persone di sesso maschile e più alte di 175 cm."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 119
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1309,
"status": "ok",
"timestamp": 1592470690157,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "y8GFMSgIoGVP",
"outputId": "4af5b67c-3efa-46ab-ff70-840c48ee0d53"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------+\n",
"|count(1)|\n",
"+--------+\n",
"| 1|\n",
"+--------+\n",
"\n"
]
}
],
"source": [
"spark.sql(\"SELECT count(*) FROM People WHERE gender='M' and height>175\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Y2jKFvzeoGVR"
},
"source": [
"In questo caso il nome della colonna (count(1)) viene impostato automaticamente, se vogliamo modificarlo possiamo creare un alias con il comando AS, ad esempio modifichiamo in counter."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 119
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1530,
"status": "ok",
"timestamp": 1592470691753,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "eDVWKxREoGVS",
"outputId": "5abfe129-3685-4116-b077-53b59c03414d"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------+\n",
"|counter|\n",
"+-------+\n",
"| 1|\n",
"+-------+\n",
"\n"
]
}
],
"source": [
"spark.sql(\"SELECT count(*) AS counter FROM People WHERE gender='M' and height>175\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Heq6yYQnoGVU"
},
"source": [
"Per concludere vediamo una serie di funzioni statistiche, come la media (avg), usiamola per calolare il peso medio degli uomini."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 119
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1372,
"status": "ok",
"timestamp": 1592470693831,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "We_1ZYaboGVU",
"outputId": "c50b4bc7-490f-4540-a709-86260b0a0f41"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----------------+\n",
"| avg_weight|\n",
"+-----------------+\n",
"|75.66666666666667|\n",
"+-----------------+\n",
"\n"
]
}
],
"source": [
"spark.sql(\"SELECT avg(weight) AS avg_weight FROM People WHERE gender='M'\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ZD2rQWQioGVX"
},
"source": [
"oppure peso massimo (max) e minimo (min), sempre per i soli uomini"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 119
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1032,
"status": "ok",
"timestamp": 1592470695415,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "o1-mnVVpoGVY",
"outputId": "7fa092ab-ed7e-4048-b12b-f14208183277"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+----------+----------+\n",
"|max_weight|min_weight|\n",
"+----------+----------+\n",
"| 88.5| 68.0|\n",
"+----------+----------+\n",
"\n"
]
}
],
"source": [
"spark.sql(\"SELECT max(weight) AS max_weight, min(weight) AS min_weight FROM People WHERE gender='M'\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "1XbOS9_6oGVZ"
},
"source": [
"Utilizzando il comando GROUP BY possiamo raggruppare le righe in base al valore di una colonna, ad esempio raggruppiamo in base al sesso e poi calcoliamo peso medio, massimo e minimo."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 136
},
"colab_type": "code",
"executionInfo": {
"elapsed": 3386,
"status": "ok",
"timestamp": 1592470699293,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "5IIscRrioGVa",
"outputId": "4890b845-a1f0-4381-a8c0-9a0e4f9e45ce"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------+-----------------+----------+----------+\n",
"|gender| avg_weight|max_weight|min_weight|\n",
"+------+-----------------+----------+----------+\n",
"| F| 52.25| 54.3| 50.2|\n",
"| M|75.66666666666667| 88.5| 68.0|\n",
"+------+-----------------+----------+----------+\n",
"\n"
]
}
],
"source": [
"spark.sql(\"SELECT gender, avg(weight) AS avg_weight, max(weight) AS max_weight, min(weight) AS min_weight FROM People GROUP BY gender\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "t4MCtiSDoGVd"
},
"source": [
"Il peso medio per l'uomo ha troppe cifre dopo la virgola, arrotondiamole utilizzando la funzione round."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 136
},
"colab_type": "code",
"executionInfo": {
"elapsed": 3246,
"status": "ok",
"timestamp": 1592470700945,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "G7tUwj13oGVd",
"outputId": "238cd928-9cad-4b41-814f-a034e8ed5dc5"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------+----------+----------+----------+\n",
"|gender|avg_weight|max_weight|min_weight|\n",
"+------+----------+----------+----------+\n",
"| F| 52.25| 54.3| 50.2|\n",
"| M| 75.67| 88.5| 68.0|\n",
"+------+----------+----------+----------+\n",
"\n"
]
}
],
"source": [
"spark.sql(\"SELECT gender, round(avg(weight), 2) AS avg_weight, max(weight) AS max_weight, min(weight) AS min_weight FROM People GROUP BY gender\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "bD_H5k1FoGVf"
},
"source": [
"## Creare una tabella permanente\n",
"Come abbiamo già detto, la view viene salvata in RAM e al termine della sessione viene perduta, per creare una tabella permanente in memoria partendo da un Dataframe possiamo usare il metodo *.saveAsTable(name).*, per usare questo metodo dobbiamo importare la classe *DataFrameWriter*."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"colab": {},
"colab_type": "code",
"executionInfo": {
"elapsed": 3255,
"status": "ok",
"timestamp": 1592470702099,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "0I-edEsIoGVf"
},
"outputs": [],
"source": [
"from pyspark.sql import DataFrameWriter\n",
"\n",
"df.write.saveAsTable(\"People\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "0X9qeP8EoGVi"
},
"source": [
"Adesso nella directory nella quale stiamo lavorando dovremmo trovarci una cartella spark-warehouse contentente un'altra cartella people"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 2842,
"status": "ok",
"timestamp": 1592470715980,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "YLDt_fwsoGVj",
"outputId": "aba0e327-ca7c-4713-d367-82e9e46975d3"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"people\n"
]
}
],
"source": [
"!ls spark-warehouse/"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"colab": {},
"colab_type": "code",
"executionInfo": {
"elapsed": 794,
"status": "ok",
"timestamp": 1592470736595,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "YLj1NkkTpt4j"
},
"outputs": [],
"source": [
"df.write.saveAsTable(\"People2\")"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 2103,
"status": "ok",
"timestamp": 1592470738642,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "5pcJ3TKBpwKw",
"outputId": "16520a8a-78b9-4d3d-ca7e-fc6387c0b213"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"people\tpeople2\n"
]
}
],
"source": [
"!ls spark-warehouse/"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "zr4hGhpIoGVl"
},
"source": [
" e qui si trova il nostro database, se proviamo a creare nuovamente la tabella otterremo un'errore, che ci informa che questa già esiste."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 681
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1040,
"status": "error",
"timestamp": 1592470757218,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "SQClHz-BoGVl",
"outputId": "80335e27-d718-4ef4-c9c6-bb6cc8e39e9d"
},
"outputs": [
{
"ename": "AnalysisException",
"evalue": "ignored",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mPy4JJavaError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m 62\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 63\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 64\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mpy4j\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprotocol\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mPy4JJavaError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/content/spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py\u001b[0m in \u001b[0;36mget_return_value\u001b[0;34m(answer, gateway_client, target_id, name)\u001b[0m\n\u001b[1;32m 327\u001b[0m \u001b[0;34m\"An error occurred while calling {0}{1}{2}.\\n\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 328\u001b[0;31m format(target_id, \".\", name), value)\n\u001b[0m\u001b[1;32m 329\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mPy4JJavaError\u001b[0m: An error occurred while calling o95.saveAsTable.\n: org.apache.spark.sql.AnalysisException: Table `People` already exists.;\n\tat org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:424)\n\tat org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:748)\n",
"\nDuring handling of the above exception, another exception occurred:\n",
"\u001b[0;31mAnalysisException\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msaveAsTable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"People\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/readwriter.py\u001b[0m in \u001b[0;36msaveAsTable\u001b[0;34m(self, name, format, mode, partitionBy, **options)\u001b[0m\n\u001b[1;32m 776\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mformat\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 777\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 778\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jwrite\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msaveAsTable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 779\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 780\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0msince\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1.4\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/content/spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m 1255\u001b[0m \u001b[0manswer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgateway_client\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend_command\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcommand\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1256\u001b[0m return_value = get_return_value(\n\u001b[0;32m-> 1257\u001b[0;31m answer, self.gateway_client, self.target_id, self.name)\n\u001b[0m\u001b[1;32m 1258\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1259\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mtemp_arg\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtemp_args\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m 67\u001b[0m e.java_exception.getStackTrace()))\n\u001b[1;32m 68\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstartswith\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'org.apache.spark.sql.AnalysisException: '\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 69\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mAnalysisException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 70\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstartswith\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'org.apache.spark.sql.catalyst.analysis'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 71\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mAnalysisException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m': '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mAnalysisException\u001b[0m: 'Table `People` already exists.;'"
]
}
],
"source": [
"df.write.saveAsTable(\"People\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "cdgM_HBroGVn"
},
"source": [
"In questi casi dobbiamo utilizzare il metodo *.mode(m)*, passando al suo interno:\n",
"* 'append': per eseguire un join con la tabella già esistente, se possibile.\n",
"* 'overwrite': per sovrascrivere la tabella già esistente."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"colab": {},
"colab_type": "code",
"executionInfo": {
"elapsed": 916,
"status": "ok",
"timestamp": 1592470766282,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "mqotrssRoGVo"
},
"outputs": [],
"source": [
"df.write.mode('append').saveAsTable(\"People\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "-2HB2--loGVq"
},
"source": [
"Le istruzioni di selezione per le tabelle sono esattamente le stesse di quelle che abbiamo visto su per le view."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 187
},
"colab_type": "code",
"executionInfo": {
"elapsed": 1084,
"status": "ok",
"timestamp": 1592470783410,
"user": {
"displayName": "T3Lab Vision",
"photoUrl": "",
"userId": "14779383426442114373"
},
"user_tz": -120
},
"id": "uorfMBUEoGVq",
"outputId": "abdb46c4-648f-4c62-c981-aec0ba93e8b2"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------+------+---+------+------+\n",
"| name|gender|age|height|weight|\n",
"+--------+------+---+------+------+\n",
"|Gianluca| M| 23| 174| 70.5|\n",
"| Andrea| M| 37| 179| 68.0|\n",
"| Marco| M| 33| 172| 88.5|\n",
"|Annalisa| F| 38| 155| 50.2|\n",
"| Monica| F| 25| 165| 54.3|\n",
"+--------+------+---+------+------+\n",
"\n"
]
}
],
"source": [
"spark.sql(\"SELECT * FROM People\").show()"
]
}
],
"metadata": {
"colab": {
"name": "pyspark_dataframe_sql.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 1
}