{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "713FbuMDwaVh" }, "source": [ "\"Open" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 796, "status": "ok", "timestamp": 1592472621795, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "BA43OLGJwW8h" }, "outputs": [], "source": [ "################ template to run PySpark on Colab #######################" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 32019, "status": "ok", "timestamp": 1592472653025, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "VAZTYHcYwW3W" }, "outputs": [], "source": [ "!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n", "!wget -q https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz\n", "!tar xf spark-2.4.5-bin-hadoop2.7.tgz\n", "!pip install -q findspark" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 564, "status": "ok", "timestamp": 1592472661406, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "n0lEmDq1wW03" }, "outputs": [], "source": [ "import os\n", "os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n", "os.environ[\"SPARK_HOME\"] = \"/content/spark-2.4.5-bin-hadoop2.7\"" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 5731, "status": "ok", "timestamp": 1592472667341, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "JdCIBuSDwWxf" }, "outputs": [], "source": [ "import findspark\n", "findspark.init()\n", "\n", "from pyspark.sql import SparkSession\n", "spark = SparkSession.builder.master(\"local[*]\").getOrCreate()\n", "spark1 = SparkSession.builder.appName('basic').getOrCreate()\n", "#Test must give no error" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 3954, "status": "ok", "timestamp": 1592472667342, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "FSH53MwnwWoT" }, "outputs": [], "source": [ "import pyspark" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 3370, "status": "ok", "timestamp": 1592472667344, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "WrgDccHPwqte" }, "outputs": [], "source": [ "from pyspark import SparkConf, SparkContext\n", "conf = SparkConf().setAppName(\"basic\").setMaster(\"local\")\n", "#sc = SparkContext(conf=conf) ## for jupyter and Databricks\n", "sc = SparkContext.getOrCreate() ## for Colab" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 2801, "status": "ok", "timestamp": 1592472667345, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "_32n-VwawqqX" }, "outputs": [], "source": [ "from pyspark.sql.types import *" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "nWNHbdMXw06T" }, "source": [ "# Analisi del prezzo delle Azioni di Apple\n", "\n", "In questo notebook faremo un po' di pratica con Spark, Dataframe e serie storiche (timeseries). Come lo faremo ? Andando ad analizzare il titolo azionario di Apple (AAPL)\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "colab_type": "code", "executionInfo": { "elapsed": 2406, "status": "ok", "timestamp": 1592472705548, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "xYVKJFJHLz2Q", "outputId": "26b164c9-0877-4c82-eb30-cff45f0d475e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2020-06-18 09:31:44-- https://frenzy86.s3.eu-west-2.amazonaws.com/fav/tecno/AAPL.csv\n", "Resolving frenzy86.s3.eu-west-2.amazonaws.com (frenzy86.s3.eu-west-2.amazonaws.com)... 52.95.149.42\n", "Connecting to frenzy86.s3.eu-west-2.amazonaws.com (frenzy86.s3.eu-west-2.amazonaws.com)|52.95.149.42|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 646721 (632K) [application/vnd.ms-excel]\n", "Saving to: ‘AAPL.csv’\n", "\n", "AAPL.csv 100%[===================>] 631.56K 1.40MB/s in 0.4s \n", "\n", "2020-06-18 09:31:45 (1.40 MB/s) - ‘AAPL.csv’ saved [646721/646721]\n", "\n" ] } ], "source": [ "!wget https://frenzy86.s3.eu-west-2.amazonaws.com/fav/tecno/AAPL.csv" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 460 }, "colab_type": "code", "executionInfo": { "elapsed": 6718, "status": "ok", "timestamp": 1592472714051, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "uaJLPnYZLz2X", "outputId": "250d9781-c6e1-4f80-92d3-3a73ac8c0308" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------------------+--------+--------+--------+--------+---------+---------+\n", "| Date| Open| High| Low| Close|Adj Close| Volume|\n", "+-------------------+--------+--------+--------+--------+---------+---------+\n", "|1980-12-12 00:00:00|0.513393|0.515625|0.513393|0.513393| 0.410525|117258400|\n", "|1980-12-15 00:00:00|0.488839|0.488839|0.486607|0.486607| 0.389106| 43971200|\n", "|1980-12-16 00:00:00|0.453125|0.453125|0.450893|0.450893| 0.360548| 26432000|\n", "|1980-12-17 00:00:00|0.462054|0.464286|0.462054|0.462054| 0.369472| 21610400|\n", "|1980-12-18 00:00:00|0.475446|0.477679|0.475446|0.475446| 0.380182| 18362400|\n", "|1980-12-19 00:00:00|0.504464|0.506696|0.504464|0.504464| 0.403385| 12157600|\n", "|1980-12-22 00:00:00|0.529018|0.531250|0.529018|0.529018| 0.423019| 9340800|\n", "|1980-12-23 00:00:00|0.551339|0.553571|0.551339|0.551339| 0.440868| 11737600|\n", "|1980-12-24 00:00:00|0.580357|0.582589|0.580357|0.580357| 0.464072| 12000800|\n", "|1980-12-26 00:00:00|0.633929|0.636161|0.633929|0.633929| 0.506909| 13893600|\n", "|1980-12-29 00:00:00|0.642857|0.645089|0.642857|0.642857| 0.514049| 23290400|\n", "|1980-12-30 00:00:00|0.629464|0.629464|0.627232|0.627232| 0.501554| 17220000|\n", "|1980-12-31 00:00:00|0.611607|0.611607|0.609375|0.609375| 0.487276| 8937600|\n", "|1981-01-02 00:00:00|0.616071|0.620536|0.616071|0.616071| 0.492630| 5415200|\n", "|1981-01-05 00:00:00|0.604911|0.604911|0.602679|0.602679| 0.481921| 8932000|\n", "|1981-01-06 00:00:00|0.578125|0.578125|0.575893|0.575893| 0.460502| 11289600|\n", "|1981-01-07 00:00:00|0.553571|0.553571|0.551339|0.551339| 0.440868| 13921600|\n", "|1981-01-08 00:00:00|0.542411|0.542411|0.540179|0.540179| 0.431944| 9956800|\n", "|1981-01-09 00:00:00|0.569196|0.571429|0.569196|0.569196| 0.455147| 5376000|\n", "|1981-01-12 00:00:00|0.569196|0.569196|0.564732|0.564732| 0.451577| 5924800|\n", "+-------------------+--------+--------+--------+--------+---------+---------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "df = spark.read.csv(\"AAPL.csv\", header=True, inferSchema=True)\n", "df.show()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 170 }, "colab_type": "code", "executionInfo": { "elapsed": 790, "status": "ok", "timestamp": 1592472721231, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "HzQEnNR_Lz2c", "outputId": "22dd403c-ad8f-4c55-8adb-a2f0289c0424" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- Date: timestamp (nullable = true)\n", " |-- Open: string (nullable = true)\n", " |-- High: string (nullable = true)\n", " |-- Low: string (nullable = true)\n", " |-- Close: string (nullable = true)\n", " |-- Adj Close: string (nullable = true)\n", " |-- Volume: string (nullable = true)\n", "\n" ] } ], "source": [ "df.printSchema()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "nZtkD_GULz2f" }, "source": [ "Lo schema è stato interpretato male, tutte le colonne, ad eccezione del timestamp, sono state lette come stringhe." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "V5I_FgHELz2f" }, "source": [ "## Correggiamo lo schema\n", "Definiamo manualmente lo schema con i tipi corretti." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 460 }, "colab_type": "code", "executionInfo": { "elapsed": 999, "status": "ok", "timestamp": 1592472736538, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "8l8j6j1eLz2g", "outputId": "86cb9697-6c5d-4097-8782-d4b69ca2318c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------------------+--------+--------+--------+--------+---------+---------+\n", "| Date| Open| High| Low| Close|Adj Close| Volume|\n", "+-------------------+--------+--------+--------+--------+---------+---------+\n", "|1980-12-12 00:00:00|0.513393|0.515625|0.513393|0.513393| 0.410525|117258400|\n", "|1980-12-15 00:00:00|0.488839|0.488839|0.486607|0.486607| 0.389106| 43971200|\n", "|1980-12-16 00:00:00|0.453125|0.453125|0.450893|0.450893| 0.360548| 26432000|\n", "|1980-12-17 00:00:00|0.462054|0.464286|0.462054|0.462054| 0.369472| 21610400|\n", "|1980-12-18 00:00:00|0.475446|0.477679|0.475446|0.475446| 0.380182| 18362400|\n", "|1980-12-19 00:00:00|0.504464|0.506696|0.504464|0.504464| 0.403385| 12157600|\n", "|1980-12-22 00:00:00|0.529018| 0.53125|0.529018|0.529018| 0.423019| 9340800|\n", "|1980-12-23 00:00:00|0.551339|0.553571|0.551339|0.551339| 0.440868| 11737600|\n", "|1980-12-24 00:00:00|0.580357|0.582589|0.580357|0.580357| 0.464072| 12000800|\n", "|1980-12-26 00:00:00|0.633929|0.636161|0.633929|0.633929| 0.506909| 13893600|\n", "|1980-12-29 00:00:00|0.642857|0.645089|0.642857|0.642857| 0.514049| 23290400|\n", "|1980-12-30 00:00:00|0.629464|0.629464|0.627232|0.627232| 0.501554| 17220000|\n", "|1980-12-31 00:00:00|0.611607|0.611607|0.609375|0.609375| 0.487276| 8937600|\n", "|1981-01-02 00:00:00|0.616071|0.620536|0.616071|0.616071| 0.49263| 5415200|\n", "|1981-01-05 00:00:00|0.604911|0.604911|0.602679|0.602679| 0.481921| 8932000|\n", "|1981-01-06 00:00:00|0.578125|0.578125|0.575893|0.575893| 0.460502| 11289600|\n", "|1981-01-07 00:00:00|0.553571|0.553571|0.551339|0.551339| 0.440868| 13921600|\n", "|1981-01-08 00:00:00|0.542411|0.542411|0.540179|0.540179| 0.431944| 9956800|\n", "|1981-01-09 00:00:00|0.569196|0.571429|0.569196|0.569196| 0.455147| 5376000|\n", "|1981-01-12 00:00:00|0.569196|0.569196|0.564732|0.564732| 0.451577| 5924800|\n", "+-------------------+--------+--------+--------+--------+---------+---------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "from pyspark.sql.types import *\n", "\n", "data_schema = [ #StructField('Date', DateType(), True),\n", " StructField('Date', TimestampType(), True),\n", " StructField('Open', FloatType(), True),\n", " StructField('High', FloatType(), True),\n", " StructField('Low', FloatType(), True),\n", " StructField('Close', FloatType(), True),\n", " StructField('Adj Close', FloatType(), True),\n", " StructField('Volume', IntegerType(), True),]\n", " \n", "schema = StructType(fields=data_schema)\n", "\n", "df = spark.read.schema(schema).option(\"header\",\"true\").option(\"inferSchema\",\"false\").csv(\"AAPL.csv\")\n", "df.show()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 170 }, "colab_type": "code", "executionInfo": { "elapsed": 663, "status": "ok", "timestamp": 1592472743951, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "xQKvufgQLz2j", "outputId": "c4a9f510-cb96-4ea0-ff7d-63f98860165c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- Date: timestamp (nullable = true)\n", " |-- Open: float (nullable = true)\n", " |-- High: float (nullable = true)\n", " |-- Low: float (nullable = true)\n", " |-- Close: float (nullable = true)\n", " |-- Adj Close: float (nullable = true)\n", " |-- Volume: integer (nullable = true)\n", "\n" ] } ], "source": [ "df.printSchema()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "kjSKLrYXLz2m" }, "source": [ "Adesso lo schema è corretto, l'ora presente nel timestamp è un'informazione superflua, dato che coincide sempre con la mezzanotte, rimuoviamola." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "oVju-629Lz2n" }, "source": [ "### Convertiamo il timestamp in una data\n", "Per rimuovere l'ora ci basta convertire il timestamp in una data, per farlo utilizziamo la funzione *.to_date(column, format)* di spark." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 460 }, "colab_type": "code", "executionInfo": { "elapsed": 694, "status": "ok", "timestamp": 1592472748923, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "6e0JXNmULz2o", "outputId": "63e5c3fe-36d6-4255-8fd2-3c1338ee3e22" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+--------+--------+--------+--------+---------+---------+\n", "| Date| Open| High| Low| Close|Adj Close| Volume|\n", "+----------+--------+--------+--------+--------+---------+---------+\n", "|1980-12-12|0.513393|0.515625|0.513393|0.513393| 0.410525|117258400|\n", "|1980-12-15|0.488839|0.488839|0.486607|0.486607| 0.389106| 43971200|\n", "|1980-12-16|0.453125|0.453125|0.450893|0.450893| 0.360548| 26432000|\n", "|1980-12-17|0.462054|0.464286|0.462054|0.462054| 0.369472| 21610400|\n", "|1980-12-18|0.475446|0.477679|0.475446|0.475446| 0.380182| 18362400|\n", "|1980-12-19|0.504464|0.506696|0.504464|0.504464| 0.403385| 12157600|\n", "|1980-12-22|0.529018| 0.53125|0.529018|0.529018| 0.423019| 9340800|\n", "|1980-12-23|0.551339|0.553571|0.551339|0.551339| 0.440868| 11737600|\n", "|1980-12-24|0.580357|0.582589|0.580357|0.580357| 0.464072| 12000800|\n", "|1980-12-26|0.633929|0.636161|0.633929|0.633929| 0.506909| 13893600|\n", "|1980-12-29|0.642857|0.645089|0.642857|0.642857| 0.514049| 23290400|\n", "|1980-12-30|0.629464|0.629464|0.627232|0.627232| 0.501554| 17220000|\n", "|1980-12-31|0.611607|0.611607|0.609375|0.609375| 0.487276| 8937600|\n", "|1981-01-02|0.616071|0.620536|0.616071|0.616071| 0.49263| 5415200|\n", "|1981-01-05|0.604911|0.604911|0.602679|0.602679| 0.481921| 8932000|\n", "|1981-01-06|0.578125|0.578125|0.575893|0.575893| 0.460502| 11289600|\n", "|1981-01-07|0.553571|0.553571|0.551339|0.551339| 0.440868| 13921600|\n", "|1981-01-08|0.542411|0.542411|0.540179|0.540179| 0.431944| 9956800|\n", "|1981-01-09|0.569196|0.571429|0.569196|0.569196| 0.455147| 5376000|\n", "|1981-01-12|0.569196|0.569196|0.564732|0.564732| 0.451577| 5924800|\n", "+----------+--------+--------+--------+--------+---------+---------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "from pyspark.sql.functions import to_date\n", "\n", "df = df.withColumn('Date', to_date(df[\"Date\"], \"yyyy-MM-dd\"))\n", "df.show()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 170 }, "colab_type": "code", "executionInfo": { "elapsed": 536, "status": "ok", "timestamp": 1592472754950, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "d5pOVpPPLz2t", "outputId": "6a7922c6-b64f-4193-ca2d-7f861ddbecab" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- Date: date (nullable = true)\n", " |-- Open: float (nullable = true)\n", " |-- High: float (nullable = true)\n", " |-- Low: float (nullable = true)\n", " |-- Close: float (nullable = true)\n", " |-- Adj Close: float (nullable = true)\n", " |-- Volume: integer (nullable = true)\n", "\n" ] } ], "source": [ "df.printSchema()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "tm61T2-TLz2w" }, "source": [ "### Cerchiamo il giorno con il valore maggiore\n", "Per trovare il giorno in cui il prezzo del titolo ha raggiunto il valore maggiore dobbiamo semplicemente ordinare il dataset in base alla colonna *High*, in ordine discendente, poi stampiamo la prima riga." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 136 }, "colab_type": "code", "executionInfo": { "elapsed": 1170, "status": "ok", "timestamp": 1592472771188, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "B3yN7RvhLz2x", "outputId": "059bfadd-60b8-4dc5-8ff3-ce4b8a724bad" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+------+------+------+------+---------+--------+\n", "| Date| Open| High| Low| Close|Adj Close| Volume|\n", "+----------+------+------+------+------+---------+--------+\n", "|2018-10-03|230.05|233.47|229.78|232.07|229.39209|28654800|\n", "+----------+------+------+------+------+---------+--------+\n", "only showing top 1 row\n", "\n" ] } ], "source": [ "from pyspark.sql.functions import max\n", "\n", "df.orderBy(\"High\", ascending=False).show(1)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "49Cy6xzSLz20" }, "source": [ "### Cerchiamo il giorno con il valore minore dopo il 2000\n", "Anche in questo caso l'operazione è semplice, filtriamo i giorni successivi al primo Gennaio del 2000 e ordiniamoli in base alla colonna *Low*, in ordine ascendente, quindi stampiamo la prima riga." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 136 }, "colab_type": "code", "executionInfo": { "elapsed": 1519, "status": "ok", "timestamp": 1592472781747, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "zVaOE504Lz20", "outputId": "e672a4a8-0f19-4c10-e8fe-39712c7328a2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+--------+--------+--------+--------+---------+---------+\n", "| Date| Open| High| Low| Close|Adj Close| Volume|\n", "+----------+--------+--------+--------+--------+---------+---------+\n", "|2003-04-17|0.942857|0.946429|0.908571|0.937143| 0.820964|154064400|\n", "+----------+--------+--------+--------+--------+---------+---------+\n", "only showing top 1 row\n", "\n" ] } ], "source": [ "df.filter(\"Date > '2000-01-01'\").orderBy(\"Low\", ascending=True).show(1)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "4l9nh2OULz23" }, "source": [ "### Calcoliamo la percentuale di giorni con un valore in chiusura minore di 100 USD \n", "Per calcolare la percentuale dobbiamo dividere il numero di giorni in cui il valore ha chiuso sotto i 100$ per il numero di gironi totali e moltiplicare per 100, dato che il numero avrà molte cifre dopo la virogola lo arrotondiamo utilizzando la funzio round di python." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "executionInfo": { "elapsed": 1267, "status": "ok", "timestamp": 1592472794360, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "35AUvbaBLz23", "outputId": "cdef3310-582b-4f3d-c125-fc40481451b5" }, "outputs": [ { "data": { "text/plain": [ "89.67" ] }, "execution_count": 19, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "round((df.filter(df[\"Close\"]<100).count() / df.count())*100, 2)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "NIFuNPqVLz27" }, "source": [ "I giorni sono quasi il 90% ed è comune dato che stiamo considerando fin da quando la Apple è stata quotata in borsa negli anni 80." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "_29AYycgLz27" }, "source": [ "### Contiamo i giorni successivi al 2014 con un valore in chiusura minore di 100 USD \n", "Ripetiamo lo stesso processo di sopra, ma questa volta filtriamo soltanto i giorno successivi al primo Gennaio 2014." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 460 }, "colab_type": "code", "executionInfo": { "elapsed": 747, "status": "ok", "timestamp": 1592472867807, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "88mvPRCKLz29", "outputId": "da3377e7-1ecb-406b-c7be-d4e4d7d5f79b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+---------+---------+---------+---------+---------+---------+\n", "| Date| Open| High| Low| Close|Adj Close| Volume|\n", "+----------+---------+---------+---------+---------+---------+---------+\n", "|2014-01-02| 79.38286|79.575714| 78.86| 79.01857| 71.59167| 58671200|\n", "|2014-01-03| 78.98| 79.1|77.204285| 77.28286| 70.0191| 98116900|\n", "|2014-01-06| 76.77857| 78.11429| 76.22857|77.704285| 70.4009|103152700|\n", "|2014-01-07| 77.76|77.994286| 76.84571|77.148575| 69.89742| 79302300|\n", "|2014-01-08|76.972855| 77.93714| 76.95571|77.637146|70.340096| 64632400|\n", "|2014-01-09| 78.11429|78.122856| 76.47857| 76.64571| 69.44184| 69787200|\n", "|2014-01-10| 77.11857| 77.25714|75.872856|76.134285| 68.97846| 76244000|\n", "|2014-01-13| 75.70143| 77.5| 75.69714| 76.53286|69.339584| 94623200|\n", "|2014-01-14| 76.88857| 78.10429| 76.80857| 78.05572| 70.71932| 83140400|\n", "|2014-01-15| 79.07429| 80.02857| 78.80857|79.622856| 72.13917| 97909700|\n", "|2014-01-16| 79.27143| 79.55| 78.81143| 79.17857| 71.73665| 57319500|\n", "|2014-01-17| 78.78286| 78.86714| 77.12857| 77.23857| 69.97897|106684900|\n", "|2014-01-21| 77.28429| 78.58143| 77.20286| 78.43857|71.066185| 82131700|\n", "|2014-01-22| 78.70143|79.612854| 78.25857| 78.78714| 71.382| 94996300|\n", "|2014-01-23| 78.56286| 79.5| 77.83|79.454285| 71.98645|100809800|\n", "|2014-01-24| 79.14286| 79.37428| 77.82143| 78.01|70.677895|107338700|\n", "|2014-01-27| 78.58143| 79.25714| 77.96429| 78.64286| 71.25127|138719700|\n", "|2014-01-28| 72.68| 73.57143| 71.72429| 72.35714| 65.55634|266380800|\n", "|2014-01-29| 71.99286| 72.48143| 71.23143| 71.53571|64.812126|125702500|\n", "|2014-01-30| 71.79143| 72.35714|70.957146| 71.39714|64.686554|169625400|\n", "+----------+---------+---------+---------+---------+---------+---------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "df2014 = df.filter(\"Date >= '2014-01-01'\")\n", "df2014.show()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "executionInfo": { "elapsed": 1124, "status": "ok", "timestamp": 1592472873933, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "TcIc-O7oLz2_", "outputId": "86cc136b-fb2b-461d-e63a-a6ff54f2a98e" }, "outputs": [ { "data": { "text/plain": [ "21.54" ] }, "execution_count": 25, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "round((df2014.filter(df[\"Close\"]<100).count() / df2010.count())*100, 2)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "2483r3xILz3C" }, "source": [ "In questo caso la Apple ha chiuso sotto i 100$ per soltanto il 21.5% delle volte" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "kIDKAUQdLz3D" }, "source": [ "### Visualizziamo il massimo per anno\n", "Raggruppiamo il dataframe per anno, per estrarre solo l'anno dalla data possiamo usare la funzione *year(col)* di spark." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "colab": {}, "colab_type": "code", "executionInfo": { "elapsed": 467, "status": "ok", "timestamp": 1592472900756, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "3bXMu91vLz3E" }, "outputs": [], "source": [ "from pyspark.sql.functions import year\n", "\n", "dfGroupYear = df.groupBy(year(\"Date\").alias(\"Year\"))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "LoKkyj7ALz3H" }, "source": [ "Ora aggreghiamo in base al valore bassimo della colonna *High* e ordiniamo in base all'anno." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 785 }, "colab_type": "code", "executionInfo": { "elapsed": 2895, "status": "ok", "timestamp": 1592472905052, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "LxJbR5imLz3H", "outputId": "40311959-b5d0-46f6-ad50-941162754e73" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----+---------+\n", "|Year|max(High)|\n", "+----+---------+\n", "|2018| 233.47|\n", "|2017| 177.2|\n", "|2016| 118.69|\n", "|2015| 134.54|\n", "|2014| 119.75|\n", "|2013| 82.16286|\n", "|2012|100.72429|\n", "|2011| 60.95714|\n", "|2010|46.665714|\n", "|2009|30.564285|\n", "|2008|28.608572|\n", "|2007|28.994286|\n", "|2006|13.308572|\n", "|2005| 10.78|\n", "|2004| 4.969285|\n", "|2003| 1.786429|\n", "|2002| 1.869286|\n", "|2001| 1.937143|\n", "|2000| 5.370536|\n", "|1999| 4.214286|\n", "|1998| 1.5625|\n", "|1997| 1.055804|\n", "|1996| 1.267857|\n", "|1995| 1.790179|\n", "|1994| 1.5625|\n", "|1993| 2.330357|\n", "|1992| 2.5|\n", "|1991| 2.616071|\n", "|1990| 1.705357|\n", "|1989| 1.799107|\n", "|1988| 1.705357|\n", "|1987| 2.133929|\n", "|1986| 0.783482|\n", "|1985| 0.555804|\n", "|1984| 0.613839|\n", "|1983| 1.129464|\n", "|1982| 0.622768|\n", "|1981| 0.620536|\n", "|1980| 0.645089|\n", "+----+---------+\n", "only showing top 39 rows\n", "\n" ] } ], "source": [ "from pyspark.sql.functions import year\n", "\n", "dfHigh = dfGroupYear.agg(max(\"High\"))\n", "dfHigh.orderBy(\"Year\", ascending=False).show(39)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "62MNzdmZLz3J" }, "source": [ "### Troviamo l'anno con il volume maggiore\n", "Utiliziamo il Dataframe raggruppato per anno che abbiamo creato sopra e sommiamo i volumi di ogni anno utilizzando il metodo *.agg*, poi stampiamo i 10 risultati ordinati per volumi totali." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 290 }, "colab_type": "code", "executionInfo": { "elapsed": 2584, "status": "ok", "timestamp": 1592472909413, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "g2mvakqMLz3K", "outputId": "4074b66c-ad46-4006-f66c-62926a9b03e9" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----+-----------+\n", "|Year| Volume|\n", "+----+-----------+\n", "|2008|71495301500|\n", "|2007|61748996400|\n", "|2006|53924741500|\n", "|2005|45600245600|\n", "|2010|37756231800|\n", "|2009|35813421700|\n", "|1999|34275676400|\n", "|2012|32991051100|\n", "|2011|31014834900|\n", "|2004|30450417200|\n", "+----+-----------+\n", "only showing top 10 rows\n", "\n" ] } ], "source": [ "from pyspark.sql.functions import sum\n", "\n", "dfVolume = dfGroupYear.agg(sum(\"Volume\").alias(\"Volume\"))\n", "dfVolume.orderBy(\"Volume\", ascending=False).show(10)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "eaiFaULsLz3N" }, "source": [ "### Verifichiamo la variazione del titolo di Apple in seguito alla commercializzazione dell'iPhone." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "4Jsu5v9nLz3N" }, "source": [ "Dobbiamo considerare il valore del titolo tra il 2007-06-29 e 180 giorni dopo, per calcolare la seconda data utilizziamo il modulo datetime di python." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "executionInfo": { "elapsed": 481, "status": "ok", "timestamp": 1592472913995, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "eD17RwKcLz3O", "outputId": "9fd3f0b4-78ce-4061-c5cb-80a919f603ec" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2007-12-26\n" ] } ], "source": [ "from datetime import datetime, timedelta\n", "\n", "start_date = '2007-06-29'\n", "end_date = datetime.strptime(start_date, \"%Y-%m-%d\")+timedelta(days=180) # convertiamo la stringa in data e aggiungiamo 180 giorni\n", "end_date = datetime.strftime(end_date, \"%Y-%m-%d\") # convertiamo la data in stringa\n", "\n", "print(end_date)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "pDp9B0xXLz3Q" }, "source": [ "Ottimo, ora estraiamo i valori per la data di inizio all'interno di un dizionario python, possiamo farlo usando un semplice filtro e poi il metodo *.asDict()* sull'oggetto Row." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 136 }, "colab_type": "code", "executionInfo": { "elapsed": 890, "status": "ok", "timestamp": 1592472916054, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "M7URbYAjLz3Q", "outputId": "b656bc61-8ebd-415c-c665-d31bf5bec6ca" }, "outputs": [ { "data": { "text/plain": [ "{'Adj Close': 15.272941589355469,\n", " 'Close': 17.43428611755371,\n", " 'Date': datetime.date(2007, 6, 29),\n", " 'High': 17.714284896850586,\n", " 'Low': 17.29857063293457,\n", " 'Open': 17.424285888671875,\n", " 'Volume': 284460400}" ] }, "execution_count": 33, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "rowFirst = df.filter(\"Date == '\"+start_date+\"'\").take(1)\n", "dictFirst = rowFirst[0].asDict()\n", "\n", "dictFirst" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "6E_s-voWLz3T" }, "source": [ "Ripetiamo lo stesso procedimento per la data di fine." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 136 }, "colab_type": "code", "executionInfo": { "elapsed": 491, "status": "ok", "timestamp": 1592472917673, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "hY6Yo7sgLz3U", "outputId": "558bb650-dfad-4078-ed05-689714f15fb1" }, "outputs": [ { "data": { "text/plain": [ "{'Adj Close': 24.89799690246582,\n", " 'Close': 28.421428680419922,\n", " 'Date': datetime.date(2007, 12, 26),\n", " 'High': 28.70857048034668,\n", " 'Low': 28.117143630981445,\n", " 'Open': 28.43000030517578,\n", " 'Volume': 175933100}" ] }, "execution_count": 34, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "rowLast = df.filter(\"Date == '\"+end_date+\"'\").take(1)\n", "dictLast = rowLast[0].asDict()\n", "\n", "dictLast" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "9abhcqbRLz3W" }, "source": [ "Ora per ottenere la variazione in percentuale ci basta applicare la seguente formula:\n", "

\n", "$$\\frac{close-open}{close}*100$$\n", "e arrotondiamo il risultato a 2 cifre dopo la virgola." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "executionInfo": { "elapsed": 531, "status": "ok", "timestamp": 1592472919206, "user": { "displayName": "T3Lab Vision", "photoUrl": "", "userId": "14779383426442114373" }, "user_tz": -120 }, "id": "JWWyyLVLLz3X", "outputId": "4d2af453-b8e7-470b-a57c-ebfc1198398b" }, "outputs": [ { "data": { "text/plain": [ "38.69" ] }, "execution_count": 35, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "round((dictLast[\"Close\"]-dictFirst[\"Open\"])/dictLast[\"Close\"]*100,2)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ym8tUPopLz3a" }, "source": [ "Un +38.7% per la gioia degli investitori :D" ] } ], "metadata": { "colab": { "name": "apple_stock.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 1 }