使用 Streamlit 在 Python 中构建仪表盘

您可能熟悉“一张图片胜过千言万语”这句话，在数据科学领域，可视化的图表也同样有价值。它通过提供表格数据的不同视角来实现这一点，例如简单的折线图、直方图分布以及更复杂的透视表。

尽管这些图表很有用，但我们在印刷品或网页上看到的典型图表很可能是静态的。想象一下，在一个交互式仪表盘中操作这些静态变量会带来多大的吸引力？

🏂

准备好立即深入了解了吗？这是仪表盘应用和GitHub 仓库。

在本博客中，您将学习如何构建一个人口仪表盘应用，该应用展示了从美国人口普查局获取的 2010-2019 年美国人口数据和可视化结果。

我将指导您使用 Streamlit 从零开始构建这个交互式仪表盘应用作为前端。我们的后端强大支持来自于 PyData 中的重量级库，如 NumPy、Pandas、Scikit-Learn 和 Altair，确保了强大的数据处理和分析能力。

您将学习如何

定义关键指标
执行 EDA 分析
使用 Streamlit 构建仪表盘应用

仪表盘里有什么？

这是构成此人口仪表盘应用组件的视觉分解

让我们开始吧！

1. 定义关键指标

在我们实际构建仪表盘之前，我们需要先提出明确定义的指标来衡量重要事项。

1.1 关键指标概述

任何仪表盘的目标都是展示能为数据驱动决策提供依据的见解。仪表盘的主要目的是什么？这将指导您希望仪表盘以关键指标的形式回答的后续问题。

例如

在销售中，主要目标可能是了解：“销售团队的表现如何？” 指标可能包括按销售代表划分的总收入、按区域划分的销售单位，或随时间推移产生的新潜在客户。
在营销中，主要目标可能是了解“我的营销活动表现如何？” 这可能包括衡量领先指标，例如回复率或点击率，以及滞后指标，例如收入转化率或客户获取成本。
在财务中，仪表盘可能需要回答“我们的业务盈利能力如何？” 这可能包括毛利润、营业利润率和资产回报率。

1.2 为此应用选择的关键指标

这个人口仪表盘旨在回答的主要问题是：美国各州人口如何随时间变化？

我们还需要提出哪些问题来帮助我们实现这个仪表盘目标？

不同州的总人口如何比较？
各州人口随时间如何演变，以及它们之间如何比较？
在给定年份，哪些州经历了超过 50,000 人的迁入或迁出？我们将这些标记为迁入和迁出迁移指标。

2. 执行 EDA 分析

确定关键指标后，我们需要收集数据并对其有扎实的理解，然后才能在仪表盘中以视觉美观的方式呈现它。

探索性数据分析 (EDA) 可以定义为一种迭代的数据理解过程，它通过分析数据的探索性工作来提出和回答问题。本质上，您的仪表盘始于一块空白画布，而 EDA 提供了一种务实的方法来创建引人注目的、讲述故事的数据可视化。

John Tukey 在1977年的 EDA 上的开创性工作，为有效的数据沟通奠定了基础。以下是一些值得注意的关键要点

“图表的最大价值在于它迫使我们看到从未预料到的事物。” 事实上，Tukey 引入了箱线图（也称盒须图）。
在处理数据时保持灵活开放的心态，因此 EDA 具有“探索性”。

2.1 您有哪些可用数据？

这是从美国人口普查局获取的我们用于人口仪表盘的数据集样本。有 3 个潜在变量（states、year 和 population）将作为我们指标的基础。

states, states_code, id, year, population
Alabama, AL, 1, 2010, 4785437
Alaska, AK, 2, 2010, 713910
Arizona, AZ, 4, 2010, 6407172
Arkansas, AR, 5, 2010, 2921964
California, CA, 6, 2010, 37319502

2.2 准备数据

将年份列合并为单个统一列。

按年份子集化数据的优势在于，它能为生成可能的可视化（例如，例如热力图、等值线图等）和可排序的数据框提供所需的格式。

2.3 选择最能可视化关键指标的图表

既然我们已经对数据有了更好的理解，并确定了要衡量的关键指标，是时候决定如何在仪表盘上可视化结果了。可视化数据集的方法数不胜数，以下是为我们的人口仪表盘应用选择的方案。

不同州的总人口如何比较？
- 一个等值线图通过添加地理空间维度来突出人口最多和最少的州。
各州人口随时间如何演变，以及它们之间如何比较？
- 一个热力图通过展示不同年份的数据，全面概述人口最多和最少的州。
- 对数据框进行排序，可以直接快速比较人口最多和最少的州，从而无需在图表的不同部分之间跳转。
在给定年份，有多少百分比的州经历了超过 50,000 人的迁入/迁出？
- 圆环图是一种内弧为空的饼图，我们使用它来可视化各州迁入和迁出的百分比。

可视化数据集的方法数不胜数！

您可以从社区不断增长的自定义组件集合中发现更多可视化选项。以下是一些您可以尝试的

streamlit-extras 提供了广泛的组件，扩展了 Streamlit 的原生功能。
streamlit-shadcn-ui 提供了一些可用于仪表盘应用的前端 UI 组件（模态框、悬停卡、徽章等）。
streamlit-elements 允许创建可拖动和可调整大小的仪表盘组件。

3. 使用 Streamlit 构建仪表盘

💡

这是仪表盘应用和GitHub 仓库。

3.1 导入库

首先，我们将导入所需的库

Streamlit - 一个低代码 Web 框架
Pandas - 一个数据分析和整理工具
Altair - 一个数据可视化库
Plotly Express - 一个简洁且高级的图表创建 API

import streamlit as st
import pandas as pd
import altair as alt
import plotly.express as px

3.2 页面配置

接下来，我们将为应用定义设置，为其指定在浏览器中显示的页面标题和图标。这还定义了以适应页面宽度的宽布局显示页面内容，并显示侧边栏处于展开状态。

在这里，我们还将 Altair 图表的颜色主题设置为深色，以便与应用的深色主题相匹配。

st.set_page_config(
    page_title="US Population Dashboard",
    page_icon="🏂",
    layout="wide",
    initial_sidebar_state="expanded")

alt.themes.enable("dark")

3.3 加载数据

接下来，我们将使用 Pandas 的 read_csv() 函数将数据加载到应用中，如下所示

df_reshaped = pd.read_csv('data/us-population-2010-2019-reshaped.csv')

现在，我们将通过 st.title() 创建应用标题，并通过 st.selectbox() 创建下拉组件，允许用户选择特定年份和颜色主题。

selected_year（从 2010-2019 年的可用年份中选择）将用于子集化该年份的数据，然后显示在应用中。

selected_color_theme 将允许根据前面提到的组件指定的颜色为等值线图和热力图着色。

with st.sidebar:
    st.title('🏂 US Population Dashboard')
    
    year_list = list(df_reshaped.year.unique())[::-1]
    
    selected_year = st.selectbox('Select a year', year_list, index=len(year_list)-1)
    df_selected_year = df_reshaped[df_reshaped.year == selected_year]
    df_selected_year_sorted = df_selected_year.sort_values(by="population", ascending=False)

    color_theme_list = ['blues', 'cividis', 'greens', 'inferno', 'magma', 'plasma', 'reds', 'rainbow', 'turbo', 'viridis']
    selected_color_theme = st.selectbox('Select a color theme', color_theme_list)

3.5 图形和图表类型

接下来，我们将定义自定义函数来创建仪表盘中显示的各种图表。

热力图

热力图将使我们能够查看 52 个州 2010-2019 年的人口增长情况。

def make_heatmap(input_df, input_y, input_x, input_color, input_color_theme):
    heatmap = alt.Chart(input_df).mark_rect().encode(
            y=alt.Y(f'{input_y}:O', axis=alt.Axis(title="Year", titleFontSize=18, titlePadding=15, titleFontWeight=900, labelAngle=0)),
            x=alt.X(f'{input_x}:O', axis=alt.Axis(title="", titleFontSize=18, titlePadding=15, titleFontWeight=900)),
            color=alt.Color(f'max({input_color}):Q',
                             legend=None,
                             scale=alt.Scale(scheme=input_color_theme)),
            stroke=alt.value('black'),
            strokeWidth=alt.value(0.25),
        ).properties(width=900
        ).configure_axis(
        labelFontSize=12,
        titleFontSize=12
        ) 
    # height=300
    return heatmap

等值线图

接下来，通过等值线图描绘了选定年份的 52 个美国州的彩色地图。

def make_choropleth(input_df, input_id, input_column, input_color_theme):
    choropleth = px.choropleth(input_df, locations=input_id, color=input_column, locationmode="USA-states",
                               color_continuous_scale=input_color_theme,
                               range_color=(0, max(df_selected_year.population)),
                               scope="usa",
                               labels={'population':'Population'}
                              )
    choropleth.update_layout(
        template='plotly_dark',
        plot_bgcolor='rgba(0, 0, 0, 0)',
        paper_bgcolor='rgba(0, 0, 0, 0)',
        margin=dict(l=0, r=0, t=0, b=0),
        height=350
    )
    return choropleth

圆环图

接下来，我们将创建一个圆环图来展示各州迁移的百分比。

具体来说，这代表了年迁入或迁出人数 > 50,000 人的州所占百分比。例如，2019 年有 12 个州符合条件（共 52 个州），这对应于 23%。

在创建圆环图之前，我们需要计算同比人口迁移。

def calculate_population_difference(input_df, input_year):
  selected_year_data = input_df[input_df['year'] == input_year].reset_index()
  previous_year_data = input_df[input_df['year'] == input_year - 1].reset_index()
  selected_year_data['population_difference'] = selected_year_data.population.sub(previous_year_data.population, fill_value=0)
  return pd.concat([selected_year_data.states, selected_year_data.id, selected_year_data.population, selected_year_data.population_difference], axis=1).sort_values(by="population_difference", ascending=False)

然后根据前面提到的各州迁移百分比值创建圆环图。

def make_donut(input_response, input_text, input_color):
  if input_color == 'blue':
      chart_color = ['#29b5e8', '#155F7A']
  if input_color == 'green':
      chart_color = ['#27AE60', '#12783D']
  if input_color == 'orange':
      chart_color = ['#F39C12', '#875A12']
  if input_color == 'red':
      chart_color = ['#E74C3C', '#781F16']
    
  source = pd.DataFrame({
      "Topic": ['', input_text],
      "% value": [100-input_response, input_response]
  })
  source_bg = pd.DataFrame({
      "Topic": ['', input_text],
      "% value": [100, 0]
  })
    
  plot = alt.Chart(source).mark_arc(innerRadius=45, cornerRadius=25).encode(
      theta="% value",
      color= alt.Color("Topic:N",
                      scale=alt.Scale(
                          #domain=['A', 'B'],
                          domain=[input_text, ''],
                          # range=['#29b5e8', '#155F7A']),  # 31333F
                          range=chart_color),
                      legend=None),
  ).properties(width=130, height=130)
    
  text = plot.mark_text(align='center', color="#29b5e8", font="Lato", fontSize=32, fontWeight=700, fontStyle="italic").encode(text=alt.value(f'{input_response} %'))
  plot_bg = alt.Chart(source_bg).mark_arc(innerRadius=45, cornerRadius=20).encode(
      theta="% value",
      color= alt.Color("Topic:N",
                      scale=alt.Scale(
                          # domain=['A', 'B'],
                          domain=[input_text, ''],
                          range=chart_color),  # 31333F
                      legend=None),
  ).properties(width=130, height=130)
  return plot_bg + plot + text

将人口转换为文本

接下来，我们将创建一个自定义函数，使人口数值更简洁并提高美观度。具体来说，在指标卡中，人口数值不再显示为 28,995,881 这样的数字，而是更简洁的形式 29.0 M。这也适用于千位范围内的数值。

def format_number(num):
    if num > 1000000:
        if not num % 1000000:
            return f'{num // 1000000} M'
        return f'{round(num / 1000000, 1)} M'
    return f'{num // 1000} K'

3.6 应用布局

最后，是时候将所有内容整合到应用中了。

定义布局

首先创建 3 列

col = st.columns((1.5, 4.5, 2), gap='medium')

具体来说，输入参数 (1.5, 4.5, 2) 表示第二列的宽度大约是第一列的三倍，第三列的宽度大约是第二列宽度的一半。

第 1 列

显示“增益/损失”部分，其中指标卡显示选定年份（通过 st.selectbox 创建的“选择年份”下拉组件指定）迁入和迁出人数最多的州。

“州迁移”部分显示一个圆环图，其中显示了年迁入或迁出人数 > 50,000 的州所占百分比。

with col[0]:
    st.markdown('#### Gains/Losses')

    df_population_difference_sorted = calculate_population_difference(df_reshaped, selected_year)

    if selected_year > 2010:
        first_state_name = df_population_difference_sorted.states.iloc[0]
        first_state_population = format_number(df_population_difference_sorted.population.iloc[0])
        first_state_delta = format_number(df_population_difference_sorted.population_difference.iloc[0])
    else:
        first_state_name = '-'
        first_state_population = '-'
        first_state_delta = ''
    st.metric(label=first_state_name, value=first_state_population, delta=first_state_delta)

    if selected_year > 2010:
        last_state_name = df_population_difference_sorted.states.iloc[-1]
        last_state_population = format_number(df_population_difference_sorted.population.iloc[-1])   
        last_state_delta = format_number(df_population_difference_sorted.population_difference.iloc[-1])   
    else:
        last_state_name = '-'
        last_state_population = '-'
        last_state_delta = ''
    st.metric(label=last_state_name, value=last_state_population, delta=last_state_delta)

    
    st.markdown('#### States Migration')

    if selected_year > 2010:
        # Filter states with population difference > 50000
        # df_greater_50000 = df_population_difference_sorted[df_population_difference_sorted.population_difference_absolute > 50000]
        df_greater_50000 = df_population_difference_sorted[df_population_difference_sorted.population_difference > 50000]
        df_less_50000 = df_population_difference_sorted[df_population_difference_sorted.population_difference < -50000]
        
        # % of States with population difference > 50000
        states_migration_greater = round((len(df_greater_50000)/df_population_difference_sorted.states.nunique())*100)
        states_migration_less = round((len(df_less_50000)/df_population_difference_sorted.states.nunique())*100)
        donut_chart_greater = make_donut(states_migration_greater, 'Inbound Migration', 'green')
        donut_chart_less = make_donut(states_migration_less, 'Outbound Migration', 'red')
    else:
        states_migration_greater = 0
        states_migration_less = 0
        donut_chart_greater = make_donut(states_migration_greater, 'Inbound Migration', 'green')
        donut_chart_less = make_donut(states_migration_less, 'Outbound Migration', 'red')

    migrations_col = st.columns((0.2, 1, 0.2))
    with migrations_col[1]:
        st.write('Inbound')
        st.altair_chart(donut_chart_greater)
        st.write('Outbound')
        st.altair_chart(donut_chart_less)

第 2 列

接下来，第二列使用之前创建的自定义函数显示等值线图和热力图。

with col[1]:
    st.markdown('#### Total Population')
    
    choropleth = make_choropleth(df_selected_year, 'states_code', 'population', selected_color_theme)
    st.plotly_chart(choropleth, use_container_width=True)
    
    heatmap = make_heatmap(df_reshaped, 'year', 'states', 'population', selected_color_theme)
    st.altair_chart(heatmap, use_container_width=True)

第 3 列

最后，第三列通过数据框显示人口最多的州，其中人口通过 st.dataframe 的 column_config 参数显示为彩色进度条。

一个关于部分通过 st.expander() 容器显示，提供有关数据源和仪表盘中使用术语的定义信息。

with col[2]:
    st.markdown('#### Top States')

    st.dataframe(df_selected_year_sorted,
                 column_order=("states", "population"),
                 hide_index=True,
                 width=None,
                 column_config={
                    "states": st.column_config.TextColumn(
                        "States",
                    ),
                    "population": st.column_config.ProgressColumn(
                        "Population",
                        format="%f",
                        min_value=0,
                        max_value=max(df_selected_year_sorted.population),
                     )}
                 )
    
    with st.expander('About', expanded=True):
        st.write('''
            - Data: [U.S. Census Bureau](<https://www.census.gov/data/datasets/time-series/demo/popest/2010s-state-total.html>).
            - :orange[**Gains/Losses**]: states with high inbound/ outbound migration for selected year
            - :orange[**States Migration**]: percentage of states with annual inbound/ outbound migration > 50,000
            ''')

3.7 将仪表盘应用部署到云端

要查看部署 Streamlit 应用的视频演练，请观看 YouTube 上的此教程。

额外提示：构建仪表盘时的 5 个注意事项

执行 EDA 以获得数据理解
识别关键指标以跟踪重要事项
决定最适合可视化关键指标的图表
将相关指标分组
使用清晰简洁的标签来描述指标

总结

总而言之，Streamlit 提供了一种快速、高效且对代码友好的方式，用于在 Python 中构建交互式仪表盘应用，使其成为从事数据可视化的数据科学家和开发人员的首选工具。

Streamlit 的一个关键特性是它能够根据数据或输入参数的增量变化自动更新和重新渲染应用，这使其非常适合实时数据可视化任务。

观看此教程视频来跟着学习

您正在构建什么样的仪表盘？在下方评论区分享您的仪表盘以激励社区，或寻求反馈！

在 X 上关注我： @thedataprof，在 LinkedIn 上： Chanin Nantasenamat，或订阅我的 YouTube 频道：Data Professor！

愉快地使用 Streamlit 吧！📊