Daniel Ng
Email Me Contact Me

You're welcome to email me at dng11@sympatico.ca.

While I do like hearing from you, I am not interested in SEO services nor offshoring my business, so please don't offer me any.

Follow me on Twitter Subscribe to my RSS feed

Extract Data From Excel Files Into Database VB.NET Code Sample
Originally posted by Daniel Ng on June 16, 2013.

I was recently tasked to extract data from Excel files into a SQL Server database. These Excel files are fairly complex (see sample screenshots #1 & #2). They are actually the Ontario Structure Inspection Manual (OSIM) inspection forms for bridge structure. Each Excel file represents an entire inspection form of a particular bridge. And within each file, they may have numerous Excel worksheets. Each worksheet could be the sub-form of a particular bridge element inspected.

It is an understatement to say that these files are not in a format that can be readily imported into SQL Server. And I am under the gun to get this task completed as soon as possible.


Task Challenges

Some of the challenges associated with this task are:

  1. I have more than 180 Excel files to go through. From each file, we need to extract around 50 attributes for the bridge and around 30 attributes for each bridge element. The number of bridge elements varies from bridge to bridge. It can be anywhere from 5 to 30.
  2. Lack of data validation in Excel, e.g., a numeric form field could contain a string, or a date form field could contain just the year value and not the full date, and etc.
  3. Rectangle shape object is being used as checkbox in inspection form.
  4. A particular form field may be present in some files but not in others.
  5. Most form fields have common cell location but some don't. For instance, on one file, the field may be in cell "A10". On another file, the same field may be in cell "A20". They will most certainly be under the same column but will be on different row.

Bottom line is that manually transferring data into the database is too labour-intensive and time-consuming. On the other hand, automating this process can be tricky.


My Solution

I decided to develop a VB.NET Windows application to automate the data extraction and import process. This small application took me 2 to 3 days to develop from start to finish. Majority of my time were spent on identifying all the different variations in my Excel files. Now, with a click of a button, all 180 Excel files can be processed under 10 minutes. Extracted data will be automatically inserted into the 2 database tables which I have set up prior (i.e., one for storing bridge data and one for bridge element).

As you can tell, the development effort I put into this is totally justified. If I would to do this task manually, it would take me at least a week to complete and you can expect plenty of data entry errors with it. Furthermore, If my client comes back to me and asks for additional attributes to extract or they have a new set of files to upload, I can simply modify my code, re-run the program and be done with it. The cost-saving, in this case, is evident.


My VB.NET Source Code

Main Screen

I would like to share some of my source code here for couple of reasons:

  1. To illustrate how I handle some of the common challenges associating with extracting values from an Excel file.
  2. This is an example of a multi-threaded application. I leveraged the BackgroundWorker class to do the main processing on a separate thread to avoid UI freezing. In my code, you will see how I manage to pass data to the worker thread and how I receive progress report back from it.

Please bear in mind that this application is not polished and doesn't have all the bells and whistles. It is an automating tool for me to use and not my client. Making the application looks pretty is not one of my priorities.

I have inserted comments throughout my code, so they should be fairly easy to follow. And here is my main form class:

Imports System.IO
Imports Microsoft.Office.Core
Imports Microsoft.Office.Interop
Imports System.Runtime.InteropServices

Public Class frmMain

  Private excludeSheets As List(Of String)

  Private Sub frmMain_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load

    '*** Define a list of Excel worksheet names to exlude from processing. ***
    excludeSheets = New List(Of String)

    With excludeSheets
	  .Add("CanWeb")
	  .Add("General Input")
	  .Add("Lists")
	  .Add("East Approach")
	  .Add("East Elevation")
	  .Add("North Approach*") 'i.e., name that starts with "North Approach".
	  .Add("North Elevation")
	  .Add("S. Approach")
	  .Add("South Approach*") 'i.e., name that starts with "South Approach".
	  .Add("South Elevation")
	  .Add("West Approach")
	  .Add("West Elevation")
	  .Add("Sheet*")          'i.e., name that starts with "Sheet".
	  .Add("TP*")             'i.e., name that starts with "TP".
    End With

  End Sub

  Private Sub btnImport_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnImport.Click

    Try

      '*** Disable the button to prevent double-click. ***
      Me.btnImport.Enabled = False

      '*** Output message to UI. ***
      Me.txtOutput.Text = "Processing... please wait" & environment.newline

      '*** Retrieve a list of OSIM Excel files from a given directory. ***
      Dim files As List(Of String) = GetOSIMFiles("C:\Temp\2012 OSIM")

      '*** Proceed only if there is one or more files. ***
      If Not files Is Nothing AndAlso files.Count > 0 Then

        '*** Configure the progress bar. ***
        With Me.pbarProgress
          .Style = ProgressBarStyle.Continuous
          .Maximum = 100
          .Minimum = 0
          .Value = 0
          .Step = CInt(100 / files.Count)
          .Visible = True
        End With

        '*** Do processing on a separate thread. Pass in a list of file names. ***
        '*** This line will trigger the BackgroundWorker1_DoWork() event. ***
        Me.BackgroundWorker1.RunWorkerAsync(files)

      Else

        MessageBox.Show("There is no file to process in the specified directory.", _
        "Action Terminated", MessageBoxButtons.OK, MessageBoxIcon.Warning)

        Me.btnImport.Enabled = True
        Me.txtOutput.Text = "Action terminated."

      End If

    Catch ex As Exception

      MessageBox.Show(ex.ToString, "Application Error", MessageBoxButtons.OK, MessageBoxIcon.Error)

    End Try

  End Sub

  Private Sub BackgroundWorker1_DoWork(ByVal sender As Object, ByVal e As System.ComponentModel.DoWorkEventArgs) Handles BackgroundWorker1.DoWork

    Dim oApp As Excel.Application = Nothing
    Dim oWorkbooks As Excel.Workbooks = Nothing
    Dim oWorkbook As Excel.Workbook = Nothing
    Dim oSheets As Excel.Sheets = Nothing
    Dim oSheet As Excel.Worksheet = Nothing
    Dim elementNo As Integer = 0
    Dim counter As Integer = 0
    Dim bridgeAdded As Integer = 0
    Dim bridgeElementAdded As Integer = 0
    Dim k As Integer = 0

    Try

      '*** This is a LINQ to SQL class created for me to connect to my db. ***
      Using db As New DCMainDataContext

      '*** Instantiate a new Excel application object. ***
      oApp = New Excel.Application
      oApp.Visible = False
      oApp.ScreenUpdating = False
      oApp.DisplayAlerts = False

      '*** Get a list of files from the Argument object. ***
      Dim files As List(Of String) = CType(e.Argument, List(Of String))

      '*** Iterate through all the Excel files. ***
      For Each f As String In files

        '*** Report progress back to UI thread. ***
        Me.BackgroundWorker1.ReportProgress(counter, String.Format("Reading {0}", f))

        k += 1
		
        '*** Convert the counter into a percent. ***
        counter = CInt((k * 100) / files.Count)

        '*** Open each Excel file. ***
        oWorkbooks = oApp.Workbooks
        oWorkbook = oWorkbooks.Open(f)
        oWorkbook.KeepChangeHistory = False
        oSheets = oWorkbook.Worksheets

        '*** Instantiate a new bridge object to store data. ***
        Dim newBridge As New Bridge

        '*** Assign some default attributes to the object. ***
        With newBridge
          .SiteID = "RDS"
          .ImportFile = f
          .ImportDate = Now
        End With

        Dim sheetName As String = ""
        Dim tmp As Integer = -1

        '*** Iterate through all the Excel worksheets. ***
        For i As Integer = 1 To oSheets.Count

          oSheet = CType(oSheets(i), Excel.Worksheet)
          sheetName = oSheet.Name.Trim

          If IsExclude(sheetName) = False Then

            '*** Proceed here only if the worksheet is not on the exclusion list. ***

            Select Case sheetName

              Case "General"

                With newBridge

                  '*** These form fields have a common cell location. ***
                  '*** I merely have to pass in a cell address to a custom method. ***
				  
                  .InspectionID = GetCellAsString(oSheet, "K1").Left(36)
                  .MTOSiteNumber = GetCellAsString(oSheet, "K2").Left(100)
                  .BridgeName = GetCellAsString(oSheet, "C6").Left(100)
                  .StructureType = GetCellAsString(oSheet, "C16").Left(100)
                  .MainHighwayOrRoad = GetCellAsString(oSheet, "J8").Left(100)
                  .RoadwayWidth = GetCellAsDouble(oSheet, "K20")
                  .TotalDeckArea_sqm = GetCellAsDouble(oSheet, "C18")
                  .TotalDeckLength_m = GetCellAsDouble(oSheet, "C20")
                  .NoOfSpans = GetCellAsDouble(oSheet, "C22")
                  .DirectionofStructure = GetCellAsString(oSheet, "C26").Left(100)
                  .YearBuilt = GetCellAsDate(oSheet, "D31")
                  .LastBiennialInspection = GetCellAsDate(oSheet, "D35")
                  .LastConditionSurvey = GetCellAsString(oSheet, "D37").Left(100)
                  .CurrentLoadLimit = GetCellAsDouble(oSheet, "K33")
                  .LastBridgeMasterInspection = GetCellAsString(oSheet, "K35").Left(100)
                  .LastUnderwaterInspection = GetCellAsString(oSheet, "K37").Left(100)
                  .OnOrUnderRoad = GetCellAsString(oSheet, "K8").Left(100)
				 
                  '*** These form fields may or may not be there. ***
                  '*** I need to find whether a form field label is there first. ***
                  '*** The form field value is typicall right beside it. ***
				  
                  tmp = FindLabelRowNumber(oSheet, "Region", "B6", "B16")
                  If tmp > 0 Then 
                    .MTORegion = GetCellAsString(oSheet, "C" & tmp.ToString).Left(100)
                  End If
				  
                  tmp = FindLabelRowNumber(oSheet, "District", "B6", "B16")
                  If tmp > 0 Then 
                    .MTODistrict = GetCellAsString(oSheet, "C" & tmp.ToString).Left(100)
                  End If                 
                  
                  tmp = FindLabelRowNumber(oSheet, "Crossing:", "H8", "H22")
                  If tmp > 0 Then 
                    .CrossingType = GetCellAsString(oSheet, "J" & tmp.ToString).Left(100)
                  End If
                 
                  tmp = FindLabelRowNumber(oSheet, "Owner", "H8", "H22")
                  If tmp > 0 Then 
                    .Owners = GetCellAsString(oSheet, "J" & tmp.ToString).Left(100)
                  End If
                  
                  tmp = FindLabelRowNumber(oSheet, "AADT", "H8", "H22")
                  If tmp > 0 Then 
                    .AADT = GetCellAsDouble(oSheet, "J" & tmp.ToString)
                  End If
				  
                  tmp = FindLabelRowNumber(oSheet, "Posted Speed", "H8", "H22")
                  If tmp > 0 Then 
                    .PostedSpeed = GetCellAsDouble(oSheet, "K" & tmp.ToString)
                  End If
				   
                  tmp = FindLabelRowNumber(oSheet, "Road Class", "H8", "H22")
                  If tmp > 0 Then 
                    .RoadClass = GetCellAsString(oSheet, "K" & tmp.ToString).Left(100)
                  End If
				  
                End With

              Case "General  (2)", "General (2)", "General 2"

                '*** Get additional attributes for the bridge object. ***
                With newBridge
                  .InspectionDate = GetCellAsDate(oSheet, "C6")
                  .Inspector = GetCellAsString(oSheet, "C7").Left(100)
                  .OthersInParty = GetCellAsString(oSheet, "C8").Left(500)
                  .EquipmentUsed = GetCellAsString(oSheet, "C9").Left(500)
                  .Weather = GetCellAsString(oSheet, "C10").Left(100)
                  .Temperature = GetCellAsString(oSheet, "C11").Left(100)
                End With

              Case "BCI"

                '*** Get additional attributes for the bridge object. ***
                With newBridge
                  tmp = FindLabelRowNumber(oSheet, "BCI", "B1", "B250")
                  If tmp > 0 Then .BCI = GetCellAsDouble(oSheet, "C" & tmp.ToString)
                End With

              Case "GPS & Costings"

                '*** Get additional attributes for the bridge object. ***
                With newBridge
                  .Latitude = GetCellAsDouble(oSheet, "B2")
                  .Longitude = GetCellAsDouble(oSheet, "C2")
                End With

              Case Else

                '*** Note: each worksheet represents a new bridge element. ***

                '*** Instantiate a new bridge element object to store data. ***
                Dim newBrideElement As New BridgeElement

                With newBrideElement

                  elementNo += 1
                  newBrideElement.ElementNumber = elementNo
                  newBrideElement.ImportFile = f
                  .SiteID = GetCellAsString(oSheet, "J2")
                  .ElementGroup = GetCellAsString(oSheet, "C4").Left(100)
                  .ElementName = GetCellAsString(oSheet, "C5").Left(100)
                  .Location = GetCellAsString(oSheet, "C6").Left(100)
                  .Material = GetCellAsString(oSheet, "C7").Left(36)
                  .ElementType = GetCellAsString(oSheet, "C8").Left(36)
                  If GetCheckBoxValue(oSheet, "C9") Then
                    .Environment = "Benign"
                  Else
                    If GetCheckBoxValue(oSheet, "E9") Then
                      .Environment = "Moderate"
                    Else
                      If GetCheckBoxValue(oSheet, "G9") Then
                        .Environment = "Severe"
                      End If
                    End If
                  End If
                  .Length = GetCellAsDouble(oSheet, "I4")
                  .Width = GetCellAsDouble(oSheet, "I5")
                  .Height = GetCellAsDouble(oSheet, "I6")
                  .Count = GetCellAsDouble(oSheet, "I7", "All", 1)
                  .TotalQuantity = GetCellAsDouble(oSheet, "I8", "All", 1)
                  If GetCheckBoxValue(oSheet, "I9") Then 
                    .NotInspected = "Y" 
                  Else 
                    .NotInspected = "N"
                  End If
                  .Units = GetCellAsString(oSheet, "C11").Left(36)
                  .Excellent = GetCellAsDouble(oSheet, "D11", "All", 1)
                  .Good = GetCellAsDouble(oSheet, "E11", "All", 1)
                  .Fair = GetCellAsDouble(oSheet, "F11", "All", 1)
                  .Poor = GetCellAsDouble(oSheet, "G11", "All", 1)
                  .PerformDeficiencies = GetCellAsString(oSheet, "H11").Left(36)
                  .MaintenanceNeeds = GetCellAsString(oSheet, "I11").Left(36)
                  .Comments = GetCellAsString(oSheet, "B13").Left(2000)
                  .RecommendedWork = GetCellAsString(oSheet, "K11").Left(2000)
                  .EstimatedCost = GetCellAsDouble(oSheet, "G12")
                  .ImportDate = Now

                End With

                If String.IsNullOrEmpty(newBrideElement.ElementName) = False Then

                  bridgeElementAdded += 1
				
                  '*** Add new bridge element object to a pending table. ***
                  db.BridgeElements.InsertOnSubmit(newBrideElement)

                End If

              End Select

            Else

              '*** Report progress back to the UI thread. ***
              '*** Telling user that this current worksheet has been skipped. ***
              Me.BackgroundWorker1.ReportProgress(counter, String.Format("... Skipping '{0}'", sheetName))

            End If

          Next '*** Get next worksheet. ***

          oWorkbook.Close()

          bridgeAdded += 1
		  
          '*** Add new bridge object to a pending table. ***
          db.Bridges.InsertOnSubmit(newBridge)

        Next '*** Get next file. ***

        Me.BackgroundWorker1.ReportProgress(100, "Deleting existing data from tables.")
		
        '*** Call a stored procedure to purge existing records from tables first. ***
        db.sp_DeleteBridgeData()

        Me.BackgroundWorker1.ReportProgress(100, String.Format("Adding {0} new Bridge(s).", bridgeAdded))
        Me.BackgroundWorker1.ReportProgress(100, String.Format("Adding {0} new Bridge Element(s).", bridgeElementAdded))
        
        '*** Submit all changes to database. ***
        db.SubmitChanges()

        Me.BackgroundWorker1.ReportProgress(100, "Completed.")

      End Using

    Catch ex As Exception

      Throw

    Finally

      '*** Clean up COM objects. ***
      If Not oWorkbooks Is Nothing Then Marshal.FinalReleaseComObject(oWorkbooks)
      If Not oWorkbook Is Nothing Then Marshal.FinalReleaseComObject(oWorkbook)
      If Not oSheets Is Nothing Then Marshal.FinalReleaseComObject(oSheets)
      If Not oSheet Is Nothing Then Marshal.FinalReleaseComObject(oSheet)

      oApp.Quit()
      Marshal.FinalReleaseComObject(oApp)

    End Try

  End Sub

  Private Sub BackgroundWorker1_ProgressChanged(ByVal sender As Object, ByVal e As System.ComponentModel.ProgressChangedEventArgs) Handles BackgroundWorker1.ProgressChanged

    '*** Retrieve data from the worker thread and display them on the UI. *** 

    Me.txtOutput.AppendText(e.UserState.ToString & Environment.NewLine)
    Me.lblStatus.Text = String.Format("Percent Complete... {0}%", e.ProgressPercentage)
    Me.pbarProgress.Value = e.ProgressPercentage

  End Sub

  Private Sub BackgroundWorker1_RunWorkerCompleted(ByVal sender As Object, ByVal e As System.ComponentModel.RunWorkerCompletedEventArgs) Handles BackgroundWorker1.RunWorkerCompleted

    '*** This event is automatically triggered when the worker is done. *** 

    MessageBox.Show("Done", "Confirmation", MessageBoxButtons.OK, MessageBoxIcon.Information)

    Me.pbarProgress.Visible = False
    Me.lblStatus.Text = "Ready"
    Me.btnImport.Enabled = True

  End Sub

  Private Function GetOSIMFiles(ByVal parentDirectory As String) As List(Of String)

    GetOSIMFiles = Nothing

    If String.IsNullOrEmpty(parentDirectory) = False Then

      Dim oDirInfo As New DirectoryInfo(parentDirectory)
      Dim ret As List(Of String) = Nothing

      '*** Get all the Excel files within the specified directory. ***
      For Each oFileInfo As FileInfo In oDirInfo.GetFiles("*.xls", SearchOption.AllDirectories)

        If ret Is Nothing Then ret = New List(Of String)

        '*** Add full file path to a running list. ***
        ret.Add(oFileInfo.FullName)

      Next

      Return ret

    End If

  End Function

  Private Function GetCellAsString(ByRef oSheet As Excel.Worksheet, ByVal cell As String) As String

    GetCellAsString = Nothing

    If Not oSheet Is Nothing AndAlso String.IsNullOrEmpty(cell) = False Then

      Dim oRange As Excel.Range = oSheet.Range(cell)

      If Not oRange Is Nothing AndAlso Not oRange.Text Is Nothing Then

        If String.IsNullOrEmpty(oRange.Text.ToString.Trim) = False Then
          GetCellAsString = oRange.Text.ToString.Trim
        End If

      End If

      If Not oRange Is Nothing Then Marshal.ReleaseComObject(oRange)

    End If

  End Function

  Private Function GetCellAsDouble(ByRef oSheet As Excel.Worksheet, ByVal cell As String) As Double?

    Return GetCellAsDouble(oSheet, cell, Nothing, Nothing)

  End Function

  Private Function GetCellAsDouble(ByRef oSheet As Excel.Worksheet, ByVal cell As String, ByVal match As String, ByVal def As Double) As Double?

    GetCellAsDouble = Nothing

    If Not oSheet Is Nothing AndAlso String.IsNullOrEmpty(cell) = False Then

      Dim oRange As Excel.Range = oSheet.Range(cell)

      If Not oRange Is Nothing AndAlso Not oRange.Text Is Nothing Then

        '*** If a number is found, return the number. ***
        '*** If text is found, return nothing unless the text matches an input string. ***
        '*** In that case, return the user-specified default value. ***
        If IsNumeric(oRange.Text) Then
          GetCellAsDouble = CDbl(oRange.Text)
        Else
          If String.IsNullOrEmpty(match) = False AndAlso oRange.Text.ToString.Trim = match Then
            GetCellAsDouble = def
          End If
        End If

      End If

      If Not oRange Is Nothing Then Marshal.ReleaseComObject(oRange)

    End If

  End Function

  Private Function GetCellAsDate(ByRef oSheet As Excel.Worksheet, ByVal cell As String) As Date?

    GetCellAsDate = Nothing

    If Not oSheet Is Nothing AndAlso String.IsNullOrEmpty(cell) = False Then

      Dim oRange As Excel.Range = oSheet.Range(cell)

      If Not oRange Is Nothing AndAlso Not oRange.Text Is Nothing Then

        If String.IsNullOrEmpty(oRange.Text.ToString.Trim) = False Then

          Dim val As String = oRange.Text.ToString.Trim

          '*** The cell value could be anything. ***
          '*** e.g., "2013", "Unknown", "21-Sep-12" ***

          If val Like "####" Then

            '*** If the cell value has a 4-digit number, treat it as a year. ***
            '*** Return a January 1st date for that year. ***
            GetCellAsDate = New Date(CInt(val), 1, 1)

          Else

            '*** If the cell value matches this format "24-Apr-12", then parse it. ***
            If val Like "*-???-##" Then
              GetCellAsDate = Date.ParseExact(val, "d-MMM-yy", New System.Globalization.CultureInfo("en-ca"))
            End If

          End If

        End If

      End If

      If Not oRange Is Nothing Then Marshal.ReleaseComObject(oRange)

    End If

  End Function

  Private Function FindLabelRowNumber(ByRef oSheet As Excel.Worksheet, ByVal findString As String, ByVal fromRange As String, ByVal toRange As String) As Integer

    FindLabelRowNumber = -1

    Dim oSearch As Excel.Range
    Dim oFind As Excel.Range

    '*** Look up a string within a user-specified search range. ***
    oSearch = oSheet.Range(fromRange, toRange)
    oFind = oSearch.Find(findString, , _
            Excel.XlFindLookIn.xlValues, _
            Excel.XlLookAt.xlPart, _
            Excel.XlSearchOrder.xlByRows, _
            Excel.XlSearchDirection.xlNext, False)

    '*** If found, return the row number. ***
    If Not oFind Is Nothing Then FindLabelRowNumber = oFind.Row

    If Not oSearch Is Nothing Then Marshal.ReleaseComObject(oSearch)
    If Not oFind Is Nothing Then Marshal.ReleaseComObject(oFind)

  End Function

  Private Function GetCheckBoxValue(ByRef oSheet As Excel.Worksheet, ByVal cell As String) As Boolean

    GetCheckBoxValue = False

    If Not oSheet Is Nothing AndAlso String.IsNullOrEmpty(cell) = False Then

      Dim oShapes As Excel.Shapes = oSheet.Shapes

      If Not oShapes Is Nothing Then

        For Each oShape As Excel.Shape In oShapes

          Dim ctr As Object = oShape.OLEFormat.Object

          If Not ctr Is Nothing AndAlso TypeName(ctr) = "Rectangle" Then
            If oShape.TopLeftCell.Address.Replace("$", "") = cell Then
              If String.IsNullOrEmpty(ctr.Text.trim) = False Then GetCheckBoxValue = True
              Exit For
            End If
          End If

        Next

      End If

    End If

  End Function

  Private Function IsExclude(ByVal sheetName As String) As Boolean

    IsExclude = False

    '*** Check if a worksheet name is on the exclusion list. ***
    For Each s As String In Me.excludeSheets
      If sheetName.ToLower Like s.ToLower Then Return True
    Next

  End Function

End Class

Here is a Module:

Module StringExtensions

  <System.Runtime.CompilerServices.Extension()> _
  Public Function Left(ByVal input As String, ByVal length As Integer) As String

    '*** This is a simple example on how to extend the string class. ***
    '*** This method will ensure the return string is no more than x characters long. ***

    Left = input

    If String.IsNullOrEmpty(input) = False AndAlso input.Length > length Then
      Return input.Substring(0, length)
    End If

  End Function

End Module

Final Notes

Feel free to email me if you have any questions or concerns. You may also want to consider subscribing to my RSS feeds for future posts.

Copyright © 2013 www.xcentricway.ca. All rights reserved.