owenG
home learn tableau about
divider
legalific swsom swsom diagram swapps ahk ue








GenChart

What

A ClickOnce-deployed utility, accessed via a pseudo-ActiveX control hosted in a Summation window, for counting the file extensions of ediscovery documents present in a database. Results are displayed in a pie chart, with optional export of data to Excel 2007.

How

Uses the swsom .Net wrapper to read the 'Linked Document Field' from displayed records and parse out the individual file extensions. A JPEG image is created on the fly, with the DocLink tally information serialized into XML. If appropriate prerequisites are in place, a Visual Studio Tools for Office (VSTO) document-level project provides the ability to load the XML and auto-generate a native Excel chart.

Why

To get a quick read on the contents of an ediscovery dataset.

Where

Check for prerequisites and install from here GenChart.


divider element

Usage

Installing and/or running the GenChart ClickOnce package will result in a number of files being copied into a new subfolder (oweng.GenChart) under the Application-level profile directory within Summation e.g. C:\Program Files\Summation\Blaze\Profiles\ogalvin\oweng.GenChart. Once the files are in place, a new tabbed window (aka 'View') will open within Summation, hosting the controls needed to count the data:

screenshot showing initial dialog

There are a number of scenarios that may result in an error message appearing during the file copy process, where the errors may be safely ignored. Essentially this happens when the program files dropped by the initial ClickOnce installation are locked by Summation, preventing them from being overwritten by a subsequent ClickOnce file copy. For technical reasons, the file copy happens every time the GenChart ClickOnce application is opened, even though the same exact files (same version etc.) may already be in place and ready for use. This file copy is a good thing if a newer version of GenChart has been installed since the last time the application was opened but otherwise is simply a side effect of deploying this type of ActiveX control via ClickOnce. In any case, closing and re-opening Summation will clear the file locks.

Once the desired recordset is available in the Column or Form view, click the 'Create Chart' button within the new View. GenChart will take a snapshot (Clone) of the data in each record's DocLink field (as defined in 'Linked Document Field' on the Link Fields tab of Summation Defaults dialog) and attempt to parse out the value's file extension. The initial Clone process may take anywhere from a few seconds to a minute or more, depending on the number of records currently displayed in Summation.

NOTE: the app makes no attempt to confirm the file's identity by examining the file's internal metadata - a Word document renamed with a .txt extension will simply be counted as '.txt.'
  • Empty values are ignored and the final tally figures do not take these records into account.

  • The ten extensions with the highest count are preserved and sorted, highest to lowest. If the corpus of edocs spanned more than ten file extensions, the extensions that didn't register on the 'top ten' list are all lumped together into an 'other' category.

  • By default the pie slices will be generated with random colors, which may result in adjacent slices that are not chromatically distinct. Re-run the app with same default to get new random colors, or uncheck the 'Use Random Colors' checkbox' beforehand so that the chart is displayed with standard, distinct colors.

  • The chart's legend will include total files counted (records with a doc link value that could be parsed into a file extension) and separate counts for each file extension - percent of total followed by raw count in parentheses.

  • After the image has been loaded a GenChart.xml file will be generated, persisting the tally amounts.

  • Clicking 'Create Chart' will always result in the app analyzing the current recordset and generating a new GenChart.jpg (overwriting previous). Searches, custodian-specific or other, can be repeatedly performed, followed each time by re-running GenChart via the 'Create Chart' button.


A completed chart:

screenshot showing created pie chart

Once a given recordset has been processed, the 'Export to Excel' button becomes available, though the feature can only be taken advantage of if Excel 2007 and version 3.5 of the Microsoft .Net Framework have both been installed. Clicking the button will open the GenChartExcel.xlsx file dropped by the GenChartLauncher, which in turn will 1) analyze the GenChart.xml file mentioned above, 2) transfer the file extension data into worksheet cells, and 3) draw a simple Excel pie chart based on those values. Within Excel the chart type can be changed if desired.

screenshot showing created pie chart

Cleanup and Removal

Uninstalling the GenChartLauncher application via Add/Remove Programs will remove the source files from their original ClickOnce application directory but there are additional steps for a full cleanup. First, before uninstalling, run the application so that the new Summation View is created. Then click the 'Unregister ActiveX' button-bar at the bottom of the window, which will remove the two registry entries that had been created by the app. (As an alternative to re-initiating the GenChart Summation View only in order to unregister, there should be an UnregisterGenChartDLL.bat in the oweng.GenChart folder discussed below.) Next, close Summation in order to remove any file locks and, assuming there is no data you wish to save, delete the oweng.GenChart folder under the user's application-level profile directory e.g. C:\Program Files\Summation\Blaze\Profiles\ogalvin\oweng.GenChart. Finally, uninstall GenChartLauncher via Add/Remove Programs.





divider element



Technical Details

Appropriate App Approach

The easiest way of providing a rich UI while interacting with Summation programmatically would probably be to simply create a standalone WinForm .exe but I like the elegance of embedding an application within an ActiveX container as per Embedding VB Applications in CT Summation. Other integration possibilities included adding a new context-menu item to the core db columnar view or adding a button to Excel itself via Visual Studio Tools for Office. The former option can get a little tricky when dealing with the possibility of existing .mxs Context Menu Files at a variety of user/case levels and the latter idea would have introduced its own problems in terms of necessary prerequistes.


Deciding to pursue the embed-in-view idea I investigated a number of possible implementations:

  • Hosting .Net Windows Control in IE, via # syntax

    You can use Visual Studio to create a Windows Forms Control Library, add any number of standard controls or create your own custom ones, and wrap up into a .dll. The result is a visually appealing, feature-rich UI that can then be hosted in Internet Explorer using simple markup code on the HTML side. For example, assuming there is a Windows Form Library DLL named WinFormLib.dll that holds a UserControl class named MyUserControl, that control can be hosted in an IE window with only a few lines of HTML. Within the classid attribute value, the first part is the full path to the .dll (WinControlLib.dll), followed by the fully qualified name of the control (WinControlLib.MyUserControl), with a pound sign (#) separating the two:

    HTML
        <object
            id="Object1"
            classid="WinControlLib.dll#WinControlLib.MyUserControl"
            height="700" width="550">
        </object>
    

    Which is all well and good when the ASP.NET page can be rendered via IIS i.e. over the internet or an intranet. I would in fact consider the latter intranet solution to be the best overall, since presumably the necessary Code Access permissions could be configured as needed, it being an intranet and all. Although an assembly from the Intranet Zone would natively have somewhat looser security restrictions than something working from the Internet Zone, what it comes down to is that anything needing to interact with unmanaged code (like the Summation API) would need full trust.

    On the other hand, one might assume that everything would work fine if the .dll were accessed from the local hard drive, which would tend to imply broader security permissions. Presumably this is not a common scenario, which is why I initially couldn't find any documentation regarding expected behavior. I tried a number of possible workarounds, troubleshooting the resulting fusion bind errors etc. but it wasn't until I more or less gave up on this approach that I came across PRB: Cannot Host .NET User Controls in Internet Explorer 6.0 from File System, which states the cannot-use-local-windows-control-in-IE behavior is by design, because of security concerns. That article instead recommends installing the assembly into the GAC, which seems significantly more heavy handed than my eventual choice, which was to make the appropriate COM entries in the Windows Registry.

  • C++ MFC ActiveX Control

    The only project type I could find in Visual Studio 2008 that referenced the ActiveX techology was the one for an MFC ActiveX Control, available under C++. For a little while there seemed a chance this might actually work. Then after a few hours of playing around reality intruded, primarily in the form of a steep learning curve. I've done some relatively straightforward work in C++, including with MFC apps, but not for a few years and pursuing this option didn't seem worthwhile in terms of time investment vs. likely benefits (would need to be registered anyway?). And I did do enough research to make me believe the necessary type of ActiveX component, and ActiveX document, might not be supported anyways.

  • Hosting Windows Control in IE, as COM-visible ActiveX

    Finally, returning to more of a traditional ActiveX approach, though only traditional in that the .Net component would need to be registered on the host machine and then referenced in the .htm file with the clsid syntax:

    HTML
        <object
            id="mainControl"
            classid="clsid:29BFC741-B8B0-4ad0-9F49-693FD73E8CF8" 
            height="100%" width="100%">	   
        </object>
    

    The first concern here was setting everything up so that the control was properly registered and visible and for that I would need to thank Michal Kosmala and his Create ActiveX in .NET Step by Step article up on www.codeproject.com. The implementation details are discussed below but the downside to this solution include the need to make Registry entries in the first place and the later difficulty in cleaning up said entries on uninstallation. It would be nice to avoid the Registry and the whole spirit of the ClickOnce technology is to make a very light footprint on the target machine. On the other hand it struck me as 'good enough' and the best of the alternatives. ClickOnce does generally allow for deploying COM components in a way that avoids registration but ActiveX Document (and ActiveX EXE) projects are specifically not supported.


And the whole COM registration process worked fine, assuming adequate permissions, when testing in XP. Once everything seemed to be operating correctly I looked at the app in Windows 7, where attempting the .dll registration resulted in a System.UnauthorizedAccess exception. Specifically, the HKCR\owenG.GenChart.GenChartCtrl key could not be created. I'm presuming the same error would occur in Vista and perhaps the problem could be sledgehammered away by turning off User Account Control entirely. I did spend some time trying to come up with a more fine-tuned workaround but between resource constraints (OS/hardware) and no compelling need at this point, I deferred. Also, as noted above, a real solution would not involve COM at all.

Initial Setup

After selecting the 'hybrid Windows Control via ActiveX clsid reference' approach, the next hurdle dealt with how to actually deploy the files. In theory all of the application files could have remained in their original ClickOnce directory, which would be in some obfuscated directory under C:\Documents and Settings\. The relative complexity of having three different sets of assemblies interact made selecting a single known target directory seem like a better idea, where swsom is used to determine the current user's profile directory at the app level (sw.Application.AppProfile) and create a new sub-directory there, \owenG.GenChart.

The deploy steps wind up going something like:

  1. The GenChartLauncher ClickOnce application is installed, where its job is really to serve as a delivery mechanism for the GenChart control and VSTO .dll/.xlsx combo. On the Visual Studio project level, it is pretty much a WinForm app w/o a Form that has been marked as COM visible. The next steps happen each time the app is launched, which is not ideal but at least allows everything to function while still taking advantage of ClickOnce.
  2. After making sure Summation is running, the GenChart binaries are copied to the current user's Summation profile, as mentioned above. If the GenChart Summation view is already open at this point, there will likely be file copy problems, as noted under Usage. At a minimum I really should be checking file versions of existing .dlls before bothering to try and overwrite but that step awaits a future day.
  3. With the GenChart files in place, including a host.htm that references the clsid of the GenChart control, the RegistrationServices.RegisterAssembly() method is used to register GenChart.dll.
  4. Finally, (swsom)sw.Views.CreateHtmlView is used to create a new Tabbed view, passing in the file path to host.htm as the url.

Having put the target files in place, registered the ActiveX control and created a Summation view for the host.htm, the actual ClickOnce application (GenChartLauncher), is more or less out of the picture.


Getting the raw data

Clicking 'Create Chart' results in the generation of a GenChart.jpg file, which is then displayed in the control's PictureBox. Formulating that image file takes a bit of doing though. The first substantive step is to check to see if the 'doc link' field exists in the table currently loaded into Summation:

C#
    string doclinkFieldName = sw.DB.Defaults.LinkCol;
    if (sw.DB.Fields.Exists(doclinkFieldName))
    {

Once that is confirmed, it is time to get the actual doclink values from the database. For the first pass I tried direct access to the database, which worked just fine if there were only a few records:

C#
    sw.DB.Field fld = new sw.DB.Field(sw.DB.Defaults.LinkCol);
    ...
    object val = fld.Value;
    if (val != DBNull)
    {
        string docLinkValue = (string)val;
 

For any appreciable number of records though, using the Clone method on Summation's db object proved much superior, where it was only necessary to Clone the doclink field:

C#
    ADODB.Recordset rs = sw.DB.Clone(1, sw.DB.Defaults.LinkCol, ",", false);
    ...
    string val = rs.Fields[0].Value as string;
    if (val != null)
    {
 

I later did some comparisons between the two different methods of data access, which demonstrate the greater performance of db.Clone. The primary disadvantage, not relevant for GenChart, is of course that the data is read-only. With CT Summation Enterprise a second advantage to Clone is that db values comes across clearly, where some versions of that product have problems reading directly data using the db object (at least in earlier Enterprise versions).

Record CountUsing (swsom)sw.db.CloneDirect (swsom)sw.db field access
100.08 seconds18.5 seconds
10001 second3 min., 4 seconds
100008.25 secondsI'm guessing a while; 2000 recs = 5.5 min.
  • All of the summaries in recordsets had populated doclink fields
  • Tests run on local VM, RAM constrained to 1 GB
  • Time elapsed values include walking the recordset and extracting a file extension from doclink field

"The greater part of progress is the desire for progress"

In order to provide an accurate progress display in the control window, a delegate was declared in the class (Charter) that is doing all of the processing and I added a RecordsReviewed event based on that delegate:

C#
    class Charter
    {
        public delegate void UpdateProgressBar();
        public event UpdateProgressBar RecordsReviewed;
Within the loop of the Charter method responsible for walking cloned recordset the RecordsReviewed event is fired every 10 records:
C#
    countRecords++;
    if (countRecords % 10 == 0)
    {
        RecordsReviewed();
    }
Then the UI class of the main user control (GenChartCtrl) the Maximum in the progress bar control is set to the recset count and the chart_RecordsReviewed function subscribed to the RecordsReviewed event:
C#
    public partial class GenChartCtrl : UserControl
    {
        ...
        progRecset.Maximum = (int)sw.DB.RecordCount;
        Charter chart = new Charter();
        chart.RecordsReviewed += new Charter.UpdateProgressBar(chart_RecordsReviewed);

And this RecordsReviewed subscriber, chart_RecordsReviewed, is resposible for advancing the progress bar:

C#
    public void chart_RecordsReviewed()
    {
        //where Step value is set to 10, matching the modulo trigger in the Charter class
        //progbar range is from 0 to total # of records
        progRecset.PerformStep(); 
    }

GetFileExtension

Meanwhile, the main code tries to parse out a file extension from the doclink value, using a combination of (System.IO)FileInfo.Extension and a series of more manual calculations.

The file paths that were most likely to result in FileInfo.Extension failure were those that were email attachments and either really did not have a file extension:
  e.g. eMail\Session006\PSTXYZ003336^CPB-TXT_zip
or were .msg files with an illegal path character, most often a colon in from a Re: subject line:
  e.g. eMail\Session008\PSTXYZ003974^RE: How to make.msg
 
If FileInfo.Extension call fails, the parsing attempt on the doclink value consists of:
  - confirming there is a least one backslash
  - finding the final '.', with any characters after that becoming the file extension
Record will be skipped if there is no backslash and/or no period character.

The tally for each valid file extension is stored in a generic Dictionary<string, int>, where if a given file extension is already stored in the Dictionary (via TKey string), it is removed and re-added with an incremented count (TValue int). When all the records have been processed, the Dictionary is transmogrified into a List<KeyValuePair<string, int>> ( http://stackoverflow.com/questions/289/how-do-you-sort-a-c-dictionary-by-value) which then means it can be sorted, by highest count. The ten file extensions with the highest tally are preserved, while the rest are thrown out. A dummy "<other>" key with count = 0 is appended and the final kvp List is returned to the calling procedure.


Pie Making

The following set of code blocks are all contiguous, where the CreateImage method is in the Charter class and is called directly from the GenChartCtrl user control. The parameters for CreateImage are:
* filename = Summation user profile directory at app level + oweng.GenChart\ + GenChart.jpg
* rect = PictureBox.Bounds in GenChartCtrl
* useRandomColors = check state of that box in User Control UI
* xmlFileName = file name to which data will be exported

C#
    public void CreateImage(string filename, Rectangle rect, bool useRandomColors, string xmlFileName)
    {
        int width = rect.Width, height = rect.Height;
        Bitmap myBitmap = new Bitmap(width, height);
        Graphics g = Graphics.FromImage(myBitmap);
        g.Clear(Color.White);

DocLinkInfo is a very simple class, taking a string (file extension) and int (extension count) in its constructor. It has a couple of Xml Serialization attributes and exists only to make the export of data into XML file simple.

C#
        List<DocLinkInfo> doclinks = new List<DocLinkInfo>();

The FillSlices method returns the List that holds all the tally info, with total = the total count of valid file extensions.

C#
        int total = 0;
        List<KeyValuePair<string, int>> slices = FillSlices(out total);
        if (slices.Count > 0)
        {
            Random rnd = new Random();
            Rectangle pieRect = new Rectangle(210, 10, 375, 375);
            float deg = 0;
            int totalDrawn = 0;
            int baseY = 10;
            int counter = 0;

Iterate through each <file extension, count> pairing, first setting a color for what will become the related pie slice (colors is a string array of basic colors).

C#
            foreach (KeyValuePair<string, int> kvp in slices)
            {
                SolidBrush brush = null;
                if (useRandomColors)
                {
                    brush = new SolidBrush(Color.FromArgb(rnd.Next(255), rnd.Next(255), 
                        rnd.Next(255)));
                }
                else
                {
                    brush = new SolidBrush(Color.FromName(colors[counter]));
                    counter++;
                }

Pull out the data pairs. If the count = 0, must be on the dummy "<other>" key and its aggregate count can be deduced from the overall total minus what's been accounted for already.

C#
                int extensionCount = kvp.Value;
                string extensionString = kvp.Key;
                if (extensionCount == 0) 
                {
                    //on last kvp, for the agglomerated extensions that were not in top 10
                    extensionCount = total - totalDrawn;
                }
Create new DocLinkInfo object for current data pair and add to the List of DocLinkInfo's. Draw current extension's pie slice, after calculating deg startAngle and arc sweepAngle.
C#                
                DocLinkInfo info = new DocLinkInfo(extensionString, extensionCount);
                doclinks.Add(info);

                float percent = (float)extensionCount / total;
                float arc = percent * 360;
                Pen p = new Pen(Color.Black, 1);
                g.DrawPie(p, pieRect, deg, arc);
                g.FillPie(brush, pieRect, deg, arc);
                deg += arc;

Create the color square for the legend. If on the first slice, write out the legend header text.

C#                
                Rectangle keyBox = new Rectangle(10, baseY + 5, 10, 10);
                if (totalDrawn == 0)
                {
                    Font legendFont = new Font("Verdana", 12, FontStyle.Bold | FontStyle.Underline);
                    string legendText = "Files counted: " + total;
                    g.DrawString(legendText, legendFont, new SolidBrush(Color.Black), 
                        new Rectangle(keyBox.X + 10, baseY, 250, 20), StringFormat.GenericDefault);
                    baseY += 30;
                    keyBox.Y += 30;
                }
                g.FillRectangle(brush, keyBox);

Screen space for legend is limited, need to truncate any file extensions that are likely to result in text & pie occupying same pixels.

C#                
                //Verdana is proportional, don't really #chars that would fit in a given space
                if (extensionString.Length > 12) 
                {
                    extensionString = extensionString.Substring(0, 9) + "...";
                }
                Font f = new Font("Verdana", 10, FontStyle.Bold);
                string keyText = String.Format("{0} - {1:0.0%} ({2})", 
                    extensionString, percent, extensionCount);
                g.DrawString(keyText, f, new SolidBrush(Color.Black), 
                    new Rectangle(keyBox.X + 10, baseY, 275, 20), StringFormat.GenericDefault);
                baseY += 20;
                totalDrawn += kvp.Value;
            }
        }

If no parsable doclink values had ever been found ('slices.Count > 0' is not true), let user know via a graphic.

C#                        
        else
        {
            Font f = new Font("Verdana", 20, FontStyle.Bold);
            string doclinkFieldName = sw.DB.Defaults.LinkCol.ToUpper();
            if (doclinkFieldName.Length <= 0) doclinkFieldName = " 'Linked Document' ";
            string nothingFoundText = String.Format("Nothing found - no records with valid {0} field",
                 doclinkFieldName);
            g.DrawString(nothingFoundText, f, new SolidBrush(Color.Black), 
                new Rectangle(100, 150, 500, 300), StringFormat.GenericDefault);
        }

Save the constructed Bitmap object as a .jpg. Call DocLinkInfoToXml to serialize as an .xml file (see method definition below).

C#                
        myBitmap.Save(filename, System.Drawing.Imaging.ImageFormat.Jpeg);
        DocLinkInfoToXml(xmlFileName, doclinks);

        g.Dispose();
        myBitmap.Dispose();
    }  //end CreateImage

DocLinkInfoToXml definition

C#
    private void DocLinkInfoToXml(string xmlFileName, List<DocLinkInfo> doclinkInfoList)
    {
        XmlSerializer s = new XmlSerializer(typeof(List<DocLinkInfo>));            
        TextWriter w = new StreamWriter(xmlFileName);
        s.Serialize(w, doclinkInfoList);
        w.Close();
    }

To Excel

Clicking 'Open in Excel', on the GenChart user control UI, doesn't do much more than try a Process.Start on the GenChartExcel.xlsx file. That file is actually the result of a 'Excel 2007 Workbook' VSTO project in Visual Studio 2008 and in conjunction with it's companion GenChartToExcel12.dll is capable of loading the tally data and creating a linked pie chart. Upon file open, which leads to calling of the default Sheet1_Startup method, the project will attempt to open the genchart.xml (xmlFilePath) file. The xml file is loaded as an XmlDocument and then queried via XPath, where the query matches the format of the serialized DocLinkInfo class in the GenChart control:

C#
    XmlDocument xmlDoc = new XmlDocument();
    xmlDoc.Load(xmlFilePath);
    XPathNavigator nav = xmlDoc.CreateNavigator();
    //the serialization process will have capitalized the first letter of type in ArrayOfO..., 
    //though it will remain with orig lowercase in child elements
    XPathNodeIterator it = nav.Select("/ArrayOfOwenGGenChartdocLinkInfo/owenG.GenChart.docLinkInfo");

Then the code moves through each result from the Select statement, extracting the two needed values and writing them directly to descending cells in the first two columns of the Excel worksheet:

C#
    index++;
    XPathNavigator navInner = it.Current;
    string ext = navInner.GetAttribute("extension", "");
    string cntString = navInner.GetAttribute("count", "");
    ((Excel.Range)Globals.Sheet1.Cells[index, 1]).Value2 = ext;
    int count = 0;
    if (int.TryParse(cntString, out count))
    {
        ((Excel.Range)Globals.Sheet1.Cells[index, 2]).Value2 = count;
    }
    else
    {
        ((Excel.Range)Globals.Sheet1.Cells[index, 2]).Value2 = 
            "Error parsing number from count value in .xml";
    }

And finally a pie chart of defined size is created, using the newly written cells as source data:

C#
    Microsoft.Office.Tools.Excel.Chart pieChart;
    pieChart = this.Controls.AddChart(250, 20, 400, 400, "DocLink Info");
    pieChart.ChartType = Microsoft.Office.Interop.Excel.XlChartType.xlPie;

    string endingCell = String.Format("B{0}", index);
    Excel.Range docLinkRange = this.get_Range("A2", endingCell);
    pieChart.SetSourceData(docLinkRange, missing);